Towards Universal Languages for Tractable Ontology Mediated Query Answering

  • 2019-11-26 06:07:20
  • Heng Zhang, Yan Zhang, Jia-Huai You, Zhiyong Feng, Guifei Jiang
  • 1

Abstract

An ontology language for ontology mediated query answering (OMQA-language) isuniversal for a family of OMQA-languages if it is the most expressive one amongthis family. In this paper, we focus on three families of tractableOMQA-languages, including first-order rewritable languages and languages whosedata complexity of the query answering is in AC0 or PTIME. On the negativeside, we prove that there is, in general, no universal language for each ofthese families of languages. On the positive side, we propose a novel property,the locality, to approximate the first-order rewritability, and show that thereexists a language of disjunctive embedded dependencies that is universal forthe family of OMQA-languages with locality. All of these results apply to OMQAwith query languages such as conjunctive queries, unions of conjunctive queriesand acyclic conjunctive queries.

 

Quick Read (beta)

Towards Universal Languages for Tractable Ontology Mediated Query Answering

Heng Zhang,1 Yan Zhang,2,3 Jia-Huai You,4 Zhiyong Feng,1 Guifei Jiang 5

1College of Intelligence and Computing, Tianjin University, Tianjin, China
2School of Computing, Engineering and Mathematics, Western Sydney University, Penrith, Australia
3School of Computer Science and Technology, Huazhong University of Technology and Science, Wuhan, China
4Department of Computing Science, University of Alberta, Edmonton, Canada
5College of Software, Nankai University, Tianjin, China
[email protected], [email protected], [email protected], [email protected], [email protected]
Abstract

An ontology language for ontology mediated query answering (OMQA-language) is universal for a family of OMQA-languages if it is the most expressive one among this family. In this paper, we focus on three families of tractable OMQA-languages, including first-order rewritable languages and languages whose data complexity of the query answering is in AC0 or PTIME. On the negative side, we prove that there is, in general, no universal language for each of these families of languages. On the positive side, we propose a novel property, the locality, to approximate the first-order rewritability, and show that there exists a language of disjunctive embedded dependencies that is universal for the family of OMQA-languages with locality. All of these results apply to OMQA with query languages such as conjunctive queries, unions of conjunctive queries and acyclic conjunctive queries.

Towards Universal Languages for Tractable Ontology Mediated Query Answering


Heng Zhang,1 Yan Zhang,2,3 Jia-Huai You,4 Zhiyong Feng,1 Guifei Jiang 5 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2School of Computing, Engineering and Mathematics, Western Sydney University, Penrith, Australia 3School of Computer Science and Technology, Huazhong University of Technology and Science, Wuhan, China 4Department of Computing Science, University of Alberta, Edmonton, Canada 5College of Software, Nankai University, Tianjin, China [email protected], [email protected], [email protected], [email protected], [email protected]

Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Introduction

Ontology mediated query answering (OMQA) is a paradigm that generalizes the traditional database querying by enriching the database with a domain ontology (?). This paradigm has played an important role in the semantic web (??), data modelling (?), data exchange (?) and data integration (?), and has recently emerged as one of the central issues in knowledge representation as well as in databases.

A long-term major topic for OMQA is to identify proper languages that specify ontologies. There have been a large number of ontology languages proposed for OMQA since the mid 2000s. For instance, in description logics, the DL-Lite family (?), -family (?) and other variants have been proposed and extensively studied. More recently, the Datalog± family, a.k.a. existential rule languages, or dependencies in databases, have been rediscovered as promising languages for OMQA, see, e.g., (???). Most of these languages enjoy good computational properties such as the first-order rewritability or PTIME data complexity.

While all these languages are of their specific features and hence are useful in different applications, it is not realistic to implement OMQA-systems for all of them. So a natural question arises: Can we find the largest one (in the expressiveness) among the family of first-order rewritable (or PTIME-tractable) OMQA-languages? Let us call the largest language in the above sense a universal language. Clearly, it is of great theoretical and practical importance to identify the existence of universal language w.r.t. some kind of tractability, which is also the main task of this paper.

It is worth noting that the universality is one of the major principles for designing languages in both computer science and logic. For example, almost all the traditional programming languages, including C, Java and Prolog, are known to be universal for the family of Turing complete programming languages; propositional logic can express all boolean functions; and by the well-known Lindström theorem the first-order logic is the largest one among the logics that enjoy both the compactness and the Löwenheim-Skolem property; see, e.g., (?). In databases, first-order language is shown to be universal for the family of query languages with data complexity in AC0, and Datalog universal for the family of query languages with data complexity in PTIME; see, e.g., (?).

Some recent work in OMQA has been done along the line of identifying universal languages. ? (?) proved that, under a certain syntactic classification, some languages in the DL-Lite family are the maximal fragments of description logic with the first-order rewritability. By regarding OMQA as traditional database querying, ? (?) showed that weakly-guarded tuple-generating dependencies (TGDs) capture the class of EXPTIME queries; ? (?) proved that general TGDs capture the class of recursively enumerable queries. In the setting of schema mapping, ? (?) showed that the language of weakly-acyclic TGDs is universal for languages of TGDs with finite semi-oblivious chase. All of these results shed new insights on understanding the expressiveness of existential rules, but it is worth noting that OMQA is significantly different from both traditional database querying and schema mapping. To understand the expressiveness in the framework of OMQA, ? (?) proved that the language of disjunctive embedded dependencies is universal for the family of recursively enumerable OMQA-languages. Along this line, this paper will focus on tractable OMQA.

Aimed at exploiting universal languages for the tractable OMQA, in this paper we focus on three families of OMQA-languages, including first-order rewritable languages and languages whose data complexity is in AC0 or PTIME. Our contributions are summarized as follows. On one hand, we prove that there is, in general, no universal language for each of the above families of languages. On the other hand, by restricting the number of database constants involved in query answering, we propose a novel property, called the locality, to approximate the first-order rewritability, and identify the existence of universal language for the family of local OMQA-languages. All of these results hold for OMQA with query languages such as conjunctive queries, unions of conjunctive queries and acyclic conjunctive queries.

Preliminaries

Databases and Instances. We use a countably infinite set of constants and a countably infinite set of variables, and assume they are disjoint. Every term is either a constant or a variable. A relational schema consists of a set of relation symbols. Each relation symbol has an arity which is a natural number. An atoms over R (or -atom) is either an equality, or a relational atom built upon terms and relation symbols in . A fact is a variable-free relational atom. Each instance over R (or -instance) consists of a set of facts over . Instances that are finite are called databases. Suppose I is an instance. Let adom(I) denote the set of constants that occur in I. Let DB[] denote the class of all databases over schema . Given a set A of constants, by I|A we denote the subset of I in which each fact involves only constants in A.

Let I and J be instances over a relational schema , and Cadom(I)adom(J). Then every C-homomorphism from I to J is a function h:adom(I)adom(J) such that (i) R(a)I implies R(h(a))J for all relation symbols R and all tuples a of constants, and (ii) h(c)=c for all cC. If such h exists, we say that I is C-homomorphic to J, and write ICJ; in addition, we write ICJ if h is injective. For simplicity, C will be dropped if it is empty.

Queries. Fix as a relational schema. By a query over (or -query) we mean a formula built upon atoms over in some logic. The logic could be first-order logic, second-order logic, or other variants. A query is boolean if it has no free variables. For convenience, given any query q, let const(q) denote the set of constants that occur in q.

Every first-order formula is called a first-order query. A conjunctive query (CQ) is a query of the form y.φ(x,y) where φ is a finite but nonempty conjunction of relational atoms. Let q be a boolean CQ. We use [q] to denote the database that consists of all the atoms that appear in q, where variables in atoms are regarded as special constants. The Gaifman graph of q is an undirected graph with each term in q as a vertex, and with each pair of distinct terms as an edge if they cooccur in some atom in q. A boolean CQ is called acyclic if its Gaifman graph is acyclic. A union of conjunctive query (UCQ) is a first-order formula built upon atoms by connectives , and quantifier . Clearly, every UCQ is equivalent to a disjunction of CQs.

Every Datalog¬ program consists of a finite set of rules of the form xy(φ(x,y)α(x)), where α is a relational atom and φ is a finite conjunction of atoms or negated atoms; α and φ are called the head and the body of the rule, respectively. Each variable in x should have at least one positive occurrence in φ. A relation symbol is called intentional if it has at least one occurrence in the head of some rule, and extensional otherwise. No intensional relation symbol is allowed to appear in a negated atom. A Datalog¬ query is of the form (Π,𝑃)(x) where Π is a Datalog¬ program, P an extensional relation symbol, and x a variable tuple of a proper length. It is well-known that every Datalog¬ query can be translated to an equivalent formula in least fixpoint logic, see, e.g., (?).

Only boolean queries will be used in this work. For convenience, we employ CQ, ACQ and UCQ to denote the classes of boolean CQs, boolean acyclic CQs and boolean UCQs, respectively. Let FO denote the class of boolean first-order queries, FO(+,×) denote the class of boolean first-order queries that involve two built-in arithmetic relations + and ×, and Datalog()¬ denote the class of boolean Datalog¬ queries that involve a built-in successor relation Succ, and special constants min and max, denoting the minimum and the maximum elements, respectively, under the underlying order. Given a class 𝒞 of queries and a relational schema , let 𝒞[] denote the class of -queries that belong to 𝒞.

In the theory of descriptive complexity (?), it was proved that FO(+,×) and Datalog()¬ exactly capture complexity classes AC0 and PTIME respectively, where AC0 denotes the class of languages recognized by a uniform family of circuits with constant depth and polynomial size, and PTIME denotes the class of languages recognized by a deterministic Turing machine in polynomial time.

Dependencies. A disjunctive embedded dependency (DED) over a relational schema is a sentence σ of the form

xy(ϕ(x,y)z1.ψ1(x,z1)zk.ψk(x,zk))

where k0, ϕ is a conjunction of relational -atoms involving terms from xy only, each ψi is a conjunction of atoms over involving terms from xzi only, and each variable in x has at least one occurrence in ϕ. In particular, σ is called a tuple-generating dependency (TGD) if it is equality-free and k=1. For simplicity, we omit the universal quantifiers and the brackets appearing outside the atoms.

Let D be a database, Σ a set of (first-order) sentences, and q a boolean query; all of them are over a common relational schema . We write DΣq if, for all -instances I with DI, if Iσ for all sentences σΣ then Iq, where the satisfaction relation is defined in the standard way.

Ontologies and Languages in OMQA

Before identifying the existence of universal languages for OMQA, we need some notions to clarify what an ontology in OMQA is, and what an ontology language in OMQA is. To make the presented results more applicable, we will define these notions in a language-independent way.

To define ontologies in OMQA, below we generalize the notion introduced in (?) from CQs to more general query languages such as UCQs.

Definition 1.

Let 𝒟 and 𝒬 be relational schemas, and 𝒬 a class of queries. A quasi-OMQA[Q]-ontology over (𝒟,𝒬) is a set of ordered pairs (D,q), where D is a 𝒟-database and q a boolean 𝒬-query in 𝒬 such that 𝑐𝑜𝑛𝑠𝑡(q)adom(D).

Moreover, a quasi-OMQA[𝒬]-ontology O over (𝒟,𝒬) is called an OMQA[Q]-ontology if all of the following hold:

  1. 1.

    O is closed under query conjunctions, i.e.,
    (D,q)O&(D,p)O&qp𝒬(D,qp)O.

  2. 2.

    O is closed under query implications, i.e.,
    (D,q)O&p𝒬&qp(D,p)O.

  3. 3.

    O is closed under injective database homomorphisms, i.e.,
    (D,q)O&Dconst(q)D(D,q)O.

Given any logical theory Σ, we can interpret it as a quasi-OMQA[𝒬]-ontology over (𝒟,𝒬) as follows:

[[Σ]]𝒟,𝒬𝒬={(D,q):DDB[𝒟]&q𝒬[𝒬]&DΣq}.

It is easy to see that, for theories Σ in almost all the classical logic, [[Σ]]𝒟,𝒬𝒬 is indeed an OMQA[𝒬]-ontology.

With the notion of ontology, we are then able to present an abstract definition for ontology languages in OMQA.

Definition 2.

Let V be a finite but nonempty set, 𝒟 and 𝒬 relational schemas, and 𝒬 a class of queries. Then every OMQA[Q]-language over (𝒟,𝒬) (with vocabulary V) is defined as an ordered pair (T,M) such that:

  1. 1.

    T consists of a decidable set of theories, each of which is a finite string over V (i.e., an element of V);

  2. 2.

    M is a semantic mapping, i.e., a function that maps each theory in T to an OMQA[𝒬]-ontology over (𝒟,𝒬).

Example 1.

Let 𝒟 and 𝒬 be relational schemas, 𝒬 a class of queries, and T a decidable class of finite sets of DEDs. Let M be a function that maps each set ΣT to [[Σ]]𝒟,𝒬𝒬. It is easy to see that =(T,M) is an OMQA[𝒬]-language.

The language defined above is called a DED-language over (𝒟,𝒬) (induced by T). In particular, if T consists of all finite sets of DEDs, we call it the full DED-language over (𝒟,𝒬). Unfortunately, it had been proved in (?) that query answering with the full DED-language is uncomputable. In this work, we thus focus on tractable OMQA-languages. We will consider two kinds of tractability:

Definition 3.

Let 𝒟 and 𝒬 be relational schemas, 𝒞 and 𝒬 classes of queries, and 𝒦 a complexity class. An OMQA[𝒬]-language =(T,M) over (𝒟,𝒬) is

  1. 1.

    𝒞-rewritable if there is a computable function rew that maps each ordered pair (t,q)T×𝒬[𝒬] to a boolean query φt,q𝒞[𝒟] such that (D,q)M(t) iff Dφt,q; in this case, rew is called a 𝒞-rewriting function of .

  2. 2.

    𝒦-compilable if there is a computable function com that maps each ordered pair (t,q)T×𝒬[𝒬] to a Turing machine 𝕄t,q, whose running time belongs to the class 𝒦, such that (D,q)M(t) iff 𝕄t,q accepts on the input D; in this case, com is called a 𝒦-compiler of .

Example 2.

According to (?), the language of linear TGDs is both FO-rewritable and AC0-compilable, and the language of guarded TGDs is both Datalog()¬-rewritable and PTIME-compilable.

Remark 1.

Clearly, there is a nonuniform way to redefine notions in Definition 3 by allowing rewriting functions and compilers to be uncomputable. However, it is worth noting that languages defined in such a way could be intractable. In fact, there is a nonuniform FO-rewritable OMQA-language in which the query answering is highly undecidable.

Next we give the definition of universal OMQA-language.

Definition 4.

Let 𝒬 be a class of queries, 𝒟 and 𝒬 relational schemas, and =(T,M) and =(T,M) OMQA[𝒬]-languages over (𝒟,𝒬). Then we say that is at least as expressive as , written , if for each theory tT there is a theory tT such that M(t)=M(t); and has the same expressiveness as if both and .

An OMQA[𝒬]-language is called universal for a family of OMQA[𝒬]-languages over (𝒟,𝒬) if (i) , and (ii) for all languages , we have that .

Nonexistence for the General Case

One ambitious goal in OMQA is to find some universal language for the tractable OMQA. Unfortunately, the following theorem shows that this goal is in general unachievable.

Theorem 1.

Let D and Q be relational schemas such that Q contains a relation symbol of arity 2, and suppose C{𝐹𝑂,𝐹𝑂(+,×),𝐷𝑎𝑡𝑎𝑙𝑜𝑔¬()} and ACQ Q𝑈𝐶𝑄. Then there is no universal language for the family of C-rewritable OMQA[Q]-languages over (D,Q).

Since AC0 and PTIME are exactly captured by FO(+,×) and Datalog()¬ respectively, by Theorem 1 we have

Corollary 2.

Let D and Q be relational schemas such that Q contains at least one relation symbol of arity 2, and suppose K{AC0, PTIME} and ACQ Q𝑈𝐶𝑄. Then there is no universal language for the family of K-compilable OMQA[Q]-languages over (D,Q).

To prove Theorem 1, the general idea is to implement a diagonalization argument as follows. Assume by contradiction that there is a universal language for the desired family. We first give an effective enumeration for all nontrivial ontologies defined in the universal language. With this enumeration, we then construct a new OMQA[𝒬]-ontology O and a new language in which O is definable; Finally we show that is still 𝒞-rewritable, which leads to a contradiction.

Proof of Theorem 1.

Only consider the case where 𝒞=FO and 𝒬=UCQ. Assume by contradiction that there is a universal language for FO-rewritable OMQA[UCQ]-languages over (𝒟,𝒬). Let =(T,M) be such a language. Our task is to define another FO-rewritable OMQA[UCQ]-language that is strictly more expressive than . To do this, we first construct an ontology that is not definable in .

Before we present the construction, some notations are needed. W.l.o.g., we assume that there is a binary relation symbol R in 𝒬. Note that, by a repetition of the parameters, R can be always simulated by another relation symbol of arity >2. For example, one can use S(x,x,y) to simulate R(x,y). With this assumption, we first define a sequence of acyclic CQs. For all integers n1, we define

qn=x0xn(R(x0,x1)R(xn-1,xn)).

Intuitively, qn asserts that there is a cycle-free path (via R) of length n+1 in the intended model.

Let s=(s1,s2,) be an effective enumeration11 1 I.e., there is a Turing machine to generate such an enumeration. of all the theories in T. Such an enumeration clearly exists. Now our task is to construct countably infinite sequences N and t, where N=(N1,N2,) is a sequence of positive integers, and t=(t1,t2,) is a sequence of theories in T. The sequences are required to have the following properties:

  1. 1.

    N is monotonic increasing, i.e., NiNj if i<j;

  2. 2.

    For all k1 there exists a database D with (D,qk)M(tk) and |adom(D)|Nk;

  3. 3.

    For all tTt there exists i>0 such that |adom(D)|>Ni+1 for all databases D with (D,qi)M(t).

Procedure 1 is devoted to generate the desired sequences.

\Fori1 \KwTo \Fornn \KwTo \Forj1 \KwTon \lIfD s.t. (D,qi)M(sj) &|adom(D)|n goto line 9 \IfD s.t. (D,qi)M(sj) &|adom(D)|n+1 nn+1 \[email protected]
goto line 9 \[email protected]
delete sj from s \[email protected]
Procedure 1 Generating Sequences t and N

Now we have the following property:

Claim 1. The sequences N and t generated by Procedure 1 satisfy Properties (1-3).

Proof.

Properties 1 and 2 are clear from Procedure 1. So it remains to show Property 3. Suppose t=sk for some k1. Since t has no occurrence in t, according to Procedure 1, we know that, whenever lines 5 and 6 are executed for j=k, conditions in both “if” statements must be false. (Otherwise we will have tt.) In addition, as n increases arbitrarily, we know that line 6 must be executed. This means that there is some i1 such that |adom(D)|>Ni+1 for all databases D with (D,qi)M(t), which then yields the claim. ∎

Now we are able to construct the desired ontology. To do this, we first define some notations. For n1, let λn denote the sentence x1xn1i<jn¬(xi=xj), which asserts that the intended domain contains at least n elements. Given a boolean UCQ q, if there exists an integer k1 such that qkq, let φq denote λNm+1 where m is the least integer among such ks, and let φq denote the sentence x¬(x=x) (always false) if no such ks exist. Furthermore, we define

O={(D,q):DDB[𝒟]&qUCQ[𝒬]&Dφq}.

It is not difficult to prove the following properties:

Claim 2. Let p and p be boolean UCQs. Then we have:

  1. 1.

    If pp then φpφp;

  2. 2.

    φppφpφp

Proof.

1. For the case where there exists no integer i1 such that qip, we have that φp=x¬(x=x), which is always unsatisfiable. This implies that φpφp trivially.

Now it remains to show the case where there exists i1 such that qip. Let m be the least integer such that qmp. Then we have φp=λNm+1. From pp, we then have qmp. Let n be the least integer such that qnp. Then it is clear that nm. According to Property 1, we also know that NnNm, which implies that λNm+1λNn+1, or equivalently, φpφp. This proves Statement 1.

2. For the case where there is no integer i1 such that qip, we have that φp=x¬(x=x)=φpp, which implies the desired equivalence. The same argument applies to the case where there is no integer i1 such that qip.

Now, it remains to consider the case where there are integers i and j such that qip and qjp. Let m and n denote the least integers among such is and js, respectively. W.l.o.g., suppose mn. Then we have both qmqnp and qmp. Combining both of them, we know that m is the least integer such that qmpp. Thus, we have that φp=λNm+1=φpp. On the other hand, it is also clear that λNm+1λNn+1, or equivalently φpφp, which implies that φpφpφp. Consequently, we obtain that φppφpφp, which completes the proof. ∎

Claim 3. O is an OMQA[UCQ]-ontology.

Proof.

The closure property of O under injective database homomorphisms is clear since, for any boolean UCQ q, φq is preserved under injective homomorphisms.

Next we show that O is closed under query conjunctions. Suppose (D,p)O and (D,p)O. By definition, we have both Dφp and Dφp, which means that Dφpφp. By Statement 2 of Claim 2, we then have that Dφpp, which implies that (D,pp)O as desired.

Now it remains to show the closure of O under query implications. Suppose (D,p)O and pp. We need to prove (D,p)O. From (D,p)O we have Dφp, and from pp, we have φpφp by Statement 1 of Claim 2. Combining both of these, we obtain Dφp. By definition, it follows that (D,p)O, which completes the proof. ∎

Claim 4. OM(t) for any theory tT.

Proof.

First consider the case where t occurs in t, and suppose t=tk for some k1. According to the definition of t, we know that there is a database D with (D,qk)M(tk) and |adom(D)|Nk. On the other hand, by the definition of φqk has no model D with |adom(D)|Nk, which means that there is no database D with (D,qk)O and |adom(D)|Nk. Consequently, we have OM(t).

Now it remains to consider the case where t does not occur in t. By Claim 1, it suffices to show that for every integer k1 there exists a database D with (D,qk)O and |adom(D)|Nk+1, or equivalently, φqk has a model that contains at most Nk+1 elements. According to the definition of φqk, the latter is indeed true. This also implies that OM(t), and completes the proof immediately. ∎

With Claims 3 and 4, we are now in the position to prove the desired theorem. Let t be a binary string such that tT, and let T=T{t}. Following the decidability of T, we have the decidability of T. Let M be a function that extends M by mapping t to O, and let =(T,M). By Claim 3, we know that is an OMQA[UCQ]-language. Suppose rew is an FO-rewriting function. Let rew be a function that extends rew by mapping (t,q) to φq for all boolean UCQs q. By a slight modification to Procedure 1, one can easily devise an algorithm to compute Ni (and si) on given integer i1. This implies that rew is computable. By definition, we know that rew is an FO-rewriting function, which implies that is FO-rewritable. By Claim 4, we also know that is strictly more expressive than , a contradiction as desired. And this completes the proof. ∎

Remark 2.

Since the sentence φq defined in the above proof is also an FO(+,×)-sentence, so the proof directly applies to the case of FO(+,×). For a proof of the remaining case, one can convert φq to a Datalog()¬-program.

Locality to the Rescue

In the last section, we proved that there is no universal language for tractable OMQA in general. Then, a natural question arises as to whether one can find a natural property that approximates the tractability but still allows the existence of a universal language. The challenge here is that the property should be manageable enough to avoid a diagonalization argument (see the proof of Theorem 1). Below we propose a property as an approximation of the FO-rewritability.

Locality as Approximation of FO-rewritability

A bound function is a computable function : such that (n)n for n. To simplify the presentation, we fix a way to represent bound functions, e.g., one can represent each bound function by a Turing machine that computes it. A class of bound functions is called decidable if the class of representations of those bound functions is decidable.

To measure the size of a query, we fix a computable function |||| that maps each UCQ to a positive integer. Clearly, there are many methods to define ||||. The only restriction is that we require ||pq||||p||+||q|| for all UCQs p and q.

Definition 5.

Let 𝒟 and 𝒬 be relational schemas, and 𝒬 a class of queries, O an OMQA[𝒬]-ontology over (𝒟,𝒬), and a bound function. Then O is called -local if for all boolean 𝒬-queries q𝒬 and all 𝒟-databases D there is a set A, which consists of at most (||q||) constants, such that

(D,q)O iff (D|A,q)O.

Furthermore, given an OMQA[𝒬]-language , a bound function and a class 𝖥 of bound functions, is called -local if all-OMQA[𝒬] ontologies defined in is -local, and is 𝖥-local if it is -local for some bound function 𝖥.

One might question why the bounded locality is a good approximation to the first-order rewritability. Let +FO() denote the class of first-order sentences built on atoms and inequalities by using connectives , and the quantifier . Obviously, this class is exactly the class of UCQs with inequalities. It had been observed by ? that +FO() captures the class of first-order sentences that preserved under injective homomorphisms (?). It remains open whether such a preservation theorem holds on finite structures (or databases). If this is indeed true, by the following proposition we then have that an OMQA-language is FO-rewritable iff it is -local for some bound function .

Proposition 3.

Let D and Q be relational schemas, O an OMQA[UCQ]-ontology over (D,Q), and a bound function. Then O is -local iff for each boolean Q-UCQ q there is a +𝐹𝑂()-sentence φ involving at most (||q||) terms such that (D,q)O iff Dφ for all D-databases D.

Proof.

For the direction of “if”, let us assume that for each boolean 𝒬-UCQ q there is a +FO()-sentence φ involving at most (||q||) terms such that (D,q)O iff Dφ for all 𝒟-databases D. We need to show that O is -local. Let q be a boolean 𝒬-UCQ, and φ a +FO()-sentence involving at most (||q||) terms such that (D,q)O iff Dφ for all 𝒟-databases D. By the assumption, such a sentence does exist. Suppose φ=xψ where ψ is a quantifier-free existential positive first-order formula with inequalities and involving at most (||q||) terms. Let D be a 𝒟-database. If Dφ, then let A={s(x):xx}const(φ), where s is an assignment such that Dψ[s]. Otherwise, let A be any subset of adom(D) such that |A|k. In both cases we have the following: (i) |A|k, and (ii) Dφ iff D|Aφ. From the latter, we know that (D,q)O iff (D|A,q)O. We thus yields that O is -local as desired.

Conversely, suppose O is -local. Let q be a boolean 𝒬-UCQ. Given a 𝒟-database D, let Aadom(D) be a witness of the locality of O w.r.t. q, let φD denote the conjunction of all facts in D; let φ^D be the formula obtained from φD by replacing each constant that does not occur in q by a fresh variable; and let ψD denote the sentence x(φ^Dλx), where x is the tuple of all variables occurring in φ^D, and λx denotes the conjunction of xixj for each pair of distinct variables xi,xjx. It is easy to see that, up to logical equivalence, there is only a finite number of ψD for all 𝒟-databases D such that (D,q)O. Let ψq be a disjunction of ψD for all 𝒟-databases D with |adom(D)|(||q||) and (D,q)O. Clearly, ψq is equivalent to a +FO()-sentence that involves at most (||q||) terms. To complete the proof, it suffices to show the following property:

Claim. (D,q)O iff Dψq for all databases D over 𝒟.

Now it remains to prove the claim. Let D be a database over 𝒟. We first prove the direction of “only if”. Suppose (D,q)O. Since O is -local, there should be a 𝒟-database DD such that (D,q)O and |adom(D)|(||q||). By the definition of ψq, we know that ψD is equivalent to a disjunct of ψq. It is clear that DψD. From DD we the have DψD, which implies Dψq as desired.

For the converse, we assume that Dψq. Then there is a database D over 𝒟 with |adom(D)|(||q||) such that (i) (D,q)O and (ii) ψD is a disjunct of ψq. From (ii) we have DψD, which means that there is an injective C-homomorphism from D to D, where C=const(q). As O is closed under injective database homomorphisms, we have (D,q)O as desired, which completes the proof. ∎

Remark 3.

Proposition 3 reveals an intrinsic connection between the bounded locality and the complexity of rewritings. We will elaborate this in an extended version of this paper.

Universal Language for Local OMQA

Now it remains to know whether the bounded locality allows the existence of universal languages. For convenience, in the rest of this section, we fix F as a decidable class of bound functions; fix D and Q as a pair of disjoint relational schemas. The disjointness will not introduce any real limitation. For instance, in a DED-language, given any set Σ of DEDs, one can construct another set Σ of DEDs by introducing a fresh relation symbol 𝑅 for each 𝑅𝒟, and adding copy rules of the form 𝑅(x)𝑅(x). Clearly, Σ has the same behaviour over (𝒟,𝒬) as Σ over (𝒟,𝒬), where 𝒟 denotes the schema consisting of all the fresh symbols.

Surprisingly, we have the following result.

Theorem 4.

Let Q be a decidable class of UCQs. Then there exists a DED-language that is universal for the family of F-local OMQA[Q]-languages over (D,Q).

Let be any bound function in 𝖥. To prove Theorem 4, the general idea is to develop a transformation that converts every DED set to an -local DED set. In addition, for each DED set that is already -local, the transformation is required to preserve the semantics of query answering. If such a transformation exists, since DED is universal for the family of OMQA-languages in which query answering is recursively enumerable, we then obtain a universal language for the family of 𝖥-local OMQA-languages.

Let us begin with a finite set Σ of DEDs over a relational schema 𝒟𝒬. To implement the desired transformation, we first show how to construct an -local OMQA[𝒬]-ontology from Σ. As a natural idea, one may expect to define the desired ontology by removing all the pairs (D,q) from the original ontology (defined by Σ) where q is not -local on D. Unfortunately, the ontology defined above is in general not well-defined. To construct the desired ontology, the -locality and the closure under both query conjunctions and query implications should be maintained simultaneously.

Below we explain how to construct the ontology. We need to fix a strict linear order over 𝒬-UCQs firstly. The strict linear order is required to satisfy pq for all 𝒬-UCQs p and q such that ||p||<||q||. Clearly, such an order always exists. For the given set Σ of DEDs, let SΣ be the set that consists of the ordered pair (D,q) if D is a 𝒟-database, q is a 𝒬-UCQ in 𝒬, and the following condition holds:

p 𝒬[𝒬]:||p||||pr^(q)||&pr^(q)p (1)
Aadom(D) s.t. |A|(||p||)&D|AΣp

where pr(q) is the set of boolean UCQs p such that pq and (D,p)SΣ, and pr^(q) denotes the conjunction of q and all UCQs in pr(q). Moreover, we define OΣ as the minimum superset of SΣ that is closed under query conjunctions, query implications and injective database homomorphisms.

The constructed ontology enjoys several properties which will play important roles in our proof for Theorem 4.

Lemma 5.

If [[Σ]]D,QQ is -local, then OΣ=[[Σ]]D,QQ.

Proof.

Follows from the definition of OΣ. ∎

Lemma 6.

OΣ is an -local OMQA[Q]-ontology.

Proof.

We first claim that, for all UCQs q with (D,q)OΣ, there exists an integer k1 and UCQs p1,,pk such that (D,pi)SΣ for each pi and p1pkq. This can be proved by a routine induction of the construction of OΣ.

With the claim, w.l.o.g., we assume p1p2pk. Let p denote the query p1pk. Obviously, it holds that pr^(pk)p, which implies that pr^(pk)q immediately. On the other hand, by definition we know that pi is -local on D, i.e., there is a set Aiadom(D) such that D|AiΣpi. Let A=A1Ak. We then have D|AΣp and |A|(||p1||)++(||pk||)(||p1||++||pk||)(||p||). Consequently, we obtain that p is -local on D.

Now, it remains to show that q is -local on D. For the case where ||p||||q||, from pq and D|AΣp, we obtain that D|AΣq, which implies that q is -local on D. For the other case, it must be true that ||p||>||q||. From the fact that ||pr^(pk)||||p||, we know that ||pr^(pk)||>||q||. Since (D,pk)Sk is true, by Condition (1) we know that there is a set Badom(D) such that D|BΣq, which means that q is -local on D. This then completes the proof. ∎

Lemma 7.

OΣ is recursively enumerable.

Proof.

The lemma is a corollary of the following facts: (1) The validity problem for inference in first-order logic is recursively enumerable; (2) The query containment problem for UCQs is decidable; (3) There are only a finite number of boolean 𝒬-UCQs p with ||p||||pr^(q)||; (4) There are only a finite number of subsets of adom(D). (5) Both and |||| are computable. (6) 𝒬 is decidable. ∎

Now, to define the transformation, it remains to show how to encode the ontology OΣ by another set of DEDs. Suppose D is the underlying database, and q the underlying query. The encoding will be implemented in the following way:

  1. 1.

    Simulate the query answering of q under OΣ and D.

  2. 2.

    If the answer of Stage 1 is positive, then nondeterministically copy disjuncts of q to generate the universal models.

The main challenges of implementing the above encoding are as follows. Firstly, instead of a single universal model, we need to generate a set of universal models in Stage 2. It is not clear whether the technique of generating universal model in (?) can be applied to this situation. Secondly, to encode the computation in Stage 1, a successor relation is needed. But it seems impossible to define such a relation in the language of DEDs explicitly.

Below we explain how to implement the encoding.

Defining Successor and Arithmetic Relations.   To implement the desired encoding, a successor relation needs to be defined so that the constants in the underlying database D can be ranged over. As there is no negation in the body of DEDs, it seems impossible to construct DEDs to traverse ALL constants in adom(D). Fortunately, thanks to the closure of OMQA-ontologies under injective database homomorphisms, we do not need a successor relation on the full domain. The reason is as follows. Suppose we want to show that a query q is derivable from the database D under some ontology O. As O is closed under injective database homomorphisms, it is equivalent to show whether there is a subset A of adom(D) such that q is derivable from D|A under O.

To range over subsets of adom(D), we employ the partial successor relations on adom(D), each of which is a successor relation on some subset of adom(D). Clearly, there is a partial successor relation for each subset A of adom(D). With the mentioned property, we will define some DEDs to generate partial successor relations on adom(D). To check whether q is derivable from D under O, it would be sufficient to test whether the computation of Stage 1 halts with “accept” under a certain partial successor relation.

Our method to generate partial successor relations was inspired by ?’s technique to define successor relations in the language of TGDs (?). In that paper they showed that every homomorphism-closed database query can be defined by a set of TGDs. It is worth noting that the ontology mediated queries focused on this paper are not necessary to be closed under homomorphisms. So their technique cannot be applied directly. Fortunately, a linear order on adom(D) can be easily defined by a set of DEDs, and with this order, we are able to use their idea to generate all partial successor relations that are compatible with the defined order. Now we show how to implement this idea.

Let AD be a unary relation symbol that will be interpreted as adom(D). Clearly, such a relation can be easily defined by some DEDs. With the relation AD, a linear order relation Less over AD can then be defined in a routine way:

𝐴𝐷(x)𝐴𝐷(y)𝐿𝑒𝑠𝑠(x,y)x=y𝐿𝑒𝑠𝑠(y,x) (2)
𝐿𝑒𝑠𝑠(x,y)𝐿𝑒𝑠𝑠(y,z)𝐿𝑒𝑠𝑠(x,z) (3)
𝐿𝑒𝑠𝑠(x,x) (4)

To generate all partial successor relations compatible with Less, we link each constant c in adom(D) with a alias a by the relation 𝐿𝑖𝑛𝑘(c,a). Suppose a1 and a2 are aliases of constants c1 and c2 respectively, by 𝑁𝑒𝑥𝑡(a1,a2) we mean that c2 is the immediate successor of c1 in the underlying successor relation. The head (resp., the tail) of a partial successor relation is denoted by 𝐹𝑖𝑟𝑠𝑡(a) (resp., 𝐿𝑎𝑠𝑡(b)). In particular, we use a as the name of the underlying relation. Every partial successor relation is required to have a head and a tail. To generate these relations, we use the following DEDs:

𝐴𝐷(x)v𝐿𝑖𝑛𝑘(x,v)𝐿𝑎𝑠𝑡(v)𝐹𝑖𝑟𝑠𝑡(v) (5)
𝐴𝐷(x)v𝐿𝑖𝑛𝑘(x,v)𝐿𝑎𝑠𝑡(v)𝑃𝑎𝑟𝑡𝑖𝑎𝑙(v) (6)
𝐿𝑒𝑠𝑠(x,y)𝐿𝑖𝑛𝑘(y,v)𝑃𝑎𝑟𝑡𝑖𝑎𝑙(v)u𝐿𝑖𝑛𝑘(x,u)𝑁𝑒𝑥𝑡(u,v)𝐹𝑖𝑟𝑠𝑡(u) (7)
𝐿𝑒𝑠𝑠(x,y)𝐿𝑖𝑛𝑘(y,v)𝑃𝑎𝑟𝑡𝑖𝑎𝑙(v)u𝐿𝑖𝑛𝑘(x,u)𝑁𝑒𝑥𝑡(u,v)𝑃𝑎𝑟𝑡𝑖𝑎𝑙(u) (8)

To understand how these DEDs work in more detail, please refer to the following example.

Figure 1: The Instance Generated in Example 3.
Example 3.

Let D be a database which involves only constants c1,c2 and c3, and suppose the linear order defined by Less is c1>c2>c3. By an exhaustive application of DEDs (6-8),22 2 To make the figure simple, we use the semi-oblivious chase. we obtain an instance as illustrated by Figure 1.

As seen in Figure 1, there are 7 partial successor relations s1,,s7 generated in the instance. For instance, s1 defines the partial successor relation involving only c1; s5 defines the relation c2>c3; s7 defines the relation c1>c2>c3.

To encode Stage 1 mentioned before, we need to generate a linear order and the corresponding successor relation on a countably infinite domain, making sure they are compatible with the underlying partial successor relation on adom(D). More relations are needed to do this. 𝑍𝑒𝑟𝑜(o,a) means that a is the least element under the order o; 𝐷𝑀𝑎𝑥(o,a) states that a is the largest element in adom(D) under the order o; 𝑆𝑢𝑐𝑐(o,a,b) denotes that b is the immediate successor of a under the order o; and 𝐿𝑇(o,a,b) asserts that a is less than b under the order o. For a technical reason, we also need some auxiliary relations. 𝑇𝐶(o,a) denotes that a is not less than the largest element in adom(D) under the order o, and 𝑅𝑇(o,a,c) means that a is an alias of c and it is used to build the order o. All of these are defined by the following DEDs:

𝐹𝑖𝑟𝑠𝑡(w)𝐿𝑖𝑛𝑘(x,w)𝑍𝑒𝑟𝑜(w,x)𝑅𝑇(w,w,x) (9)
𝑁𝑒𝑥𝑡(u,v)𝑅𝑇(w,u,x)𝐿𝑖𝑛𝑘(y,v)𝑆𝑢𝑐𝑐(w,x,y)𝑅𝑇(w,v,y) (10)
𝐿𝑎𝑠𝑡(v)𝑅𝑇(w,v,x))𝐷𝑀𝑎𝑥(w,x)𝑇𝐶(w,x) (11)
𝑇𝐶(v,x)z.𝑆𝑢𝑐𝑐(v,x,y)𝑇𝐶(v,y) (12)
𝑆𝑢𝑐𝑐(v,x,y)𝐿𝑇(v,x,y) (13)
𝐿𝑇(v,x,y)𝐿𝑇(v,y,z)𝐿𝑇(v,x,z) (14)

With these relations, it is routine to define arithmetic relations such as 𝐴𝑑𝑑(o,a,b,c) (asserting that c=a+b under the order o) and 𝐵𝑖𝑡i(o,a,b) (asserting that the b-th bit of the binary representation of a is i). We omit the details here.

Simulating Query Answering under OΣ.   With a partial successor relation and the related arithmetic relations, we are now in the position to define some DEDs to simulate the query answering of q under OΣ and D.

Our encoding that implements the simulation of query answering is almost the same as that in Section 5.3 of (?). As proved by ? (see Proposition 6 of (?)), all recursively enumerable OMQA-ontologies can be recognized by a certain class of Turing machines, called convergent 2-bounded nondeterministic Turing machines. Although the queries involved in that work are only boolean CQs, by a similar argument one can show that the result can be generalized to the case where boolean UCQs are involved. The only difference is that, to deal with UCQs, we have to change the format of input slightly. Due to the space limit, we omit the details here.

With the result mentioned above, we can then find a convergent 2-bounded nondeterministic Turing machine 𝕄Σ to recognize OΣ. By employing the DEDs defined in Section 5.3 of (?) (with a slight modification to specify the partial successor relation), we are then able to simulate the computation of 𝕄Σ on the input (D,q).

To restore the result of the query answering, we use a binary relation symbol Accept. By 𝐴𝑐𝑐𝑒𝑝𝑡(o,a) we mean that the machine 𝕄Σ halts on input (D,q) with “accept”, and o is the partial successor relation to implement the computation.

Generating Universal Models.   By applying all the DEDs that we have constructed, we will obtain the class of boolean UCQs that are derivable from D under OΣ. With such a class of UCQs, now our task is to construct a universal model set.

Given a class 𝒟 of databases and a set C of constants, let

C𝒟={D:D𝒟}

where, for each database D𝒟, D is an isomorphic copy of D such that, for any pair of distinct databases D1,D2𝒟, only constants from C will be shared by D1 and D2.

Let D be a 𝒟-database, and O an OMQA[UCQ]-ontology over (𝒟,𝒬). Given a boolean UCQ q, let 𝒟q denote the set consisting of [p] for each disjunct (a boolean CQ) of q. Let

Γ(O,D)={𝒟q:(D,q)O}.

Let 𝒰(O,D) denote the set that consists of CH for each minimum hitting set H of Γ(O,D), where C=adom(D).

Proposition 8.

Let O be an OMQA[UCQ]-ontology over (D,Q), let D be a D-database, and let q be a boolean Q-UCQ such that 𝑐𝑜𝑛𝑠𝑡(q)𝑎𝑑𝑜𝑚(D). Then (D,q)O iff Iq for all instances IU(O,D).

Proof.

Let Λ denote the set of all boolean UCQs p such that (D,p)O. We first show a property as follows.

Claim. Λq iff Iq for all instances I𝒰(O,D).

Proof.

First consider the direction of “only if”. Suppose we have Λq, and let I be any instance in 𝒰(O,D). We need to prove that Iq. According to the definition of 𝒰(O,D), we know that there is a minimum hitting set H of Γ(O,D) such that I=𝑎𝑑𝑜𝑚(D)H. This implies that for each UCQ q0Λ there is a disjunct p (which is a boolean CQ) of q0 such that [p] has an isomorphic copy in I. Consequently, we have that Iq0 for all boolean UCQs q0Λ. From the assumption that Λq, we conclude Iq as desired.

Next let us turn to the direction of “if”. Suppose Iq for all instances I𝒰(O,D). Now our task is to prove that Λq. Let J be an arbitrary instance such that Jp for all boolean UCQs pΛ. Take p as any boolean UCQ in Λ. W.l.o.g., we write p as the form p1pn where each pi is a boolean CQ. Let φp{p1,,pn} be any disjunct of p such that Jφp. Such a disjunct always exists because Jp. Suppose φp is of the form xpψp, where ψp is a conjunction of atoms and xp a tuple of variables. Let sp be an assignment such that Jψp[sp]. Let H denote the set that consists of [φp] for each UCQ pΛ, and let I denote the instance 𝑎𝑑𝑜𝑚(D)H. Let h be a mapping that maps the isomorphic copy of x in I to sp(x) if pΛ and xxp, and maps each constant in 𝑎𝑑𝑜𝑚(D) to itself. Clearly, h is an 𝑎𝑑𝑜𝑚(D)-homomorphism from I to J. By the assumption made in the begin of this paragraph, we know Iq. As q is preserved under 𝑎𝑑𝑜𝑚(D)-homomorphisms, we conclude Jq, which completes the proof of the claim. ∎

According to the above claim, to prove the desired proposition, it suffices to show that (D,q)O iff Λq. The direction of “only if” is trivial since from (D,q)O we already have qΛ. It thus remains to show the converse. Suppose Λq. According to the compactness, there is a finite subset Λ0 of Λ such that Λ0q. Let q0 denote the conjunction of all UCQs in Λ0. Obviously, q0 is also a boolean UCQ. By the definition of Λ, we know that for each UCQ qΛ0 we have (D,q)O. Since O is closed under query conjunctions, we obtain (D,q0)O. By Λ0q, it also holds that q0q. Furthermore, by applying the closure property of O under query implications, we conclude (D,q)O, which completes the proof of the proposition immediately. ∎

With Proposition 8, we are now able to construct a set of DEDs to generate 𝒰(O,D). Several relations are needed to access the encoding of a query. We use 𝑈𝐶𝑄(o,a) to denote that, under the order o, a is the representation (e.g., the Gödel number) of a boolean UCQ, and use 𝑈𝑛𝑖𝑜𝑛(o,a,b,c) to assert that, under the order o, the boolean UCQ a is the disjunction of a boolean CQ b and a boolean UCQ c. By 𝐶𝑄(o,a) we mean that a is the encoding of some boolean CQ under the order o, and by 𝑄𝑉𝑎𝑟(o,a,b) we means that b is a variable that occurs in the boolean CQ encoded by a under the order o. Moreover, we assume that 𝑄1,,𝑄n enumerates all the relation symbols in 𝒬. For i=1,,k, let 𝐻𝑎𝑠𝑄i(o,a,t) denote that 𝑄i(t) is an atom in the CQ encoded by a. It is easy to see that all these relations can be defined by standard arithmetic relations.

Now, we use the following DEDs to nondeterministically choose which disjunct of a boolean UCQ to be true:

𝐴𝑐𝑐𝑒𝑝𝑡(v,x)𝑇𝑟𝑢𝑒(v,x) (15)
𝑇𝑟𝑢𝑒(v,x)𝑈𝑛𝑖𝑜𝑛(v,x,y,z)𝑇𝑟𝑢𝑒(v,y)𝑇𝑟𝑢𝑒(v,z) (16)

where 𝑇𝑟𝑢𝑒(o,a) states that the boolean CQ encoded by a under the order o is chosen to be true in the intended model.

To generate a copy of a boolean CQ in the intended universal model, we employ the following DEDs:

𝐶𝑄(v,x)𝑄𝑉𝑎𝑟(v,x,y)z𝐶𝑜𝑝𝑦(v,x,y,z) (17)
𝐶𝑄(v,x)𝐴𝐷(y)𝐶𝑜𝑝𝑦(v,x,y,y) (18)
𝑇𝑟𝑢𝑒(v,x)𝐶𝑄(v,x)𝐻𝑎𝑠𝑄i(v,x,y)𝐶𝑜𝑝𝑦(v,x,y,z)𝑄i(z) (19)

where 𝐶𝑜𝑝𝑦(v,x,y,z) denotes 1jk𝐶𝑜𝑝𝑦(v,x,yj,zj) if y=y1yk, z=z1zk, and k is the arity of 𝑄i. Intuitively, the first rule generates a copy for each variable in the CQ q, the second one asserts that the constant in q will not change, and the third one then copy atoms in q into the universal model and implement some necessary substitutions.

Let loc(Σ) denote the set of DEDs that we have defined in this section. From the encoding and Proposition 8 we have

Lemma 9.

[[loc(Σ)]]𝒟,𝒬𝒬=OΣ.

Clearly, given the representation of any bound function and any finite set Σ of DEDs, constructing loc(Σ) is computable. Let T𝖥 consist of loc(Σ) for each finite set Σ of DEDs and each bound function 𝖥. Since 𝖥 is decidable, we know that T𝖥 is also decidable. Let 𝖥 be the DED-language induced by T𝖥. By Lemmas 95 and 6, 𝖥 must be universal for the family of 𝖥-local OMQA-languages.

Concluding Remarks

We have established the nonexistence of universal language for both FO-rewritable and PTIME-tractable OMQA. As a rescue, we also proposed a novel property, called the locality, as an approximation to the FO-rewritability, and proved that there is some language of DEDs which is universal for OMQA with bounded locality. In spite of the unnaturalness of the constructed language, we believe that the proposed property would shed light on finding natural universal languages, as well as on identifying new tractable languages.

References