The Query Translation Landscape: a Survey

  • 2019-10-07 22:37:33
  • Mohamed Nadjib Mami, Damien Graux, Harsh Thakkar, Simon Scerri, Sören Auer, Jens Lehmann
  • 0

Abstract

Whereas the availability of data has seen a manyfold increase in past years,its value can be only shown if the data variety is effectively tackled ---oneof the prominent Big Data challenges. The lack of data interoperability limitsthe potential of its collective use for novel applications. Achievinginteroperability through the full transformation and integration of diversedata structures remains an ideal that is hard, if not impossible, to achieve.Instead, methods that can simultaneously interpret different types of dataavailable in different data structures and formats have been explored. On theother hand, many query languages have been designed to enable users to interactwith the data, from relational, to object-oriented, to hierarchical, to themultitude emerging NoSQL languages. Therefore, the interoperability issue couldbe solved not by enforcing physical data transformation, but by looking attechniques that are able to query heterogeneous sources using one uniformlanguage. Both industry and research communities have been keen to develop suchtechniques, which require the translation of a chosen 'universal' querylanguage to the various data model specific query languages that make theunderlying data accessible. In this article, we survey more than forty querytranslation methods and tools for popular query languages, and classify themaccording to eight criteria. In particular, we study which query language is amost suitable candidate for that 'universal' query language. Further, theresults enable us to discover the weakly addressed and unexplored translationpaths, to discover gaps and to learn lessons that can benefit future researchin the area.

 

Quick Read (beta)

The Query Translation Landscape: a Survey

Mohamed Nadjib Mami Enterprise Information Systems, Fraunhofer IAIS, St. Augustin & Dresden, Germany Damien Graux Harsh Thakkar Smart Data Analytics group, University of Bonn, Germany Simon Scerri Enterprise Information Systems, Fraunhofer IAIS, St. Augustin & Dresden, Germany Sören Auer Jens Lehmann
October 2019
Abstract

Whereas the availability of data has seen a manyfold increase in past years, its value can be only shown if the data variety is effectively tackled —one of the prominent Big Data challenges. The lack of data interoperability limits the potential of its collective use for novel applications. Achieving interoperability through the full transformation and integration of diverse data structures remains an ideal that is hard, if not impossible, to achieve. Instead, methods that can simultaneously interpret different types of data available in different data structures and formats have been explored. On the other hand, many query languages have been designed to enable users to interact with the data, from relational, to object-oriented, to hierarchical, to the multitude emerging NoSQL languages. Therefore, the interoperability issue could be solved not by enforcing physical data transformation, but by looking at techniques that are able to query heterogeneous sources using one uniform language. Both industry and research communities have been keen to develop such techniques, which require the translation of a chosen ’universal’ query language to the various data model specific query languages that make the underlying data accessible.

In this article, we survey more than forty query translation methods and tools for popular query languages, and classify them according to eight criteria. In particular, we study which query language is a most suitable candidate for that ’universal’ query language. Further, the results enable us to discover the weakly addressed and unexplored translation paths, to discover gaps and to learn lessons that can benefit future research in the area.

\usetikzlibrary

shapes,positioning

Introduction

Query languages have come a long way during the last few decades. The first database query language, SQL, was formally introduced in the early seventies [chamberlin1974sequel] following the earlier proposed and well-received relational model [codd1970relational]. SQL has influenced the design of dozens query languages, from several SQL dialects, to object-oriented, graph, columnar, and the various NoSQL languages. These query languages are implemented and used in an unprecedented variety of storage and data management systems. In order to leverage the advantages of these solutions, companies and institutions are choosing to store their data in different representations, a phenomenon known as Polyglot Persistence [sadalage2013nosql]. As a result, large data repositories with heterogeneous data sources are being generated (also known as Data Lakes [2010dixon]), exposing various query interfaces to the user. Integrating this heterogeneous data (Big Data Variety [laney2012deja]) into a unified format and system, as has historically been the case with e.g., data warehouses, is nowadays becoming irrelevant. This is because (1) data is very large in size (Big Data Volume), (2) companies are less likely to sacrifice data freshness especially with the advances in streaming and IoT technologies (Big Data Velocity).

On the other hand, while computer scientists were looking for the holy grail of data representation and querying in the last decades, it is meanwhile accepted that no optimal data storage and query paradigm exist. Instead, different storage and query paradigms have different characteristics especially in terms of representation and query expressivity and scalability. Different approaches balance differently between expresivity and scalability in this regard. While SQL, for example, comprises a sophisticated data structuring and very expressive query language, NoSQL trades schema and query expressivity for scalability. As a result, since no optimal representation exists, different storage and query paradigms have their right to exist based on the requirements of various usecases.

With the resulted high variety, the challenge is how can the collected data sources be accessed in a uniform ad hoc way. Learning the syntax of their respective query languages is counterproductive as these query languages may substantially differ in both their syntax and semantics. A plausible approach is to develop means to map and translate between different storage and query paradigms. One way to achieve this is by leveraging the existing query translators, and building wrappers that allow the conversion of a query in a unique language to the various query languages of the underlying data sources. This has stressed the need for a better understanding of the translation methods between the query languages.

The topic covered in this survey, namely Query Translation, is horizontal to and directly concerns many Computer Science domains, from Information Retrieval, Databases, Data Integration, Data Analytics, Polyglot Persistence to Data Publishing and Archiving. Thus, the topic can be of interest to a broad audience; from as specific as researchers in Query Translation topics, to as general as users who solely interact with an existing system using those query languages and needing to transition from one language to another.

Related Surveys.

Several studies investigating query translation methods exist in the literature. They typically tackle pair-wise translation methods between two specific types of query languages, e.g., [krishnamurthy2003xml] surveys XML languages-to-SQL query translations, [michel2014survey, spanos2012bringing, sahoo2009survey] surveys SPARQL-to-SQL query translations. However, to the best of our knowledge, no survey has tackled the problem of universal translation across several query languages.

Contributions.

In this survey article we take a broader view over the query translation landscape. We consider existing query translation methods that target many widely-used and standardized query languages. Those include query languages that have withstood the test of time and recent ones experiencing rapid adoption. The contributions of this article can be summarised as follows:

  • We propose eight criteria shaping what we call a Query Translation Identity Card; each criterion represents an aspect of the translation method.

  • We review the translation methods that exist between the most popular query languages, whereby popularity is judged based on a set of defined measures. We then categorize them based on the defined criteria.

  • We provide a set of graphical representations of the various criteria in order to facilitate information reading, including a historical timeline of the query translation evolution.

  • We discuss our findings, including the weakly addressed query translation paths or the unexplored ones, and report on some identified gaps and lessons learned.

Considered Query Languages

We chose the most popular query languages in four database categories: relational, graph, hierarchical and document-oriented databases. We look at the standardization effort, number of citations to relevant publications, categorizations found in recently published works and technologies using the query languages. Subsequently, we introduce our chosen query languages and motivate the choice. We provide a query example for these query languages. Our example query corresponds to the following natural language query: ”Find the city of residence of all persons named Max.

Relational Query Languages

SQL

is the de facto relational query language first described in [chamberlin1974sequel]. It has been an ANSI/ISO standard since 1986/1987 and is continually receiving updates [gulutzan1999sql], latest of which was published in 2016.
Example: SELECT place FROM Person WHERE name = "Max"

Graph Query Languages

The recently published work at the ACM Computing Surveys [angles2017foundations] features three query languages: SPARQL, Cypher and Gremlin. A blogpost [topGraphDatabase] published by IBM Developer in 2017 sees those query languages as most popular; GraphQL is also mentioned, but it has far less scientific and technological adoption.

SPARQL

is the de facto language for querying RDF data. Of the three surveyed graph query languages, only SPARQL became a standard (by W3C in 2008), and is still receiving updates [sparql10, sparql11], latest of which is SPARQL 1.1 [gearon2013sparql] 2013. Research articles on SPARQL foundations [perez2006semantics, perez2009semantics, prudhommeaux2008sparql] are among the most cited across all graph query languages.
Example: SELECT ?c WHERE {?p :type :Person . ?p :name "Max" . ?p :city ?c }

Cypher

is Neo4j’s query language developed in 2011, which has been open-sourced in 2015 under the OpenCypher project [green2018opencypher]. Cypher has been recently formally described in a scientific article published [francis2018cypher]. At the time of writing, Neo4j tops DB engine ranking [DBEnginesRanking] of Graph DBMS.
Example: MATCH (p:Person) WHERE p.name = "Max" RETURN p.city

Gremlin

[rodriguez2015gremlin] is the traversal query language of Apache TinkerPop [rodriguez2015gremlin]. It first appeared in 2009 and predates Cypher. It also covers wider range of graph query processing: declarative (pattern matching) and imperative (graph traversal). Thus, it has a larger technological adoption. For example, it has libraries in more query languages: Java, Groovy, Python, Scala, Clojure, PHP, and JavaScript; and is integrated in more renowned data processing technologies e.g., Hadoop, Spark, and graph databases, e.g., Amazon Neptune, Azure Cosmos, OrientDB, etc.
Example (declarative): g.V().match(.as(’a’)
.hasLabel(’Person’).has(’name’,’Max’).as(’p’), __.as(’p’)
.out(’city’).values().as(’c’)).select(’c’)

Example (imperative): g.V().hasLabel(’Person’)
.has(’name’,’Max’).out(’city’).values()

Hierarchical Query Languages

This family is dominantly represented by XML query languages. XML appeared more than two decades ago and has been standardized in 2006 by W3C [bray1997extensible]; it is used mainly for data exchange between applications. W3C recommended XML query languages are XPath and XQuery.

XPath

allows to define path expressions that navigate XML trees from a root parent to descendent children. XPath has been standardized by W3C in 1999, and is continually receiving updates [clark1999xml, berglund2003xml, boag2002xquery] with the latest one in 2017 [dyck2017xml].
Example: //person[./name=’Max’]/city]

XQuery

is the XML de facto query language. XQuery is also considered a functional programming language, as it allows calling and writeing functions to interact with XML documents. XQuery uses XPath for path expressions, and can perform insert, update and delete operations. It has been initially suggested in 2002 [boag2002xquery], standardized by W3C in 2007 and recently updated in 2017 [jonathan2017xml].
Example: for $x in doc("persons.xml")/person where $x/name=’Max’ return $x/city

Document Query Languages

The representative document database that we choose is MongoDB. MongoDB (first released in 2009) is the document database that attracted the most attention both from academia and industry. At the time of writing, MongoDB tops the DB engine ranking [DBEnginesRanking] for document stores.

MongoDB operations.

MongoDB does not have a proper query language like SQL or SPARQL, but rather interacts with documents by means of query operations in a JSON-like format.
Example: db.product.find({name: "Max"}, {city: 1})

Query Translation Paths

In this section, we introduce the various translation paths between the selected query languages. Figure 1 shows a visual representation, where the nodes correspond to the considered query languages and the directed arrows correspond to the translation direction; the thickness of the arrows reflects the number of works on the respective query translation path.

SQL <> XML languages

The interest in using a relational database as a backbone for storing and querying XML has appeared as early as 1999 [he1999relational]. Even though XML model differs substantially from the relation model, e.g., multi-level nesting of data, cycles, recursive graph traversals, etc., storing XML data in RDBMSs was sought to benefit from their query efficiency and storage scalability.

XPath/XQuery-to-SQL:

XML documents have to be flattened, or shredded, into relations so they can be loaded into or mapped to relational tables. The ultimate goal is to hide the specificity of the back-end store, and make users feel as if they are directly dealing with the original XML documents. In parallel, there are efforts to provide an XML view on top of relational databases. The rational is to unify the access, using XML, and also to benefit from XML querying capabilities, e.g., expressing path traversals and recursion.

SQL-to-XPath/XQuery:

This covers approaches for storing XML in native XML stores, but adding an SQL interface to enable the querying of XML by SQL users. Metadata about how XML data is mapped to the relational model is required.

SQL <> SPARQL

SPARQL-to-SQL:

Similarly to XML, the interest in bridging the gap between RDF data model and the relational model emerged as early as RDF. This was motivated by multiple and various use-cases. For example, RDBMS were suggested to store RDF data [melnik2001storing, prud2004optimal], even before SPARQL standardization. Also, the Semantic Web community suggested a well-received data integration proposal, whereby disparate relational data sources are mapped to a unified ontology model and then queried uniformly [prud2004optimal, noy2004semantic]. The concept evolved to become the popular OBDA, Ontology-Based Data Access [poggi2008linking], empowering a lot of applications today.

SQL-to-SPARQL:

The other direction received less attention. The main two motivations presented were enhancing interoperability between the two worlds in general, and enabling reusability of the wealth of existing relational-oriented tools over RDF data, e.g, reporting and visualization.

SQL-to-Document-based

The main motivation behind exploring this path was to enable SQL users and legacy systems to access the new class of NoSQL document databases with their sole SQL knowledge.

SPARQL-to-Document:

The rational here is identical to that of SPARQL-to-SQL, with one extra consideration: scalability. Native triple stores become prone to scalability issues when storing and querying significant amounts of RDF data. Users resorted to more scalable solutions to store and query the data [hausenblas2012large]. The most studied database solution by the research community, we found, was MongoDB.

SQL <> Graph-based

SQL-to-Cypher:

This path is considered for the same reasons as the SQL-to-Document, which is mainly attempting to help users with SQL knowledge to approach graph data stored in Neo4j.

Cypher-to-SQL:

The rational is to allow running graph queries over relational databases. It has also been advocated that using relational databases to store graph data can be beneficial in certain cases, benefiting from the efficient index-based retrieval RDBMSs offer.

Gremlin-to-SQL:

The aim here is to allow executing Gremlin traversals (without side effect steps) on top of relation databases, in order to leverage the optimization techniques built into RDBMSs. In order to do so, the property graph data is represented and stored as relational tables.

SQL-to-Gremlin:

The main motivation is to enable relational database users to migrate to graph databases in order to leverage the advantages of graph-based functions (e.g., depth-first search, shortest paths, etc.) and data analytical applications that require distributed graph data processing.

SPARQL <> XML languages

SPARQL-to-XPath/XQuery:

Similarly to SQL-to-XML paths, this path seeks to build interoperability environments between semantic and XML database systems, to enable ontology-based data access to XML data, and to add a semantic layer on top XML data and services for integration purposes.

SPARQL <> Graph-based

XPath/XQuery-to-SPARQL:

Enabling XPath traversal or XQuery functional programming styles on top of RDF data can be an interesting feature to equip native RDF stores with, in order to embark adopters from the XML world into the Semantic Web world.

SPARQL-to-Gremlin:

This path aims to bridge the gap between the Semantic Web and Graph database communities by enabling SPARQL querying of property graph databases. Users well versed in SPARQL query language can avoid learning another query language, as Gremlin supports both OLTP and OLAP graph processors, covering a wide variety of graph databases.

{tikzpicture}\tikzstyle

language=[draw,rounded corners=4pt] \node[language] (sparql) at (0,-1) SPARQL; \node[language] (sql) at (0,4) SQL; \node[language,align=left] (xquery) at (6,2) XPath
XQuery; \node[language] (gremlin) at (9,3) Gremlin; \node[language,align=left] (document) at (-5,-1) Document
based; \node[language] (cypher) at (-5,4) Cypher; \draw[-¿,¿=latex] (sql) to[bend left] node[pos=0.5,right][ramanujam2009r2d, ramanujam2009r2dextra, rachapalli2011retro] (sparql); \draw[-¿,¿=latex] (sparql.north west) to[bend left=45] node[pos=0.2,left,text width=1.8cm,align=right][stadlerconnecting, kiminki2010sparql, priyatna2014formalisation, Chebotko06semanticspreserving, lu2008effective, elliott2009complete, unbehauen2012accessing, sequeda2013ultrawrap, journals/ws/Rodriguez-MuroR15] (sql.south); \draw[-¿,¿=latex] (sparql.north east) to[bend right] node[pos=0.8,right,align=left][bikakis2015sparql2xquery, bikakis2009querying, bikakis2009semantic, groppe2008embedding, fischer2011translating] (xquery.south); \draw[-¿,¿=latex] (xquery.south west) to[bend right=10] node[pos=0.7,below][droop2007translating] (sparql.north east) ; \draw[-¿,¿=latex,rounded corners=4pt] (xquery.north) —- (sql.east) node[pos=0.68,below][krishnamurthy2004efficient, fan2005query, mani2006join, georgiadis2007xpath, hu2008adaptive, min2008xtron]; \draw[-¿,¿=latex] (sql.south east) – (xquery.north west) node[pos=0.5,below,sloped][vidhya2009query, vidhya2010insert, jigyasu2006sql, halverson2004rox]; \draw[-¿,¿=latex,rounded corners=4pt] (sparql.east) -— (gremlin.south) node[pos=0.4,below][thakkar2018stitch, thakkar2018two]; \draw[-¿,¿=latex] (sql.north east) to[bend left=30] node[pos=0.9,left][sqlgremlin] (gremlin.north west); \draw[-¿,¿=latex] (gremlin.north) to[bend right=45] node[pos=0.9,above][sun2015sqlgraph] (sql.north); \draw[-¿,¿=latex] (cypher.east) – (sql.west) node[midway,above][cyp2sql, steer2017cytosm]; \draw[-¿,¿=latex] (sql.west) – (document.north) node[midway,left][querymongo, mongoDB-translator-teiid, unityjdbc]; \draw[-¿,¿=latex] (sparql.west) – (document.east) node[midway,below][mutharaju2013d, unbehauen-semantics-2016-sparqlmap-m, botoeva2016obda, conf/ontobras/AraujoABW17, michel2016generic];

Figure 1: Query translation paths found and studied.

Survey Methodology

Our study of the literature revealed a set of generic query translation patterns and common aspects that can be used to classify the surveyed query translation methods and tools. We refer to them as translation criteria and organize them in three categories, forming what we call the Query Translation Identity Card.

I. Translation Properties

  1. 1.

    Translation type: Describes how the target query is obtained.

    1. (a)

      Direct: the translation generates the destination query starting from and by analyzing only the original query.

    2. (b)

      Intermediate/meta query language-based: the translation generates the destination query by passing by an intermediate (meta-)language.

    3. (c)

      Storage scheme-aware: the translation generates queries depending on how data is internally structured or partitioned.

    4. (d)

      Schema information-aware: the translation depends mainly on the schema information of the underlying data.

    5. (e)

      Mapping language-based: the translation generates the destination query using a set of mapping rules expressed in an established/standardized third-party mapping language, e.g., R2RML [das2011r2rml].

  2. 2.

    Translation coverage: Describes how much of the origin query language syntax is covered. For example, projection and filtering preserved, joining and update dropped.

II. Translation Optimization

  1. 1.

    Optimization strategies: Describes any optimization techniques applied during query translation, e.g., reordering joins in a query plan to reduce intermediate results.

  2. 2.

    Translation relationship: Describes how many destination queries can be generated starting from the input query: one-to-one, one-to-many. Generally, it is desirable to reduce the number of destination queries to one, so we consider this an optimization aspect. We separate it from the previous point, however, as it has separate (discrete) value range.

III. Community Factors

  1. 1.

    Availability: Describes whether the translation method implementation or prototype is openly available. That can be known, for example, by checking if the reference to the source code repository or download page is still available.

  2. 2.

    Adoption: Describes the degree of acceptance of the translation method by the community by, for example, enumerating the research publications citing it.

  3. 3.

    Evaluation: Assesses whether the translation method has been empirically evaluated. For example [halverson2004rox] evaluates the various schema options and their effect on query execution, using the TPC-H benchmark.

  4. 4.

    Metadata: Provides some related information about the presented translation method, such as date of first and last release/update. For example, this helps to obtain an indication about whether the solution is still maintained.

Criteria-based Classification

Scope definition. Given the broad scope tackled in this survey, it is important to limit the search space. Therefore, we take measures as to favor quality, high-influence and completeness, as well as preserve certain level of novelty–at least in paths with the highest number of works. The measures are as follows:

  • We do not consider work that describes the query translation very marginally or that has a broad scope with little focus on the query translation aspects.

  • We only consider works proposed during the last fifteen years, i.e., after 2003. This applies in particular to XML-related translations; however, interested readers may refer to an existing survey covering older XML translation works [krishnamurthy2003xml].

It is also important to explicitly prune the scope in terms of what is not considered for the study:

  • We do not address post-query translation steps, e.g., results format and representation.

  • As the aim of this survey is to explore the methods and capacities, we do not comment on the results of empirical evaluations of the individual works. This is also due to the vast heterogeneity between the languages, their underlying data and use-cases.

  • The translation method is summarized, which may entail that certain details are omitted. The goal is to allow the reader to discover the literature; interested readers are encouraged to reach to the individual publications for the full details.

In the following, we refer to the articles and tools by citation and, when applicable, by name, and directly describe the query translation methods they present. Further, it should not be inferred that the article or tool presents solely translation methods, but often, other aspects are also tackled, e.g., data migration, which are considered out-of-scope of the current study. Finally, in order to give the survey a temporal context, works are listed in a chronological order.

I. Translation Properties

1. Translation type:

(a) Direct:
SQL-to-XPath/XQuery:

ROX [halverson2004rox] aims at directly querying native XML stores using a SQL interface. The method consists of creating relational views, called NICKNAMEs, over a native XML store. The NICKNAME contains schema descriptions of the rows that would be returned starting from XML input data, including mappings between those rows and XML elements expressed in form of XPath calls. Nested parent-child XML elements are caught, in the NICKNAME definition, by expressing primary and foreign keys between the corresponding NICKNAMEs. [vidhya2010insert, vidhya2009query] propose a set of algorithms enabling direct logical translations of simple SQL INSERT, UPDATE, DELETE and RENAME queries to statements in the XUpdate language11 1 XUpdate is an extension of XPath allowing to manipulate XML documents.. In case of the INSERT, SQL query has to be slightly extended to instruct in which position related to the context node, preceding/following, the new node has to be inserted.

SPARQL-to-SQL:

[Chebotko06semanticspreserving] defines a set of primitives that allow to (a) extract the relation where triples matching a triple pattern are stored, (b) extract the relational attribute whose value may match a given triple pattern in a certain position (s,p,o), (c) generate a distinct name from a triple pattern variable or URI, (d) generate SQL conditions (WHERE) given a triple pattern and the latter primitive, and (e) generate SQL projections (SELECT) given a triple pattern and the latter three primitives. A translation function returns a SQL query by fusing and building up the previous primitives given a graph pattern. The translation function generates SQL joins from UNIONs and OPTIONALs between sub-graph patters. FSparql2Sql [lu2008effective] is an early work focusing on the various cases of filter in SPARQL queries. While RDF objects can take many forms like IRIs (Internationalized Resource Identifier), literals with and without language and/or datatype tags, values stored in RDBMS are generally atomic textual or numeral values. Therefore, the various cases of RDF objects are affected primitive data types, called ’facets’, e.g., facets for IRIs, datatype tags and language tags are of primitive type String. This way, filter operands become complex, so they need to be bound dynamically. To achieve that, CASE WHEN … THEN expressions part of SQL-92 are exploited. [elliott2009complete] proposes several translation SQL model-algorithms implementing different operators of a SPARQL query (algebra). In contrast to many existing works, this work aims to generate flat/un-nested SQL queries, instead of multi-level nested-queries, so SQL query optimizers can achieve better performance. This is done via SQL augmentations, i.e., SPARQL operators gradually augment the SQL query instead of creating a new nested one. The algorithms implement functions which each generates a part of the final SQL query.

SQL-to-Document-based:

QueryMongo [querymongo] is a Web-based translator that accepts a SQL query and generates an equivalent MongoDB query. The translation is based solely on SQL query syntax, i.e., not considering any data or schema. No explanation about the translation approach is provided. [sql-to-mongo-db-query-converter] is a library providing an API to translate SQL to MongoDB queries. The translation is based on SQL query syntax only.

SPARQL-to-XPath/XQuery:

[groppe2008embedding] does not provide a direct translation of SPARQL, but SPARQL embedded inside XQuery. The method involves firstly representing SPARQL in form of tree of operators. There are operators for projection, filtering, joining, optional and union; they declare how the output (XQuery) of the corresponding operations are represented. The translation involves data translation, from RDF to XML, and the translation of the operators to XQuery queries accordingly. An XML element with three sub-elements are created for each triple for each triple term (s, p and o). The translation from an operator into XQuery constructs is based on transformation rules, which replace the embedded SPARQL constructs with XQuery constructs. The translation from an operator into an XQuery constructs is based on transformation rules, which replace the embedded SPARQL constructs with XQuery constructs. In XQL2Xquery [fischer2011translating], variables of the basic graph patter (BGP) are mapped to XQuery values. A for loop and a path expression is used to retrieve subjects and bind any variables encountered, then nested under every variable, iterate over the predicates and bind their variables. In a similar way, nestedly iterate over objects. Next, BGP constants and filters are mapped to XQuery where. OPTIONAL is mapped to an XQuery function implementing a left outer join. For filters, XQuery value comparison are employed (e.g., eq, neq). ORDER BY is mapped to order by in a FLWOR expression. LIMIT and OFFSET are handled using position on the results. REDUCED is translated into a NO-OP.

XPath/XQuery-to-SPARQL:

[droop2007translating] presents a translation method that includes data transformation from XML to RDF. During the data transformation process, XML nodes are annotated with information used to support all XPath axes. For example, type information, attributes, namespaces, parent-child relationships, information necessary for recursive XPath, etc. The above annotations conform to the structure of the generated RDF and are used to generate the final SPARQL query.

Gremlin-to-SQL

[sun2015sqlgraph] propose a direct mapping approach for translating Gremlin queries (without the side effect step) to SQL queries. The authors propose a generic technique to translate a subset of Gremlin queries (queries without side effect steps) into SQL leveraging the relational query optimizers. They propose techniques that make use of a novel schema which exploits both relational and non-relational storage for property graph data by combining relational storage with JSON storage for adjacency information and vertex and edge attributes respectively.

SPARQL-to-Gremlin:

Gremlinator [thakkar2018stitch, thakkar2018two] proposes a direct translation of SPARQL queries to Gremlin pattern matching traversals, by mapping each triple pattern within a SPARQL query to a corresponding single step in the Gremlin traversal language. This is made possible by the match()-step in Gremlin which offers a SPARQL-style of declarative construct. Within a single match()-step, multiple single step traversals can be combined forming a complex traversal, analogous to how multiple basic graph patterns constitute a complex SPARQL query [thakkar2017towards].

(b) Intermediate/meta query language-based:
Type-ARQuE

[kiminki2010sparql] uses an intermediate query language called AQL, Abstract Query Language. AQL is designed to stand between SQL and SPARQL, it extends from the relational algebra (in particular the join) and accommodates both SQL and SPARQL semantics. It is represented as a tree of expressions and joins between them, containing selects and orders. The translation process consists of three stages: (1) SPARQL query parsed and translated to AQL query, (2) AQL query undergoes a series of transformations (simplification) preparing it for SQL transformation, and (3) AQL query translated to the target SQL dialect, transforming AQL join tree to SQL join tree, along the other selects and orders expressions. Example of stage 2 simplifications: type inference, nested join flattening, join inner joins with parents, etc. In [journals/ws/Rodriguez-MuroR15], Datalog is used as an intermediate language between SPARQL and SQL. SPARQL query is translated into a semantics-similar Datalog program. First phase is translating SPARQL query to a set of Datalog rules. The translation adopts a syntactic variation of the method presented in [polleres2013relation] by incorporating built-in predicates available in SQL and avoid negation, e.g., LeftJoin, isNull, isNotNul, NOT. Second phase is generating an SQL query starting from Datalog rules. Datalog atoms, ans, triple, Join, Filter, LeftJoin, are mapped to equivalent relational algebra operators. ans and triple are mapped to a projection, while filter and joins to equivalent relational filter and joins, respectively.

SPARQL-to-Document:

In [michel2016generic] a generic two-step SPARQL-to-X approach is suggested, with a showcase using MongoDB. The article proposes to convert a SPARQL query to a pivot intermediate query language called Abstract Query Language (AQL). The translation uses a set of mappings in xR2RML mapping language, which describe how data in target databases are mapped into RDF model, without converting data to RDF. AQL has a grammar that is similar to SQL both syntactically and semantically. The BGP part of a SPARQL query, is decomposed into a set of expressions in AQL. Next, xR2RML mappings are checked for any maps matching the containing triple patterns. Those detected matching maps are used to translate individual triple patterns to atomic abstract queries. Queries in AQL are translated to the query language of the target database. Unsupported operations like JOIN in MongoDB are assumed left to a higher-lever query engine.

(c) Storage scheme-aware:
XPath/XQuery-to-SQL:

In [min2008xtron] XTRON, a relational XML management system is presented. The article suggests a schema-oblivious way of storing and querying XML data. XML documents are stored uniformly in identical relational tables using a fixed predefined relational model. Generated queries then have to abide by this fixed relational schema scheme.

SPARQL-to-Document:

D-SPARQ [mutharaju2013d] focuses on the efficient processing of join operation between triple patterns of a SPARQL query. RDF data is physically materialized in a cluster of MongoDB stores, following a specific graph partitioning scheme. SPARQL queries are converted to MongoDB queries following the same.

Cypher-to-SQL:

Cyp2sql [cyp2sql] is a tool for the automatic transformation of both data and queries from Neo4j to a relational database. During the transformation, the following tables are created: Nodes, Edges, Labels, Relationship types, plus materialized views to store the adjacency list of the nodes. Cypher queries are then translated to SQL queries tailored to that data storage scheme.

SQL-to-Gremlin:

SQL-Gremlin [sqlgremlin] is a proof-of-concept SQL-to-Gremlin translator. The translation requires that the underlying graph data is given a relational schema, where elements from the graph are mapped to tables and attributes. However, there is no reported scientific study that discusses the translation approach. SQL2Gremlin [sql2gremlin] is a tool for converting SQL queries to Gremlin queries. They show how to reproduce the effect of SQL queries using Gremlin traversals. A pre-defined graph model is used during the translation; as an example, Northwind relational data was loaded as a graph inside Gremlin.

(d) Schema information-aware:
XPath/XQuery-to-SQL:

[krishnamurthy2004efficient] The process uses summary information on the relational integrity constraints pre-computed in a pre-processing phase. An XML view is constructed by mapping elements from the XML schema to elements from the relational schema. The XML view is a tree where the nodes map to table names and the leaves to column names. An SQL query is built by going from the root to the leaves of this tree, a traversal from a node to a node is a join between the two corresponding tables. In [fan2005query] XML data is shredded into relations based on an XML schema (DTD) and saved in a RDBMS. The article extends XPath expressions to allow capturing recursive queries against a recursive schema. XPath queries with the extended expressions can, next, be translated into an equivalent sequence of SQL queries using a common RDBMS operator (LFP: Simple Least Fixpoint). Whereas [mani2006join] builds a virtual XML view on top of relational databases using XQuery, the focus of the article is on the optimization of the intermediate relational algebra.

SQL-to-SPARQL:

R2D [ramanujam2009r2d, ramanujam2009r2dextra] propose to create a relational virtual normalized schema (view) on top of RDF data. Schema elements are extracted from RDF schema; if schema is missing or incomplete, schema information is extracted by thoroughly exploring the data. r2d:TableMap, r2d:keyField, r2d:refersToTableMap denote a relational table, its primary key, and foreign key, respectively. A relational view is created using those schema constructs, against which SQL queries are posed. SQL queries are translated into SPARQL queries. For every SQL projected, filtered or aggregated (with GROUP BY) variable, a variable is added to SPARQL SELECT. SQL WHERE conditions are added to SPARQL FILTER, LIKE mapped to a regex(), and blank nodes are used in a number of cases. In RETRO [rachapalli2011retro] RDF data is exhaustively parsed to extract domain-specific relational schema. The schema corresponds to the so-called vertical partitioning, i.e., one table for every extracted predicate, each table is composed of <subject object> attributes. Then, the translation algorithm parses the SQL query posed against the extracted relational schema and iteratively builds the SPARQL query.

SQL-to-Document-based:

[mongoDB-translator-teiid] requires the user to provide a MongoDB schema, expressed in a relational form using tables, procedures, and functions. [unityjdbc] provides a JDBC access to MongoDB documents by building a representative schema, which is, in turn, constructed by sampling MongoDB data and fitting the least-general type representing the data.

SQL-to-XPath/XQuery:

AquaLogic Data Services Platform [jigyasu2006sql] builds an XML-based layer on top of heterogeneous data sources and services. To allow SQL access to relational data, relational schema is mapped to AquaLogic DSP artifacts (internal data organization), e.g., service function to relational tables.

SPARQL-to-Document:

[botoeva2016obda], in the context of OBDA, suggests a two-step approach, whereby the relational model is used as an intermediate model between SPARQL and MongoDB queries. Notions of MongoDB type constrains (schema) and mapping assertions are imposed on MongoDB data, both of which are used during the first phase of query translation to create relational views. The schema is extracted from the data stored in MongoDB. MongoDB mappings relate MongoDB paths (e.g., student.name) to ontology properties. A SPARQL query is first decomposed into a set of translatable sub-queries. Using MongoDB mappings, MongoDB queries are created. OntoMongo [conf/ontobras/AraujoABW17] proposes an OBDA on top of NoSQL stores, applied to MongoDB. An ontology, conceptual layer, and mapping between the ontology and conceptual layer are involved. The conceptual layer adopts the object-oriented programming model, i.e., classes and hierarchy of classes. Data is accessed via ODM, Document-Relational Mapping, calls. SPARQL triple patterns are grouped by their shared subject variable (star-shaped). Each group of triples is assumed to be of one class defined in the mappings, the class name is denoted by the variable of the shared subject. MongoDB query can be created by mapping query classes to classes in the conceptual model, which then is used to call MongoDB terms via the ODM. The lack of JOIN operation in MongoDB is substituted with a combination of two unwind commands each concerning one side (class) of the join.

Cypher-to-SQL:

Cytosm [steer2017cytosm] presents a middleware allowing to execute graph queries directly on non-graph databases. The application relies on gTop (graph Topology) to build a form of schema on top of graph data. gTop consists of two components: (1) Abstract Property Graph model and (2) a mapping to the relational model. It captures the structure of property graphs, i.e., node and edge types and their properties, and provides mapping between graph query language and the relational query language, mapping nodes to rows of tables, and edges to either fields of rows or a sequence of table-join operations. Query translation is twofold. (1) Using gTop abstract model, Cypher path expressions (from MATCH keyword) are visited and a set of restricted OpenCypher [francis2018cypher] queries not containing multi-hop edges and anonymous entities (which are not possible to translate to SQL) are generated, denoted rOCQ. (2) rOCQ are parsed and an intermediate SQL-like representation is generated, having one SELECT and WITH SELECT for each MATCH. SELECT variables are checked if they require information from the RDBMS, and if they inter-depend. Then, the mapping part of gTop is used to map nodes to relational tables. Finally, edges are resolved into JOINs, also basing on gTop mappings.

SPARQL-to-XPath/XQuery:

SPARQL2XQuery is described in a couple of publications [bikakis2015sparql2xquery, bikakis2009querying, bikakis2009semantic]. The translation is based on a mapping model between OWL ontology (existing or user-defined) and XML Schema. Mappings can either be automatically extracted by analyzing the ontology and XML schema, or manually curated by a domain expert. SPARQL queries are posed against the ontology without knowledge of the XML schema. The BGP (Basic Graph Pattern) of SPARQL query is normalized into a form where each GP is UNION-free, so each GP can be processed independently and more efficiently. XPaths are bound to GP variables, there are various forms of binding for various types of variables. Next, GPs are translated into an equivalent XQuery expression using the mappings; for each variable of a triple, a For or Let clause using the variable binding is created. Ultrawrap [sequeda2013ultrawrap] implements an RDF2RDB mapping, allowing to execute SPARQL queries on top of existing RDBMSs. It creates an RDF ontology from the SQL schema, based on which it next creates a set of logical RDF views over the RDBMS. The views, called Tripleviews, are an extension of the famous triple tables (subject,predicate,object) with two additional columns: subject and object primary keys. Four Tripleviews are created: types —stores subjects along their types in the DB, varchar(size) —stores only textual attributes, int —stores only numeral attributes, and object properties —stores join links between DB tables. Given a SPARQL query, each triple pattern maps to a Tripleview.

(e) Mapping language-based:
SPARQL-to-SQL:

In SparqlMap [unbehauen2012accessing] triple patterns of a SPARQL query are individually examined to extract R2RML triple maps. Methods are applied to find the candidate set of triple maps, and then to prune this to produce a set that prepares for the subsequent query translation. Given a SPARQL query, a recursive query generation process is devised yielding a single but nested SQL query. Sub-queries are created for individual mapped triple patterns and for reconciling those via JOIN or UNION. Nested subqueries querying the RDBMS tables extract not only the columns but also structural information like term type (resource, literal, etc.), concatenates multiple columns to form IRIs, etc. To generalize the technique of [journals/ws/Rodriguez-MuroR15] (Datalog as intermediate language) to arbitrary relational schema, R2RML is incorporated. For every R2RML triple map a set of Datalog rules are generated reflecting the same semantics. A triple atom is created for every combination of subject map, property map and object map on a translated logical table. Finally, the translation process from Datalog to SQL is extended to deal with the new rules introduced by R2RML mappings. [priyatna2014formalisation] extends a previously published translation method [Chebotko06semanticspreserving] to involve user-defined R2RML mappings. In particular, it incorporates R2RML mappings in α and β mappings as well as genCondSQL(), genPRSQL() and trans() functions. For each, an algorithm is devised, considering the various situations found in R2RML mappings like the absence of Reference Object Map. SparqlMap-M [unbehauen-semantics-2016-sparqlmap-m] enables querying document stores using SPARQL without RDF data materialization. It is based on a previous SPARQL-to-SQL translator, SparqlMap [unbehauen2012accessing], so it adopts a relational model to virtually represent the data. Documents are mapped to relations using an extension of R2RML allowing to capture duplicate demoralized data, which is common characteristic of document data. The lack of union and join capabilities support is mitigated by a multi-level query execution, producing and reusing intermediate results. Selection parts are pushed to the document store, while the union and join are executed using an internal RDF store.

2. Translation coverage:

We note the following before starting our review of the works:

  • The coverage is extracted not only from the core of the articles, but also from the evaluation sections and from the online page of the implementations (when available). For example, [sequeda2013ultrawrap, unbehauen-semantics-2016-sparqlmap-m] evaluate using all 12 BSBM benchmark queries, which cover more scope than that of the article; the corresponding Web page of [sequeda2013ultrawrap] mention features that are both beyond the core and the evaluation section of the article.

  • We mention the supported query feature but we do not assume its completeness, e.g., [conf/ontobras/AraujoABW17] supports filters but only for equality condition. Interested users are encouraged to seek details from the corresponding articles/tools.

  • Table 2 shows that some works [Chebotko06semanticspreserving, conf/ontobras/AraujoABW17] support only one feature. This does not necessarily imply insignificance, but reflects a choice to reserve the full study to covering that particular feature, e.g., various shapes of graph patters or different cases of OPTIONAL.

SQL-to-X and SPARQL-to-X:

See Table 1 and Table 2 for translation methods and tools from SQL to SPARQL respectively. For SQL, the WHERE clause is an essential part of most useful queries, hence, it is supported by all methods. GROUP BY is the next commonly supported feature, as it enables a significant class of SQL queries: analytical and aggregational queries. To a lower extent supported is the sorting operation ORDER BY. UNION and especially JOIN are operations of typically high cost; they are among the least supported features. As most researched query categories are of retrieval nature, modification queries such as INSERT, UPDATE and DELETE are very weakly addressed. DISTINCT and nested queries are rarely supported, which might also be attributed to their typical expensiveness, e.g., DISTINCT requires sorting, and nested-queries generate large intermediate results. EXCEPT, UPSERT, and CREATE are only supported by individual works. For SPARQL, query operation support is more prominent across the reviewed works. FILTER, UNION and OPTIONAL are the most commonly supported query operations with up to 60% of the surveyed works. To less extent, DISTINCT, LIMIT and ORDER BY are supported by about half of the works. The rest query operations are all supported by a few works , e.g., DESCRIBE, CONSTRUCT, ASK, blank nodes, datatype(), bound(), isLiteral(), isURI(), etc. GRAPH, SUB-GRAPH, BIND are examples of interesting query operations but only supported by individual works. In general, DESCRIBE, CONSTRUCT and ASK are far less prominent SPARQL query constructs in comparison to SELECT, which is present in all the works. isURI() and isLiteral() are SPARQL-specific functions with no direct equivalent in other languages.

XPath/XQuery-to-SQL:

The queries [krishnamurthy2004efficient] focuses on are simple path expressions, including descendent axis traversal, i.e., //. [fan2005query] enables XPath recursive queries against a recursive schema. [mani2006join] focuses on optimizing relational algebra, only a simple XPath query is used for the example. [georgiadis2007xpath] covers simple, ancestor, following, parent, following-sibling, descendant-or-self XPath queries. In [hu2008adaptive], the supported queries are XPath queries with descendent/child axes with simple conditions. [min2008xtron] translates XQuery queries with path expressions including decedent axis // XQuery queries, dereference operator => and FLWR expressions.

XPath/XQuery-to-SPARQL:

[droop2007translating] mentions support for recursive XPath queries, with descendent, following and preceding axes as well as for filters.

Cypher-to-SQL:

[steer2017cytosm] experiments with queries containing MATCH, WITH, WHERE, RETURN, DISTINCT, CASE, ORDER BY, LIMIT, and with patters: simple patterns with known nodes and relationships, and -> and <- directions, variable-length relationship. [cyp2sql] is able to translate MATCH, WITH, WHERE, RETURN, DISTINCT, ORDER BY, LIMIT, SKIP, UNION, count(), collect(), exists(), label(), id(), and rich pattern cases, e.g., (a or empty)–()–(b or empty), [a or empty]-[b]-(c or empty), -> and <-, (a) --> (b).

\rotatebox

90 Work DISTINCT WHERE/ REGEX JOIN UNION GROUP BY /HAVING ORDER BY LIMIT/ OFFSET INSERT/ UPDATE DELETE/ DROP Nested queries Others SQL-to-XPath/XQuery [halverson2004rox] ? \CheckmarkBold/ ? ? \CheckmarkBold/ \CheckmarkBold ? ? \CheckmarkBold/ ? [jigyasu2006sql] ? \CheckmarkBold/ \CheckmarkBold \CheckmarkBold ? \CheckmarkBold ? ? ? \CheckmarkBold [vidhya2009query, vidhya2010insert] ? \CheckmarkBold/ ? ? ? ? ? \CheckmarkBold/\CheckmarkBold \CheckmarkBold/ ? RENAME SQL-to-SPARQL [rachapalli2011retro] ? \CheckmarkBold/ \CheckmarkBold \CheckmarkBold ? ? ? ? ? EXCEPT [ramanujam2009r2d, ramanujam2009r2dextra] ? \CheckmarkBold/\CheckmarkBold ? ? \CheckmarkBold/ ? ? ? ? ? SQL-to-Document-based [querymongo] \CheckmarkBold \CheckmarkBold/\CheckmarkBold \XSolidBrush \XSolidBrush \CheckmarkBold/\CheckmarkBold \CheckmarkBold \CheckmarkBold/ \XSolidBrush \XSolidBrush \XSolidBrush [mongoDB-translator-teiid] \CheckmarkBold \CheckmarkBold/ ? ? \CheckmarkBold/\CheckmarkBold \CheckmarkBold \CheckmarkBold/\CheckmarkBold \CheckmarkBold/ \CheckmarkBold/ ? [unityjdbc] ? \CheckmarkBold/\CheckmarkBold \CheckmarkBold ? \CheckmarkBold/ ? \CheckmarkBold/\CheckmarkBold \CheckmarkBold/\CheckmarkBold /\CheckmarkBold ? CREATE, DROP, UPSERT, date, string, math fncts [sql-to-mongo-db-query-converter] ? \CheckmarkBold/\CheckmarkBold ? ? \CheckmarkBold/ \CheckmarkBold ? ? \CheckmarkBold/ ? some Boolean filters [neo4jsql] ? \CheckmarkBold/ \CheckmarkBold ? \CheckmarkBold/ \CheckmarkBold ? ? ? ? SQL-to-Gremlin [sqlgremlin] \CheckmarkBold \CheckmarkBold/\CheckmarkBold ? \CheckmarkBold \CheckmarkBold/ \CheckmarkBold ? ? ? \CheckmarkBold

Table 1: SQL features supported in SQL-to-X query translations. \CheckmarkBold is supported, \XSolidBrush is not supported, ? not (clearly) mentioned supported. Others are features provided only by individual works.
\rotatebox

90 Work DISTINCT /REDUCED FILTER/ regex() OPTIONAL UNION ORDER BY LIMIT/ OFFSET Blank nodes datatype() /lang() isURI() isLiteral() DESCRIBE /bound() CONSTRUCT /ASK Others SPARQL-to-SQL [kiminki2010sparql] \CheckmarkBold/ ? ? \CheckmarkBold \XSolidBrush \CheckmarkBold/ ? ? ? ? ? [priyatna2014formalisation] ? ? ? ? ? ? ? ? ? ? ? [elliott2009complete] \CheckmarkBold/ ? \CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold/ \CheckmarkBold \CheckmarkBold/ \CheckmarkBold \CheckmarkBold/ \CheckmarkBold/\CheckmarkBold GRAPH, FROM NAMED, isBlank() [unbehauen2012accessing] ?/ ? \CheckmarkBold ? \CheckmarkBold ? ? ? ? ? ? [thakkar2018stitch, thakkar2018two] \CheckmarkBold/\XSolidBrush \CheckmarkBold/\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold/\CheckmarkBold \CheckmarkBold ? ? \XSolidBrush/\XSolidBrush \XSolidBrush/\XSolidBrush GROUP BY, SUBGRAPH, REMOTE [lu2008effective] ? \CheckmarkBold \CheckmarkBold \CheckmarkBold ? ? ? \CheckmarkBold/ /\CheckmarkBold /\CheckmarkBold ? [sequeda2013ultrawrap] \CheckmarkBold/ \CheckmarkBold/\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold/\CheckmarkBold ? /\CheckmarkBold ? \CheckmarkBold/\CheckmarkBold ? BIND [journals/ws/Rodriguez-MuroR15] \CheckmarkBold/ \CheckmarkBold/ \CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold/\CheckmarkBold ? /\CheckmarkBold ? ? ? [Chebotko06semanticspreserving] ? ? \CheckmarkBold ? ? ? ? ? ? ? ? SPARQL-to-Document [conf/ontobras/AraujoABW17] ? \CheckmarkBold/ ? ? ? ? ? ? ? ? ? [unbehauen-semantics-2016-sparqlmap-m] \CheckmarkBold/ \CheckmarkBold/\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold/\CheckmarkBold ? /\CheckmarkBold ? \CheckmarkBold/\CheckmarkBold \CheckmarkBold/ [mutharaju2013d] ? \XSolidBrush \XSolidBrush ? \XSolidBrush ? ? ? ? ? [botoeva2016obda] ? \CheckmarkBold/ \XSolidBrush ? \XSolidBrush ? ? ? ? ? ? SPARQL-to-XPath/XQuery [bikakis2015sparql2xquery, bikakis2009querying, bikakis2009semantic, bikakis2014supporting] \CheckmarkBold/\CheckmarkBold \CheckmarkBold/\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold/\CheckmarkBold \CheckmarkBold ? ? \CheckmarkBold/ \CheckmarkBold/\CheckmarkBold DELETE, INSERT [fischer2011translating] \CheckmarkBold/\CheckmarkBold \CheckmarkBold/ \CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold/\CheckmarkBold ? ? ? ? ? [groppe2008embedding] ? \CheckmarkBold/\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold ? ? ? ? ? ?

Table 2: SPARQL features supported in SPARQL-to-X query translations. See Table 1 for \CheckmarkBold \XSolidBrush ?. Others are features provided only by individual works.

II. Translation Optimization

3. Optimization strategies

In this section, we use the terms previously introduced in Transformation type (1); in order to avoid repetitions.

XPath/XQuery-to-SQL:

[krishnamurthy2004efficient] suggests to eliminate joins by eliminating unnecessary prefix traversals, i.e. first traversals from the root. [mani2006join] proposes a set of rewrite rules meant to detect and eliminate unnecessarily redundant joins in the relational algebra of SQL queries resulted from the translation of XML queries. During query translation, [fan2005query] suggests an algorithm leveraging the structure of XML schema: pushing selections and projections into the LFP operator (Simple Least Fixpoint). PPFS+ [georgiadis2007xpath] mainly seeks to leverage RDBMS storage of shredded XML data. Based on an empirical evaluation, nested loop join was chosen to apply merge queries over the shredded XML. They try to improve query performance by generating pipelined plans reducing time to ”first results”. To ensure XPath results follow the order of the original XML document and have as few duplicates as possible, redundant orders (ORDER BY) are eliminated, and ordering operations are pushed down the query plan tree. As a physical optimization, the article resorts to indexed file organization for the shredded relations. Even though [min2008xtron] XTRON is schema-oblivious by nature, some schema/structural information is used to speed up query response. That is by encoding simple paths of XML elements into intervals of real numbers using a specific algorithm (Reverse Arithmetic Encoder). The latter reduces the number of self-joins in the generated SQL queries.

SQL-to-XPath/XQuery:

ROX [halverson2004rox] suggests a cost-based optimization to generate optimal query plans, and physical indexes for quick node look-up; however, no details are given.

SPARQL-to-SQL:

The method in [priyatna2014formalisation] optimizes certain SQL query cases that negatively impact (some) RDBMSs. In particular, sub-query elimination and self-join elimination query rewriting techniques are applied. The former removes non-correlated subqueries from the query by pushing down projections and selections, the latter removes self-joins occurring in the former queries. [elliott2009complete] implements an optimization technique called ”early project simplification”, which skips variables that are not needed during query processing from the SELECT clause. In SparqlMap [unbehauen2012accessing], filter expressions are pushed to the graph patters, and nested SQL queries are flattened to minimize self-joins. In FSparql2Sql [lu2008effective], the translation method may generate an abnormal SQL query with a lot of CASE expressions and constants. The query is optimized by replacing complex expressions by simpler ones, e.g., by manipulating different logical orders, or removing useless ones. The translation approach in Ultrawrap [sequeda2013ultrawrap] is expected to generate a view of a very large union of many SELECT-FROM-WHERE statements. To mitigate this, two strategies are applied: detection of unsatisfiable conditions, and self-join elimination. The former detects whether a query would yield empty results, even before executing it, due to the presence of contradictions e.g., WHERE predicate equals two opposite values; it also prunes unnecessary UNION sub-tree, e.g., by removing an empty argument from the UNION, in case two attributes of the same table are projected or filtered individually then joined. The generated SQL query in [journals/ws/Rodriguez-MuroR15] may be sub-optimal due to the presence of e.g., joins of UNION-subqueries, redundant joins with respect to keys, unsatisfiable conditions. Using techniques from Logical Programming, Partial evaluation is used to optimize Datalog rules dealing with ans and triple atoms, by iteratively filtering out options that would not generate valid answers; Goal Derivation in Nested Atoms and Partial SDL-tree with JOIN and LEFT JOIN dealing with join atoms. Techniques from Semantic Query Optimizations are applied to detect unsatisfiable queries, e.g., joins when equating two different constants, simplification of trivially satisfiable conditions like x=x. The generated query in [Chebotko06semanticspreserving] is optimized using simplifications, e.g., removing redundant projections that do not contribute to a join or conditions in subqueries, removing True values from some conditions, reducing join conditions based on logical evaluations, omitting left outer joins in case of SPARQL UNION when union’ed relations have identical schema, pushing down projection into SELECT subqueries, etc.

SPARQL-to-Document:

Query optimization in D-SPARQ [mutharaju2013d] is based on a ”divide and conquer”-like principle. It groups triple patterns into independent blocks of triples, which can run more efficiently in parallel. For example, a star-shaped pattern groups are considered as indivisible blocks. Within one star pattern group, for each predicate triple patterns are ordered by number of triples involving that predicate. This boosts query processing by reducing the selectivity of the individual patter groups. In the relational-based OBDA of [botoeva2016obda], the intermediate relational query is simplified by applying structural optimization, e.g., replacing join of unions by union of joins, and semantic optimization, e.g., redundant self-join elimination. In [michel2016generic], the generated MongoDB query is optimized by pushing filters to the level of triple patters, and by self-join elimination through merging atomic queries that share the same FROM part, and by self-union elimination through merging UNIONs of atomic queries that share the same FROM part.

Cypher-to-SQL:

Cyp2sql [cyp2sql] stores graph data following a specific tables scheme, which is designed to optimize specific queries. For example, Label table is created to overcome the problem of prevalent NULL values in the Nodes table. Query translator decides, on query-time, which relationship to use to obtain node information. Relationship data is stored in the Edges table (storing all relationships) as well as in their separate tables (duplicate). Further optimization is gained from using a couple of metafiles populated during schema conversion, e.g., a nodes property list per label type used to narrow down the search for nodes.

SPARQL-to-XPath/XQuery:

In [groppe2008embedding], a logical optimization is applied to the operator tree in order to generate a reorganized equivalent tree with faster translation time (no more details given). Next, a physical optimization aims to find the algorithm that implements the operator with the best estimated performance.

Gremlin-to-SQL

SQLGraph [sun2015sqlgraph] proposes a translation optimization whereby a sequence of the non selective pipe g.V (retrieve all vertices in g) or g.E (retrieve all edges in g) are replaced by a sequence of attribute-based filter pipes (filter pipes that select graph elements based on specific values). For example, the non selective first pipe g.V is explicitly merged with the more selective filter filterit.tag == ’w’ in the translation. For the query evaluation, optimization strategies of the RDBMS are leveraged.

4. Translation relationship

This information is not always explicitly stated, and we cannot make assumptions based on the architectures or the algorithms, so we only report when there is a clear statement about the type of relationship. Information is collected in Table 3.

Work One-to-one One-to-many
SQL-to-XPath/XQuery:
[halverson2004rox] ROX \CheckmarkBold
SPARQL-to-SQL:
[kiminki2010sparql] Type-ARQuE \CheckmarkBold
[lu2008effective] FSparql2Sql \CheckmarkBold
SQL-to-SPARQL:
[ramanujam2009r2d, ramanujam2009r2dextra] R2D SQL-to-SPARQL: \CheckmarkBold
SQL-to-Document-based:
[querymongo] QueryMongo \CheckmarkBold
Gremlin-to-SQL:
[sun2015sqlgraph] SQLGraph \CheckmarkBold
Table 3: Query Translation relationship.

III. Community Factors

For a better readability and structuring, we collect the information in Table 4. The last column rates the community effect using stars (\FiveStar), which are to be interpreted as follows. \FiveStar: ‘Implemented’, \FiveStar\FiveStar: ‘Implemented and Evaluated’ or ‘Implemented and Available (for download)’, ‘\FiveStar\FiveStar\FiveStar: ‘Implemented, Evaluated and Available (for download)’.

\resizebox

! Paper/tool 𝐘𝐅𝐑 𝐘𝐋𝐑 𝐧𝐑 𝐧𝐂 Implementation Reference Community XPath/XQuery-to-SQL [krishnamurthy2004efficient] 57 \FiveStar [fan2005query] 2005 37 \FiveStar\FiveStar [mani2006join] 2006 1 \FiveStar\FiveStar [georgiadis2007xpath] PPFS+ 40 \FiveStar\FiveStar [hu2008adaptive] 5 \FiveStar\FiveStar [min2008xtron] XTRON 23 \FiveStar\FiveStar SQL-to-XPath/XQuery [vidhya2009query, vidhya2010insert] 1, 5 [jigyasu2006sql] AquaLogic 2006 2008 22 Acquired by Oracle and merged in its products \FiveStar\FiveStar [halverson2004rox] 65 \FiveStar\FiveStar SPARQL-to-SQL [stadlerconnecting] Sparqlify 2013 2018 30 2 https://github.com/SmartDataAnalytics/Sparqlify \FiveStar\FiveStar\FiveStar [kiminki2010sparql] Type-ARQuE 2010 6 http://www.cs.hut.fi/~skiminki/type-arque/index.html \FiveStar\FiveStar [priyatna2014formalisation] Morph translator 2014 2018 37 74 Part of Morph-RDB: https://github.com/oeg-upm/morph-rdb \FiveStar\FiveStar\FiveStar [Chebotko06semanticspreserving] 151 \FiveStar\FiveStar [lu2008effective] 28 \FiveStar\FiveStar [elliott2009complete] 78 \FiveStar\FiveStar\FiveStar [unbehauen2012accessing] SPARQLMap 22 \FiveStar\FiveStar\FiveStar [sequeda2013ultrawrap] Ultrawrap 99 https://capsenta.com/ultrawrap \FiveStar\FiveStar [journals/ws/Rodriguez-MuroR15] 52 Part of Ontop: https://github.com/ontop/ontop \FiveStar\FiveStar SQL-to-SPARQL [ramanujam2009r2d, ramanujam2009r2dextra] R2D 19, 15 \FiveStar\FiveStar [rachapalli2011retro] 14 SQL-to-Document-based [querymongo] Query Mongo [mongoDB-translator-teiid] MongoDB Translator \FiveStar [unityjdbc] UnityJDBC \FiveStar SPARQL-to-Document-based [mutharaju2013d] D-SPARQ 11 \FiveStar\FiveStar [unbehauen-semantics-2016-sparqlmap-m] SparqlMap-M 2015 2017 12 2 https://github.com/tomatophantastico/sparqlmap \FiveStar\FiveStar\FiveStar [botoeva2016obda] 19 Extends Ontop but no reference found \FiveStar [conf/ontobras/AraujoABW17] OntoMongo 2017 1 https://github.com/thdaraujo/onto-mongo \FiveStar\FiveStar [michel2016generic] 2014 2015 6 5 https://github.com/frmichel/morph-xr2rml/tree/query_rewrite \FiveStar\FiveStar Cypher-to-SQL [steer2017cytosm] Cytosm 2017 1 2 https://github.com/cytosm/cytosm \FiveStar\FiveStar\FiveStar [cyp2sql] Cyp2sql 2017 2017 1 https://github.com/DTG-FRESCO/cyp2sql \FiveStar\FiveStar Gremlin-to-SQL [sun2015sqlgraph] SQLGraph 2015 44 \FiveStar\FiveStar SQL-to-Gremlin [sqlgremlin] SQL-Gremlin 2015 2016 1 https://github.com/twilmes/sql-gremlin \FiveStar SPARQL-to-XPath/XQuery [bikakis2015sparql2xquery, bikakis2009querying, bikakis2009semantic] SPARQL2XQuery 29, 11, 21 http://www.dblab.ntua.gr/~bikakis/SPARQL2XQuery.html \FiveStar\FiveStar\FiveStar [groppe2008embedding] 45 \FiveStar\FiveStar [fischer2011translating] XQL2Xquery 6 \FiveStar\FiveStar XPath/XQuery-to-SPARQL [droop2007translating] 21 \FiveStar\FiveStar SPARQL-to-Gremlin [thakkar2018stitch, thakkar2018two] Gremlinator 2018 6 https://github.com/apache/tinkerpop/tree/master/sparql-gremlin \FiveStar\FiveStar\FiveStar

Table 4: Community Factors. 𝐘𝐅𝐑 year of first release, 𝐘𝐋𝐑 year of last release, 𝐧𝐑 number of releases, 𝐧𝐂 number of citations (from Google Scholar). If 𝐧𝐑=1 it is the first release and last release is last update.

Discussions and Conclusion

Weakly addressed paths.

Although one would presume that SQL-to-Document-based translation is a well-supported path given the popularity of SQL and document databases, there is still a modest literature in this regard. Most of the efforts provide marginal contributions in addition to the more general SQL-to-NoSQL translation. Furthermore, the translation of this path in all cases is far from being complete, and does not follow the systematic methodology observed by other efforts in this study. Some of these works are [dos2013providing, schreiner2015sqltokeynosql, lawrence2014integration]. Similarly, despite the popularity of SQL and Gremlin, the Gremlin-to-SQL translation has also attracted little attention. That may be due to the large difference in the semantics of the Gremlin graph traversal model and SQL’s relational model. In general, the work on translating between SQL and MongoDB and Gremlin languages is still in an relatively early stage, partially because of the lack of a strong formal foundation of the semantics and complexity of MongoDB’s document language as well as Gremlin. On the other hand, the path XPath/XQuery-to-SPARQL has much fewer works than its reverse. This is possibly because SPARQL is more frequently used for solving integration problems as part of the OBDA framework, which involves translating various queries into SPARQL.

  1974    chamberlin1974sequel.   SQL introduced. 2002    boag2002xquery.   XQuery introduced. 2003    berglund2003xml.   XPath introduced. 2004    halverson2004rox ROX [krishnamurthy2004efficient, krishnamurthy2004recursive].   SQL-to-XPath/XQuery XPath/XQuery-to-SQL. 2005    fan2005query.   XPath-to-SQL. 2006    mani2006join.   XQuery-to-SQL. 2007    droop2007translating georgiadis2007xpath.   XPath/XQuery-to-SPARQL XPath/XQuery-to-SQL. 2008    prudhommeaux2008sparql hu2008adaptive lu2008effective min2008xtron XTRON.   SPARQL introduced XML-to-SQL SPARQL-to-SQL XQuery-to-SQL. 2009    fan2009query vidhya2009query elliott2009complete bikakis2009querying, bikakis2009semantic ramanujam2009r2d.   XPath-to-SQL SQL-to-XPath SPARQL-to-SQL SPARQL-to-XQuery SQL-to-SPARQL. 2010    vidhya2010insert kiminki2010sparql Type-ARQuE.   SQL-to-XQuery SPARQL-to-SQL. 2011    das2011r2rml R2RML atay2011schema fischer2011translating rachapalli2011retro RETRO.   SQL-to-SPARQL XML-to-SQL SQL- and SPARQL-to-XQuery SQL-to-SPARQL. 2012    rodriguez2012quest Quest unbehauen2012accessing SPARQLMap.   SPARQL-to-SQL SPARQL-to-SQL. 2013    dos2013providing sequeda2013ultrawrap Ultrawrap.   SQL-to-Document based SPARQL-to-SQL. 2014    bikakis2014supporting priyatna2014formalisation Morph lawrence2014integration.   SPARQL-to-XQuery SPARQL-to-SQL SQL-to-Document-based. 2015    sun2015sqlgraph SQLGraph bikakis2015sparql2xquery.   Gremlin-to-SQL SPARQL-to-XQuery. 2016    unbehauen-semantics-2016-sparqlmap-m SparqlMap-M.   SQL-to-Document-based. 2017    steer2017cytosm Cytosm.   Cypher-to-SQL. 2018    thakkar2018stitch, thakkar2018two Gremlinator.   SPARQL-to-Gremlin.
Table 5: Timeline recording publication years of the considered query languages and methods.
Missing paths.

We have not found any articles or software/tools for the following paths SQL-to-Cypher, Gremlin-to-SPARQL, XPath/XQuery-to-Cypher and vice versa, XPath/XQuery-to-Gremlin and vice versa, Cypher-to-Document-based and vice versa. We see opportunities in tackling those translation paths with rationals similar to those of the previously tackled translation paths. For example, although SPARQL and Gremlin fundamentally differ in their approaches to query graph data, one based on graph pattern matching one on graph traversals, they are both graph query languages. A transition from one to the other not only allows the interoperability between systems supporting those languages, but also makes data from one world available to the other without requiring to learn the other respective query language [DBLP:conf/amw/AnglesTT19]. Similarly, XML languages have a rooted notion of traversals, a conversion to and from Gremlin is natural. In fact, according to [lindaaker2018graphhistory], the early prototype of Gremlin used XPath for querying graph data.

Gaps and Lessons Learned.

The survey has also allowed us to identify gaps and learn lessons, which we summarize in the following points:

  • We noticed that the optimizations that are applied during the query translation process have more potential to improve the overall translation performance than the optimization applied on the generated query. This is because at query translation-time, optimizations from the system of the original query, e.g., statistics, can be leveraged to impact the resulted target query. This opportunity is not present once the query in the target language has been generated.

  • Looking at the language scope coverage, there seems to still be a lack in covering the more sophisticated operations of query languages, e.g., more join types and temporal functions in SQL; blank nodes, grouping and binding in SPARQL. Such functions are motivated by and are at the core of many modern analytical and real-time applications. Indeed, some of those features are newly-introduced and some of the needs are only recently exposed, in which case we make the call to both update the existing works and build new solutions to embrace the new features and address the new needs.

  • Certain works present a well-founded and defined query translation frameworks, from the query translation process to the various optimization strategies. However, the example queries effectively worked on are simple and would hardly represent real-world queries. Use-case-driven translation methods would be more helpful to reveal the useful query patterns and fragments, and to evaluate the translation methods and optimizations on real-world data.

  • There is a wide variety in the evaluation frameworks used by each of the query translation methods. Following a unique standardized benchmark specialized in evaluating and assessing query translation aspects is paramount. Such a dedicated benchmark unfortunately does not exist at the time of writing.

Candidates for a ’universal’ query language.

After discovering and exploring the various query translation methods, it appears that SQL and SPARQL are the most suitable languages to act as a ’universal’ language for realizing the heterogeneous data integration. They both have the most number of translations to other languages (see outgoing edges in Figure 1). SQL is the oldest query language with ever-continued development cycles and adoption. SPARQL is the stable query language of the so-called ontology-based data integration and access, which specializes specifically in integrating data coming from heterogeneous sources.

Query Translation History.

We project the surveyed works into a vertical timeline shown in Table 5. The visualization allows us to draw some remarks. SPARQL was very quickly recognized by the community, as works translating to and from SPARQL started to emerge the same year it was suggested. We cannot make a similar judgment about the adoption of SQL, XPath and XQuery as they were introduced earlier than the timeframe we consider in this study, 2003-2019. Works on translating to and from SPARQL have continued to attract research efforts to date. Works translating to and from SQL is present in all the years of the timeline, except 2013. With less regularity, works translating to and from XML languages have also been continually published. Despite their latest updates in 2017, we have not found any works (at least complying with our criteria) published since 2015.

In this article, we have surveyed more than forty articles and tools around query translation between seven popular query languages. Although organizing the information was a complicated and sensitive task, the study allowed us to extract eight common criteria according to which we categorized the surveyed works. It also allowed us to discover which translation paths are not sufficiently addressed and which ones are not addressed yet, as well as to observe gaps and learn lessons for future research on the topic. We hope that reporting this knowledge opens new doors for research and development on the topic of query translation, and serves users of applications like polyglot persistence and data lakes to exploit more data value by tackling the data variety issue.

References