Skip to content

Suboptimal queries when using remote SPARQL endpoint #5

@elordis

Description

@elordis

Hello.

We have an existing triplestore with SPARQL endpoint running and want to attach a GraphQL interface to it. The problem is that any query that we try to run results in a enormously large data transfers.
Our config is pretty simple:

{
    "DataSources": [
        {
            "Name": "corese",
            "Provider": "remote",
            "Default": true,
            "DefaultNamespace": "http://example.net/device.owl#",
            "Prefixes": {
                "net": "http://example.net/device.owl#"
            },
            "Settings": {
                "EndpointUri": "<omitted>"
            }
        }
    ],
    "Definitions": [
        {
            "Provider": "inline",
            "Settings": {
                "Schema": {
                    "Query": {
                        "Fields": [
                            {
                                "Name": "device",
                                "Object": "Device",
                                "IsArray": true
                            }
                        ]
                    },
                    "Interfaces": [
                        {
                            "Name": "IRdfsExtensions",
                            "Namespace": "http://www.w3.org/2000/01/rdf-schema#",
                            "Fields": [
                                {
                                    "Name": "label",
                                    "Scalar": "String"
                                }
                            ]
                        }
                    ],
                    "Types": [
                        {
                            "Name": "Device",
                            "Interfaces": [
                                "IRdfsExtensions"
                            ]
                        }
                    ]
                }
            }
        }
    ]
}

We do a simple query like this:

query {
	device(id: "http://example.net/device.owl#<omitted>"){
		label
	}
}

And it takes more than 4 seconds to complete on our database, while a SPARQL query for similar results would complete in 40 ms.
I've decided to look how our database is queried adn discovered that GraphSPARQL pretty much tries to read the whole database on every request. E.g. the request above is translated into two queries:
First one is fine:

CONSTRUCT
{ ?__s0 <https://schema.uibk.ac.at/GraphSPARQL/triples/p0> ?__o0 . }
WHERE
{
  { }
  UNION
  {
    ?__o0 a ?__s0 .
    FILTER((?__s0 = <http://example.net/device.owl#Device>) && (?__o0 = <http://example.net/device.owl#<omitted>>))
  }
}

But second one reads every label in the database. Also, if I didn't use a filter in above request, the VALUES section would still include results from first query of which there may be a few thousand rows.

CONSTRUCT
{ ?__s0 <https://schema.uibk.ac.at/GraphSPARQL/triples/p0> ?__o0 . }
WHERE
{
  { }
  UNION
  {
    {
      VALUES ( ?__s0 )
      {( <http://example.net/device.owl#<omitted>>)}
    }
    UNION
    { ?__s0 <http://www.w3.org/2000/01/rdf-schema#label> ?__o0 . }
  }
}

And that is on simplest GraphQL queries. When we try to do anything resembling our production needs, things become even worse.

So, is everything working as intended? Did we maybe miss some hidden options to speed up things?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions