Skip to content
This repository was archived by the owner on Feb 25, 2020. It is now read-only.
This repository was archived by the owner on Feb 25, 2020. It is now read-only.

Query directly from raw files #528

@jdegoes

Description

@jdegoes

To make Precog much more accessible and user-friendly to local installs, as well as prepare for work on a distributed version of Precog, we should allow querying directly on files which are stored in formats for which we have an input adapter, similar to how Hive and Pig handle data analysis.

This ticket is to refactor the query engine so that we are able to allow querying directly over JSON data files, CSV files and, of course, NIHDB 'files', in a file system containing a variety of file formats.

To do this, we need to define a suitable input adapter which exposes a Table-oriented view of a file format, and propagate information necessary to use a particular adapter (e.g. for CSV files or possibly even JSON files, the input may be ambiguous and require information such as delimiters in order to unambiguously interpret as a Table).

Some file "formats" may in fact be directories containing many files; we should think about how to handle these.

Note that as per @nuttycom's comment, we already have JSON-backed and even JDBC-backed table adapters. The exact functionality we lack is the ability to discriminate between alternate representations at runtime based on the actual string paths passed to the table load function, as well as an architecture that makes it easy to add new input adapters and rules for selecting them during runtime loads.

This ticket will be considered complete when it is possible to create a Quirrel script that loads data from a JSON file, a CSV file, and a NIHDB file, and joins them all together; and when the associated architecture allows cleanly adding support and selection criteria for new input adapters (by defining the input adapter and describing the rules that dictate when the input adapter is used for dynamically loaded data -- e.g. when the file extension or mime type is such and such).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions