diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md index 26b4a571fc..4444973aa3 100644 --- a/book/src/SUMMARY.md +++ b/book/src/SUMMARY.md @@ -102,20 +102,18 @@ - [unnest](super-sql/operators/unnest.md) - [values](super-sql/operators/values.md) - [where](super-sql/operators/where.md) - - [SQL Clauses](super-sql/sql/intro.md) - - [FROM](super-sql/sql/from.md) + - [SQL](super-sql/sql/intro.md) - [SELECT](super-sql/sql/select.md) + - [FROM](super-sql/sql/from.md) - [WHERE](super-sql/sql/where.md) - [GROUP BY](super-sql/sql/group-by.md) - [HAVING](super-sql/sql/having.md) - - [FILTER](super-sql/sql/filter.md) - [VALUES](super-sql/sql/values.md) - - [ORDER](super-sql/sql/order.md) + - [ORDER BY](super-sql/sql/order-by.md) - [LIMIT](super-sql/sql/limit.md) - - [JOIN](super-sql/sql/join.md) - [WITH](super-sql/sql/with.md) - - [UNION](super-sql/sql/union.md) - - [INTERSECT](super-sql/sql/intersect.md) + - [JOIN](super-sql/sql/join.md) + - [Set Operators](super-sql/sql/set-ops.md) - [Functions](super-sql/functions/intro.md) - [Generics](super-sql/functions/generics/intro.md) - [coalesce](super-sql/functions/generics/coalesce.md) diff --git a/book/src/super-sql/aggregates/intro.md b/book/src/super-sql/aggregates/intro.md index cdf3a00d2e..3fe8533714 100644 --- a/book/src/super-sql/aggregates/intro.md +++ b/book/src/super-sql/aggregates/intro.md @@ -15,11 +15,11 @@ with the particular function, and Aggregate functions may appear in * the [aggregate](../operators/aggregate.md) operator, * an aggregate [shortcut](../operators/intro.md#shortcuts), or -* in [SQL operators](../sql/intro.md) when performing aggregations. +* in [SQL expressions](../sql/intro.md) when performing aggregations. When aggregate functions appear in context of grouping (e.g., the `by` clause of an [aggregate](../operators/aggregate.md) operator or a -[SQL operator](../sql/intro.md) with a [GROUP BY](../sql/group-by.md) clause), +[SELECT](../sql/select.md) query with a [GROUP BY](../sql/group-by.md) clause), then the aggregate function produces one output value for each unique combination of grouping expressions. diff --git a/book/src/super-sql/declarations/functions.md b/book/src/super-sql/declarations/functions.md index 522b25294f..249a38d316 100644 --- a/book/src/super-sql/declarations/functions.md +++ b/book/src/super-sql/declarations/functions.md @@ -140,7 +140,7 @@ fn stats(numbers): ( | sort this | avg(this),min(this),max(this),mode:=collect(this) | mode:=mode[len(mode)/2] -) +) values stats(a) # input {a:[3,1,2]} diff --git a/book/src/super-sql/expressions/inputs.md b/book/src/super-sql/expressions/inputs.md index ff973e9b90..f80a70f2e8 100644 --- a/book/src/super-sql/expressions/inputs.md +++ b/book/src/super-sql/expressions/inputs.md @@ -8,9 +8,10 @@ is always referenced as the special value `this`. In [relational scoping](../intro.md#relational-scoping), input data is referenced by specifying the columns of one or more tables. -See the [SQL section](../sql/intro.md#input-references) for -details on how columns are bound to [identifiers](../queries.md#identifiers), how table references -are resolved, and how `this` behaves in a SQL expression. +See the [SQL section](../sql/intro.md) for +details on how columns are [bound](../sql/intro.md#relational-bindings) +to [identifiers](../queries.md#identifiers), how table references +are resolved, and how [`this`](../sql/intro.md#this) behaves in a SQL expression. The type of `this` may be any [type](../types/intro.md). When `this` is a [record](../types/record.md), references @@ -29,12 +30,9 @@ tables or columns are referenced. In a SQL operator, if the input is not a record (i.e., not relational), then the input data can still be referred to as the value `this` and placed into an output relation using [SELECT](../sql/select.md). -When referring to non-relational inputs with `*`, there are no columns and -thus the select value is empty, i.e., the value `{}`. - -When non-record data is referenced in a SQL operator and the input -schema is dynamic and unknown, runtime [errors](../types/error.md) like `error("missing")` -will generally arise and be present in the output data. +Otherwise, column references to non-record data in dynamic inputs +generally cause runtime [errors](../types/error.md) +like `error("missing")`. ### Examples diff --git a/book/src/super-sql/expressions/intro.md b/book/src/super-sql/expressions/intro.md index 7c82effd8d..6e3e7493aa 100644 --- a/book/src/super-sql/expressions/intro.md +++ b/book/src/super-sql/expressions/intro.md @@ -21,7 +21,7 @@ used by pipe operators. While SQL expressions and pipe expressions share an identical syntax, their semantics diverge in some key ways: -* SQL expressions that reference `this` have [semantics](../sql/intro.md#accessing-this) +* SQL expressions that reference `this` have [semantics](../sql/intro.md#this) that depend on the SQL clause that expression appears in, * relational tables and/or columns cannot be referenced using aliases in pipe scoping, * double-quoted string [literals](literals.md) may be used in pipe expressions but are interpreted @@ -52,7 +52,7 @@ Operators include an array, set, record, [map](../types/map.md), string, or [bytes](../types/bytes.md), * [logic](logic.md) to combine predicates using Boolean logic, and * [slices](slices.md) to extract subsequences from arrays, sets, strings, and bytes. - + ### Identifier Resolution An identifier that appears as an operand in an expression is resolved to diff --git a/book/src/super-sql/expressions/subqueries.md b/book/src/super-sql/expressions/subqueries.md index c4f40cc200..799c09eba9 100644 --- a/book/src/super-sql/expressions/subqueries.md +++ b/book/src/super-sql/expressions/subqueries.md @@ -221,7 +221,7 @@ _Independent subqueries in SQL operators are supported while correlated subqueri let input = (values {x:1},{x:2},{x:3}) select x from input -where x >= (select avg(x) from input) +where x >= (select avg(x) from input) # input # expected output diff --git a/book/src/super-sql/intro.md b/book/src/super-sql/intro.md index c4c4c075e8..43ed8887bf 100644 --- a/book/src/super-sql/intro.md +++ b/book/src/super-sql/intro.md @@ -20,27 +20,16 @@ and the SuperSQL compiler often optimizes a query into an implementation different from the [dataflow](https://en.wikipedia.org/wiki/Dataflow) implied by the pipeline to achieve the same semantics with better performance. -While SuperSQL at its core is a pipe-oriented language, it is also -[backward compatible](../intro.md#supersql) with relational SQL in that any -arbitrarily complex SQL query may appear as a single pipe operator -anywhere in a SuperSQL pipe query. - -In other words, a single pipe operator that happens to be a standalone SQL query -is also a SuperSQL pipe query. -For example, these are all valid SuperSQL queries: -``` -SELECT 'hello, world' -SELECT * FROM table -SELECT * FROM f1.json JOIN f2.json ON f1.id=f2.id -SELECT watchers FROM https://api.github.com/repos/brimdata/super -``` - -## Interactive UX +## Friendly Syntax -To support an interactive pattern of usage, SuperSQL includes -[search](operators/search.md) syntax -reminiscent of Web or email keyword search along with -[_operator shortcuts_](operators/intro.md#shortcuts). +In addition to its user-friendly pipe syntax, +SuperSQL embraces two key design patterns that simplify +query editing for interactive usage: +* [shortcuts](operators/intro.md#shortcuts) that reduce +typing overhead and provide a concise syntax for common query patterns, and +* [search](operators/search.md) +reminiscent of Web or email keyword search, which is otherwise hard +to carry out with traditional SQL syntax. With shortcuts, verbose queries can be typed in a shorthand facilitating rapid data exploration. For example, the query @@ -49,7 +38,10 @@ SELECT count(), key FROM source GROUP BY key ``` -can be simplified as `from source | count() by key`. +can be simplified to +``` +from source | count() by key +``` With search, all of the string fields in a value can easily be searched for patterns, e.g., this query @@ -60,6 +52,23 @@ from source searches for the strings "example.com" and "urgent" in all of the string values in the input and also includes a numeric comparison regarding the field `message_length`. +## SQL Compatibility + +While SuperSQL at its core is a pipe-oriented language, it is also +[backward compatible](sql/intro.md) with relational SQL in that any +arbitrarily complex SQL query may appear as a single pipe operator +anywhere in a SuperSQL pipe query. + +In other words, a single pipe operator that happens to be a standalone SQL query +is also a SuperSQL pipe query. +For example, these are all valid SuperSQL queries: +``` +SELECT 'hello, world' +SELECT * FROM table +SELECT * FROM f1.json JOIN f2.json ON f1.id=f2.id +SELECT watchers FROM https://api.github.com/repos/brimdata/super +``` + ## Pipe Queries The entities that transform data within a SuperSQL pipeline are called @@ -109,7 +118,7 @@ fork ## Pipe Sources -Like SQL, input data for a query is typically sourced with the +Like SQL, input data for a pipe query is typically sourced with the [from](operators/from.md) operator. When `from` is not present, the file arguments to the @@ -331,12 +340,12 @@ The array subquery produces an array value so it is often desirable to [unnest](operators/unnest.md) this array with respect to the outer values as in ``` -from f1.json | unnest {outer:this,inner:[from f2.json | ...]} into ( ) +from f1.json | unnest {outer:this,inner:[from f2.json | ...]} into ( ) ``` -where `` can be an arbitrary pipe query that processes each +where `` is an arbitrary pipe query that processes each collection of unnested values separately as a unit for each outer value. -The `into ( )` body is an optional component of `unnest`, and if absent, -the unnested collection boundaries are ignored and all of the unnested data is output. +The `into ( )` body is an optional component of `unnest`, and if absent, +the unnested collection boundaries are ignored and all of the unnested data is output as a combined sequence. With the `unnest` operator, we can now consider how a [correlated subquery](https://en.wikipedia.org/wiki/Correlated_subquery) from SQL can be implemented purely as a pipe query with pipe scoping. @@ -363,7 +372,7 @@ giving the same result {s:21} ``` -## Strong Typing +## Type Checking Data in SuperSQL is always strongly typed. diff --git a/book/src/super-sql/operators/from.md b/book/src/super-sql/operators/from.md index 76bfb8d1d7..b7c019753a 100644 --- a/book/src/super-sql/operators/from.md +++ b/book/src/super-sql/operators/from.md @@ -1,37 +1,101 @@ -### Operator +# from -[✅](../intro.md#data-order)[🎲](../intro.md#data-order)  **from** — source data from databases, files, or URLs +[✅](../intro.md#data-order)[🎲](../intro.md#data-order) source data from databases, files, or URLs -### Synopsis +## Synopsis ``` -from [ ( format ) ] -from [@] -from [ ( format method headers body ) ] -from eval() [ ( format method headers body ) ] +from [ ( ) ] +from ``` +where `` has the form of +* a [text entity](../queries.md#text-entity) representing a file, URL, or pool name, +* an [f-string](../expressions/f-strings.md) representing a file, URL, or pool name, +* a [glob](../queries.md#glob) matching files in the local file system or pool names in a database, or +* a [regular expression](../queries.md#regular-expression) matching pool names + in a database; + +`` is an optional concatenation of named [options](#options); and, + +`` is an identifier referencing a +[declared query](../declarations/queries.md). -### Description +## Description The `from` operator identifies one or more sources of data as input to -a query and transmits that data to its output. +a query and transmits the data required by the query to its output. -It has two forms: -* a `from` pipe operator with [pipe scoping](../intro.md#pipe-scoping) as described here, or -* a SQL [`FROM`](../sql/from.md) clause with - [relational scoping](../intro.md#relational-scoping). +Unlike the [FROM](../sql/from.md) clause in a [SQL query](../sql/intro.md), +the pipe `from` merely sources data to its downstream operators +and does not include relational joins or table subqueries, . As a pipe operator, `from` preserves the order of the data within a file, -URL, or a sorted pool but when multiple sources are identified, +URL, or a sorted pool but when multiple sources are identified +(e.g., as a file-system glob or regular expression matching pools), the data may be read in parallel and interleaved in an undefined order. -Optional arguments to `from` may be appended as a parenthesized concatenation -of arguments. +Optional arguments to `from` may be appended as a parenthesized concatenation of named [arguments](#options). + +### Entity Syntax + +How the entity is interpreted depends on whether the query is run +attached to or detached from a [database](../../command/db.md). + +When detached from a database, the entity must be a +[text entity](../queries.md#text-entity), +[f-string](../expressions/f-strings.md), or +[glob](../queries.md#glob). +A glob matches [files](#files) in the file system +while a text entity or f-string +is an [URL](#urls) if it parses as an URL; otherwise, it is presumed to be a file path. + +When attached to a database, the entity must be a +[text entity](../queries.md#text-entity), +[f-string](../expressions/f-strings.md), +[glob](../queries.md#glob), or a slash-delimitated +[regular expression](../queries.md#regular-expression). +A regular expression matches [pools](#pools) in the attached database. +A text entity or f-string is an [URL](#urls) if it parses as an URL and otherwise, +is presumed to be a pool name. + +Local files are not accessible when attached to a database. + +> [!NOTE] +> While pool names and file names have overlapping syntax, +> their use is disambiguated by the presence or absence of an attached +> database. -When reading from sources external to a [database](../../command/db.md) (e.g., URLs or files), -the format of each data source is automatically detected using heuristics. -To manually specify the format of a source and override the autodetection heuristic, +When the entity is an [f-string](../expressions/f-strings.md), +the `from` operator reads data from its upstream pipe operator +and for each input value, the f-string expression is evaluated and +used as the `` string argument. Each such entity is scanned +one at a time and the data is fed to the output of `from`. +When an entity does not exist, a structured error is produced and +the query continues execution. + +### Options + +Options to `from` may be appended as a parenthesized list of name/value pairs +having the form: +``` +( [ ... ] ) +``` +Each entity type supports a specific set of named options as described below. +When the entity comprises multiple sources (e.g., with a glob), then the +options apply to every entity matched. + +### Format Detection + +When reading data from files or URLs, the serialization format of the +input data is determined by the presence of a +[well-known extension](../../command/super.md#supported-formats) +(e.g., `.json`, `.sup`, etc.) on the file path or URL, +or if the extension is not present or unknown, the format is +[inferred](../../command/super.md#format-detection) +by inspecting the input data. + +To manually specify the format of a source and override these heuristics, a format argument may be appended as an argument and has the form ``` format @@ -40,96 +104,84 @@ where `` is the name of a supported [serialization format](../../command/super.md#supported-formats) and is parsed as a [text entity](../queries.md#text-entity). -When `from` references a file or URL entity whose name ends in a -[well-known extension](../../command/super.md#supported-formats) -(e.g., `.json`, `.sup`, etc.), auto-detection is disabled and the -format is implied by the extension name. +### Files -#### File-System Operation +When the `` argument is recognized as a file, the file +data required for the query +is read from the local file system, parsed as its specified or +detected serialization format, and emitted to its output. -When running detached from a database, the target of `from` -is either a -[text entity](../queries.md#text-entity) -or a file system [glob](../queries.md#glob). +File-system paths are interpreted relative to the directory in which +the [super](../../command/super.md) command is running. -If a text entity is parseable as an HTTP or HTTPS URL, -then the target is presumed to be a [URL](#url) and is processed -accordingly. Otherwise, the target is assumed to be a file -in the file system whose path is relative to the directory -in which the `super` command is running. +The only allowed option for file entities is the +[format](#format-detection) option described above. -If the target is a glob, then the glob is expanded and the files -are processed in an undefined order. Any operator arguments specified -after a glob target are applied to all of the matched files. - -Here are a few examples illustrating file references: +Here are some examples of file syntax: ``` -from "file.sup" from file.json +from 'file-with-dash.sup' +from /path/to/file.csv from file*.parq (format parquet) ``` -#### Database Operation - -When running attached to a database (i.e., using `super db`), -the target of `from` is either a -[text entity](../queries.md#text-entity) -or a [regular expression](../queries.md#regular-expression) -or [glob](../queries.md#glob) that matches pool names. - -If a text entity is parseable as an HTTP or HTTPS URL, -then the target is presumed to be a [URL](#url) and is processed -accordingly. Otherwise, the target is assumed to be the name -of a pool in the attached database. - -Local files are not accessible when attached to a database. +### Pools -Note that pool names and file names have similar syntax in `from` but -their use is disambiguated by the presence or absence of an attached -database. +When the `` argument is recognized as a [database](../../command/db.md) pool, +the data required for the query is ready from the database and +emitted to its output. -When multiple data pools are referenced with a glob or regular expression, -they are scanned in an undefined order. - -The reference string for a pool may also be appended with an `@`-style -[commitish](../../database/intro.md#commitish), which specifies that -data is sourced from a specific commit in a pool's commit history. +The only allowed option for a pool is the commit argument having the form +``` +commit +``` +where `` is a +[commitish](../../database/intro.md#commitish) that specifies a specific +commit in the pool's log thereby allowing time travel. -When a single pool name is specified without an `@` reference, or -when using a glob or regular expression, the tip of the `main` -branch of each pool is accessed. +The the commit argument may be abbreviated by appending to the pool name +an `@` character followed by the commitish, e.g., +``` +from Pool (commit 36AwHUt9s8usF7pi9x3l6LOl8IB) +``` +maybe be instead written as +``` +from Pool@36AwHUt9s8usF7pi9x3l6LOl8IB +``` +When a single pool name is specified without a `commit` option, or +when using a regular expression, the tip of the `main` branch +of each pool is accessed. -The format argument is not valid with a database source. +It is an error to specify a format option when the entity is +is a pool. >[!NOTE] > Metadata from database pools also may be sourced using `from`. > This will be documented in a future release of SuperDB. -#### URL +### URLs Data sources identified by URLs can be accessed either when attached or detached from a database. -When the `` argument begins with `http:` or `https:` -and has the form of a valid URL, then the source is fetched remotely using the -indicated protocol. +As a [text entity](../queries.md#text-entity), typical URLs need not be quoted though URLs with special characters must be quoted. -As a [text entity](../queries.md#text-entity), typical URLs need not be quoted -though URLs with special characters must be quoted. +When the `` argument begins with `http:` or `https:` +and has the form of a valid URL, then the source is fetched remotely +using either HTTP or HTTPS. -A format argument may be appended to a URL reference. - -Other valid operator arguments control the body and headers of the HTTP request -that implement the data retrieval and include: -* method `` -* headers `` -* body `` +When the URL begins with `s3:` then data is fetched via +the Amazon S3 object service using the settings defined +by a [local configuration](../../dev/integrations/s3.md). +Named options for URL entities include `format`, `method`, `headers`, +and `body` as in +``` +from [ ( format method headers body ) ] +``` where - * `` is one of `GET`, `PUT`, `POST`, or `DELETE`, -* `` is a [record expression](../types/record.md) that defines the names and values -to be included as HTTP header options, and +* `` is a [record expression](../types/record.md) that defines the names and values to be included as HTTP header options, and * `` is a [text-entity](../queries.md#text-entity) string to be included as the body of the HTTP request. @@ -139,20 +191,13 @@ Each field of this record must either be a string or (to specify a header option appearing multiple times with different values) an array or set of strings. -#### Expression - -The `eval()` form of `from` provides a means to read data programmatically from -sources based on the `` argument to `eval`, which should return -a value of type [`string`](../types/string.md). -In this case, `from` reads values from its parent, applies `` to each -value, and interprets the string result as a target to be processed. +These options cannot be used with S3 URLs. -Each string value is interpreted as a from target and must be a file path -(when running detached from a database), a pool name (when attached to a database), -or a URL forming a sequence of targets which are read and output by the -`from` operator in the order encountered. +> [!NOTE] +> Currently, the headers expression must evaluate to a compile-time constant though +> this may change to allow run-time evaluation in a future version of SuperSQL. -#### Combining Data +### Combining Data To combine data from multiple sources using pipe operators, `from` may be used in combination with other operators like [`fork`](fork.md) and [`join`](join.md). @@ -182,49 +227,70 @@ from PoolOne | op1 | op2 | ... | ... ``` -### File Examples +## File Examples --- -_Source structured data from a local file_ +_Source structured data from a JSON file_ ```mdtest-command -echo '{greeting:"hello world!"}' > hello.sup -super -s -c 'from hello.sup | values greeting' +echo '{"greeting":"hello world!"}' > hello.json +super -s -c 'from hello.json | values greeting' ``` -=> ```mdtest-output "hello world!" ``` --- -_Source data from a local file, but in "line" format_ +_Source super-structured from a local file_ ```mdtest-command -super -s -c 'from hello.sup (format line)' +echo '1 2 {x:1} {s:1::(int64|string)} {s:"hello"::(int64|string)}' > vals.sup +super -s -c 'from vals.sup' ``` -=> ```mdtest-output -"{greeting:\"hello world!\"}" +1 +2 +{x:1} +{s:1::(int64|string)} +{s:"hello"::(int64|string)} ``` -### HTTP Example +--- + +## HTTP Example --- _Source data from a URL_ ``` -super -s -c 'from https://raw.githubusercontent.com/brimdata/super/main/package.json - | values name' +super -s -c 'from https://api.github.com/repos/brimdata/super | values name' ``` -=> ``` "super" ``` --- -### Database Examples +## F-String Example + +_Read from dynamically defined files and add a column_ + +```mdtest-command +echo '{a:1}{a:2}' > a.sup +echo '{b:3}{b:4}' > b.sup +echo '"a.sup" "b.sup"' | super -s -c "from f'{this}' | c:=coalesce(a,b)+1" - +``` +```mdtest-output +{a:1,c:2} +{a:2,c:3} +{b:3,c:4} +{b:4,c:5} +``` + +--- + +## Database Examples The remaining examples below assume the existence of the SuperDB database created and populated by the following commands: @@ -264,7 +330,6 @@ _Source data from the `main` branch of a pool_ ```mdtest-command super db -db example -s -c 'from coinflips' ``` -=> ```mdtest-output {flip:2,result:"tails"} {flip:1,result:"heads"} @@ -276,7 +341,6 @@ _Source data from a specific branch of a pool_ ```mdtest-command super db -db example -s -c 'from coinflips@trial' ``` -=> ```mdtest-output {flip:3,result:"heads"} {flip:2,result:"tails"} @@ -289,7 +353,6 @@ _Count the number of values in the `main` branch of all pools_ ```mdtest-command super db -db example -s -c 'from * | count()' ``` -=> ```mdtest-output 5 ``` @@ -305,7 +368,6 @@ super db -db example -s -c ' | values {...left, word:right.word} | sort' ``` -=> ```mdtest-output {flip:1,result:"heads",word:"one"} {flip:2,result:"tails",word:"two"} @@ -326,7 +388,6 @@ super db -db example -s -c ' | values f"There were {c} flips" ) | sort this' ``` -=> ```mdtest-output "There were 3 flips" {flip:1,result:"heads",word:"one"} @@ -334,22 +395,3 @@ super db -db example -s -c ' ``` --- - -#### Expression Example - -_Read from dynamically defined files and add a column_ - -```mdtest-command -echo '{a:1}{a:2}' > a.sup -echo '{b:3}{b:4}' > b.sup -echo '"a.sup" "b.sup"' | super -s -c "from f'{this}' | c:=coalesce(a,b)+1" - -``` -=> -```mdtest-output -{a:1,c:2} -{a:2,c:3} -{b:3,c:4} -{b:4,c:5} -``` - ---- diff --git a/book/src/super-sql/queries.md b/book/src/super-sql/queries.md index 417beec5ac..320327bce6 100644 --- a/book/src/super-sql/queries.md +++ b/book/src/super-sql/queries.md @@ -8,7 +8,7 @@ The syntactical structure of a query consists of Any valid SQL query may appear as a pipe operator and thus be embedded in a pipe query. A SQL query expressed as a pipe operator is -called a [SQL operator](sql/intro.md). +called a [SQL operator](sql/intro.md#sql-operator). Operator sequences may be parenthesized and nested to form lexical [scopes](#scope). @@ -66,7 +66,7 @@ as a field reference `this.PI` via pipe scoping. ```mdtest-spq fails {data-layout='no-labels'} {style='margin:auto;width:85%'} # spq -( +( const PI=3.14 values PI ) @@ -96,7 +96,7 @@ character may be included in a backtick string with Unicode escape `\u0060`. In SQL expressions, identifiers may also be enclosed in double-quoted strings. The [special value](intro.md#pipe-scoping) `this` is also available in SQL but has -[peculiar semantics](sql/intro.md#accessing-this) +[peculiar semantics](sql/intro.md#this) due to SQL scoping rules. To reference a column called `this` in a SQL expression, simply use double quotes, i.e., `"this"`. @@ -148,7 +148,7 @@ regexp(r'\w+(foo|bar)', this) But when used outside of expressions where an explicit indication of a regular expression is required (e.g., in a [search](operators/search.md) or -[from](operators/from.md#database-operation) operator), the RE2 is instead +[from](operators/from.md#pools) operator), the RE2 is instead prefixed and suffixed with a `/`, e.g., ``` /foo|bar/ @@ -184,7 +184,7 @@ to the `from` and [load](operators/load.md) operators. Specifically, a text entity is one of: * a [string literal](types/string.md) (double quoted, single quoted, or raw string), * an unquoted string consisting of a sequence of characters consisting of letters, digits, `_`, `$`, `.`, and `/`, or -* a simple URL consisting of a sequence of characters beginning with `http://` or `https://`, followed by dotted strings of letters, digits, `-`, and `_`, and in turn optionally followed by `/` and a sequence of characters consisting of letters, digits, `_`, `$`, `.`, and `/`. +* a simple URL consisting of a sequence of characters beginning with `http://` , `https://`, or `s3://` followed by dotted strings of letters, digits, `-`, and `_`, and in turn optionally followed by `/` and a sequence of characters consisting of letters, digits, `_`, `$`, `.`, and `/`. If a URL does not meet the constraints of the simple URL rule, e.g., containing a `:` or `&`, then it must be quoted. diff --git a/book/src/super-sql/sql/filter.md b/book/src/super-sql/sql/filter.md deleted file mode 100644 index 9b23aabf05..0000000000 --- a/book/src/super-sql/sql/filter.md +++ /dev/null @@ -1 +0,0 @@ -# FILTER diff --git a/book/src/super-sql/sql/from.md b/book/src/super-sql/sql/from.md index 443ba264d6..58816e8649 100644 --- a/book/src/super-sql/sql/from.md +++ b/book/src/super-sql/sql/from.md @@ -1 +1,175 @@ # FROM + +The `FROM` clause of a [SELECT](select.md) has the form +``` +FROM +``` +where a `` represents data sources (like files, +API endpoints, or database pools), table subqueries, pipe queries, +or joins. + +## Table Expressions + +A table expression `` has one of the following forms: + +``` + [ ( ) ] [ ] + [ ] +( ) [ ] + +( ) +``` + +`` is defined as in the pipe form of [from](../operators/from.md), namely one of +* a [text entity](../queries.md#text-entity) representing a file, URL, or pool name, +* an [f-string](../expressions/f-strings.md) representing a file, URL, or pool name, +* a [glob](../queries.md#glob) matching files in the local file system or pool names in a database, or +* a [regular expression](../queries.md#regular-expression) matching pool names + +`` are the [entity options](../operators/from.md#options) + as in pipe `from`. + +`` is the name of a common-table expression (CTE) +defined in a [WITH](with.md) clause or a +[declared query](../declarations/queries.md). + +`` is any [query](../queries.md) inclusive of +[SQL operators](intro.md#sql-operator) +or [pipe operators](../operators/intro.md). + +`` is any [JOIN](join.md) operation, which is defined to +recursively operate upon any `` defined here. + +Any `` may be parenthesized to control precedence +and evaluation order. + +## Table Aliases + +The table expressions above that represent data-source entities +and table subqueries may be bound to a table alias +with the option `` clause of the form +``` +[ AS ] +``` +where the `AS` keyword is optional and `` has the form +``` + [ ( [ , ... ] ) ] +``` +`
` and `` are [identifiers](../queries.md#identifiers) +naming a table or a table and the columns of the indicated table +and an optional parenthesized list of columns positionally specifies the +column names of that table. + +Joined expression and parenthesized table expressions cannot be assigned +aliases as the [relational scope](intro.md#relational-scopes) +produced by such expression is comprised of their constituent table names +and columns. + +## Input Table + +A `FROM` clause is a component of [SELECT](select.md) that +identifies the query's input data to create the _input table_ +for `SELECT`. + +The input table is accessed via a namespace comprised of +table and column [references](intro.md#relational-bindings) +that may then appear in the various expressions appearing throughout +the query. + +This namespace is called a [relational scope](intro.md#relational-scopes) +and the `FROM` clause creates the [input scope](intro.md#input-scope) +for `SELECT`. + +The name space consists of the table names and aliases (and their +constituent columns) created by the initial `FROM` clause and +any [JOIN](join.md) clauses that appear. Any tables that are defined +in table subqueries in the `FROM` clause are not part of the +input scope. + +>[!NOTE] +> The SQL `FROM` clause is similar to the pipe form of the +> [from](../operators/from.md) operator but +> * uses [relational scoping](../intro.md#relational-scoping) instead of +> [pipe scoping](../intro.md#pipe-scoping), +> * allows the binding of table aliases to relational data sources, and +> * can be combined with [JOIN](join.md) clauses to implement relational joins. + +## File Examples + +--- + +_Source structured data from a local file_ + +```mdtest-command +echo '{"greeting":"hello world!"}' > hello.json +super -s -c 'SELECT greeting FROM hello.json' +``` +```mdtest-output +{greeting:"hello world!"} +``` + +--- + +_Translate some CSV into Parquet and query it_ +```mdtest-command +echo 'Name,Email,Phone Number,Address +John Doe,john.doe@example.com,123-555-1234,"123 Example Address, City, State" +Jane Smith,jane.smith@example.com,123-555-5678,"456 Another Lane, Town, State"' > example.csv +super -f parquet -o example.parquet example.csv +super -s -c 'SELECT collect("Phone Number") as numbers FROM example.parquet' +``` +```mdtest-output +{numbers:["123-555-1234","123-555-5678"]} +``` + +--- + +## HTTP Example + +--- + +_Source data from a URL_ +``` +super -s -c "SELECT name FROM https://api.github.com/repos/brimdata/super" +``` +``` +{name:"super"} +``` + +--- + +### F-String Example + +--- + +_Read from dynamically defined files and add a column_ + +```mdtest-command +echo '{a:1}{a:2}' > a.sup +echo '{b:3}{b:4}' > b.sup +echo '"a.sup" "b.sup"' | super -s -c " +SELECT this, coalesce(a,b)+1 AS c +FROM f'{this}' +" - +``` +```mdtest-output +{that:{a:1},c:2} +{that:{a:2},c:3} +{that:{b:3},c:4} +{that:{b:4},c:5} +``` + +--- + +## Database Examples + +--- + +>[!NOTE] +> The SuperDB database will soon support super-structured types, which +> are required for SQL compatibility. Currently, database queries +> should be done with the pipe form of the [from](../operators/from.md) +> operator. SQL examples utilizing a SuperDB database will be documented +> here in a future version of SuperDB. + +--- diff --git a/book/src/super-sql/sql/group-by.md b/book/src/super-sql/sql/group-by.md index 1a92f7b80c..03f9079405 100644 --- a/book/src/super-sql/sql/group-by.md +++ b/book/src/super-sql/sql/group-by.md @@ -1 +1,84 @@ # GROUP BY + +A `GROUP BY` clause has the form +``` +GROUP BY | [ , | ... ] +``` +where `` is an [expression](../expressions/index.md) +and `` is an expression +that evaluates to a compile-time constant integer indicating a column +number of the projection. + +A GROUP BY clause is a component of [SELECT](select.md) that defines +the grouping logic for a [grouped projection](select.md#grouped-projection). + +The expressions cause the input table's rows to be placed in groups, +one group for each unique value of the set of expressions present. +The table and column references in the grouping expressions bind +to the [input scope](intro.md#input-scope) and +[column aliases](intro.md#column-aliases). + +When an `` is specified, the grouping expression is taken from the +projection's expressions with the leftmost column numbered 1 and so forth. + +## Examples + +--- + +_Compute an aggregate on x with grouping column y_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,1), (2,2), (3,2) +) +SELECT sum(x),y +FROM T +GROUP BY y +ORDER BY y +# input + +# expected output +{sum:1,y:1} +{sum:5,y:2} +``` + +--- + +_Grouped table without an aggregate function_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,1), (2,2), (3,2) +) +SELECT (x+y)/3 as key +FROM T +GROUP BY (x+y)/3 +ORDER BY key +# input + +# expected output +{key:0} +{key:1} +``` + +--- + +_Group using projection column ordinal_ + +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,1), (2,2), (3,2) +) +SELECT sum(x),y +FROM T +GROUP BY 2 +ORDER BY y +# input + +# expected output +{sum:1,y:1} +{sum:5,y:2} +``` + +--- diff --git a/book/src/super-sql/sql/having.md b/book/src/super-sql/sql/having.md index c022d51f75..74cbf63cd5 100644 --- a/book/src/super-sql/sql/having.md +++ b/book/src/super-sql/sql/having.md @@ -1 +1,67 @@ # HAVING + +A `HAVING` clause has the form +``` +HAVING +``` +where `` is a Boolean-valued [expression](../expressions/index.md). + +A HAVING clause is a component of [SELECT](select.md) that is applied +to the query's grouped output removing each value from the input table +for which `` is false. + +The predicate cannot refer to the input scope except for expressions +whose components are grouping expressions or aggregate functions whose +arguments refer to the input scope. + +## Examples + +--- +_Simple aggregate without GROUP BY_ +```mdtest-spq +# spq +SELECT 'hello, world' as message +HAVING count()=1 +# input + +# expected output +{message:"hello, world"} +``` +--- + +_HAVING referencing the grouping expression_ + +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +) +SELECT min(y) +FROM T +GROUP BY (x+y)/7 +HAVING (x+y)/7=0 +# input + +# expected output +{min:2} +``` +--- + +_HAVING clause without a grouped output is an error_ +```mdtest-spq fails +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +) +SELECT x +FROM T +HAVING y >= 4 +# input + +# expected output +HAVING clause requires aggregation functions and/or a GROUP BY clause at line 6, column 8: +HAVING y >= 4 + ~~~~~~ +``` + +--- \ No newline at end of file diff --git a/book/src/super-sql/sql/intersect.md b/book/src/super-sql/sql/intersect.md deleted file mode 100644 index 5c29738da5..0000000000 --- a/book/src/super-sql/sql/intersect.md +++ /dev/null @@ -1 +0,0 @@ -# INTERSECT diff --git a/book/src/super-sql/sql/intro.md b/book/src/super-sql/sql/intro.md index 00b4c8f476..7c60a6af74 100644 --- a/book/src/super-sql/sql/intro.md +++ b/book/src/super-sql/sql/intro.md @@ -1,19 +1,321 @@ -## SQL Operators +# SQL -TODO: document all the SQL clauses +SuperSQL is backward compatible with SQL in that any SQL query +is a SuperSQL [pipe operator](../operators/intro.md). +A SQL query used as a pipe operator in SuperSQL is called a _SQL operator_. +SQL operators can also be used recursively inside of a SQL operation +as [defined below](#sql-body). -XXX explain here how a SELECT query is Pipe operator -XXX figure out how to document FROM query without select as this overlaps -with the pipe operator form of FROM +## SQL Operator -### Identifier Scope +A SQL operator is a query having the form of a `` defined as +``` +[ ] + +[ ] +[ ] +``` +where +* `` is an optional [WITH](with.md) clause containing one or more + comma-separated common table-expressions (CTEs), +* `` is a recursively defined query structure as [defined below](#sql-body), +* `` is an optional list of one or more sort expressions + in an [ORDER BY](order-by.md) clause, and +* `` is a [LIMIT](limit.md) clause constraining the number of rows in the output. -TODO: seciton on scoping +A SQL operator produces relational data in the form of sets of records +and may appear in several contexts including: +* the top-level query, +* as a parenthesized data-source element of a [FROM](from.md) clause, +* as a data-source element of [JOIN](join.md) clause embedded in a + [FROM](from.md) clause, +* as a [subquery](../expressions/subqueries.md) in expressions, and +* as operands in a [set operation](set-ops.md). -see [issue](https://github.com/brimdata/super/issues/5974) +The optional `` component creates named SQL queries that are available +to any [FROM](from.md) clause contained (directly or recursively) within the +``. -CTE scoping +The output of the `` may be optionally sorted by the +`` component and limited in length by the `` component. -### Input References +Note that all of the elements of a `` are optional except the +``. Thus, any form of a simple `` may appear +anywhere a `` may appear. -### Accessing `this` +> [!NOTE] +> The `WINDOW` clause is not yet available in SuperSQL. + +## SQL Body + +A `` component has one of the following forms: +* a [SELECT](select.md) clause, +* a [VALUES](values.md) clause, +* a [set operation](set-ops.md), or +* a parenthesized query of the form `( )`. + +Thus, a `` must include either a `SELECT` or `VALUES` component +as its foundation, i.e., a `` at core is either a `SELECT` or +`VALUES` query. +Then, this core query may be combined in optionally parenthesized +set operations involving other `` or `` +components. + +## Table Structure + +Relational tables in SuperSQL are modeled as a sequence of records with +a uniform type. With input in this form, standard SQL syntax may define +a table alias that references an input sequence so that the fields of the +record type then correspond to relational columns. + +When the record type of the input data is known, the SuperSQL treats +it as a relational schema thereby enabling familiar SQL concepts like +static type checking and unqualified column resolution. + +However, SuperSQL also allows for non-record data as well as data +whose type is unknown at compile time (e.g., large JSON files that are not +parsed for their type information prior to compilation). +A table reference to input data who type is unknown is called +a _dynamic table_. + +Dynamic tables pose a challenge to traditional SQL semantics because +the compiler cannot know the columns that comprise the input and thus +cannot resolve a column reference to a dynamic table. Also, static +type checking as in traditional SQL cannot be carried out. + +To remedy this, SuperSQL supports dynamic tables in SQL operators +but restrict how they may be used as described below. + +> [!NOTE] +> The restrictions on dynamic tables avoid a situation where the +> semantics of a query is dependent on whether the input type is known. +> If column bindings were to change when the input type goes from +> unknown to known, then the semantics of the query would change simply +> because type information happened to be known. +> The constraints on dynamic tables are imposed to avoid this pitfall. + +When SQL operators encounter data that is not in table form, +errors typically arise, e.g., compile-time errors indicating a query +referencing a non-existent column or, in the case of a dynamic table, +runtime errors indicating `error("missing")`. + +> [!NOTE] +> When querying highly heterogeneous data (e.g., JSON events), +> it is typically preferable to use [pipe operators](../operators/intro.md) +> on arbitrary data instead of SQL queries on tables. + +## Relational Scopes + +Identifiers that appear in SQL expressions are resolved in accordance +with the relational model, typically referring to tables and columns and by name. + +Each SQL pipe operator defines one or more relational namespaces +that are independent of other SQL pipe operators and does not span across +pipe operator boundaries. A set of columns (from one or more tables) +comprising such a namespace is called a _relational scope_. + +A [FROM](from.md) clause creates a relational scope defined by a +namespace comprising one or more table names each containing +one or more column names from the top-level tables that +appear in the `FROM` body. + +A [VALUES](from.md) clause creates a relational scope defined by +the default column names `c0`, `c1`, etc. and typically appears as +a table expression in a `FROM` clause with a table alias to rename +the columns. + +Table names and column names do not need to be unique but when non-unique +names cause ambiguous references, then errors are reported and the query +fails to compile. + +A particular column is referenced by name using the syntax +``` + +``` +or +``` +
. +``` +where `
` and `` are identifiers. +The first form is called an _unqualified column reference_ while the +second form is called a _qualified column reference_. + +>[!NOTE] +> The `.` operator here is overloaded as it may (1) indicate +> a column inside of a table or (2) +> [dereference a record value](../expressions/dot.md). + +A table referenced without a column qualifier, as in +``` +
+``` +is simply called a _table reference_. Table references within expressions +result in values that comprise the entire row of the table as a record. + +## Input Scope + +A relational scope defined by the optional [FROM](from.md) clause is called +an _input scope_. + +An input scope is comprised of the table and constituent column +that `FROM` defines, +which may in turn contain [JOIN](join.md) clauses and additional tables +and columns. +Any of the tables defined in subqueries embedded in the `FROM` clause +are not part of the input scope and thus not visible. + +For example, this query +``` +SELECT * +FROM (VALUES (1),(2)) T(x) +CROSS JOIN (VALUES (3,4),(5,6)) U(y,z) +``` +creates an input scope with columns `T.x`, `U.y`, and `U.z`. + +## Output Scope + +A relational scope defined by a SQL operator that is not a +[SELECT](select.md) operation — i.e., set operations or a +[VALUES](values.md) clause — is called an _output scope_. + +An output scope does not have a table name and is an anonymous +scope for which only unqualified column references apply. + +When an output scope appears as a table subquery within a +[FROM](from.md) clause, the output scope may be named with +a [table alias](from.md#table-aliases) and becomes +part of the input scope for the `SELECT` operation in which it appears. + +## Relational Bindings + +While identifiers in SQL expressions typically resolve to columns in table, +they may also refer to lexically-scoped +[declarations](../declarations/intro.md) for constants, named queries, +and so forth. These bindings have a precedence higher than than relational +bindings so an identifier is first resolved via +[lexical binding](../expressions/intro.md#identifier-resolution). + +When an identifier does not resolve to a declaration in a lexical scope, +then it is resolved as a table or column reference from +the [input scope](#input-scope) or to a +[column alias](#column-aliases). + +The relational identifiers are bound to a table, column, or input expression +as follows: +* when [alias-first resolution](#column-aliases) is in effect: if the + identifier matches a column alias, + then the identifier is substituted with the corresponding column's + input expression; +* if the identifier resolves as an + [unqualified reference](#unqualified-references) + without error, then the identifier binds to that column; +* if the identifier resolves as an unqualified reference with an ambiguous column + error, then the error is reported and the query fails to compile; +* when alias resolution is in effect (but column-first is not in effect): + if the identifier matches a column alias, + then the identifier is substituted with the corresponding column's + input expression; +* when the identifier is a candidate for a + [qualified reference](#qualified-references) + (i.e., the identifier is followed by a `.` and a second identifier): + * if the identifier pair resolves as an unqualified reference without error, + then the pair binds to that column; + * if the identifier pair resolves as an unqualified reference with an + ambiguous column error, then the error is reported and the query fails + to compile; +* when the identifier is not a candidate for a qualified reference: + * if the identifier resolves as a table reference without error, + then the identifier binds to that table; + * if the identifier resolves as a table reference with an ambiguous table + error, then the error is reported and the query fails to compile. + +If no such matches are found, then an error is reported indicating +a non-existent table or column reference and the query fails to compile. + +### Unqualified References + +An unqualified reference of the form `` is resolved by +searching the input scope over all columns where the identifier +and `` match: +* if there is exactly one match, then the identifier binds to that column; +* if there is more than one match, then an error is reported indicating + an ambiguous column reference and the query fails to compile; +* if there is no match, then the resolution fails without error. + +When there are multiple tables in scope and at least one of the tables +is dynamic, then unqualified references are not allowed. In this case, +an error is reported and the query fails to compile. + +### Qualified References + +A qualified reference of the form `
. ` is +resolved by searching the input scope over all tables that match +`
` and where the table contains a column matching ``: +* if there is exactly one match, then the identifier binds to that column; +* if there is more than one match, then an error is reported indicating + an ambiguous column reference and the query fails to compile; +* if there is no match, then the resolution fails without error. + +For dynamic tables, all qualified references bind to any column name and +runtime errors are generated when referencing columns that do not exist +in the dynamic table. + +### Table References + +A table reference of the form `
` is resolved by searching +the input scope over all tables that match `
`: +* if there is exactly one match, then the identifier binds to that table; +* if there is more than one match, then an error is reported indicating + an ambiguous table reference and the query fails to compile; +* if there is no match, then the resolution fails without error. + +Table references for dynamic tables are not allowed. In this case, +an error is reported and the query fails to compile. + +### Column Aliases + +Depending on the particular clause, column aliases +may be referenced in expressions. + +For expressions in `GROUP BY` and `ORDER BY` clauses +(and when pragma `pg` is false): +* if the identifier matches a column alias, then the corresponding expression + is substituted in place of the identifier, +* otherwise, the identifier is resolved to a table or column reference from + the input scope as described [previously](#relational-bindings). + +When pragma `pg` is true, the column alias check is performed _after_ the +input scope but only for `GROUP BY` expressions, which is +in line with PostgreSQL semantics. + +For expressions in an `ORDER BY` clause, the column alias check +is always performed _before_ the input scope check independent of +the pragma `pg` setting. + +For expressions in a `WHERE` clause, the identifier is resolved using +only the input scope, i.e., column aliases are not allowed. + +## `this` + +When querying dynamic tables, the `*` selector for all columns is not +available as the input columns are unknown. Also, the input is not guaranteed +to be relational. + +To remedy this, SuperSQL allows `this` to be referenced in expressions, +which resolves to the input row for `WHERE`, `GROUP BY` and the projected +expressions and resolves the output row for `HAVING` and `ORDER BY`. + +For example, the `SELECT` statement here, places `this` into a +first [column called that](../types/record.md#derived-field-names) +thereby producing a relational output: +```mdtest-spq {data-layout='no-labels&stacked'} +# spq +values 1, 'foo', {x:1} +| SELECT this +# input + +# expected output +{that:1} +{that:"foo"} +{that:{x:1}} +``` diff --git a/book/src/super-sql/sql/join.md b/book/src/super-sql/sql/join.md index a028935919..c58cf2dc3f 100644 --- a/book/src/super-sql/sql/join.md +++ b/book/src/super-sql/sql/join.md @@ -1 +1,190 @@ # JOIN + +A `JOIN` operation performs a relational join. + +Joins are _conditional_ when they have the form +``` + JOIN +``` +and are _non-conditional_ when having the form +``` + +``` +where +* `` is a [table expression](from.md#table-expressions) + as defined in the [FROM](from.md) clause, +* `` indicates the flavor of join as + [described below](#join-types), +* `` is either a comma (`,`) or the keywords `CROSS JOIN`, and +* `` is the join condition in one of two forms: + * `ON ` where `` is a Boolean-valued + [expression](../expressions/intro.md), or + * `USING ( , [ , ... ] )` where `` is an identifier indicating + the one or more columns. + +The `` on the left is called the _left table_ while the other +`` is the _right table_. The two tables form a +[relational scope](intro.md#relational-scopes) called the _join scope_ +consisting of the tables and columns from both tables. + +Join operations are left associative and all of the join types have +equal precedence. + +## Cross Join + +A non-conditional join forms its output by combining each row in the +left table with all of the rows in the right table forming a cross product +between the two tables. The order of the output rows is undefined. + +## Conditional Join + +Conditional joins logically form a cross join then filter the joined table +using the indicated join condition. + +The join condition may be an `ON` clause or a `USING` clause. + +The `ON ` clause applies the `` to each combined row. +Table and column references within the `` expression +are resolved using the [relational scope](intro.md#relational-scopes) +created by the left and right tables. + +The `USING [ , ... ]` presumes each column is present in both +tables and applies an equality predicate for the indicated columns: +``` +. = . AND .=. +``` +where `` and `` are the names of the left and right tables. + +### Join Types + +For the `ON ` condition, the `` is evaluated for +every row in the cross product and rows are included or excluded based +on the predicate's result as well as the ``, which must be +one of: +* `LEFT [ OUTER ]` - produces an `INNER` join plus all rows in the left table + not present in the inner join +* `RIGHT [ OUTER ]` - produces an `INNER` join plus all rows in the right table + not present in the inner join +* `INNER` - produces the rows from the cross join that match the join condition, +* `ANTI` - produces the rows from the left table that are not in the inner join. + +If no `` is present, then an `INNER` join is presumed. + +>[!NOTE] +> `FULL OUTER JOIN` is not yet supported by SuperSQL. Also, note that +> `ANTI` is a left anti-join and there is no support for a right anti-join. + +## Examples + +--- + +_Inner join_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +), +U(z) AS ( + VALUES (2), (3) +) +SELECT * +FROM T +JOIN U ON x=z +# input + +# expected output +{x:3,y:4,z:3} +``` + +--- + +_Left outer join_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +), +U(z) AS ( + VALUES (2), (3) +) +SELECT * +FROM T +LEFT JOIN U ON x=z +ORDER BY x +# input + +# expected output +{x:1,y:2,z:error("missing")} +{x:3,y:4,z:3} +{x:5,y:6,z:error("missing")} +``` + +--- + +_Right outer join_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +), +U(z) AS ( + VALUES (2), (3) +) +SELECT * +FROM T +RIGHT JOIN U ON x=z +ORDER BY x +# input + +# expected output +{x:3,y:4,z:3} +{x:error("missing"),y:error("missing"),z:2} +``` + +--- + +_Cross join_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +), +U(z) AS ( + VALUES (2), (3) +) +SELECT * +FROM T +CROSS JOIN U +ORDER BY z,y +# input + +# expected output +{x:1,y:2,z:2} +{x:3,y:4,z:2} +{x:5,y:6,z:2} +{x:1,y:2,z:3} +{x:3,y:4,z:3} +{x:5,y:6,z:3} +``` + +--- + +_Inner join with USING condition_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +), +W(y) AS ( + VALUES (2), (3) +) +SELECT * +FROM T +JOIN W USING (y) +# input + +# expected output +{y:2,x:1} +``` + +--- diff --git a/book/src/super-sql/sql/limit.md b/book/src/super-sql/sql/limit.md index 4670e57b1f..50550e30bf 100644 --- a/book/src/super-sql/sql/limit.md +++ b/book/src/super-sql/sql/limit.md @@ -1 +1,65 @@ # LIMIT + +A `LIMIT` clause has the form +``` +LIMIT [ OFFSET ] +``` +or +``` +OFFSET [ LIMIT ] +``` +where `` and `` are numeric [expressions](../expressions/index.md) +that evaluate to compile time constants. + +A `LIMIT` or `OFFSET` clause may appear after an `ORDER BY` clause or after +any [SQL operator](intro.md#sql-operator). + +`LIMIT` may precede `OFFSET` or vice versa and the order is not significant. + +`LIMIT` modifies the output of the preceding SQL operator by capping the number +of rows produced to ``. If the `OFFSET` clause is present, +then the first `` rows are ignored and the subsequent rows are produced +capping the output to `` rows. + +## Examples + +--- + +_Reduce table from three rows to two_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,1), (2,2), (3,2) +) +SELECT x +FROM T +ORDER BY x +LIMIT 2 +# input + +# expected output +{x:1} +{x:2} +``` + +--- + +_Reduce table from three rows to two skipping the first row_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,1), (2,2), (3,2) +) +SELECT x +FROM T +ORDER BY x +OFFSET 1 +LIMIT 2 +# input + +# expected output +{x:2} +{x:3} +``` + +--- diff --git a/book/src/super-sql/sql/order-by.md b/book/src/super-sql/sql/order-by.md new file mode 100644 index 0000000000..1f02dd3dd9 --- /dev/null +++ b/book/src/super-sql/sql/order-by.md @@ -0,0 +1,90 @@ +# ORDER BY + +An `ORDER BY` clause has the form +``` +ORDER BY [ , ... ] +``` +where each `` has the form +``` + | [ ASC | DESC] [ NULLS FIRST | NULLS LAST ] +``` +`` is an [expression](../expressions/index.md) indicating +the sort key of the resulting order and `` is an expression +that evaluates to a compile-time constant integer indicating a column +number of the sorted table. + +An `ORDER BY` clause may appear after any [SQL operator](intro.md#sql-operator) +and modifies the output of the preceding SQL operator by ordering the rows +by the value of `` or according to the column indicated by ``, +which is 1-based. + +The `ASC` keyword indicates an ascending sort order while `DESC` indicates +descending. If neither `ASC` or `DESC` is present, then `ASC` is presumed. + +The `NULLS FIRST` keyword indicates that null values should appear first in +the sort order; otherwise they appear last. If a `NULLS` clause is not +present, then `NULLS LAST` is presumed. + +When the `ORDER BY` clause follows a [SELECT](select.md) operation, +the sort expressions are evaluated with respect to its +[input scope](intro.md#input-scope) and resolve identifiers +[column aliases](intro.md#column-aliases) at a precedence higher +than the input scope. + +When the `ORDER BY` clause follows a SQL operator +that is not a SELECT operation, then sort expressions are evaluated +with respect to the [output scope](intro.md#output-scope) +created by that operator. + +## Examples + +--- + +_Sort on a column_ +```mdtest-spq +# spq +SELECT x +ORDER BY x DESC +# input +{x:1,y:2} +{x:2,y:2} +{x:3,y:1} +# expected output +{x:3} +{x:2} +{x:1} +``` + +--- + +_Sort on two columns_ +```mdtest-spq +# spq +SELECT x +ORDER BY y,x +# input +{x:1,y:2} +{x:2,y:2} +{x:3,y:1} +# expected output +{x:3} +{x:1} +{x:2} +``` + +--- + +_Sort on aggregate function_ +```mdtest-spq +# spq +SELECT y +GROUP BY y +ORDER BY min(x) +# input +{x:1,y:2} +{x:2,y:2} +{x:3,y:1} +# expected output +{y:2} +{y:1} +``` diff --git a/book/src/super-sql/sql/order.md b/book/src/super-sql/sql/order.md deleted file mode 100644 index 505c4a7248..0000000000 --- a/book/src/super-sql/sql/order.md +++ /dev/null @@ -1 +0,0 @@ -# ORDER diff --git a/book/src/super-sql/sql/select.md b/book/src/super-sql/sql/select.md index 3fcd1fa710..ccdbe968cb 100644 --- a/book/src/super-sql/sql/select.md +++ b/book/src/super-sql/sql/select.md @@ -1 +1,247 @@ # SELECT + +A `SELECT` query has the form +``` +SELECT [ DISTINCT | ALL ] | [ AS ] [ , | [ AS ]... ] +[ FROM [ , ... ] ] +[ WHERE ] +[ GROUP BY | [ , | ... ]] +[ HAVING ] +``` +where +* `` is an [expression](../expressions/intro.md), +* `` is a [column pattern](#column-patterns), +* `` is an [identifier](../queries.md#identifiers), +* `` is an input as defined in the [FROM](from.md) clause, +* `` is a [Boolean-valued](../types/bool.md) expression, and +* `` is a column number as defined in [GROUP BY](group-by.md). + +The list of expressions followed the `SELECT` keyword is called +the _projection_ and the column names derived from the `AS` clauses +are referred to as the [_column aliases_](intro.md#column-aliases). + +A `SELECT` query may be used as a building block in more complex queries as it +is a [<sql-body>](intro.md#sql-body) in the structure of a +[<sql-op>](intro.md#sql-operator). +Likewise, it may be +[prefixed by](intro.md#sql-operator) a [WITH](with.md) clause +defining one or more CTEs and/or +[followed by](intro.md#sql-operator) optional +[ORDER BY](order-by.md) and [LIMIT](limit.md) clauses. + +Since a `` is also a `` and any +`` is a [pipe operator](../operators/intro.md), +a `SELECT` query may be used anywhere a pipe operator may appear. + +> [!NOTE] +> Grouping sets are not yet available in SuperSQL. + +## Execution Steps + +A `SELECT` query performs its computation by +* forming an input table indicated by its [FROM](from.md) clause, +* optionally filtering the input table with its [WHERE](where.md) clause, +* optionally grouping rows into aggregates, one for each unique set of + values of the grouping expressions specified by the [GROUP BY](group-by.md) clause, or grouping the entire input into a single aggregate row when + there are [aggregate functions](../aggregates/intro.md) present, +* optionally filtering aggregated rows with its [HAVING](having.md) clause, and finally +* producing an output table based on the list of + [expressions](../expressions/intro.md) or [column patterns](#column-patterns) + in the `SELECT` clause. + +A `SELECT` query typically specifies its input using one or more +tables specified in the [FROM](from.md) clause, but when the +`FROM` clause is omitted, the query takes its input from the +parent pipe operator. +If there is no parent operator and `FROM` is omitted, then the +default input is a single `null` value. + +A `FROM` clause may also take input from its parent when using +an [f-string](../expressions/f-strings.md) as its input table. +In this case, the input table is dynamically typed. + +## Column Patterns + +A column pattern, as indicated by `` above, +uses the `*` notation to match multiple columns +from the input table. In its standalone form, it matches all columns +in the input table, e.g., +``` +SELECT * FROM table1 CROSS JOIN table2 +``` +matches all columns from `table1` and `table2`. + +A column pattern may be prefixed with a table name as in `table.*` as in +``` +SELECT table2.* FROM table1 CROSS JOIN table2 +``` +which matches only the columns from the specified table. + +## The Projection + +The output of the `SELECT` query, called the projection, +is a set of rows formed from the list of expressions following +the `SELECT` keyword where each rows is represented +by a [record](../types/record.md). +The record fields correspond to the columns of the table +and the field names and positions are fixed over the entire +result set. The type of a column may vary from row to row when the +`SELECT` expressions produce values of varying types. + +The names of the columns are specified by each `AS` clause. When the +`AS` clause is absent, the column name is +[derived](../types/record.md#derived-field-names) +from the expression in the same way field names are derived from +expression in record expressions. + +>[!NOTE] Column names currently must be unique as the underlying record +> type requires distinct field names. Names are automatically deduplicated +> when there are conflicts. SuperSQL will support duplicate +> column names in a future release. + +The projection may be [grouped](#grouped-projection) +or [non-grouped](#non-grouped-projection). + +### Grouped Projection + +A grouped projection occurs when either or both occur: +* there is a [GROUP BY](group-by.md) clause, or +* there is at least one reference to an + [aggregate function](../aggregates/intro.md) in the projection, + in a `HAVING` clause, or in an `ORDER BY` clause. + +In a grouped projection, the `HAVING` clause, `ORDER BY` clause, and +the projection may refer only to inputs that are aggregate functions +(where the function arguments are bound to the input scope and colum +aliases) or to expressions or combination of expressions that appear +in the `GROUP BY` clause. + +Aggregate functions may be organized into expressions as any +other function but they may not appear anywhere inside of a +argument to another aggregate function. + +There is one output row for each unique set of values of the +grouping expressions and the arguments for each instance of +each aggregate function are evaluated over the grouped set of values +optionally filtered with an aggregate function `FILTER` clause. + +### Non-grouped Projection + +A non-grouped projection occurs when there are no references to +aggregate functions and there is no `GROUP BY` clause. In this case, +there cannot be a `HAVING` clause. + +The projection formed here consists of the `SELECT` expressions +evaluated once for each row from the input table that is not +filtered by the `WHERE` clause. + +## Examples + +--- + +_Hello world_ +```mdtest-spq +# spq +SELECT 'hello, world' AS message +# input + +# expected output +{message:"hello, world"} +``` + +--- + +_Reference to `this` to see default input is null_ + +```mdtest-spq +# spq +SELECT this +# input + +# expected output +{that:null} +``` + +--- + +_Mix alias and inferred column names_ + +```mdtest-spq +# spq +SELECT upper(s), upper(s[0:1])||s[1:] AS mixed +# input +{s:"foo"} +{s:"bar"} +# expected output +{upper:"FOO",mixed:"Foo"} +{upper:"BAR",mixed:"Bar"} +``` + +--- + +_Column names (currently) must be unique and are deduplicated_ + +```mdtest-spq +# spq +SELECT s, s +# input +{s:"foo"} +{s:"bar"} +# expected output +{s:"foo",s_1:"foo"} +{s:"bar",s_1:"bar"} +``` + +--- + +_Distinct values sorted_ + +```mdtest-spq +# spq +SELECT DISTINCT s ORDER BY s +# input +{s:"foo"} +{s:"bar"} +{s:"foo"} +# expected output +{s:"bar"} +{s:"foo"} +``` + +--- + +_Select entire rows as records using a table reference_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,1), (2,2), (3,2) +) +SELECT T +FROM T +# input + +# expected output +{T:{x:1,y:1}} +{T:{x:2,y:2}} +{T:{x:3,y:2}} +``` + +--- + +_Select entire rows as records using `this`_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,1), (2,2), (3,2) +) +SELECT this as table +FROM T +# input + +# expected output +{table:{x:1,y:1}} +{table:{x:2,y:2}} +{table:{x:3,y:2}} +``` + +--- diff --git a/book/src/super-sql/sql/set-ops.md b/book/src/super-sql/sql/set-ops.md new file mode 100644 index 0000000000..386a69d59b --- /dev/null +++ b/book/src/super-sql/sql/set-ops.md @@ -0,0 +1,192 @@ +# Set Operators + +Set operators combine two input tables produced by any +[SQL operator](intro.md#sql-operator) +using set union, set intersection, and set subtraction. + +A set operation has the form +``` + UNION [ALL | DISTINCT] + INTERSECT [ALL | DISTINCT] + EXCLUDE [ALL | DISTINCT] +``` +where `` is any [SQL operator](intro.md#sql-operator). + +The set operators all have equal precedence and associate left to right. +Parentheses may be used to override the default left-to-right +evaluation order. + +The table produced by the first `` is called the _left table_ and +the table produced by the other `` is called the _right table_. + +>[!NOTE] +> Only the `UNION` set operator is currently supported. +> The `INTERSECT` AND `EXCLUDE` operators will be available in +> a future version of SuperSQL. + +## UNION + +The `UNION` operation performs a relational set union between the left and +right tables. + +The number of columns in the two tables must be the same but the column +names need not match. The output table inherits the column names of +the left table and the columns from the right table are merged into the +output based on column position not by name. + +If the `ALL` keyword is present, then all rows from both tables are +included in the output. + +If the `DISTINCT` keyword is present, then only unique rows are included +in the output. + +If neither the `ALL` nor `DISTINCT` keywords are is present, then `DISTINCT` +is presumed. + +## Non-relational Data + +When processing mixed-type tables or non-table inputs, the effect +of union can be achieved by simply combining pipe queries using +[fork](../operators/fork.md). + +When it is desirable to have a homogenous output for such data, +data can be fused into one type with the [fuse](../operators/fuse.md) operator, +which resembles the _union-by-name_ variation available in some SQL dialects. + +## Examples + +--- + +_Basic union where column name inherited from left table_ + +```mdtest-spq +# spq +SELECT 1 as x +UNION +SELECT 2 as y +ORDER BY x +# input + +# expected output +{x:1} +{x:2} +``` + +--- + +_UNION results are distinct by default_ + +```mdtest-spq +# spq +SELECT 1 as x +UNION +SELECT 2 as y +UNION +SELECT 2 as z +ORDER BY x +# input + +# expected output +{x:1} +{x:2} +``` + +--- + +_UNION ALL retains duplicate rows_ + +```mdtest-spq +# spq +SELECT 1 as x +UNION ALL +SELECT 2 as y +UNION ALL +SELECT 2 as z +ORDER BY x +# input + +# expected output +{x:1} +{x:2} +{x:2} +``` + +--- + +_Misaligned tables cause a compilation error_ + +```mdtest-spq fails +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +), +U(z) AS ( + VALUES (2), (3) +) +SELECT * FROM T +UNION ALL +SELECT * from U +# input + +# expected output +set operations can only be applied to sources with the same number of columns at line 7, column 1: +SELECT * FROM T +~~~~~~~~~~~~~~~ +``` + +--- + +_Pad a table to align columns_ + +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +), +U(z) AS ( + VALUES (2), (3) +) +SELECT * FROM T +UNION ALL +SELECT *, 0 from U +ORDER BY x,y +# input + +# expected output +{x:1,y:2} +{x:2,y:0} +{x:3,y:0} +{x:3,y:4} +{x:5,y:6} +``` +--- + +_Fuse data as an alternative to a SQL UNION_ + +```mdtest-spq +# spq +fork + ( + WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) + ) + SELECT * FROM T + ) + ( + WITH U(z) AS ( + VALUES (2), (3) + ) + SELECT * FROM U + ) +| sort x,z +| fuse +# input + +# expected output +{x:1,y:2,z:null::int64} +{x:3,y:4,z:null::int64} +{x:5,y:6,z:null::int64} +{x:null::int64,y:null::int64,z:2} +{x:null::int64,y:null::int64,z:3} +``` +--- diff --git a/book/src/super-sql/sql/union.md b/book/src/super-sql/sql/union.md deleted file mode 100644 index b00e41fd42..0000000000 --- a/book/src/super-sql/sql/union.md +++ /dev/null @@ -1 +0,0 @@ -# UNION diff --git a/book/src/super-sql/sql/values.md b/book/src/super-sql/sql/values.md index e71c219d2b..5e611d9b71 100644 --- a/book/src/super-sql/sql/values.md +++ b/book/src/super-sql/sql/values.md @@ -1 +1,73 @@ # VALUES + +A `VALUES` clause has the form +``` +VALUES [ , ... ] +``` +where each `` has the form +``` +( [ , ... ] ) +``` +and `` is an [expression](../expressions/index.md) +that must evaluate to a compile-time constant. + +>[!NOTE] +> SuperSQL currently requires that VALUES expressions be compile-time constants. +> A future version of SuperSQL will support correlated subqueries and lateral +> joins at which time the expressions may refer to relational inputs. + +Each tuple in the `VALUES` clause forms a row and the collection of +tuples form a table with an [output scope](intro.md#output-scope) +whose columns are named `c0`, `c1`, etc. + +There is no tuple type in SuperSQL. Instead, the tuple expressions are +translated to a record (i.e., relational row) with column names +`c0`, `c1`, etc. + +As it produces an output scope, the result of `VALUES` does not have a +table name. Typically, a `VALUES` clause is used as a table subquery +in a [FROM](from.md) clause and assigned table and column names with a +[table alias](from.md#table-aliases). + +## Examples + +--- + +_Simple `VALUES` operation_ +```mdtest-spq +# spq +VALUES ('hello, world') +# input + +# expected output +{c0:"hello, world"} +``` + +--- + +_As a table subquery_ +```mdtest-spq +# spq +SELECT * +FROM (VALUES ('hello, world'),('to be or not to be')) T(message) +# input + +# expected output +{message:"hello, world"} +{message:"to be or not to be"} +``` + +--- + +_Column variation filled in with missing values_ +```mdtest-spq +# spq +SELECT * FROM (VALUES (1,2),(3)) T(x,y) +# input + +# expected output +{x:1,y:2} +{x:3,y:error("missing")} +``` + +--- \ No newline at end of file diff --git a/book/src/super-sql/sql/where.md b/book/src/super-sql/sql/where.md index 7250e3b696..401844b582 100644 --- a/book/src/super-sql/sql/where.md +++ b/book/src/super-sql/sql/where.md @@ -1 +1,79 @@ # WHERE + +A `WHERE` clause has the form +``` +WHERE +``` +where `` is a Boolean-valued [expression](../expressions/index.md). + +A WHERE clause is a component of [SELECT](select.md) that is applied +to the query's [input](from.md) removing each value from the input table +for which `` is false. + +The predicate may not contain any [aggregate functions](../aggregates/intro.md). + +As in [PostgreSQL](https://www.postgresql.org/), +table and column references in the `WHERE` clause bind only to the +[input scope](intro.md#input-scope). + +## Examples + +--- + +_Filter on y while selecting x_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +) +SELECT x +FROM T +WHERE y >= 4 +# input + +# expected output +{x:3} +{x:5} +``` + +--- + +_A subquery in the WHERE clause_ +```mdtest-spq +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +), +U(z) AS ( + VALUES (2), (3) +) +SELECT x +FROM T +WHERE y >= (SELECT MAX(z) FROM U) +# input + +# expected output +{x:3} +{x:5} +``` + +--- + +_Cannot use aggregate functions in WHERE_ +```mdtest-spq fails +# spq +WITH T(x,y) AS ( + VALUES (1,2), (3,4), (5,6) +) +SELECT x +FROM T +WHERE MIN(y) = 1 +# input + +# expected output +aggregate function "min" called in non-aggregate context at line 6, column 7: +WHERE MIN(y) = 1 + ~~~~~~ +``` + +--- diff --git a/book/src/super-sql/sql/with.md b/book/src/super-sql/sql/with.md index 78d19968bf..444edc2890 100644 --- a/book/src/super-sql/sql/with.md +++ b/book/src/super-sql/sql/with.md @@ -1 +1,90 @@ # WITH + +A [WITH](with.md) clause may precede any +[SQL operator](intro.md#sql-operator) and has the form +``` +WITH AS ( + +) +[ , AS ( ) ... ] +``` +where +* `` is a table alias with optional columns as defined +in a [FROM](from.md#table-aliases) clause, and +* `` is any [SQL operator](intro.md#sql-operator). + +`WITH` defines one or more common-table expressions (CTE) +each of which binds a name to the query body defined in the CTE. + +A CTE is similar to a [query declaration](../declarations/queries.md) +but the CTE body must be a [SQL operator](intro.md#sql-operator) +and the CTE name can be used only with a [FROM](from.md) clause +and is not accessible in an expression. + +The table aliases form a lexical scope +which is available in any `FROM` clause defined within the SQL operator +that follows the `WITH` clause and any `FROM` clauses recursively +defined within that operator. Additionally, a CTE alias is available to +the other CTEs that follow in the same `WITH` clause. + +>[!NOTE] +> SuperSQL will support recursive CTEs in a future version. + +## Examples + +--- + +_Hello world_ +```mdtest-spq +# spq +WITH hello(message) AS ( + VALUES ('hello, world') +) +SELECT * FROM hello +# input + +# expected output +{message:"hello, world"} +``` + +--- + +_A first CTE referenced in a second CTE_ +```mdtest-spq +# spq +WITH T(x) AS ( + VALUES (1), (2), (3) +), +U(y) AS ( + SELECT x+1 FROM T +) +SELECT * FROM U +# input + +# expected output +{y:2} +{y:3} +{y:4} +``` + +--- + +_A nested CTE reaching into its parent scope_ +```mdtest-spq +# spq +WITH T(x) AS ( + VALUES (1), (2), (3) +) +SELECT ( + WITH U(y) AS ( + SELECT x+1 FROM T + ) + SELECT max(y) FROM U + ) as max +# input + +# expected output +{max:4} +``` + +---