-
Notifications
You must be signed in to change notification settings - Fork 125
Description
There should be a way to embed languages generically, without having to account for every possible comment/string/etc of that specific language that just so happen to break the unrelated container syntax; and having to workaround that by adding a lot of unrelated "hack" rules to fix it.
Several languages allow specifying arbitrary embedded languages (markdown is an example), and having to account for every language pair combination is bad, when this could very well be solved generically. Syntax of embedded languages should be determined on a second "pass" without breaking the syntax of whatever is delimiting it in the parent language; while still allowing it to override parent escape sequences (e.g. in strings) over the embedded language.
I think this would be the ideal implementation for the best embedded language support.
Allow a subPatterns field (and an optional replacementPatterns field with it) that uses this second-pass logic. They would be mutually exclusive with patterns. This is how it could work when subPatterns is present:
- The start..end|while rule is matched first, without considering any sub-patterns or replacement patterns. Let's say the text content between them is all stored into a
innerTextvariable. - Then apply replacement patterns if they exist. They are basically the same as
patterns, except they usematchand areplaceWithfield to specify substitution withininnerText. Place the result into asubCodevariable. So the sub-patterns will later operate considering these. So, for example, if a<to<substitution occurs, then sub-patterns operate on this new text. This allows you to replace escaping syntax from the parent language before the sub-pattern that includes the embedded language.- The
replaceWithfield can have back-references from itsmatchgroups. Those can be the literal group text, or the unicode char from the hex or decimal number from the group (for generic unicode escape sequences).
- The
- Then apply
subPatternsinto just thesubCodetext atomically, on an inner/sub pass. - For any regions of
innerTextthat had replacements, apply the replacement scope name on top of whatever scopes come from the sub-patterns. So this way you can inter-mix escaping syntax of both languages.
Additionally, allow parent back-references in the "include" names, so you can add any arbitrary language ids.
A theoretical example:
This would let you include any arbitrary embedded language without having to know anything about its syntax, and you could even have escaping in the parent language be recognized and everything would just work.
Example code for this theoretical my-lang:
(all escapes are from my-lang, except backslash is escaped twice, for both languages)
json`
{
"backtick": "\`",
"backslash": "\\\\",
"slash": "\u002F",
"percent": "\c[37]"
}
`
These would not break the syntax in my-lang, as the inner code is isolated:
json`"`
js`//`
cpp`/*`
python`#`
csharp`(`
{ "name": "string.quoted.embedded-code.$1.my-lang", "begin": "([\\w-]+)`", // group 1 is the language id "beginCaptures": { "1": { "name": "entity.other.language.my-lang" } }, "end": "`", "contentName": "meta.embedded.block.$1 source.$1", "replacementPatterns": [ // $1 would replace with the char in group 1 below literally { "match": "\\\\([`\\\\])", "replaceWith": "$1", "name": "constant.character.escape.my-lang" }, // $h1 could replace with the unicode char from the hex number matched by group 1 { "match": "\\\\u(\\h{4})", "replaceWith": "$h1", "name": "constant.character.escape.my-lang" }, // $d1 same as above, but for decimal numbers { "match": "\\\\c\\[(\\d+)\\]", "replaceWith": "$d1", "name": "constant.character.escape.my-lang" }, ], "subPatterns": [ // "include" could allow back-references from the parent begin/match pattern // to support arbitrary languages { "include": "source.$1" } ] }