Support arbitrary embedded languages on an inner pass without breaking container syntax

There should be a way to embed languages generically, without having to account for every possible comment/string/etc of that specific language that just so happen to break the unrelated container syntax; and having to workaround that by adding a lot of unrelated "hack" rules to fix it.

Several languages allow specifying arbitrary embedded languages (markdown is an example), and having to account for every language pair combination is bad, when this could very well be solved generically. Syntax of embedded languages should be determined on a second "pass" without breaking the syntax of whatever is delimiting it in the parent language; while still allowing it to override parent escape sequences (e.g. in strings) over the embedded language.

I think this would be the ideal implementation for the best embedded language support.
Allow a `subPatterns` field (and an optional `replacementPatterns` field with it) that uses this second-pass logic. They would be mutually exclusive with `patterns`. This is how it could work when `subPatterns` is present:
- The start..end|while rule is matched first, without considering any sub-patterns or replacement patterns. Let's say the text content between them is all stored into a `innerText` variable.
- Then apply replacement patterns if they exist. They are basically the same as `patterns`, except they use `match` and a `replaceWith` field to specify substitution within `innerText`. Place the result into a `subCode` variable. So the sub-patterns will later operate considering these. So, for example, if a `&lt;` to `<` substitution occurs, then sub-patterns operate on this new text. This allows you to replace escaping syntax from the parent language before the sub-pattern that includes the embedded language.
    - The `replaceWith` field can have back-references from its `match` groups. Those can be the literal group text, or the unicode char from the hex or decimal number from the group (for generic unicode escape sequences).
- Then apply `subPatterns` into just the `subCode` text atomically, on an inner/sub pass.
- For any regions of `innerText` that had replacements, apply the replacement scope name on top of whatever scopes come from the sub-patterns. So this way you can inter-mix escaping syntax of both languages.

Additionally, allow parent back-references in the "include" names, so you can add any arbitrary language ids.

A theoretical example:
```jsonc
{
  "name": "string.quoted.embedded-code.$1.my-lang",
  "begin": "([\\w-]+)`", // group 1 is the language id
  "beginCaptures": {
    "1": { "name": "entity.other.language.my-lang" }
  },
  "end": "`",
  "contentName": "meta.embedded.block.$1 source.$1",
  "replacementPatterns": [
    // $1 would replace with the char in group 1 below literally
    { "match": "\\\\([`\\\\])", "replaceWith": "$1", "name": "constant.character.escape.my-lang" },
    // $h1 could replace with the unicode char from the hex number matched by group 1
    { "match": "\\\\u(\\h{4})", "replaceWith": "$h1", "name": "constant.character.escape.my-lang" },
    // $d1 same as above, but for decimal numbers
    { "match": "\\\\c\\[(\\d+)\\]", "replaceWith": "$d1", "name": "constant.character.escape.my-lang" },
  ],
  "subPatterns": [
    // "include" could allow back-references from the parent begin/match pattern
    // to support arbitrary languages
    { "include": "source.$1" }
  ]
}
```
This would let you include any arbitrary embedded language without having to know anything about its syntax, and you could even have escaping in the parent language be recognized and everything would just work.

Example code for this theoretical my-lang:
(all escapes are from my-lang, except backslash is escaped twice, for both languages)
```
json`
{
  "backtick": "\`",
  "backslash": "\\\\",
  "slash": "\u002F",
  "percent": "\c[37]"
}
`
```
These would not break the syntax in my-lang, as the inner code is isolated:
```
json`"`
js`//`
cpp`/*`
python`#`
csharp`(`
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support arbitrary embedded languages on an inner pass without breaking container syntax #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support arbitrary embedded languages on an inner pass without breaking container syntax #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions