-
-
Notifications
You must be signed in to change notification settings - Fork 14.2k
[WIP] rustdoc: Add tree-sitter syntax highlighting for non-Rust code blocks #149944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Integrate arborium (tree-sitter based highlighting) to provide syntax highlighting for non-Rust code blocks in documentation. Previously, code blocks like ```python or ```javascript were rendered as plain text. Supported languages: bash, c, cpp, css, go, html, java, javascript, json, python, ruby, sql, toml, typescript, yaml. The highlighting uses custom HTML elements (a-k for keywords, a-s for strings, etc.) which are styled via CSS to match rustdoc's existing color scheme across all themes (light, dark, ayu).
Gate the arborium-based syntax highlighting behind an unstable flag (-Z unstable-options --highlight-foreign-code) so it can be tested before becoming the default behavior. The flag is threaded through: - RenderOptions in config.rs - SharedContext in context.rs - Markdown/MarkdownWithToc structs in markdown.rs - CodeBlocks iterator for actual highlighting
Test cases: - Highlighting enabled: Python, JavaScript, JSON produce arborium tags - Language aliases work (py -> python, js -> javascript) - Highlighting disabled: no arborium tags produced - Unsupported languages fall back to plain escaped text
This comment has been minimized.
This comment has been minimized.
Cached crates, build artifacts, cloned repositories etc. quickly sum up to tens of gigabytes of used storage capacity for a developer. From that viewpoint, the question of including either a 171 MiB vs. 22 MiB (quoted from the blog post) binary is that it does not really matter that much. Of course, other reasons may be much more important.
Going only by the viewpoint above, all would be convenient. If it's not feasible, at least assembly would be nice to have. |
|
@bors try @rust-timer queue |
|
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
[WIP] rustdoc: Add tree-sitter syntax highlighting for non-Rust code blocks
This comment has been minimized.
This comment has been minimized.
|
I've personally never encountered code blocks other than Rust in docs.rs I'm sure there are some crates that do include this, but I'm uncertain how valuable this addition would be, particularly if it negatively impacts binary size or other aspects I'm curious: has there been any user research or data gathering on this specific need? |
Surely that's not true - TOML or JSON at least. The downsides of this approach are documented in excruciating detail here: https://fasterthanli.me/articles/my-gift-to-the-rust-docs-team — there are two other approaches mentioned.
None as far as I'm aware! There's no better time than the present, though. I wanted to make sure we were covered implementation-wise — again, as described in the article. It would be fairly easy to run a Rust user survey. Collecting the quantity of non-Rust code blocks on docs.rs is also a fairly simple affair. Syntax highlighting could ship as a separate (optional) component. There are many options when one is willing to put in the work. (I don't necessarily think we should increase the size of the rustdoc binary from 22M to 171M — I wonder what that looks like compressed — the WASM version of grammars compress very well). If there's one thing I can ask, it's to do your own research before commenting. Thanks! |
To clarify, I meant other programming languages |
|
Just went through |
|
The job Click to see the possible cause of the failure (guessed by this bot) |
|
💔 Test for d9d0c0a failed: CI. Failed jobs:
|
Given how well Rust works in the Web Dev world, I can see JavaScript, TypeScript, JSON, TOML, SQL, CSS, and HTML doc blocks being used frequently. Maybe this would go a step forward and encourage better documentation across the board when integrating with those other languages. I mean no shade, I just really love good documentation. Anything that the rust team can do to facilitate that would be extremely helpful and welcome. |
Just want to mention that this is not a system library, it's innocent portable C11 code. I would swear on my life that it's not gonna cause any problem down the line but I understand this isn't going to be enough. Can I get a glimpse of the future? Say suddenly arborium uses a Rust core (instead of tree-sitter-core), a Rust compiler that generates Rust parsers, and that all grammars in the 96 that have an external scanner.c or scanner.cc are ported to pure Rust — what would you look at next? What would the vetting process look like? @GuillaumeGomez |
|
That would remove one blocker. Next would be to look into reducing the size of code base overall (if possible). This feature would be very nice to have, however, it's not "core" so if it adds tens thousands of lines for that, might be a bit too much (when I talk about lines of code, I don't include each language "tree", these are mostly fine I guess). |
I'm interested in where the line is (and maybe we should take this discussion elsewhere). A minimal Rust re-implementation of tree-sitter for docs.rs wouldn't need all the incremental parsing stuff which constitutes the bulk of the complexity of TS (a I understand it). We could also look into using bytecode instead of DFA for size gains at the expend of speed. When you looked at the current state of arborium, which dependencies specifically looked worrying? Does it matter as much if it's shipped as a separate, optional rustup component that rustdoc knows how to call to over... stdio or something? Help me understand the constraints here. |
|
Trying to describe bigger picture (not an official rustdoc position, just my opinion here): First thing to precise: there isn't a clear line on most things. We check and if majority of the team agrees, we add it. There are a few exceptions when something stands out too much, like having a C dependency (immediate blocker) or having more than 5 dependencies (which are not optional and not already in the rustc tree). These would require a debate to be accepted. Stating how I see dependencies: more dependency means potentially more bugs, more security issues, increases compile-time (strongly depends on each dep of course), and eventually means to increase our maintenance burden because the rustdoc team needs to keep an eye on our deps (we contribute heavily to all of our direct dependencies to fix bugs, for perf reasons, etc, so that's part of the maintenance burden). So before adding a dependency, we need to check what it does etc. So if the crate has 100+ optional dependencies, if in the end we only use 3, it's fine. And then comes the last two points which both influence the other: the impact on performance and how widely used it would be. If widely used and small impact, then we have the golden path, other cases are more tricky. Based on all this, do you have some more specific questions? |
|
One possible alternative is to have simple extension points in rustdoc (e.g. HTML postprocessors or Markdown preprocessors), so crates can choose which “plugins” they want to use. Running a configured command on every HTML file is much less code to add for rustdoc and it will allow integrating with arborium, LaTeX renderers or any other kind of doc niceness one might want. |
I do! So you're not worried at all about the redistributable size? (22MB => 171MB uncompressed?) But you are worried about having to patch "tree-sitter but in Rust" for performance / security / etc. ? I don't think you've commented on the "separate component" option, either redistributed as a rustup component or simply internal to docs.rs — like the arborium-rustdoc proof of concept here. |
You mean the generated docs? If so yes we are but we're getting close to the best we can do for pure HTML output. And it's not only for "tree-sitter but in Rust" we're worried but about all new dependencies.
I don't think the complexity of a rustup component is worth the added value from highlighting other languages, especially when an alternative solution based on JS and highlight.js works perfectly. As for docs.rs, the issues I listed for rustdoc are basically the same but even worse: we're only two working on docs.rs. |
the 22M -> 171M is for |
|
Oof, yeah that's also a blocker imo. |
No! I mean the rustdoc binary that is distributed via rustup. That one: The generated HTML inflates very little — have you had a chance to check out the Angle 3 section of my article? The video shows My worry is that, when this discussion resumes after having "some good enough rust version of tree-sitter", the size of the rustdoc binary will be the next blocker (I thought it would be the current blocker!). Then the only reasonable move would be to "run it in the docs.rs backend" — at which point maybe "every C dependency is a blocker" no longer applies. So. Just trying to save some work here. Maybe I should go a little more in detail about the technical aspects of the tree-sitter solution. Grammars are defined as Sometimes, defining a grammar in a The However, these are not sufficient to perform parsing, querying and highlighting. You can think of them as mostly "definitions" or "state machines" (with some imperative code) that are loaded by the tree-sitter runtime itself, which is also written in portable C11. The authors saw it as an advantage! It let it have zero dependencies and compile anywhere, on the web, on any native platform with a C compiler, etc. Now, under a "no C policy", the plan would be to:
Part number 3 has already been attempted via C2Rust — I don't believe this is necessarily the right approach. The result would be probably slightly slower, it might be slightly smaller too if we chose a different representation of "compiled grammars", and there wouldn't be any C in the build process of rustdoc or "some tool internal to docs.rs". This is a significant engineering effort that I'm happy to shoulder, I just want to make sure you understand what the end result would look like! I would love it if you could ask me questions to ensure there is no misunderstanding here as to what's being proposed. @GuillaumeGomez |
I did, hence why I was surprised. Although I find the test to be a bit "off". It directly depends on the number of code examples, we have the same issue with rust highlighting: almost no changes between highlighted and not highlighted, but as you add more Rust code examples, the difference will increase. But that's another debate and quite secondary, if we reach this discussion, then it's pretty much that all other problems have been solved.
I confirm that going from 22MB to 177MB is also a blocker. You did well asking confirmation. ^^'
Thanks for the technical details! Just to confirm: it is done when updating
I can already say that this approach is not viable last I check. Or more like "not maintainable", ie it generates working code but good luck trying to maintain it. Overall I'm surprised that a Rust equivalent of |
Just want to highlight that's A) those sizes are uncompressed and unstripped B) 177MB is including all 96 grammars which I think is probably overkill.
Correct. By the time you're compiling arborium or something that depends on it, you are building a lot of C code (all of it vendored with zero system dependencies outside of malloc + a handful of isupper/islower functions), but not generating anything.
highlight.js is a pile of regexp. My personal position is that this is no way to highlight a programming language, unless you have literally exhausted all other options. Here, docs.rs has an easy, literally drop-in solution (insert arborium-rustdoc into the build pipeline) to get high-quality syntax highlighting with zero JavaScript dependencies to maintain or care about. I must admit I'm surprised at all the shields being raised re: that option specifically. |
|
Just to sum things up: you suggest to add a new dependency for a minor feature (which would still be nice to have) which:
So in short: being worried about greatly increasing maintenance burden of rustdoc for a minor feature doesn't seem that far of a fetch to me. I understand your frustration, you spent a lot of time and this and would love for it to be handled by rustdoc by default directly, but based on all the reasons I listed, I think my position shouldn't come as a surprise. |
Nope! You're talking about Angle 2, which I always predicted would reveal untenable. I'm asking why is Angle 3 not being discussed. Angle 3 only involves post-processing rustdoc output. It's a horse of a completely different color. edit: also the VERY FIRST THING this PR says is "This is not intended to merge as-is, but as a basis for discussion." |
|
You mean doing it in docs.rs instead? Quoting myself:
|
Do you mean that building a binary for
I have a lot of empathy there, but I was hoping to have a technical discussion. |
|
I think it's natural that people are discussing Angle 2 here; after all, your PR implements Angle 2. It can be quite confusing to understand what exactly do you mean by the alternatives, and it's IMO not ideal to force reviewers to read an article (as much as I think your articles are amazing, and I think pretty much everyone would agree with that!) to understand that. If you want to discuss Angle 3, I would suggest to create an issue specifically about Angle 3 (maybe on the docs.rs repo?), and explain in that issue what exactly that would entail. |
Summary
This PR adds syntax highlighting for non-Rust code blocks in rustdoc using arborium, a tree-sitter based highlighting library.
Currently, code blocks like
```pythonor```javascriptare rendered as plain text. This PR enables proper syntax highlighting for 15 languages: Python, JavaScript, TypeScript, Bash, C, C++, Go, Java, JSON, TOML, YAML, SQL, Ruby, CSS, and HTML.Status
This is not intended to merge as-is, but as a basis for discussion.
There are other approaches being explored, such as post-processing the rustdoc HTML output rather than integrating directly into the rendering pipeline. See bearcove/arborium#36 for context on the different approaches.
Usage
Behind an unstable flag:
Implementation
arboriumas a dependency with 15 language grammarsCodeBlocksiterator inhtml/markdown.rs<a-k>for keywords,<a-s>for strings, etc.)Open questions