-
Notifications
You must be signed in to change notification settings - Fork 48
WIP: Parser update #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* Switches underlying node tree from JsonML to a DOM-like structure. * Ribbon interface extended to be able to return contextual sub-slices. * Elements added to parse tree are tagged with source offset-position
|
Nodes are returned with line numbers when function parseTextile (txSource) {
const { ELEMENT_NODE } = textile.Element;
return textile
.parseTree(txSource)
.visit(node => {
if (node.nodeType === ELEMENT_NODE) {
node.setAttribute('data-line', node.pos.line + 1);
}
})
.toHTML();
}See: test/line-numbers.js |
|
Forgive me for stating the obvious, but a friendly reminder that this would be textile-js 3.0 due to the top-level API changes (e.g. removal of jsonml) |
|
@borgar Re: your comment on #51 (comment), I tested this out. For context, we are working on an editor component for textile text. Actually something very much like the Github comment editor I'm typing into right now -- with a 'Write' tab and a 'Preview' tab -- but for Textile. Maybe in the future we hope to combine these into a WSIWYG experience. We have two needs with edit mode:
We can mostly already do (1) with the existing library -- we insert some sentinels into the input where the selection start/end is, and then visit all the nodes in the JSONML tree looking for the node containing a sentinel. Then just walk up the tree to get the ancestors, and compare the two lists of ancestors to get the active formatters for the current selection. There are some annoying edge cases (e.g. when the cursor is at For (2) this PR doesn't help. The tree that's exposed is the pseudo-DOM tree, i.e. after the input has already been transformed into ~an Abstract Syntax Tree. But in order to reliably toggle a formatter on a particular selection, we would need to manipulate the Parse Tree instead (i.e. the one that preserves original whitespace, etc.), so that we can turn the tree back into a blob of textile source in a lossless manner. I understand this project is a labor of love, so I certainly don't mean to make any demands :) Just wondering what's possible here and what's not. |
Code now uses ESM import/export syntax. The package has not been set to type module as tape does not support that yet. The project is ready to be switched over but I am happy to wait for support in the test runner for now. I have also cleaned up the tests so they now use template strings for multiline strings and arrow callback functions. Moved babel config and browserlist to rc files.
- Source offsets support for HTML - Clean out more old commented code - Move away from offset as 3rd parameter in elements - Unit tests for source offset
|
Looks like |
|
The update here is written with three goals in mind:
This is why I have only done the bare minimum of source position tagging. My free time is in short supply so I have to optimize it somehow. I would have tried to add the end positions had I been aware of the need for them before I did most of the work here. I did consider including them but decided not to to reduce the amount of work. Following this merge, however, it should not be too complicated to add end positions in as well. As to the other points, there things quickly become more complicated.
TextNodes are not tagged with the source offset position for two reasons. Firstly, (as said above) it was work I didn't think needed for the #51 use-case. Secondly, I didn't want to have to answer the problematic question of how to deal with consistency: Not all emitted TextNodes represent text found in the source. Whitespace (for example linebreaks and list indentation tabs) are emitted according to rules and so
Similar problems arise with start and end offsets in that they don't fully represent the input. Consider this markup: Which should render as: <p class="class">paragraph 1</p>
<p class="class">paragraph 2</p>The second paragraph shares attributes with the first. What meaning do start and end source positions have for the paragraphs? In your case, you would really need a model that includes a container node around the two paragraph nodes to represent an "extended block" and hold the shared list of properties (which behaves differently between types of blocks). I am not a believer in using a formal methods of parsing textile, or at least, I am not a believer in my own abilities to build such a thing. I initially tried to go that route when I first attempted to write this but textile is, to put it diplomatically, not a very elegant syntax. Textile also comes with all the problems of HTML which is famously no longer parsed using a formal grammar.
I wouldn't have used the word "love" exactly. 😄 But yes, as with many things, I'm just some random guy doing this in his [limited] spare time. My opinion is that textile sounds ill suited for what you are attempting. I humbly advise you to consider other less messy alternatives. But, of course, I don't know your motives or ultimate plan here. Some routes towards your goal that I can envision: The RedCloth project, textile for Ruby, is built on top of a formal grammar. You might be able to build on top of that grammar or port it to some other parser generator. This could yield the parse tree you want, affording you more control. This project can certainly be moved closer to what you need. As well as the obvious additions of end positions, I can look into adding the aforementioned extended block container. I have introduced hidden nodes into the output tree. They capture input (such as textile comments) that is not supposed to be rendered. These could be used to hold whitespace otherwise thrown away. Write something from scratch. This project is liberally licensed so parts of it might be used. If I were doing this from scratch today though, I think I would look at building the parser on top of http://unifiedjs.com/ |
|
Thank you very much for your detailed thoughts. I really appreciate all the time you've spent both on this project and on my questions!
Heh, I know exactly how you feel :D
I agree with you, but unfortunately the decision to use textile is something like a decade old, and now my task is to make a UI that isn't absolutely awful to use with it. But I'm aware of what a losing proposition that particular task is :( Thankfully we don't expect that our users will be expert textile users. For example, the two-paragraph syntax you exampled is unlikely to be a case we care about supporting in a robust way. We have proceeded down a path of using our own regexes for dealing with block and list formatters, and are combining that with spelunking the Thank you again. |
These are not standard and not tested for.
Lang value now supports only chars legal in BCP-47 and underscore. Fixes #76
Yeah, I suspected this might be the case. I've certainly seen (and probably caused, let's be fair) my fair share of legacy.
Good news, then! I had a bit of time to look at this and I have added end offset positions! The elements now have Normally the end position tries to be inclusive about tailing input. For example, a block's range will include tailing whitespace ( textile-js/test/source-offsets.js Lines 490 to 493 in 2e1f2bc
I have added an So, I have a bit more work to do on this but do take a look and see if this is getting closer to something useable. |
…rder is not ensured
This follows the PHP syntax output pretty much exactly. I am not a bit fan of the output but I don't see a reason to deviate from it. Closes #72
Closes #77
The glyph conversions are run on all text nodes. This was a performance bottleneck so it has been rewritten for speed as well as given an accuracy overhaul. The most notable change is that glyphs are no longer entity encoded by default. This seems like a silly default for JavaScript as well as fairly pointless if no other non-ascii entitles are being encoded. An option has been added to switch to the older behavior though.
|
Hi again, apologies for the very long (gosh has it really been six months??) wait from me. We unfortunately had to shelve this particular component for a while, but now I'm back on it and about to try out your latest changes here. Thanks for bearing with me! |
Changes being made for next release: