Skip to content

Conversation

@borgar
Copy link
Owner

@borgar borgar commented Mar 13, 2021

Changes being made for next release:

  • Switches underlying node tree from JsonML to a DOM-like structure.
  • Ribbon interface extended to be able to return contextual sub-slices.
  • Elements added to parse tree are tagged with source offset-position.
    • Block level elements
    • Inline level elements
    • HTML elements
  • Root interface changed a bit.
  • Return line-numbers (zero based) when getting parse-tree.
  • Modernizing the code a bit (destruct, fat arrow, ...).
  • Switch to import/export syntax for imports.

  • Clean up outmost library interface
  • Document new interface

* Switches underlying node tree from JsonML to a DOM-like structure.
* Ribbon interface extended to be able to return contextual sub-slices.
* Elements added to parse tree are tagged with source offset-position
@borgar borgar changed the title Parser update WIP: Parser update Mar 13, 2021
@borgar borgar marked this pull request as draft March 13, 2021 14:54
@borgar
Copy link
Owner Author

borgar commented Mar 13, 2021

Nodes are returned with line numbers when textile.parseTree() is used. They are zero based, so a one-based line number can be exposed in the rendered output like this:

function parseTextile (txSource) {
  const { ELEMENT_NODE } = textile.Element;
  return textile
    .parseTree(txSource)
    .visit(node => {
      if (node.nodeType === ELEMENT_NODE) {
        node.setAttribute('data-line', node.pos.line + 1);
      }
    })
    .toHTML();
}

See: test/line-numbers.js

@craigkovatch
Copy link

Forgive me for stating the obvious, but a friendly reminder that this would be textile-js 3.0 due to the top-level API changes (e.g. removal of jsonml)

@craigkovatch
Copy link

craigkovatch commented Mar 20, 2021

@borgar Re: your comment on #51 (comment), I tested this out.

For context, we are working on an editor component for textile text. Actually something very much like the Github comment editor I'm typing into right now -- with a 'Write' tab and a 'Preview' tab -- but for Textile. Maybe in the future we hope to combine these into a WSIWYG experience.

We have two needs with edit mode:

  1. detect what Textile "formatters" (e.g. h3, underline, link, etc.) are currently active
  2. toggle a particular formatter on the current text selection

We can mostly already do (1) with the existing library -- we insert some sentinels into the input where the selection start/end is, and then visit all the nodes in the JSONML tree looking for the node containing a sentinel. Then just walk up the tree to get the ancestors, and compare the two lists of ancestors to get the active formatters for the current selection.

There are some annoying edge cases (e.g. when the cursor is at h|3. hello) which are only solvable if we also do some parsing of our own. These cases would be much simpler if the offsets included on each node in this PR included both start and end (or pos and len) rather than just pos. I wonder if that would be a simple change to this PR? This would be a critically helpful change for us.

For (2) this PR doesn't help. The tree that's exposed is the pseudo-DOM tree, i.e. after the input has already been transformed into ~an Abstract Syntax Tree. But in order to reliably toggle a formatter on a particular selection, we would need to manipulate the Parse Tree instead (i.e. the one that preserves original whitespace, etc.), so that we can turn the tree back into a blob of textile source in a lossless manner.

I understand this project is a labor of love, so I certainly don't mean to make any demands :) Just wondering what's possible here and what's not.

borgar added 6 commits March 28, 2021 15:02
Code now uses ESM import/export syntax. The package has not been set to type module as tape does not support that yet. The project is ready to be switched over but I am happy to wait for support in the test runner for now.

I have also cleaned up the tests so they now use template strings for multiline strings and arrow callback functions.

Moved babel config and browserlist to rc files.
- Source offsets support for HTML
- Clean out more old commented code
- Move away from offset as 3rd parameter in elements
- Unit tests for source offset
@craigkovatch
Copy link

Looks like TextNodes are currently not tagged with the source offset position -- is that intentional?

@borgar
Copy link
Owner Author

borgar commented Apr 12, 2021

The update here is written with three goals in mind:

  1. Modernizing/cleaning the project a bit
  2. Replace node internals (because JSONML was hard to use)
  3. Add source positions to support the use case in Add the option showOriginalLineNumber, #51

This is why I have only done the bare minimum of source position tagging. My free time is in short supply so I have to optimize it somehow. I would have tried to add the end positions had I been aware of the need for them before I did most of the work here. I did consider including them but decided not to to reduce the amount of work. Following this merge, however, it should not be too complicated to add end positions in as well.

As to the other points, there things quickly become more complicated.

Looks like TextNodes are currently not tagged with the source offset position -- is that intentional?

TextNodes are not tagged with the source offset position for two reasons. Firstly, (as said above) it was work I didn't think needed for the #51 use-case. Secondly, I didn't want to have to answer the problematic question of how to deal with consistency: Not all emitted TextNodes represent text found in the source. Whitespace (for example linebreaks and list indentation tabs) are emitted according to rules and so \n between elements does not necessarily represent a \n found in the source.

turn the tree back into a blob of textile source in a lossless manner

Similar problems arise with start and end offsets in that they don't fully represent the input. Consider this markup:

p(class).. paragraph 1

paragraph 2

Which should render as:

<p class="class">paragraph 1</p>
<p class="class">paragraph 2</p>

The second paragraph shares attributes with the first. What meaning do start and end source positions have for the paragraphs? In your case, you would really need a model that includes a container node around the two paragraph nodes to represent an "extended block" and hold the shared list of properties (which behaves differently between types of blocks).

I am not a believer in using a formal methods of parsing textile, or at least, I am not a believer in my own abilities to build such a thing. I initially tried to go that route when I first attempted to write this but textile is, to put it diplomatically, not a very elegant syntax. Textile also comes with all the problems of HTML which is famously no longer parsed using a formal grammar.

I understand this project is a labor of love, so I certainly don't mean to make any demands :) Just wondering what's possible here and what's not.

I wouldn't have used the word "love" exactly. 😄 But yes, as with many things, I'm just some random guy doing this in his [limited] spare time.

My opinion is that textile sounds ill suited for what you are attempting. I humbly advise you to consider other less messy alternatives. But, of course, I don't know your motives or ultimate plan here. Some routes towards your goal that I can envision:

The RedCloth project, textile for Ruby, is built on top of a formal grammar. You might be able to build on top of that grammar or port it to some other parser generator. This could yield the parse tree you want, affording you more control.

This project can certainly be moved closer to what you need. As well as the obvious additions of end positions, I can look into adding the aforementioned extended block container. I have introduced hidden nodes into the output tree. They capture input (such as textile comments) that is not supposed to be rendered. These could be used to hold whitespace otherwise thrown away.

Write something from scratch. This project is liberally licensed so parts of it might be used. If I were doing this from scratch today though, I think I would look at building the parser on top of http://unifiedjs.com/

@craigkovatch
Copy link

Thank you very much for your detailed thoughts. I really appreciate all the time you've spent both on this project and on my questions!

I am not a believer in using a formal methods of parsing textile, or at least, I am not a believer in my own abilities to build such a thing. I initially tried to go that route when I first attempted to write this but textile is, to put it diplomatically, not a very elegant syntax. Textile also comes with all the problems of HTML which is famously no longer parsed using a formal grammar.

Heh, I know exactly how you feel :D

My opinion is that textile sounds ill suited for what you are attempting. I humbly advise you to consider other less messy alternatives.

I agree with you, but unfortunately the decision to use textile is something like a decade old, and now my task is to make a UI that isn't absolutely awful to use with it. But I'm aware of what a losing proposition that particular task is :( Thankfully we don't expect that our users will be expert textile users. For example, the two-paragraph syntax you exampled is unlikely to be a case we care about supporting in a robust way.

We have proceeded down a path of using our own regexes for dealing with block and list formatters, and are combining that with spelunking the textile-js tree to understand what phrase formatters are active on a particular node. End offsets would hugely simplify that, but in the meantime we can approximate it by looking naively for the same formatter after that formatter's "opening" character position.

Thank you again.

borgar added 5 commits April 21, 2021 20:38
These are not standard and not tested for.
Lang value now supports only chars legal in BCP-47 and underscore.

Fixes #76
@borgar
Copy link
Owner Author

borgar commented Apr 25, 2021

I agree with you, but unfortunately the decision to use textile is something like a decade old, and now my task is to make a UI that isn't absolutely awful to use with it.

Yeah, I suspected this might be the case. I've certainly seen (and probably caused, let's be fair) my fair share of legacy.

End offsets would hugely simplify that

Good news, then! I had a bit of time to look at this and I have added end offset positions! The elements now have .pos.start and .pos.end. (This is a change from the previous .pos.offset.)

Normally the end position tries to be inclusive about tailing input. For example, a block's range will include tailing whitespace (p. one\n\n). There may be exceptions to this, but I am conscious of only one: The last table cell in a row does not include a closing | off the end of the line (the | does appear in the tr range. however).

[ 'tr', [ 91, 103 ], '| 1 | _2_ |\n' ],
[ 'td', [ 91, 95 ], '| 1 ' ],
[ 'td', [ 95, 101 ], '| _2_ ' ],
[ 'em', [ 97, 100 ], '_2_' ],

I have added an ExtendedNode in an attempt to solve the extended nodes problem touched on above. I am not happy with this solution so that is likely to change before I merge this branch.

So, I have a bit more work to do on this but do take a look and see if this is getting closer to something useable.

borgar added 12 commits May 5, 2021 09:44
This follows the PHP syntax output pretty much exactly. I am not a bit fan of the output but I don't see a reason to deviate from it.

Closes #72
The glyph conversions are run on all text nodes. This was a performance bottleneck so it has been rewritten for speed as well as given an accuracy overhaul.

The most notable change is that glyphs are no longer entity encoded by default. This seems like a silly default for JavaScript as well as fairly pointless if no other non-ascii entitles are being encoded. An option has been added to switch to the older behavior though.
@craigkovatch
Copy link

Hi again, apologies for the very long (gosh has it really been six months??) wait from me. We unfortunately had to shelve this particular component for a while, but now I'm back on it and about to try out your latest changes here. Thanks for bearing with me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants