WIP: Parser update #69

borgar · 2021-03-13T14:54:08Z

Changes being made for next release:

Clean up outmost library interface
Document new interface

* Switches underlying node tree from JsonML to a DOM-like structure. * Ribbon interface extended to be able to return contextual sub-slices. * Elements added to parse tree are tagged with source offset-position

borgar · 2021-03-13T14:59:43Z

Nodes are returned with line numbers when textile.parseTree() is used. They are zero based, so a one-based line number can be exposed in the rendered output like this:

function parseTextile (txSource) {
  const { ELEMENT_NODE } = textile.Element;
  return textile
    .parseTree(txSource)
    .visit(node => {
      if (node.nodeType === ELEMENT_NODE) {
        node.setAttribute('data-line', node.pos.line + 1);
      }
    })
    .toHTML();
}

See: test/line-numbers.js

craigkovatch · 2021-03-20T16:32:40Z

Forgive me for stating the obvious, but a friendly reminder that this would be textile-js 3.0 due to the top-level API changes (e.g. removal of jsonml)

craigkovatch · 2021-03-20T17:51:55Z

@borgar Re: your comment on #51 (comment), I tested this out.

For context, we are working on an editor component for textile text. Actually something very much like the Github comment editor I'm typing into right now -- with a 'Write' tab and a 'Preview' tab -- but for Textile. Maybe in the future we hope to combine these into a WSIWYG experience.

We have two needs with edit mode:

detect what Textile "formatters" (e.g. h3, underline, link, etc.) are currently active
toggle a particular formatter on the current text selection

We can mostly already do (1) with the existing library -- we insert some sentinels into the input where the selection start/end is, and then visit all the nodes in the JSONML tree looking for the node containing a sentinel. Then just walk up the tree to get the ancestors, and compare the two lists of ancestors to get the active formatters for the current selection.

There are some annoying edge cases (e.g. when the cursor is at h|3. hello) which are only solvable if we also do some parsing of our own. These cases would be much simpler if the offsets included on each node in this PR included both start and end (or pos and len) rather than just pos. I wonder if that would be a simple change to this PR? This would be a critically helpful change for us.

For (2) this PR doesn't help. The tree that's exposed is the pseudo-DOM tree, i.e. after the input has already been transformed into ~an Abstract Syntax Tree. But in order to reliably toggle a formatter on a particular selection, we would need to manipulate the Parse Tree instead (i.e. the one that preserves original whitespace, etc.), so that we can turn the tree back into a blob of textile source in a lossless manner.

I understand this project is a labor of love, so I certainly don't mean to make any demands :) Just wondering what's possible here and what's not.

Code now uses ESM import/export syntax. The package has not been set to type module as tape does not support that yet. The project is ready to be switched over but I am happy to wait for support in the test runner for now. I have also cleaned up the tests so they now use template strings for multiline strings and arrow callback functions. Moved babel config and browserlist to rc files.

- Source offsets support for HTML - Clean out more old commented code - Move away from offset as 3rd parameter in elements - Unit tests for source offset

craigkovatch · 2021-04-08T18:21:35Z

Looks like TextNodes are currently not tagged with the source offset position -- is that intentional?

borgar · 2021-04-12T19:29:43Z

The update here is written with three goals in mind:

Modernizing/cleaning the project a bit
Replace node internals (because JSONML was hard to use)
Add source positions to support the use case in Add the option showOriginalLineNumber, #51

This is why I have only done the bare minimum of source position tagging. My free time is in short supply so I have to optimize it somehow. I would have tried to add the end positions had I been aware of the need for them before I did most of the work here. I did consider including them but decided not to to reduce the amount of work. Following this merge, however, it should not be too complicated to add end positions in as well.

As to the other points, there things quickly become more complicated.

Looks like TextNodes are currently not tagged with the source offset position -- is that intentional?

TextNodes are not tagged with the source offset position for two reasons. Firstly, (as said above) it was work I didn't think needed for the #51 use-case. Secondly, I didn't want to have to answer the problematic question of how to deal with consistency: Not all emitted TextNodes represent text found in the source. Whitespace (for example linebreaks and list indentation tabs) are emitted according to rules and so \n between elements does not necessarily represent a \n found in the source.

turn the tree back into a blob of textile source in a lossless manner

Similar problems arise with start and end offsets in that they don't fully represent the input. Consider this markup:

p(class).. paragraph 1

paragraph 2

Which should render as:

<p class="class">paragraph 1</p>
<p class="class">paragraph 2</p>

The second paragraph shares attributes with the first. What meaning do start and end source positions have for the paragraphs? In your case, you would really need a model that includes a container node around the two paragraph nodes to represent an "extended block" and hold the shared list of properties (which behaves differently between types of blocks).

I am not a believer in using a formal methods of parsing textile, or at least, I am not a believer in my own abilities to build such a thing. I initially tried to go that route when I first attempted to write this but textile is, to put it diplomatically, not a very elegant syntax. Textile also comes with all the problems of HTML which is famously no longer parsed using a formal grammar.

I understand this project is a labor of love, so I certainly don't mean to make any demands :) Just wondering what's possible here and what's not.

I wouldn't have used the word "love" exactly. 😄 But yes, as with many things, I'm just some random guy doing this in his [limited] spare time.

My opinion is that textile sounds ill suited for what you are attempting. I humbly advise you to consider other less messy alternatives. But, of course, I don't know your motives or ultimate plan here. Some routes towards your goal that I can envision:

The RedCloth project, textile for Ruby, is built on top of a formal grammar. You might be able to build on top of that grammar or port it to some other parser generator. This could yield the parse tree you want, affording you more control.

This project can certainly be moved closer to what you need. As well as the obvious additions of end positions, I can look into adding the aforementioned extended block container. I have introduced hidden nodes into the output tree. They capture input (such as textile comments) that is not supposed to be rendered. These could be used to hold whitespace otherwise thrown away.

Write something from scratch. This project is liberally licensed so parts of it might be used. If I were doing this from scratch today though, I think I would look at building the parser on top of http://unifiedjs.com/

craigkovatch · 2021-04-12T19:43:23Z

Thank you very much for your detailed thoughts. I really appreciate all the time you've spent both on this project and on my questions!

I am not a believer in using a formal methods of parsing textile, or at least, I am not a believer in my own abilities to build such a thing. I initially tried to go that route when I first attempted to write this but textile is, to put it diplomatically, not a very elegant syntax. Textile also comes with all the problems of HTML which is famously no longer parsed using a formal grammar.

Heh, I know exactly how you feel :D

My opinion is that textile sounds ill suited for what you are attempting. I humbly advise you to consider other less messy alternatives.

I agree with you, but unfortunately the decision to use textile is something like a decade old, and now my task is to make a UI that isn't absolutely awful to use with it. But I'm aware of what a losing proposition that particular task is :( Thankfully we don't expect that our users will be expert textile users. For example, the two-paragraph syntax you exampled is unlikely to be a case we care about supporting in a robust way.

We have proceeded down a path of using our own regexes for dealing with block and list formatters, and are combining that with spelunking the textile-js tree to understand what phrase formatters are active on a particular node. End offsets would hugely simplify that, but in the meantime we can approximate it by looking naively for the same formatter after that formatter's "opening" character position.

Thank you again.

These are not standard and not tested for.

Lang value now supports only chars legal in BCP-47 and underscore. Fixes #76

borgar · 2021-04-25T19:47:30Z

I agree with you, but unfortunately the decision to use textile is something like a decade old, and now my task is to make a UI that isn't absolutely awful to use with it.

Yeah, I suspected this might be the case. I've certainly seen (and probably caused, let's be fair) my fair share of legacy.

End offsets would hugely simplify that

Good news, then! I had a bit of time to look at this and I have added end offset positions! The elements now have .pos.start and .pos.end. (This is a change from the previous .pos.offset.)

Normally the end position tries to be inclusive about tailing input. For example, a block's range will include tailing whitespace (p. one\n\n). There may be exceptions to this, but I am conscious of only one: The last table cell in a row does not include a closing | off the end of the line (the | does appear in the tr range. however).

textile-js/test/source-offsets.js

Lines 490 to 493 in 2e1f2bc

    
           [ 'tr', [ 91, 103 ], '| 1 | _2_ |\n' ], 
        
           [ 'td', [ 91, 95 ], '| 1 ' ], 
        
           [ 'td', [ 95, 101 ], '| _2_ ' ], 
        
           [ 'em', [ 97, 100 ], '_2_' ],

I have added an ExtendedNode in an attempt to solve the extended nodes problem touched on above. I am not happy with this solution so that is likely to change before I merge this branch.

So, I have a bit more work to do on this but do take a look and see if this is getting closer to something useable.

Closes #74

Closes #73

…rder is not ensured

This follows the PHP syntax output pretty much exactly. I am not a bit fan of the output but I don't see a reason to deviate from it. Closes #72

Closes #42

Closes #77

The glyph conversions are run on all text nodes. This was a performance bottleneck so it has been rewritten for speed as well as given an accuracy overhaul. The most notable change is that glyphs are no longer entity encoded by default. This seems like a silly default for JavaScript as well as fairly pointless if no other non-ascii entitles are being encoded. An option has been added to switch to the older behavior though.

craigkovatch · 2021-10-26T20:22:00Z

Hi again, apologies for the very long (gosh has it really been six months??) wait from me. We unfortunately had to shelve this particular component for a while, but now I'm back on it and about to try out your latest changes here. Thanks for bearing with me!

Closes #88

Parser update

1e19580

* Switches underlying node tree from JsonML to a DOM-like structure. * Ribbon interface extended to be able to return contextual sub-slices. * Elements added to parse tree are tagged with source offset-position

borgar changed the title ~~Parser update~~ WIP: Parser update Mar 13, 2021

borgar marked this pull request as draft March 13, 2021 14:54

borgar mentioned this pull request Mar 13, 2021

Add the option showOriginalLineNumber, #51

Closed

borgar added 2 commits March 14, 2021 22:40

Use ABBR since ACRONYM is deprecated

e581bb8

Added tests

9e25f35

borgar added 6 commits March 28, 2021 15:02

Update all packages and fix lint problems

0602531

Code lint and crud fixes

53a12b8

Add textile comments to the tree but keep them hidden

e911731

Remove garbage comment

671451d

Finished source offset support for HTML elements

01df600

- Source offsets support for HTML - Clean out more old commented code - Move away from offset as 3rd parameter in elements - Unit tests for source offset

This was referenced Apr 12, 2021

Plans to update for compatibility with Textile 3.7.1? #55

Open

Complete 100% test coverage. #41

Open

borgar added 5 commits April 21, 2021 20:38

Remove {} fences

56115e7

These are not standard and not tested for.

Renamed Flow->Block and Phrase->Inline

72196f8

Wrap extended blocks in an extended node

212a150

Stricter matching for lang attributes

c402299

Lang value now supports only chars legal in BCP-47 and underscore. Fixes #76

Token end postions

2e1f2bc

borgar added 4 commits April 25, 2021 19:47

Remove unused null offset from node pos.

f587e86

Conform footnote handling to PHP v4

8371bc0

Closes #74

Support MediaWiki style definition list syntax

f38cc79

Closes #73

Add tests for def-lists source indexes

281286f

borgar added 12 commits May 5, 2021 09:44

translating offsets to lines needs to be done differently as source o…

2edd9f9

…rder is not ensured

Add support for endnotes

6b75fda

This follows the PHP syntax output pretty much exactly. I am not a bit fan of the output but I don't see a reason to deviate from it. Closes #72

Simpler code

2434a1b

Disallow emitting link URI that have unsafe protocols

871382a

Closes #42

HTML is processed same as regular textile

01b5c2f

Support ID prefixing

42843f7

Closes #77

Upper case should not default HTML parsing or XSS

f85f5bc

Update re to a class to prevent pattern bleed

88f91fa

Support all regexp flags in Re

3f29341

Updated package dependencies

fa800f5

Remove npm commands that don't do anything

8e51d77

borgar added 5 commits August 6, 2023 17:11

Don't inline-linebreak if whitespace follows the newline

308c3f5

Closes #88

Add _* to npmignore

fa68982

Update deps and re-lint repo

5e6e149

Adding types and docs as well as a few minor things

1293c01

Update web editor

d3919d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Parser update #69

WIP: Parser update #69

Uh oh!

borgar commented Mar 13, 2021 •

edited

Loading

Uh oh!

borgar commented Mar 13, 2021 •

edited

Loading

Uh oh!

craigkovatch commented Mar 20, 2021

Uh oh!

craigkovatch commented Mar 20, 2021 •

edited

Loading

Uh oh!

craigkovatch commented Apr 8, 2021

Uh oh!

borgar commented Apr 12, 2021

Uh oh!

craigkovatch commented Apr 12, 2021

Uh oh!

borgar commented Apr 25, 2021

Uh oh!

craigkovatch commented Oct 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: Parser update #69

Are you sure you want to change the base?

WIP: Parser update #69

Uh oh!

Conversation

borgar commented Mar 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

borgar commented Mar 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craigkovatch commented Mar 20, 2021

Uh oh!

craigkovatch commented Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craigkovatch commented Apr 8, 2021

Uh oh!

borgar commented Apr 12, 2021

Uh oh!

craigkovatch commented Apr 12, 2021

Uh oh!

borgar commented Apr 25, 2021

Uh oh!

craigkovatch commented Oct 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

borgar commented Mar 13, 2021 •

edited

Loading

borgar commented Mar 13, 2021 •

edited

Loading

craigkovatch commented Mar 20, 2021 •

edited

Loading