Any ideas how to deal with context-sensitive grammar? #139

ForNeVeR · 2022-02-23T17:16:31Z

ForNeVeR
Feb 23, 2022

In my implementation of C parser, I've finally hit the first solid roadblock: the place where I have to implement the infamous lexer hack.

Preamble

In C, the following constructs are the same:

int x;
int (x);

This is because the following are two different declarations in C:

typedef int *foo(int);
typedef int (*foo)(int);

So, it is mandatory to have an ability to enclose the variable name (with some additional stuff) into parentheses, and if we allow this, then why make the * mandatory?

The issue

In my case, after adding a couple of new AST nodes from the C standard devoted to the parsing of these functional typedefs, the following code is now parsed as a variable declaration instead of a function call:

exit(exitCode);

According to a context-free interpretation of the grammar, this is a conflict: is it a declaration of variable exitCode of type exit, or is it a function call?

In other compilers, the so-called lexer hack is implemented (I've no idea, yet, why it is called a lexer hack when it involves so much parsing). As I understand it, during lexing/parsing we should keep a context with all already declared typedefs, and change the parser behavior based on presence of absence of certain symbols in this typedef collection. Knowing that exit is not a typedef and not a type name at all would help us to skip parsing exit(exitCode); as a variable declaration, and thus fix our ability to reason about the code at the parse time.

I'm not sure how to implement this approach in Yoakke, though, and should it be only implemented in parser, in lexer or both.

I can see that "C frontend" is on the Yoakke roadmap (and I would be happy to contribute my implementation after it gets more mature; I believe that my MIT-licensed code may be easily sublicensed to Apache 2 and upstreamed – if desired, of course). So, what are your thoughts on the context-sensitivity of the C grammar? Is it possible to implement something like this in Yoakke actually? (I believe so, but it could require certain modifications of the codegen, which may or may not be easy to hack into.)

If you're interested in real implementation and code devoted to this problem (please note that I'm not demanding nor expecting any review or direct help, but merely asking for advice), you may take a look at the failing FullParserTest::AbsCallTest in this commit. It defines the expected and actual structure of the AST I want to achieve.

Answered by ForNeVeR

Feb 23, 2022

My current idea (inspired by @impworks, thank you!) is to introduce a special kind of "volatile AST node" which will denote this particular case of ambiguous AST.

In my parser, I will generate this node instead of the usual declaration node if I detect the situation of kind x(y); (this detection is certainly easily possible without any knowledge of context).

After that, at a latter stage of compiling (when I am converting the original AST to intermediate AST, which is perhaps a questionable implementation detail of my project), I should be able to choose between two different AST forms (a declaration of a function call) based on the context, and thus will be able to convert this "volatile…

View full answer

ForNeVeR · 2022-02-23T17:36:44Z

ForNeVeR
Feb 23, 2022
Author

My current idea (inspired by @impworks, thank you!) is to introduce a special kind of "volatile AST node" which will denote this particular case of ambiguous AST.

In my parser, I will generate this node instead of the usual declaration node if I detect the situation of kind x(y); (this detection is certainly easily possible without any knowledge of context).

After that, at a latter stage of compiling (when I am converting the original AST to intermediate AST, which is perhaps a questionable implementation detail of my project), I should be able to choose between two different AST forms (a declaration of a function call) based on the context, and thus will be able to convert this "volatile node" to a proper one and compile it correctly.

A hack? Certainly. But is it better or worse than the lexer hack™? Who knows!

2 replies

LPeter1997 Feb 23, 2022
Maintainer

Ah yes, one of my suggestion was essentially this!

ForNeVeR Mar 27, 2022
Author

This solution was implemented in commit ForNeVeR/Cesium@718fd0d.

LPeter1997 · 2022-02-23T17:40:17Z

LPeter1997
Feb 23, 2022
Maintainer

Well it's called a lexer hack, because compilers usually solve this on a lexical level, meaning that they differentiate type identifiers and variable identifiers (on a token level).

I have 2 ideas in mind.

The former would be introducing a syntax node, that represents both constructs, until they can be disambiguated.

The latter is actually doing the lexer hack, you could introduce a set of known type names/variable names in the parser as a member variable, and make your transformation function register your types/variables on a successful construct parse. This way, you can disambiguate your cases in the parser. Note, that your parsers can fail, and the parser assumes that it's stateless, but things like a typedef or a variable declaration usually stick around, and there's not a lot that can fail in that, when parsing.

2 replies

ForNeVeR Feb 23, 2022
Author

things like a typedef or a variable declaration usually stick around, and there's not a lot that can fail in that, when parsing

That's an interesting proposition, actually. Yes, I agree that it theoretically should never be necessary to roll back this part of the state, thanks to typedef being a keyword.

LPeter1997 Feb 23, 2022
Maintainer

Yes, this is a dangerous assumption and I'd hate to work with it. If I was writing the parser, I'd definitely take the other option.

LPeter1997 · 2022-02-23T17:43:05Z

LPeter1997
Feb 23, 2022
Maintainer

Of course I'd love to see a C parser implementation contributed to Yoakke! Regarding the licensing, I'm really open to anything. if you want to keep it MIT, we can license that portion to be MIT.

1 reply

ForNeVeR Feb 23, 2022
Author

I have nothing against any decision on that, but usually it's easier to manage single-license project from a legal standpoint (ew).

Though I can't see any harm for the library users if the library is licensed under MIT+Apache2 or something like that. We may decide that later, after I implement more of the AST :)

With Yoakke, 90% of the parser creation process is easy and mechanical: just copy-paste the syntax from the C standard, and that's it. There's nothing even to license in this code, it's literally the standard molten into the Yoakke form.

Remaining 10% is an entirely different story, though, and you know all of it because I always go and ask you when I'm stuck with these 10% :)

Any ideas how to deal with context-sensitive grammar? #139

Uh oh!

Uh oh!

ForNeVeR Feb 23, 2022

Preamble

The issue

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

ForNeVeR Feb 23, 2022 Author

Uh oh!

LPeter1997 Feb 23, 2022 Maintainer

Uh oh!

ForNeVeR Mar 27, 2022 Author

Uh oh!

LPeter1997 Feb 23, 2022 Maintainer

Uh oh!

ForNeVeR Feb 23, 2022 Author

Uh oh!

LPeter1997 Feb 23, 2022 Maintainer

Uh oh!

LPeter1997 Feb 23, 2022 Maintainer

Uh oh!

ForNeVeR Feb 23, 2022 Author

ForNeVeR
Feb 23, 2022

Replies: 3 comments 5 replies

ForNeVeR
Feb 23, 2022
Author

LPeter1997 Feb 23, 2022
Maintainer

ForNeVeR Mar 27, 2022
Author

LPeter1997
Feb 23, 2022
Maintainer

ForNeVeR Feb 23, 2022
Author

LPeter1997 Feb 23, 2022
Maintainer

LPeter1997
Feb 23, 2022
Maintainer

ForNeVeR Feb 23, 2022
Author