January 2nd, 2008
Antlr lexing problem
I should probably post this on a mailing list instead, but for now I want to document my problem here. If anyone has any good suggestions I’d appreciate it.
I’m using Antlr to lex a language. The language is fixed and has some cumbersome features. One in particular is being really annoying and giving me some trouble to handle neatly with Antlr 3.
This problem is about sorting out Identifiers. Now, to make things really, really simple, an identifier can consist of the letter “s” and the character “:” in any order, in any quantity. An identifier can also be the three operators “=”, “:=” and “::=”. That is the whole language. It’s really easy to handle with whitespace separation and so on. But these are the requirements that give me trouble. The first three are simple baseline examples:
- “s” should lex into “s”
- “s:” should lex into “s:”
- “s::::” should lex into “s::::”
- “s:=” should lex into “s” and “:=”
- “s::=” should lex into “s:” and “:=”
- etc.
Now, the problem is obviously that any sane way of lexing this will end up eating the last colon too. I can of course use a semantic predicate to make sure this isn’t allowed when the last character is a colon and the next is “=”. This helps for the 4th case, but not for the 5th.
Anyone care to help? =)
