Parsing language with lots of "keywords" #1542

maurymarkowitz · 2025-07-10T15:37:24Z

maurymarkowitz
Jul 10, 2025

I really hope this is an allowable post here.

I came across Lark by accident while going down a google rabbit hole. I wonder if it might be able to solve a longstanding problem I've had.

I have previously written a system using flex/bison that runs old dialects of BASIC. The biggest problem I face is that BASIC does not require whitespace. This makes picking out keywords difficult. Consider "10 FOR I=1 TO 10", which can be entered as "10FORI=1TO10". So is that "FOR I" or "FORI"? In BASIC, parsing stops as soon as you hit a complete keyword, so it emits at FOR. Coding this in flex/bison is really annoying - basically you build an array of token strings and loop over it, so now you have two lists to maintain.

I'm wondering if anyone has come across something similar and Lark offers a solution? I'm sure there's a term for this, but reading the docs doesn't turn up anything that caught my eye.

erezsh · 2025-07-11T06:30:06Z

erezsh
Jul 11, 2025
Maintainer

@maurymarkowitz Yes, Lark's contextual lexer actually supports this innately.

See this sample code:

from lark import Lark

grammar = r"""
    start: line_statement*
    line_statement: NUMBER statement
    statement: "FOR" variable "=" expression "TO" expression
    expression: NUMBER
    variable: CNAME

    %import common.NUMBER
    %import common.WS_INLINE
    %import common.CNAME
    %ignore WS_INLINE
"""

lark = Lark(grammar, start='start', parser='lalr')

inputs = [
    '10 FOR I = 1 TO 10',
    '10FORI=1TO10'
]


for i in inputs:
    tree = lark.parse(i)
    print(tree.pretty())

1 reply

maurymarkowitz Jul 11, 2025
Author

@erezsh

Yes, Lark's contextual lexer actually supports this innately.

Well this is very good news indeed. I'm sure I already saw that in the docs, but I'm not so familiar with the. lingo so I didn't realize it.

The solutions I tried in flex were to put state rules on every single keyword, by listing every single keyword in the identifier pattern (in an array for instance), or writing my own scanner Every option was messy and it sort of upset the entire concept of the project in terms of being easy to read and maintain.

So I'm going to give this a whirl. Thanks! And I'm definitely not going to miss having to free()!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parsing language with lots of "keywords" #1542

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Parsing language with lots of "keywords" #1542

Uh oh!

maurymarkowitz Jul 10, 2025

Replies: 2 comments · 1 reply

Uh oh!

erezsh Jul 11, 2025 Maintainer

Uh oh!

maurymarkowitz Jul 11, 2025 Author

maurymarkowitz
Jul 10, 2025

Replies: 2 comments 1 reply

erezsh
Jul 11, 2025
Maintainer

maurymarkowitz Jul 11, 2025
Author