Use a token as separator between rules #1549

rezemika · 2025-08-10T16:00:30Z

rezemika
Aug 10, 2025

Hello and thank you for developing Lark!
This feels like a simple problem but I can't find a (clean) solution for it. I'm looking for a way to require some separator between rules, which must be present only if these rules are present. Let's use this basic grammar example.

FRUIT: "apple" | "banana" | "strawberry" | "raspberry"
ANIMAL: "cat" | "dog" | "elephant" | "mouse" | "horse"
VEHICLE: "car" | "bicycle" | "plane" | "boat"

SEPARATOR: ","

start: fruits? animals? vehicles?

fruits: FRUIT (SEPARATOR FRUIT)*
animals: ANIMAL (SEPARATOR ANIMAL)*
vehicles: VEHICLE (SEPARATOR VEHICLE)*

%ignore " "

So the following strings will be parsed successfully: apple,banana cat,dog plane, raspberry boat, mouse plane, etc.
However, it will also parse appleplane or banana mouseplane as all spaces are ignored.

I could do something like start: (fruits " ")? (animals " ")? vehicles?, but it would require a trailing space if there are only fruits and/or animals (and no vehicles), like apple mouse .

What surprised me is that I tried the Python grammar with the online IDE and it seems to ignore spaces the same way. So that a whileTrue:pass produces the same valid AST as while True: pass, even though it raises a SyntaxError in a REPL. (And just to be clear: that is really not a criticism or a complaint, it's just to explain my thought process.)

Of course I could solve this really simple example with something like this.

start: fruits | animals | vehicles
       | fruits " " animals " " vehicles
       | (fruits " " animals) | (fruits " " vehicles)
       | animals " " vehicles

However, it would grow exponentially with the number of rules in start (and it makes the grammar quite difficult to read and more error-prone if there are many rules). So is there any way to specify a separator between EBNF rules that I don't know of, or is it an open question/problem?

erezsh · 2025-08-10T16:42:26Z

erezsh
Aug 10, 2025
Maintainer

This is the intended behavior, but I understand it isn't always convenient.

You could try to require a space or comma as part of the token, like this: (writing from memory)

FRUIT1: ("apple" | "banana" | "strawberry" | "raspberry") /(?=[,\s])/

FRUIT2: ("apple" | "banana" | "strawberry" | "raspberry") /(?!\w)/

Regarding the example you gave, you can rewrite it in a more efficient way:

SEP: "," | " "
fruits_sep: fruits SEP
animals_sep: animals SEP

start: fruits_sep? (animals_sep? vehicles | animals)
        | fruits

And you can move the optional operator into the rule, as lark supports empty rules, and that will help with the exponential growth of the rules.

If none of these work well enough, we can discuss modifying the parser.

1 reply

rezemika Aug 13, 2025
Author

Oh sorry for the late answer.
That's quite elegant, I'll try this! I don't strictly need this, I was mostly wondering if there was any idiomatic way to do it in EBNF grammar. It would be great but I don't know how useful it would be for common use cases (mostly for DSL I suppose), I guess it could have some side effects too.
Thank you! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use a token as separator between rules #1549

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Use a token as separator between rules #1549

Uh oh!

rezemika Aug 10, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

erezsh Aug 10, 2025 Maintainer

Uh oh!

Uh oh!

rezemika Aug 13, 2025 Author

rezemika
Aug 10, 2025

Replies: 1 comment 1 reply

erezsh
Aug 10, 2025
Maintainer

rezemika Aug 13, 2025
Author