Bringing Structure and Faithfulness to LLM Outputs

Exploring Grammar-Constrained Decoding (GCD) for LLMs. Comparing EGG, GAD, and FEG methods for improving efficiency, faithfulness, and flexibility in generating structured output.

Bringing Structure and Faithfulness to LLM Outputs
Grammar-Aligned Decoding Token Spanning-Tree

Large Language Models (LLMs) are incredibly powerful, capable of generating human-like text, translating languages, and writing different kinds of creative content. However, getting them to reliably produce outputs that follow strict structural rules like valid JSON, formatted code, or specific markup like context-free grammars or EBNFs can be challenging.

How can we guide these powerful models to generate not just any text, but text that conforms precisely to our requirements?

Grammar-Constrained Decoding (GCD)

The classic approach to this problem is Grammar-Constrained Decoding (GCD). Think of it as putting guardrails on the LLM's output.

Grammar-Constrained Decoding (GCD) Process

Imagine you want an LLM to generate simple mathematical expressions. At each step, as the LLM considers the next token (e.g., a number, an operator, a parenthesis), GCD checks if adding that token would violate the rules of a valid mathematical expression (defined by a formal grammar). If a token would lead down an invalid path (like putting two operators side-by-side), GCD masks that token, essentially telling the LLM, "Don't pick this one", this is done by essentially hard prunning invalid output logits at each generation step. This ensures the final output always follows the grammar rules.

While effective in guaranteeing structure, classic GCD can be slow. It often needs to check every possible next token against the grammar rules at every single step, which becomes computationally expensive, especially with the massive vocabularies modern LLMs use. Another key challenge is the token misalignment problem.

Making it Faster: Efficient Guided Generation (EGG)

This is where Efficient Guided Generation (EGG) comes in. EGG tackles the efficiency bottleneck of GCD.

FSM Masking Process of EGG

Instead of repeatedly checking tokens against the grammar on the fly, EGG uses a clever pre-processing step. It converts the grammar rules into a Finite-State Machine (FSM), a map of all possible valid states and transitions. Then, it builds an index that maps each state in the FSM to the specific vocabulary tokens that are allowed in that state.

During generation, instead of checking the whole vocabulary, the LLM just looks up the current FSM state in the index and instantly knows the valid next tokens. This drastically speeds up the process, reducing the cost per token from potentially checking the entire vocabulary (O(N)) down to a near-constant time lookup (O(1) on average). EGG makes enforcing complex grammars practical and fast, though its simplest forms might still struggle with the token misalignment nuances without significant online computation.

Making it More Faithful: Grammar-Aligned Decoding (GAD)

While EGG primarily solves the speed problem, a different challenge remained: does simply masking invalid tokens distort the distribution of what the LLM generates? This is the question addressed by Grammar-Aligned Decoding (GAD).

ASAp Sampling Process of GAD

The GAD paper points out that methods like classic GCD (and by extension, efficient methods like EGG that use the same core masking principle) can inadvertently bias the LLM's output. By forcing the LLM away from certain paths, even if those paths initially looked promising according to the LLM's learned probabilities, we might end up with outputs that are grammatically correct but unlikely or unnatural according to the LLM's own distribution.

GAD aims for outputs that are not only grammatical but also statistically faithful to the LLM's predictions, conditioned on the grammar. It introduces the concept of Expected Future Grammaticality (EFG) - essentially, the probability that if we let the LLM continue generating from the current point, the resulting full sequence will be grammatically valid.

Classic GCD uses a very simple, binary approximation of EFG (it's either possible or impossible). GAD proposes a more sophisticated algorithm called Adaptive Sampling with Approximate Expected Futures (ASAp). ASAp works iteratively:

  1. It samples an output using the current best guess for EFG (initially, like GCD).
  2. It observes where the generation path hit dead ends (non-grammatical sequences).
  3. It uses the probability mass the LLM assigned to those dead ends to refine its estimate of the EFG for the prefixes along that path.
  4. It uses this improved EFG estimate for the next sampling iteration.

Over multiple samples, ASAp's EFG approximation gets closer to the true value, allowing it to generate outputs that better reflect the LLM's underlying probability distribution while still satisfying the grammar.

Balancing Offline and Online Efficiency: Flexible and Efficient GCD (FEG)

The quest for efficiency didn't stop with EGG. Some efficient methods require very long offline preprocessing times to build their complex data structures, making them cumbersome for situations where grammars change frequently. This led to the development of approaches like Flexible and Efficient Grammar-Constrained Decoding (FEG).

Token-Level Lexing Transducer of the FEG Process

FEG aims for the best of both worlds: low online masking overhead (like the fastest methods) combined with significantly faster offline preprocessing. It tackles the token misalignment problem directly:

  • It uses Finite State Transducers (FSTs) to model the relationship between LLM tokens and sequences of grammar terminals, considering the current lexer state.
  • It precomputes a "token spanner table" that efficiently stores this mapping.
  • It also preprocesses the parser to distinguish between checks that require analyzing the parser's stack ("context-dependent") and those that don't ("always-accepted"), minimizing online computation.

This allows FEG to achieve state-of-the-art online speed while drastically reducing the offline preparation time compared to similarly fast online methods.

Comparing the Approaches

Feature Classic GCD Efficient Guided Generation (EGG) Grammar-Aligned Decoding (GAD/ASAp) Flexible & Efficient GCD (FEG)
Primary Goal Grammatical Correctness Efficiency (Online) + Correctness Faithfulness + Correctness Efficiency (Online & Offline) + Flexibility
Mechanism Token Masking FSM Index + Masking Adaptive EFG Approx. + Re-weighting FSTs, Token Spanner Table, Parser Preprocessing + Masking
Extension over Classic Baseline Speed via efficient lookups Accuracy via distribution alignment Balances offline/online costs, handles token misalignment
Probability Influence Masks invalid tokens (potential bias) Masks invalid tokens (potential bias) Re-weights valid tokens based on approx. EFG Masks invalid tokens (potential bias)
Key Improvement Guarantees structure Drastically improved online speed Improved faithfulness to LLM distribution Fast online speed w/ faster offline setup
Trade-offs Slow; struggles w/ misalignment High online cost (per FEG); basic forms may not handle misalignment well Requires multiple samples for convergence Doesn't address faithfulness like GAD

Conclusion

Choosing the right approach depends on your priorities:

  • If you need guaranteed grammatical output and the absolute fastest online generation is the main goal, and you can tolerate potentially high offline costs or simpler grammars, approaches derived from EGG or methods like FEG are strong contenders.
  • If you need a balance of fast online speed and reasonable offline setup times, especially if your grammars change often (like in program synthesis), FEG offers a compelling advantage.
  • If you need outputs that not only follow the rules but also accurately reflect the LLM's learned likelihoods (perhaps for tasks requiring nuanced or diverse structured outputs), GAD (ASAp) provides a path towards greater faithfulness, albeit potentially at the cost of requiring more sampling iterations.

Understanding these different flavors of grammar-constrained decoding GCD, EGG, GAD, and FEG, helps us better control and leverage the power of LLMs for generating precise, efficient, and faithful structured information.

References

Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning
Despite their impressive performance, large language models (LMs) still struggle with reliably generating complex output structures when not finetuned to follow the required output format exactly. To address this issue, grammar-constrained decoding (GCD) can be used to control the generation of LMs, guaranteeing that the output follows a given structure. Most existing GCD methods are, however, limited to specific tasks, such as parsing or code generation. In this work, we demonstrate that formal grammars can describe the output space for a much wider range of tasks and argue that GCD can serve as a unified framework for structured NLP tasks in general. For increased flexibility, we introduce input-dependent grammars, which allow the grammar to depend on the input and thus enable the generation of different output structures for different inputs. We then empirically demonstrate the power and flexibility of GCD-enhanced LMs on (1) information extraction, (2) entity disambiguation, and (3) constituency parsing. Our results indicate that grammar-constrained LMs substantially outperform unconstrained LMs or even beat task-specific finetuned models. Grammar constraints thus hold great promise for harnessing off-the-shelf LMs for a wide range of structured NLP tasks, especially where training data is scarce or finetuning is expensive. Code and data: https://github.com/epfl-dlab/GCD.
Efficient Guided Generation for Large Language Models
In this article we show how the problem of neural text generation can be constructively reformulated in terms of transitions between the states of a finite-state machine. This framework leads to an efficient approach to guiding text generation with regular expressions and context-free grammars by allowing the construction of an index over a language model’s vocabulary. The approach is model agnostic, allows one to enforce domain-specific knowledge and constraints, and enables the construction of reliable interfaces by guaranteeing the structure of the generated text. It adds little overhead to the token sequence generation process and significantly outperforms existing solutions. An implementation is provided in the open source Python library Outlines
Grammar-Aligned Decoding
Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup. Constrained decoding approaches mitigate this problem by greedily restricting what tokens an LLM can output at each step to guarantee that the output matches a given constraint. Specifically, in grammar-constrained decoding (GCD), the LLM’s output must follow a given grammar. In this paper, we demonstrate that GCD techniques (and in general constrained decoding techniques) can distort the LLM’s distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM, and so ultimately are low-quality. We call the problem of aligning sampling with a grammar constraint, grammar-aligned decoding (GAD), and propose adaptive sampling with approximate expected futures (ASAp), a decoding algorithm that guarantees the output to be grammatical while provably producing outputs that match the conditional probability of the LLM’s distribution conditioned on the given grammar constraint. Our algorithm uses prior sample outputs to soundly overapproximate the future grammaticality of different output prefixes. Our evaluation on code generation and structured NLP tasks shows how ASAp often produces outputs with higher likelihood (according to the LLM’s distribution) than existing GCD techniques, while still enforcing the desired grammatical constraints.
Flexible and Efficient Grammar-Constrained Decoding
Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

Read more

Converting 2D Architectural Drawings into Hierarchical Scene Graph Representations

Converting 2D Architectural Drawings into Hierarchical Scene Graph Representations

In this ongoing research, we outline a theoretical framework for transforming architectural 2D drawings into hierarchical scene graph representations. Traditional floorplans carry far more than rectangular outlines: they contain rich spatial relationships, complex room shapes, and functional constraints. Our approach bridges graph-theoretic floorplan analysis, hierarchical scene modeling, and constraint reasoning