[Query] True explanation behind the Guidance project? #1040

oflakne26 · 2024-10-04T02:53:34Z

Hello,

I am a computational linguist working on an exciting project regarding NLP and machine translation.

Custom logits processors and, thereby, constrained decoding are of high interest to me, and I have found Guidance's strategies to be the most successful, after testing many of my own.

However, I cannot find a formal publication, such as a research paper, that introduces Guidance, at least not publicly. In fact, as far as I understand, the project is nearly unknown by the mainstream media.

I would like to get to know the project better, especially regarding how it was initially built. Please, if you are associated with the developers and able, provide me with some documentation on the methodology and from where it derives. I am not the best at reading source code and understanding it immediately.

Thank you! :)

Harsha-Nori · 2024-10-06T05:29:48Z

Hi @oflakne26,

Thank you so much for the kind words and your interest in guidance :). Sorry to make you search -- we haven't written up a formal publication yet, but plan to do so in the near future!

I can share some high level details of the project's history and how it works. Definitely feel free to ask more questions if there's anything I missed or if you'd like more detail.

The project started at Microsoft Research in ~October of 2022. A few of us were given early access to GPT-4, and started collaborating across the company to build the very first version of Copilot (back then, it was called Bing Chat or by its internal codename Sydney). We were all floored by the new capabilities of these models -- keep in mind that ChatGPT hadn't even been released yet -- and wrote about our research experiences in the Sparks of AGI paper: https://arxiv.org/abs/2303.12712

However, as amazed as we were by the potential, building real world applications on top of these models was incredibly painful (even more so back then when the instruction tuning was less powerful). Integrating LLMs into traditional software stacks was painful for a number of reasons:

Fundamentally, we couldn't guarantee that the models would output text in a way that downstream software could understand. This meant writing a ton of validation, healing, and retry logic which is painful and a bad user experience too.
On the flip side, we often wanted to add programmatic control flow to the way the models executed, i.e. conditional logic and loops on the outputs (e.g. if the model indicates it wants to do a web search, hook into the Bing Index).
It was messy to manage prompt templates and changing interfaces to model APIs.
LLM model inference was prohibitively expensive and had lots of room for optimization.
Being unable to control models makes them more susceptible to jailbreaks or unintentional responsible AI harms.

To address these challenges, we started building Guidance as an internal tool just for our own development. As we began to rely on it more and more, and discover new optimization opportunities or new features, we decided to open source it and share it with the community in May of 2023.

The first version of guidance was a DSL (https://en.wikipedia.org/wiki/Domain-specific_language) which used a handlebars-like template language syntax that looked something like this:

program = guidance("""
{{~#system~}}
You are a helpful assistant.
{{~function_def get_current_weather}}
{{~/system~}}

{{~#user~}}
Get the current weather in New York City.
{{~/user~}}

{{~#while True~}}
    {{~#assistant~}}
    {{gen 'answer' temperature=1.0 max_tokens=50 functions="auto"}}
    {{~/assistant~}}

    {{#if answer_function_call}}
        {{~#function~}}
        {{answer_function_call()}}
        {{~/function~}}
    {{else}}
        {{break}}
    {{/if}}
{{/while}}
""")

which was great in many ways (works across languages, the programs can be serialized safely as simple strings, etc.). However, it also 1) had a steep learning curve for the python community, 2) was painful to debug for end users and 3) was a pain to keep developing and maintaining, since any time we wanted to add a new language feature (eg while loops), we had to update the entire language itself.

We made the decision over the summer to switch to a more Pythonic interface, which is the current syntax you see today. We'll probably write up a blog at some point about the design decisions that went into this interface change too!

Ok, all that said, how does guidance work?

Fundamentally, guidance does three things today:

Make it easy to develop programmatic workflows around interacting with LMs, including prompting
Enable users to control model outputs to e.g. follow a regular expression or select from a list of items.
Optimize interactions with language models at inference time, with features like guidance acceleration to speed up inference and [token healing] (https://towardsdatascience.com/the-art-of-prompt-design-prompt-boundaries-and-token-healing-3b2448b0be38) to get higher quality model outputs.

We'll write up a proper academic paper about our approach at some point, but on a high level, the python programs users write get turned into a formal grammar (https://en.wikipedia.org/wiki/Formal_grammar) under the hood which is a formal way of specifying what is and isn't allowed to be generated by the model. These are used in e.g. programming languages to determine what is and isn't a valid program. Similarly, while users of guidance feel like they're writing just plain python, we help them formally specify exactly what the model is and isn't allowed to generate.

I'm sure you know all of this already @oflakne26 but wrote it out for the benefit of future readers)

Once we have a grammar, we use a modified Earley parser (https://en.wikipedia.org/wiki/Earley_parser) under the hood to build token masks. To generate text, language models are doing next token prediction - just determining what token is most likely to follow the text that they have already seen. To do this, they calculate a probability distribution over all possible tokens before sampling one, and then autoregressively continue the process. By "token masking", I mean that our parser manipulates this probability distribution, suppressing any tokens that aren't allowed by the grammar so that the model can only sample tokens that are valid.

We chose to use an Earley parser because it's among the most flexible parsing algorithms out there (which means users can write much more expressive constraints). This differentiates us from most other constrained decoding libraries out there, which use simpler parsers that can only handle e.g. regular grammars. Earley parsers come with tradeoffs -- they can be quite slow if implemented naively -- so we've taken great care to write a very high performance implementation (for details check out https://github.com/microsoft/llguidance).

I talk about these concepts with a lot of heavy visuals in my //build talk from earlier this year, which might be worth a watch if you have the time: https://www.youtube.com/watch?v=qXMNPVVlCMs

This post is getting quite long, but hope it gives you some insight! Happy to answer any further questions you have, and thanks again for your excitement and interest :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Query] True explanation behind the Guidance project? #1040

[Query] True explanation behind the Guidance project? #1040

oflakne26 commented Oct 4, 2024

Harsha-Nori commented Oct 6, 2024 •

edited

Loading

[Query] True explanation behind the Guidance project? #1040

[Query] True explanation behind the Guidance project? #1040

Comments

oflakne26 commented Oct 4, 2024

Harsha-Nori commented Oct 6, 2024 • edited Loading

Harsha-Nori commented Oct 6, 2024 •

edited

Loading