Interactivity Overhaul (User Interface & Model Instrumentation & Network Comms) #1054

nopdive · 2024-10-22T07:56:20Z

Interactivity Overhaul

What you want, when you want.

-- some guidance developer (circa 2024)

Overview

This PR is the first of many focusing on interactivity. It introduces an updated user interface for notebooks, new instrumentation for models, and a respective network layer to handle bidirectional communication between the IPython kernel and JavaScript client. To further support this, models have reworked rendering, added tracing logic to better support replays where required.

This PR also functions as a foundational step towards near future work including rendering across various environments (i.e. terminal support as TUI and append-only outputs), upgraded benchmarking and model inspection.

TL;DR

We added a lot of code to support better model metrics and visualization. We are getting ready for multimedia streaming, and want to have users deep inspect all the models, without overheating the computer.

Acknowledgements

Big shoutouts to:

Loc (co-developed this PR): model instrumentation & metrics.
Jingya: consult & sketches on enhanced UI design.
Harsha: overall feedback & collab on prototypes.

Running this PR

cd packages/python/stitch && pip install -e .
Go run a notebook.

User Interface

Design principle: All visibility. No magic.

Overall we're trying to show as much as we can on model outputs. When debugging outputs, there can be real ugliness that is often hidden away including tokenization concerns and critical points that may dictate the rest of the output. This need for inspection increases as users begin to define their own structured decoding grammars, unexpected overconstraints can occur in development.

The old user interface that displays HTML as a side-effect in notebooks when models compute, have been replaced with a custom Jupyter Widget (see Network Communications for more detail), of which hosts an interactive sandboxed iframe. We still support a legacy mode, if users desire the previous UI.

Before

After

We're getting more information to the output at the expense of less text density. There is simply more going on, and in order to keep some legibility we've increased text size and spacing, compensating for two visual elements (highlighting and underlines) that are used to convey token info for scanning. A general metrics bar is also displayed for discoverability on token reduction and other efficiency metrics relevant when prompt engineering for reduced costs.

When users want further detail on tokens, we support a tool tip that contains top 5 alternate token candidates alongside exact values for visual elements. Highlighting has been applied to candidates, accentuating tokens that include spaces.

We use a mono-space typeface such that data format outputs can be inspected quicker (i.e. verticality can matter for balancing braces and indentation).

As users learn a system: a UI with easier discoverability can come at the cost of productivity. We've made all visual components optional to keep our power users in the flow, and in the future we intend to allow users to define defaults to fully support this.

For legacy mode (modeled after previous UI). Users can execute guidance.legacy_mode(True) at the start of their notebook.

Old school cool.

The Code

Added
- guidance.visual module. Handles renderer creation (stitch or HTML display) and all required messaging. This also handles Jupyter cell change detection for deciding when widgets need to be instantiated or reset.
- guidance.trace module. Tracks model inputs & outputs of an engine. Important for replaying for clients.
- graphpaper-inline NPM package has been added. This handles all client-side rendering and messaging. Written with Svelte/TypeScript/Tailwind/D3.
Changed
- Rendering logic has been stripped from Model class and has been delegated to Renderer member where possible.
- Relevant state logic has been augmented for inputs & outputs, and stored within engine for tracing across models.
- Role processing across guidance has been thinned. Model class now generates role openers and closer text directly from its respective chat template.

Instrumentation

Instrumentation is key for model inspection, debugging and cost-sensitive prompt engineering. This includes backing the new UI. Metrics are now collected for both general compute resources (CPU/GPU/RAM) and model tokens (including token counts/reduction, latency, type, backtracking).

The Code

Added (metric collection feature)
- Add Monitor class in _model.py to collect common metrics (CPU, RAM, GPU utilization, etc.)
  - Monitor runs in a separated process to prevent competing resources with model/engine process
- Model now keeps stats of current input/output/backtrack tokens
- At the end of notebook cell's execution, we'll collect probability of each token in the final model state, and collect associated stats per token such as
  - Latency
  - If token was generated, force-forwarded or from user input
Changed:
- Replaced get_next_token with get_next_token_with_top_k to keep track issued token along with its associated top_k tokens (both constrained and unconstrained). Data will be stored in EngineOutput class
- Model now has VisBytesChunk object to keep track of which part of the chunk is from user input, generated by engine or force-forwarded by parser.
  VisBytesChunk also stores the list of EngineOutput objects generated by the engine during chunk generation.
  This facilitates the process of checking tokens from the final state are generated, force-forwarded or from user input.
- Add get_per_token_topk_probs function in Engine class to calculate probability of each token in the token list.
  This function is used at the end of the cell execution to calculate the probabilities of model state in unconstrained mode.
- Add get_per_token_stats function in Model class to report stats for each token in model state in unconstrained mode.
  Stats include issued token, probability, latency, top-k, masked-top-k if available.
  Data from get_per_token_stats will be reported to the UI for new visualization.

Network Communications

We have two emerging requirements that will impact future guidance development. One, the emergence of streaming multimedia around language models (audio/video). Two, user interactivity within the UI, requesting more data or computation that may not be feasible to rpre-(?:fetch|calculate) to a static client.

For user interactivity from UI to Python, it's also important that we cover as many notebook environments as possible. Each cloud notebook provider has their own quirks of which complicates client development. Some providers love resizing cell outputs indefinitely, others refuse to display HTML unless it's secured away in an isolated iframe.

All in all, we need a solution that is isolated, somewhat available across providers and can allow streams of messages between server (Jupyter Python kernel) and client (cell output with a touch of JS).

Stitch

It's 3:15AM, bi-directional comms was a mistake.

-- some guidance developer, minutes prior to passing out (circa 2024)

stitch is an auxiliary package we've created, that handles bi-directional communication between a web client and a Jupyter python kernel. It does this by creating a thin custom Jupyter widget that handles messages between the kernel and a sandboxed iframe hosting the web client. It looks something like this:

python code -> kernel-side jupyter widget -> kernel comms (ZMQ) -> client-side jupyter widget -> window event message -> sandboxed iframe -> web client (graphpaper-inline)

This package drives messages between guidance.visual module and graphpaper-inline client. All messages are streamed to allow near-real-time rendering within a notebook. Bi-directional comms is used to repair the display if initial messages have been missed (client will request a full replay when it notices the first message it receives has a non-zero identifier).

The Code

Added
- stitch Python package. Can be found at packages/python/stitch.

Future work

We wanted to shoot for the stars, and ended up in the ocean. The following will occur after this PR.

Near future tasks:

User defaults for UI
Terminal support (non-interactive & shell)
Restyle
Richer visualizations
Memory re-architecture (broader than this PR)
Interactive support for multimedia
Guidance quality-of-life (visual diff testing)

Visualization components do better with state handled as traces that can rewind. As such definitions and evaluation of a guidance grammar is separated here while minimizing changes needed at the grammar level.

Probably need to have separate fields for tracking, input and output of a given node.

Trace can now handle capture groups. State module moved to trace module.

Documentation added and some type changes.

Trace nodes have light adjustments. HTML renderer is connected but fully working yet due to role closers.

Old HTML display now fully replaced. Fixed some roles issues as well.

Uses stitch for kernel to client communication. Need to redesign and hook in instrumentation.

Tooling appears to create a nameless role. Fixed.

Kernel messages still need to be re-implemented.

This package is required for Jupyter kernel comms via a custom ipywidget.

Copyright headers now correctly pointing to Guidance Contributors.

Trace messages are now JSON serializable. Some minor fixes like adding a manifest for package.

Client has a race condition where it skips messages that have been fired by stitch before it loads.

Had to send a heartbeat first then send all messages in buffer.

Client messages can be handled in engine. Output for print and log not working due to being in an ipywidget. Will need to re-implement with asyncio later.

Separate thread for send/recv on messages.

Final message sent on cell completion. Still needs further testing.

No more dictionaries to recv_msg!

This includes for HTML renderer.

Queue instantiation now deferred to asyncio background thread.

JC1DA · 2024-12-10T07:19:17Z

@nopdive Got this exception when I share the same base_lm in multiple cells

Exception: Parent missing for trace node: identifier=174 parent=0:None:None children=[] input=None output=None

Examples:
Cell-1

base_lm = guidance.models.Transformers(
    model,
    device_map="auto",
    trust_remote_code=True,
    chat_template=QWen2_ChatTemplate,
    # chat_template=LLAMA3_1_ChatTemplate,
    # attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    top_k=1
)

Cell-2

lm = (
    base_lm
    + """\
1 + 1 = add(1, 1) = 2
3 + 5 = add(3, 5) = 8
11 + 9 = """
)
lm = lm + guidance.gen(max_tokens=33, name="result")

Cell-3

@guidance
def add(lm, input1, input2):
    lm += f" = {int(input1) + int(input2)}"
    return lm

lm = (
    base_lm
    + """\
1 + 1 = add(1, 1) = 2
3 + 5 = add(3, 5) = 8
11 + 9"""
)
lm = lm + guidance.gen(max_tokens=30, tools=[add])

…verhaul

nopdive · 2024-12-10T19:39:31Z

Just noticed that wall time and RAM continue to be monitored long after a notebook cell finishes executing -- can anyone else repro?
Edit: this is in vscode's notebook env -- not sure about vanilla jupyter

yeah, I saw it. But only one active cell will receive updates, so if you run another cell, the previous will not be updating new data anymore. @nopdive Should we can stop the periodic metrics generator whenever the cells completes its execution, and re-create whenever we enter a new cell or we can just use a flag to pause it or keep it that way?

TLDR; Visually it should stop (if this isn't happening it's a bug), backend should keep running for resource metrics as long as we have an engine running that requires it.

There's the problem of overhead in starting/stopping the background process for monitoring. From what I recall, there's also an initialization time where we don't capture resource events (i.e. GPU utilization until ~100-200ms mark).

Pragmatically: resource metrics visually should end on cell execution. Resource metrics should be consumed/reported for a given cell until cell execution (can have a job that starts/ends here in a thread, i.e. renderer layer does this for messaging via a background asyncio thread). Single resource monitoring process/thread that produces events until the last engine that needs monitoring is unalive. This is controversial in that it's a single producer across the library as opposed to per engine, but it does ease performance costs.

Later on, we should revisit threads/process usage across guidance. This is a larger problem than the PR itself of course, similar to a memory review that should also occur in the near future.

nopdive · 2024-12-10T19:41:23Z

@nopdive Got this exception when I share the same base_lm in multiple cells

Exception: Parent missing for trace node: identifier=174 parent=0:None:None children=[] input=None output=None

Examples: Cell-1

base_lm = guidance.models.Transformers(
    model,
    device_map="auto",
    trust_remote_code=True,
    chat_template=QWen2_ChatTemplate,
    # chat_template=LLAMA3_1_ChatTemplate,
    # attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    top_k=1
)

Cell-2

lm = (
    base_lm
    + """\
1 + 1 = add(1, 1) = 2
3 + 5 = add(3, 5) = 8
11 + 9 = """
)
lm = lm + guidance.gen(max_tokens=33, name="result")

Cell-3

@guidance
def add(lm, input1, input2):
    lm += f" = {int(input1) + int(input2)}"
    return lm

lm = (
    base_lm
    + """\
1 + 1 = add(1, 1) = 2
3 + 5 = add(3, 5) = 8
11 + 9"""
)
lm = lm + guidance.gen(max_tokens=30, tools=[add])

This might be a regression with all the changes since the initial PR, let's make a test for it and figure this one out.

Fix, this was set to stitch earlier.

Signed-off-by: JC1DA <[email protected]>

1) Fix missing _recv_queue and _send_queue in AutoRenderer 2) Add enable_monitoring flag into transformers and llamacpp engine 3) Fix incorrect token metrics data in monitor

Fix missing anytree lib in tests + missing TokensMessage in model_registry

Signed-off-by: JC1DA <[email protected]>

1) Only collect token-metrics if echo is True 2) Use bytes-string for invalid utf8

Merge 'main' branch

Signed-off-by: JC1DA <[email protected]>

Fix missing enable_backtrack and enable_ff_tokens to parser creation

I've ignored one of the tests around block definitions with grammars. Should this continue to be a feature in next release?

codecov-commenter · 2024-12-11T20:47:58Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 55.19399% with 716 lines in your changes missing coverage. Please review.

Project coverage is 61.93%. Comparing base (c78e0b4) to head (182f836).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
guidance/models/_model.py	52.74%	353 Missing ⚠️
guidance/visual/_renderer.py	29.33%	171 Missing ⚠️
guidance/models/llama_cpp/_llama_cpp.py	7.24%	64 Missing ⚠️
guidance/visual/_trace.py	49.49%	50 Missing ⚠️
guidance/models/transformers/_transformers.py	20.00%	28 Missing ⚠️
guidance/models/_mock.py	30.55%	25 Missing ⚠️
guidance/visual/_jupyter.py	44.44%	10 Missing ⚠️
guidance/trace/_trace.py	96.85%	5 Missing ⚠️
guidance/_parser.py	87.87%	4 Missing ⚠️
guidance/_utils.py	92.85%	2 Missing ⚠️
... and 2 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1054      +/-   ##
==========================================
- Coverage   66.67%   61.93%   -4.74%     
==========================================
  Files          65       72       +7     
  Lines        5173     6552    +1379     
==========================================
+ Hits         3449     4058     +609     
- Misses       1724     2494     +770

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Will need a review overall on blocks later.

hudson-ai · 2024-12-11T22:18:42Z

tests/unit/library/test_block.py

+# TODO(nopdive): Review this exception later -- how should we be going about grammars in blocks overall.
+@pytest.mark.skip(reason="requires review")


I think we need to revisit the entire concept of blocks, and I am not fundamentally opposed to requiring block openers/closers to be strings. We can push this discussion to a later date -- definitely doesn't need to be resolved in this PR

Delay processing invalid bytes in Engine for visualization

nopdive and others added 30 commits September 24, 2024 10:01

Added prototype for state changes.

fd00ab7

Visualization components do better with state handled as traces that can rewind. As such definitions and evaluation of a guidance grammar is separated here while minimizing changes needed at the grammar level.

WIP state nodes now update.

5d032e1

Probably need to have separate fields for tracking, input and output of a given node.

Added working role to trace tree.

b7c15ff

Clean-up of data classes for trace.

e6c8420

Various trace updates.

6429d97

Trace can now handle capture groups. State module moved to trace module.

Clean-up of trace module.

e95d4f7

Documentation added and some type changes.

Revert some documentation word swaps.

b69582e

Adding basic HTML renderer. Adjustments to trace.

6652b90

Trace nodes have light adjustments. HTML renderer is connected but fully working yet due to role closers.

Working basic visualization.

43820d6

Old HTML display now fully replaced. Fixed some roles issues as well.

Black format on visual and trace module.

0dc1014

Added client skeletal for Jupyter rendering.

7ac9f97

Uses stitch for kernel to client communication. Need to redesign and hook in instrumentation.

Adjustments to clientside. Fixed tooling issue.

017988a

Tooling appears to create a nameless role. Fixed.

Connected widget system to renderer.

af90c77

Kernel messages still need to be re-implemented.

Added stitch package.

1a51086

This package is required for Jupyter kernel comms via a custom ipywidget.

Fix copyright for stitch package.

b20780d

Copyright headers now correctly pointing to Guidance Contributors.

Serialization for trace. Minor fixes.

a8a0f06

Trace messages are now JSON serializable. Some minor fixes like adding a manifest for package.

WIP messaging in client.

a65379b

Client has a race condition where it skips messages that have been fired by stitch before it loads.

Resolved race condition for Jupyter client.

9f9dcd9

Had to send a heartbeat first then send all messages in buffer.

Callback for client messages.

2c64bbc

Client messages can be handled in engine. Output for print and log not working due to being in an ipywidget. Will need to re-implement with asyncio later.

Asyncio connected for renderers.

37be3a3

Separate thread for send/recv on messages.

WIP final message.

5f80680

Final message sent on cell completion. Still needs further testing.

HTML renderer now supports detecting end of cell execution.

55b9487

De(serialization) of guidance messages.

0f6b59f

No more dictionaries to recv_msg!

visualize 3 different colors for input/generated/force-forwarded texts

626880e

indev: add probs to UI

deeaddf

indev : add token-based metrics - latency

ee28e74

Renderer update can be called via recv_msg.

f28d8cc

This includes for HTML renderer.

Merge branch 'state' into state_visualization

09fdfa8

indev - showing prob working

b94db9c

Fixed queue without loop arguments.

ae9c347

Queue instantiation now deferred to asyncio background thread.

JC1DA and others added 5 commits December 10, 2024 00:44

Reuse metrics obj in engine class

cff07ea

Minor fix for visual test.

3ab5c49

Merge branch 'overhaul' of https://github.com/nopdive/guidance into o…

3a8c3b7

…verhaul

Minor fix for visual test.

ef99bf0

Merge branch 'overhaul' into overhaul_loc

fbf3ba1

nopdive and others added 17 commits December 10, 2024 11:46

Engine should default to autorenderer.

09ef82f

Fix, this was set to stitch earlier.

Merge branch 'overhaul' into overhaul_loc

380b4cd

Fix renderer clean_up for AutoRenderer

43fa0a9

Signed-off-by: JC1DA <[email protected]>

Add enable_monitoring flag to llama_cpp and transformers engine

bd85f7b

Merge pull request #19 from JC1DA/overhaul_loc

bf26b5b

1) Fix missing _recv_queue and _send_queue in AutoRenderer 2) Add enable_monitoring flag into transformers and llamacpp engine 3) Fix incorrect token metrics data in monitor

Add TokensMessage into model_registry

8fd06ba

Add anytree lib into test

74e6fc2

Merge pull request #20 from JC1DA/overhaul_loc

70e2554

Fix missing anytree lib in tests + missing TokensMessage in model_registry

Only collect token-metrics in engine call if echo is True

acc4f0c

Signed-off-by: JC1DA <[email protected]>

Disable collecting token-metrics if echo is False

ac5f41e

Use bytes_string for invalid utf8 bytes

15a7cf9

Signed-off-by: JC1DA <[email protected]>

Merge pull request #21 from JC1DA/overhaul_loc

302f1dc

1) Only collect token-metrics if echo is True 2) Use bytes-string for invalid utf8

Merge branch 'main' into overhaul_loc

bdc864c

Merge pull request #22 from JC1DA/overhaul_loc

7ccbee6

Merge 'main' branch

Fix missing enable_backtrack and enable_ff_tokens to parser creation

705b000

Signed-off-by: JC1DA <[email protected]>

Merge pull request #23 from JC1DA/overhaul_loc

9db8357

Fix missing enable_backtrack and enable_ff_tokens to parser creation

Update to failing tests.

d40ff18

I've ignored one of the tests around block definitions with grammars. Should this continue to be a feature in next release?

Update to grammar closer test.

7c4f66d

Will need a review overall on blocks later.

hudson-ai reviewed Dec 11, 2024

View reviewed changes

JC1DA added 2 commits December 11, 2024 15:07

Delay processing invalid bytes in Engine for visualization

f8c727c

Merge pull request #24 from JC1DA/overhaul_loc

182f836

Delay processing invalid bytes in Engine for visualization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interactivity Overhaul (User Interface & Model Instrumentation & Network Comms) #1054

Interactivity Overhaul (User Interface & Model Instrumentation & Network Comms) #1054

nopdive commented Oct 22, 2024 •

edited

Loading

JC1DA commented Dec 10, 2024

nopdive commented Dec 10, 2024 •

edited

Loading

nopdive commented Dec 10, 2024

codecov-commenter commented Dec 11, 2024 •

edited

Loading

hudson-ai Dec 11, 2024

		# TODO(nopdive): Review this exception later -- how should we be going about grammars in blocks overall.
		@pytest.mark.skip(reason="requires review")

Interactivity Overhaul (User Interface & Model Instrumentation & Network Comms) #1054

Are you sure you want to change the base?

Interactivity Overhaul (User Interface & Model Instrumentation & Network Comms) #1054

Conversation

nopdive commented Oct 22, 2024 • edited Loading

Interactivity Overhaul

Overview

TL;DR

Acknowledgements

Running this PR

User Interface

The Code

Instrumentation

The Code

Network Communications

Stitch

The Code

Future work

JC1DA commented Dec 10, 2024

nopdive commented Dec 10, 2024 • edited Loading

nopdive commented Dec 10, 2024

codecov-commenter commented Dec 11, 2024 • edited Loading

Codecov Report

hudson-ai Dec 11, 2024

Choose a reason for hiding this comment

nopdive commented Oct 22, 2024 •

edited

Loading

nopdive commented Dec 10, 2024 •

edited

Loading

codecov-commenter commented Dec 11, 2024 •

edited

Loading