Skip to content

v0.4.1

Compare
Choose a tag to compare
@github-actions github-actions released this 08 Nov 13:35
· 820 commits to main since this release

OpenLLM version 0.4.0 introduces several enhanced features.

  • Unified API and Continuous Batching support: 0.4.0 brings a simplified API for OpenLLM. Users can now run LLM with two new APIs.

    • await llm.generate_iterator(prompt, stop, **kwargs): one shot generation for any given prompt

      import openllm, asyncio
      
      llm = openllm.LLM("HuggingFaceH4/zephyr-7b-beta")
      
      async def infer(prompt,**kwargs):
        return await llm.generate(prompt, **kwargs)
      
      asyncio.run(infer("Time is a definition of"))
    • await llm.generate(prompt, stop, **kwargs: stream generation that returns tokens as they become ready

      import bentoml, openllm
      import openllm
      
      llm = openllm.LLM("HuggingFaceH4/zephyr-7b-beta")
      
      svc = bentoml.Service(name='zephyr-instruct', runners=[llm.runner])
      
      @svc.api(input=bentoml.io.Text(), output=bentoml.io.Text(media_type='text/event-stream'))
      async def prompt(input_text: str) -> str:
        async for generation in llm.generate_iterator(input_text):
          yield f"data: {generation.outputs[0].text}\\n\\n"
    • Under async context, calls to both llm.generate_iterator and llm.generate now supports continuous batching for the most optimal throughput.

    • The backend is now automatically inferred based on the presence of vllm in the environment. However, if you prefer to manually specify the backend, you can achieve this by using the backend argument.

      openllm.LLM("HuggingFaceH4/zephyr-7b-beta", backend='pt')
    • Quantization can also be passed directly to this new LLM API.

      openllm.LLM("TheBloke/Mistral-7B-Instruct-v0.1-AWQ", quantize='awq')
  • Mistral Model: OpenLLM now supports Mistral. To start a Mistral server, simply execute openllm start mistral.

  • AWQ and SqueezeLLM Quantization: AWQ and SqueezeLLM is now supported with vLLM backend. Simply pass --quantize awq or --quantize squezzellm to openllm start to use AWQ or SqueezeLLM quantization.

    IMPORTANT: For using AWQ it is crucial that the model weight is already quantized with AWQ. Please look for the model variant on HuggingFace hub for the AWQ version of the model you want to use. Currently, only AWQ with vLLM is fully tested and supported.

  • General bug fixes: fixed a bug with regards to tag generation. Standalone Bento that use this new API should just work as expected if the model is already exists in the model store.

    • For consistency, make sure to run openllm prune -y --include-bentos

Installation

pip install openllm==0.4.1

To upgrade from a previous version, use the following command:

pip install --upgrade openllm==0.4.1

Usage

All available models: openllm models

To start a LLM: python -m openllm start opt

To run OpenLLM within a container environment (requires GPUs): docker run --gpus all -it -P ghcr.io/bentoml/openllm:0.4.1 start opt

To run OpenLLM Clojure UI (community-maintained): docker run -p 8420:80 ghcr.io/bentoml/openllm-ui-clojure:0.4.1

Find more information about this release in the CHANGELOG.md

What's Changed

  • chore(runner): yield the outputs directly by @aarnphm in #573
  • chore(openai): simplify client examples by @aarnphm in #574
  • fix(examples): correct dependencies in requirements.txt [skip ci] by @aarnphm in #575
  • refactor: cleanup typing to expose correct API by @aarnphm in #576
  • fix(stubs): update initialisation types by @aarnphm in #577
  • refactor(strategies): move logics into openllm-python by @aarnphm in #578
  • chore(service): cleanup API by @aarnphm in #579
  • infra: disable npm updates and correct python packages by @aarnphm in #580
  • chore(deps): bump aquasecurity/trivy-action from 0.13.1 to 0.14.0 by @dependabot in #583
  • chore(deps): bump taiki-e/install-action from 2.21.7 to 2.21.8 by @dependabot in #581
  • chore(deps): bump sigstore/cosign-installer from 3.1.2 to 3.2.0 by @dependabot in #582
  • fix: device imports using strategies by @aarnphm in #584
  • fix(gptq): update config fields by @aarnphm in #585
  • fix: unbound variable for completion client by @aarnphm in #587
  • fix(awq): correct awq detection for support by @aarnphm in #586
  • feat(vllm): squeezellm by @aarnphm in #588
  • docs: update quantization notes by @aarnphm in #589
  • fix(cli): append model-id instruction to build by @aarnphm in #590
  • container: update tracing dependencies by @aarnphm in #591

Full Changelog: v0.4.0...v0.4.1