v0.6.1
Aphrodite Engine - v0.6.1
What's Changed
- ci: exclude cu118 from build and add py_limited_api by @AlpinDale in #639
- fix: better async request cancellation by @AlpinDale in #641
- fix: gracefully handle missing chat template by @AlpinDale in #642
- chore: deduplicate nvlink check to cuda platform by @AlpinDale in #643
- fix: hardcoded float16 in embedding mode check by @AlpinDale in #645
- quadratic sampling: separate diff from logits to filter out NaNs. by @50h100a in #644
- fix: RSLoRA support by @AlpinDale in #647
- feat: introduce
BaseAphroditeParameter
by @AlpinDale in #646 - fix: move zeromq rpc frontend to IPC instead of TCP by @AlpinDale in #652
- fix: input processor in internvl2 by @AlpinDale in #653
- fix: multiprocessing timeout by @AlpinDale in #654
- fix: GPTQ/AWQ on Colab by @AlpinDale in #655
- fix: make
merge_async_iterators.is_cancelled()
optional by @AlpinDale in #656 - fix: flashinfer outputs by @AlpinDale in #657
- fix: max_num_batched_tokens should not be limited for lora by @AlpinDale in #658
- fix: lora with pipeline parallel by @AlpinDale in #659
- fix: kill api server when pinging dead engine by @AlpinDale in #660
- fix:
get_num_blocks_touched
logic by @AlpinDale in #661 - chore: update the env.py script and the bug report template by @AlpinDale in #662
- feat: add INT8 W8A16 quant for TPU by @AlpinDale in #663
- feat: allow serving encoder-decoder models in the API server by @AlpinDale in #664
- fix: deps with TPU dockerfile by @AlpinDale in #665
- optimization: reduce end-to-end overhead from python obj allocation by @AlpinDale in #666
- fix: minor adjustments to scheduler and block manager by @AlpinDale in #667
- feat: enable using fp8 kv and prefix caching with chunked prefill by @AlpinDale in #668
- fix: mlpspeculator with padded vocab by @AlpinDale in #669
- feat: option to apply temperature scaling last by @AlpinDale in #670
- chore: decouple
should_modify_greedy_probs_inplace
by @AlpinDale in #671 - chore: better stream termination in async engine by @AlpinDale in #672
- chore: mamba cache single buffer by @AlpinDale in #673
- feat: mamba model support by @AlpinDale in #674
- fix: reinit procedure in
ModelInputForGPUBuilder
by @AlpinDale in #675 - feat: embeddings support for batched OAI endpoint by @AlpinDale in #676
- fix: fp8 checkpoints with fused linear modules by @AlpinDale in #677
- feat: add numpy implementation of
compute_slot_mapping
by @AlpinDale in #678 - fix: chunked prefill with v2 block manager by @AlpinDale in #679
- fix: phi3v batch inference with different aspect ratio images by @AlpinDale in #680
- chore: use mark_dynamic to reduce TPU compile times by @AlpinDale in #681
- chore: bump lmfe to v0.10.6 and include triton for tpu and xpu dockefiles by @AlpinDale in #682
- refactor: base worker input refactor for multi-step by @AlpinDale in #683
- build: add empty device by @AlpinDale in #684
- chore: update flashinfer to v0.1.3 by @AlpinDale in #685
- feat: allow image embeddings for VLM input by @AlpinDale in #686
- feat: add progress bar for loading individual weight modules by @AlpinDale in #640
- chore: use public ECR for neuron image by @AlpinDale in #687
- fix: logit softcapping in flash-attn by @AlpinDale in #688
- chore: use scalar type to dispatch to different
gptq_marlin
kernels by @AlpinDale in #689 - fix: allow passing float for GiB arguments by @AlpinDale in #690
- build: bump cmake to 3.26 by @AlpinDale in #691
- fix: shut down ray dag workers cleanly by @AlpinDale in #692
- feat: add lora loading/unloading api endpoint by @AlpinDale in #693
- feat: add load/unload endpoints for soft-prompts by @AlpinDale in #694
- fix: loading chameleon model with TP>1 by @AlpinDale in #695
- fix: consolidated
is_tpu()
and suppress tpu import warning by @AlpinDale in #696 - fix: manually install triton for other devices to prevent outlines errors by @AlpinDale in #697
- feat: support for Audio modality by @AlpinDale in #698
- chore: migrate gptq_marlin to AphroditeParameters by @AlpinDale in #699
- chore: update fused MoE weight loading by @AlpinDale in #700
- feat: add Solar model support by @AlpinDale in #701
- feat: migrate awq and awq_marlin to AphroditeParameter by @AlpinDale in #702
- chore: spawn engine process from api server process by @AlpinDale in #703
- chore: use the
compressed-tensors
library to avoid code reuse by @AlpinDale in #704 - feat: add aphrodite plugin system by @AlpinDale in #705
- Revert "chore: use the
compressed-tensors
library to avoid code reuse (#704)" by @AlpinDale in #706 - feat: add support for multi-host TPU by @AlpinDale in #707
- fix: import ray under a guard by @AlpinDale in #708
- fix: empty sampler output when temperature is too low by @AlpinDale in #709
- fix: disable embeddings API for chat models by @AlpinDale in #710
- feat: implement mistral tokenizer mode by @AlpinDale in #711
- feat: support profiling with multiple multi-modal inputs per prompt by @AlpinDale in #712
- chore: multi-step args and sequence modifications by @AlpinDale in #713
- chore: set per-rank XLA cache for TPU by @AlpinDale in #714
- chore: add support for up to 2048 block size by @AlpinDale in #715
- fix: install protobuf for cpu by @AlpinDale in #716
- fix: weight loading for scalars by @AlpinDale in #718
- chore: quant config for speculative draft models by @AlpinDale in #719
- feat: enable prompt logprobs in OpenAI API by @AlpinDale in #720
- chore: update grafana template by @AlpinDale in #721
- ci: bump aphrodite to 0.6.1 by @AlpinDale in #722
Full Changelog: v0.6.0.post1...v0.6.1