Aphrodite Engine - v0.6.1

What's Changed

ci: exclude cu118 from build and add py_limited_api by @AlpinDale in #639
fix: better async request cancellation by @AlpinDale in #641
fix: gracefully handle missing chat template by @AlpinDale in #642
chore: deduplicate nvlink check to cuda platform by @AlpinDale in #643
fix: hardcoded float16 in embedding mode check by @AlpinDale in #645
quadratic sampling: separate diff from logits to filter out NaNs. by @50h100a in #644
fix: RSLoRA support by @AlpinDale in #647
feat: introduce BaseAphroditeParameter by @AlpinDale in #646
fix: move zeromq rpc frontend to IPC instead of TCP by @AlpinDale in #652
fix: input processor in internvl2 by @AlpinDale in #653
fix: multiprocessing timeout by @AlpinDale in #654
fix: GPTQ/AWQ on Colab by @AlpinDale in #655
fix: make merge_async_iterators.is_cancelled() optional by @AlpinDale in #656
fix: flashinfer outputs by @AlpinDale in #657
fix: max_num_batched_tokens should not be limited for lora by @AlpinDale in #658
fix: lora with pipeline parallel by @AlpinDale in #659
fix: kill api server when pinging dead engine by @AlpinDale in #660
fix: get_num_blocks_touched logic by @AlpinDale in #661
chore: update the env.py script and the bug report template by @AlpinDale in #662
feat: add INT8 W8A16 quant for TPU by @AlpinDale in #663
feat: allow serving encoder-decoder models in the API server by @AlpinDale in #664
fix: deps with TPU dockerfile by @AlpinDale in #665
optimization: reduce end-to-end overhead from python obj allocation by @AlpinDale in #666
fix: minor adjustments to scheduler and block manager by @AlpinDale in #667
feat: enable using fp8 kv and prefix caching with chunked prefill by @AlpinDale in #668
fix: mlpspeculator with padded vocab by @AlpinDale in #669
feat: option to apply temperature scaling last by @AlpinDale in #670
chore: decouple should_modify_greedy_probs_inplace by @AlpinDale in #671
chore: better stream termination in async engine by @AlpinDale in #672
chore: mamba cache single buffer by @AlpinDale in #673
feat: mamba model support by @AlpinDale in #674
fix: reinit procedure in ModelInputForGPUBuilder by @AlpinDale in #675
feat: embeddings support for batched OAI endpoint by @AlpinDale in #676
fix: fp8 checkpoints with fused linear modules by @AlpinDale in #677
feat: add numpy implementation of compute_slot_mapping by @AlpinDale in #678
fix: chunked prefill with v2 block manager by @AlpinDale in #679
fix: phi3v batch inference with different aspect ratio images by @AlpinDale in #680
chore: use mark_dynamic to reduce TPU compile times by @AlpinDale in #681
chore: bump lmfe to v0.10.6 and include triton for tpu and xpu dockefiles by @AlpinDale in #682
refactor: base worker input refactor for multi-step by @AlpinDale in #683
build: add empty device by @AlpinDale in #684
chore: update flashinfer to v0.1.3 by @AlpinDale in #685
feat: allow image embeddings for VLM input by @AlpinDale in #686
feat: add progress bar for loading individual weight modules by @AlpinDale in #640
chore: use public ECR for neuron image by @AlpinDale in #687
fix: logit softcapping in flash-attn by @AlpinDale in #688
chore: use scalar type to dispatch to different gptq_marlin kernels by @AlpinDale in #689
fix: allow passing float for GiB arguments by @AlpinDale in #690
build: bump cmake to 3.26 by @AlpinDale in #691
fix: shut down ray dag workers cleanly by @AlpinDale in #692
feat: add lora loading/unloading api endpoint by @AlpinDale in #693
feat: add load/unload endpoints for soft-prompts by @AlpinDale in #694
fix: loading chameleon model with TP>1 by @AlpinDale in #695
fix: consolidated is_tpu() and suppress tpu import warning by @AlpinDale in #696
fix: manually install triton for other devices to prevent outlines errors by @AlpinDale in #697
feat: support for Audio modality by @AlpinDale in #698
chore: migrate gptq_marlin to AphroditeParameters by @AlpinDale in #699
chore: update fused MoE weight loading by @AlpinDale in #700
feat: add Solar model support by @AlpinDale in #701
feat: migrate awq and awq_marlin to AphroditeParameter by @AlpinDale in #702
chore: spawn engine process from api server process by @AlpinDale in #703
chore: use the compressed-tensors library to avoid code reuse by @AlpinDale in #704
feat: add aphrodite plugin system by @AlpinDale in #705
Revert "chore: use the compressed-tensors library to avoid code reuse (#704)" by @AlpinDale in #706
feat: add support for multi-host TPU by @AlpinDale in #707
fix: import ray under a guard by @AlpinDale in #708
fix: empty sampler output when temperature is too low by @AlpinDale in #709
fix: disable embeddings API for chat models by @AlpinDale in #710
feat: implement mistral tokenizer mode by @AlpinDale in #711
feat: support profiling with multiple multi-modal inputs per prompt by @AlpinDale in #712
chore: multi-step args and sequence modifications by @AlpinDale in #713
chore: set per-rank XLA cache for TPU by @AlpinDale in #714
chore: add support for up to 2048 block size by @AlpinDale in #715
fix: install protobuf for cpu by @AlpinDale in #716
fix: weight loading for scalars by @AlpinDale in #718
chore: quant config for speculative draft models by @AlpinDale in #719
feat: enable prompt logprobs in OpenAI API by @AlpinDale in #720
chore: update grafana template by @AlpinDale in #721
ci: bump aphrodite to 0.6.1 by @AlpinDale in #722

Full Changelog: v0.6.0.post1...v0.6.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.1

Aphrodite Engine - v0.6.1

What's Changed

Contributors