BentoML 发布 llm-optimizer,LLM 推理和性能优化开源工具
BentoML近日发布了llm-optimizer,这是一个用于基准测试和优化 LLM 推理的开源工具。它支持多个推理框架,并兼容任何开源 LLM。
llm-optimizer 旨在将 LLM 性能优化的繁琐手动工作自动化。您可以在一个地方运行结构化实验、应用约束并可视化结果,只需几个命令即可。
使用示例
llm-optimizer estimate \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-len 1024 \
--output-len 512 \
--gpu A100 \
--num-gpus 2
预期输出
=== Configuration ===
Model: meta-llama/Llama-3.1-8B-Instruct
GPU: 2x A100
Precision: fp16
Input/Output: 1024/512 tokens
Target: throughput
Fetching model configuration...
Model: 8029995008.0B parameters, 32 layers
=== Performance Analysis ===
Best Latency (concurrency=1):
TTFT: 43.1 ms
ITL: 2.6 ms
E2E: 1.39 s
Best Throughput (concurrency=512):
Output: 18873.3 tokens/s
Input: 23767.8 tokens/s
Requests: 14.24 req/s
Bottleneck: Memory
=== Roofline Analysis ===
Hardware Ops/Byte Ratio: 142.5 ops/byte
Prefill Arithmetic Intensity: 52205.5 ops/byte
Decode Arithmetic Intensity: 50.9 ops/byte
Prefill Phase: Compute Bound
Decode Phase: Memory Bound
=== Concurrency Analysis ===
KV Cache Memory Limit: 688 concurrent requests
Prefill Compute Limit: 8 concurrent requests
Decode Capacity Limit: 13 concurrent requests
Theoretical Overall Limit: 8 concurrent requests
Empirical Optimal Concurrency: 16 concurrent requests
=== Tuning Commands ===
--- SGLANG ---
Simple (concurrency + TP/DP):
llm-optimizer --framework sglang --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tp_size*dp_size=[(1, 2), (2, 1)]" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json
Advanced (additional parameters):
llm-optimizer --framework sglang --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tp_size*dp_size=[(1, 2), (2, 1)];chunked_prefill_size=[1434, 2048, 2662];schedule_conservativeness=[0.3, 0.6, 1.0];schedule_policy=fcfs" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json
--- VLLM ---
Simple (concurrency + TP/DP):
llm-optimizer --framework vllm --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 2), (2, 1)]" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json
Advanced (additional parameters):
llm-optimizer --framework vllm --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 2), (2, 1)];max_num_batched_tokens=[1024, 1177, 1331]" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json
该工具解决了LLM部署中的一个常见挑战:在不依赖手动试错的情况下,为延迟、吞吐量和成本找到最佳配置。llm-optimizer为探索LLM性能景观提供了一种结构化的方式。它通过实现系统基准和跨可能配置的自动搜索,消除了重复的猜测。
开源地址:https://github.com/bentoml/llm-optimizer
