Update TensorRT-LLM #846

kaiyux · 2024-01-09T12:09:37Z

Model Support
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)
Features
- Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
- Out-of-the-box support for the QWEN model
- Support for returning context and/or generation logits in the Triton backend
API
- Add a set of High-level APIs for end-to-end generation tasks, the features are as below
  - ModelConfig() as a clean configuration interface for LLM tasks
  - LLM() for LLM pipelines, it will trigger the necessary engine building or model quantization silently in the background
  - generate() API for batched offline inference, both single-GPU and multi-GPU supported
  - generate_async() API for asynchronous offline inference on a single GPU, streaming mode is supported
Bug fixes
- Add pickle support for InferenceRequest GptManager pybind 2/4TP run demo #701
- Fix Mixtral-8x7b build failure with custom_all_reduce Mixtral-8x7b build fails with custom_all_reduce #825
Performance
- Performance optimization of beam search kernel
- Increase default freeGpuMemoryFraction parameter from 0.85 to 0.9 for higher throughput
Documentation
- Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
- Add documentation for Falcon AWQ support (See examples/falcon/README.md)

Update TensorRT-LLM

77d24b5

kaiyux marked this pull request as draft January 9, 2024 12:09

update

d043415

Shixiaowei02 marked this pull request as ready for review January 9, 2024 13:00

Shixiaowei02 approved these changes Jan 9, 2024

View reviewed changes

kaiyux merged commit d879430 into main Jan 9, 2024

kaiyux deleted the kaiyu/update branch January 9, 2024 13:03

xesdiny mentioned this pull request Jan 19, 2024

ModelRunnerCpp.generate throw tensorrt_llm::common::TllmException for the second time #912

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #846

Update TensorRT-LLM #846

kaiyux commented Jan 9, 2024 •

edited

Loading

Update TensorRT-LLM #846

Update TensorRT-LLM #846

Conversation

kaiyux commented Jan 9, 2024 • edited Loading

kaiyux commented Jan 9, 2024 •

edited

Loading