You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to following the example for Mixtral to build the engine but I am unable to use '--load_by_shard' as stated in the instructions, with the following error: AssertionError: MoE does not support sharded load. Which is contradictory to the docs in the example where it states: Note that when loading Mixtral weights you must use the --load_by_shard option. The model does attempt to load without this flag but consumes over >128 GB of RAM, causing OOM on my system. Is there anyway to build the engine for Mixtral without using so much RAM or getting --load_by_shard to work?
[01/08/2024-20:07:53] [TRT-LLM] [I] Using GPT attention plugin for inflight batching mode. Setting to default 'float16'
[01/08/2024-20:07:53] [TRT-LLM] [I] Using remove input padding for inflight batching mode.
[01/08/2024-20:07:53] [TRT-LLM] [I] Using paged KV cache for inflight batching mode.
You are using a model of type mixtral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors.
[01/08/2024-20:07:53] [TRT-LLM] [I] Serially build TensorRT engines.
[01/08/2024-20:07:53] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 123, GPU 10500 (MiB)
[01/08/2024-20:07:55] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1798, GPU +312, now: CPU 2057, GPU 10812 (MiB)
[01/08/2024-20:07:55] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[01/08/2024-20:07:55] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.2726 (GiB) Device 10.5589 (GiB)
[01/08/2024-20:07:56] [TRT-LLM] [I] Loading HF LLaMA ... from /mixtral_model
[01/08/2024-20:07:56] [TRT-LLM] [I] Loading weights from HF LLaMA...
Traceback (most recent call last):
File "/models/trtllm_examples/llama/build.py", line 934, in <module>
build(0, args)
File "/models/trtllm_examples/llama/build.py", line 878, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/models/trtllm_examples/llama/build.py", line 721, in build_rank_engine
load_from_hf_checkpoint(tensorrt_llm_llama,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 540, in load_from_hf_checkpoint
assert not tensorrt_llm_llama.moe_config.has_moe(
AssertionError: MoE does not support sharded load
The text was updated successfully, but these errors were encountered:
The docs are mistaken, indeed. Thanks for the catch!
We'll work on making the model easier to build with less RAM. But Mixtral is a quite large model (~45B parameters) so you won't be able to fit it directly on a A6000 GPU. We're working on making that that possible through quantization in the following releases.
The docs are mistaken, indeed. Thanks for the catch!
So does the correct command to build the engine for Mixtral require '--load_by_shard' or not? Seems like Mixtral support is going to improve in the next release from this issue.
I am trying to following the example for Mixtral to build the engine but I am unable to use '--load_by_shard' as stated in the instructions, with the following error:
AssertionError: MoE does not support sharded load
. Which is contradictory to the docs in the example where it states:Note that when loading Mixtral weights you must use the --load_by_shard option
. The model does attempt to load without this flag but consumes over >128 GB of RAM, causing OOM on my system. Is there anyway to build the engine for Mixtral without using so much RAM or getting--load_by_shard
to work?System spec:
Build command:
Full traceback:
The text was updated successfully, but these errors were encountered: