Building Mixtral engine with --load_by_shard option gives contradictory error #842

JohnnyRacer · 2024-01-08T20:21:40Z

I am trying to following the example for Mixtral to build the engine but I am unable to use '--load_by_shard' as stated in the instructions, with the following error: AssertionError: MoE does not support sharded load. Which is contradictory to the docs in the example where it states: Note that when loading Mixtral weights you must use the --load_by_shard option. The model does attempt to load without this flag but consumes over >128 GB of RAM, causing OOM on my system. Is there anyway to build the engine for Mixtral without using so much RAM or getting --load_by_shard to work?

System spec:

Core i7 12700k
Nvidia A6000 48GB
128 GB RAM

Build command:

python build.py --model_dir /mixtral_model  \
               --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin \
                --load_by_shard \
                --output_dir /models/mixtral-0.1-trtllm

Full traceback:

[01/08/2024-20:07:53] [TRT-LLM] [I] Using GPT attention plugin for inflight batching mode. Setting to default 'float16'
[01/08/2024-20:07:53] [TRT-LLM] [I] Using remove input padding for inflight batching mode.
[01/08/2024-20:07:53] [TRT-LLM] [I] Using paged KV cache for inflight batching mode.
You are using a model of type mixtral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors.
[01/08/2024-20:07:53] [TRT-LLM] [I] Serially build TensorRT engines.
[01/08/2024-20:07:53] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 123, GPU 10500 (MiB)
[01/08/2024-20:07:55] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1798, GPU +312, now: CPU 2057, GPU 10812 (MiB)
[01/08/2024-20:07:55] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[01/08/2024-20:07:55] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.2726 (GiB) Device 10.5589 (GiB)
[01/08/2024-20:07:56] [TRT-LLM] [I] Loading HF LLaMA ... from /mixtral_model
[01/08/2024-20:07:56] [TRT-LLM] [I] Loading weights from HF LLaMA...
Traceback (most recent call last):
  File "/models/trtllm_examples/llama/build.py", line 934, in <module>
    build(0, args)
  File "/models/trtllm_examples/llama/build.py", line 878, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/models/trtllm_examples/llama/build.py", line 721, in build_rank_engine
    load_from_hf_checkpoint(tensorrt_llm_llama,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 540, in load_from_hf_checkpoint
    assert not tensorrt_llm_llama.moe_config.has_moe(
AssertionError: MoE does not support sharded load

The text was updated successfully, but these errors were encountered:

mickaelseznec · 2024-01-10T13:16:44Z

The docs are mistaken, indeed. Thanks for the catch!

We'll work on making the model easier to build with less RAM. But Mixtral is a quite large model (~45B parameters) so you won't be able to fit it directly on a A6000 GPU. We're working on making that that possible through quantization in the following releases.

JohnnyRacer · 2024-01-14T04:26:58Z

The docs are mistaken, indeed. Thanks for the catch!

So does the correct command to build the engine for Mixtral require '--load_by_shard' or not? Seems like Mixtral support is going to improve in the next release from this issue.

nv-guomingz · 2024-11-15T15:28:14Z

Thanks for your patience. I don't think the latest build cmd for this model requires --load_by_shard any more.

please feel free to reopen this ticket if needed.

nv-guomingz closed this as completed Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building Mixtral engine with --load_by_shard option gives contradictory error #842

Building Mixtral engine with --load_by_shard option gives contradictory error #842

JohnnyRacer commented Jan 8, 2024

mickaelseznec commented Jan 10, 2024

JohnnyRacer commented Jan 14, 2024

nv-guomingz commented Nov 15, 2024

Building Mixtral engine with --load_by_shard option gives contradictory error #842

Building Mixtral engine with --load_by_shard option gives contradictory error #842

Comments

JohnnyRacer commented Jan 8, 2024

mickaelseznec commented Jan 10, 2024

JohnnyRacer commented Jan 14, 2024

nv-guomingz commented Nov 15, 2024