Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building Mixtral engine with --load_by_shard option gives contradictory error #842

Closed
JohnnyRacer opened this issue Jan 8, 2024 · 3 comments

Comments

@JohnnyRacer
Copy link

I am trying to following the example for Mixtral to build the engine but I am unable to use '--load_by_shard' as stated in the instructions, with the following error: AssertionError: MoE does not support sharded load. Which is contradictory to the docs in the example where it states: Note that when loading Mixtral weights you must use the --load_by_shard option. The model does attempt to load without this flag but consumes over >128 GB of RAM, causing OOM on my system. Is there anyway to build the engine for Mixtral without using so much RAM or getting --load_by_shard to work?

System spec:

Core i7 12700k
Nvidia A6000 48GB
128 GB RAM

Build command:

python build.py --model_dir /mixtral_model  \
               --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin \
                --load_by_shard \
                --output_dir /models/mixtral-0.1-trtllm

Full traceback:

[01/08/2024-20:07:53] [TRT-LLM] [I] Using GPT attention plugin for inflight batching mode. Setting to default 'float16'
[01/08/2024-20:07:53] [TRT-LLM] [I] Using remove input padding for inflight batching mode.
[01/08/2024-20:07:53] [TRT-LLM] [I] Using paged KV cache for inflight batching mode.
You are using a model of type mixtral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors.
[01/08/2024-20:07:53] [TRT-LLM] [I] Serially build TensorRT engines.
[01/08/2024-20:07:53] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 123, GPU 10500 (MiB)
[01/08/2024-20:07:55] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1798, GPU +312, now: CPU 2057, GPU 10812 (MiB)
[01/08/2024-20:07:55] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[01/08/2024-20:07:55] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.2726 (GiB) Device 10.5589 (GiB)
[01/08/2024-20:07:56] [TRT-LLM] [I] Loading HF LLaMA ... from /mixtral_model
[01/08/2024-20:07:56] [TRT-LLM] [I] Loading weights from HF LLaMA...
Traceback (most recent call last):
  File "/models/trtllm_examples/llama/build.py", line 934, in <module>
    build(0, args)
  File "/models/trtllm_examples/llama/build.py", line 878, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/models/trtllm_examples/llama/build.py", line 721, in build_rank_engine
    load_from_hf_checkpoint(tensorrt_llm_llama,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 540, in load_from_hf_checkpoint
    assert not tensorrt_llm_llama.moe_config.has_moe(
AssertionError: MoE does not support sharded load
@mickaelseznec
Copy link
Collaborator

The docs are mistaken, indeed. Thanks for the catch!

We'll work on making the model easier to build with less RAM. But Mixtral is a quite large model (~45B parameters) so you won't be able to fit it directly on a A6000 GPU. We're working on making that that possible through quantization in the following releases.

@JohnnyRacer
Copy link
Author

The docs are mistaken, indeed. Thanks for the catch!

So does the correct command to build the engine for Mixtral require '--load_by_shard' or not? Seems like Mixtral support is going to improve in the next release from this issue.

@nv-guomingz
Copy link
Collaborator

Thanks for your patience. I don't think the latest build cmd for this model requires --load_by_shard any more.

please feel free to reopen this ticket if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants