Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt formatting for different version of InternVL2 #2672

Open
2 of 4 tasks
nzarif opened this issue Jan 8, 2025 · 0 comments
Open
2 of 4 tasks

Prompt formatting for different version of InternVL2 #2672

nzarif opened this issue Jan 8, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@nzarif
Copy link

nzarif commented Jan 8, 2025

System Info

Hi,

Different versions of InternVL2 model use different LLM architectures. For example according to model card the 2B model uses InternLM2 while the 4B model uses Phi3 as its LLM. The multimodal_model_runner() preprocesses and formats the prompt the same way for all version of InternVL2 here. It appears it uses the InternLM prompt formatting while the 4B model for example will have a Phi3 as its language engine.

If we want to use versions of InternVL that use an LLM architecture other than InternLM, should we modify the prompt formatting to match the format that specific LLM expects? For example when using the 4B model should we format the prompt as Phi3 expects? Or we will be okay using multimodal_model_runner() for all versions of InternVL without changing prompt formatting? The reason I am asking is because I'm observing very low accuracy for InternVL that varies considerably between different runs.

Also, to use mutlimodal_model_runner() for InternVL2.5 should we make any specific changes?

Who can help?

@sunnyqgg

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Follow the instructions here to generate the engine files for InternVL2-2B or InternVL2-4B then use the multimodal_model_runner() to run evaluation on the model using your dataset. Measure accuracy, latency and throughput if you want.

Expected behavior

See similar evaluation accuracy for different version of InternVL when comparing the PyTorch model to trt model.

actual behavior

Right now the trt evaluation accuracy is much lower. Around 60% for PyTorch model and below 20% for trt model.

additional notes

I am using InternVL2 for VQA and the dataset I'm using is A-OKVQA.

@nzarif nzarif added the bug Something isn't working label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant