Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roadmap: mmap for keeping Model in VRAM when Flash Attention is used #1717

Open
dan-menlo opened this issue Nov 23, 2024 · 2 comments
Open
Assignees

Comments

@dan-menlo
Copy link
Contributor

dan-menlo commented Nov 23, 2024

Goal

Feedback from WiseFarAI:

I am also wondering if you guys use Mmap and other ways of keeping the model in Vram memory and if/when Flash Attention is used. Since these parameters cannot easily be observed (their current setting) or changed from within the UI. Sometimes it seems like it tries to reload the model mid-conversation and the generation speed drops from 6 tokens to 2 per second.

@dan-menlo dan-menlo converted this from a draft issue Nov 23, 2024
@dan-menlo dan-menlo changed the title feedback: mmap for keeping Model in VRAM feedback: mmap for keeping Model in VRAM when Flash Attention is used Nov 23, 2024
@gabrielle-ong
Copy link
Contributor

@vansangpfiev assigning to you to investigate, cc @dan-homebrew - will have a call with Sang next week

@dan-menlo dan-menlo changed the title feedback: mmap for keeping Model in VRAM when Flash Attention is used roadmap: mmap for keeping Model in VRAM when Flash Attention is used Dec 16, 2024
@vansangpfiev vansangpfiev moved this from Investigating to In Progress in Jan & Cortex Dec 23, 2024
@vansangpfiev
Copy link
Contributor

llama.cpp optimizes memory usage by strategically distributing model components across VRAM and RAM:

  1. VRAM allocation:

    • Model weights
    • Key-value (KV) cache
    • Preprocessed prompt buffer
  2. RAM allocation:

    • Embedding lookup table
    • Auxiliary buffers for data transfer between VRAM and CPU.
    • Model

The potential issue of reloading the model mid-conversation typically arises if the model in memory is swapped out due to system memory pressure or OS schedule.

The model is kept in VRAM during inference, for example in case of CUDA, the cudaHostRegister() API is used to pin memory, which ensures it stays in GPU memory and isn't swapped out. https://github.com/ggerganov/llama.cpp/blob/8d59d911711b8f1ba9ec57c4b192ccd2628af033/ggml/src/ggml-cuda/ggml-cuda.cu#L2685

llama.cpp provides mlock parameter which ensures to lock model into RAM, prevent swapping out.

cc: @dan-menlo @nguyenhoangthuan99

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

4 participants