roadmap: mmap for keeping Model in VRAM when Flash Attention is used #1717

dan-menlo · 2024-11-23T07:55:05Z

Goal

Feedback from WiseFarAI:

I am also wondering if you guys use Mmap and other ways of keeping the model in Vram memory and if/when Flash Attention is used. Since these parameters cannot easily be observed (their current setting) or changed from within the UI. Sometimes it seems like it tries to reload the model mid-conversation and the generation speed drops from 6 tokens to 2 per second.

gabrielle-ong · 2024-11-29T09:53:23Z

@vansangpfiev assigning to you to investigate, cc @dan-homebrew - will have a call with Sang next week

vansangpfiev · 2025-01-09T03:20:43Z

llama.cpp optimizes memory usage by strategically distributing model components across VRAM and RAM:

VRAM allocation:
- Model weights
- Key-value (KV) cache
- Preprocessed prompt buffer
RAM allocation:
- Embedding lookup table
- Auxiliary buffers for data transfer between VRAM and CPU.
- Model

The potential issue of reloading the model mid-conversation typically arises if the model in memory is swapped out due to system memory pressure or OS schedule.

The model is kept in VRAM during inference, for example in case of CUDA, the cudaHostRegister() API is used to pin memory, which ensures it stays in GPU memory and isn't swapped out. https://github.com/ggerganov/llama.cpp/blob/8d59d911711b8f1ba9ec57c4b192ccd2628af033/ggml/src/ggml-cuda/ggml-cuda.cu#L2685

llama.cpp provides mlock parameter which ensures to lock model into RAM, prevent swapping out.

cc: @dan-menlo @nguyenhoangthuan99

dan-menlo added this to Jan & Cortex Nov 23, 2024

dan-menlo converted this from a draft issue Nov 23, 2024

dan-menlo changed the title ~~feedback: mmap for keeping Model in VRAM~~ feedback: mmap for keeping Model in VRAM when Flash Attention is used Nov 23, 2024

gabrielle-ong mentioned this issue Nov 28, 2024

Sprint 26 Planning #1735

Closed

gabrielle-ong assigned vansangpfiev Nov 29, 2024

gabrielle-ong added the engine: llama.cpp label Nov 29, 2024

dan-menlo changed the title ~~feedback: mmap for keeping Model in VRAM when Flash Attention is used~~ roadmap: mmap for keeping Model in VRAM when Flash Attention is used Dec 16, 2024

dan-menlo assigned nguyenhoangthuan99 and vansangpfiev and unassigned vansangpfiev and nguyenhoangthuan99 Dec 16, 2024

vansangpfiev moved this from Investigating to In Progress in Jan & Cortex Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roadmap: mmap for keeping Model in VRAM when Flash Attention is used #1717

roadmap: mmap for keeping Model in VRAM when Flash Attention is used #1717

dan-menlo commented Nov 23, 2024 •

edited

Loading

gabrielle-ong commented Nov 29, 2024

vansangpfiev commented Jan 9, 2025

roadmap: mmap for keeping Model in VRAM when Flash Attention is used #1717

roadmap: mmap for keeping Model in VRAM when Flash Attention is used #1717

Comments

dan-menlo commented Nov 23, 2024 • edited Loading

Goal

gabrielle-ong commented Nov 29, 2024

vansangpfiev commented Jan 9, 2025

dan-menlo commented Nov 23, 2024 •

edited

Loading