You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am also wondering if you guys use Mmap and other ways of keeping the model in Vram memory and if/when Flash Attention is used. Since these parameters cannot easily be observed (their current setting) or changed from within the UI. Sometimes it seems like it tries to reload the model mid-conversation and the generation speed drops from 6 tokens to 2 per second.
The text was updated successfully, but these errors were encountered:
dan-menlo
changed the title
feedback: mmap for keeping Model in VRAM
feedback: mmap for keeping Model in VRAM when Flash Attention is used
Nov 23, 2024
@vansangpfiev assigning to you to investigate, cc @dan-homebrew - will have a call with Sang next week
dan-menlo
changed the title
feedback: mmap for keeping Model in VRAM when Flash Attention is used
roadmap: mmap for keeping Model in VRAM when Flash Attention is used
Dec 16, 2024
llama.cpp optimizes memory usage by strategically distributing model components across VRAM and RAM:
VRAM allocation:
Model weights
Key-value (KV) cache
Preprocessed prompt buffer
RAM allocation:
Embedding lookup table
Auxiliary buffers for data transfer between VRAM and CPU.
Model
The potential issue of reloading the model mid-conversation typically arises if the model in memory is swapped out due to system memory pressure or OS schedule.
Goal
Feedback from WiseFarAI:
I am also wondering if you guys use Mmap and other ways of keeping the model in Vram memory and if/when Flash Attention is used. Since these parameters cannot easily be observed (their current setting) or changed from within the UI. Sometimes it seems like it tries to reload the model mid-conversation and the generation speed drops from 6 tokens to 2 per second.
The text was updated successfully, but these errors were encountered: