Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : refactor llama_context, llama_kv_cache, llm_build_context (v2) #12181

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Mar 4, 2025

alt #11213

Overview

The implementation in #11213 became too complicated, trying to make a lot of changes at once. This is an alternative implementation, which does not involve the abstraction of the llama_context. The PR introduces some new abstractions, improves the graph build handling and is an initial step for the next changes listed in section "Next" below.

  • Rework the old llm_build_context into new llm_graph_context implemented in llama-graph.h/.cpp
  • Introduce llm_graph_input_... classes for handling graph inputs in a safer and cleaner way
  • Introduce llm_graph_result for extracting important tensors such as embeddings and logits, instead of searching for them by tensor name
  • Introduce llm_memory_i concept that will abstract different cache/memory mechanisms. For now we have only llama_kv_cache as a type of memory
  • Rework session saving/loading using new llama_io_write_i and llama_io_read_i interfaces
  • Remove "worst case" concept from the graph building logic

API changes

The current changes are only necessary to make the API more consistent in following the naming convention. To migrate, simply replace the old API calls with the new ones.

  • Deprecate llama_kv_cache_... API
  • Add llama_kv_self_... API

Next

  • Introduce new model arch interface and have the different models implement it
  • Add new class llama_kv_cache_recurrent and remove all recurrent logic from the existing class llama_kv_cache_unified. Simplify llama_kv_cache_unified.

@github-actions github-actions bot added android Issues specific to Android examples python python script changes server labels Mar 4, 2025
@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v2 branch 7 times, most recently from 766edbf to 62ba774 Compare March 7, 2025 11:20
@ggerganov ggerganov marked this pull request as ready for review March 7, 2025 11:26
@ggerganov ggerganov requested a review from ngxson as a code owner March 7, 2025 11:26
@ggerganov
Copy link
Member Author

Planning to merge this tomorrow unless there are any suggestions for improvements.

@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v2 branch from 62ba774 to a170669 Compare March 11, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
android Issues specific to Android examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants