Skip to content
/ gguf Public

IBM GGUF-encoded AI models and conversion scripts

License

Notifications You must be signed in to change notification settings

IBM/gguf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gguf

This repository provides an automated CI/CD process to convert, test and deploy IBM Granite models, in safetensor format, from the ibm-granite organization to versioned IBM GGUF collections in Hugging Face Hub under the ibm-research organization. This includes:

Topic index


Target IBM models for format conversion

Format conversions (i.e., GGUF) and quantizations will only be provided for canonically hosted model repositories hosted in an official IBM Huggingface organization.

Currently, this includes the following organizations:

Additionally, only a select set of IBM models from these orgs. will be converted based upon the following general criteria:

  • The IBM GGUF model needs to be referenced by an AI provider service as a "supported" model.

    • For example, a local AI provider service such as Ollama or a hosted service such as Replicate.
  • The GGUF model is referenced by a public blog, tutorial, demo, or other public use case.

Select quantization will only be made available when:

  • Small form-factor is justified:
    • e.g., Reduced model size intended running locally on small form-factor devices such as watches and mobile devices.
  • Performance provides significant benefit without compromising on accuracy (or enabling hallucination).

Supported IBM Granite models (GGUF)

Specifically, the following Granite model repositories are currently supported in GGUF format (by collection) with listed:

Language

Typically, this model category includes "instruct" models.

HF (llama.cpp) Architecture Source Repo. ID Target Repo. ID
GraniteForCausalLM (gpt2) ibm-granite/granite-3.2-2b-instruct ibm-research
GraniteForCausalLM (gpt2) ibm-granite/granite-3.2-8b-instruct ibm-research
  • Supported quantizations: fp16, Q2_K, Q3_K_L, Q3_K_M, Q3_K_S, Q4_0, Q4_1, Q4_K_M, Q4_K_S, Q5_0, Q5_1, Q5_K_M, Q5_K_S, Q6_K, Q8_0
Guardian
HF (llama.cpp) Architecture Source Repo. ID Target Repo. ID
GraniteMoeForCausalLM (granitemoe) ibm-granite/granite-guardian-3.2-3b-a800m ibm-research
GraniteMoeForCausalLM (granitemoe) ibm-granite/granite-guardian-3.2-5b ibm-research
  • Supported quantizations: fp16, Q4_K_M, Q5_K_M, Q6_K, Q8_0
Vision
HF (llama.cpp) Architecture Source Repo. ID Target Repo. ID
GraniteForCausalLM (granite), LlavaNextForConditionalGeneration ibm-granite/granite-vision-3.2-2b ibm-research
  • Supported quantizations: fp16, Q4_K_M, Q5_K_M, Q8_0
Embedding (dense)
HF (llama.cpp) Architecture Source Repo. ID Target Repo. ID
Roberta (roberta-bpe) ibm-granite/granite-embedding-30m-english ibm-research
Roberta (roberta-bpe) ibm-granite/granite-embedding-125m-english ibm-research
Roberta (roberta-bpe) ibm-granite/granite-embedding-107m-multilingual ibm-research
Roberta (roberta-bpe) ibm-granite/granite-embedding-278m-multilingual ibm-research
  • Supported quantizations: fp16, Q8_0

Note: Sparse model architecture (i.e., RobertaMaskedLM) is not currently supported; therefore, there is no conversion for ibm-granite/granite-embedding-30m-sparse.

RAG LoRA support**
  • LoRA support is currently in plan (no date).

GGUF Conversion & Quantization

The GGUF format is defined in the GGUF specification. The specification describes the structure of the file, how it is encoded, and what information is included.

Currently, the primary means to convert from HF SafeTensors format to GGUF will be the canonical llama.cpp tool convert-hf-to-gguf.py.

for example:

python llama.cpp/convert-hf-to-gguf.py ./<model_repo> --outfile output_file.gguf --outtype q8_0

Alternatives (planned)

Ollama CLI
  • https://github.com/ollama/ollama/blob/main/docs/import.md#quantizing-a-model

    $ ollama create --quantize q4_K_M mymodel
    transferring model data
    quantizing F16 model to Q4_K_M
    creating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bd
    creating new layer sha256:0853f0ad24e5865173bbf9ffcc7b0f5d56b66fd690ab1009867e45e7d2c4db0f
    writing manifest
    success
    

Note: The Ollama CLI tool only supports a subset of quantizations: - (rounding): q4_0, q4_1, q5_0, q5_1, q8_0 - k-means: q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q5_K_M, q6_K

Hugging Face endorsed tool "ggml-org/gguf-my-repo"

Note:

  • Similar to Ollama CLI, the web UI supports only a subset of quantizations.

GGUF Verification Testing

As a baseline, each converted model MUST successfully be run in the following providers:

llama.cpp testing

llama.cpp - As the core implementation of the GGUF format which is either a direct dependency or utilized as forked code in most all downstream GGUF providers, testing is essential. Specifically, testing to verify the model can be hosted using the llama-server service. - See the specific section on llama.cpp for more details on which version is considered "stable" and how the same version will be used in both conversion and testing.

Ollama testing

Ollama - As a key model service provider supported by higher level frameworks and platforms (e.g., AnythingLLM, LM Studio etc.), testing the ability to pull and run the model is essential.

Notes

  • The official Ollama Docker image ollama/ollama is available on Docker Hub.
  • Ollama does not yet support sharded GGUF models
    • "Ollama does not support this yet. Follow this issue for more info: ollama/ollama#5245"
    • e.g., ollama pull hf.co/Qwen/Qwen2.5-14B-Instruct-GGUF

References


Releasing GGUF model conversions & quantizations

This repository uses GitHub workflows and actions to convert IBM Granite models hosted on Huggingface to GGUF format, quantize them, run build-verification tests on the resultant models and publish them to target GGUF collections in IBM owned Huggingface organizations (e.g., ibm-research and ibm-granite).

Types of releases

There are 3 types of releases that can be performed on this repository:

  1. Test (private) - releases GGUF models to a test (or private) repo. on Huggingface.
  2. Preview (private) - releases GGUF models to a GGUF collection within the ibm-granite HF organization for time-limited access to select IBM partners (typically for pre-release testing and integration).
  3. Public - releases GGUF models to a public GGUF collection within the ibm-research HF organization for general use.

Note: The Huggingface (HF) term "private" means that repos. and collections created in the target HF organization are only visible to organization contributors and not visible (or hidden) from normal users.

Configuring a release

Prior to "triggering" release workflows, some files need to be configured depending on the release type.

Github secrets

Project maintainers for this repo. are able to access the secrets (tokens) that are made available to the CI/CD release workflows/actions:

https://github.com/IBM/gguf/settings/secrets/actions

Secrets are used to authenticate with Github and Huggingface (HF) and are already configured for the ibm-granite and ibm-research HF organizations for "preview" and "public" release types.

For "test" (or private) builds, users can fork the repo. and add a repository secret named HF_TOKEN_TEST with a token (value) created on their test (personal, private) HF organization account with appropriate privileges to allow write access to repos. and collections.

Collection mapping files (JSON)

Each release type has a collection mapping file that defines which models repositories along with titles, descriptions and family designations. Family designations allow granular control over the which model families are included in a release which allows for "staggered" releases typically by model architecture. These files are:

Note: The version portion of the file path will vary depending on IBM Granite release version (e.g., granite-3.2).

What to update

The JSON collection mapping files have the following structure using the "Public" release as an example:

{
    "collections": [
        {
            "title": "Granite 3.2 Models (GGUF)",
            "description": "GGUF-formatted versions of IBM Granite 3.2 models. Licensed under the Apache 2.0 license.",
            "items": [
                {
                    "type": "model",
                    "family": "instruct",
                    "repo_name": "granite-3.2-2b-instruct"
                },
                ...
                {
                    "type": "model",
                    "family": "vision",
                    "repo_name": "granite-vision-3.2-2b"
                },
                ...
                {
                    "type": "model",
                    "family": "guardian",
                    "repo_name": "granite-guardian-3.2-3b-a800m"
                },
                ...
                {
                    "type": "model",
                    "family": "embedding",
                    "repo_name": "granite-embedding-30m-english"
                },
                ...
            ]
        }
    ]
}

Simple add a new object under the items array for each new IBM Granite repo. you want added to the corresponding (GGUF) collection.

Currently, the only HF item type supported is model and valid families (which have supported workflows) include: instruct (language), vision, guardian and embedding.

Note: If you need to change the HF collection description, please know that HF limits this string to 150 chars. or less.

Release workflow files

Each release type has a corresponding (parent, master) workflow that configures and controls which model family (i.e., instruct (language), vision, guardian and embedding) are executed for a given GitHub (tagged) release.

For example, a 3.2 versioned release uses the following files which correspond to one of the release types (i.e., Test, Preview or Public):

What to update

The YAML GitHub workflow files have a few environment variables that may need to be updated to reflect which collections, models and quantizations should be included on the next, subsequent GitHub (tagged)release. Using the "Public" release YAML file as an example:

env:
  ENABLE_INSTRUCT_JOBS: false
  ENABLE_VISION_JOBS: false
  ENABLE_GUARDIAN_JOBS: true
  SOURCE_INSTRUCT_REPOS: "[
    'ibm-granite/granite-3.2-2b-instruct',
    ...
  ]"
  TARGET_INSTRUCT_QUANTIZATIONS: "[
    'Q4_K_M',
    ...
  ]"
  SOURCE_GUARDIAN_REPOS: "[
    'ibm-granite/granite-guardian-3.2-3b-a800m',
    ...
  ]"
  TARGET_GUARDIAN_QUANTIZATIONS: "[
    'Q4_K_M',
    ...
  ]"
  SOURCE_VISION_REPOS: "[
    'ibm-granite/granite-vision-3.2-2b',
    ...
  ]"
  TARGET_VISION_QUANTIZATIONS: "[
    'Q4_K_M',
    ...
  ]"
  ...
  COLLECTION_CONFIG: "resources/json/granite-3.2/hf_collection_mapping_release_ibm_research.json"

Note: that the COLLECTION_CONFIG env. var. provides the relative path to the collection configuration file, which is located in the resources/json directory of the repository for the specific Granite release.

Triggering a release

This section contains the steps required to successfully "trigger" a release workflow for one or more supported Granite models families (i.e., instruct (language), vision, guardian and embedding).

  1. Click the Releases link from the right column of the repo. home page which should be the URL https://github.com/IBM/gguf/releases.

  2. Click the "Draft a new release" button near the top of the releases page.

  3. Click the "Choose a tag" drop-down menu and enter a tag name that starts with one of the following strings relative to which release type you want to "trigger":

    • Test: test-v3.2
    • Preview: preview-v3.2
    • Preview: v3.2

    Treat these strings as "prefixes" which you must append a unique build version. For example:

    • v3.2-rc-01 for a release candidate version 01
  4. "Create a new tag: on publish" near the bottom of the drop-down list.

  5. By convention, add the same "tag" name you created in the previous step into the "Release title" entry field.

  6. Adjust the "Set as a pre-release" and "Set as the latest release" checkboxes to your desired settings.

  7. Click the "Publish release" button.

At this point, you can observe the CI/CD workflows being run by the GitHub service "runners". Please note that during heavy traffic times, assignment of a "runner" (for each workflow job) may take longer.

To observe the CI/CD process in action, please navigate to the following URL:

and look for the name of the tag you entered for the release (above) in the workflow run title.

Note: It is common to occasionally see some jobs "fail" due to network or scheduling timeout errors. In these cases, you can go into the failed workflow run and click on the "Re-run failed jobs" button to re-trigger the failed job(s).