This repository provides an automated CI/CD process to convert, test and deploy IBM Granite models, in safetensor format, from the ibm-granite
organization to versioned IBM GGUF collections in Hugging Face Hub under the ibm-research
organization. This includes:
- Target IBM models for format conversion
- GGUF Conversion & Quantization
- GGUF Verification Testing
- References
- Releasing GGUF model conversions & quantizations
Format conversions (i.e., GGUF) and quantizations will only be provided for canonically hosted model repositories hosted in an official IBM Huggingface organization.
Currently, this includes the following organizations:
Additionally, only a select set of IBM models from these orgs. will be converted based upon the following general criteria:
-
The IBM GGUF model needs to be referenced by an AI provider service as a "supported" model.
-
The GGUF model is referenced by a public blog, tutorial, demo, or other public use case.
- Specifically, if the model is referenced in an IBM Granite Snack Cookbook
Select quantization will only be made available when:
- Small form-factor is justified:
- e.g., Reduced model size intended running locally on small form-factor devices such as watches and mobile devices.
- Performance provides significant benefit without compromising on accuracy (or enabling hallucination).
Specifically, the following Granite model repositories are currently supported in GGUF format (by collection) with listed:
Typically, this model category includes "instruct" models.
HF (llama.cpp) Architecture | Source Repo. ID | Target Repo. ID |
---|---|---|
GraniteForCausalLM (gpt2) | ibm-granite/granite-3.2-2b-instruct | ibm-research |
GraniteForCausalLM (gpt2) | ibm-granite/granite-3.2-8b-instruct | ibm-research |
- Supported quantizations:
fp16
,Q2_K
,Q3_K_L
,Q3_K_M
,Q3_K_S
,Q4_0
,Q4_1
,Q4_K_M
,Q4_K_S
,Q5_0
,Q5_1
,Q5_K_M
,Q5_K_S
,Q6_K
,Q8_0
HF (llama.cpp) Architecture | Source Repo. ID | Target Repo. ID |
---|---|---|
GraniteMoeForCausalLM (granitemoe) | ibm-granite/granite-guardian-3.2-3b-a800m | ibm-research |
GraniteMoeForCausalLM (granitemoe) | ibm-granite/granite-guardian-3.2-5b | ibm-research |
- Supported quantizations:
fp16
,Q4_K_M
,Q5_K_M
,Q6_K
,Q8_0
HF (llama.cpp) Architecture | Source Repo. ID | Target Repo. ID |
---|---|---|
GraniteForCausalLM (granite), LlavaNextForConditionalGeneration | ibm-granite/granite-vision-3.2-2b | ibm-research |
- Supported quantizations:
fp16
,Q4_K_M
,Q5_K_M
,Q8_0
HF (llama.cpp) Architecture | Source Repo. ID | Target Repo. ID |
---|---|---|
Roberta (roberta-bpe) | ibm-granite/granite-embedding-30m-english | ibm-research |
Roberta (roberta-bpe) | ibm-granite/granite-embedding-125m-english | ibm-research |
Roberta (roberta-bpe) | ibm-granite/granite-embedding-107m-multilingual | ibm-research |
Roberta (roberta-bpe) | ibm-granite/granite-embedding-278m-multilingual | ibm-research |
- Supported quantizations:
fp16
,Q8_0
Note: Sparse model architecture (i.e., RobertaMaskedLM) is not currently supported; therefore, there is no conversion for ibm-granite/granite-embedding-30m-sparse
.
- LoRA support is currently in plan (no date).
The GGUF format is defined in the GGUF specification. The specification describes the structure of the file, how it is encoded, and what information is included.
Currently, the primary means to convert from HF SafeTensors format to GGUF will be the canonical llama.cpp tool convert-hf-to-gguf.py
.
for example:
python llama.cpp/convert-hf-to-gguf.py ./<model_repo> --outfile output_file.gguf --outtype q8_0
-
https://github.com/ollama/ollama/blob/main/docs/import.md#quantizing-a-model
$ ollama create --quantize q4_K_M mymodel transferring model data quantizing F16 model to Q4_K_M creating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bd creating new layer sha256:0853f0ad24e5865173bbf9ffcc7b0f5d56b66fd690ab1009867e45e7d2c4db0f writing manifest success
Note: The Ollama CLI tool only supports a subset of quantizations:
- (rounding): q4_0
, q4_1
, q5_0
, q5_1
, q8_0
- k-means: q3_K_S
, q3_K_M
, q3_K_L
, q4_K_S
, q4_K_M
, q5_K_S
, q5_K_M
, q6_K
Note:
- Similar to Ollama CLI, the web UI supports only a subset of quantizations.
As a baseline, each converted model MUST successfully be run in the following providers:
llama.cpp - As the core implementation of the GGUF format which is either a direct dependency or utilized as forked code in most all downstream GGUF providers, testing is essential. Specifically, testing to verify the model can be hosted using the llama-server
service.
- See the specific section on llama.cpp
for more details on which version is considered "stable" and how the same version will be used in both conversion and testing.
Ollama - As a key model service provider supported by higher level frameworks and platforms (e.g., AnythingLLM, LM Studio etc.), testing the ability to pull
and run
the model is essential.
Notes
- The official Ollama Docker image ollama/ollama is available on Docker Hub.
- Ollama does not yet support sharded GGUF models
- "Ollama does not support this yet. Follow this issue for more info: ollama/ollama#5245"
- e.g.,
ollama pull hf.co/Qwen/Qwen2.5-14B-Instruct-GGUF
-
GGUF format
- Huggingface: GGUF - describes the format and some of the header structure.
- llama.cpp:
- GGUF Quantization types (
ggml_ftype
) -ggml/include/ggml.h
- GGUF Quantization types (
LlamaFileType
) -gguf-py/gguf/constants.py
- GGUF Quantization types (
-
GGUF Examples
-
GGUF tools
- GGUF-my-repo - Hugging Face space to build your own quants. without any setup. (ref. by llama.cpp example docs.)
- CISCai/gguf-editor - batch conversion tool for HF model repos. GGUF models.
-
llama.cpp Tutorials
- How to convert any HuggingFace Model to gguf file format? - using the
llama.cpp/convert-hf-to-gguf.py
conversion script.
- How to convert any HuggingFace Model to gguf file format? - using the
-
Ollama tutorials
- Importing a model - includes Safetensors, GGUF.
- Use Ollama with any GGUF Model on Hugging Face Hub
- Using Ollama models from Langchain - This example uses the
gemma2
model supported by Ollama.
This repository uses GitHub workflows and actions to convert IBM Granite models hosted on Huggingface to GGUF format, quantize them, run build-verification tests on the resultant models and publish them to target GGUF collections in IBM owned Huggingface organizations (e.g., ibm-research
and ibm-granite
).
There are 3 types of releases that can be performed on this repository:
- Test (private) - releases GGUF models to a test (or private) repo. on Huggingface.
- Preview (private) - releases GGUF models to a GGUF collection within the
ibm-granite
HF organization for time-limited access to select IBM partners (typically for pre-release testing and integration). - Public - releases GGUF models to a public GGUF collection within the
ibm-research
HF organization for general use.
Note: The Huggingface (HF) term "private" means that repos. and collections created in the target HF organization are only visible to organization contributors and not visible (or hidden) from normal users.
Prior to "triggering" release workflows, some files need to be configured depending on the release type.
Project maintainers for this repo. are able to access the secrets (tokens) that are made available to the CI/CD release workflows/actions:
https://github.com/IBM/gguf/settings/secrets/actions
Secrets are used to authenticate with Github and Huggingface (HF) and are already configured for the ibm-granite
and ibm-research
HF organizations for "preview" and "public" release types.
For "test" (or private) builds, users can fork the repo. and add a repository secret named HF_TOKEN_TEST
with a token (value) created on their test (personal, private) HF organization account with appropriate privileges to allow write access to repos. and collections.
Each release type has a collection mapping file that defines which models repositories along with titles, descriptions and family designations. Family designations allow granular control over the which model families are included in a release which allows for "staggered" releases typically by model architecture. These files are:
- Test: resources/json/granite-3.2/hf_collection_mapping_test_private.json
- Preview: resources/json/granite-3.2/hf_collection_mapping_preview_ibm_granite.json
- Public: resources/json/granite-3.2/hf_collection_mapping_release_ibm_research.json
Note: The version portion of the file path will vary depending on IBM Granite release version (e.g., granite-3.2
).
The JSON collection mapping files have the following structure using the "Public" release as an example:
{
"collections": [
{
"title": "Granite 3.2 Models (GGUF)",
"description": "GGUF-formatted versions of IBM Granite 3.2 models. Licensed under the Apache 2.0 license.",
"items": [
{
"type": "model",
"family": "instruct",
"repo_name": "granite-3.2-2b-instruct"
},
...
{
"type": "model",
"family": "vision",
"repo_name": "granite-vision-3.2-2b"
},
...
{
"type": "model",
"family": "guardian",
"repo_name": "granite-guardian-3.2-3b-a800m"
},
...
{
"type": "model",
"family": "embedding",
"repo_name": "granite-embedding-30m-english"
},
...
]
}
]
}
Simple add a new object under the items
array for each new IBM Granite repo. you want added to the corresponding (GGUF) collection.
Currently, the only HF item type supported is model
and valid families (which have supported workflows) include: instruct
(language), vision
, guardian
and embedding
.
Note: If you need to change the HF collection description, please know that HF limits this string to 150 chars. or less.
Each release type has a corresponding (parent, master) workflow that configures and controls which model family (i.e., instruct
(language), vision
, guardian
and embedding
) are executed for a given GitHub (tagged) release.
For example, a 3.2
versioned release uses the following files which correspond to one of the release types (i.e., Test
, Preview
or Public
):
- Test: .github/workflows/granite-3.2-release-test.yml
- Preview: .github/workflows/granite-3.2-release-preview-ibm-granite.yml
- Public: .github/workflows/granite-3.2-release-ibm-research.yml
The YAML GitHub workflow files have a few environment variables that may need to be updated to reflect which collections, models and quantizations should be included on the next, subsequent GitHub (tagged)release. Using the "Public" release YAML file as an example:
env:
ENABLE_INSTRUCT_JOBS: false
ENABLE_VISION_JOBS: false
ENABLE_GUARDIAN_JOBS: true
SOURCE_INSTRUCT_REPOS: "[
'ibm-granite/granite-3.2-2b-instruct',
...
]"
TARGET_INSTRUCT_QUANTIZATIONS: "[
'Q4_K_M',
...
]"
SOURCE_GUARDIAN_REPOS: "[
'ibm-granite/granite-guardian-3.2-3b-a800m',
...
]"
TARGET_GUARDIAN_QUANTIZATIONS: "[
'Q4_K_M',
...
]"
SOURCE_VISION_REPOS: "[
'ibm-granite/granite-vision-3.2-2b',
...
]"
TARGET_VISION_QUANTIZATIONS: "[
'Q4_K_M',
...
]"
...
COLLECTION_CONFIG: "resources/json/granite-3.2/hf_collection_mapping_release_ibm_research.json"
Note: that the COLLECTION_CONFIG
env. var. provides the relative path to the collection configuration file, which is located in the resources/json
directory of the repository for the specific Granite release.
This section contains the steps required to successfully "trigger" a release workflow for one or more supported Granite models families (i.e., instruct
(language), vision
, guardian
and embedding
).
-
Click the
Releases
link from the right column of the repo. home page which should be the URL https://github.com/IBM/gguf/releases. -
Click the "Draft a new release" button near the top of the releases page.
-
Click the "Choose a tag" drop-down menu and enter a tag name that starts with one of the following strings relative to which release type you want to "trigger":
- Test:
test-v3.2
- Preview:
preview-v3.2
- Preview:
v3.2
Treat these strings as "prefixes" which you must append a unique build version. For example:
v3.2-rc-01
for a release candidate version 01
- Test:
-
"Create a new tag: on publish" near the bottom of the drop-down list.
-
By convention, add the same "tag" name you created in the previous step into the "Release title" entry field.
-
Adjust the "Set as a pre-release" and "Set as the latest release" checkboxes to your desired settings.
-
Click the "Publish release" button.
At this point, you can observe the CI/CD workflows being run by the GitHub service "runners". Please note that during heavy traffic times, assignment of a "runner" (for each workflow job) may take longer.
To observe the CI/CD process in action, please navigate to the following URL:
and look for the name of the tag
you entered for the release (above) in the workflow run title.
Note: It is common to occasionally see some jobs "fail" due to network or scheduling timeout errors. In these cases, you can go into the failed workflow run and click on the "Re-run failed jobs" button to re-trigger the failed job(s).