sync : llama.cpp #1089

ggerganov · 2025-01-29T09:27:02Z

No description provided.

* SYCL: Add Gated Linear attention kernel * glahpp: add a space at the end of file * gla: Put the barrier inside the main logic loop

* RoPE: fix back, CUDA support for back + noncont. * fix comments reg. non-cont. RoPE support [no-ci]

* fix: ggml: fix vulkan-shaders-gen build The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup - Add GGML_SHADERS_GEN_TOOLCHAIN CMake option. - Auto-detect host toolchain if not set. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup Use configure_file to generate host_toolchain.cmake from template * fix: ggml: Fix compile error Fix compile error not finding vulkan-shaders-gen * fix: vulkan-shaders-gen build and path handling Fix build issues with vulkan-shaders-gen: - Add target dependency for correct build order - Use CMAKE_HOST_SYSTEM_NAME for executable suffix - Fix MSVC output directory in host toolchain - Normalize path handling for cross-compilation * fix: improve host compiler detection in vulkan shader build Improve host compiler detection for vulkan shader generation: - Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches - Consolidate compiler detection logic - Fix Windows-specific MSVC detection - Ensure correct compiler search in cross-compilation * refactor: Simplify CMake function for detecting host compiler Simplified the CMake function to improve the process of detecting the host compiler. * fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt Since `vulkan-shader-gen.cpp` only requires the `glslc` executable and not the Vulkan headers or libraries, CMakeLists.txt needs to be corrected. (See: ecc93d0558fc3ecb8a5af69d2ece02fae4710ade) * refactor: Rename host_toolchain.cmake.in - Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in * refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN

* q6_k scale caching * 16 bit unpack * q4_k test (slow) * revert it * q3_k * q2_k * little stuff * try precalculating products of a and q2_k scales * Revert "try precalculating products of a and q2_k scales" This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b. * unpack should be u16, add vim swap to gitignore (about time) * better q4_k scales * q5_k * better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations * q2_k better dequant * q3_k optimizations * q3_k use hmask simd from cpu avx version * make the caches happy * q3_k separate out calculation * q2_k separate out * little stuff * use calc_superblock everywhere * q2_k optimize scale calculation * more barriers

…11227) * Add SVE support for q4_K_q8_K * Update ggml/src/ggml-cpu/ggml-cpu-quants.c change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* CUDA: backwards pass for misc. ops, add tests * remove restrict from pointers

Do masking on whole dwords, fetch all scales at once.

…ma/11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination

Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: #10609

…/11281) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.

* cmake : add sanitizer flags for llama.cpp ggml-ci * tests : fix compile warnings ggml-ci * cmake : move sanitizer flags to llama_add_compile_flags ggml-ci * cmake : move llama.cpp compile flags to top level lists ggml-ci * cmake : apply only sanitizer flags at top level ggml-ci * tests : fix gguf context use in same_tensor_data * gguf-test: tensor data comparison * dummy : trigger ggml-ci * unicode : silence gcc warnings ggml-ci * ci : use sanitizer builds only in Debug mode ggml-ci * cmake : add status messages [no ci] --------- Co-authored-by: Johannes Gäßler <[email protected]>

* Implement host pool for matrix_info Creating a new memory pool on the host to store memory location for matrix_info needed to launch gemm_batch from oneMKL/oneMath. Removing complex support in gemm_batch since it is not used in llama.cpp * Remove unnecessary headers and cast * Reorder member variable to avoid warning on initialization * Formatting * Remove unused variable * Address PR review feedback - remove warning --------- Signed-off-by: nscipione <[email protected]>

mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.

ggml-ci

There is no need to use map, just store the base pointer in the buffer context.

With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.

Fixes #11306.

There should be a copy-and-paste error here. *mmq_wg_denoms should be used together with *warptile_mmq, instead of wg_denoms.

Now that we have batched mat-vec mul Vulkan shaders for up to n==8, these tests weren't actually exercising the mat-mat mul path. Test n==9 as well. Also, change to use all_types.

…11366) See https://reproducible-builds.org/ for why this is good and https://reproducible-builds.org/specs/source-date-epoch/ for the definition of this variable. Without this patch, compiling on different machines produced different binaries, which made verification of results difficult. Fixes: #11317 This patch was done while working on reproducible builds for openSUSE.

* Add hipGraph support * Enable VMM on rocm

…gurations (llama/11420)

Reduce first-run startup time and memory consumption. Should fix #11339.

* Add initial ggml cmake package * Add build numbers to ggml find-package * Expand variables with GGML_ prefix * Guard against adding to cache variable twice * Add git to msys2 workflow * Handle ggml-cpu-* variants * Link ggml/ggml-base libraries to their targets * Replace main-cmake-pkg with simple-cmake-pkg * Interface features require c_std_90 * Fix typo * Removed unnecessary bracket from status message * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <[email protected]> * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* metal : use residency sets ggml-ci * metal : restore commandBufferWithUnretainedReferences calls [no ci] * metal : release descriptors ggml-ci * metal : check env GGML_METAL_NO_RESIDENCY ggml-ci * metal : fix build + clean-up ggml-ci

…a/11441) This fixes segmentation fault error when running tests when no metal devices are available (for example, when not linked with Core Graphics framework or otherwise).

The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.

Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021. To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it). * SYCL: SOFTMAX F16 mask support and other fixes * test-backend-ops: Add F16 mask test cases

…le instantation bug (llama/11080) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.

loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.

ggml-ci

qnixsynapse and others added 30 commits January 29, 2025 11:26

SYCL: Add gated linear attention kernel (llama/11175)

28ac740

* SYCL: Add Gated Linear attention kernel * glahpp: add a space at the end of file * gla: Put the barrier inside the main logic loop

RoPE: fix back, CUDA support for back + noncont. (llama/11240)

904a095

* RoPE: fix back, CUDA support for back + noncont. * fix comments reg. non-cont. RoPE support [no-ci]

CUDA: backwards pass for misc. ops, add tests (llama/11257)

9b5d224

* CUDA: backwards pass for misc. ops, add tests * remove restrict from pointers

vulkan: optimize coopmat2 q2_k dequant function (llama/11130)

934f7ec

vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (llama/11206)

be82ddf

Do masking on whole dwords, fetch all scales at once.

rpc : early register backend devices (llama/11262)

72a5ae0

Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: #10609

vulkan: fix coopmat2 flash attention for non-contiguous inputs (llama…

782151b

…/11281) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.

metal : fix out-of-bounds write (llama/11314)

4ec5cf7

ggml-ci

rpc : better caching of the base buffer pointer (llama/11331)

54c1e2f

There is no need to use map, just store the base pointer in the buffer context.

vulkan: sort shaders for more deterministic binary (llama/11315)

4bec73f

Fixes #11306.

Vulkan-run-test: fix mmq_wg_denoms (llama/11343)

05a0c1d

There should be a copy-and-paste error here. *mmq_wg_denoms should be used together with *warptile_mmq, instead of wg_denoms.

tests: fix some mul_mat test gaps (llama/11375)

027cb54

Now that we have batched mat-vec mul Vulkan shaders for up to n==8, these tests weren't actually exercising the mat-mat mul path. Test n==9 as well. Also, change to use all_types.

CPU/CUDA: fix (GQA) mul mat back, add CUDA support (llama/11380)

cd3261d

rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (llama/11356)

8c59363

CUDA: fix FP16 cuBLAS GEMM (llama/11396)

7957144

hip : Add hipGraph and VMM support to ROCM (llama/11362)

c8e6c31

* Add hipGraph support * Enable VMM on rocm

Hip: disable VMM on hip as it seams that it dosent work in some confi…

2a0f12c

…gurations (llama/11420)

vulkan: compile shaders on-demand (llama/11406)

d886b3b

Reduce first-run startup time and memory consumption. Should fix #11339.

metal : use residency sets (llama/11427)

a756ee5

* metal : use residency sets ggml-ci * metal : restore commandBufferWithUnretainedReferences calls [no ci] * metal : release descriptors ggml-ci * metal : check env GGML_METAL_NO_RESIDENCY ggml-ci * metal : fix build + clean-up ggml-ci

metal: Handle null returned from MTLCreateSystemDefaultDevice() (llam…

bc64584

…a/11441) This fixes segmentation fault error when running tests when no metal devices are available (for example, when not linked with Core Graphics framework or otherwise).

Haus1 and others added 8 commits January 29, 2025 11:26

AMD: parse the architecture as supplied by gcnArchName (llama/11244)

ec4d3e7

The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.

cmake : don't fail on GGML_CPU=OFF (llama/11457)

8ac9155

HIP: Only call rocblas_initialize on rocblas versions with the multip…

807b5f2

…le instantation bug (llama/11080) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.

HIP: Supress transformation warning in softmax.cu

df283f7

loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.

sync : llama.cpp

5bcbe65

ggml-ci

scripts : sync cmake

5ce4ce2

cmake : sync new file

41ae935

ggml-ci

ggerganov merged commit 475e012 into master Jan 29, 2025
9 checks passed

ggerganov deleted the sync-llama.cpp-25-01-29 branch January 29, 2025 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : llama.cpp #1089

sync : llama.cpp #1089

ggerganov commented Jan 29, 2025

sync : llama.cpp #1089

sync : llama.cpp #1089

Conversation

ggerganov commented Jan 29, 2025