Releases: Project-HAMi/HAMi
Releases · Project-HAMi/HAMi
v2.5.0
Major features:
- Support dynamic mig feature, please refer to this document
- Reinstall Hami will NOT crash GPU tasks
- Put all configurations into a configMap, you can customize hami installation by modify its content: see details
Major bug fixes:
- Fix an issue where hami-core will stuck on tasks using 'cuMallocAsync'
- Fix hami-core stuck on high glib images, like 'tf-serving:latest'
What's Changed
⬆️ Dependencies
- Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 by @dependabot in #631
- Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker by @dependabot in #676
- Bump actions/upload-artifact from 4.4.3 to 4.5.0 by @dependabot in #717
- Bump docker/build-push-action from 6.9.0 to 6.10.0 by @dependabot in #644
- Bump docker/build-push-action from 6.10.0 to 6.11.0 by @dependabot in #792
🔨 Other Changes
- Fix Kubernetes version string handling by stripping metadata by @Nimbus318 in #623
- Update vGPUmonitor to add dynamic adjustment on core and memory limit by @archlitchi in #624
- feat: support device plugin daemonset update strategy by @devenami in #628
- add ut about schedule policy by @yt-huang in #638
- Fix: Refactor the license based on the approaches used in OpenSearch and ElasticSearch. by @haitwang-cloud in #626
- add ut for the scheduler by @shijinye in #645
- docs(issue-tmpl): add FAQ link to issue templates by @Nimbus318 in #647
- fix: filter device registry to node by @lengrongfu in #639
- Add self-hosted runner by @archlitchi in #659
- fix-example-yaml by @WQL782795 in #667
- update docs by @yangshiqi in #668
- add ut for ascend by @shijinye in #664
- optimization map init in test by @lengrongfu in #678
- Optimize monitor by @for800000 in #683
- fix code lint faild by @lengrongfu in #685
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName by @Nimbus318 in #687
- fix vGPUmonitor deviceidx is always 0 by @lengrongfu in #684
- add ut for pkg/scheduler/event.go by @Penguin-zlh in #688
- add ut for nodes by @shijinye in #695
- add license for pkg/scheduler/event_test.go by @Penguin-zlh in #706
- fix: exception happen when creating multiple ascend-gpu pods concurrently by @lijm87 in #575
- add ut for device/nvidia by @shijinye in #657
- add ut for pkg/monitor/nvidia/v0/spec.go by @yt-huang in #670
- Enable Dynamic-mig feature for HAMi by @archlitchi in #708
- Fix chart can not be deployed properly by @archlitchi in #711
- Fix NodeLock issue by @archlitchi in #714
- fix example yaml by @lixd in #709
- add ut for device/cambricon by @shijinye in #712
- Update dynamic mig documents and examples by @archlitchi in #718
- random time may be zero by @shijinye in #697
- fix grafana dashboard and clarify dashboard usage more clearly. by @jiangsanyin in #543
- doc(README): add examples for GPU sharing and update-examples by @xiaoyao in #665
- add ut for github.com/Project-HAMi/HAMi/pkg/scheduler/pod.go by @yt-huang in #673
- Add design document to 'dynamic-mig' feature by @archlitchi in #725
- fix(doc): fix a typo and resolve markdown warnings in the tasklist by @elrondwong in #724
- add ut for pkg/util/nodelock/nodelock.go by @learner0810 in #719
- test: add ut for pkg/version/version.go by @Penguin-zlh in #677
- Update on mig mode by @archlitchi in #726
- Update documents for config & config_cn by @archlitchi in #729
- set PASS_DEVICE_SPECS ENV to device-plugin by @jingzhe6414 in #690
- fix device-plugin-version by @learner0810 in #743
- feat: Return the nodes that failed to be scheduled back to the scheduler by @chaunceyjiang in #746
- fix(log): fix missing log output in nvidiadeviceplugin server by @elrondwong in #735
- support configuration resources limits and requests by @flpanbin in #739
- feat(test): add TestMarshalNodeDevices scenarios by @elrondwong in #747
- print flags for device-plugin and scheduler by @flpanbin in #756
- Fix typos, add more contributors and maintainers. by @yangshiqi in #765
- Add a mind map(Chinese and English) to help understand this project by @oceanweave in #764
- [Docs] update config pages by @windsonsea in #760
- add ut for device-map by @KubeKyrie in #762
- refactor(ci): use go.mod file for Go version in workflows by @yxxhero in #766
- support set log level for device plugin by @flpanbin in #771
- feat: Restart/Upgrade device-plugin will not affect services. by @chaunceyjiang in #767
- add ut nvml devices by @KubeKyrie in #773
- add ut for device-map by @KubeKyrie in #772
- Optimize the time format layout by @learner0810 in #741
- fix: nvidia-device-plugin no version info by @chaunceyjiang in #779
- HAMi supports e2e by @Rei1010 in #775
- Proposal: enable E2E test by @Rei1010 in #633
- add ut for device/iluvatar by @shijinye in #795
- add ut for device/hygon by @shijinye in #787
- add ut for pkg/monitor/nvidia/v1 by @shijinye in #780
- refactor(logging): enhance log messages for device resource counting by @haitwang-cloud in #778
- Enrich pod health check by @Rei1010 in #801
- docs: fix broken link by @lixd in #802
- Optimize the E2E execution logic by @Rei1010 in #803
- optimize MetricsBindAddress to MetricsBindPort by @phoenixwu0229 in #796
- fix: handle the node nil issue & E2E test failure by @haitwang-cloud in #804
- add ut for device/mthreads by @shijinye in #808
- fix: Resolve formatting issue in ConfigMap causing display anomalies by @lixd in #814
- [docs] Update ascend910b-support.md by @windsonsea in #816
- Refine metrics logs by @haitwang-cloud in #817
- Update mig-related logics and refine logs by @archlitchi in #833
- Add 910B4 config to device-configmap for ascend by @lijm87 in #828
- [docs] fix: glibc version requirement in README by @chinaran in #826
- Update HAMi-core for v2.5.0 by @archlitchi in #834
- FIx multi-process device memory count issue by @archlitchi in #835
- bump version to v2.5.0 by @wawa0210 in #836
- Fix CI by @archlitchi in #838
- Fix CI release by @archlitchi in #840
- Fix release ci by @archlitchi in #841
- Fix Dockerfile to make CI pass by @archlitchi in #846
- Fix E2E failure with pod status check by @Rei1010 in htt...
v2.4.1
Major Features:
- Support Metax scheduling optimazation
- Support Mthreads sGPU
- Add a configMap hami-scheduler-device for all configurations of HAMi
- Optimize installation process
Details
⬆️ Dependencies
- Bump actions/download-artifact from 3 to 4 by @dependabot in #529
- Bump docker/build-push-action from 6.8.0 to 6.9.0 by @dependabot in #528
- Bump actions/upload-artifact from 3.1.3 to 4.4.0 by @dependabot in #530
- Bump aquasecurity/trivy-action from 0.24.0 to 0.27.0 by @dependabot in #546
- Bump actions/upload-artifact from 4.4.0 to 4.4.3 by @dependabot in #541
- Bump ubuntu from 20.04 to 24.04 in /docker by @dependabot in #394
- Bump aquasecurity/trivy-action from 0.27.0 to 0.28.0 by @dependabot in #559
- Bump codecov/codecov-action from 4 to 5 by @dependabot in #613
🔨 Other Changes
- fix build badge status by @wawa0210 in #526
- update action-gh-release template file to more accurate matching by @wawa0210 in #527
- Refactor helm "Admission Webhook" config. by @4gt-104 in #532
- fix: error happen when allocate iluvatar device by @lijm87 in #522
- Fix code scanning alert-Incorrect conversion between integer types by @ghostloda in #556
- update hami-core version by @chaunceyjiang in #557
- Mthreads support by @archlitchi in #560
- Fix code scanning alert-Incorrect conversion between integer types by @ghostloda in #561
- update docs by @ghostloda in #567
- migrate hami slack to cncf hami group by @wawa0210 in #568
- Fix pod assignment issue when pod already has a node assigned by @chaunceyjiang in #564
- fix(scheduler): prevent array out-of-bounds when GPU containers are placed between non-GPU containers by @Nimbus318 in #572
- improve pkg/k8sutil/pod.go ut coverage by @wawa0210 in #570
- Metax GPU topo-awareness support by @archlitchi in #574
- Add WebUI to readme and readme_cn.md by @archlitchi in #578
- remove watermark of MetaX topo diagrams by @obnah in #581
- update HAMi Talks and References by @wawa0210 in #582
- fix: assgin to wrong devices when 1 pod has 2+ containers request GPU by @joy717 in #593
- docs: fix deployments path in README by @dublc in #608
- Add unified configMap and update charts by @archlitchi in #614
- Fix configMap device-config not properly installed by @archlitchi in #616
- fix CI: race condition error by @archlitchi in #618
- Pre release to v2.4.1 by @archlitchi in #619
New Contributors
- @4gt-104 made their first contribution in #532
- @lijm87 made their first contribution in #522
- @ghostloda made their first contribution in #556
- @Nimbus318 made their first contribution in #572
- @obnah made their first contribution in #581
- @dublc made their first contribution in #608
Full Changelog: v2.4.0...v2.4.1
v2.4.0
What's Changed
✨ New Features
- Support huawei ascend 910p for GA by @peizhaoyou in #389 ,https://github.com/Project-HAMi/ascend-device-plugin
- Support for multiple versions of cudevshr for vGPUmonitor by @zoyopei in #458
- Add filter device when register node by uuid or index by @lengrongfu in #495
- Support Ascend custom configuration file settings for NPU virtualization by @wawa0210 in #510
- Add event handlers registration by @wawa0210 in #417
- Officially supports arm architecture
🐛 Bug Fixes
- fixed go build image version by @chaunceyjiang in #405
- fix: fix duplicate resource keys in configmap by @devenami in #422
- fix data race when read pods info by @lengrongfu in #419
- fix OpenSSF Best Practices by @wawa0210 in #478
- fix CI and go-lint check error by @archlitchi in #486
- fix trivy scan failed by @wawa0210 in #504 and #507
- Fix HAMi image is too large and uses inappropriate base image by @wawa0210 in #508
- fix device configmap by @zoyopei in #494
- fix chart lint always running when charts has no change by @wawa0210 in #501
🔨 Other Changes
- Proposal: support GPU Utilization Metrics by @chaunceyjiang in #258
- disable PreferredAllocation by @lengrongfu in #415
- optimization code by @lengrongfu in #401
- add vgpu doc and update the readme. by @william-wang in #430
- add hami vulnerability scan and report by @wawa0210 in #433
- add CodeQL analysis by @wawa0210 in #432
- update hami logo by @lengrongfu in #456
- add node record pod info by @lengrongfu in #451
- support code Coverage Analytics by @wawa0210 in #473
- add dev branch ci & remove unused binary by @archlitchi in #487
- refactoring ascend device code by @zoyopei in #492
- Remake README.md by @archlitchi in #489
- support pr commit can also build images by @wawa0210 in #499
- Add HAMi release process by @wawa0210 in #520
New Contributors
- @william-wang made their first contribution in #430
- @devenami made their first contribution in #422
Full Changelog: v2.3.13...v2.4.0
hami-2.3
Heterogeneous AI Computing Virtualization Middleware