Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smartmontools broken for some hw (Current 7.4 request to move back to 7.2) #21077

Open
scottpaulsen opened this issue Dec 6, 2024 · 3 comments
Assignees
Labels
Triaged this issue has been triaged

Comments

@scottpaulsen
Copy link

Description

The current version of smartmontools (specifically 7.4) does not work with all HW. The version 7.2 that was in the 202311 SONiC image worked fine with out HW.

The specific drive model we have that is failing is (It is an NVME drive):
Model Number: HFS480GEJ8X176N

Steps to reproduce the issue:

  1. smartctl -a /dev/nvme0n1 ;# Returns -4 instead of 0 and bails out when checking self test status logs.

Describe the results you received:

Describe the results you expected:

Should be downgraded back to 7.2 or move to a current 7.5 build.

Output of show version:

root@sonic:/opt/cisco/etc/sonic# show version

SONiC Software Version: SONiC.mckenzie-dev_202405.0-dirty-20241206.102347
SONiC OS Version: 12
Distribution: Debian 12.8
Kernel: 6.1.0-22-2-amd64
Build commit: 31089c683
Build date: Fri Dec 6 18:47:19 UTC 2024
Built by: scott@vxr-slurm-255

Platform: x86_64-85_rp_o-r0
HwSKU: Cisco-85-RP-O
ASIC: cisco-8000
ASIC Count: 1
Serial Number: FLM282802LK
Model Number: 85-RP-O
Hardware Revision: 0.3
Uptime: 18:21:57 up 1:12, 1 user, load average: 3.48, 3.28, 3.19
Date: Mon 04 Nov 2024 18:21:57

Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-platform-monitor latest 9e3c6075cb3d 460MB
docker-platform-monitor mckenzie-dev_202405.0-dirty-20241206.102347 9e3c6075cb3d 460MB
docker-snmp latest 5ea91caa41de 375MB
docker-snmp mckenzie-dev_202405.0-dirty-20241206.102347 5ea91caa41de 375MB
docker-dhcp-relay latest 6fff90024fb5 340MB
docker-macsec latest cfc4248c3ed7 362MB
docker-eventd latest 9bd4289f7d18 331MB
docker-eventd mckenzie-dev_202405.0-dirty-20241206.102347 9bd4289f7d18 331MB
docker-gbsyncd-cisco latest 1b8e755098fc 391MB
docker-gbsyncd-cisco mckenzie-dev_202405.0-dirty-20241206.102347 1b8e755098fc 391MB
docker-fpm-frr latest 869549655343 391MB
docker-fpm-frr mckenzie-dev_202405.0-dirty-20241206.102347 869549655343 391MB
docker-nat latest 52a617932035 362MB
docker-nat mckenzie-dev_202405.0-dirty-20241206.102347 52a617932035 362MB
docker-sflow latest 10f9c43c13a2 360MB
docker-sflow mckenzie-dev_202405.0-dirty-20241206.102347 10f9c43c13a2 360MB
docker-orchagent latest dcc8958c9627 372MB
docker-orchagent mckenzie-dev_202405.0-dirty-20241206.102347 dcc8958c9627 372MB
docker-sonic-mgmt-framework latest 3b9d0ef54431 418MB
docker-sonic-mgmt-framework mckenzie-dev_202405.0-dirty-20241206.102347 3b9d0ef54431 418MB
docker-teamd latest 85bb0d0538d7 359MB
docker-teamd mckenzie-dev_202405.0-dirty-20241206.102347 85bb0d0538d7 359MB
docker-router-advertiser latest a16b1ff34dfb 331MB
docker-router-advertiser mckenzie-dev_202405.0-dirty-20241206.102347 a16b1ff34dfb 331MB
docker-lldp latest 17ea44604be6 377MB
docker-lldp mckenzie-dev_202405.0-dirty-20241206.102347 17ea44604be6 377MB
docker-database latest d814df62760d 339MB
docker-database mckenzie-dev_202405.0-dirty-20241206.102347 d814df62760d 339MB
docker-sonic-gnmi latest df9565c0a9eb 415MB
docker-sonic-gnmi mckenzie-dev_202405.0-dirty-20241206.102347 df9565c0a9eb 415MB
docker-mux latest ac37bd65a4d7 383MB
docker-mux mckenzie-dev_202405.0-dirty-20241206.102347 ac37bd65a4d7 383MB
docker-ipxeserver-cisco latest 621a752c8ee8 353MB
docker-ipxeserver-cisco mckenzie-dev_202405.0-dirty-20241206.102347 621a752c8ee8 353MB
docker-syncd-cisco latest 13b58b3604ca 1.1GB
docker-syncd-cisco mckenzie-dev_202405.0-dirty-20241206.102347 13b58b3604ca 1.1GB

root@sonic:/opt/cisco/etc/sonic#

(paste your output here)

Output of show techsupport:

Problem output:

Welcome to Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-192-generic x86_64)

System information as of Fri 06 Dec 2024 01:54:17 PM PST

System load: 5.26 Users logged in: 29
Usage of /: 19.5% of 97.87GB IPv4 address for docker0: 172.17.0.1
Memory usage: 16% IPv4 address for eno7: 172.26.228.181
Swap usage: 4% IPv4 address for virbr0: 192.168.122.1
Processes: 1745

=> There are 5 zombie processes.

98 updates can be applied immediately.
1 of these updates is a standard security update.
To see these additional updates run: apt list --upgradable

New release '22.04.5 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

1 updates could not be installed automatically. For more details,
see /var/log/unattended-upgrades/unattended-upgrades.log

*** System restart required ***

** Workspaces older than six months **
Fri 06 Dec 2024 01:00:01 PM PST
449G total
79G /nobackup/manamand/sonic-build
48G /nobackup/phemadri/sonic-buildimage
48G /nobackup/kaima/sonic-1
47G /nobackup/skayamku/sonic
46G /nobackup/wjacob/sonic
42G /nobackup/jeflo/sonic_tortuga_2
36G /nobackup/thgowda/tortuga-202205
12G /nobackup/wjacob/cleanup
8.9G /nobackup/athingal/swss_env
8.7G /nobackup/jrode/sdk
Last login: Mon Dec 2 14:21:38 2024 from 10.28.39.44
vxr-slurm-255:~> cat /nobackup/scott/scott-smart
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.0-22-2-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: HFS480GEJ8X176N
Serial Number: ****
Firmware Version: 51090A30
PCI Vendor/Subsystem ID: 0x1c5c
IEEE OUI Identifier: 0xace42e
Total NVM Capacity: 480,103,981,056 [480 GB]
Unallocated NVM Capacity: 0
Controller ID: 0
NVMe Version: 1.4
Number of Namespaces: 16
Namespace 1 Size/Capacity: 480,103,981,056 [480 GB]
Namespace 1 Utilization: 11,587,231,744 [11.5 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: ace42e 00452d0b0e
Local Time is: Mon Nov 4 18:36:30 2024 UTC
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x065f): Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec Get_LBA_Sts Lockdown
Optional NVM Commands (0x00df): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x7e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg Log0_FISE_MI Telmtry_Ar_4
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 74 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Namespace 1 Features (0x12): NA_Fields NP_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.25W 0.00W - 0 0 0 0 30000 30000
1 + 7.00W 0.00W - 1 1 1 1 30000 30000
2 + 6.00W 0.00W - 2 2 2 2 30000 30000
3 + 5.00W 0.00W - 3 3 3 3 30000 30000
4 - 5.00W - - 3 3 3 3 30000 30000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 34 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 13,967,376 [7.15 TB]
Data Units Written: 843,112 [431 GB]
Host Read Commands: 56,028,524
Host Write Commands: 10,870,631
Controller Busy Time: 43
Power Cycles: 67
Power On Hours: 1,762
Unsafe Shutdowns: 65
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 29 Celsius
Temperature Sensor 2: 39 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x002)

smartctl rc 4

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

With the HW we have it fails 100%, when loading the 202311 image it passes.

@tjchadaga
Copy link
Contributor

@scottpaulsen - have you tested with 7.5 and do you see that the issue is fixed with that version?

@tjchadaga tjchadaga added the Triaged this issue has been triaged label Dec 18, 2024
@tjchadaga
Copy link
Contributor

@prgeor - please take a look

@scottpaulsen
Copy link
Author

Yes it is fixed with 7.5, however 7.5 is not yet released... I pulled the latest build and hacked it in for our private image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

3 participants