Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fork fixes issue where RP would not launch #283

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

fischer-felix
Copy link

I was having this issue which prevented the application from starting altogether.
Since this seems to have been fixed in this fork, I think it would make sense to merge it upstream.

emerge-e-world and others added 16 commits December 2, 2021 17:54
QString truncates strings after the first occurance of a null character. If a sysfs file contains null characters in the middle of the contents, the output of getValueFromSysFsFile() is incomplete.

This replaces all occurances of null chars in the QByteArray returned from QFile::readAll() before converting it to a QString.
This also fixes segfaults on startup when incomplete pp_od_clk_voltage tables are being parsed.
If the PC returns from sleep/hibernation we can enforce restoration of
the correct fan control mode by checking if the correct value is set for
pwm_enable and restore the value if necessary
To get around the issue that PWM isn't restored after hibernation it's
enough to check the pwm_enable file - we don't need to also store the
current PWM state into a variable.
Make sure an item is actually selected and can be moved in the desired
direction. I.e. don't try to move left if it is already the first item
and don't move right if it is the last.

Without this bounds check clicking on move left/right leads to crashes
due to out of range acccesses on ui->listWidget.
…same ValueID type

Changed ValueID from a simple enum to a class that contains two components: the ValueID (as before) and an instance id. This allows to store and address multiple instances of the same value type in gpu::gpuData using unique keys.

This commit only adds the new class and updates syntax for some uses of ValueID, without changing any program logic. Overall the new ValueID class should be a drop in for the old enum if only one instance of a ValueID type is needed. Some explicit static_cast's to ints and type conversion to ints in settings.cpp and dialogs/dialog_deinetopbaritem.cpp needed to be updated though.

This is prep work for introducing support for multiple temperature sensors when available by hwmon. Sensors will be distinguished by each having a unique ValueID::instance value per sensor for each of the TEMPERATURE_* valueID types.
When temperature sensors via the HWMON backend are supported, detect all
available temperature sensors.

If the labels provided by hwmon are known, we assign known instance ids
to the ValueIDs of the sensor data. As of now, this includes edge,
junction and mem temperature, which will be assigned the intance of
T_EDGE, T_JUNCTION and T_MEM for all TEMPERATURE_* values.

For those instances, getNameOfValueID() now returns the appropriate
localized strings for UI labels.

For all sensors detected with unknown labels, a free instance id is
assigned, the label provided by HWMON is used to construct labels for
the UI, as returned by getNameOfValueID().
makes data list aware of potential multiple instances of
TEMPERATURE_CURRENT values from multiple temperature sensors.

Readings from all sensors will now be listed properly.
makes radeon_profile::resetMinMax() aware of multiple temperature
sensors.

When the user requests a reset of the min/max values in the settings
tab, now the min/max values for all temperature sensors will be reset.
Make topbar items aware of multiple instances of TEMPERATURE_CURRENT
values, so user can display sensors readings from any sensor.
When a default topbar schema is created, make sure the main large label
for temperature is using the T_EDGE sensor.

Also conditionally generate a label pair for T_JUNCTION and T_MEM sensors, when
available by the driver.
allow user to select the temperature sensor to monitor for each
TEMPERATURE type event.
add a third column, display crit and emergency constants for temp sensors, if available.
Hysteresis was a global setting, applied independently of fan profile.
This is somewhat confusing, as one could expect it to be an aspect of a
fan profile, since one sets it in the fan profile tab.

This adds a new FanProfile struct that can store additional settings
besides the temperature steps and adds hysteresis as one of those
settings.

The global hysteresis value stored in older radeon-profile-settings will
still be read and applied to all loaded profiles. Once new config files
are written, this values is moved to the xml file and stored as attributes
of each fan profile respectively.
Adds a new combo box to the fan profile tab to allow the user to select
a temperature sensor as data source for this fan curve.

radeon_profile::adjustFanSpeed() will now read the current temperature
of the temperature sensor selected in the currently active fan profile.

This setting is a per fan profile setting. It is stored in the
newly added FanProfile struct and written to and read from the xml config file
for each fan profile. By default, the edge temperature sensor is
selected. This also applies when an older config file is read without
the new config attribute.
@eitch
Copy link

eitch commented Jan 16, 2022

Are you sure this merge would help? I tried it myself, and it doesn't work, failing with the same error message

@fischer-felix
Copy link
Author

It does the trick for me, but could you provide more details, like driver version, distro, kernel parameters, etc.

radeon-profile.mp4

@eitch
Copy link

eitch commented Jan 17, 2022

True, the no file name specified might not be the actual problem. I also get a segmentation fault. My details are (i did also test before with mesa 21.2.x with the same result):

$ uname -a
Linux eitchtower 5.15.11-76051511-generic #202112220937~1640185481~21.10~b3a2c21 SMP Wed Dec 22 15:41:49 U x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Pop
Description:	Pop!_OS 21.10
Release:	21.10
Codename:	impish

$ glxinfo
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon RX 6800 XT (SIENNA_CICHLID, DRM 3.42.0, 5.15.11-76051511-generic, LLVM 13.0.0) (0x73bf)
    Version: 21.3.4
    Accelerated: yes
    Video memory: 16384MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 14582 MB, largest block: 14582 MB
    VBO free aux. memory - total: 15986 MB, largest block: 15986 MB
    Texture free memory - total: 14582 MB, largest block: 14582 MB
    Texture free aux. memory - total: 15986 MB, largest block: 15986 MB
    Renderbuffer free memory - total: 14582 MB, largest block: 14582 MB
    Renderbuffer free aux. memory - total: 15986 MB, largest block: 15986 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 16384 MB
    Total available memory: 32752 MB
    Currently available dedicated video memory: 14582 MB
OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon RX 6800 XT (SIENNA_CICHLID, DRM 3.42.0, 5.15.11-76051511-generic, LLVM 13.0.0)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 21.3.4 - kisak-mesa PPA

@emerge-e-world
Copy link

If my fork fixed your startup issue then it was probably this bug #279 , which is fixed in my branch. I made a pull request (#280) for my fixes as well, but marazmista seems to be inactive at the moment. Feel free to use my branch though, I commited a number of bug fixes in there.

The "QFSFileEngine::open" error message from #278 seems unrelated to any of my fixes. It also is probably not the cause of the segfault, but a call to "QString::toUInt()" as Natim commented in #278. I'll see if I can reproduce that and find the bug.

@eitch
Copy link

eitch commented Feb 17, 2022

@emerge-e-world sadly your fork does not fix the issue... Without sudo your version starts, but with sudo i get the segfault...

@emerge-e-world
Copy link

could you compile it with debug messages enabled and post the output leading up to the segfault?
You can do so by running:

qmake-qt5 CONFIG+=debug
make

Then run the resulting ./target/radeon-profile binary from a terminal and copy/paste all the output here.

@eitch
Copy link

eitch commented Feb 22, 2022

$ qmake CONFIG+=debug
$ make clean
$ make -j32
$ sudo ./target/radeon-profile
Creating application object
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
Creating radeon_profile
Loading configuration
Creating ui elements
Analyzing screen  0
  Analyzing output  0
    Analyzing active mode
      Property is EDID, parsing it
Using QCharRef with an index pointing outside the valid range of a QString. The corresponding behavior is deprecated, and will be changed in a future version of Qt.
Using QCharRef with an index pointing outside the valid range of a QString. The corresponding behavior is deprecated, and will be changed in a future version of Qt.
Using QCharRef with an index pointing outside the valid range of a QString. The corresponding behavior is deprecated, and will be changed in a future version of Qt.
Searching PnP ID:  "GSM"
Found PnP ID:  "GSM" -> "LG Electronics"
  Analyzing output  1
  Analyzing output  2
  Analyzing output  3
Initializing device
Card detected:
 module:  "amdgpu" 
 sysName(path):  "card0" 
 name:  "Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c1)"
hwmon path:  "/sys/class/drm/card0/device/hwmon/hwmon2/"
Opened /dev/dri/renderD128 , fd number: 48
QFSFileEngine::open: No file name specified
Detected power method based on sysfs power_method file: 3
Power method: Power Profile Modes. Creating profiles list
ioctlHandler: everything ok
detected 3 temperature sensors: edge,junction,mem
ASSERT failure in QList<T>::operator[]: "index out of range", file /usr/include/x86_64-linux-gnu/qt5/QtCore/qlist.h, line 579
Aborted

@emerge-e-world
Copy link

Thanks. It looks like there the error happens when parsing one of the pp tables (containing the power level / clock descriptions for your card). I suspect that the amdgpu drivers changed the output for navi 2x cards somewhat, and this may happen with other Navi 2x based cards as well.
Could you post the contents of all the files called /sys/class/drm/card0/device/pp*? Just run the following command in a terminal:
more /sys/class/drm/card0/device/pp* | cat

@eitch
Copy link

eitch commented Feb 24, 2022

$ more /sys/class/drm/card0/device/pp* | cat
::::::::::::::
/sys/class/drm/card0/device/pp_cur_state
::::::::::::::
0
::::::::::::::
/sys/class/drm/card0/device/pp_dpm_dcefclk
::::::::::::::
0: 417Mhz 
1: 960Mhz *
2: 1200Mhz 
::::::::::::::
/sys/class/drm/card0/device/pp_dpm_dclk
::::::::::::::
::::::::::::::
/sys/class/drm/card0/device/pp_dpm_fclk
::::::::::::::
0: 500Mhz 
1: 1276Mhz *
2: 1941Mhz 
::::::::::::::
/sys/class/drm/card0/device/pp_dpm_mclk
::::::::::::::
0: 96Mhz 
1: 456Mhz 
2: 673Mhz 
3: 1000Mhz *
::::::::::::::
/sys/class/drm/card0/device/pp_dpm_pcie
::::::::::::::
0: 2.5GT/s, x1 310Mhz 
1: 8.0GT/s, x8 619Mhz *
::::::::::::::
/sys/class/drm/card0/device/pp_dpm_sclk
::::::::::::::
0: 500Mhz *
1: 2575Mhz 
::::::::::::::
/sys/class/drm/card0/device/pp_dpm_socclk
::::::::::::::
0: 480Mhz 
1: 800Mhz *
2: 1200Mhz 
::::::::::::::
/sys/class/drm/card0/device/pp_dpm_vclk
::::::::::::::
::::::::::::::
/sys/class/drm/card0/device/pp_features
::::::::::::::
features high: 0x00003763 low: 0xa37ffdff
No. Feature               Bit : State
00. DPM_PREFETCHER       ( 0) : enabled
01. DPM_GFXCLK           ( 1) : enabled
02. DPM_GFX_GPO          ( 2) : enabled
03. DPM_UCLK             ( 3) : enabled
04. DPM_FCLK             ( 4) : enabled
05. DPM_SOCCLK           ( 5) : enabled
06. DPM_MP0CLK           ( 6) : enabled
07. DPM_LINK             ( 7) : enabled
08. DPM_DCEFCLK          ( 8) : enabled
09. DPM_XGMI             ( 9) : disabled
10. MEM_VDDCI_SCALING    (10) : enabled
11. MEM_MVDD_SCALING     (11) : enabled
12. DS_GFXCLK            (12) : enabled
13. DS_SOCCLK            (13) : enabled
14. DS_FCLK              (14) : enabled
15. DS_LCLK              (15) : enabled
16. DS_DCEFCLK           (16) : enabled
17. DS_UCLK              (17) : enabled
18. GFX_ULV              (18) : enabled
19. FW_DSTATE            (19) : enabled
20. GFXOFF               (20) : enabled
21. BACO                 (21) : enabled
22. MM_DPM_PG            (22) : enabled
23. PPT                  (24) : enabled
24. TDC                  (25) : enabled
25. APCC_PLUS            (26) : disabled
26. GTHR                 (27) : disabled
27. ACDC                 (28) : disabled
28. VR0HOT               (29) : enabled
29. VR1HOT               (30) : disabled
30. FW_CTF               (31) : enabled
31. FAN_CONTROL          (32) : enabled
32. THERMAL              (33) : enabled
33. GFX_DCS              (34) : disabled
34. RM                   (35) : disabled
35. LED_DISPLAY          (36) : disabled
36. GFX_SS               (37) : enabled
37. OUT_OF_BAND_MONITOR  (38) : enabled
38. TEMP_DEPENDENT_VMIN  (39) : disabled
39. MMHUB_PG             (40) : enabled
40. ATHUB_PG             (41) : enabled
41. APCC_DFLL            (42) : enabled
42. RSMU_SMN_CG          (44) : enabled
::::::::::::::
/sys/class/drm/card0/device/pp_force_state
::::::::::::::

::::::::::::::
/sys/class/drm/card0/device/pp_mclk_od
::::::::::::::
0
::::::::::::::
/sys/class/drm/card0/device/pp_num_states
::::::::::::::
states: 1
0 default
::::::::::::::
/sys/class/drm/card0/device/pp_od_clk_voltage
::::::::::::::
OD_SCLK:
0: 500Mhz
1: 2384Mhz
OD_MCLK:
0: 97Mhz
1: 1000MHz
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
SCLK:     500Mhz       2800Mhz
MCLK:     674Mhz       1075Mhz
::::::::::::::
/sys/class/drm/card0/device/pp_power_profile_mode
::::::::::::::
PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS MinFreqType MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
 0 BOOTUP_DEFAULT*:
                    0(       GFXCLK)       0       5       1       0       4     800 4587520  -65536       0
                    1(       SOCCLK)       0       5       1       0       1       0 3276800  -65536   -6553
                    2(        MEMLK)       0       5       1       0       4     800  327680  -65536       0
 1 3D_FULL_SCREEN :
                    0(       GFXCLK)       0       5       1       0       4     650 5242880   -3276       0
                    1(       SOCCLK)       0       5       1       0       1       0  655360  -65536   -6553
                    2(        MEMLK)       0       5       4     850       4     800  327680  -65536       0
 2   POWER_SAVING :
                    0(       GFXCLK)       0       5       1       0       3       0 5898240  -65536       0
                    1(       SOCCLK)       0       5       1       0       1       0 3407872  -65536   -6553
                    2(        MEMLK)       0       5       1       0       3       0 1966080  -65536       0
 3          VIDEO :
                    0(       GFXCLK)       0       5       1       0       4     500 4587520  -65536       0
                    1(       SOCCLK)       0       5       1       0       1       0 3473408  -65536   -6553
                    2(        MEMLK)       0       5       1       0       4     500 1966080  -65536       0
 4             VR :
                    0(       GFXCLK)       0       5       4    1000       1       0 3276800       0       0
                    1(       SOCCLK)       0       5       1       0       1       0  655360  -65536   -6553
                    2(        MEMLK)       0       5       1       0       4     800  327680  -65536       0
 5        COMPUTE :
                    0(       GFXCLK)       0       5       4    1000       1       0 3932160       0       0
                    1(       SOCCLK)       0       5       1       0       1       0  655360  -65536   -6553
                    2(        MEMLK)       0       5       4     850       3       0  327680  -65536  -32768
 6         CUSTOM :
                    0(       GFXCLK)       0       5       1       0       4     800 4587520  -65536       0
                    1(       SOCCLK)       0       5       1       0       1       0 3276800  -65536   -6553
                    2(        MEMLK)       0       5       1       0       4     800  327680  -65536       0
::::::::::::::
/sys/class/drm/card0/device/pp_sclk_od
::::::::::::::
0
::::::::::::::
/sys/class/drm/card0/device/pp_table
::::::::::::::
� "�  �?�v
����x�x���*���a&=k=k���,���
�
�
33�
   �
    dndddddddddd�������2




WZ_Z�������c7-�,7dndssss
���?�7�>��O@�?��|�?���>D4�>C� �=6?��̬?㥛�o��>ʦ̾!I?��?r�z>L���ܺC? ��?)\����>���G F?�jމ?�>�u�>(IW��,<?s  ��?)\����>���G F?�jމ?�>�u�>(IW��,<?s}?�?�$>w�>�wh��<?Uj�t?�Ga>s.e>R���/8?Cj�t?�Ga>s.e>R���/8?C-?���>'��=�(����2?�j�t?�Ga>s.e>R���/8?C�
kx=���a���������,*��&�
����x�x���*�xL��
�
 �
�
�� �]S�o"��=��Y
�              ;i�
 6k7<����������#F��
T�=�ҽ�>���>�?@��!@333@
�#=�Y=���=�I
�TA<33�>ڬ�<���>

@eitch
Copy link

eitch commented Feb 24, 2022

Thanks for the help! Really happy about getting some feedback, and hoping for the application to soon work again =).

Parsing tables in pp_od_clk_voltage is somewhat fragile in the current implementation, as assumptions made on the contents of that file tend to not apply with newer cards. This may lead to segfaults that are easier to debug when users can submit debug output that somewhat shows where the code breaks.
Preliminary fixes to support navi2x cards. This is the first step to get radeon-profile functional with navi2x based cards, by working around pp_od_clk_voltage file without FreqVolt pair tables, but OD_SCLK/OD_MCLK ables like vega20.
Proper voltage control may still require support for the new OD_VDDGFX_OFFSET value that is not implemented yet. OC features might be missing, but startup without segfaulting should work.
This is a temporary workaround to make the incomplete pp_od_clk_voltage parsing logic for navi2x work without segfaulting, when no voltage steps information is available.
When parsing the OD_RANGE or volt/freq pairs, three colums are expected
in those tables. However, bounds checks only tested for the case when
only one column (i.e. a new table header) is found.

Properly abort parsing when less than 3 columns are found and ignore
those lines.
@emerge-e-world
Copy link

You're welcome! I hope I'll be able to make it work for your card, and in extension navi cards overall. It is a bit trickier without the hardware to test - originally I started my fork to collect some general bug fixes and fix things that broke with my radeon VII in particular, but I think I have a good idea what is going wrong.

Anyhow, I've got a preliminary version that should fix the segfault, and work around some things missing that may need some more work. Assuming that particular segfault is the only major problem with your card, it should now start and then be mostly functional.

The part that may need more work is manual voltage/frequency settings in the overclocking tab. It may show nothing or not apply correctly. The slider for automatic percent overclock should likely work though. Anything else I expect to work.

You find those fixes in this branch:
https://github.com/emerge-e-world/radeon-profile/tree/navi_fixes

You can clone it with:
git clone https://github.com/emerge-e-world/radeon-profile -b navi_fixes

and then compile and run as usual:

qmake CONFIG+=debug
make -j
./target/radeon-profile

In case it still crashes elsewhere, I added some more debug output to pin it down better. So if that happens, posting the output again would help.

@eitch
Copy link

eitch commented Feb 24, 2022

Thanks, so this actually works. I can now set the power cap as well. This is the log during that:

$ sudo ./target/radeon-profile
Please touch the device.
Creating application object
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
Creating radeon_profile
Loading configuration
Creating ui elements
Analyzing screen  0
  Analyzing output  0
    Analyzing active mode
      Property is EDID, parsing it
Using QCharRef with an index pointing outside the valid range of a QString. The corresponding behavior is deprecated, and will be changed in a future version of Qt.
Using QCharRef with an index pointing outside the valid range of a QString. The corresponding behavior is deprecated, and will be changed in a future version of Qt.
Using QCharRef with an index pointing outside the valid range of a QString. The corresponding behavior is deprecated, and will be changed in a future version of Qt.
Searching PnP ID:  "GSM"
Found PnP ID:  "GSM" -> "LG Electronics"
  Analyzing output  1
  Analyzing output  2
  Analyzing output  3
Initializing device
Card detected:
 module:  "amdgpu" 
 sysName(path):  "card0" 
 name:  "Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c1)"
hwmon path:  "/sys/class/drm/card0/device/hwmon/hwmon2/"
Opened /dev/dri/renderD128 , fd number: 48
QFSFileEngine::open: No file name specified
Detected power method based on sysfs power_method file: 3
Power method: Power Profile Modes. Creating profiles list
ioctlHandler: everything ok
detected 3 temperature sensors: edge,junction,mem
parsing OD_SCLK table
parsing OD_MCLK table
parsing OD_RANGE table
parsing OC tables done.
ioctlHandler: everything ok
GPU max core clk:  2575 
 max mem clk:  1000 
 vram size:  16304
Handling found device features
Creating power profiles control buttons
Fan control available
setting up OC UI elements
creating OC profile list/graph for:  "default"
UI init complete.
parsing OD_SCLK table
parsing OD_MCLK table
parsing OD_RANGE table
parsing OC tables done.
parsing OD_SCLK table
parsing OD_MCLK table
parsing OD_RANGE table
parsing OC tables done.
parsing OD_SCLK table
parsing OD_MCLK table
parsing OD_RANGE table
parsing OC tables done.
creating OC profile list/graph for:  "default"
parsing OD_SCLK table
parsing OD_MCLK table
parsing OD_RANGE table
parsing OC tables done.
parsing OD_SCLK table
parsing OD_MCLK table
parsing OD_RANGE table
parsing OC tables done.

My card is water cooled, so i don't have any fans on them. I'll need to play around with the settings and see if i the overclock sliders work. Thanks anyhow for this start.

@eitch
Copy link

eitch commented Feb 24, 2022

How can i now enable the daemon, so i don't have to run as root?

@emerge-e-world
Copy link

awesome, that looks very good. The fan control should most likely not be any different with navi, the debug output suggests this is detected correctly. So I'd assume this would work if you had an air cooler attached. We'll have to see if someone with another card can test this.

The daemon is a separate package. I did not make any changes to the daemon for my fork, so you can just use the latest binary package from your distro if it has one (the package should be called "radeon-profile-daemon"), or compile it yourself from marazmista's repo here:
https://github.com/marazmista/radeon-profile-daemon

To build follow the instructions from there (it's the same qmake && make steps) and then sudo make install, to install it.
After that you can manually start the daemon with:
sudo systemctl start radeon-profile-daemon.service
and/or enable it at boot:
sudo systemctl enable radeon-profile-daemon.service

@emerge-e-world
Copy link

the changes to fix the startup segfault for navi2x (RX 6000 series cards) are now merged into the master branch in my repo. So anybody who had the same issue as @eitch, please use the master branch from:
https://github.com/emerge-e-world/radeon-profile
Any feedback on further issues with navi based cards would be appreciated, since I don't have a card to test myself. Please open bug reports in my repo if you find things not working properly.

@eitch
Copy link

eitch commented Feb 28, 2022

@emerge-e-world right, that worked. I installed it from the repo and am now able to start the client without sudo.

Overclocking doesn't seem to do anything. Looking at the screen:

image

I can't seem to be able to change the clocks manually, and the percent overclock does nothing, my clock always stays at 2475MHz. Any ideas?

@emerge-e-world
Copy link

emerge-e-world commented Feb 28, 2022

I think there may be a rather confusing UI bug in the General OC tab. Could you try this: enable both percent overclock and manual frequency control. Then move the core and/or mem clock percent slider up, then push apply at least twice. Now if the frequency values shown in the tables change it may have worked. If so you may try to run some workload and see if the clocks actually go higher. (They may just spike for short bursts – if your temps are getting to high. You can plot and monitor both in the Graphs tab.)
If this works then this is just a bug in the UI. That'll have to be fixed in either case :)

Another idea: If you click on the states table tab, is there a "Set Ranges" button at the bottom shown? I expect not, but if there is, can you try to click on it and increase the "maximum core clock" slider there, then hit Save and Apply, and check if your clocks increase.

@eitch
Copy link

eitch commented Mar 1, 2022

The two times clicking on the apply does add set second checkbox in the left list from 0 Mhz to 2575Mhz, but while running vkmark nothing happens. And this is not a thermal issue, as you can see on the temps.

image

image

@emerge-e-world
Copy link

emerge-e-world commented Mar 2, 2022

I had a look at the amdgpu kernel driver doc to find any changes in the clock/voltage control for navi2x to figure out what needs changing. It seems – except for a new voltage offset feature – it should work the same as with navi1x and vega20 cards. And therefore in principle work with the existing code.

However, your pp_od_clk_voltage table is missing the section called OD_VDDC_CURVE, describing the frequency/voltage curve. This is what made me originally think there would need to be some bigger changes to make it work. But I now rather think your amdgpu driver is just not exposing it.

Maybe it is as simple as it being disabled. Did you boot with a amdgpu.ppfeaturemask kernel parameter? It is a parameter for the amdgpu driver to unlock some by default disabled driver features, including some overclocking bits.

If you didn't, could you try the following?

Add the parameter amdgpu.ppfeaturemask=0xffffffff to your kernel parameter list in your bootloader. You seem to run PopOS, on that distro you should be able to add kernel parameters like this:

sudo kernelstub -a amdgpu.ppfeaturemask=0xffffffff

then reboot.
After you've rebooted, you can verify that it was actually applied with:

cat /proc/cmdline

The "amdgpu.ppfeaturemask=0xffffffff" string should be somewhere in there.

If that worked, please post the output of:

cat /sys/class/drm/card0/device/pp_od_clk_voltage

again. If there is now an OD_VDDC_CURVE section in there, you can try again applying an overclock. There should then also be the "set ranges" button now available under the states table tab, where you can increase the max allowed clock speed.

@eitch
Copy link

eitch commented Mar 2, 2022

Sadly this is not the case, as i have had this feature activated quite a while already:

eitch@eitchtower:~$ cat /proc/cmdline
initrd=\EFI\Pop_OS-c20a2fe8-0d53-4c86-bfcd-b373b2b6bceb\initrd.img root=UUID=c20a2fe8-0d53-4c86-bfcd-b373b2b6bceb ro quiet loglevel=0 systemd.show_status=false splash amdgpu.ppfeaturemask=0xffffffff
eitch@eitchtower:~$ cat /sys/class/drm/card0/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 2374Mhz
OD_MCLK:
0: 97Mhz
1: 1000MHz
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
SCLK:     500Mhz       2800Mhz
MCLK:     674Mhz       1075Mhz

@emerge-e-world
Copy link

OK, well that would have been too easy. I'll see if I can work on getting the OD_RANGES settings to work (the set ranges button), since the values are there. Maybe not setting those caps the max allowed clocks. As well as exposing the new VDDGFX_OFFSET to the UI. It may take a couple of days since I can get to it though.

In the mean time, could you check of the amdgpu drivers reports any errors when you try to apply OC settings? This should get you any kernel messages from the driver.

sudo dmesg | grep amdpgu

@eitch
Copy link

eitch commented Mar 28, 2022

Sorry for the delay, i've been on holidays. Here is the listing:

[    0.000000] Command line: initrd=\EFI\Pop_OS-c20a2fe8\initrd.img root=UUID=c20a2fe8 ro quiet loglevel=0 systemd.show_status=false splash amdgpu.ppfeaturemask=0xffffffff
[    0.057528] Kernel command line: initrd=\EFI\Pop_OS\initrd.img root=UUID=c20a2fe8 ro quiet loglevel=0 systemd.show_status=false splash amdgpu.ppfeaturemask=0xffffffff
[    2.080809] [drm] amdgpu kernel modesetting enabled.
[    2.085476] amdgpu: Ignoring ACPI CRAT on non-APU system
[    2.085478] amdgpu: Virtual CRAT table created for CPU
[    2.085482] amdgpu: Topology: Add CPU node
[    2.085527] fb0: switching to amdgpu from EFI VGA
[    2.085560] amdgpu 0000:0e:00.0: vgaarb: deactivate vga console
[    2.085587] amdgpu 0000:0e:00.0: enabling device (0006 -> 0007)
[    2.085615] amdgpu 0000:0e:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    2.087256] amdgpu 0000:0e:00.0: amdgpu: Fetched VBIOS from VFCT
[    2.087257] amdgpu: ATOM BIOS: 113-D41201-XTC
[    2.087297] amdgpu 0000:0e:00.0: amdgpu: MEM ECC is not presented.
[    2.087298] amdgpu 0000:0e:00.0: amdgpu: SRAM ECC is not presented.
[    2.087305] amdgpu 0000:0e:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[    2.087306] amdgpu 0000:0e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    2.087307] amdgpu 0000:0e:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    2.087340] [drm] amdgpu: 16368M of VRAM memory ready
[    2.087340] [drm] amdgpu: 16368M of GTT memory ready.
[    2.087605] amdgpu 0000:0e:00.0: amdgpu: PSP runtime database doesn't exist
[    4.341650] amdgpu 0000:0e:00.0: amdgpu: Will use PSP to load VCN firmware
[    4.556253] amdgpu 0000:0e:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    4.556282] amdgpu 0000:0e:00.0: amdgpu: use vbios provided pptable
[    4.626347] amdgpu 0000:0e:00.0: amdgpu: SMU is initialized successfully!
[    5.097568] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    5.138720] amdgpu: HMM registered 16368MB device memory
[    5.138777] amdgpu: SRAT table not found
[    5.138777] amdgpu: Virtual CRAT table created for GPU
[    5.139198] amdgpu: Topology: Add dGPU node [0x73bf:0x1002]
[    5.139201] kfd kfd: amdgpu: added device 1002:73bf
[    5.139227] amdgpu 0000:0e:00.0: amdgpu: SE 4, SH per SE 2, CU per SH 10, active_cu_number 72
[    5.143449] fbcon: amdgpudrmfb (fb0) is primary device
[    5.143451] amdgpu 0000:0e:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[    5.156789] amdgpu 0000:0e:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    5.156791] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    5.156792] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    5.156792] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    5.156793] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    5.156794] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    5.156795] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    5.156795] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    5.156796] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    5.156797] amdgpu 0000:0e:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    5.156798] amdgpu 0000:0e:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    5.156798] amdgpu 0000:0e:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[    5.156799] amdgpu 0000:0e:00.0: amdgpu: ring sdma2 uses VM inv eng 14 on hub 0
[    5.156800] amdgpu 0000:0e:00.0: amdgpu: ring sdma3 uses VM inv eng 15 on hub 0
[    5.156801] amdgpu 0000:0e:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[    5.156801] amdgpu 0000:0e:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[    5.156802] amdgpu 0000:0e:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[    5.156803] amdgpu 0000:0e:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 5 on hub 1
[    5.156804] amdgpu 0000:0e:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 6 on hub 1
[    5.156805] amdgpu 0000:0e:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 7 on hub 1
[    5.156805] amdgpu 0000:0e:00.0: amdgpu: ring jpeg_dec uses VM inv eng 8 on hub 1
[    5.157958] [drm] Initialized amdgpu 3.44.0 20150101 for 0000:0e:00.0 on minor 0
[    7.123169] snd_hda_intel 0000:0e:00.1: bound 0000:0e:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])

@eitch
Copy link

eitch commented Mar 28, 2022

It doesn't matter which combinations of buttons i press, the logs don't change. I'm on kernel 5.16.12

@eitch
Copy link

eitch commented Mar 30, 2022

Does this also help you:

$ cat /sys/class/drm/card0/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 2374Mhz
OD_MCLK:
0: 97Mhz
1: 1000MHz
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
SCLK:     500Mhz       2800Mhz
MCLK:     674Mhz       1075Mhz

@v-fox
Copy link

v-fox commented Nov 2, 2022

Tried it out on RX580 and I don't think that frequency curves are getting applied. Only "frequency control" toggles seems to work for sure (because it activates 1000 MHz intermediate memory state which is disabled by default in drivers due to causing flickering). "Percent overlock" might work but "GPU max core clk: / max mem clk: " startup message always shows default ones UNLESS radeon-profile binary has suid applied to it with chmod u+s (even starting with sudo doesn't change that). BUT that only works if it is patched with this to prevent Qt's "safety suicide":

diff --git a/radeon-profile/main.cpp b/radeon-profile/main.cpp
index e3f6532..3d99571 100644
--- a/radeon-profile/main.cpp
+++ b/radeon-profile/main.cpp
@@ -6,6 +6,7 @@ int main(int argc, char *argv[])
 {
     qDebug() << "Creating application object";
 
+    QApplication::setSetuidAllowed(true);
     QApplication a(argc, argv);
     QTranslator translator;
     QLocale locale;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants