While I am doing the tests with AC plugged in by `DRI_PRIME=1 glxgears -info` and `DRI_PRIME=0 glxgears -info`, the system halts and then is forced to shutdown automatically. I tried mainline kernels from 4.10rc7 to v4.14rc5 and they have the same problem.
Please attach the corresponding dmesg output. (In reply to Shih-Yuan Lee from comment #0) > While I am doing the tests with AC plugged in by `DRI_PRIME=1 glxgears > -info` and `DRI_PRIME=0 glxgears -info`, the system halts and then is forced > to shutdown automatically. To clarify, DRI_PRIME=1 glxgears works, the problem only occurs with DRI_PRIME=0 glxgears?
If I executed the command under battery mode, it won't halt the system.
Using DRI_PRIME=0 is just to switch between Intel and AMD Graphics. Of course, we can omit it completely. The system halt issue happens when executing `DRI_PRIME=1 glxgears -info`. u@u:~$ DRI_PRIME=1 glxgears -info Running synchronized to the vertical refresh. The framerate should be approximately the same as the monitor refresh rate. GL_RENDERER = Gallium 0.4 on AMD HAINAN (DRM 2.46.0, LLVM 3.8.0) GL_VERSION = 3.0 Mesa 11.2.0 GL_VENDOR = X.Org ... u@u:~$ DRI_PRIME=0 glxgears -info Running synchronized to the vertical refresh. The framerate should be approximately the same as the monitor refresh rate. GL_RENDERER = Mesa DRI Intel(R) Kabylake GT1.5 GL_VERSION = 3.0 Mesa 11.2.0 GL_VENDOR = Intel Open Source Technology Center ... (In reply to Michel Dänzer from comment #1) > Please attach the corresponding dmesg output. > > (In reply to Shih-Yuan Lee from comment #0) > > While I am doing the tests with AC plugged in by `DRI_PRIME=1 glxgears > > -info` and `DRI_PRIME=0 glxgears -info`, the system halts and then is forced > > to shutdown automatically. > > To clarify, DRI_PRIME=1 glxgears works, the problem only occurs with > DRI_PRIME=0 glxgears?
Sorry. I pasted wrong logs. GL_VERSION should be "3.0 Mesa 17.0.7" instead. (In reply to Shih-Yuan Lee from comment #3) > Using DRI_PRIME=0 is just to switch between Intel and AMD Graphics. > Of course, we can omit it completely. > > The system halt issue happens when executing `DRI_PRIME=1 glxgears -info`. > > u@u:~$ DRI_PRIME=1 glxgears -info > Running synchronized to the vertical refresh. The framerate should be > approximately the same as the monitor refresh rate. > GL_RENDERER = Gallium 0.4 on AMD HAINAN (DRM 2.46.0, LLVM 3.8.0) > GL_VERSION = 3.0 Mesa 11.2.0 > GL_VENDOR = X.Org > ... > u@u:~$ DRI_PRIME=0 glxgears -info > Running synchronized to the vertical refresh. The framerate should be > approximately the same as the monitor refresh rate. > GL_RENDERER = Mesa DRI Intel(R) Kabylake GT1.5 > GL_VERSION = 3.0 Mesa 11.2.0 > GL_VENDOR = Intel Open Source Technology Center > ... > > (In reply to Michel Dänzer from comment #1) > > Please attach the corresponding dmesg output. > > > > (In reply to Shih-Yuan Lee from comment #0) > > > While I am doing the tests with AC plugged in by `DRI_PRIME=1 glxgears > > > -info` and `DRI_PRIME=0 glxgears -info`, the system halts and then is forced > > > to shutdown automatically. > > > > To clarify, DRI_PRIME=1 glxgears works, the problem only occurs with > > DRI_PRIME=0 glxgears?
There is no problem under the battery mode, and because the system halts that makes unable to collect any dmesg. $ DRI_PRIME=1 glxgears -info Running synchronized to the vertical refresh. The framerate should be approximately the same as the monitor refresh rate. GL_RENDERER = Gallium 0.4 on AMD HAINAN (DRM 2.50.0 / 4.14.0-041400rc5-generic, LLVM 4.0.0) GL_VERSION = 3.0 Mesa 17.0.7 GL_VENDOR = X.Org ... $ DRI_PRIME=0 glxgears -info Running synchronized to the vertical refresh. The framerate should be approximately the same as the monitor refresh rate. GL_RENDERER = Mesa DRI Intel(R) Kabylake GT1.5 GL_VERSION = 3.0 Mesa 17.0.7 GL_VENDOR = Intel Open Source Technology Center ...
(In reply to Shih-Yuan Lee from comment #5) > There is no problem under the battery mode, and because the system halts > that makes unable to collect any dmesg. Please attach dmesg captured in battery mode or before the problem occurs.
Created attachment 134936 [details] dmesg by drm.debug=0xe The messages just before the system halt.
Hmm, I notice these errors: [ 2.050887] [drm:radeon_acpi_init [radeon]] Call to ATCS verify_interface failed: -5 [ 2.050994] [drm:radeon_acpi_init [radeon]] Call to ATIF verify_interface failed: -5 which I think are ACPI calls, it might be worth checking your BIOS/EFI is up to date and if that doesn't fix things maybe play around with the acpi_osi= options
I have tried acpi_osi="Windows 2009", "Windows 2012", "Windows 2013" and "Windows 2015" on the latest mainline kernel 4.14rc6, and they all have the same errors and halt the system. The BIOS is also up to date. (In reply to Mike Lothian from comment #8) > Hmm, I notice these errors: > > [ 2.050887] [drm:radeon_acpi_init [radeon]] Call to ATCS verify_interface > failed: -5 > [ 2.050994] [drm:radeon_acpi_init [radeon]] Call to ATIF verify_interface > failed: -5 > > which I think are ACPI calls, it might be worth checking your BIOS/EFI is up > to date and if that doesn't fix things maybe play around with the acpi_osi= > options
Did this ever work for you?
(In reply to Mike Lothian from comment #10) > Did this ever work for you? What do you mean by this?
BTW, this is a new Dell laptop in the development.
I was meaning, is this a regression, as in it used to work with an older kernel or mesa. If it's a new system perhaps not.
Yup, this is a new system. `DRI_PRIME=1 glxgears` never worked properly before.
Are there any changes when you boot the system with radeon.runpm=0, this will mean the card never powers down What distro are you running? You mention trying older kernel version, did you try older mesa versions too? Can you attach your Xorg.0.log too
Do you also see the issue with amdgpu rather than using the radeon kernel driver?
Created attachment 135027 [details] Xorg.0.log (In reply to Mike Lothian from comment #15) > Are there any changes when you boot the system with radeon.runpm=0, this > will mean the card never powers down > > What distro are you running? > > You mention trying older kernel version, did you try older mesa versions too? > > Can you attach your Xorg.0.log too radeon.runpm=0 doesn't make any change. I am running Ubuntu 16.04 LTS which using Linux kernel 4.4 and Mesa 11.2.0 before upgrading the system. After the system upgraded, it uses Mesa 17.0.7 instead.
(In reply to Mike Lothian from comment #16) > Do you also see the issue with amdgpu rather than using the radeon kernel > driver? amdgpu doesn't support on this AMD graphics with the kernel parameters "amdgpu.si_support=1 radeon.si_support=0" on Linux kernel 4.14rc6. X window system can not start up. 01:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Jet PRO [Radeon R5 M230] [1002:6665] (rev c3) Subsystem: Dell Jet PRO [Radeon R5 M230] [1028:0844] Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 129 Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at d0000000 (64-bit, non-prefetchable) [size=256K] Region 4: I/O ports at e000 [size=256] Expansion ROM at d0040000 [disabled] [size=128K] Capabilities: <access denied> Kernel driver in use: radeon Kernel modules: radeon, amdgpu
You have to blacklist radeon to use amdgpu as both modules try and claim the device
Created attachment 135048 [details] blacklist radeon dmesg
Created attachment 135049 [details] blacklist radeon Xorg.0.log
(In reply to Mike Lothian from comment #19) > You have to blacklist radeon to use amdgpu as both modules try and claim the > device After I blacklist radeon, there is no AMD graphics provider from `xrandr --listproviders`. [ 1.937326] amdgpu 0000:01:00.0: enabling device (0000 -> 0003) [ 1.937633] amdgpu 0000:01:00.0: SI support provided by radeon. [ 1.937635] amdgpu 0000:01:00.0: Use radeon.si_support=0 amdgpu.si_support=1 to override. After I use 'radeon.si_support=0 amdgpu.si_support=1', X window system can not start up.
Created attachment 135059 [details] attachment-12667-0.html Can you show the dmesg and Xorg.0.log with radeon.si_support=0 amdgpu.si_support=1 On Thu, 26 Oct 2017 at 05:17 <bugzilla-daemon@freedesktop.org> wrote: > *Comment # 22 <https://bugs.freedesktop.org/show_bug.cgi?id=103370#c22> on > bug 103370 <https://bugs.freedesktop.org/show_bug.cgi?id=103370> from > Shih-Yuan Lee <fourdollars@gmail.com> * > > (In reply to Mike Lothian from comment #19 <https://bugs.freedesktop.org/show_bug.cgi?id=103370#c19>)> You have to blacklist radeon to use amdgpu as both modules try and claim the > > device > > After I blacklist radeon, there is no AMD graphics provider from `xrandr > --listproviders`. > > [ 1.937326] amdgpu 0000:01:00.0: enabling device (0000 -> 0003) > [ 1.937633] amdgpu 0000:01:00.0: SI support provided by radeon. > [ 1.937635] amdgpu 0000:01:00.0: Use radeon.si_support=0 amdgpu.si_support=1 > to override. > > After I use 'radeon.si_support=0 amdgpu.si_support=1', X window system can not > start up. > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the bug. > >
If I used radeon.dpm=0, there is no such issue.
There is no such issue when I used mesa 11.2.0 on Ubuntu 16.04. I found this issue on mesa 17.0.7 and mesa 17.2.4 also has this issue.
this was tested to regress between mesa 12.0.3 and 12.0.5, and bisect points out commit d3d33918c79d9e87aedaf6f70ed39f75eed262a0 Author: Michel Dänzer <michel.daenzer@amd.com> Date: Wed Aug 17 17:02:04 2016 +0900 loader/dri3: Overhaul dri3_update_num_back as the first bad commit
Thanks for bisecting, but I don't think that commit can be directly responsible for a GPU hang. Before that commit, the DRI3 code in Mesa would only use one back buffer for glxgears, which means that the GPU could only start rendering a new frame after the previous one had finished presenting. Maybe that somehow prevented the hang. A possible test for this theory is running vblank_mode=0 DRI_PRIME=1 glxgears with Mesa 12.0.3; does that also trigger the hang?
`vblank_mode=0 DRI_PRIME=1 glxgears` will also introduce the GPU lock up. However when using radeon.dpm=0, it won't happen but it is tearing all the time.
Tearing is expected with vblank_mode=0.
Tearing won't happen on battery power, but it will only happen when plugged in AC power. Is this behavior also expected?
With vblank_mode=0, the only thing that can prevent tearing is luck.
forwarding a comment from an engineer: "During viewing the source code of radeon module, I found there is a bug [1] related to the dpm and clocks. So I decided to do some experiments. Tried to set different max_sclk and max_mclk to see if the issue is gone. 1. max_sclk: 70000, max_mclk: 75000 --> have the same issue 2. max_sclk: 50000, max_mclk: 60000 --> pass multi-run test (more than 50 runs) [1] https://bugs.freedesktop.org/show_bug.cgi?id=76490 "
(In reply to Michel Dänzer from comment #27) > Thanks for bisecting, but I don't think that commit can be directly > responsible for a GPU hang. Before that commit, the DRI3 code in Mesa would > only use one back buffer for glxgears, which means that the GPU could only > start rendering a new frame after the previous one had finished presenting. > Maybe that somehow prevented the hang. That commit "fixed" a performance regression at the time because it ended up causing enough of a delay that the clocks didn't ramp up. So it probably exposed a kernel dpm issue. Without it, the clocks never ramped up enough to cause an issue. With it, they did. (In reply to Timo Aaltonen from comment #32) > forwarding a comment from an engineer: > > "During viewing the source code of radeon module, I found there is a bug [1] > related to the dpm and clocks. So I decided to do some experiments. > Tried to set different max_sclk and max_mclk to see if the issue is gone. > 1. max_sclk: 70000, max_mclk: 75000 --> have the same issue > 2. max_sclk: 50000, max_mclk: 60000 --> pass multi-run test (more than 50 > runs) > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=76490 > " I think Sonny fixed this. It was due to using the wrong firmware. [ 1.827060] [drm] initializing kernel modesetting (HAINAN 0x1002:0x6665 0x1028:0x0844 0xC3). This chip should be using radeon/banks_k_2_smc.bin smc firmware. Is that available on the test system and kernel?
The following commits are relevant: abb2e3c1ce64c8bba678973800c34ea1dc97c42c 6458bd4dfd9414cba5804eb9907fe2a824278c34 ef736d394e85b1bf1fd65ba5e5257b85f6c82325 4e6e98b1e48c9474aed7ce03025ec319b941e26e
Does reverting a628392cf03e0eef21b345afbb192cbade041741 fix the issue?
(In reply to Alex Deucher from comment #33) > I think Sonny fixed this. It was due to using the wrong firmware. > [ 1.827060] [drm] initializing kernel modesetting (HAINAN 0x1002:0x6665 > 0x1028:0x0844 0xC3). This chip should be using radeon/banks_k_2_smc.bin smc > firmware. Is that available on the test system and kernel? The firmware radeon/banks_k_2_smc.bin is on the test system. With Ubuntu kernel 4.4.0-101-generic, I am not pretty sure the radeon driver is using this firmware. With Ubuntu kernel 4.13.0-16-generic, I tried both amdgpu and radeon drivers, but the system hang. as soon as the system hang, the amdgpu_pm_info shows 'invalid dpm profile 15'. (In reply to Alex Deucher from comment #34) > The following commits are relevant: > abb2e3c1ce64c8bba678973800c34ea1dc97c42c > 6458bd4dfd9414cba5804eb9907fe2a824278c34 > ef736d394e85b1bf1fd65ba5e5257b85f6c82325 > 4e6e98b1e48c9474aed7ce03025ec319b941e26e These commits would be already included in Ubuntu kernel 4.13.0-16-generic. (In reply to Alex Deucher from comment #35) > Does reverting a628392cf03e0eef21b345afbb192cbade041741 fix the issue? Removing this commit does not fix the issue. BTW, with 4.13.0-16-generic, I change the max_sclk in drm/radeon/si_dpm.c (what we did with Ubuntu kernel 4.4.0-101-generic) from 75000 to 65000, but still met the hang issue.
(In reply to Robert Liu from comment #36) > BTW, with 4.13.0-16-generic, I change the max_sclk in drm/radeon/si_dpm.c > (what we did with Ubuntu kernel 4.4.0-101-generic) from 75000 to 65000, but > still met the hang issue. By restricting max_sclk to 65000 and max_mclk to 80000, both radeon and amdgpu do not have the issue.
Created attachment 135647 [details] [review] workaround for radeon workarounds for radeon and amdgpu to fix the issue.
Created attachment 135648 [details] [review] workaround for amdgpu
Created attachment 135662 [details] dmesg (In reply to Alex Deucher from comment #38) > Created attachment 135647 [details] [review] [review] > workaround for radeon > > workarounds for radeon and amdgpu to fix the issue. I applied this patch on top of Ubuntu-4.4.0-101.124 Linux kernel and it seems to fix the issue in the beginning. But it has some problem later on. $ seq 20 | while read i; do echo Loop $i; DRI_PRIME=1 glxgears -info|head -n 5; done Loop 1 radeon: Failed to allocate virtual address for buffer: radeon: size : 65536 bytes radeon: alignment : 4096 bytes radeon: domains : 4 radeon: va : 0x0000000000800000 radeon: Failed to deallocate virtual address for buffer: radeon: size : 65536 bytes radeon: va : 0x800000 radeon: Failed to allocate virtual address for buffer: radeon: size : 65536 bytes radeon: alignment : 4096 bytes radeon: domains : 4 radeon: va : 0x0000000000800000 radeon: Failed to deallocate virtual address for buffer: radeon: size : 65536 bytes radeon: va : 0x800000 radeonsi: Failed to create a context. Loop 2 ...
So far, setting max_sclk to 60000 and max_mclk to 80000, the system passed a 24hours burn-in test (vblank_mode=0 DRI_PRIME=1 glmark2 --run-forever). Another issue found is when removing the adapter, the system goes to suspend. After I wake it up, it continues running the benchmark.
(In reply to Robert Liu from comment #41) > Another issue found is when removing the adapter, the system goes to > suspend. That's not directly related to graphics drivers.
I can still reduplicate the issue after setting max_sclk to 60000 and max_mclk to 80000.
I tried max_sclk = 50000 and max_mclk = 60000 on Ubuntu-4.4.0-112.135, but I can still reduplicate the GPU lock up issue. It can pass the first run of `seq 100 | while read i; do echo Loop $i; DRI_PRIME=1 glxgears -info|head -n 3; done`. But it failed when I tried the second run of `seq 100 | while read i; do echo Loop $i; DRI_PRIME=1 glxgears -info|head -n 3; done`.
I can still reduplicate this issue on Ubuntu 18.04 by `seq 100 | while read i; do echo Loop $i; DRI_PRIME=1 glxgears -info|head -n2; done`.
The Linux kernel of Comment 45 is 4.15.0-10.11 from Ubuntu 18.04. When I tried a later version 4.15.0-12.13, I can not reduplicate this issue on Ubuntu 18.04. 4.15.0-12.13 contains the following commit. commit 239b5f64e12b1f09f506c164dff0374924782979 Author: Alex Deucher <alexander.deucher@amd.com> Date: Tue Nov 21 12:09:38 2017 -0500 drm/radeon: Add dpm quirk for Jet PRO (v2) Fixes stability issues. v2: clamp sclk to 600 Mhz Bug: https://bugs.freedesktop.org/show_bug.cgi?id=103370 Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org diff --git a/drivers/gpu/drm/radeon/si_dpm.c b/drivers/gpu/drm/radeon/si_dpm.c index ee3e742..97a0a63 100644 --- a/drivers/gpu/drm/radeon/si_dpm.c +++ b/drivers/gpu/drm/radeon/si_dpm.c @@ -2984,6 +2984,11 @@ static void si_apply_state_adjust_rules(struct radeon_device *rdev, (rdev->pdev->device == 0x6667)) { max_sclk = 75000; } + if ((rdev->pdev->revision == 0xC3) || + (rdev->pdev->device == 0x6665)) { + max_sclk = 60000; + max_mclk = 80000; + } } else if (rdev->family == CHIP_OLAND) { if ((rdev->pdev->revision == 0xC7) || (rdev->pdev->revision == 0x80) ||
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.