Created attachment 131248 [details] dmesg output on amdgpu load failure Inconsistent amdgpu driver loading for an RX-480: Most of the time, the driver will fail to load starting with the error: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD) The screen goes into stand-by and I get no display output. The rest of the system still loads normally and I can ssh and look around. ------------ But it will, very seldom, load the driver normally (KMS enabled, display still active). ------------ I can use nomodeset to prevent amdgpu from being initialized and the display continues to work, but without 3D accel and just one monitor working (out of two). To confirm that the GPU actually works (both in general and in linux with amdgpu), I tested it on a friend's PC with the same distro I'm using (Debian Testing/amd64). The driver loaded normally on the first try and worked consistently for all the reboots we tried. Since then I've read anything I could find on the issue and I couldn't find any solution. There are similar reports on other AMD videocards, but none of them give me a solution for the inconsistency. Things tried so far: Debian testing and unstable with most of the kernels released since 4.7.0-1 up to their latest kernel 4.9.25-1 (4.9.0-3 in their versioning system). Gentoo with the genkernel 4.9.16. Gentoo custom kernel 4.9.16 Gentoo with kernel 4.11 from https://cgit.freedesktop.org/~agd5f/linux/ (drm-fixes-4.11). Alternating ACPI and most, if not all, amdgpu parameters in all tested kernels. Using different outputs (DVI, HDMI and both. I cannot test DisplayPort). The Debian and Gentoo installs are fresh in different drives. I can test things on either one. I'd be ok with my MB or CPU somehow being too old or incompatible. But the few successful boots tell me that it can and *does* work on my pc. I'm attaching two dmesg logs, one for the working boot and one for the failing one. They were run shortly after each other and nothing was changed in between. I'm attaching lspci output too. Please let me know if you need any other info or clarification on the error. Regards. HV
Created attachment 131249 [details] dmesg output with amdgpu loading successfully
Created attachment 131250 [details] lspci short output
Sounds like maybe a power supply issues. Do you have another power supply you could try?
Not at the moment. Maybe some cheapo PSU lying around, but i'm not sure it's gonna be able to handle the draw. I currently have an Antec VP500P (500w) and the GPU works on Windows 7 on the same PC (installed along Debian and Gentoo). It's been used extensibly on Windows without any issues. My friend's PC does have a more powerful PSU (some Seasonic 750w). So maybe during init it needs a bit more of a push on linux to start up. Is this possible?. I'll borrow the 750w PSU during the weekend and give it a try. In the meantime, is there anything else I can test or info I can provide?. If not, I'll update/confirm during the weekend.
I tested a Seasonic 750W PSU with the RX480 but i got the same error on boot (amdgpu not loading, the rest of the system boots normally and i still get display output with nomodeset on). Is there anything else i can test for this issue?. Regards HV
Hi! I have the exact same issue ``[drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)`` with a R9 285 GPU. I've had this problems for ages and had been using nomodeset to get by. I'm trying this on Debian 9 (stretch), with kernel ``Linux debian 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 GNU/Linux``. I've attached the full dmesg. The interesting thing is that this seems to be related to the motherboard; when using the very same card (R9 285) in another system, *with the same software* (Debian 9), it works! It is not a hardware problem: the power supply is brand new (Seasonic G550W), the RAM tests fine, the SSD is brand new, the CMOS battery too, etc. The problem occurs on Asus M2N SLI Deluxe motherboard based system; and it disappears when using it with an equally old Asus P5W DH Deluxe based system. I notice there is a message saying that the clock source is unstable right before the error occurs; could it be related? Here is an excerpt of the dmesg: [ 11.552754] failed to send pre message 5b ret is 0 [ 11.748489] failed to send message 5b ret is 0 [ 11.748536] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 11.748538] clocksource: 'acpi_pm' wd_now: f69088 wd_last: ba62a0 mask: ffffff [ 11.748539] clocksource: 'tsc' cs_now: 167976623b cs_last: 12698f30d7 mask: ffffffffffffffff [ 11.957731] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD) [ 11.957840] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -22 [ 11.957879] amdgpu 0000:03:00.0: amdgpu_init failed [ 12.171671] failed to send pre message 133 ret is 0 [ 12.385424] failed to send message 133 ret is 0 [ 12.385433] DPM is not running right now, no need to disable DPM! [ 12.386774] clocksource: Switched to clocksource acpi_pm [ 12.772038] I can provide dmesg from the working system (using the same software with the same card) if judged useful.
Created attachment 133465 [details] Success with customized kernel version 4.13.0-rc2+
I managed to make it work by compiling latest 4.13.0-rc2+ from there: git://people.freedesktop.org/~agd5f/linux. I used the 'drm-next-4.14-wip' branch, and customized it a bit (added CIK option for amdgpu driver, removed AGP support, bumped event rate to 1000 Hz, dropped a few Intel specific options (I'm using AMD K10 class CPU) and enabled a few AMD specific options. I'll try to narrow down exactly what fixes it, but one thing we can see is that there is no longer clock source skew problems apparent in the kernel messages.
Created attachment 133466 [details] Debian 9 stretch stock kernel failing to initialize R9 285 on Asus M2N SLI Deluxe motherboard I had forgotten to join that one.
A few more data points. None of the 'vanilla' kernel could initialize the R9 285 (tonga 1.2) card on this Asus M2N SLI Deluxe card (remember that the same card works easily on an Asus P5W DH Deluxe based system). I've tried building the following kernels (reusing the Debian 9 stable 4.9.0 kernel config as a starting point) and booted them, but I would always get a CAFEDEAD error: * 4.11.0 from stretch-backports (didn't need to build this one) * 4.12.7 from kernel.org * 4.13.0-rc4-1 from kernel.org None of them worked. So my only success so far is with the drm-next-4.14-wip branch from git://people.freedesktop.org/~agd5f/linux. I've included the dmesg I get when booting off the 4.13.0-rc4 kernel, it has new error output talking about powerplay: [ 5.514767] amdgpu: [powerplay] failed to send message 254 ret is 0 [ 5.514793] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table! [ 5.515491] amdgpu: [powerplay] Invalid VDDGFX value! [ 5.515491] amdgpu: [powerplay] Get EVV Voltage Failed. Abort Driver loading! [ 5.515493] amdgpu: [powerplay] amdgpu: powerplay initialization failed [ 5.703544] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD) [ 5.703601] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -22 [ 5.703630] amdgpu 0000:03:00.0: amdgpu_init failed
Created attachment 133478 [details] dmesg failed init with 4.13.0-rc4 Failure to initialize a R9 285 on Asus M2N Sli Deluxe motherboard with 4.13.0-rc4 kernel.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/165.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.