Bug 100964 - RX-480 [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)
Summary: RX-480 [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test fai...
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-05-08 01:38 UTC by HV
Modified: 2019-11-19 08:16 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output on amdgpu load failure (61.15 KB, text/plain)
2017-05-08 01:38 UTC, HV
no flags Details
dmesg output with amdgpu loading successfully (58.12 KB, text/plain)
2017-05-08 01:41 UTC, HV
no flags Details
lspci short output (1.69 KB, text/plain)
2017-05-08 01:42 UTC, HV
no flags Details
Success with customized kernel version 4.13.0-rc2+ (59.95 KB, text/plain)
2017-08-13 04:31 UTC, Maxim Cournoyer
no flags Details
Debian 9 stretch stock kernel failing to initialize R9 285 on Asus M2N SLI Deluxe motherboard (67.10 KB, text/plain)
2017-08-13 04:38 UTC, Maxim Cournoyer
no flags Details
dmesg failed init with 4.13.0-rc4 (60.50 KB, text/plain)
2017-08-13 21:55 UTC, Maxim Cournoyer
no flags Details

Description HV 2017-05-08 01:38:51 UTC
Created attachment 131248 [details]
dmesg output on amdgpu load failure

Inconsistent amdgpu driver loading for an RX-480:

Most of the time, the driver will fail to load starting with the error:
    [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)
The screen goes into stand-by and I get no display output. The rest of
the system still loads normally and I can ssh and look around.

------------
But it will, very seldom, load the driver normally (KMS enabled, display
still active).
------------

I can use nomodeset to prevent amdgpu from being initialized and the
display continues to work, but without 3D accel and just one monitor
working (out of two).

To confirm that the GPU actually works (both in general and in linux
with amdgpu), I tested it on a friend's PC with the same distro I'm
using (Debian Testing/amd64). The driver loaded normally on the first
try and worked consistently for all the reboots we tried.

Since then I've read anything I could find on the issue and I couldn't
find any solution. There are similar reports on other AMD videocards,
but none of them give me a solution for the inconsistency.

Things tried so far:
    Debian testing and unstable with most of the kernels released since
        4.7.0-1 up to their latest kernel 4.9.25-1 (4.9.0-3 in their
        versioning system).
    Gentoo with the genkernel 4.9.16.
    Gentoo custom kernel 4.9.16
    Gentoo with kernel 4.11 from https://cgit.freedesktop.org/~agd5f/linux/
        (drm-fixes-4.11).
    Alternating ACPI and most, if not all, amdgpu parameters in all
        tested kernels.
    Using different outputs (DVI, HDMI and both. I cannot test
        DisplayPort).
    
The Debian and Gentoo installs are fresh in different drives. I can
test things on either one.

I'd be ok with my MB or CPU somehow being too old or incompatible. But
the few successful boots tell me that it can and *does* work on my pc.

I'm attaching two dmesg logs, one for the working boot and one for the
failing one. They were run shortly after each other and nothing was
changed in between.

I'm attaching lspci output too.

Please let me know if you need any other info or clarification on the
error.

Regards.

HV
Comment 1 HV 2017-05-08 01:41:10 UTC
Created attachment 131249 [details]
dmesg output with amdgpu loading successfully
Comment 2 HV 2017-05-08 01:42:26 UTC
Created attachment 131250 [details]
lspci short output
Comment 3 Alex Deucher 2017-05-08 15:01:29 UTC
Sounds like maybe a power supply issues.  Do you have another power supply you could try?
Comment 4 HV 2017-05-08 16:44:30 UTC
Not at the moment. Maybe some cheapo PSU lying around, but i'm not sure
it's gonna be able to handle the draw.

I currently have an Antec VP500P (500w) and the GPU works on Windows 7
on the same PC (installed along Debian and Gentoo). It's been used
extensibly on Windows without any issues.

My friend's PC does have a more powerful PSU (some Seasonic 750w). So
maybe during init it needs a bit more of a push on linux to start up. Is
this possible?.

I'll borrow the 750w PSU during the weekend and give it a try.

In the meantime, is there anything else I can test or info I can
provide?. If not, I'll update/confirm during the weekend.
Comment 5 HV 2017-05-13 21:41:27 UTC
I tested a Seasonic 750W PSU with the RX480 but i got the same error on boot (amdgpu not loading, the rest of the system boots normally and i still get display output with nomodeset on).

Is there anything else i can test for this issue?.

Regards

HV
Comment 6 Maxim Cournoyer 2017-08-13 01:29:38 UTC
Hi! I have the exact same issue ``[drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)`` with a R9 285 GPU. I've had this problems for ages and had been using nomodeset to get by.

I'm trying this on Debian 9 (stretch), with kernel ``Linux debian 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 GNU/Linux``. I've attached the full dmesg.

The interesting thing is that this seems to be related to the motherboard; when using the very same card (R9 285) in another system, *with the same software* (Debian 9), it works! It is not a hardware problem: the power supply is brand new (Seasonic G550W), the RAM tests fine, the SSD is brand new, the CMOS battery too, etc.

The problem occurs on Asus M2N SLI Deluxe motherboard based system; and it disappears when using it with an equally old Asus P5W DH Deluxe based system.

I notice there is a message saying that the clock source is unstable right before the error occurs; could it be related?

Here is an excerpt of the dmesg: 
[   11.552754] 
                failed to send pre message 5b ret is 0 
[   11.748489] 
                failed to send message 5b ret is 0 
[   11.748536] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[   11.748538] clocksource:                       'acpi_pm' wd_now: f69088 wd_last: ba62a0 mask: ffffff
[   11.748539] clocksource:                       'tsc' cs_now: 167976623b cs_last: 12698f30d7 mask: ffffffffffffffff
[   11.957731] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)
[   11.957840] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -22
[   11.957879] amdgpu 0000:03:00.0: amdgpu_init failed
[   12.171671] 
                failed to send pre message 133 ret is 0 
[   12.385424] 
                failed to send message 133 ret is 0 
[   12.385433] DPM is not running right now, no need to disable DPM!
[   12.386774] clocksource: Switched to clocksource acpi_pm
[   12.772038]

I can provide dmesg from the working system (using the same software with the same card) if judged useful.
Comment 7 Maxim Cournoyer 2017-08-13 04:31:48 UTC
Created attachment 133465 [details]
Success with customized kernel version 4.13.0-rc2+
Comment 8 Maxim Cournoyer 2017-08-13 04:37:09 UTC
I managed to make it work by compiling latest 4.13.0-rc2+ from there: git://people.freedesktop.org/~agd5f/linux. I used the 'drm-next-4.14-wip' branch, and customized it a bit (added CIK option for amdgpu driver, removed AGP support, bumped event rate to 1000 Hz, dropped a few Intel specific options (I'm using AMD K10 class CPU) and enabled a few AMD specific options. I'll try to narrow down exactly what fixes it, but one thing we can see is that there is no longer clock source skew problems apparent in the kernel messages.
Comment 9 Maxim Cournoyer 2017-08-13 04:38:48 UTC
Created attachment 133466 [details]
Debian 9 stretch stock kernel failing to initialize R9 285 on Asus M2N SLI Deluxe motherboard

I had forgotten to join that one.
Comment 10 Maxim Cournoyer 2017-08-13 21:51:27 UTC
A few more data points. None of the 'vanilla' kernel could initialize the R9 285 (tonga 1.2) card on this Asus M2N SLI Deluxe card (remember that the same card works easily on an Asus P5W DH Deluxe based system).

I've tried building the following kernels (reusing the Debian 9 stable 4.9.0 kernel config as a starting point) and booted them, but I would always get a CAFEDEAD error:

* 4.11.0 from stretch-backports (didn't need to build this one)
* 4.12.7 from kernel.org
* 4.13.0-rc4-1 from kernel.org

None of them worked. So my only success so far is with the drm-next-4.14-wip branch from git://people.freedesktop.org/~agd5f/linux.

I've included the dmesg I get when booting off the 4.13.0-rc4 kernel, it has new error output talking about powerplay:

[    5.514767] amdgpu: [powerplay] 
                failed to send message 254 ret is 0 
[    5.514793] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!
[    5.515491] amdgpu: [powerplay] Invalid VDDGFX value!
[    5.515491] amdgpu: [powerplay] Get EVV Voltage Failed.  Abort Driver loading!
[    5.515493] amdgpu: [powerplay] amdgpu: powerplay initialization failed
[    5.703544] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)
[    5.703601] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -22
[    5.703630] amdgpu 0000:03:00.0: amdgpu_init failed
Comment 11 Maxim Cournoyer 2017-08-13 21:55:45 UTC
Created attachment 133478 [details]
dmesg failed init with 4.13.0-rc4

Failure to initialize a R9 285 on Asus M2N Sli Deluxe motherboard with 4.13.0-rc4 kernel.
Comment 12 Martin Peres 2019-11-19 08:16:21 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/165.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.