Bug 78853

Summary: Panic(?) after 10-20 minutes use of R7 265 in X
Product: xorg Reporter: org.freedesktop
Component: Driver/RadeonAssignee: xf86-video-ati maintainers <xorg-driver-ati>
Status: RESOLVED MOVED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: julien.isorce
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg log
none
Dmesg
none
Xorg log
none
Dmesg with 3.15-rc5 and new firmware
none
Xorg log with 3.15-rc5 and new firmware
none
Post-crash reboot dmesg.
none
Output of lspci -vnn
none
lspci -vnn (with privileges) none

Description org.freedesktop 2014-05-18 10:12:23 UTC
I'm experiencing a strange crash that causes the machine to reboot after 10-20 minutes of use in Xorg. The crash does not appear to require any use of OpenGL (simply using a browser for that time on an xfce desktop will trigger it). There are no useful error messages logged in dmesg before the crash happens. When the crash occurs, the screen is quickly replaced with garbage, flickers, and then the machine reboots a few seconds later (presumably there is a panic message on the console, but I cannot see it with X in the way).

The card is an R7 265. Specifically, this one:

  http://www.scan.co.uk/products/2gb-xfx-radeon-r7-265-5600mhz-gddr5-gpu-900mhz-boost-925mhz-1280-streams-dport-dvi-hdmi

The card is only a week old, and the machine has been using a 6570 for a few years and has been completely stable.

I've performed the usual troubleshooting (have run memtest to eliminate problems with RAM, have kept a careful eye on temperature sensors to eliminate thermal issues - the machine is well ventilated and in a well ventilated area). I can't completely eliminate the power supply yet but I am reasonably confident that it's not broken (it's a good quality Seasonic PSU). I'll be obtaining another power supply today to eliminate that.

The operating system is Arch Linux on x86-64. Please see attached dmesg and xorg logs.

I'm not sure what other information I can provide (or if there's a reliable means of getting panic information from the machine), please advise!
Comment 1 org.freedesktop 2014-05-18 10:12:52 UTC
Created attachment 99268 [details]
Xorg log
Comment 2 org.freedesktop 2014-05-18 10:13:40 UTC
Created attachment 99269 [details]
Dmesg

The full dmesg as it appears right up until the crash.
Comment 3 org.freedesktop 2014-05-18 10:19:44 UTC
Created attachment 99270 [details]
Xorg log

Actual X log (incorrectly uploaded dmesg twice).
Comment 4 Alex Deucher 2014-05-19 02:12:40 UTC
Possibly a duplicate of bug 75992.  Can you try a newer kernel with the patches and new firmware referenced on that bug?
Comment 5 org.freedesktop 2014-05-19 17:01:45 UTC
Apologies for the delay, I've not currently got access the the machine. Will have access tomorrow. I'll try 3.15_rc5 as you suggest.
Comment 6 org.freedesktop 2014-05-19 17:03:38 UTC
I should mention: Further testing on the day I reported indicated minor video corruption in the bios (random pixels appeared to be stuck). I'm trying to rule out faulty hardware, so I have to ask: Does the firmware uploaded by the driver persist after a reboot? If the firmware doesn't persist, then presumably video corruption in the bios would indicate faulty hardware.
Comment 7 Alex Deucher 2014-05-19 18:13:35 UTC
(In reply to comment #6)
> I should mention: Further testing on the day I reported indicated minor
> video corruption in the bios (random pixels appeared to be stuck). I'm
> trying to rule out faulty hardware, so I have to ask: Does the firmware
> uploaded by the driver persist after a reboot? If the firmware doesn't
> persist, then presumably video corruption in the bios would indicate faulty
> hardware.

It may if it's a warm reboot.
Comment 8 org.freedesktop 2014-05-21 15:21:13 UTC
Hello. 3.15.0-rc5 is built. I notice from my own dmesg that the kernel says "loaded PITCAIRN firmware". To be clear, does this mean that I copy all of the PITCAIRN* files from your firmware directory into /usr/lib/firmware/radeon?
(Backing up the originals, of course!).
Comment 9 Alex Deucher 2014-05-21 15:24:29 UTC
(In reply to comment #8)
> Hello. 3.15.0-rc5 is built. I notice from my own dmesg that the kernel says
> "loaded PITCAIRN firmware". To be clear, does this mean that I copy all of
> the PITCAIRN* files from your firmware directory into
> /usr/lib/firmware/radeon?
> (Backing up the originals, of course!).

You only need to add PITCAIRN_mc2.bin.  The rest are the same as what you have.  Make sure the firmware is included in your initrd if you are using one.
Comment 10 org.freedesktop 2014-05-21 15:26:16 UTC
Right, thanks (and thanks for the impressively quick response!)
Comment 11 org.freedesktop 2014-05-21 16:55:25 UTC
Ok, the machine still crashes, but the failure mode seems to have changed.

Previously, the screen would flicker, become garbage, and then the machine would reboot.

With the new kernel and firmware, the machine ran for about an hour in X without crashing. I then tried to provoke it into crashing by doing some reasonably intensive OpenGL work, namely; running Half Life 2. Please see "dmesg-new.txt" and "xorg-new.txt" for the logs for the new kernel.

The game ran for a couple of minutes with the default settings. I then adjusted the settings one at a time (increasing texture quality to max, increasing shadow quality to max, etc). The screen went black and the machine instantly rebooted.

I tried the same thing again, but this time the screen flickered, became garbage, went black, but the machine didn't reboot. I logged in over ssh and checked the dmesg. There were no *useful* messages beyond the status of the network card changing as can be seen in the last dmesg. However, the kernel apparently appended one line to the dmesg: "mce: [Hardware error] Machine check events logged", but there was no more detail than that. I attempted to kill X from the ssh connection and the machine hard rebooted when X went down.

As I'm sitting here writing this at the console login prompt, the kernel has again printed "mce: [Hardware error] Machine check events logged". Please see "dmesg-aftercrash.txt" for the dmesg as it appears right now.
Comment 12 org.freedesktop 2014-05-21 16:55:54 UTC
Created attachment 99518 [details]
Dmesg with 3.15-rc5 and new firmware
Comment 13 org.freedesktop 2014-05-21 16:56:23 UTC
Created attachment 99519 [details]
Xorg log with 3.15-rc5 and new firmware
Comment 14 org.freedesktop 2014-05-21 16:57:05 UTC
Created attachment 99520 [details]
Post-crash reboot dmesg.
Comment 15 org.freedesktop 2014-05-21 17:00:24 UTC
I've just noticed that there is a slightly more detailed "Hardware error" logged in the post-crash dmesg, starting at [10.159751]. I'm not sure if this is related, but I've not seen it before.
Comment 16 Alex Deucher 2014-05-21 18:05:38 UTC
Does disabling dpm help?  Try booting with radeon.dpm=0 on the kernel command line in grub.
Comment 17 org.freedesktop 2014-05-21 18:42:53 UTC
Hm, yes! That seems to have done it. I've been unable to crash it so far...
Comment 18 org.freedesktop 2014-05-21 18:45:14 UTC
Attached lspci -vnn, as I just noticed you asked for them in the other report.
Comment 19 org.freedesktop 2014-05-21 18:45:43 UTC
Created attachment 99525 [details]
Output of lspci -vnn
Comment 20 org.freedesktop 2014-05-21 18:46:51 UTC
Created attachment 99526 [details]
lspci -vnn (with privileges)
Comment 21 org.freedesktop 2014-05-22 20:17:59 UTC
I'm not sure if this constitutes a resolution.

I assume radeon.dpm=0 disables power management, which seems undesirable to say the least...
Comment 22 Alex Deucher 2014-05-22 21:16:17 UTC
(In reply to comment #21)
> I'm not sure if this constitutes a resolution.
> 
> I assume radeon.dpm=0 disables power management, which seems undesirable to
> say the least...

It's a workaround until we solve why dpm is not stable on your system.
Comment 23 org.freedesktop 2014-05-22 21:52:31 UTC
Right. I'll leave it in your hands then. Let me know if there's any more information I can provide.
Comment 24 Alex Deucher 2014-05-22 22:39:58 UTC
Does attachment 98997 [details] [review] help?
Comment 25 org.freedesktop 2014-05-23 11:04:02 UTC
If anything, it actually crashes faster with the 98997 patch (didn't get a chance to get into HL2, it crashed within a minute once X was started).
Comment 26 org.freedesktop 2014-05-24 12:16:02 UTC
More data:

Finally managed to capture this with netconsole:

[  379.363208] radeon 0000:01:00.0: ring 0 stalled for more than 87623msec
[  379.363227] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000002733 last fence id 0x0000000000002732 on ring 0)
[  379.363237] radeon 0000:01:00.0: failed to get a new IB (-35)
[  379.363246] [drm:radeon_cs_ib_fill] *ERROR* Failed to get ib !
[  380.192295] radeon 0000:01:00.0: Saved 8749 dwords of commands on ring 0.
[  380.218544] radeon 0000:01:00.0: GPU softreset: 0x0000034D
[  380.218550] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA3503028
[  380.218554] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x2D000006
[  380.218557] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x28000006
[  380.218561] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20020FC0
[  380.218582] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[  380.218586] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  380.218590] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[  380.218593] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00408006
[  380.218597] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x84438647
[  380.218600] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483146
[  380.218604] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  380.218607] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  380.218611] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[  381.041852] radeon 0000:01:00.0: Wait for MC idle timedout !
[  381.041858] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
[  381.041913] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00120500
[  381.043071] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
[  381.043098] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[  381.043102] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[  381.043105] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000EC0
[  381.043127] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[  381.043130] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  381.043134] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  381.043137] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[  381.043140] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[  381.043144] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  381.043147] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  381.043201] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[  386.042431] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[  386.042439] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing C50E (len 254, WS 0, PS 4) @ 0xC538
[  386.042443] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing BB6E (len 78, WS 12, PS 8) @ 0xBBA7
[  386.048397] [drm] probing gen 2 caps for device 1022:9603 = 300d02/0
[  386.048403] [drm] PCIE gen 2 link speeds already enabled
[  386.383371] radeon 0000:01:00.0: Wait for MC idle timedout !
[  386.546075] radeon 0000:01:00.0: Wait for MC idle timedout !

Out of around ten crashes, only one of them actually logged a message. I have no idea if any of the above is helpful.
Comment 27 org.freedesktop 2014-06-01 14:18:48 UTC
New info:

The machine appears to be unstable even if dpm is disabled. I've no idea why the machine didn't crash last time I tried without dpm. Using the profile based switching and setting:

  # echo profile > /sys/class/drm/card0/device/power_method
  # echo high > /sys/class/drm/card0/device/power_profile

... causes the machine to become unstable within minutes (the screen flickers, messages similar to that posted in my last message appear on dmesg). Sometimes the screen will clear for a minute or so, but then start flickering again and eventually the machine will reboot.
Comment 28 org.freedesktop 2014-06-01 14:39:36 UTC
I've just installed the catalyst drivers.

They also crash.

I'm beginning to suspect faulty hardware.
Comment 29 Martin Peres 2019-11-19 07:46:51 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-ati/issues/103.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.