I'm experiencing a strange crash that causes the machine to reboot after 10-20 minutes of use in Xorg. The crash does not appear to require any use of OpenGL (simply using a browser for that time on an xfce desktop will trigger it). There are no useful error messages logged in dmesg before the crash happens. When the crash occurs, the screen is quickly replaced with garbage, flickers, and then the machine reboots a few seconds later (presumably there is a panic message on the console, but I cannot see it with X in the way). The card is an R7 265. Specifically, this one: http://www.scan.co.uk/products/2gb-xfx-radeon-r7-265-5600mhz-gddr5-gpu-900mhz-boost-925mhz-1280-streams-dport-dvi-hdmi The card is only a week old, and the machine has been using a 6570 for a few years and has been completely stable. I've performed the usual troubleshooting (have run memtest to eliminate problems with RAM, have kept a careful eye on temperature sensors to eliminate thermal issues - the machine is well ventilated and in a well ventilated area). I can't completely eliminate the power supply yet but I am reasonably confident that it's not broken (it's a good quality Seasonic PSU). I'll be obtaining another power supply today to eliminate that. The operating system is Arch Linux on x86-64. Please see attached dmesg and xorg logs. I'm not sure what other information I can provide (or if there's a reliable means of getting panic information from the machine), please advise!
Created attachment 99268 [details] Xorg log
Created attachment 99269 [details] Dmesg The full dmesg as it appears right up until the crash.
Created attachment 99270 [details] Xorg log Actual X log (incorrectly uploaded dmesg twice).
Possibly a duplicate of bug 75992. Can you try a newer kernel with the patches and new firmware referenced on that bug?
Apologies for the delay, I've not currently got access the the machine. Will have access tomorrow. I'll try 3.15_rc5 as you suggest.
I should mention: Further testing on the day I reported indicated minor video corruption in the bios (random pixels appeared to be stuck). I'm trying to rule out faulty hardware, so I have to ask: Does the firmware uploaded by the driver persist after a reboot? If the firmware doesn't persist, then presumably video corruption in the bios would indicate faulty hardware.
(In reply to comment #6) > I should mention: Further testing on the day I reported indicated minor > video corruption in the bios (random pixels appeared to be stuck). I'm > trying to rule out faulty hardware, so I have to ask: Does the firmware > uploaded by the driver persist after a reboot? If the firmware doesn't > persist, then presumably video corruption in the bios would indicate faulty > hardware. It may if it's a warm reboot.
Hello. 3.15.0-rc5 is built. I notice from my own dmesg that the kernel says "loaded PITCAIRN firmware". To be clear, does this mean that I copy all of the PITCAIRN* files from your firmware directory into /usr/lib/firmware/radeon? (Backing up the originals, of course!).
(In reply to comment #8) > Hello. 3.15.0-rc5 is built. I notice from my own dmesg that the kernel says > "loaded PITCAIRN firmware". To be clear, does this mean that I copy all of > the PITCAIRN* files from your firmware directory into > /usr/lib/firmware/radeon? > (Backing up the originals, of course!). You only need to add PITCAIRN_mc2.bin. The rest are the same as what you have. Make sure the firmware is included in your initrd if you are using one.
Right, thanks (and thanks for the impressively quick response!)
Ok, the machine still crashes, but the failure mode seems to have changed. Previously, the screen would flicker, become garbage, and then the machine would reboot. With the new kernel and firmware, the machine ran for about an hour in X without crashing. I then tried to provoke it into crashing by doing some reasonably intensive OpenGL work, namely; running Half Life 2. Please see "dmesg-new.txt" and "xorg-new.txt" for the logs for the new kernel. The game ran for a couple of minutes with the default settings. I then adjusted the settings one at a time (increasing texture quality to max, increasing shadow quality to max, etc). The screen went black and the machine instantly rebooted. I tried the same thing again, but this time the screen flickered, became garbage, went black, but the machine didn't reboot. I logged in over ssh and checked the dmesg. There were no *useful* messages beyond the status of the network card changing as can be seen in the last dmesg. However, the kernel apparently appended one line to the dmesg: "mce: [Hardware error] Machine check events logged", but there was no more detail than that. I attempted to kill X from the ssh connection and the machine hard rebooted when X went down. As I'm sitting here writing this at the console login prompt, the kernel has again printed "mce: [Hardware error] Machine check events logged". Please see "dmesg-aftercrash.txt" for the dmesg as it appears right now.
Created attachment 99518 [details] Dmesg with 3.15-rc5 and new firmware
Created attachment 99519 [details] Xorg log with 3.15-rc5 and new firmware
Created attachment 99520 [details] Post-crash reboot dmesg.
I've just noticed that there is a slightly more detailed "Hardware error" logged in the post-crash dmesg, starting at [10.159751]. I'm not sure if this is related, but I've not seen it before.
Does disabling dpm help? Try booting with radeon.dpm=0 on the kernel command line in grub.
Hm, yes! That seems to have done it. I've been unable to crash it so far...
Attached lspci -vnn, as I just noticed you asked for them in the other report.
Created attachment 99525 [details] Output of lspci -vnn
Created attachment 99526 [details] lspci -vnn (with privileges)
I'm not sure if this constitutes a resolution. I assume radeon.dpm=0 disables power management, which seems undesirable to say the least...
(In reply to comment #21) > I'm not sure if this constitutes a resolution. > > I assume radeon.dpm=0 disables power management, which seems undesirable to > say the least... It's a workaround until we solve why dpm is not stable on your system.
Right. I'll leave it in your hands then. Let me know if there's any more information I can provide.
Does attachment 98997 [details] [review] help?
If anything, it actually crashes faster with the 98997 patch (didn't get a chance to get into HL2, it crashed within a minute once X was started).
More data: Finally managed to capture this with netconsole: [ 379.363208] radeon 0000:01:00.0: ring 0 stalled for more than 87623msec [ 379.363227] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000002733 last fence id 0x0000000000002732 on ring 0) [ 379.363237] radeon 0000:01:00.0: failed to get a new IB (-35) [ 379.363246] [drm:radeon_cs_ib_fill] *ERROR* Failed to get ib ! [ 380.192295] radeon 0000:01:00.0: Saved 8749 dwords of commands on ring 0. [ 380.218544] radeon 0000:01:00.0: GPU softreset: 0x0000034D [ 380.218550] radeon 0000:01:00.0: GRBM_STATUS = 0xA3503028 [ 380.218554] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x2D000006 [ 380.218557] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x28000006 [ 380.218561] radeon 0000:01:00.0: SRBM_STATUS = 0x20020FC0 [ 380.218582] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [ 380.218586] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 380.218590] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00010000 [ 380.218593] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00408006 [ 380.218597] radeon 0000:01:00.0: R_008680_CP_STAT = 0x84438647 [ 380.218600] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44483146 [ 380.218604] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [ 380.218607] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 380.218611] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 381.041852] radeon 0000:01:00.0: Wait for MC idle timedout ! [ 381.041858] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF [ 381.041913] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00120500 [ 381.043071] radeon 0000:01:00.0: GRBM_STATUS = 0x00003028 [ 381.043098] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 [ 381.043102] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 [ 381.043105] radeon 0000:01:00.0: SRBM_STATUS = 0x20000EC0 [ 381.043127] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [ 381.043130] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 381.043134] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 381.043137] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [ 381.043140] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000 [ 381.043144] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 381.043147] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [ 381.043201] radeon 0000:01:00.0: GPU reset succeeded, trying to resume [ 386.042431] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting [ 386.042439] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing C50E (len 254, WS 0, PS 4) @ 0xC538 [ 386.042443] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing BB6E (len 78, WS 12, PS 8) @ 0xBBA7 [ 386.048397] [drm] probing gen 2 caps for device 1022:9603 = 300d02/0 [ 386.048403] [drm] PCIE gen 2 link speeds already enabled [ 386.383371] radeon 0000:01:00.0: Wait for MC idle timedout ! [ 386.546075] radeon 0000:01:00.0: Wait for MC idle timedout ! Out of around ten crashes, only one of them actually logged a message. I have no idea if any of the above is helpful.
New info: The machine appears to be unstable even if dpm is disabled. I've no idea why the machine didn't crash last time I tried without dpm. Using the profile based switching and setting: # echo profile > /sys/class/drm/card0/device/power_method # echo high > /sys/class/drm/card0/device/power_profile ... causes the machine to become unstable within minutes (the screen flickers, messages similar to that posted in my last message appear on dmesg). Sometimes the screen will clear for a minute or so, but then start flickering again and eventually the machine will reboot.
I've just installed the catalyst drivers. They also crash. I'm beginning to suspect faulty hardware.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-ati/issues/103.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.