Bugzilla – Bug 75992
Display freezes & corruption with an r7 260x on 3.14-rc6
Last modified: 2014-09-26 04:22:57 UTC
Created attachment 95520 [details]
log of boot
I recently added a R7 260X to my system. While the card works with 3.13 its supposed work much better with 14-rc. This is not the case. My system is unstable without radeon.dpm=0 which was the default in .13.
linux 3.14-rc6 (with an up to date arch, stable X and mesa-git (10.2) mesa 10.1 and 10.0 also show very similar problems.
When X started I did notice some corruption. There are sets of two rectangles about of a height of 2 or 3 mm, width of 25m or so with a second about a cm below. The often occurs in chomium especially when scrolling. Runing the unigine-sanctuary or unigine-tropics demo/benchmark programs also produce the above problems and eventually stall.
Created attachment 95521 [details]
xorg log of bug
Created attachment 95523 [details]
simple example of corruption (there can be many more rectangles)
This problem is still occurring with rc7 too. Booted with dpm, the system is NOT usable. X stalls and C+A+D is needed to get the box back.
Created attachment 96290 [details] [review]
Does the attached patch help? It disables most dpm features. If so can you narrow down which specific features are problematic?
I'm building a rc8 with this applied now. Any advise on what to enable and in what order?
(In reply to comment #5)
> I'm building a rc8 with this applied now. Any advise on what to enable and
> in what order?
It doesn't really matter in what order you test. Just start at the top and enable features one by one until you hit a problem. If you hit a problem, leave the problematic feature disabled and check the rest. Once you've identified which which feature(s) are problematic on your board let me know.
The only setting I cannot revert in the patch is:
pi->mclk_dpm_key_disabled = 1;
To avoid humdreds of: radeon 0000:01:00.0: GPU fault detected: 146 0x0aa20804
and 147 type errors I had to add: export R600_DEBUG=nohyperz in the X startup
as documented in: https://bugzilla.kernel.org/show_bug.cgi?id=66981
Also found some funny stuff to do with primary and secondary graphic cards.
booting with the primary as the builtin hd4600, with no radeon specific stuff in the kernel command line does NOT enable the dpm on the secondary radeon card and trying to use radeon.dpm=1 causes the X startup to stall. Note in all cases the a xorg.conf file selects the radeon card to be used.
Created attachment 96460 [details] [review]
(In reply to comment #7)
> The only setting I cannot revert in the patch is:
> pi->mclk_dpm_key_disabled = 1;
Thanks for narrowing it down. Does the attached patch help? Remove all previous patches when testing this.
With thest testing patch removed and the possible fix added I am not seeing corruptions; however the tests are running a 6-7fps vs 30... It is progress though.
(In reply to comment #9)
> With thest testing patch removed and the possible fix added I am not seeing
> corruptions; however the tests are running a 6-7fps vs 30... It is progress
Are you seeing performance issues with pi->mclk_dpm_key_disabled = 1; as well?
On Thursday 27 March 2014 22:08:20 you wrote:
> --- Comment #10 from Alex Deucher <email@example.com> ---
> (In reply to comment #9)
> > With thest testing patch removed and the possible fix added I am not seeing
> > corruptions; however the tests are running a 6-7fps vs 30... It is progress
> > though.
> Are you seeing performance issues with pi->mclk_dpm_key_disabled = 1; as well?
Created attachment 96532 [details] [review]
Does this patch fix the stability and performance issues?
disable_mclk_switching = true; gives 30fps and corruption (and eventually a gpu stall)
disable_mclk_switching = true; and pi->mclk_dpm_key_disabled = 1; no corruptions but speed is slow
with mclk_switching = true; and the patch disabling WREG32 setups, I get 30fps & corruptions and gpu stalls/crashes
Please attach a copy of your vbios:
(use lspci to get the bus id)
cd /sys/bus/pci/devices/<pci bus id>
echo 1 > rom
cat rom > /tmp/vbios.rom
echo 0 > rom
Created attachment 96568 [details]
pcie 0000:01:00.0 rom
To verify the HW is okay, I installed the catalyst drivers (14.3) and am able to run the same set of tests with no corruption at about 85fps vs 30fps with corruption or 6fps without corruption.
The r7 260x is a fairly new card. That it works at all with the opensource driver stack is GREAT. As demonstrated by the fglrx drivers, there is lots of room for improvements though <GRIN>.
Hi, I can confirm this regression on Kernel 3.14 using the the FOSS driver and r7 260x of course.
I experience artifacts, constanst hangs and - in result of the hangs, "colorfully" filled screens rendering the system unusable.
Can I help somehow? I'm not a developer, just to warn you...
Ah, and I also get those rectangles. So it's definately the same bug!
Just to add my 2c to this issue: if I set radeon.dpm=0 on the command line, the card defaults to the "profile" power method.
If I try to:
echo "high" > /sys/class/drm/card0/device/power_profile
the screen goes black and I have to shut down using the power button (Ctrl+Alt+Del doesn't even work).
This happens both while on the desktop or in another vt.
Happens to me too. Only radeon.dpm=0 stabilizes the system.
Can you post the pci ids and subsystem ids for your cards? e.g., lspci -vnn
Created attachment 97173 [details]
Here you go.
Just to see what happens with the latest drm fixes I built linux from git
(last commit 4ba85265790ba3681deeaf73f018c0eb829a7341). I am still seeing corruptions and eventually get stalls. Lots of
Apr 10 14:17:11 localhost kernel: [ 2390.829175] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
Apr 10 14:17:11 localhost kernel: [ 2390.829178] radeon 0000:01:00.0: GPU fault detected: 147 0x049a4408
Apr 10 14:17:11 localhost kernel: [ 2390.829179] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
Apr 10 14:17:11 localhost kernel: [ 2390.829180] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
with a few like (maybe one for every 100 of the one above):
Apr 10 14:17:11 localhost kernel: [ 2390.828639] VM fault (0x08, vmid 13) at page 0, read from 'TC3' (0x54433300) (68)
Apr 10 14:17:11 localhost kernel: [ 2390.828642] radeon 0000:01:00.0: GPU fault detected: 147 0x049a0408
Apr 10 14:17:11 localhost kernel: [ 2390.828643] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
Apr 10 14:17:11 localhost kernel: [ 2390.828644] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x1A008008
Apr 10 14:17:11 localhost kernel: [ 2390.805747] VM fault (0x08, vmid 13) at page 31524, read from 'TC2' (0x54433200) (72)
Apr 10 14:17:11 localhost kernel: [ 2390.805749] radeon 0000:01:00.0: GPU fault detected: 147 0x049a4808
Apr 10 14:17:11 localhost kernel: [ 2390.805749] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
Apr 10 14:17:11 localhost kernel: [ 2390.805750] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Apr 10 13:42:44 localhost kernel: [ 321.759360] VM fault (0x04, vmid 1) at page 29021, read from 'TC3' (0x54433300) (68)
Apr 10 13:42:44 localhost kernel: [ 321.759362] radeon 0000:01:00.0: GPU fault detected: 146 0x0ba20404
Apr 10 13:42:44 localhost kernel: [ 321.759363] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
Apr 10 13:42:44 localhost kernel: [ 321.759363] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
interposed between the much more common message above.
These error happen with and without hyperz enabled
( via the following in .xinitrc
export R600_DEBUG=nohyperz )
This is with ddx, glamor, mesa built from todays git (13:30 or so EDT)
Once I can get xorg rc to build with builtin glamor I'll try with it too.
On the positive side I was able to get a few benchmarks to run to completion and they are up from 30fps on .14 to 40fps with .15-git.
The fglrx drivers are faster (80fps) but eventually crash after a day or so which is not unexpected with the (beta) versions.
There are now four of us reporting this issue in .14
Created attachment 97175 [details] [review]
disable voltage control
Does this patch help?
Created attachment 97176 [details] [review]
Another patch to try.
*** Bug 77281 has been marked as a duplicate of this bug. ***
Created attachment 97191 [details]
As my Bug report (77281) is a duplicate of this bug here, I will help you to figure out the issue. I have the same issue as you may knowl
Here is my lspci.
I hope you also got some information from my bug report. Please ask me as much as you want to. I really want to fix this.
"echo "high" > /sys/class/drm/card0/device/power_profile" crashes me too when I try to set it after "radeon.dpm=0".
Because the default one is really on a low level. No OpenGL animation works smooth.
My hole Display turned black, but the computer was still turned on. ... No reactions anymore. Hard Restart required.
Created attachment 97194 [details]
lspci -vnn for me too
Niether patches fixes the issue
disable voltage control - no corruptions, 6.5fps
limit mclk - lots of corruptions (not just the normal one or two) and an almost
immediate gpu stall/crash
Pressed enter too quickly. Both patches were tested against linux-git as of about 5pm EDT.
Created attachment 97196 [details]
lspci -vnn (just the graphics card details posted)
(In reply to comment #32)
> limit mclk - lots of corruptions (not just the normal one or two) and an
> immediate gpu stall/crash
Can you try adjusting the mclk value in that patch down? E.g., change 157500 to:
I've tried 80000 and 40000 with no luck. Lots of corruptions and the gpu crashes/stall fairly quickly. More tomorrow.
Does when the mclk is changed have any importance?
What about setting it to the max and disabling any other changes?
(In reply to comment #36)
> I've tried 80000 and 40000 with no luck. Lots of corruptions and the gpu
> crashes/stall fairly quickly. More tomorrow.
> Does when the mclk is changed have any importance?
> What about setting it to the max and disabling any other changes?
That is what attachment 96532 [details] [review] did and it didn't work.
Created attachment 97208 [details] [review]
try newer mc ucode
Try this patch in conjunction with the updated mc firmware here:
(In reply to comment #38)
> Created attachment 97208 [details] [review] [review]
> try newer mc ucode
> Try this patch in conjunction with the updated mc firmware here:
Make sure the new ucode gets used. You may have to update your initrd, etc.
Alex I am using Archlinux and I dont know if my Firmware is allready up to date or not.
When I look up here http://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/log/ , the last firmware updates came out in march. And thats exactly the version I have currently installed.
So is the firmware you have postet not released yet? Because I get 2 diffrent MD5SUM between yours and my Firmware binary.
Also,what does "Make sure the new ucode gets used" excatly mean? Could you explain that a bit please? =)
Well, I just updated the Firmware (without the patch), rebuild my initram and reactivated dpm and it seems like those flickering and black holes on my display are gone.
I will test it with some opengl applications and yeah. Lets see :)
It works!!! Hahahaha
I am sorry but I am just happy!
With the new Firmware, all my issues are gone! :D
I did not patch the kernel, I dont know if I should. But anyway the new firmware works perfectly for me. Can you push it upstream pls so the distros can get it?
Thank you Alex!
If you need further help, just contact me! :D
Created attachment 97219 [details]
Error in dmesg
There are still errors about switching power profile in dmesg
My above comment was made after using the new firmware posted above
I rebuilt linux-git with the newer mc ucode patch and manually replaced the BONAIRE_mc.bin firmware file and rebuilt my initrd.
uniengine tropical completes at 33fps (fullscreen 1920x1200)
uniengine sanctuary completes at 45fps (fullscreen 1920x1200)
tropical is the test that usually killed things quickly. It ran at about 30fps when it worked at all on .14 and about 40fps on .15 with old firmware. fglrx manages about 80fps.
That being said the new firmware is not as fluid as the older. There are times that there are hesitations when the camera is panning which I did not notice previously. However my tests do run without stalling or corruption.
Would you like me to test with and without the patch on .14?
You can add my tested-by to the 'newer mc ucode patch' for linux-git
'Tested-by: Ed Tomlinson <firstname.lastname@example.org>'
Tests were run with hyperz disabled as per my comments above. With hyperz enabled, with the new firmware, I am NOT seeing the gpu faults shown above. Mesa 10.2 and DDX are from yesterday's git.
I saw you pushed those fixes upstream, Alex, thats good to see!
But do you also push your new Firmware binarys upstream?
Thats the Main fix.
Thank you for your fixes!
(In reply to comment #47)
> I saw you pushed those fixes upstream, Alex, thats good to see!
> But do you also push your new Firmware binarys upstream?
New binaries are here:
They will make their way into the Linux firmware tree eventually.
But without the new Firmware, the Bug is Not really fixed is it?
We need the new Firmware to run smooth.
Egon I would say this bug is dead. Its not smooth here but that could easily be the new kernel - its not reached rc1 yet... IMHO IF there is too much jitter we should open a new bug.
(In reply to comment #50)
> Egon I would say this bug is dead. Its not smooth here but that could
> easily be the new kernel - its not reached rc1 yet... IMHO IF there is too
> much jitter we should open a new bug.
What do you mean, I thought the bug is fixed now, and the fix will be included in future releases..
"Eventually" means that it *will* reach the kernel firmware. Since you are probalby German (Egon is a German name), you might have misunderstood it for "maybe". "Eventually" has got a completely different meaning in English, see http://www.dict.cc/?s=eventually
Another data point for everyone. The uniengine tropical benchmark completes at 41fps (20% faster), without problems, using the 184.108.40.2062 X (1.66-rc2) build with the new firmware, kernel 3.15-rc1 + patch, and three fixes from: https://bugs.freedesktop.org/show_bug.cgi?id=64297 with llvm 3.4 & mesa-git.
So in what kernel/-firmware release will it be included?
(In reply to comment #53)
> So in what kernel/-firmware release will it be included?
Regression on 3.14.3. Screen corruption is back. Using Radeon microcode from 2014-04-30.
dmesg is full of messages like this:
[ 658.696174] radeon 0000:01:00.0: GPU fault detected: 147 0x0ee20808
[ 658.696175] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
[ 658.696177] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 658.696178] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
(In reply to comment #55)
Without commit  new microcode (BONAIRE_mc2.bin) is not loaded. This commit is currently included only in >=3.15-rc2
Ok, tried 3.15_rc5 and I get no corruption. But, booting took a long time (calibrating clocksource TSC) and I got this message:
radeon 0000:01:00.0: Direct firmware load failed with error -2
Here's the dmesg | grep drm:
[ 0.703606] [drm] Initialized drm 1.1.0 20060810
[ 0.703728] [drm] radeon kernel modesetting enabled.
[ 0.704231] [drm] initializing kernel modesetting (BONAIRE 0x1002:0x6658 0x174B:0xE253).
[ 0.704390] [drm] register mmio base: 0xFE9C0000
[ 0.704477] [drm] register mmio size: 262144
[ 0.704565] [drm] doorbell mmio base: 0xCF800000
[ 0.704649] [drm] doorbell mmio size: 8388608
[ 0.707062] [drm] Detected VRAM RAM=2048M, BAR=256M
[ 0.707148] [drm] RAM width 128bits DDR
[ 0.707658] [drm] radeon: 2048M of VRAM memory ready
[ 0.707744] [drm] radeon: 1024M of GTT memory ready.
[ 0.707846] [drm] Loading BONAIRE Microcode
[ 60.691462] [drm] radeon/BONAIRE_mc.bin: 31464 bytes
[ 60.691551] [drm] Internal thermal controller with fan control
[ 60.691679] [drm] probing gen 2 caps for device 1002:5978 = 300d02/0
[ 60.699424] [drm] radeon: dpm initialized
[ 120.799122] [drm] GART: num cpu pages 262144, num gpu pages 262144
[ 120.801322] [drm] probing gen 2 caps for device 1002:5978 = 300d02/0
[ 120.801413] [drm] PCIE gen 2 link speeds already enabled
[ 120.810251] [drm] PCIE GART of 1024M enabled (table at 0x0000000000277000).
[ 120.811879] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[ 120.811968] [drm] Driver supports precise vblank timestamp query.
[ 120.812206] [drm] radeon: irq initialized.
[ 120.814520] [drm] ring test on 0 succeeded in 3 usecs
[ 120.814689] [drm] ring test on 1 succeeded in 3 usecs
[ 120.814790] [drm] ring test on 2 succeeded in 3 usecs
[ 120.815030] [drm] ring test on 3 succeeded in 2 usecs
[ 120.815127] [drm] ring test on 4 succeeded in 2 usecs
[ 120.871242] [drm] ring test on 5 succeeded in 2 usecs
[ 120.891334] [drm] UVD initialized successfully.
[ 120.891718] [drm] ib test on ring 0 succeeded in 0 usecs
[ 120.891961] [drm] ib test on ring 1 succeeded in 0 usecs
[ 120.892205] [drm] ib test on ring 2 succeeded in 0 usecs
[ 120.892467] [drm] ib test on ring 3 succeeded in 0 usecs
[ 120.892719] [drm] ib test on ring 4 succeeded in 0 usecs
[ 120.913684] [drm] ib test on ring 5 succeeded
[ 120.934151] [drm] Radeon Display Connectors
[ 120.934232] [drm] Connector 0:
[ 120.934318] [drm] DP-1
[ 120.934402] [drm] HPD2
[ 120.934488] [drm] DDC: 0x6530 0x6530 0x6534 0x6534 0x6538 0x6538 0x653c 0x653c
[ 120.934615] [drm] Encoders:
[ 120.934700] [drm] DFP1: INTERNAL_UNIPHY2
[ 120.934786] [drm] Connector 1:
[ 120.934864] [drm] HDMI-A-1
[ 120.934948] [drm] HPD3
[ 120.935033] [drm] DDC: 0x6550 0x6550 0x6554 0x6554 0x6558 0x6558 0x655c 0x655c
[ 120.935152] [drm] Encoders:
[ 120.935230] [drm] DFP2: INTERNAL_UNIPHY2
[ 120.935315] [drm] Connector 2:
[ 120.935394] [drm] DVI-D-1
[ 120.935479] [drm] HPD1
[ 120.935563] [drm] DDC: 0x6560 0x6560 0x6564 0x6564 0x6568 0x6568 0x656c 0x656c
[ 120.935684] [drm] Encoders:
[ 120.935763] [drm] DFP3: INTERNAL_UNIPHY1
[ 120.935850] [drm] Connector 3:
[ 120.935928] [drm] DVI-I-1
[ 120.936006] [drm] HPD6
[ 120.936085] [drm] DDC: 0x6580 0x6580 0x6584 0x6584 0x6588 0x6588 0x658c 0x658c
[ 120.936211] [drm] Encoders:
[ 120.936295] [drm] DFP4: INTERNAL_UNIPHY
[ 120.936381] [drm] CRT1: INTERNAL_KLDSCP_DAC1
[ 120.990787] [drm] fb mappable at 0xD047A000
[ 120.990873] [drm] vram apper at 0xD0000000
[ 120.990952] [drm] size 7299072
[ 120.991037] [drm] fb depth is 24
[ 120.991116] [drm] pitch is 6912
[ 120.991261] fbcon: radeondrmfb (fb0) is primary device
[ 121.025398] radeon 0000:01:00.0: fb0: radeondrmfb frame buffer device
[ 121.026375] [drm] Initialized radeon 2.38.0 20080528 for 0000:01:00.0 on minor
Just saw that I'm loading the wrong firmware (_mc, not _mc2). Sorry. I'll try this now.
Well, same thing with _mc2.bin, except this line (size changed):
radeon/BONAIRE_mc2.bin: 31792 bytes
I still get the Direct firmware load error and calibrating clocksource TSC takes a long time too. Usually I get this slowdown when the drm firmware cannot be loaded correctly.