This is with newest mesa from git with kernel 3.8-rc1 (+ this patch: http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.8&id=668bbc81baf0f34df832d8aca5c7d5e19a493c68 ) The screen first freezes (mouse still movable, keyboard not responding, not even to MagSysRQ), then the monitor goes off (standby) and back on with only garbage on the screen. Not sure if this has anything to do with it (but it should get fixed anyway) but dmesg gets spammed with this: [ 533.928472] radeon 0000:03:00.0: GPU fault detected: 146 0x00335514 [ 533.928477] radeon 0000:03:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 533.928483] radeon 0000:03:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 where the address isn't always the same, example: [ 533.928374] radeon 0000:03:00.0: GPU fault detected: 146 0x0033ed14 [ 533.928379] radeon 0000:03:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 533.928385] radeon 0000:03:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Created attachment 72006 [details] Full dmesg output
Is this a regression? Does it happen with older versions of mesa or kernel? If it's a regression can you identify which component (mesa or kernel) and bisect?
May also be related to bug 58354.
!Is this a regression? Does it happen with older versions of mesa or kernel?! Not that I know about. "May also be related to bug 58354." Do you have the path noted there ("drm/radeon: use DMA engine for VM page table updates on cayman/TN") ? I would loce to try to revert this patch and test it, but I'm unable to google it.
I should really read before I click save, sorry. Here again: "Is this a regression? Does it happen with older versions of mesa or kernel?" Not that I know about. "May also be related to bug 58354." Do you have a link to the patch noted there ("drm/radeon: use DMA engine for VM page table updates on cayman/TN") ? I would love to try to revert this patch and test it, but I'm unable to google it.
Created attachment 72041 [details] New dmesg output Never mind, I found the patch here: http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-next&id=33e5467871b3007c4e6deea95b2cac38a55ff9f5 I reverted it and no crash so far (but as they are random they might still occur). On the other side the dmesg messages are still there. Uploading the new output just in case it is needed. While writing this minecraft (which I used to trigger the crashes) crashed (right before and shortly after the crash the mouse wsq in slow-motion and I thought it will crash right away).
Crashes are still there after reverting "drm/radeon: use DMA engine for VM page table updates on cayman/TN"
But this crash was different: The image froze but the monitor didn't go into standby nor came it back with garbage.
Bisected mesa. This is a mesa bug caused by http://cgit.freedesktop.org/mesa/mesa/commit/?id=6532eb17baff6e61b427f29e076883f8941ae664 Can anybody move this to the right place or do I have to re-create the report (and if so: Where) ?
I was to fast with this. While the error messages in dmesg are gone it still randomly crashes, but this time the computer just froze completely. I think this bug report are in fact at least two bugs.
Also the error messages aren't completely gone. I did go back to mesa commit f5632094ba0c19d570ea47025cf6da75ef8457a (mesa: Allow glReadBuffer(GL_NONE) for winsys framebuffers.) and played Minecraft a bit. Suddenly all slowed down and the screen started to corrupt. I looked into dmesg and the messages where back. I made a video from after I killed Minecraft (when the corruption slowly disappeared) and after all corruptions where gone the message spam stopped again: https://www.dropbox.com/s/su1b6oaeiz028y2/out-86.ogv I will do more bisecting but as this is really randomly it may take a long time. Also I hope my hardware hasn't been damaged by 6532eb17baff6e61b427f29e076883f8941ae664 (is this possible and if so: Is there any way to get my money back?)
I did go back till http://cgit.freedesktop.org/mesa/mesa/commit/?id=6c99f2101fbd3edb7d5899c44ca9d984a3c0f8b6 and the bug is still there (not the crashes directly, at least I couldn't trigger them, but the error messages) so either this bug is really old or it's not a mesa bug (or, but that would be really bad: It damaged the hardware).
After going back to kernel 3.6 (3.7 not tested) I'm unable to re-trigger this bug even after doing more actions that triggered it than in every test before. So I'm pretty sure this is a kernel bug! Is anybody able to help be bisecting the kernel? I don't even know which tree (drm-next?)
drm-next should be fine.
Thanks for the fast reply. Just to get sure before I clone a few hours for no reason: git clone git://people.freedesktop.org/~airlied/linux git checkout drm-next should be a good start, right?
git checkout drm-next-3.8 i guess
or better drm-fixes-3.8 just to be sure. (it has few relevant commits on top)
There is no drm-next-3.8 nor drm-fixes-3.8 at ~airlied/linux, see: http://cgit.freedesktop.org/~airlied/linux/refs/ - That's why I asked what exactly to clone as this step will take hours.
This was a long night but I finally got it: Bad commit: http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb
I'm going crazy. I just removed the bad patch from 3.8-rc1 and updated mesa to newest git version (therefore I had to stay at a9048aa6e6abcbeb498ef286630be30729aebaf3 cause of a patch missing in the bisected tree) and the bug is back again. I don't know how to find the root of it and I have headache cause of it. There seems to be something really wrong with memory management but it's way over my head.
Here's my final summary: If http://cgit.freedesktop.org/mesa/mesa/commit/?id=6532eb17baff6e61b427f29e076883f8941ae664 and cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb is missing the bug doesn't trigger. If the first one is there the bug is triggered extremely often, spamming dmesg. If only the second is there the bug triggers randomly (good way to trigger: Lot of exploding TNT in Minecraft. Just build TNT pillars and ignite them till you are at bedrock). My last hope is that some genius hacker which knows the driver has some "ah, I see the problem" moment. :(
(In reply to comment #5) > I should really read before I click save, sorry. Here again: > > "Is this a regression? Does it happen with older versions of mesa or > kernel?" > Not that I know about. > > "May also be related to bug 58354." > Do you have a link to the patch noted there ("drm/radeon: use DMA engine for > VM page table updates on cayman/TN") ? I would love to try to revert this > patch and test it, but I'm unable to google it. "Is this a regression? Does it happen with older versions of mesa or kernel?" Yes. Previous kernel 3.7 doesn't show this problem.
(In reply to comment #22) > "Is this a regression? Does it happen with older versions of mesa or > kernel?" > Yes. Previous kernel 3.7 doesn't show this problem. Can you bisect? Is it the same commit Thomas landed on or another one?
(In reply to comment #23) > (In reply to comment #22) > > "Is this a regression? Does it happen with older versions of mesa or > > kernel?" > > Yes. Previous kernel 3.7 doesn't show this problem. > > Can you bisect? Is it the same commit Thomas landed on or another one? Pretty sure it is the same problem. With kernel 3.8.0-rcx, just launching Gnome Shell starts flooding my logs of: radeon 0000:0X:00.0: GPU fault detected: 146 0x00xxxxxx radeon 0000:0X:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 radeon 0000:0X:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing. Having a crash here and there from time to time may be coming from something different, but the incessant flood is a big one. In a single session, I end up with kernel.log and everything.log being over 52GB each. I'm also sure this message have to be triggered is something wrong is going on. I'll let you know when I'm done bisecting to figure out what is triggering this flood.
(In reply to comment #24) > I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing. Maybe you should just compile http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb (bad) and http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=9d89d78e3a20980205966fba6345645547e59ceb (good). It would be faster than bisecting and if you get another result than me you can still do a full bisect afterwards.
(In reply to comment #25) > (In reply to comment #24) > > I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing. > > Maybe you should just compile > http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3. > 8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb (bad) and > http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3. > 8&id=9d89d78e3a20980205966fba6345645547e59ceb (good). It would be faster > than bisecting and if you get another result than me you can still do a full > bisect afterwards. That's what I'll do, it makes sense.
(In reply to comment #26) > (In reply to comment #25) > > (In reply to comment #24) > > > I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing. > > > > Maybe you should just compile > > http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3. > > 8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb (bad) and > > http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3. > > 8&id=9d89d78e3a20980205966fba6345645547e59ceb (good). It would be faster > > than bisecting and if you get another result than me you can still do a full > > bisect afterwards. > > That's what I'll do, it makes sense. It seems both are bad: crashed on logon with 9d89d and both flooded my logs.
The flood is caused by: Commit: 4ac0533abaec2b83a7f2c675010eedd55664bc26 Author: Jerome Glisse <jglisse@redhat.com> 2012-12-13 12:08:11 Committer: Alex Deucher <alexander.deucher@amd.com> 2012-12-14 10:45:24 Parent: 9af20792124850369e764965690b99b20623dfc4 (drm/radeon: fix fence locking in the pageflip callback) Branch: remotes/origin/master Follows: v3.7-rc7 Precedes: v3.8-rc1 drm/radeon: fix htile buffer size computation for command stream checker Fix the size computation of the htile buffer. Signed-off-by: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> However, I think this is not related to the lockups/crashes. So, the bug's description points actually to two different bugs: the flood and the crashes. Should I open a different bug for the flood of GPU fault detected?
I just created a new bug (bug 59089) for the GPU fault flood which is not a direct link with the crashes, the first happening without the other.
Should be fixed with this mesa commit: http://cgit.freedesktop.org/mesa/mesa/commit/?id=4332f6fc185f968e7563e748b8c949021937c935
(In reply to comment #30) > Should be fixed with this mesa commit: > http://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=4332f6fc185f968e7563e748b8c949021937c935 Sadly it isn't.
You're using a Cayman card, but which model exactly?
Does a 3.8 kernel it work ok if you revert mesa back to cf5632094ba0c19d570ea47025cf6da75ef8457a? I think r600g: rework flusing and synchronization pattern v7 http://cgit.freedesktop.org/mesa/mesa/commit/?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a may be problematic on cayman.
(In reply to comment #33) > Does a 3.8 kernel it work ok if you revert mesa back to > cf5632094ba0c19d570ea47025cf6da75ef8457a? (In reply to comment #12) > I did go back till > http://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=6c99f2101fbd3edb7d5899c44ca9d984a3c0f8b6 and the bug is still there > > I think > r600g: rework flusing and synchronization pattern v7 > http://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a > may be problematic on cayman. I'm actually updating my kernel to 3.8-rc3, then I'll test newest mesa and cf5632094ba0c19d570ea47025cf6da75ef8457a again.
Still there with 3.8-rc3 + mesa cf5632094ba0c19d570ea47025cf6da75ef8457a
Did you test with mesa reverted to before following commit : http://cgit.freedesktop.org/mesa/mesa/commit/?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a
This patch might help: http://people.freedesktop.org/~glisse/0001-drm-radeon-exclude-system-placement-when-validating-.patch
(In reply to comment #37) > This patch might help: I applied it to a 3.8-rc3 kernel and while I didn't see the message spam till now the GPU crashes extremely often (so often that this might be the case I'm unable to see the spam). Either the image freezes or the monitor goes into standby. In both cases the keyboard doesn't react anymore (not even SysMagRQ).
(In reply to comment #38) > (In reply to comment #37) > > This patch might help: > > I applied it to a 3.8-rc3 kernel and while I didn't see the message spam > till now the GPU crashes extremely often (so often that this might be the > case I'm unable to see the spam). Either the image freezes or the monitor > goes into standby. In both cases the keyboard doesn't react anymore (not > even SysMagRQ). Does it do the same thing without the patch? I applied it yesterday and I haven't seen any difference.
(In reply to comment #39) > Does it do the same thing without the patch? It has random crashes without, too, yes. But way less frequent. In fact I had to revert that patch to be able to use my desktop for more than 5 minutes again.
I got a crash with a BUG message. I'm sorry for the bad image quality but I had no better camera available (that's why I made that many images) http://img571.imageshack.us/img571/5517/dsc02036ws.jpg http://img254.imageshack.us/img254/6779/dsc02037i.jpg http://img834.imageshack.us/img834/4889/dsc02038cz.jpg http://img835.imageshack.us/img835/5993/dsc02039s.jpg http://img338.imageshack.us/img338/1946/dsc02040b.jpg http://img5.imageshack.us/img5/5683/dsc02041hc.jpg http://img69.imageshack.us/img69/8716/dsc02042vj.jpg http://img594.imageshack.us/img594/8600/dsc02043cnt.jpg I also have a video, so if the images aren't enough just ask.
I updated m kernel to 3.8-rc5 and mesa to http://cgit.freedesktop.org/mesa/mesa/commit/?id=952e6e9f3b0eb179f67345f00e5a7f1dbaa7bdd5 (can't go higher cause of https://bugs.freedesktop.org/show_bug.cgi?id=60038 ) + disabled huge pages in the kernel and now things are different. First of the message spam seems to be gone completely and second the GPU doesn't crash anymore. At one time the image froze but switching to console and back solved this. I'll look if it continues like that and later on re-enable huge pages to see what happens then.
And again I was to fast with this. I started another game and the dmesg spam was there again.
And it crashed again, too. :(
Is this still an issue with the latest kernel and Mesa?
Also, does setting this environment variable help? R600_DEBUG=nohyperz
Over here, with 3.12.6 and these $ cat /etc/environment LIBGL_DRIVERS_PATH=/opt/xorg/lib/dri/ RADEON_VA=0 R600_DEBUG=nodma all appears stable. (git llvm, libclc, mesa, etc)
(In reply to comment #45) > Is this still an issue with the latest kernel and Mesa? Sorry for the delay. It seems to be fixed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.