Bug 58667 - VM-related crashes on CAYMAN
Summary: VM-related crashes on CAYMAN
Status: RESOLVED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 58354
  Show dependency treegraph
 
Reported: 2012-12-23 00:05 UTC by Thomas Rohloff
Modified: 2014-08-05 20:35 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Full dmesg output (1.34 MB, text/plain)
2012-12-23 00:07 UTC, Thomas Rohloff
no flags Details
New dmesg output (1.34 MB, text/plain)
2012-12-23 19:15 UTC, Thomas Rohloff
no flags Details

Description Thomas Rohloff 2012-12-23 00:05:11 UTC
This is with newest mesa from git with kernel 3.8-rc1 (+ this patch: http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.8&id=668bbc81baf0f34df832d8aca5c7d5e19a493c68 )

The screen first freezes (mouse still movable, keyboard not responding, not even to MagSysRQ), then the monitor goes off (standby) and back on with only garbage on the screen. 

Not sure if this has anything to do with it (but it should get fixed anyway) but dmesg gets spammed with this:
[  533.928472] radeon 0000:03:00.0: GPU fault detected: 146 0x00335514
[  533.928477] radeon 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  533.928483] radeon 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000

where the address isn't always the same, example:
[  533.928374] radeon 0000:03:00.0: GPU fault detected: 146 0x0033ed14
[  533.928379] radeon 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  533.928385] radeon 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Comment 1 Thomas Rohloff 2012-12-23 00:07:47 UTC
Created attachment 72006 [details]
Full dmesg output
Comment 2 Alex Deucher 2012-12-23 00:08:19 UTC
Is this a regression?  Does it happen with older versions of mesa or kernel?  If it's a regression can you identify which component (mesa or kernel) and bisect?
Comment 3 Alex Deucher 2012-12-23 00:13:29 UTC
May also be related to bug 58354.
Comment 4 Thomas Rohloff 2012-12-23 01:28:40 UTC
!Is this a regression?  Does it happen with older versions of mesa or kernel?!
Not that I know about.
"May also be related to bug 58354."
Do you have the path noted there ("drm/radeon: use DMA engine for VM page table updates on cayman/TN") ? I would loce to try to revert this patch and test it, but I'm unable to google it.
Comment 5 Thomas Rohloff 2012-12-23 01:30:03 UTC
I should really read before I click save, sorry. Here again:

"Is this a regression?  Does it happen with older versions of mesa or kernel?"
Not that I know about.

"May also be related to bug 58354."
Do you have a link to the patch noted there ("drm/radeon: use DMA engine for VM page table updates on cayman/TN") ? I would love to try to revert this patch and test it, but I'm unable to google it.
Comment 6 Thomas Rohloff 2012-12-23 19:15:31 UTC
Created attachment 72041 [details]
New dmesg output

Never mind, I found the patch here: http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-next&id=33e5467871b3007c4e6deea95b2cac38a55ff9f5

I reverted it and no crash so far (but as they are random they might still occur). On the other side the dmesg messages are still there. Uploading the new output just in case it is needed.

While writing this minecraft (which I used to trigger the crashes) crashed (right before and shortly after the crash the mouse wsq in slow-motion and I thought it will crash right away).
Comment 7 Thomas Rohloff 2012-12-23 19:35:25 UTC
Crashes are still there after reverting "drm/radeon: use DMA engine for VM page table updates on cayman/TN"
Comment 8 Thomas Rohloff 2012-12-23 19:45:05 UTC
But this crash was different: The image froze but the monitor didn't go into standby nor came it back with garbage.
Comment 9 Thomas Rohloff 2012-12-23 21:45:39 UTC
Bisected mesa.

This is a mesa bug caused by http://cgit.freedesktop.org/mesa/mesa/commit/?id=6532eb17baff6e61b427f29e076883f8941ae664

Can anybody move this to the right place or do I have to re-create the report (and if so: Where) ?
Comment 10 Thomas Rohloff 2012-12-23 22:29:44 UTC
I was to fast with this. While the error messages in dmesg are gone it still randomly crashes, but this time the computer just froze completely. I think this bug report are in fact at least two bugs.
Comment 11 Thomas Rohloff 2012-12-25 19:21:30 UTC
Also the error messages aren't completely gone.
I did go back to mesa commit f5632094ba0c19d570ea47025cf6da75ef8457a (mesa: Allow glReadBuffer(GL_NONE) for winsys framebuffers.) and played Minecraft a bit. Suddenly all slowed down and the screen started to corrupt. I looked into dmesg and the messages where back.
I made a video from after I killed Minecraft (when the corruption slowly disappeared) and after all corruptions where gone the message spam stopped again: https://www.dropbox.com/s/su1b6oaeiz028y2/out-86.ogv

I will do more bisecting but as this is really randomly it may take a long time. Also I hope my hardware hasn't been damaged by 6532eb17baff6e61b427f29e076883f8941ae664 (is this possible and if so: Is there any way to get my money back?)
Comment 12 Thomas Rohloff 2012-12-25 23:25:28 UTC
I did go back till http://cgit.freedesktop.org/mesa/mesa/commit/?id=6c99f2101fbd3edb7d5899c44ca9d984a3c0f8b6 and the bug is still there (not the crashes directly, at least I couldn't trigger them, but the error messages) so either this bug is really old or it's not a mesa bug (or, but that would be really bad: It damaged the hardware).
Comment 13 Thomas Rohloff 2012-12-25 23:46:29 UTC
After going back to kernel 3.6 (3.7 not tested) I'm unable to re-trigger this bug even after doing more actions that triggered it than in every test before. So I'm pretty sure this is a kernel bug!

Is anybody able to help be bisecting the kernel? I don't even know which tree (drm-next?)
Comment 14 Dmitry Cherkassov 2012-12-25 23:59:26 UTC
drm-next should be fine.
Comment 15 Thomas Rohloff 2012-12-26 00:20:28 UTC
Thanks for the fast reply. Just to get sure before I clone a few hours for no reason:

git clone git://people.freedesktop.org/~airlied/linux
git checkout drm-next

should be a good start, right?
Comment 16 Dmitry Cherkassov 2012-12-26 00:31:31 UTC
git checkout drm-next-3.8 
i guess
Comment 17 Dmitry Cherkassov 2012-12-26 00:32:17 UTC
or better drm-fixes-3.8
just to be sure. (it has few relevant commits on top)
Comment 18 Thomas Rohloff 2012-12-26 00:37:23 UTC
There is no drm-next-3.8 nor drm-fixes-3.8 at ~airlied/linux, see: http://cgit.freedesktop.org/~airlied/linux/refs/ - That's why I asked what exactly to clone as this step will take hours.
Comment 19 Thomas Rohloff 2012-12-26 06:40:25 UTC
This was a long night but I finally got it: Bad commit: http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb
Comment 20 Thomas Rohloff 2012-12-26 07:08:43 UTC
I'm going crazy. I just removed the bad patch from 3.8-rc1 and updated mesa to newest git version (therefore I had to stay at a9048aa6e6abcbeb498ef286630be30729aebaf3 cause of a patch missing in the bisected tree) and the bug is back again.

I don't know how to find the root of it and I have headache cause of it. There seems to be something really wrong with memory management but it's way over my head.
Comment 21 Thomas Rohloff 2012-12-26 07:42:46 UTC
Here's my final summary:

If http://cgit.freedesktop.org/mesa/mesa/commit/?id=6532eb17baff6e61b427f29e076883f8941ae664 and cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb is missing the bug doesn't trigger. If the first one is there the bug is triggered extremely often, spamming dmesg. If only the second is there the bug triggers randomly (good way to trigger: Lot of exploding TNT in Minecraft. Just build TNT pillars and ignite them till you are at bedrock).

My last hope is that some genius hacker which knows the driver has some "ah, I see the problem" moment. :(
Comment 22 Alexandre Demers 2013-01-03 22:10:11 UTC
(In reply to comment #5)
> I should really read before I click save, sorry. Here again:
> 
> "Is this a regression?  Does it happen with older versions of mesa or
> kernel?"
> Not that I know about.
> 
> "May also be related to bug 58354."
> Do you have a link to the patch noted there ("drm/radeon: use DMA engine for
> VM page table updates on cayman/TN") ? I would love to try to revert this
> patch and test it, but I'm unable to google it.

"Is this a regression?  Does it happen with older versions of mesa or kernel?"
Yes. Previous kernel 3.7 doesn't show this problem.
Comment 23 Alex Deucher 2013-01-04 15:20:42 UTC
(In reply to comment #22)
> "Is this a regression?  Does it happen with older versions of mesa or
> kernel?"
> Yes. Previous kernel 3.7 doesn't show this problem.

Can you bisect?  Is it the same commit Thomas landed on or another one?
Comment 24 Alexandre Demers 2013-01-05 20:26:53 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > "Is this a regression?  Does it happen with older versions of mesa or
> > kernel?"
> > Yes. Previous kernel 3.7 doesn't show this problem.
> 
> Can you bisect?  Is it the same commit Thomas landed on or another one?

Pretty sure it is the same problem. With kernel 3.8.0-rcx, just launching Gnome Shell starts flooding my logs of:
radeon 0000:0X:00.0: GPU fault detected: 146 0x00xxxxxx
radeon 0000:0X:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
radeon 0000:0X:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000

I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing. Having a crash here and there from time to time may be coming from something different, but the incessant flood is a big one. In a single session, I end up with kernel.log and everything.log being over 52GB each. I'm also sure this message have to be triggered is something wrong is going on. I'll let you know when I'm done bisecting to figure out what is triggering this flood.
Comment 25 Thomas Rohloff 2013-01-05 20:47:16 UTC
(In reply to comment #24)
> I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing.

Maybe you should just compile http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb (bad) and http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=9d89d78e3a20980205966fba6345645547e59ceb (good). It would be faster than bisecting and if you get another result than me you can still do a full bisect afterwards.
Comment 26 Alexandre Demers 2013-01-05 20:57:25 UTC
(In reply to comment #25)
> (In reply to comment #24)
> > I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing.
> 
> Maybe you should just compile
> http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.
> 8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb (bad) and
> http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.
> 8&id=9d89d78e3a20980205966fba6345645547e59ceb (good). It would be faster
> than bisecting and if you get another result than me you can still do a full
> bisect afterwards.

That's what I'll do, it makes sense.
Comment 27 Alexandre Demers 2013-01-06 00:11:43 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > (In reply to comment #24)
> > > I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing.
> > 
> > Maybe you should just compile
> > http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.
> > 8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb (bad) and
> > http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.
> > 8&id=9d89d78e3a20980205966fba6345645547e59ceb (good). It would be faster
> > than bisecting and if you get another result than me you can still do a full
> > bisect afterwards.
> 
> That's what I'll do, it makes sense.

It seems both are bad: crashed on logon with 9d89d and both flooded my logs.
Comment 28 Alexandre Demers 2013-01-06 03:30:33 UTC
The flood is caused by:
Commit: 4ac0533abaec2b83a7f2c675010eedd55664bc26

Author: Jerome Glisse <jglisse@redhat.com>  2012-12-13 12:08:11
Committer: Alex Deucher <alexander.deucher@amd.com>  2012-12-14 10:45:24
Parent: 9af20792124850369e764965690b99b20623dfc4 (drm/radeon: fix fence locking in the pageflip callback)
Branch: remotes/origin/master
Follows: v3.7-rc7
Precedes: v3.8-rc1

    drm/radeon: fix htile buffer size computation for command stream checker
    
    Fix the size computation of the htile buffer.
    
    Signed-off-by: Jerome Glisse <jglisse@redhat.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

However, I think this is not related to the lockups/crashes. So, the bug's description points actually to two different bugs: the flood and the crashes. Should I open a different bug for the flood of GPU fault detected?
Comment 29 Alexandre Demers 2013-01-06 20:43:00 UTC
I just created a new bug (bug 59089) for the GPU fault flood which is not a direct link with the crashes, the first happening without the other.
Comment 30 Alex Deucher 2013-01-07 20:33:51 UTC
Should be fixed with this mesa commit:
http://cgit.freedesktop.org/mesa/mesa/commit/?id=4332f6fc185f968e7563e748b8c949021937c935
Comment 31 Thomas Rohloff 2013-01-08 04:59:23 UTC
(In reply to comment #30)
> Should be fixed with this mesa commit:
> http://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=4332f6fc185f968e7563e748b8c949021937c935

Sadly it isn't.
Comment 32 Alexandre Demers 2013-01-09 01:49:27 UTC
You're using a Cayman card, but which model exactly?
Comment 33 Alex Deucher 2013-01-09 13:59:22 UTC
Does a 3.8 kernel it work ok if you revert mesa back to cf5632094ba0c19d570ea47025cf6da75ef8457a?

I think
r600g: rework flusing and synchronization pattern v7
http://cgit.freedesktop.org/mesa/mesa/commit/?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a
may be problematic on cayman.
Comment 34 Thomas Rohloff 2013-01-12 10:41:05 UTC
(In reply to comment #33)
> Does a 3.8 kernel it work ok if you revert mesa back to
> cf5632094ba0c19d570ea47025cf6da75ef8457a?

(In reply to comment #12)
> I did go back till
> http://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=6c99f2101fbd3edb7d5899c44ca9d984a3c0f8b6 and the bug is still there

> 
> I think
> r600g: rework flusing and synchronization pattern v7
> http://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a
> may be problematic on cayman.

I'm actually updating my kernel to 3.8-rc3, then I'll test newest mesa and cf5632094ba0c19d570ea47025cf6da75ef8457a again.
Comment 35 Thomas Rohloff 2013-01-12 11:59:08 UTC
Still there with 3.8-rc3 + mesa cf5632094ba0c19d570ea47025cf6da75ef8457a
Comment 36 Jerome Glisse 2013-01-14 15:48:30 UTC
Did you test with mesa reverted to before following commit :
http://cgit.freedesktop.org/mesa/mesa/commit/?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a
Comment 38 Thomas Rohloff 2013-01-17 17:28:49 UTC
(In reply to comment #37)
> This patch might help:

I applied it to a 3.8-rc3 kernel and while I didn't see the message spam till now the GPU crashes extremely often (so often that this might be the case I'm unable to see the spam). Either the image freezes or the monitor goes into standby. In both cases the keyboard doesn't react anymore (not even SysMagRQ).
Comment 39 Alexandre Demers 2013-01-17 18:42:46 UTC
(In reply to comment #38)
> (In reply to comment #37)
> > This patch might help:
> 
> I applied it to a 3.8-rc3 kernel and while I didn't see the message spam
> till now the GPU crashes extremely often (so often that this might be the
> case I'm unable to see the spam). Either the image freezes or the monitor
> goes into standby. In both cases the keyboard doesn't react anymore (not
> even SysMagRQ).

Does it do the same thing without the patch? I applied it yesterday and I haven't seen any difference.
Comment 40 Thomas Rohloff 2013-01-17 19:27:59 UTC
(In reply to comment #39)
> Does it do the same thing without the patch?

It has random crashes without, too, yes. But way less frequent. In fact I had to revert that patch to be able to use my desktop for more than 5 minutes again.
Comment 42 Thomas Rohloff 2013-01-30 08:40:30 UTC
I updated m kernel to 3.8-rc5 and mesa to http://cgit.freedesktop.org/mesa/mesa/commit/?id=952e6e9f3b0eb179f67345f00e5a7f1dbaa7bdd5 (can't go higher cause of https://bugs.freedesktop.org/show_bug.cgi?id=60038 ) + disabled huge pages in the kernel and now things are different. First of the message spam seems to be gone completely and second the GPU doesn't crash anymore. At one time the image froze but switching to console and back solved this.

I'll look if it continues like that and later on re-enable huge pages to see what happens then.
Comment 43 Thomas Rohloff 2013-01-30 09:13:49 UTC
And again I was to fast with this. I started another game and the dmesg spam was there again.
Comment 44 Thomas Rohloff 2013-01-30 09:26:05 UTC
And it crashed again, too. :(
Comment 45 Marek Olšák 2014-01-23 15:14:26 UTC
Is this still an issue with the latest kernel and Mesa?
Comment 46 Marek Olšák 2014-01-23 15:20:11 UTC
Also, does setting this environment variable help?

R600_DEBUG=nohyperz
Comment 47 udo 2014-01-26 10:19:07 UTC
Over here, with 3.12.6 and these
$ cat /etc/environment 
LIBGL_DRIVERS_PATH=/opt/xorg/lib/dri/
RADEON_VA=0
R600_DEBUG=nodma

all appears stable.

(git llvm, libclc, mesa, etc)
Comment 48 Thomas Rohloff 2014-01-26 11:33:17 UTC
(In reply to comment #45)
> Is this still an issue with the latest kernel and Mesa?

Sorry for the delay. It seems to be fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.