Bug 45018

Summary: [bisected] rendering regression and va conflicts since added support for virtual address space on cayman v11
Product: Mesa Reporter: Alexandre Demers <alexandre.f.demers>
Component: Drivers/Gallium/r600Assignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: medium CC: alexdeucher, h.judt, v10lator
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Good rendering
Bad rendering
dmesg with bad rendering after running the app
xorg.log with bad rendering
dmesg with bo conflict
GPU lock
Kernel crash
radeon_cs_update_pages.jpg
dmesg after lock with latest patch
screenshot showing garbled fonts in blender-2.62
Different error message
possible fix
dmesg related to the xsession-error file
snippet when gnome-shell is able to fall bak on its feet
dmesg drm-next
xsession with drm-next
Adding an environment variable to disable VM if wanted
Free va early in the kernel
Free va earyl
fixes to wait on the bo and to free the va after the kernel
Possible fix.
Properly protect virtual address
Properly protect virtual address
Properly protect virtual address
Properly protect virtual address against kernel 3.5
Properly protect virtual address kernel 3.5 v2
Properly protect virtual address v2

Description Alexandre Demers 2012-01-20 22:00:56 UTC
Created attachment 55888 [details]
Good rendering

When testing RenderFeatTest.bin64, the shadows on test07 are not rendered correctly anymore. Bisecting identified the following commit as culprit:

bb1f0cf3508630a9a93512c79badf8c493c46743 is the first bad commit
commit bb1f0cf3508630a9a93512c79badf8c493c46743
Author: Jerome Glisse <jglisse@redhat.com>
Date:   Fri Dec 2 10:20:29 2011 -0500

    r600g: add support for virtual address space on cayman v11

I'll be attaching pictures to show the regression.
Comment 1 Alexandre Demers 2012-01-20 22:03:32 UTC
By the way, I'm using latest drm and kernel from git. I have pcie_gen2 enabled. I'm running Ubuntu Oneiric with a HD6950.
Comment 2 Alexandre Demers 2012-01-20 22:04:48 UTC
Created attachment 55889 [details]
Bad rendering

Projected shadows are not rendered correctly anymore.
Comment 3 Alex Deucher 2012-01-21 06:48:47 UTC
Please attach your xorg log and dmesg output.
Comment 4 Alexandre Demers 2012-01-21 07:06:26 UTC
Created attachment 55912 [details]
dmesg with bad rendering after running the app
Comment 5 Alexandre Demers 2012-01-21 07:06:59 UTC
Created attachment 55913 [details]
xorg.log with bad rendering
Comment 6 Alexandre Demers 2012-01-21 07:07:31 UTC
Should I add the logs from the good rendering?
Comment 7 Alexandre Demers 2012-01-23 22:20:21 UTC
One of the latest commits fixed the issue. Many commits were related to r600g, some were specific to cayman, one of them must have fixed it. Closing.
Comment 8 Alexandre Demers 2012-01-25 10:01:51 UTC
I must have mixed something when testing. It is still there with latest git.
Comment 9 Alexandre Demers 2012-01-27 21:59:44 UTC
Here is why I thought the bug was fixed: for another reason, I booted with a 3.2 kernel instead of a 3.3-rc1. The bugs is not visible under kernel 3.2, but is under 3.3-rc1 since the bisected commit. I will try with a 3.3-rc2 kernel once it will be available.
Comment 10 Michel Dänzer 2012-01-28 04:52:09 UTC
(In reply to comment #9)
> The bugs is not visible under kernel 3.2, [...]

3.2 lacks Radeon virtual address space support.

> I will try with a 3.3-rc2 kernel once it will be available.

That's unlikely to make a difference, this is likely a userspace bug.
Comment 11 Alexandre Demers 2012-01-31 16:35:53 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > The bugs is not visible under kernel 3.2, [...]
> 
> 3.2 lacks Radeon virtual address space support.
> 
> > I will try with a 3.3-rc2 kernel once it will be available.
> 
> That's unlikely to make a difference, this is likely a userspace bug.

Indeed and I can now confirm it. 3.3-rc2 doesn't change the problem.
Comment 12 Jerome Glisse 2012-02-03 15:10:11 UTC
Can you please record an apitrace of the affected test. Thank you

https://github.com/apitrace
Comment 13 Alexandre Demers 2012-02-03 19:26:16 UTC
(In reply to comment #12)
> Can you please record an apitrace of the affected test. Thank you
> 
> https://github.com/apitrace

I'll try it during the weekend. I should be able to give you an apitrace by Sunday night or Monday.
Comment 14 Alexandre Demers 2012-02-05 17:45:20 UTC
Here I uploaded the apitrace:
http://www.mediafire.com/?mnlmwe6x4j305zm

It was to big to be posted here, so it's available on mediafire.
Comment 15 Alexandre Demers 2012-02-14 22:44:50 UTC
I'm now running kernel 3.3-rc3 and some applications are completely freezing my system. I haven't seen anything special in the various logs. The only error available is about a conflict in addresses for bo, but they are not related to the application that are crashing my system. Renderfeattest.bin64 and sanctuary from unigine  are exemples of those apps. Gnome-shell is also resetting, freezing or crashing from time to time (the problem could come from X when crashing).
Comment 16 Alexandre Demers 2012-02-15 15:54:48 UTC
With yesterday's gits (mesa, drum, ddx), Gnome-shell is now freezing right after I log in. From time to time, I receive GPU hanged for X msec and then it resets. I'll try to bisect.
Comment 17 Alexandre Demers 2012-02-16 16:19:08 UTC
I was not able to find the root of the problem. However, I have many message telling me the following:
Radon 0000:01:00.0: no ffff880214144000 via 0x024000000 conflicts with (no ffff880213ffcc00 0x02340000 0x03340000)

And at some point:
gnome-shell segfault at 14 ip 00007f20b053a82a Sep 00007fff23165000 error 4 in r600_drive.so[7f20b0418000+405000]

I'll continue investigating the bug, but each time I go back with kernel 3.2 (no virtual address space), it works like a charm.
Comment 18 Alexandre Demers 2012-02-16 16:31:44 UTC
Created attachment 57178 [details]
dmesg with bo conflict

Latest dmesg with the bo conflicts.
Comment 19 Alexandre Demers 2012-02-18 08:39:32 UTC
I've been able to get a log in dmesg when the GPU locked. Just after relaunching X and Gnome-Shell, I was able to reproduce a lock/crash I'm experiencing from time to time without any log to post. So I took a picture of it. I'm attaching all that just now.
Comment 20 Alexandre Demers 2012-02-18 08:40:56 UTC
Created attachment 57234 [details]
GPU lock

GPU lock and reset
Comment 21 Alexandre Demers 2012-02-18 08:42:31 UTC
Created attachment 57235 [details]
Kernel crash

Picture of the kernel crash
Comment 22 Harald Judt 2012-02-18 09:37:28 UTC
Created attachment 57237 [details]
radeon_cs_update_pages.jpg

I experience a very similar lockup problem. It always happens when rotating the desktop cube (compiz).

It seems the rest of the system still works (playing music via mpd etc.), but I can't switch VT. However, I took a picture of the output (see attached jpg).
Comment 23 Alexandre Demers 2012-02-19 13:39:42 UTC
Great to know I'm not the only one with this problem. By the way, still there with kernel 3.3-rc4 and latests gits.
Comment 24 Jerome Glisse 2012-02-21 09:44:50 UTC
I pushed a mesa fix for bo allocation issue. If you enable 2d tiling properly you shouldn't have lockup anymore. There is also a kernel patch to fix kernel issue after gpu lockup.

http://lists.freedesktop.org/archives/dri-devel/2012-February/019293.html

To properly enabled 2d tiling you need libdrm from git and ddx from git and add:

Option     "ColorTiling2D"              "true"

To your gpu device section of xorg configuration
Comment 25 Alexandre Demers 2012-02-21 20:25:39 UTC
Does this imply that when not using 2d tiling it shouldn't crash or lock anymore or is it specific to 2d tiling usage?

(In reply to comment #24)
> I pushed a mesa fix for bo allocation issue. If you enable 2d tiling properly
> you shouldn't have lockup anymore. There is also a kernel patch to fix kernel
> issue after gpu lockup.
> 
> http://lists.freedesktop.org/archives/dri-devel/2012-February/019293.html
> 
> To properly enabled 2d tiling you need libdrm from git and ddx from git and
> add:
> 
> Option     "ColorTiling2D"              "true"
> 
> To your gpu device section of xorg configuration
Comment 26 Alex Deucher 2012-02-22 06:32:43 UTC
(In reply to comment #25)
> Does this imply that when not using 2d tiling it shouldn't crash or lock
> anymore or is it specific to 2d tiling usage?

It shouldn't lock up, but if it does (for any reason, not necessarily VM related), the kernel patch should allow the kernel recover more gracefully if the reset fails.
Comment 27 Jerome Glisse 2012-02-22 09:50:14 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > Does this imply that when not using 2d tiling it shouldn't crash or lock
> > anymore or is it specific to 2d tiling usage?
> 
> It shouldn't lock up, but if it does (for any reason, not necessarily VM
> related), the kernel patch should allow the kernel recover more gracefully if
> the reset fails.

Well actualy 2D tiling path fix bunch of issues that leaded to GPU lockup. So with 2D tiling enabled there is less chance of lockup.
Comment 28 Harald Judt 2012-02-25 16:03:32 UTC
Ok. Now with latest git, there is an improvement: No kernel crash anymore, but compiz still hangs when rotating the cube and the X screen freezes. Gladly, I can switch VT and pkill -9 and restart compiz, and X is usable again. At least no need to reboot!

The following line appeared in dmesg when the freeze happened:
radeon 0000:01:00.0: offset 0x300000 is in reserved area 0x800000
Comment 29 Alexandre Demers 2012-02-25 17:31:17 UTC
Created attachment 57645 [details]
dmesg after lock with latest patch

Without using 2D tiling yet, I'll try it soon. But, meanwhile, I've installed kernel 3.3-rc5 with latest gits (mesa, drm and xorg driver) and it still locks. I'm joining the output found in dmesg.
Comment 30 Harald Judt 2012-02-27 17:24:12 UTC
Created attachment 57740 [details]
screenshot showing garbled fonts in blender-2.62

Besides the lockups and the rendering regressions already mentioned, the commit bb1f0cf3508630a9a93512c79badf8c493c46743 "r600g: add support for virtual address space on cayman v11" makes the font garbled in blender-2.62.

git bisect start
# bad: [bf4fedcef3e345f5117232d58bd9000c2441de74] r600g: use u_default_transfer_flush_region for all resource types
git bisect bad bf4fedcef3e345f5117232d58bd9000c2441de74
# good: [f9c9933f9c7f72f12be27ccda98c965c75f08a12] mesa: Bump version number to 8.0 (final)
git bisect good f9c9933f9c7f72f12be27ccda98c965c75f08a12
# good: [fe77fd3983ba3da16ec53c58a790c381b07387ce] docs: Add 8.0.1 release notes
git bisect good fe77fd3983ba3da16ec53c58a790c381b07387ce
# good: [6fe42b603d0ec9e13a8b7d6c46c6d89da3a6a614] mesa: Include glx tests Makefile.in in tarball
git bisect good 6fe42b603d0ec9e13a8b7d6c46c6d89da3a6a614
# skip: [ac3a765589a881c56f351514d6436760edd4a291] r300g: set minimum point size to 1.0 for non-sprite non-aa points
git bisect skip ac3a765589a881c56f351514d6436760edd4a291
# bad: [8bfadc802f6c3c85de4c429b2a87d0bdb1705028] st/vdpau: implement uploads to interlaced video buffers
git bisect bad 8bfadc802f6c3c85de4c429b2a87d0bdb1705028
# bad: [c45771905f237d9285465dfce955440582ee51e5] swrast: use stencil packing function in s_stencil.c
git bisect bad c45771905f237d9285465dfce955440582ee51e5
# bad: [5a0f395bcf70e524492e766a07cf0b816b42a20d] glsl: Fix leak of LinkedTransformFeedback.Varyings.
git bisect bad 5a0f395bcf70e524492e766a07cf0b816b42a20d
# bad: [39491d1d31d9f03437816fbb4f2872761ae1157c] r600g: vertex id support.
git bisect bad 39491d1d31d9f03437816fbb4f2872761ae1157c
# good: [6950a4faf650fe119ee97aa18b006eed099038be] mesa: Throw the required error for glCopyTex{Sub,}Image from multisample FBO.
git bisect good 6950a4faf650fe119ee97aa18b006eed099038be
# good: [27915708ed4519cc5606e81fb789e8427501f355] docs: new page describing how to build, install VMware SVGA3D guest driver
git bisect good 27915708ed4519cc5606e81fb789e8427501f355
# bad: [bfcffd4d721d87bb6287980a09e0296ceed0bba3] r600g: fix r600 f2i to be trans only emitted.
git bisect bad bfcffd4d721d87bb6287980a09e0296ceed0bba3
# good: [6c2c2c5a07c81a15a89519a8a84ef7c69698903b] scons: Fix libGL.so build.
git bisect good 6c2c2c5a07c81a15a89519a8a84ef7c69698903b
# bad: [5250bd00c00ac8470320f4fae1d74425132f2083] r600g: add missing r32 uint/sint fbo formats.
git bisect bad 5250bd00c00ac8470320f4fae1d74425132f2083
# bad: [bb1f0cf3508630a9a93512c79badf8c493c46743] r600g: add support for virtual address space on cayman v11
git bisect bad bb1f0cf3508630a9a93512c79badf8c493c46743
Comment 31 Harald Judt 2012-03-02 07:10:27 UTC
Using current git of kernel, xorg-server, xf86-video-ati and mesa, the screen still freezes every once in a while and dmesg shows these messages: 

radeon 0000:01:00.0: offset 0x200000 is in reserved area 0x800000
radeon 0000:01:00.0: offset 0x200000 is in reserved area 0x800000
Comment 32 Harald Judt 2012-03-04 04:07:31 UTC
And today an unrecoverable error with 3.3-rc6, forcing a reboot of the machine:

radeon 0000:01:00.0: GPU lockup CP stall for more than 10060msec
GPU lockup (waiting for 0x00026292 last fence id 0x0002628F)
radeon 0000:01:00.0: GPU softreset 
radeon 0000:01:00.0:   GRBM_STATUS=0xF5702028
radeon 0000:01:00.0:   GRBM_STATUS_SE0=0xFC000004
radeon 0000:01:00.0:   GRBM_STATUS_SE1=0xFC000004
radeon 0000:01:00.0:   SRBM_STATUS=0x200000C0
radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_ADDR   0x09E3D64C
radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00071001
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
radeon 0000:01:00.0:   GRBM_SOFT_RESET=0x0000DF7B
radeon 0000:01:00.0:   GRBM_STATUS=0x00003828
radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000007
radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000007
radeon 0000:01:00.0:   SRBM_STATUS=0x200000C0
radeon 0000:01:00.0: GPU reset succeed
Comment 33 Harald Judt 2012-03-04 05:38:29 UTC
Ok, the last gpu lockup has nothing to do with this; it is specific to an application and occurs on kernel-3.2 too.
Comment 34 Alexandre Demers 2012-03-04 13:46:48 UTC
Tested with latest kernel, mesa and drm gits and the problem is still there.

Is there a way to disable radeon virtual addressing when loading the kernel? I'm sorry to ask, but from where I stand, this regression is preventing me from having a reliable experience with my computer (freezes, crashes and locks) and I was not having this problem prior to this commits serie (mesa/kernel). I think it should be ironed out and disabled until things are fixed (as Intel RC6 had been until recently). We are getting near a new kernel release (3.3) where this will be enabled by default, so we can expect this problem to be reported a lot more once a new stable kernel will be available.
Comment 35 Alex Deucher 2012-03-04 14:30:32 UTC
(In reply to comment #34)
> Is there a way to disable radeon virtual addressing when loading the kernel?

You can disable it in mesa.  Just set ws->info.r600_virtual_address to FALSE in do_winsys_init() in radeon_drm_winsys.c.
Comment 36 Alexandre Demers 2012-03-04 16:06:56 UTC
(In reply to comment #35)
> (In reply to comment #34)
> > Is there a way to disable radeon virtual addressing when loading the kernel?
> 
> You can disable it in mesa.  Just set ws->info.r600_virtual_address to FALSE in
> do_winsys_init() in radeon_drm_winsys.c.

Hi Alex.

I want to point out this is not an option for the average user nor is it an option to turn off virtual address "on the fly". The average user will not recompile code; only if we are lucky will he use a flag to disable or enable an option, as long as it is easily accessible. You are taking the point of view of a dev or, at best, a tester willing to go beyond testing the code as it is.

I know I can run a 3.2 kernel, I know I can compile a different version or bisect or submit patches, I know I can switch from Gnome Shell to another window manager without fancy effects or that I can disable options if I follow your advise. But this is not accessible to the average user.

Please, consider another option for the average users that will use compiled code available soon.

Meanwhile, I'm still completly dedicated in solving this issue if I can do anything else. I'm sure other people following this bug are also willing to go further to help you fix this issue. Can we provide you with something more? apitrace, register states?
Comment 37 Alex Deucher 2012-03-04 16:24:02 UTC
(In reply to comment #36)
> I know I can run a 3.2 kernel, I know I can compile a different version or
> bisect or submit patches, I know I can switch from Gnome Shell to another
> window manager without fancy effects or that I can disable options if I follow
> your advise. But this is not accessible to the average user.

You can run an older mesa release as well.  It's probably a better as a mesa knob than a kernel knob.

> 
> Please, consider another option for the average users that will use compiled
> code available soon.

We can add a mesa option if we aren't able to get this fixed in time for the next mesa release, but for now I'd prefer to leave it enabled otherwise most users will just disable it and not test the current code which won't help in getting it fixed.
Comment 38 Alexandre Demers 2012-03-05 20:40:48 UTC
Some news: today, I updated xserver and it seems I'm now able to boot under Gnome-Shell correctly.

However, launching RenderFeatTest.bin64 still hangs exactly where it has been hanging for some time now and freeze my window manager. At least, it seems one of the problem was related to xserver.

I'll hope I'll be able to find something new in the logs.
Comment 39 Alexandre Demers 2012-03-06 22:25:26 UTC
(In reply to comment #38)
> Some news: today, I updated xserver and it seems I'm now able to boot under
> Gnome-Shell correctly.
> 
> However, launching RenderFeatTest.bin64 still hangs exactly where it has been
> hanging for some time now and freeze my window manager. At least, it seems one
> of the problem was related to xserver.
> 
> I'll hope I'll be able to find something new in the logs.

After one testing day, it happened again. It's just not happening at start as it was doing, but more randomly. Too bad.
Comment 40 Harald Judt 2012-03-20 11:50:26 UTC
(In reply to comment #35)
> (In reply to comment #34)
> > Is there a way to disable radeon virtual addressing when loading the kernel?
> 
> You can disable it in mesa.  Just set ws->info.r600_virtual_address to FALSE in
> do_winsys_init() in radeon_drm_winsys.c.

Thanks, as expected this also cures the garbled fonts in blender.

(In reply to comment #37)
> We can add a mesa option if we aren't able to get this fixed in time for the
> next mesa release, but for now I'd prefer to leave it enabled otherwise most
> users will just disable it and not test the current code which won't help in
> getting it fixed.

We're already 2 users affected and very willing to test and help ;-) What information could we provide to further improve the situation?
Comment 41 Alex Deucher 2012-03-20 12:11:26 UTC
Are you still getting any messages like the following in your dmesg with the latest mesa from git?

radeon 0000:01:00.0: offset 0x200000 is in reserved area 0x800000
radeon 0000:01:00.0: offset 0x200000 is in reserved area 0x800000

I pushed a patch yesterday that fixed up a missing va setup, although I don't think the driver should hit that path with cayman and vm support.
Comment 42 Alexandre Demers 2012-03-21 07:35:36 UTC
(In reply to comment #41)
> Are you still getting any messages like the following in your dmesg with the
> latest mesa from git?
> 
> radeon 0000:01:00.0: offset 0x200000 is in reserved area 0x800000
> radeon 0000:01:00.0: offset 0x200000 is in reserved area 0x800000
> 
> I pushed a patch yesterday that fixed up a missing va setup, although I don't
> think the driver should hit that path with cayman and vm support.

Upgraded yesterday, using latest 3.3.0 kernel with latest drm and mesa. For now, it seems I'm not seeing it. However, I'll be testing it more in the next few days, I'll be mostly doing more than just using the desktop (I'll run some demos and games that were triggering the error). I'll keep you updated.
Comment 43 Alexandre Demers 2012-03-21 21:11:41 UTC
Created attachment 58846 [details]
Different error message

Running RenderFeatTest.bin64 with yesterday's gits still crash at the same spot, but doesn't produce the "radeon 0000:01:00.0: offset 0x200000 is in reserved area 0x800000" error.

GPU locks up, as you can see in the dmesg output. Once hung, I have to reset my computer to be able to use the radeon driver again, otherwise I'm running under software rendering (softpipe).
Comment 44 Alexandre Demers 2012-04-03 19:24:36 UTC
Just to let you know I've moved from Ubuntu to Arch. This week, kernel 3.0 came in and the problem is obviously appearing as expected. Still locks up, still hangs, still fails to resume:

[ 1454.142346] radeon 0000:01:00.0: offset 0x100000 is in reserved area 0x800000
[ 1454.142955] [drm:radeon_cs_parser_relocs] *ERROR* gem object lookup failed 0x10
[ 1454.142959] [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -2!
[ 1454.155602] [drm:radeon_cs_parser_relocs] *ERROR* gem object lookup failed 0x10
[ 1454.155606] [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -2!
[ 1465.203216] radeon 0000:01:00.0: GPU lockup CP stall for more than 10030msec
[ 1465.203220] GPU lockup (waiting for 0x0001E557 last fence id 0x0001E554)
[ 1465.418088] radeon 0000:01:00.0: GPU softreset 
[ 1465.418092] radeon 0000:01:00.0:   GRBM_STATUS=0xF5700828
[ 1465.418094] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0xFC000001
[ 1465.418096] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0xFC000001
[ 1465.418098] radeon 0000:01:00.0:   SRBM_STATUS=0x20020FC0
[ 1465.418101] radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_ADDR   0x000779DD
[ 1465.418103] radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00072001
[ 1465.418105] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000005B9
[ 1465.418108] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020A400C
[ 1465.579826] radeon 0000:01:00.0: Wait for MC idle timedout !
[ 1465.579828] radeon 0000:01:00.0:   GRBM_SOFT_RESET=0x0000DF7B
[ 1465.579936] radeon 0000:01:00.0:   GRBM_STATUS=0x80103828
[ 1465.579938] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x04000007
[ 1465.579940] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x04000007
[ 1465.579941] radeon 0000:01:00.0:   SRBM_STATUS=0x20020FC0
[ 1465.580943] radeon 0000:01:00.0: GPU reset succeed
[ 1465.771511] radeon 0000:01:00.0: Wait for MC idle timedout !
[ 1465.942884] radeon 0000:01:00.0: Wait for MC idle timedout !
[ 1465.944796] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
[ 1465.944872] radeon 0000:01:00.0: WB enabled
[ 1465.944874] [drm] fence driver on ring 0 use gpu addr 0x40000c00 and cpu addr 0xffff88021ea01c00
[ 1465.944876] [drm] fence driver on ring 1 use gpu addr 0x40000c04 and cpu addr 0xffff88021ea01c04
[ 1465.944878] [drm] fence driver on ring 2 use gpu addr 0x40000c08 and cpu addr 0xffff88021ea01c08
[ 1466.140829] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8500)=0xCAFEDEAD)
[ 1466.140831] [drm:cayman_resume] *ERROR* cayman startup failed on resume


I'll be testing kernel 3.4-rc1 soon and I'll play with 2D tiling.
Comment 45 Alexandre Demers 2012-04-22 22:56:08 UTC
I'm now working with a 3.4-rc4 kernel. I activated ColorTiling2D. However, it didn't change anything.

On the other hand, if you have the hardware under hand, I want to let you know that since the problem appeared (after kernel 3.2 and the culprit commit under mesa), you should be able to recreate the problem by running piglit tests (r600.tests). I'm able to recreate it each time if I also enable GLSL130. Doing the same, but with a 3.2 kernel will not produce the crash, as expected.
Comment 46 Alex Deucher 2012-04-25 06:41:45 UTC
Does this kernel patch help?
http://lists.freedesktop.org/archives/dri-devel/2012-April/022037.html
Comment 47 Michel Dänzer 2012-04-25 07:11:51 UTC
(In reply to comment #46)
> Does this kernel patch help?
> http://lists.freedesktop.org/archives/dri-devel/2012-April/022037.html

I was wondering about that as well, but I'm afraid it can't, as Christian pointed out.

(In reply to comment #44)
> [ 1454.142346] radeon 0000:01:00.0: offset 0x100000 is in reserved area
> 0x800000

The offset 0x100000 comes from userspace, so it could still be a pure userspace problem. Maybe it runs out of VM space and into the reserved area or something like that.
Comment 48 Alexandre Demers 2012-04-25 08:21:12 UTC
(In reply to comment #47)
> (In reply to comment #46)
> > Does this kernel patch help?
> > http://lists.freedesktop.org/archives/dri-devel/2012-April/022037.html
> 
> I was wondering about that as well, but I'm afraid it can't, as Christian
> pointed out.
> 
> (In reply to comment #44)
> > [ 1454.142346] radeon 0000:01:00.0: offset 0x100000 is in reserved area
> > 0x800000
> 
> The offset 0x100000 comes from userspace, so it could still be a pure userspace
> problem. Maybe it runs out of VM space and into the reserved area or something
> like that.

I'll try it just in case ASAP. However, this means it won't probably be until Friday or the weekend.
Comment 49 Alex Deucher 2012-04-25 08:59:57 UTC
Created attachment 60582 [details] [review]
possible fix

Does this patch help?
Comment 50 Michel Dänzer 2012-04-25 09:16:19 UTC
Comment on attachment 60582 [details] [review]
possible fix

Review of attachment 60582 [details] [review]:
-----------------------------------------------------------------

::: src/gallium/winsys/radeon/drm/radeon_drm_bo.c
@@ +221,4 @@
>              pipe_mutex_unlock(mgr->bo_va_mutex);
>              return offset;
>          }
> +        if (waste < hole->size && (hole->size - waste) >= size) {

AFAICT the 'if (offset >= (hole->offset + hole->size))' test further up is a roundabout way of saying 'if (waste >= hole->size)', so I'm afraid this won't have any effect.
Comment 51 Michel Dänzer 2012-04-26 11:50:28 UTC
Does the Mesa patch series at http://lists.freedesktop.org/archives/mesa-dev/2012-April/021211.html help?

Beware that it's only lightly tested, and I'll be away now for a long weekend. If there's a problem with the patches, I'll look into it next week.
Comment 52 Alexandre Demers 2012-04-26 14:19:41 UTC
(In reply to comment #51)
> Does the Mesa patch series at
> http://lists.freedesktop.org/archives/mesa-dev/2012-April/021211.html help?
> 
> Beware that it's only lightly tested, and I'll be away now for a long weekend.
> If there's a problem with the patches, I'll look into it next week.

No, it doesn't. But it's not worst either.
Comment 53 Alexandre Demers 2012-04-26 19:41:37 UTC
(In reply to comment #46)
> Does this kernel patch help?
> http://lists.freedesktop.org/archives/dri-devel/2012-April/022037.html

No, it doesn't.
Comment 54 Alexandre Demers 2012-05-04 21:21:18 UTC
On latest git (3cd7bee48f7caf7850ea64d40f43875d4c975507), in src/gallium/drivers/r600/r66_hw_context.c, on line 194, shouldn't it be:
- int offset
+ unsigned offset

Also, at line 1259, I'm not quite sure why it is shifted by 2. Most of the time, offset is usually shifted by 8. Just looking through the code to see if something could have been missed...
Comment 55 Michel Dänzer 2012-05-07 03:07:07 UTC
(In reply to comment #54)
> On latest git (3cd7bee48f7caf7850ea64d40f43875d4c975507), in
> src/gallium/drivers/r600/r66_hw_context.c, on line 194, shouldn't it be:
> - int offset
> + unsigned offset

That might be slightly better, but it doesn't really matter. It's the offset from the start of the MMIO aperture, so it would only matter if the register aperture grew beyond 2GB, which we're almost 5 orders of magnitude short of. Very unlikely.


> Also, at line 1259, I'm not quite sure why it is shifted by 2. Most of the
> time, offset is usually shifted by 8.

It's just converting offset from units of 32 bits to bytes.


> Just looking through the code to see if something could have been missed...

Right now it would be most useful to track down why radeon_bomgr_find_va / radeon_bomgr_force_va ends up returning the offset the kernel complains about.
Comment 56 Alexandre Demers 2012-05-30 11:18:28 UTC
(In reply to comment #55)
> (In reply to comment #54)
> > On latest git (3cd7bee48f7caf7850ea64d40f43875d4c975507), in
> > src/gallium/drivers/r600/r66_hw_context.c, on line 194, shouldn't it be:
> > - int offset
> > + unsigned offset
> 
> That might be slightly better, but it doesn't really matter. It's the offset
> from the start of the MMIO aperture, so it would only matter if the register
> aperture grew beyond 2GB, which we're almost 5 orders of magnitude short of.
> Very unlikely.
> 
> 
> > Also, at line 1259, I'm not quite sure why it is shifted by 2. Most of the
> > time, offset is usually shifted by 8.
> 
> It's just converting offset from units of 32 bits to bytes.
> 
> 
> > Just looking through the code to see if something could have been missed...
> 
> Right now it would be most useful to track down why radeon_bomgr_find_va /
> radeon_bomgr_force_va ends up returning the offset the kernel complains about.

What do you suggest? I'll be playing with kernel 3.5-rc1 when available, but I don't think that will fix it. Is there or could there be a way to track what's going on with a debug switch or something similar?
Comment 57 Alexandre Demers 2012-06-03 18:26:12 UTC
Now running kernel 3.5-rc1 with latest mesa, drm, ddx and still locking the GPU. As always, easy to reproduce by running piglit r600 tests.
Comment 58 Alexandre Demers 2012-06-05 19:18:10 UTC
I noticed a different clue that could help track down the bug: when X doesn't completly freezes, there is a backtrace under .xsession-error. So I'm attaching both dmesg and the xsession snippet related to this crash.
Comment 59 Alexandre Demers 2012-06-05 19:19:18 UTC
Created attachment 62618 [details]
dmesg related to the xsession-error file

This dmesg happened with the next attachment: xsession-error.txt
Comment 60 Alexandre Demers 2012-06-05 19:20:47 UTC
Created attachment 62619 [details]
snippet when gnome-shell is able to fall bak on its feet

snippet when gnome-shell is able to fall bak on its feet. Should be used with the last dmesg submitted.
Comment 61 Christian König 2012-06-06 02:15:10 UTC
Please also try this patch:
http://lists.freedesktop.org/archives/dri-devel/2012-June/023735.html

It doesn't fix anything rendering related, but instead fixes a deadlock introduced with the vm patch It isn't the complete solution of the problem it might still be an improvement.

Christian.
Comment 62 Alexandre Demers 2012-06-06 15:37:07 UTC
(In reply to comment #61)
> Please also try this patch:
> http://lists.freedesktop.org/archives/dri-devel/2012-June/023735.html
> 
> It doesn't fix anything rendering related, but instead fixes a deadlock
> introduced with the vm patch It isn't the complete solution of the problem it
> might still be an improvement.
> 
> Christian.

Thanks Christian. I just tested the patch and it still fails. Running piglit on r600.test hangs, kills Xorg, restarts without any 3D support and produce the following:

-----
dmesg:

[   44.640434] retire_capture_urb: 1 callbacks suppressed
[   64.193666] radeon 0000:01:00.0: bo ffff88021b1d2400 va 0x0180D000 conflict with (bo ffff880221d00400 0x0180D000 0x0180E000)
[   64.242569] radeon 0000:01:00.0: bo ffff880221d1dc00 va 0x0184E000 conflict with (bo ffff8802135ac800 0x0184E000 0x0184F000)
[   64.369362] radeon 0000:01:00.0: bo ffff880222126800 va 0x01841000 conflict with (bo ffff88021b3b4400 0x01841000 0x01842000)
[   64.832098] radeon 0000:01:00.0: bo ffff88021352dc00 va 0x01859000 conflict with (bo ffff880222c42800 0x01859000 0x0185B000)
[   65.486230] EXT4-fs (sdc2): warning: maximal mount count reached, running e2fsck is recommended
[   65.540929] EXT4-fs (sdc2): mounted filesystem with ordered data mode. Opts: (null)
[   69.016383] radeon 0000:01:00.0: bo ffff880221d1e000 va 0x0402D000 conflict with (bo ffff880221fc5000 0x0402D000 0x0402E000)
[   69.017579] radeon 0000:01:00.0: bo ffff880221d1b400 va 0x0404D000 conflict with (bo ffff880206061400 0x0404D000 0x0404E000)
[  471.209470] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[  471.209482] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000001ee7 last fence id 0x0000000000001ee4)
[  471.708793] radeon 0000:01:00.0: GPU lockup CP stall for more than 10500msec
[  471.708803] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000001ee5)
[  471.708812] radeon 0000:01:00.0: failed to get a new IB (-35)
[  471.708818] [drm:radeon_cs_ib_chunk] *ERROR* Failed to get ib !
[  471.712988] radeon 0000:01:00.0: GPU softreset 
[  471.712996] radeon 0000:01:00.0:   GRBM_STATUS=0xB3703828
[  471.713001] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x24000007
[  471.713006] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x3D000007
[  471.713012] radeon 0000:01:00.0:   SRBM_STATUS=0x200206C0
[  471.713017] radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_ADDR   0x00000000
[  471.713023] radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00000000
[  471.713029] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  471.713035] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x07070010
[  471.862829] radeon 0000:01:00.0: Wait for MC idle timedout !
[  471.862831] radeon 0000:01:00.0:   GRBM_SOFT_RESET=0x0000DF7B
[  471.862933] radeon 0000:01:00.0:   GRBM_STATUS=0x00003828
[  471.862934] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000007
[  471.862936] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000007
[  471.862937] radeon 0000:01:00.0:   SRBM_STATUS=0x200206C0
[  471.863938] radeon 0000:01:00.0: GPU reset succeed
[  472.044573] radeon 0000:01:00.0: Wait for MC idle timedout !
[  472.202790] radeon 0000:01:00.0: Wait for MC idle timedout !
[  472.204511] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
[  472.204582] radeon 0000:01:00.0: WB enabled
[  472.204584] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880221964c00
[  472.204586] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000040000c04 and cpu addr 0xffff880221964c04
[  472.204587] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000040000c08 and cpu addr 0xffff880221964c08
[  472.387014] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8500)=0xCAFEDEAD)
[  472.387015] [drm:cayman_resume] *ERROR* cayman startup failed on resume
[  472.406246] radeon 0000:01:00.0: ffff88021c7d3800 unpin not necessary
[  472.406260] radeon 0000:01:00.0: ffff88021c7d3c00 unpin not necessary
[  472.407518] radeon 0000:01:00.0: GPU softreset 
[  472.407525] radeon 0000:01:00.0:   GRBM_STATUS=0xA0003828
[  472.407530] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000007
[  472.407536] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000007
[  472.407541] radeon 0000:01:00.0:   SRBM_STATUS=0x200206C0
[  472.407546] radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_ADDR   0x00000000
[  472.407552] radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00000000
[  472.407557] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  472.407562] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x07070010
[  472.577076] radeon 0000:01:00.0: Wait for MC idle timedout !
[  472.577080] radeon 0000:01:00.0:   GRBM_SOFT_RESET=0x0000DF7B
[  472.577183] radeon 0000:01:00.0:   GRBM_STATUS=0x00003828
[  472.577185] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000007
[  472.577186] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000007
[  472.577188] radeon 0000:01:00.0:   SRBM_STATUS=0x200206C0
[  472.578190] radeon 0000:01:00.0: GPU reset succeed
[  472.756629] radeon 0000:01:00.0: Wait for MC idle timedout !
[  472.912577] radeon 0000:01:00.0: Wait for MC idle timedout !
[  472.914304] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
[  472.914377] radeon 0000:01:00.0: WB enabled
[  472.914380] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880221964c00
[  472.914382] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000040000c04 and cpu addr 0xffff880221964c04
[  472.914383] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000040000c08 and cpu addr 0xffff880221964c08
[  473.094478] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8500)=0xCAFEDEAD)
[  473.094480] [drm:cayman_resume] *ERROR* cayman startup failed on resume
[  539.092664] retire_capture_urb: 1 callbacks suppressed

-----
xsession-errors

Tracker-Message: Setting up monitor for changes to config file:'/home/dema1701/.config/tracker/tracker-miner-fs.cfg'
Tracker-Message: Setting up monitor for changes to config file:'/home/dema1701/.config/tracker/tracker-store.cfg'
Starting log:
  File:'/home/dema1701/.local/share/tracker/tracker-miner-fs.log'
** Message: applet now removed from the notification area
** (process:1189): WARNING **: Trying to register gtype 'GMountMountFlags' as enum when in fact it is of type 'GFlags'
** (process:1189): WARNING **: Trying to register gtype 'GDriveStartFlags' as enum when in fact it is of type 'GFlags'
** (process:1189): WARNING **: Trying to register gtype 'GSocketMsgFlags' as enum when in fact it is of type 'GFlags'
Tracker-Message: Setting up monitor for changes to config file:'/home/dema1701/.config/tracker/tracker-store.cfg'
Starting log:
  File:'/home/dema1701/.local/share/tracker/tracker-store.log'
radeon: Failed to allocate a buffer:
radeon:    size      : 256 bytes
radeon:    alignment : 256 bytes
radeon:    domains   : 2
EE r600_texture.c:865 r600_texture_get_transfer - failed to create temporary texture to hold untiled copy
Mesa: User error: GL_OUT_OF_MEMORY in glTexSubImage2D
radeon: Failed to allocate a buffer:
radeon:    size      : 2560 bytes
radeon:    alignment : 256 bytes
radeon:    domains   : 2
EE r600_texture.c:865 r600_texture_get_transfer - failed to create temporary texture to hold untiled copy
radeon: Failed to allocate a buffer:
radeon:    size      : 2560 bytes
radeon:    alignment : 256 bytes
radeon:    domains   : 2
EE r600_texture.c:865 r600_texture_get_transfer - failed to create temporary texture to hold untiled copy
radeon: Failed to allocate a buffer:
radeon:    size      : 256 bytes
radeon:    alignment : 256 bytes
radeon:    domains   : 2
EE r600_texture.c:865 r600_texture_get_transfer - failed to create temporary texture to hold untiled copy
Window manager warning: Failed to load theme "Ambiance": Failed to find a valid file for theme Ambiance

** Message: applet now embedded in the notification area
** Message: Stopping registered applet secret agent because GNOME Shell is running
radeon: Failed to allocate a buffer:
radeon:    size      : 256 bytes
radeon:    alignment : 256 bytes
radeon:    domains   : 2
EE r600_texture.c:865 r600_texture_get_transfer - failed to create temporary texture to hold untiled copy
Mesa: User error: GL_OUT_OF_MEMORY in glTexSubImage2D
radeon: Failed to allocate a buffer:
radeon:    size      : 2816 bytes
radeon:    alignment : 256 bytes
radeon:    domains   : 2
EE r600_texture.c:865 r600_texture_get_transfer - failed to create temporary texture to hold untiled copy
Window manager warning: CurrentTime used to choose focus window; focus window may not be correct.
Window manager warning: Got a request to focus the no_focus_window with a timestamp of 0.  This shouldn't happen!
Window manager warning: Log level 16: STACK_OP_ADD: window 0x2600002 already in stack
Window manager warning: Log level 16: STACK_OP_ADD: window 0x2600002 already in stack
Window manager warning: Log level 16: STACK_OP_ADD: window 0x2600002 already in stack
Window manager warning: Log level 16: STACK_OP_ADD: window 0x1e00002 already in stack
Window manager warning: Log level 16: STACK_OP_ADD: window 0x1e00002 already in stack
Window manager warning: Log level 16: STACK_OP_ADD: window 0x2600002 already in stack
Window manager warning: Log level 16: STACK_OP_ADD: window 0x2600002 already in stack
Window manager warning: Log level 16: STACK_OP_ADD: window 0x2600002 already in stack
** Message: Active session changed
** Message: Active session changed

(gnome-settings-daemon:1120): color-plugin-WARNING **: Done switch to new account, reload devices
** Message: Active session changed
** Message: Active session changed
** Message: Active session changed
gnome-session[1085]: Gdk-WARNING: gnome-session: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
(gnome-settings-daemon:1120): Gdk-WARNING **: gnome-settings-daemon: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
(gnome-screensaver:1191): Gdk-WARNING **: gnome-screensaver: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
(evolution-alarm-notify:1196): Gdk-WARNING **: evolution-alarm-notify: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
(gnome-shell-calendar-server:1227): Gdk-WARNING **: gnome-shell-calendar-server: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
(gnome-terminal:1406): Gdk-WARNING **: gnome-terminal: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
(nautilus:1385): Gdk-WARNING **: nautilus: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
(nm-applet:1185): Gdk-WARNING **: nm-applet: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
applet.py: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes on an async read (g-io-error-quark, 0). Exiting.

(deja-dup-monitor:1184): GVFS-RemoteVolumeMonitor-WARNING **: Owner :1.17 of volume monitor org.gtk.Private.UDisks2VolumeMonitor disconnected from the bus; removing drives/volumes/mounts
g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes on an async read (g-io-error-quark, 0). Exiting.

Received signal:15->'Terminated'
g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes on an async read (g-io-error-quark, 0). Exiting.
g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes on an async read (g-io-error-quark, 0). Exiting.

Received signal:15->'Terminated'
OK

(tracker-miner-fs:1183): GVFS-RemoteVolumeMonitor-WARNING **: Owner :1.17 of volume monitor org.gtk.Private.UDisks2VolumeMonitor disconnected from the bus; removing drives/volumes/mounts
(tracker-miner-fs:1183): GLib-GIO-CRITICAL **: Error while sending AddMatch() message: The connection is closed
(tracker-miner-fs:1183): GLib-GIO-CRITICAL **: Error while sending AddMatch() message: The connection is closed
(tracker-miner-fs:1183): GLib-GIO-CRITICAL **: Error while sending AddMatch() message: The connection is closed

OK

radeon: The kernel rejected CS, see dmesg for more information.
Window manager warning: Log level 16: gnome-shell: Fatal IO error 16 (Device or resource busy) on X server :0.
Comment 63 Alexandre Demers 2012-07-10 00:21:56 UTC
Now running latest drm-next just in case. Always the same error, but with a little something new: with regular kernel, once the GPU crashed, it stays this way. With the drm-next branch, it loops. Attaching some files in a moment.

I just started Gnome Shell, then opened a terminal window and launched piglit r600 tests.

I'm pretty sure (dmesg):
[   66.238981] radeon 0000:01:00.0: bo ffff88020f46bc00 va 0x0183B000 conflict with (bo ffff88021b65d000 0x0183B000 0x0183C000)
[   66.271373] radeon 0000:01:00.0: bo ffff880222cc9400 va 0x01814000 conflict with (bo ffff880221a50800 0x01814000 0x01815000)
[   66.334540] radeon 0000:01:00.0: bo ffff880222b70000 va 0x01809000 conflict with (bo ffff8802230a9000 0x01809000 0x0180A000)

corresponds to (.xsession-error):

radeon: Failed to allocate a buffer:
radeon:    size      : 256 bytes
radeon:    alignment : 256 bytes
radeon:    domains   : 2
EE r600_texture.c:869 r600_texture_get_transfer - failed to create temporary texture to hold untiled copy
Mesa: User error: GL_OUT_OF_MEMORY in glTexSubImage
radeon: Failed to allocate a buffer:
radeon:    size      : 256 bytes
radeon:    alignment : 256 bytes
radeon:    domains   : 2
EE r600_texture.c:869 r600_texture_get_transfer - failed to create temporary texture to hold untiled copy
radeon: Failed to allocate a buffer:
radeon:    size      : 256 bytes
radeon:    alignment : 256 bytes
radeon:    domains   : 2
EE r600_texture.c:869 r600_texture_get_transfer - failed to create temporary texture to hold untiled copy

Then (dmesg):

[  196.710933] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[  196.710946] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000675 last fence id 0x000000000000066c)
[  196.711129] radeon 0000:01:00.0: couldn't schedule ib
[  196.711239] radeon 0000:01:00.0: couldn't schedule ib
[  196.711805] radeon 0000:01:00.0: couldn't schedule ib
[  196.715732] radeon 0000:01:00.0: couldn't schedule ib
[  196.715975] radeon 0000:01:00.0: couldn't schedule ib
[  196.716362] radeon 0000:01:00.0: couldn't schedule ib
[  196.716627] radeon 0000:01:00.0: couldn't schedule ib
[  196.718012] radeon 0000:01:00.0: couldn't schedule ib
[  196.718262] radeon 0000:01:00.0: couldn't schedule ib
[  196.718480] radeon 0000:01:00.0: couldn't schedule ib
[  196.718985] radeon 0000:01:00.0: couldn't schedule ib
[  196.920396] radeon 0000:01:00.0: couldn't schedule ib
[  196.920703] radeon 0000:01:00.0: couldn't schedule ib
[  196.921084] radeon 0000:01:00.0: couldn't schedule ib
[  196.921318] radeon 0000:01:00.0: couldn't schedule ib
[  196.921558] radeon 0000:01:00.0: couldn't schedule ib
[  196.921898] radeon 0000:01:00.0: couldn't schedule ib
[  196.952350] radeon 0000:01:00.0: couldn't schedule ib
[  196.952386] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[  196.952439] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  196.952494] IP: [<ffffffffa050080d>] radeon_fence_ref+0xd/0x40 [radeon]
[  196.952531] PGD 221dc4067 PUD 2228ff067 PMD 0 
[  196.952556] Oops: 0000 [#1] PREEMPT SMP 
[  196.952579] CPU 1 
[  196.952617] Modules linked in: fuse snd_usb_audio snd_usbmidi_lib snd_rawmidi powernow_k8 snd_seq_device radeon ttm joydev snd_hda_codec_hdmi ppdev evdev pwc snd_hda_codec_realtek r8712u(C) r8169 mperf parport_pc parport sp5100_tco usb_storage uas drm_kms_helper drm videobuf2_vmalloc videobuf2_memops hid_logitech_dj pcspkr processor snd_hda_intel snd_hda_codec i2c_algo_bit mii hid_generic videobuf2_core videodev media wmi kvm_amd snd_hwdep snd_pcm snd_page_alloc snd_timer psmouse i2c_piix4 usbhid firewire_ohci hid serio_raw i2c_core firewire_core k10temp kvm microcode crc_itu_t snd edac_core button soundcore edac_mce_amd ext4 crc16 jbd2 mbcache pata_acpi sr_mod sd_mod cdrom pata_atiixp ata_generic ohci_hcd ahci libahci libata ehci_hcd usbcore scsi_mod usb_common
[  196.952957] 
[  196.952969] Pid: 715, comm: Xorg Tainted: G         C   3.5.0-rc4-VANILLA-46957-g74da01d #1 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
[  196.953044] RIP: 0010:[<ffffffffa050080d>]  [<ffffffffa050080d>] radeon_fence_ref+0xd/0x40 [radeon]
[  196.953092] RSP: 0018:ffff8802230e9b48  EFLAGS: 00010286
...


and it loops.
Comment 64 Alexandre Demers 2012-07-10 00:22:55 UTC
Created attachment 64052 [details]
dmesg drm-next

dmesg with latest drm-next branch
Comment 65 Alexandre Demers 2012-07-10 00:23:46 UTC
Created attachment 64053 [details]
xsession with drm-next

.xsession with drm-next branch
Comment 66 Alexandre Demers 2012-07-23 18:49:17 UTC
(In reply to comment #37)
> (In reply to comment #36)
> > I know I can run a 3.2 kernel, I know I can compile a different version or
> > bisect or submit patches, I know I can switch from Gnome Shell to another
> > window manager without fancy effects or that I can disable options if I follow
> > your advise. But this is not accessible to the average user.
> 
> You can run an older mesa release as well.  It's probably a better as a mesa
> knob than a kernel knob.
> 
> > 
> > Please, consider another option for the average users that will use compiled
> > code available soon.
> 
> We can add a mesa option if we aren't able to get this fixed in time for the
> next mesa release, but for now I'd prefer to leave it enabled otherwise most
> users will just disable it and not test the current code which won't help in
> getting it fixed.

So it's been a while now and no improvement (even with proposed patches or by running drm-next kernel). Could we add this flag now so it will be possible to disable VM for cayman if wanted? This way, people will still be able to use VM by default, but for those encountering this problem, it will be possible to use their card without seeing it locking up by this code. It will also be possible to enable VM for them to test for any improvement or regression. Nobody's loosing anything. I'll be able to test other commits and new features running programs and piglit tests and once in a while I'll test the VM code (or test any patches or fixes dev could suggest me).
Comment 67 Alexandre Demers 2012-07-24 06:53:33 UTC
Created attachment 64585 [details] [review]
Adding an environment variable to disable VM if wanted

By setting R600_VM=0, we disable the virtual address space code path. By default, the path will still be enabled and used. However, if set to 0, it will prevent some cards (mostly CAYMAN it seems) from locking up or crashing because of the VM code. It is a work around until we figure out why it is locking.
Comment 68 Alexandre Demers 2012-07-25 18:11:54 UTC
I was thinking about it yesterday: is it possible that we are not tracking something in the virtual addresse spaces that we should be? That could explain why we are getting messages like "radeon 0000:01:00.0: bo ffff880212cb7000 va 0x00C26000 conflict with (bo ffff880222cc9400 0x00C26000 0x00C27000)" and so on.
Comment 69 Alexandre Demers 2012-07-27 18:26:19 UTC
(In reply to comment #67)
> Created attachment 64585 [details] [review] [review]
> Adding an environment variable to disable VM if wanted
> 
> By setting R600_VM=0, we disable the virtual address space code path. By
> default, the path will still be enabled and used. However, if set to 0, it will
> prevent some cards (mostly CAYMAN it seems) from locking up or crashing because
> of the VM code. It is a work around until we figure out why it is locking.

Please, if someone could review and commit if possible.

Thank you.
Alexandre Demers
Comment 70 Alex Deucher 2012-07-31 15:10:38 UTC
Does this kernel patch help?
http://lists.freedesktop.org/archives/dri-devel/2012-July/025704.html
Comment 71 Anthony Waters 2012-08-01 01:18:00 UTC
I have been having this same issue with respect to rendering regressions, I have also experienced the error relating to va conflicts.  I investigated it a bit and I think the cause of the rendering regression is when a va is freed through radeon_bomgr_free_va and subsequently used again in radeon_bomgr_find_va the GPU isn't done with the memory and it gets overwritten before the GPU is done.

I experimented with this a bit and by not reusing any va_holes in radeon_bomgr_find_va the rendering regression goes away, at the expense of continually eating up the memory.  So I looked around a way to make it so the va was only freed when it wasn't used any more, and it turns out that worked as well.

In order to test this I placed a call to radeon_bo_wait before radeon_bomgr_free_va is called within radeon_bo_destroy, the code looks something like in radeon_drm_bo.c
    if (mgr->va) {
        radeon_bo_wait(bo, RADEON_USAGE_READWRITE);
        radeon_bomgr_free_va(mgr, bo->va, bo->va_size);
    }

It causes busy waiting currently and could be improved by tracking the destroyed bos that need to be freed from va when they are not busy, if this is ultimately the way to solve it.
Comment 72 Anthony Waters 2012-08-01 02:09:17 UTC
Also, I believe the source of "radeon 0000:01:00.0: bo ffff8802ea5ec800 va 0x038EC000 conflict with (bo ffff8803eb464000 0x038EC000 0x038ED000)" is due to a race condition.  It appears that after the call to radeon_bomgr_free_va the virtual address space is in a state where user space sees that freed address as available but the kernel hasn't been notified yet, until the drmIoctl call I assume.

I'm not sure if there are multiple threads allowed to interact with radeon_drm_bo.c, but if there are then the user space can request a virtual address that hasn't been freed yet by the kernel.

I moved the call to radeon_bomgr_free_va to be after the drmIoctl inradeon_bo_destroy,  I'll run through the piglit tests to see if it fixes the errors.
Comment 73 Jerome Glisse 2012-08-01 02:14:12 UTC
Created attachment 65013 [details] [review]
Free va early in the kernel

Diagnosis was kind of obvious, but it just pop into my mind that ttm was sometimes delaying the deletion. So attached kernel patch should fix the issue without any mesa patch.
Comment 74 Jerome Glisse 2012-08-01 02:15:19 UTC
The way i build my kernel must hide this latency i guess...
Comment 75 Alexandre Demers 2012-08-01 03:26:16 UTC
These are all food news. So I'll test both patches and I'll see if it also fixes the thing for me. Awaters (I don't know your name, you'll have to tell me), if what you found fixes my 6 month old problem, I'll offer you a beer (or whatever you'd like to drink). I'll be back soon with some news (good I hope).
Comment 76 Jerome Glisse 2012-08-01 03:40:53 UTC
Created attachment 65014 [details] [review]
Free va earyl

This one build (minor typo)
Comment 77 Alexandre Demers 2012-08-01 16:09:21 UTC
(In reply to comment #70)
> Does this kernel patch help?
> http://lists.freedesktop.org/archives/dri-devel/2012-July/025704.html

No, it doesn't (well not about the present bug).
Comment 78 Jerome Glisse 2012-08-01 16:59:03 UTC
(In reply to comment #77)
> (In reply to comment #70)
> > Does this kernel patch help?
> > http://lists.freedesktop.org/archives/dri-devel/2012-July/025704.html
> 
> No, it doesn't (well not about the present bug).

This patch is mostly for the lockup situation, it does not affect the va issue. My patch should definitely fix va issue. Alex patch might fix lockup on top of that.
Comment 79 Alexandre Demers 2012-08-01 18:06:40 UTC
(In reply to comment #78)
> (In reply to comment #77)
> > (In reply to comment #70)
> > > Does this kernel patch help?
> > > http://lists.freedesktop.org/archives/dri-devel/2012-July/025704.html
> > 
> > No, it doesn't (well not about the present bug).
> 
> This patch is mostly for the lockup situation, it does not affect the va issue.
> My patch should definitely fix va issue. Alex patch might fix lockup on top of
> that.

OK, so I should try them together then. I should be able to test it tonight. As of this morning with Alex's patch only, va issue was still reported but I had no time to test it further for lockups.
Comment 80 Anthony Waters 2012-08-02 00:41:23 UTC
I tried both patches, the one from comment 76 and the one from comment 70, neither fixed the issue with the rendering regression or the va conflict.
Comment 81 Anthony Waters 2012-08-02 00:47:06 UTC
Created attachment 65051 [details] [review]
fixes to wait on the bo and to free the va after the kernel

These are the changes I made to make it work in mesa, the first change, inserting radeon_bo_wait was so that the va wouldn't be immediately reallocated for a different va while the GPU was still using it causing the rendering regression.

The second change was to move the freeing of the va in mesa after the kernel was freed so that the kernel's list would be updated before mesa's list.

Hopefully this provides more insight to the issue/cause
Comment 82 Alexandre Demers 2012-08-02 01:02:42 UTC
(In reply to comment #80)
> I tried both patches, the one from comment 76 and the one from comment 70,
> neither fixed the issue with the rendering regression or the va conflict.

Same here, I was rebuilding my kernel from scratch just in case.
Comment 83 Jerome Glisse 2012-08-02 03:46:43 UTC
How do you trigger the va issue ? piglit ? I was not able to reproduce. It's kind of painful to debug in the dark.
Comment 84 Anthony Waters 2012-08-03 01:28:25 UTC
I randomly saw it when I was playing a game of Warcraft 3, the terrain textures would blink.  I'll check the piglit tests and mesa demos to see if I can reproduce the issue with them.
Comment 85 Anthony Waters 2012-08-03 02:07:06 UTC
I found a demo that has the issue, in the demos repository for mesa within the src/demo folder the program 'reflect'.  After I start it up and press 's' to see the stencil buffer the white plan blinks continuously.  Applying the patch 'fixes to wait on the bo and to free the va after the kernel' removes the blinking, as does disabling va through the variable ws->info.r600_virtual_address.

The other issue with the kernel reporting a va conflict is going to be a little harder to reproduce because it appears to be caused by a race condition.

I'll still look for other demos that have the issue.
Comment 86 Alexandre Demers 2012-08-03 03:00:00 UTC
(In reply to comment #85)
> I found a demo that has the issue, in the demos repository for mesa within the
> src/demo folder the program 'reflect'.  After I start it up and press 's' to
> see the stencil buffer the white plan blinks continuously.  Applying the patch
> 'fixes to wait on the bo and to free the va after the kernel' removes the
> blinking, as does disabling va through the variable
> ws->info.r600_virtual_address.
> 
> The other issue with the kernel reporting a va conflict is going to be a little
> harder to reproduce because it appears to be caused by a race condition.
> 
> I'll still look for other demos that have the issue.

Yes, I understand it can be hard to track for you Jerome. Well for the va issue, on my side, it is as simple as logging in KDE or Gnome 3. Before logging in, there is no va error in dmesg. Once I'm in, there are usually 3 or sometimes 6 errors (they are written in block of 3, so I suspect it tries a first time and for some reason it fails and try again second time).

I also experience the issue when watching some movies. With Anthony's patch, va issues are gone and I watched a couple of shows yesterday without any problem. Before the patch, it would blink and get corrupted after about 16 minutes and then crash. So, Anthony has put a finger on something.

However, I also run piglit tests and some other applications like RendererFeatTest64 (which is an application released before Amnesia went out to test OpenGL performances if I recall recorrectly). With Anthony's patch, I'm still able to lock the display everytime (if I play music at the same time, it will continue to play but I won't be able to change terminal even if sometimes my mouse pointer can still be moved). RendererFeatTest64 will always lock at the same test, but it is not the same for piglit tests (even if it happens often at the same or near the same).

I'm installing a freshly compiled kernel 3.5.0 with both Alex and your patches (by the way, they can't be applied on latest drm-next branch) and I'll tell you if I'm still experiencing the lockups. I'll also try Anthony's test to see if I get the same results (blinking without his patch, OK with it)
Comment 87 Alexandre Demers 2012-08-03 06:03:39 UTC
(In reply to comment #86)
> (In reply to comment #85)
> > I found a demo that has the issue, in the demos repository for mesa within the
> > src/demo folder the program 'reflect'.  After I start it up and press 's' to
> > see the stencil buffer the white plan blinks continuously.  Applying the patch
> > 'fixes to wait on the bo and to free the va after the kernel' removes the
> > blinking, as does disabling va through the variable
> > ws->info.r600_virtual_address.
> > 
> > The other issue with the kernel reporting a va conflict is going to be a little
> > harder to reproduce because it appears to be caused by a race condition.
> > 
> > I'll still look for other demos that have the issue.
> 
> Yes, I understand it can be hard to track for you Jerome. Well for the va
> issue, on my side, it is as simple as logging in KDE or Gnome 3. Before logging
> in, there is no va error in dmesg. Once I'm in, there are usually 3 or
> sometimes 6 errors (they are written in block of 3, so I suspect it tries a
> first time and for some reason it fails and try again second time).
> 
> I also experience the issue when watching some movies. With Anthony's patch, va
> issues are gone and I watched a couple of shows yesterday without any problem.
> Before the patch, it would blink and get corrupted after about 16 minutes and
> then crash. So, Anthony has put a finger on something.
> 
> However, I also run piglit tests and some other applications like
> RendererFeatTest64 (which is an application released before Amnesia went out to
> test OpenGL performances if I recall recorrectly). With Anthony's patch, I'm
> still able to lock the display everytime (if I play music at the same time, it
> will continue to play but I won't be able to change terminal even if sometimes
> my mouse pointer can still be moved). RendererFeatTest64 will always lock at
> the same test, but it is not the same for piglit tests (even if it happens
> often at the same or near the same).
> 
> I'm installing a freshly compiled kernel 3.5.0 with both Alex and your patches
> (by the way, they can't be applied on latest drm-next branch) and I'll tell you
> if I'm still experiencing the lockups. I'll also try Anthony's test to see if I
> get the same results (blinking without his patch, OK with it)

Well it still locks up even with the patches. I also tested the reflect demo and I don't have any blink without Anthony's patch, but we may be experiencing different symptoms of the same problem.
Comment 88 Michel Dänzer 2012-08-03 07:47:17 UTC
(In reply to comment #86)
> So, Anthony has put a finger on something.

Yes, I think something like Anthony's patch is needed due to asynchronous GPU processing: when the userspace driver assigns virtual address space for a new BO, the GPU may not have finished processing command streams using previous BOs occupying that same virtual address space.

However, the userspace driver shouldn't wait synchronously for the BO to go idle when destroying it but should instead defer destruction (or at least the freeing of the virtual address space) until it notices the BO has become idle.


> With Anthony's patch, I'm still able to lock the display everytime

And these lockups do not happen when not using virtual address space? Can you provide the dmesg output of the GPU reset for such a lockup? Ideally from a single piglit test reproducing it.
Comment 89 Alexandre Demers 2012-08-03 08:05:07 UTC
(In reply to comment #88)
> (In reply to comment #86)
> > So, Anthony has put a finger on something.
> 
> Yes, I think something like Anthony's patch is needed due to asynchronous GPU
> processing: when the userspace driver assigns virtual address space for a new
> BO, the GPU may not have finished processing command streams using previous BOs
> occupying that same virtual address space.
> 
> However, the userspace driver shouldn't wait synchronously for the BO to go
> idle when destroying it but should instead defer destruction (or at least the
> freeing of the virtual address space) until it notices the BO has become idle.
> 
> 
> > With Anthony's patch, I'm still able to lock the display everytime
> 
> And these lockups do not happen when not using virtual address space? Can you
> provide the dmesg output of the GPU reset for such a lockup? Ideally from a
> single piglit test reproducing it.

Nope, no lockup without va (I may only be lucky though if the bug is there but only shown when using va). I'll try to find a way to get dmesg... It has been a problem since the start for that part, but I may be able to use another computer to log in remotely. May take a couple of days to do though.
Comment 90 Michel Dänzer 2012-08-03 08:13:03 UTC
(In reply to comment #89)
> Nope, no lockup without va (I may only be lucky though if the bug is there but
> only shown when using va).

That's indeed possible: Using virtual address space will catch out of bounds memory access that may otherwise go unnoticed.

So, I think in this report we should focus on the rendering regression(s), and track the lockups in other reports.
Comment 91 Christian König 2012-08-03 12:58:04 UTC
I just fixed a memory leak in radeonsi, and it looks like I'm hitting the same problem now.

Do I understand it correctly that the userspace VM manager is releasing allocations to early and not waiting for async buffer use to end?

That should be easy to fix.
Comment 92 Michel Dänzer 2012-08-03 13:21:22 UTC
(In reply to comment #91)
> I just fixed a memory leak in radeonsi, and it looks like I'm hitting the same
> problem now.

Ah cool, you found it already. :)

> Do I understand it correctly that the userspace VM manager is releasing
> allocations to early and not waiting for async buffer use to end?

That's my working theory.
Comment 93 Michel Dänzer 2012-08-03 13:26:32 UTC
(In reply to comment #92)
> > Do I understand it correctly that the userspace VM manager is releasing
> > allocations to early and not waiting for async buffer use to end?
> 
> That's my working theory.

Also, if it wasn't the case, I don't see how Anthony's patch could make a difference.
Comment 94 Jerome Glisse 2012-08-03 14:39:59 UTC
(In reply to comment #88)
> (In reply to comment #86)
> > So, Anthony has put a finger on something.
> 
> Yes, I think something like Anthony's patch is needed due to asynchronous GPU
> processing: when the userspace driver assigns virtual address space for a new
> BO, the GPU may not have finished processing command streams using previous BOs
> occupying that same virtual address space.
> 
> However, the userspace driver shouldn't wait synchronously for the BO to go
> idle when destroying it but should instead defer destruction (or at least the
> freeing of the virtual address space) until it notices the BO has become idle.
> 
> 
> > With Anthony's patch, I'm still able to lock the display everytime
> 
> And these lockups do not happen when not using virtual address space? Can you
> provide the dmesg output of the GPU reset for such a lockup? Ideally from a
> single piglit test reproducing it.

No, Anthony patch should not be needed. Once userspace call kernel to destroy bo userspace should be able to reuse va right away even if kernel is delaying bo destruction. My patch should fix the va issue, note that the patch attached here have a bug but it should not affect the va thing.
Comment 95 Alexandre Demers 2012-08-03 14:51:56 UTC
(In reply to comment #90)
> (In reply to comment #89)
> > Nope, no lockup without va (I may only be lucky though if the bug is there but
> > only shown when using va).
> 
> That's indeed possible: Using virtual address space will catch out of bounds
> memory access that may otherwise go unnoticed.
> 
> So, I think in this report we should focus on the rendering regression(s), and
> track the lockups in other reports.

OK, I'll open another bug for the lockups. This one will be renamed for va issues and rendering regression. I'll wait until tonight to make changes to see if someone objects.
Comment 96 Christian König 2012-08-03 15:03:52 UTC
Created attachment 65093 [details] [review]
Possible fix.

It's hard and uneffecient to solve this problem completely in the kernel.

Since we patch the VM table synchronously, but use it asynchronously we will always end up needing to wait for a bo use by the GPU to end before patching in the new VA.

Please take a look at the attached patch it should fix the issue nicely in userspace.
Comment 97 Marek Olšák 2012-08-03 15:20:12 UTC
(In reply to comment #96)
> Created attachment 65093 [details] [review] [review]
> Possible fix.
> 
> It's hard and uneffecient to solve this problem completely in the kernel.
> 
> Since we patch the VM table synchronously, but use it asynchronously we will
> always end up needing to wait for a bo use by the GPU to end before patching in
> the new VA.
> 
> Please take a look at the attached patch it should fix the issue nicely in
> userspace.

Please use the radeon_bo_is_busy function. Calling DRM_RADEON_GEM_BUSY directly is not reliable because of the thread offloading of the CS ioctl. The same applies to any other kernel queries and commands which depend on the CS ioctl.
Comment 98 Jerome Glisse 2012-08-03 16:54:04 UTC
Created attachment 65095 [details] [review]
Properly protect virtual address

Properly protect virtual address

Patch against Linus master, gonna attach patch against 3.5 next.
Comment 99 Jerome Glisse 2012-08-03 16:56:00 UTC
Created attachment 65096 [details] [review]
Properly protect virtual address

Properly protect virtual address

Patch against Linus master, gonna attach patch against 3.5 next.

Sorry previous one was wrong one.
Comment 100 Jerome Glisse 2012-08-03 16:59:41 UTC
Created attachment 65097 [details] [review]
Properly protect virtual address

Properly protect virtual address

Patch against Linus master, gonna attach patch against 3.5 next.

Again, sorry previous one was wrong one.
Comment 101 Jerome Glisse 2012-08-03 17:05:15 UTC
Created attachment 65098 [details] [review]
Properly protect virtual address against kernel 3.5

Same patch against 3.5
Comment 102 Jerome Glisse 2012-08-03 19:04:54 UTC
Created attachment 65101 [details] [review]
Properly protect virtual address kernel 3.5 v2

Updated
Comment 103 Jerome Glisse 2012-08-03 19:05:47 UTC
Created attachment 65102 [details] [review]
Properly protect virtual address v2

Against Linus master
Comment 104 Alexandre Demers 2012-08-03 19:44:34 UTC
(In reply to comment #103)
> Created attachment 65102 [details] [review] [review]
> Properly protect virtual address v2
> 
> Against Linus master

I will test them later today. They should take care of the va issues, right? Probably nothing to do with lockups?
Comment 105 Jerome Glisse 2012-08-03 19:46:50 UTC
Well for va issue you also need the mesa patch from Christian. This patch mostly fix kernel, it might help with lockup, thought here piglit lockup hard with lastest mesa.
Comment 106 Anthony Waters 2012-08-04 02:05:34 UTC
I tried the patch from Christian in comment 96 atop of mesa git and the patch from Jerome in comment 102 atop of linux-3.5 and I no longer experience the rendering regression and I have not seen the va conflict error, thanks.
Comment 107 Alexandre Demers 2012-08-04 03:55:56 UTC
Tested with 3.6-rc1 and latest mesa with both respective patches. No va issue anymore.

However, lockups still happen with RendererFeatTest64: I tried to run some tests and my system locked completly and restarted. This seems to be a different problem though not related to the va conflict issue. So I'll open a different bug for the lockups revealed by the same commit (as previously said, without virtual address space, it doesn't lock).
Comment 108 Alexandre Demers 2012-08-05 04:29:39 UTC
Oops, I've hit a va error again. I've been using my computer all day long, going from one window to another, using Flash on Openstreetmap and Google Map. The error could explain some lockups I've experienced. I hit the card's maximum memory from what I understand of the error. Should I put collected info here or under bug 53111?
Comment 109 Alexandre Demers 2012-08-05 04:34:02 UTC
(In reply to comment #108)
> Oops, I've hit a va error again. I've been using my computer all day long,
> going from one window to another, using Flash on Openstreetmap and Google Map.
> The error could explain some lockups I've experienced. I hit the card's maximum
> memory from what I understand of the error. Should I put collected info here or
> under bug 53111?

Here is the error message without any log for now. I'll wait to see if it should be tracked here:
[54804.656571] radeon 0000:01:00.0: offset 0x400000 is in reserved area 0x800000
[54805.166815] radeon 0000:01:00.0: bo ffff8800c227d800 va 0x02B00000 conflict with (bo ffff880202702400 0x02440000 0x03440000)
[54805.177976] radeon 0000:01:00.0: bo ffff8800c227b000 va 0x02C38000 conflict with (bo ffff880202702400 0x02440000 0x03440000)
[54805.178980] radeon 0000:01:00.0: bo ffff880061241400 va 0x02C38000 conflict with (bo ffff880202702400 0x02440000 0x03440000)
[54805.253953] radeon 0000:01:00.0: bo ffff88021b183800 va 0x00900000 conflict with (bo ffff8802222fc000 0x00900000 0x00901000)
[54806.900210] radeon 0000:01:00.0: va above limit (0x00100200 > 0x00100000)
[54806.927121] radeon 0000:01:00.0: va above limit (0x001000B0 > 0x00100000)
[54811.663812] radeon 0000:01:00.0: bo ffff880223631c00 va 0x01278000 conflict with (bo ffff88020270b000 0x01200000 0x01700000)
[54813.069082] radeon 0000:01:00.0: bo ffff88021b183800 va 0x00900000 conflict with (bo ffff8802222fc000 0x00900000 0x00901000)
[54813.075691] radeon 0000:01:00.0: bo ffff88007f002c00 va 0x00900000 conflict with (bo ffff8802222fc000 0x00900000 0x00901000)
[54813.075886] radeon 0000:01:00.0: bo ffff88007f002000 va 0x00900000 conflict with (bo ffff8802222fc000 0x00900000 0x00901000)
[54813.075961] gnome-shell[1025]: segfault at 50 ip 00007f8af5ebe019 sp 00007fff80159650 error 4 in r600_dri.so[7f8af5e53000+4b1000]
Comment 110 Christian König 2012-08-08 10:48:35 UTC
I just pushed a minor bugfix to mesa master, that in conjunction with Jeromes kernel patch should eliminate the last VA issues.

Please retest it again.

Christian.
Comment 111 Alexandre Demers 2012-08-08 13:29:34 UTC
(In reply to comment #110)
> I just pushed a minor bugfix to mesa master, that in conjunction with Jeromes
> kernel patch should eliminate the last VA issues.
> 
> Please retest it again.
> 
> Christian.

You must be refering to commit 8c44e5a144009a03c20befa6468d19d41c802795. Do I still need to apply your previous patch also (attachment 65093 [details] [review])? I'll try it tonight, but it may take a bit more complicated to reproduce, I'll have to play for a while until it does or doesn't trigger the last reported vm error.
Comment 112 Alexandre Demers 2012-08-09 01:38:41 UTC
(In reply to comment #111)
> (In reply to comment #110)
> > I just pushed a minor bugfix to mesa master, that in conjunction with Jeromes
> > kernel patch should eliminate the last VA issues.
> > 
> > Please retest it again.
> > 
> > Christian.
> 
> You must be refering to commit 8c44e5a144009a03c20befa6468d19d41c802795. Do I
> still need to apply your previous patch also (attachment 65093 [details] [review] [review])? I'll try it
> tonight, but it may take a bit more complicated to reproduce, I'll have to play
> for a while until it does or doesn't trigger the last reported vm error.

Well, I tested it with your previous patch on top of 68bccc40f55aee7f4af8eb64b15a95f0b49d6a17 and it was not working properly. First, I had to modify your patch to apply on top of latest git. After applying it, compiling and installing, I rebooted and I was unable to load the logging screen. I removed the patch, rebuilt a clean mesa from 68bccc40f55aee7f4af8eb64b15a95f0b49d6a17, installed and relaunched Xorg and... I was able to log in. So I'm now testing latest mesa (68bccc40f55aee7f4af8eb64b15a95f0b49d6a17) with kernel 3.6-rc1 + Jerome's patch. I should be able to tell you soon if it works. Meanwhile, if I should have applied something different, let me know.

To Jerome: I could test your [PATCH] drm/radeon: delay virtual address destruction to bo destruction. But first, I want to make sure Christian's patch does what it should do.
Comment 113 Alexandre Demers 2012-08-09 05:20:04 UTC
(In reply to comment #112)
> (In reply to comment #111)
> > (In reply to comment #110)
> > > I just pushed a minor bugfix to mesa master, that in conjunction with Jeromes
> > > kernel patch should eliminate the last VA issues.
> > > 
> > > Please retest it again.
> > > 
> > > Christian.
> > 
> > You must be refering to commit 8c44e5a144009a03c20befa6468d19d41c802795. Do I
> > still need to apply your previous patch also (attachment 65093 [details] [review] [review] [review])? I'll try it
> > tonight, but it may take a bit more complicated to reproduce, I'll have to play
> > for a while until it does or doesn't trigger the last reported vm error.
> 
> Well, I tested it with your previous patch on top of
> 68bccc40f55aee7f4af8eb64b15a95f0b49d6a17 and it was not working properly.
> First, I had to modify your patch to apply on top of latest git. After applying
> it, compiling and installing, I rebooted and I was unable to load the logging
> screen. I removed the patch, rebuilt a clean mesa from
> 68bccc40f55aee7f4af8eb64b15a95f0b49d6a17, installed and relaunched Xorg and...
> I was able to log in. So I'm now testing latest mesa
> (68bccc40f55aee7f4af8eb64b15a95f0b49d6a17) with kernel 3.6-rc1 + Jerome's
> patch. I should be able to tell you soon if it works. Meanwhile, if I should
> have applied something different, let me know.
> 
> To Jerome: I could test your [PATCH] drm/radeon: delay virtual address
> destruction to bo destruction. But first, I want to make sure Christian's patch
> does what it should do.

Bug still there with latest mesa git (without your previous patch as explained previously).
Aug  9 01:03:29 Xander kernel: [13308.165749] radeon 0000:01:00.0: offset 0x400000 is in reserved area 0x800000
Aug  9 01:03:29 Xander kernel: [13308.232245] radeon 0000:01:00.0: bo ffff880223646400 va 0x02B00000 conflict with (bo ffff8801e3edc400 

Locked and reset without any notice.
Comment 114 Alex Deucher 2012-08-09 14:14:04 UTC
Please test mesa from git (no additional patches) and make sure your kernel has this patch:
http://lists.freedesktop.org/archives/dri-devel/2012-August/026015.html
(no other kernel patches).
Comment 115 Alexandre Demers 2012-08-09 14:56:34 UTC
(In reply to comment #114)
> Please test mesa from git (no additional patches) and make sure your kernel has
> this patch:
> http://lists.freedesktop.org/archives/dri-devel/2012-August/026015.html
> (no other kernel patches).

It looks pretty much to what I was testing with (latest mesa git without any patch as explained in comment 112) where I had already applied Jerome's patch v2 (no other patch). v4 doesn't seem to have any major differences (according to comment for v3 and v4). Nevertheless, I'll recompile kernel 3.6-rc1 with patch v4 just in case, though I would be surprised if that would make a difference from test/error reported in comment 113.
Comment 116 Alexandre Demers 2012-08-09 15:24:41 UTC
(In reply to comment #113)
> (In reply to comment #112)
> > (In reply to comment #111)
> > > (In reply to comment #110)
> > > > I just pushed a minor bugfix to mesa master, that in conjunction with Jeromes
> > > > kernel patch should eliminate the last VA issues.
> > > > 
> > > > Please retest it again.
> > > > 
> > > > Christian.
> > > 
> > > You must be refering to commit 8c44e5a144009a03c20befa6468d19d41c802795. Do I
> > > still need to apply your previous patch also (attachment 65093 [details] [review] [review] [review] [review])? I'll try it
> > > tonight, but it may take a bit more complicated to reproduce, I'll have to play
> > > for a while until it does or doesn't trigger the last reported vm error.
> > 
> > Well, I tested it with your previous patch on top of
> > 68bccc40f55aee7f4af8eb64b15a95f0b49d6a17 and it was not working properly.
> > First, I had to modify your patch to apply on top of latest git. After applying
> > it, compiling and installing, I rebooted and I was unable to load the logging
> > screen. I removed the patch, rebuilt a clean mesa from
> > 68bccc40f55aee7f4af8eb64b15a95f0b49d6a17, installed and relaunched Xorg and...
> > I was able to log in. So I'm now testing latest mesa
> > (68bccc40f55aee7f4af8eb64b15a95f0b49d6a17) with kernel 3.6-rc1 + Jerome's
> > patch. I should be able to tell you soon if it works. Meanwhile, if I should
> > have applied something different, let me know.
> > 
> > To Jerome: I could test your [PATCH] drm/radeon: delay virtual address
> > destruction to bo destruction. But first, I want to make sure Christian's patch
> > does what it should do.
> 
> Bug still there with latest mesa git (without your previous patch as explained
> previously).
> Aug  9 01:03:29 Xander kernel: [13308.165749] radeon 0000:01:00.0: offset
> 0x400000 is in reserved area 0x800000
> Aug  9 01:03:29 Xander kernel: [13308.232245] radeon 0000:01:00.0: bo
> ffff880223646400 va 0x02B00000 conflict with (bo ffff8801e3edc400 
> 
> Locked and reset without any notice.

Two things I've noticed:
1- the error points directly at "offset 0x400000 is in reserved area 0x800000" since I applied Christian's and Jerome's patches, which is a different error from errors before patches.
2- the error only happens after a while, when switching between windows (under Gnome 3 in that case). I had to alt+tab and show my whole desktop (top left corner) many times before it happened. I played with my desktop all night long.

So, it's like if the pointer keeps increasing until it reaches its limit. Either we are not releasing correctly previous addresses (or we are forgetting some on the way)or we are unaware of every released addresses, in both cases pushing us forward until we hit a wall.

And if someone could explain me what this message/addresses means, I'd appreciate it. How is it possible that an offset of 0x400000 ends up in a reserved area allocated at 0x800000? We must not be offsetting from 0 obviously.
Comment 117 Alex Deucher 2012-08-09 18:50:34 UTC
(In reply to comment #116)
> 
> And if someone could explain me what this message/addresses means, I'd
> appreciate it. How is it possible that an offset of 0x400000 ends up in a
> reserved area allocated at 0x800000? We must not be offsetting from 0
> obviously.

The first 8 MB of the client's VM space are reserved for kernel use and not available for the client to use.  The client is not allowed to use an address below 0x800000.  If an address ends up there, the kernel flags it.  That's the message you are seeing.
Comment 118 Alexandre Demers 2012-08-11 04:49:31 UTC
Reproduced again with exactly the setup Alex told me to use (kernel 3.6-rc1+Jerome's patch v4 and latest mesa containing Christian's fix). To reproduce, I clicked repeatedly on Activities on top left corner of Gnome shell until it locked:

Everything.log
---
Aug 11 00:23:08 Xander kernel: [92926.580673] radeon 0000:01:00.0: offset 0x200000 is in reserved area 0x800000
Aug 11 00:23:08 Xander kernel: [92926.587281] [drm:radeon_cs_parser_relocs] *ERROR* gem object lookup failed 0x11
Aug 11 00:23:08 Xander kernel: [92926.587291] [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -2!
Aug 11 00:23:08 Xander kernel: [92926.597151] radeon 0000:01:00.0: offset 0x200000 is in reserved area 0x800000
Aug 11 00:23:18 Xander kernel: [92937.073091] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
Aug 11 00:23:18 Xander kernel: [92937.073105] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000009ea1d last fence id 0x000000000009ea1c)
Aug 11 00:23:18 Xander kernel: [92937.074236] radeon 0000:01:00.0: Saved 15 dwords of commands on ring 0.
Aug 11 00:23:18 Xander kernel: [92937.074243] radeon 0000:01:00.0: GPU softreset 
Aug 11 00:23:18 Xander kernel: [92937.074248] radeon 0000:01:00.0:   GRBM_STATUS=0xF5700828
Aug 11 00:23:18 Xander kernel: [92937.074253] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0xFC000001
Aug 11 00:23:18 Xander kernel: [92937.074258] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0xFC000001
Aug 11 00:23:18 Xander kernel: [92937.074263] radeon 0000:01:00.0:   SRBM_STATUS=0x20020FC0
Aug 11 00:23:18 Xander kernel: [92937.074269] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Aug 11 00:23:18 Xander kernel: [92937.074274] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x40000000
Aug 11 00:23:18 Xander kernel: [92937.074279] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008004
Aug 11 00:23:18 Xander kernel: [92937.074284] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80228647
Aug 11 00:23:18 Xander kernel: [92937.074289] radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_ADDR   0x00074124
Aug 11 00:23:18 Xander kernel: [92937.074294] radeon 0000:01:00.0:   VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00071001
Aug 11 00:23:18 Xander kernel: [92937.074300] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000021F1
Aug 11 00:23:18 Xander kernel: [92937.074305] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020A4004
Aug 11 00:23:19 Xander kernel: [92937.223150] radeon 0000:01:00.0: Wait for MC idle timedout !
Aug 11 00:23:19 Xander kernel: [92937.223152] radeon 0000:01:00.0:   GRBM_SOFT_RESET=0x0000DF7B
Aug 11 00:23:19 Xander kernel: [92937.223254] radeon 0000:01:00.0:   GRBM_STATUS=0x80103828
Aug 11 00:23:19 Xander kernel: [92937.223256] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x04000007
Aug 11 00:23:19 Xander kernel: [92937.223257] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x04000007
Aug 11 00:23:19 Xander kernel: [92937.223258] radeon 0000:01:00.0:   SRBM_STATUS=0x20020FC0
Aug 11 00:23:19 Xander kernel: [92937.223260] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Aug 11 00:23:19 Xander kernel: [92937.223262] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Aug 11 00:23:19 Xander kernel: [92937.223263] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Aug 11 00:23:19 Xander kernel: [92937.223264] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
Aug 11 00:23:19 Xander kernel: [92937.224266] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
Aug 11 00:23:19 Xander kernel: [92937.230003] [drm] probing gen 2 caps for device 1022:9603 = 2/0
Aug 11 00:23:19 Xander kernel: [92937.230004] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
Aug 11 00:23:19 Xander kernel: [92937.388426] radeon 0000:01:00.0: Wait for MC idle timedout !
Aug 11 00:23:19 Xander kernel: [92937.546743] radeon 0000:01:00.0: Wait for MC idle timedout !
Aug 11 00:23:19 Xander kernel: [92937.548662] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
Aug 11 00:23:19 Xander kernel: [92937.548751] radeon 0000:01:00.0: WB enabled
Aug 11 00:23:19 Xander kernel: [92937.548754] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff88022332dc00
Aug 11 00:23:19 Xander kernel: [92937.548755] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000040000c04 and cpu addr 0xffff88022332dc04
Aug 11 00:23:19 Xander kernel: [92937.548757] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000040000c08 and cpu addr 0xffff88022332dc08
Aug 11 00:23:19 Xander kernel: [92937.752374] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
Aug 11 00:23:19 Xander kernel: [92937.752377] [drm:cayman_resume] *ERROR* cayman startup failed on resume

Could it be a previously hidden bug that patches from Jerome and Christian digged up?
Comment 119 Michel Dänzer 2012-08-15 16:07:30 UTC
(In reply to comment #118)

Try the Mesa patches from http://lists.freedesktop.org/archives/mesa-dev/2012-August/025715.html .
Comment 120 Alexandre Demers 2012-08-16 00:38:42 UTC
(In reply to comment #119)
> (In reply to comment #118)
> 
> Try the Mesa patches from
> http://lists.freedesktop.org/archives/mesa-dev/2012-August/025715.html .

Testing right now.

May I suggest adding some debug info with an env variable switch to be able to track what the vm_mgr is doing, keeping and forgetting if this doesn't fix the problem or something similar?
Comment 121 Alexandre Demers 2012-08-16 15:35:45 UTC
(In reply to comment #120)
> (In reply to comment #119)
> > (In reply to comment #118)
> > 
> > Try the Mesa patches from
> > http://lists.freedesktop.org/archives/mesa-dev/2012-August/025715.html .
> 
> Testing right now.
> 
> May I suggest adding some debug info with an env variable switch to be able to
> track what the vm_mgr is doing, keeping and forgetting if this doesn't fix the
> problem or something similar?

I've been testing to reproduce latest VA issue all evening without being able to. So if it doesn't finally fix the problem, your patches do help a lot. I'll continue to test it tonight. Good to know your patches have been commited this morning.

However, keep in mind I haven't tested anything for the other lockups (piglit tests and some other OpenGL apps).
Comment 122 Alex Deucher 2012-08-16 16:59:49 UTC
*** Bug 53291 has been marked as a duplicate of this bug. ***
Comment 123 Thomas Rohloff 2012-08-16 20:10:32 UTC
(In reply to comment #119)
> (In reply to comment #118)
> 
> Try the Mesa patches from
> http://lists.freedesktop.org/archives/mesa-dev/2012-August/025715.html .

Not sure if this is related or if I should open a new report, but since this patches I get this when I try to start compiz with GLAMOR acceleration: http://pastebin.com/WbxMT0V9 - before I got the "conficts with" messages and without GLAMOR I get (and got) no messages at all but compiz loads slow and the screen flickers while doing so.

P.S. Also the desktop is corrupted with GLAMOR. This is better since this patches but still there.
Comment 124 Thomas Rohloff 2012-08-16 21:03:39 UTC
And there are some random rendering issues that wasn't there before the patches, like using the wrong texture.

Good: http://img713.imageshack.us/img713/492/mcgood.png
Bad: http://img96.imageshack.us/img96/6417/mcbad.png

Also water in the game flashes white (seems to choose the wrong texture sometimes in the animation, too) and sometimes the whole game screen flashes blue.
Comment 125 Alexandre Demers 2012-08-17 03:00:53 UTC
(In reply to comment #124)
> And there are some random rendering issues that wasn't there before the
> patches, like using the wrong texture.
> 
> Good: http://img713.imageshack.us/img713/492/mcgood.png
> Bad: http://img96.imageshack.us/img96/6417/mcbad.png
> 
> Also water in the game flashes white (seems to choose the wrong texture
> sometimes in the animation, too) and sometimes the whole game screen flashes
> blue.

I won't officially answer you question, but I think it should be tracked under a different bug since you are using Glamor. However, if I was you, I would create a new bug entry with a reference to this one.
Comment 126 Alexandre Demers 2012-08-17 03:18:27 UTC
Good news on my side: I was unable to recreate the bug until now. So I went with running pilit tests. Sadly, for that part, it still locks (now tracked under bug 53111).

I won't say for sure the vm problem is fixed, but if it's still there, latest patches helped a lot since I was able to run more than twice as long as usual without any problem.
Comment 127 Michel Dänzer 2012-08-17 07:26:52 UTC
(In reply to comment #126)
> I won't say for sure the vm problem is fixed, but if it's still there, latest
> patches helped a lot since I was able to run more than twice as long as usual
> without any problem.

Great! Resolving this bug as fixed.

Any other remaining issues, in particular Thomas' glamor issues, should be tracked in separate bug reports.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.