Bug 53111 - [bisected] lockups since added support for virtual address space on cayman v11
Summary: [bisected] lockups since added support for virtual address space on cayman v11
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600 (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-04 04:00 UTC by Alexandre Demers
Modified: 2013-02-01 04:36 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg of piglit r600.test crash (6.47 KB, text/plain)
2012-08-04 04:13 UTC, Anthony Waters
Details
apitrace (38.02 KB, application/octet-stream)
2012-08-18 05:09 UTC, Alexandre Demers
Details
bad rendereing on test 7, where it used to lock (106.99 KB, image/jpeg)
2012-08-19 23:04 UTC, Alexandre Demers
Details

Description Alexandre Demers 2012-08-04 04:00:01 UTC
When running RendererFeatTest64, it always locks at the same test. Lockups also happen when running piglit r600.test, locking always near the same test (sanity tests are OK). If we disable virtual address space as explained under bug 45018, no lockups happen.
Comment 1 Anthony Waters 2012-08-04 04:13:14 UTC
Created attachment 65108 [details]
dmesg of piglit r600.test crash

I also have the same issue, here is the dmesg of the crash I get when running the piglit test case r600.test.  This is with virtual address enabled and the patches from bug 45018 applied.
Comment 2 Alexandre Demers 2012-08-04 04:32:01 UTC
Small note to whoever could come here and was not following bug 45018:

Bisecting identified the following commit as culprit:

bb1f0cf3508630a9a93512c79badf8c493c46743 is the first bad commit
commit bb1f0cf3508630a9a93512c79badf8c493c46743
Author: Jerome Glisse <jglisse@redhat.com>
Date:   Fri Dec 2 10:20:29 2011 -0500

    r600g: add support for virtual address space on cayman v11
Comment 3 Michel Dänzer 2012-08-06 13:27:48 UTC
FWIW, r600.tests should no longer be used in favour of quick-driver.tests. I assume it still happens with the latter though, so here's some debugging tips:

For isolating a single piglit test that locks up, it may help to run piglit-run.py with -c 0 to prevent several tests from running in parallel.

For isolating the cause of a lockup, it may help to add some debugging output about virtual addresses to the r600g driver, and compare that to the fault address in the VM_CONTEXT1_PROTECTION_FAULT_ADDR register.
Comment 4 Alexandre Demers 2012-08-06 14:42:39 UTC
(In reply to comment #3)
> FWIW, r600.tests should no longer be used in favour of quick-driver.tests. I
> assume it still happens with the latter though, so here's some debugging tips:
> 
> For isolating a single piglit test that locks up, it may help to run
> piglit-run.py with -c 0 to prevent several tests from running in parallel.
> 
> For isolating the cause of a lockup, it may help to add some debugging output
> about virtual addresses to the r600g driver, and compare that to the fault
> address in the VM_CONTEXT1_PROTECTION_FAULT_ADDR register.

Your info will be helpful for piglit tests, I'll try that later. For the debug calls, I'll let someone else propose a patch so it is at the right spot.
Comment 5 Alexandre Demers 2012-08-07 01:21:27 UTC
Tested running one piglit test at a time (thanks Michel) and it always locks on "texturing/depthstencil-render-miplevels 146 s=z24_s8_d=z32f_s8". It locks hard, resets, stays locked and usually restarts the computer.
Comment 6 Anthony Waters 2012-08-09 03:07:19 UTC
The fault address in the VM_CONTEXT1_PROTECTION_FAULT_ADDR register is less than the start of the virtual address area, unless that is due to the bug?
Comment 7 Michel Dänzer 2012-08-09 07:05:26 UTC
(In reply to comment #6)
> The fault address in the VM_CONTEXT1_PROTECTION_FAULT_ADDR register is less
> than the start of the virtual address area, unless that is due to the bug?

Sorry, should have mentioned that the address in VM_CONTEXT1_PROTECTION_FAULT_ADDR is shifted right by 12 bits (i.e. it's the page frame number).
Comment 8 Alexandre Demers 2012-08-17 03:49:31 UTC
Is there a way to use apitrace in combination with piglit? I'd like to trace the problematic test.
Comment 9 Michel Dänzer 2012-08-17 07:28:50 UTC
(In reply to comment #8)
> Is there a way to use apitrace in combination with piglit? I'd like to trace
> the problematic test.

The first step would be to reproduce the problem by manually running the problematic test from the piglit/bin directory. Then you should be able to apitrace it just like any other app.
Comment 10 Alexandre Demers 2012-08-18 05:08:00 UTC
Well, it seems running it through qapitrace doesn't lock. But running only this single test in a terminal does.

One thing though: when using qapitrace and looking up state, framebuffer under surfaces is pretty much garbage whatever stage I look at. I don't know if this is expected fom depthstencil-render-miplevels 146 s=z24_s8_d=z32f_s8.
Comment 11 Alexandre Demers 2012-08-18 05:09:01 UTC
Created attachment 65723 [details]
apitrace
Comment 12 Alexandre Demers 2012-08-19 22:22:24 UTC
I tried to trace RenderFeatTest (one of the other applications locking my system). It did as  with the piglit test: it didn't crash. However, the rendering is corrupted where it locks when launched from a terminal. Trace is 75MB when compressed if you want me to upload it somewhere.
Comment 13 Alexandre Demers 2012-08-19 23:03:48 UTC
(In reply to comment #12)
> I tried to trace RenderFeatTest (one of the other applications locking my
> system). It did as  with the piglit test: it didn't crash. However, the
> rendering is corrupted where it locks when launched from a terminal. Trace is
> 75MB when compressed if you want me to upload it somewhere.

I forgot to say: it doesn't lock anymore at all. I should have written "... where it locked when launched from a terminal". It was locking at test 7. I'm attaching a screenshot from that test.

I'll bisect to see if I can find which commit "fixed" the lock.
Comment 14 Alexandre Demers 2012-08-19 23:04:33 UTC
Created attachment 65813 [details]
bad rendereing on test 7, where it used to lock
Comment 15 Michel Dänzer 2012-08-20 14:59:34 UTC
(In reply to comment #10)
> Well, it seems running it through qapitrace doesn't lock.

The apitrace looks incomplete: it doesn't contain any actual rendering operations.
Comment 16 Alexandre Demers 2012-08-20 15:05:03 UTC
(In reply to comment #15)
> (In reply to comment #10)
> > Well, it seems running it through qapitrace doesn't lock.
> 
> The apitrace looks incomplete: it doesn't contain any actual rendering
> operations.

I'll rerun it at home tonight.
Comment 17 Alexandre Demers 2012-08-22 03:02:45 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > I tried to trace RenderFeatTest (one of the other applications locking my
> > system). It did as  with the piglit test: it didn't crash. However, the
> > rendering is corrupted where it locks when launched from a terminal. Trace is
> > 75MB when compressed if you want me to upload it somewhere.
> 
> I forgot to say: it doesn't lock anymore at all. I should have written "...
> where it locked when launched from a terminal". It was locking at test 7. I'm
> attaching a screenshot from that test.
> 
> I'll bisect to see if I can find which commit "fixed" the lock.

I was not able to figure out the combination that fixed the thing. Well, let's focus on the piglit test that locks the beast.
Comment 18 Alexandre Demers 2012-08-22 05:32:58 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #10)
> > > Well, it seems running it through qapitrace doesn't lock.
> > 
> > The apitrace looks incomplete: it doesn't contain any actual rendering
> > operations.
> 
> I'll rerun it at home tonight.

You were right, I had missed a ";" between the arguments. Bam, locked. I was unable to retrieve a trace. Well, I may try to run it in debug mode to see where it stops later this week.
Comment 19 Alexandre Demers 2012-08-23 04:12:52 UTC
So about this locking piglit test (depthstencil-render-miplevels 146 s=z24_s8_d=z32f_s8), I've been able to track it down to:
line 218: 		piglit_report_result(PIGLIT_SKIP);

I don't know if we are supposed to be hitting this path, but either way, it seems piglit_report_result(PIGLIT_SKIP) locks. I suppose this function must be releasing resources before exiting, but something wrong is happening in there.

By the way, I'm now running kernel 3.6.0-rc3 with latest drm and mesa.
Comment 20 Michel Dänzer 2012-08-23 06:45:54 UTC
(In reply to comment #19)
> So about this locking piglit test (depthstencil-render-miplevels 146
> s=z24_s8_d=z32f_s8), I've been able to track it down to:
> line 218:         piglit_report_result(PIGLIT_SKIP);

How did you determine that? It's weird, I wouldn't expect a skipped test to produce any actual GPU rendering.
Comment 21 Alexandre Demers 2012-08-23 13:13:25 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > So about this locking piglit test (depthstencil-render-miplevels 146
> > s=z24_s8_d=z32f_s8), I've been able to track it down to:
> > line 218:         piglit_report_result(PIGLIT_SKIP);
> 
> How did you determine that? It's weird, I wouldn't expect a skipped test to
> produce any actual GPU rendering.

I used gdb and step into the code until it locked. It gets out at level 0, after going through:

 /**
 * Attach the proper miplevel of each texture to the framebuffer
 */
void
set_up_framebuffer_for_miplevel(int level)...

Before this call, there is a framebuffer initialization:
	GLuint fbo;
	glGenFramebuffers(1, &fbo);
	glBindFramebuffer(GL_DRAW_FRAMEBUFFER, fbo);
	glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);

	for (int level = 0; level <= max_miplevel; ++level) {
		set_up_framebuffer_for_miplevel(level);
Comment 22 Alexandre Demers 2012-08-30 21:34:48 UTC
It seems Marek has more weight than me about lockups related to VM on Cayman(problem first reported as bug 45018). Patch by Marek to disable VM by default for Cayman: http://lists.freedesktop.org/archives/mesa-dev/2012-August/026590.html

If you have any news on the subject, feel free to add info in the current bug. To Marek: are you experiencing the same first lockup in the piglit tests as reported in comment 10. I'm sure have seen a previous comment from another dev who was also experiencing lockups on Cayman, but I can't find who that was.
Comment 23 Alex Deucher 2012-08-30 21:42:20 UTC
(In reply to comment #22)
> It seems Marek has more weight than me about lockups related to VM on
> Cayman(problem first reported as bug 45018). 

Well, we were hoping to get this resolved in time for 9.0, but as it's getting pretty close now, it's probably better to disable it at least for the 9.0 release.  The problem is, when it's disabled, there's not much chance of anyone testing it, so it's not likely to ever get properly fixed.  Also, SI only supports VM, so we can't disable VM for SI.
Comment 24 Alexandre Demers 2012-08-30 22:38:36 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > It seems Marek has more weight than me about lockups related to VM on
> > Cayman(problem first reported as bug 45018). 
> 
> Well, we were hoping to get this resolved in time for 9.0, but as it's getting
> pretty close now, it's probably better to disable it at least for the 9.0
> release.  The problem is, when it's disabled, there's not much chance of anyone
> testing it, so it's not likely to ever get properly fixed.  Also, SI only
> supports VM, so we can't disable VM for SI.

Meanwhile, since fixes committed for bug 45018 helped me a lot, I'll gladly keep VM activated to test it. After all, my desktop is now usable now, I've been running for 3 days without any lockup, while I was previously only able to run for a couple of hours before restarting. So, if you have any patches you want to test that could help, ask me.
Comment 25 Alexandre Demers 2012-09-06 17:19:09 UTC
I'll have to confirm it later today by disabling VM, but I'm pretty sure I experienced a lock (can be reproduced every time) related to VM when testing with Unigine Tropics. It loaded, the demo began and then it locked when the island appeared at the horizon (I guess that's what it is since it was the first time I was running this demo).

From the retrieved logs, I could only identify a GPU lock with a reset that failed to reset rings properly.
Comment 26 Anthony Waters 2012-10-03 02:47:07 UTC
As I mentiond in bug 55416 I received a new lockup due to VA being enabled, however, the lockups only started occuring after commit c8b06dccff9cb89e20378664f3cbc202876a180f.  Disabling VA also prevents the lockups, so it may be similar to what was mentioned in comment 25.  I will check if that piglit test still locks up for me.
Comment 27 Alexandre Demers 2012-12-16 21:31:50 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #19)
> > > So about this locking piglit test (depthstencil-render-miplevels 146
> > > s=z24_s8_d=z32f_s8), I've been able to track it down to:
> > > line 218:         piglit_report_result(PIGLIT_SKIP);
> > 
> > How did you determine that? It's weird, I wouldn't expect a skipped test to
> > produce any actual GPU rendering.
> 
> I used gdb and step into the code until it locked. It gets out at level 0,
> after going through:
> 
>  /**
>  * Attach the proper miplevel of each texture to the framebuffer
>  */
> void
> set_up_framebuffer_for_miplevel(int level)...
> 
> Before this call, there is a framebuffer initialization:
> 	GLuint fbo;
> 	glGenFramebuffers(1, &fbo);
> 	glBindFramebuffer(GL_DRAW_FRAMEBUFFER, fbo);
> 	glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
> 
> 	for (int level = 0; level <= max_miplevel; ++level) {
> 		set_up_framebuffer_for_miplevel(level);

It seems that with latest mesa, drm, xf86 and kernel 3.7.0-rc7-71633-g3b6b59b from drm-next, it doesn't fail on this test anymore. It does lock however on a different one. I'll debug it and see where it locks.
Comment 28 Alexandre Demers 2013-02-01 04:35:57 UTC
I'm closing this bug, the original triggering application is not doing it anymore. Also many things changed since then. I'll reopen it if for any reason I can still point to this exact problem, but I think it was more like an umbrella bug: many others were hidden under it.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.