Bug 82828

Summary: Regression: Crash in 3Dmark2001
Product: Mesa Reporter: Stefan Dösinger <stefandoesinger>
Component: Drivers/Gallium/r300Assignee: Default DRI bug account <dri-devel>
Status: VERIFIED FIXED QA Contact:
Severity: blocker    
Priority: high CC: andabata12, cwabbott0, kenneth, pavel.ondracka, pedretti.fabio
Version: gitKeywords: bisected, regression
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Backtrace
full backtrace from piglit crash
debugging patch
another debugging patch
RADEON_DEBUG=fp,vp output
proposed fix

Description Stefan Dösinger 2014-08-19 21:11:31 UTC
Created attachment 104921 [details]
Backtrace

Since commit e78a01d5e6f77e075fe667a0f0ccb10d89c0dd58 3DMark2001 crashes in the Nature test when it is run in Wine with the ARB shader backend on r300g.

The 3DMark2001 download can be found here: http://www.futuremark.com/benchmarks/legacy

I used Wine 1.7.22 for testing, but I am certain that the bug can be reproduced with newer Wine releases because the ARB shader code hasn't been changed in a white. My GPU is a Radeon X1600.

To reproduce the bug you have to enable the ARB shader backend by starting Wine's regedit and setting HKEY_CURRENT_USER/Software/Wine/Direct3D/UseGLSL to disabled. Create the Direct3D key and UseGLSL string value if needed.

I do not see the crash on r600g (Tested with 9a071e33, Radeon HD 5770).

A backtrace is attached. The backtrace was generated with Mesa 1c4f141a.
Comment 1 Tom Stellard 2014-08-20 17:01:25 UTC
*** Bug 82852 has been marked as a duplicate of this bug. ***
Comment 2 José Jorge 2014-08-26 13:06:37 UTC
I confirm the same bug on ATI X600 Mobile with Mesa 10.3.0 RC1 .
Flightgear at least triggers it.
Comment 3 Pavel Ondračka 2014-08-29 16:43:32 UTC
Yeah its affecting multiple apps and I also see over 100 crashing piglit tests after this commit on my RV530 so this should be easy to reproduce even without wine.
Comment 4 Connor Abbott 2014-08-29 20:06:45 UTC
All the crashes are in the same place, right?

Can you run it under gdb and print out n2 and the contents of g->nodes[n].adjacency_list (it's an array with g->nodes[n].adjacency_count elements) after the segfault? How about the former before the ra_simplify() call in the ra_allocate() call that's segfaulting? (If you don't know how to do this, see http://stackoverflow.com/questions/2956889/how-to-set-a-counter-for-a-gdb-breakpoint)

I'm guessing that it's segfaulting because n2 is some bogus value. n2 comes from the adjacency_list, which is something generated before the allocator actually runs by code I didn't touch and then never modified afterward, and the code that's segfaulting wasn't modified by the commit in question, so the two most likely options I see are that either this is exposing a bug somewhere else (like in r300g) or the new ra_simplify() is somehow corrupting the adjacency_list. I don't know how r300g sets up the register conflicts and register classes, though, so I can't guess why it works fine on i965 but fails for r300g.
Comment 5 Pavel Ondračka 2014-08-30 07:19:25 UTC
Created attachment 105451 [details]
full backtrace from piglit crash

(In reply to comment #4)
> All the crashes are in the same place, right?
> 
> Can you run it under gdb and print out n2 and the contents of
> g->nodes[n].adjacency_list (it's an array with g->nodes[n].adjacency_count
> elements) after the segfault? How about the former before the ra_simplify()
> call in the ra_allocate() call that's segfaulting? (If you don't know how to
> do this, see
> http://stackoverflow.com/questions/2956889/how-to-set-a-counter-for-a-gdb-
> breakpoint)
> 
> I'm guessing that it's segfaulting because n2 is some bogus value. n2 comes
> from the adjacency_list, which is something generated before the allocator
> actually runs by code I didn't touch and then never modified afterward, and
> the code that's segfaulting wasn't modified by the commit in question, so
> the two most likely options I see are that either this is exposing a bug
> somewhere else (like in r300g) or the new ra_simplify() is somehow
> corrupting the adjacency_list. I don't know how r300g sets up the register
> conflicts and register classes, though, so I can't guess why it works fine
> on i965 but fails for r300g.

OK, so not sure if I know what I'm doing but selecting one random crashing piglit test

/bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto

Program received signal SIGSEGV, Segmentation fault.
0xb76391a9 in ra_select (g=0x80c2058) at ../../src/mesa/program/register_allocate.c:525
525			BITSET_TEST(g->regs->regs[r].conflicts, g->nodes[n2].reg)) {

print n2
$2 = 0

print n
$7 = 1

print g->nodes[n].adjacency_count
$1 = 3

print g->nodes[n].adjacency_list
$3 = (unsigned int *) 0x80c1b58

print g->nodes[n].adjacency_list[0]
$4 = 1

print g->nodes[n].adjacency_list[1]
$5 = 0

print g->nodes[n].adjacency_list[2]
$6 = 2

full backtrace attached.
Comment 6 Connor Abbott 2014-08-30 17:54:27 UTC
(In reply to comment #5)
> Created attachment 105451 [details]
> full backtrace from piglit crash
> 
> (In reply to comment #4)
> > All the crashes are in the same place, right?
> > 
> > Can you run it under gdb and print out n2 and the contents of
> > g->nodes[n].adjacency_list (it's an array with g->nodes[n].adjacency_count
> > elements) after the segfault? How about the former before the ra_simplify()
> > call in the ra_allocate() call that's segfaulting? (If you don't know how to
> > do this, see
> > http://stackoverflow.com/questions/2956889/how-to-set-a-counter-for-a-gdb-
> > breakpoint)
> > 
> > I'm guessing that it's segfaulting because n2 is some bogus value. n2 comes
> > from the adjacency_list, which is something generated before the allocator
> > actually runs by code I didn't touch and then never modified afterward, and
> > the code that's segfaulting wasn't modified by the commit in question, so
> > the two most likely options I see are that either this is exposing a bug
> > somewhere else (like in r300g) or the new ra_simplify() is somehow
> > corrupting the adjacency_list. I don't know how r300g sets up the register
> > conflicts and register classes, though, so I can't guess why it works fine
> > on i965 but fails for r300g.
> 
> OK, so not sure if I know what I'm doing but selecting one random crashing
> piglit test
> 
> /bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0xb76391a9 in ra_select (g=0x80c2058) at
> ../../src/mesa/program/register_allocate.c:525
> 525			BITSET_TEST(g->regs->regs[r].conflicts, g->nodes[n2].reg)) {
> 
> print n2
> $2 = 0
> 
> print n
> $7 = 1
> 
> print g->nodes[n].adjacency_count
> $1 = 3
> 
> print g->nodes[n].adjacency_list
> $3 = (unsigned int *) 0x80c1b58
> 
> print g->nodes[n].adjacency_list[0]
> $4 = 1
> 
> print g->nodes[n].adjacency_list[1]
> $5 = 0
> 
> print g->nodes[n].adjacency_list[2]
> $6 = 2
> 
> full backtrace attached.

Can you print out the value of g->nodes[n2].reg? I think it may be NO_REG (0xffffffff), even though it shouldn't be (if a node is not on the stack, then it's supposed to be assigned a register already).

(In reply to comment #5)
> Created attachment 105451 [details]
> full backtrace from piglit crash
> 
> (In reply to comment #4)
> > All the crashes are in the same place, right?
> > 
> > Can you run it under gdb and print out n2 and the contents of
> > g->nodes[n].adjacency_list (it's an array with g->nodes[n].adjacency_count
> > elements) after the segfault? How about the former before the ra_simplify()
> > call in the ra_allocate() call that's segfaulting? (If you don't know how to
> > do this, see
> > http://stackoverflow.com/questions/2956889/how-to-set-a-counter-for-a-gdb-
> > breakpoint)
> > 
> > I'm guessing that it's segfaulting because n2 is some bogus value. n2 comes
> > from the adjacency_list, which is something generated before the allocator
> > actually runs by code I didn't touch and then never modified afterward, and
> > the code that's segfaulting wasn't modified by the commit in question, so
> > the two most likely options I see are that either this is exposing a bug
> > somewhere else (like in r300g) or the new ra_simplify() is somehow
> > corrupting the adjacency_list. I don't know how r300g sets up the register
> > conflicts and register classes, though, so I can't guess why it works fine
> > on i965 but fails for r300g.
> 
> OK, so not sure if I know what I'm doing but selecting one random crashing
> piglit test
> 
> /bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0xb76391a9 in ra_select (g=0x80c2058) at
> ../../src/mesa/program/register_allocate.c:525
> 525			BITSET_TEST(g->regs->regs[r].conflicts, g->nodes[n2].reg)) {
> 
> print n2
> $2 = 0
> 
> print n
> $7 = 1
> 
> print g->nodes[n].adjacency_count
> $1 = 3
> 
> print g->nodes[n].adjacency_list
> $3 = (unsigned int *) 0x80c1b58
> 
> print g->nodes[n].adjacency_list[0]
> $4 = 1
> 
> print g->nodes[n].adjacency_list[1]
> $5 = 0
> 
> print g->nodes[n].adjacency_list[2]
> $6 = 2
> 
> full backtrace attached.
Comment 7 Connor Abbott 2014-08-30 20:50:32 UTC
Oh, and I forgot to mention:

If you do find that g->nodes[n2].reg is NO_REG, the next step would be to break at the end of ra_simplify() (but make sure to stop at the last time the breakpoint gets hit before the segfault using the stackoverflow post I linked to) and print out the values of all the nodes (g->nodes[0], g->nodes[1], ..., g->nodes[g->count - 1]). All the ones with .reg = NO_REG should also have .in_stack = true. If one has .reg = NO_REG and .in_stack = false, then in ra_simplify() we should have reached line 468, in which case we either push it onto the stack (if pq_test() returns true) or considered it for optimistic coloring (if pq_test() returns false). So if we finished the loop, then progress == false and so no nodes were pushed on the stack and no nodes were considered for optimistic coloring (see the places where we set progress = true), so no nodes should have .reg = NO_REG and .in_stack = false when we leave ra_simplify(). Then, in ra_select(), whenever we set .in_stack = false (line 536) we also set .reg to something else (line 541) unless we run out of registers in which case we bail out and then r300g will complain about running out of registers. So it seems strange to me that that would happen, but also the most likely explanation of why it's segfaulting.
Comment 8 Pavel Ondračka 2014-09-01 10:31:28 UTC
Ok, so indeed I got NO_REG for g->nodes[n2].reg

print g->nodes[n2].reg
$1 = 4294967295

than I set breakpoint at end of ra_simplify (it gets called just once before the crash)

Breakpoint 1, ra_simplify (g=0x80c2058)
    at ../../src/mesa/program/register_allocate.c:491
491	}
(gdb) print g->count
$2 = 3

print g->nodes[0]
$3 = {adjacency = 0x81b8968, adjacency_list = 0x80c1658, 
  adjacency_list_size = 4, adjacency_count = 3, class = 0, reg = 4294967295, 
  in_stack = false, q_total = 4294967295, spill_cost = 0}

print g->nodes[1]
$4 = {adjacency = 0x81b3d18, adjacency_list = 0x80c1b58, 
  adjacency_list_size = 4, adjacency_count = 3, class = 2, reg = 4294967295, 
  in_stack = true, q_total = 2, spill_cost = 0}

print g->nodes[2]
$5 = {adjacency = 0x81c0328, adjacency_list = 0x80c18d8, 
  adjacency_list_size = 4, adjacency_count = 3, class = 3, reg = 4294967295, 
  in_stack = true, q_total = 4, spill_cost = 0}
Comment 9 Connor Abbott 2014-09-01 21:37:04 UTC
Created attachment 105572 [details] [review]
debugging patch
Comment 10 Connor Abbott 2014-09-01 21:39:12 UTC
Can you try the patch I attached and tell me what output you get between the last "--- begin simplify ---" and "--- end simplify ---" pair?
Comment 11 Pavel Ondračka 2014-09-02 06:21:53 UTC
Full test output with debugging patch:

$ bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto
r300: DRM version: 2.38.0, Name: ATI RV530, ID: 0x71c5, GB: 1, Z: 2
r300: GART size: 509 MB, VRAM size: 256 MB
r300: AA compression RAM: YES, Z compression RAM: YES, HiZ RAM: YES
--- begin simplify ---
got here with node 2
pushing node 2 onto the stack
got here with node 1
pushing node 1 onto the stack
got here with node 0
got here with node 0
--- end simplify ---
Neoprávněný přístup do paměti (SIGSEGV)
Comment 12 Connor Abbott 2014-09-02 18:00:26 UTC
Created attachment 105630 [details] [review]
another debugging patch

Ok, it looks like the problem is that node 0's q_total is bogus, which means it never even gets considered for optimistic coloring. To help me figure out why this is, can you apply this patch to master (not on top of the other patch) and tell me the output of the piglit test now?
Comment 13 Tom Stellard 2014-09-02 18:09:25 UTC
(In reply to comment #12)
> Created attachment 105630 [details] [review] [review]
> another debugging patch
> 
> Ok, it looks like the problem is that node 0's q_total is bogus, which means
> it never even gets considered for optimistic coloring. To help me figure out
> why this is, can you apply this patch to master (not on top of the other
> patch) and tell me the output of the piglit test now?

On (In reply to comment #12)
> Created attachment 105630 [details] [review] [review]
> another debugging patch
> 
> Ok, it looks like the problem is that node 0's q_total is bogus, which means
> it never even gets considered for optimistic coloring. To help me figure out
> why this is, can you apply this patch to master (not on top of the other
> patch) and tell me the output of the piglit test now?

I'm not sure if this matters, but r300g pre-allocates the input registers before calling ra_allocate_no_spills().
Comment 14 Connor Abbott 2014-09-02 18:23:31 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > Created attachment 105630 [details] [review] [review] [review]
> > another debugging patch
> > 
> > Ok, it looks like the problem is that node 0's q_total is bogus, which means
> > it never even gets considered for optimistic coloring. To help me figure out
> > why this is, can you apply this patch to master (not on top of the other
> > patch) and tell me the output of the piglit test now?
> 
> On (In reply to comment #12)
> > Created attachment 105630 [details] [review] [review] [review]
> > another debugging patch
> > 
> > Ok, it looks like the problem is that node 0's q_total is bogus, which means
> > it never even gets considered for optimistic coloring. To help me figure out
> > why this is, can you apply this patch to master (not on top of the other
> > patch) and tell me the output of the piglit test now?
> 
> I'm not sure if this matters, but r300g pre-allocates the input registers
> before calling ra_allocate_no_spills().

I think there are no input registers in this case (there's a NumInputs = 0 somewhere in the backtrace) so there aren't any pre-allocated nodes here.
Comment 15 Tom Stellard 2014-09-02 18:45:18 UTC
Can you post the output of RADEON_DEBUG=ps,vs ?
Comment 16 Marek Olšák 2014-09-02 18:49:16 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > (In reply to comment #12)
> > > Created attachment 105630 [details] [review] [review] [review] [review]
> > > another debugging patch
> > > 
> > > Ok, it looks like the problem is that node 0's q_total is bogus, which means
> > > it never even gets considered for optimistic coloring. To help me figure out
> > > why this is, can you apply this patch to master (not on top of the other
> > > patch) and tell me the output of the piglit test now?
> > 
> > On (In reply to comment #12)
> > > Created attachment 105630 [details] [review] [review] [review] [review]
> > > another debugging patch
> > > 
> > > Ok, it looks like the problem is that node 0's q_total is bogus, which means
> > > it never even gets considered for optimistic coloring. To help me figure out
> > > why this is, can you apply this patch to master (not on top of the other
> > > patch) and tell me the output of the piglit test now?
> > 
> > I'm not sure if this matters, but r300g pre-allocates the input registers
> > before calling ra_allocate_no_spills().
> 
> I think there are no input registers in this case (there's a NumInputs = 0
> somewhere in the backtrace) so there aren't any pre-allocated nodes here.

What Tom probably meant is that inputs are loaded to temps before the fragment shader starts, so inputs and temps pretty much share the temporary file. Not sure how relevant it is to this issue, but obviously you can't rename the temps which are supposed to contain inputs.
Comment 17 Pavel Ondračka 2014-09-02 21:09:37 UTC
output with second debug patch:

bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto
r300: DRM version: 2.38.0, Name: ATI RV530, ID: 0x71c5, GB: 1, Z: 2
r300: GART size: 509 MB, VRAM size: 256 MB
r300: AA compression RAM: YES, Z compression RAM: YES, HiZ RAM: YES
increasing q total, old q total = 0, n1 = 0, n2 = 1, value = 1
increasing q total, old q total = 0, n1 = 1, n2 = 0, value = 1
increasing q total, old q total = 1, n1 = 0, n2 = 2, value = 1
increasing q total, old q total = 0, n1 = 2, n2 = 0, value = 1
increasing q total, old q total = 1, n1 = 1, n2 = 2, value = 1
increasing q total, old q total = 1, n1 = 2, n2 = 1, value = 3
decreasing q total, old q total = 2, n = 2, n2 = 0, value = 0
decreasing q total, old q total = 2, n = 2, n2 = 1, value = 0
decreasing q total, old q total = 2, n = 1, n2 = 0, value = 3
Neoprávněný přístup do paměti (SIGSEGV)
Comment 18 Pavel Ondračka 2014-09-02 21:17:58 UTC
Created attachment 105641 [details]
RADEON_DEBUG=fp,vp output

(In reply to comment #15)
> Can you post the output of RADEON_DEBUG=ps,vs ?

I suppose you mean RADEON_DEBUG=fp,vp?
Comment 19 Connor Abbott 2014-09-03 00:26:07 UTC
Created attachment 105645 [details] [review]
proposed fix

Does this patch fix the piglit failures? For doing a full piglit run, I'd recommend comparing the commit before before my series where the mess started (d72d67832bd7a5f2aa0c402333a7de6804ad35ef) and the last commit (e78a01d5e6f77e075fe667a0f0ccb10d89c0dd58) with my fix on top.
Comment 20 Pavel Ondračka 2014-09-04 10:17:34 UTC
Your patch does indeed fix the crashing tests, I still see some piglit regressions but that should be either bug 82882 or bug 82978.
Thanks for the fix.
Comment 21 Connor Abbott 2014-09-09 22:46:41 UTC
FYI, I posted the fix I attached as http://lists.freedesktop.org/archives/mesa-dev/2014-September/067343.html and a few other patches that cleanup things I noticed when fixing this, but I don't have commit access so I'm waiting for someone to push the series before I close this issue.
Comment 22 Fabio Pedretti 2014-09-12 09:47:26 UTC
Can someone push Connor patches and backport the fix in time for 10.3?

r300 is seriously broken without this fix, with many apps crashing, and it would be nice to have it fixed in time for 10.3.
Comment 23 Andreas Boll 2014-09-12 14:09:58 UTC
Fixed with commit afd82dcad127b64381ca6d80d0e499368074f474

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.