Summary: | Regression: Crash in 3Dmark2001 | ||
---|---|---|---|
Product: | Mesa | Reporter: | Stefan Dösinger <stefandoesinger> |
Component: | Drivers/Gallium/r300 | Assignee: | Default DRI bug account <dri-devel> |
Status: | VERIFIED FIXED | QA Contact: | |
Severity: | blocker | ||
Priority: | high | CC: | andabata12, cwabbott0, kenneth, pavel.ondracka, pedretti.fabio |
Version: | git | Keywords: | bisected, regression |
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Backtrace
full backtrace from piglit crash debugging patch another debugging patch RADEON_DEBUG=fp,vp output proposed fix |
Description
Stefan Dösinger
2014-08-19 21:11:31 UTC
*** Bug 82852 has been marked as a duplicate of this bug. *** I confirm the same bug on ATI X600 Mobile with Mesa 10.3.0 RC1 . Flightgear at least triggers it. Yeah its affecting multiple apps and I also see over 100 crashing piglit tests after this commit on my RV530 so this should be easy to reproduce even without wine. All the crashes are in the same place, right? Can you run it under gdb and print out n2 and the contents of g->nodes[n].adjacency_list (it's an array with g->nodes[n].adjacency_count elements) after the segfault? How about the former before the ra_simplify() call in the ra_allocate() call that's segfaulting? (If you don't know how to do this, see http://stackoverflow.com/questions/2956889/how-to-set-a-counter-for-a-gdb-breakpoint) I'm guessing that it's segfaulting because n2 is some bogus value. n2 comes from the adjacency_list, which is something generated before the allocator actually runs by code I didn't touch and then never modified afterward, and the code that's segfaulting wasn't modified by the commit in question, so the two most likely options I see are that either this is exposing a bug somewhere else (like in r300g) or the new ra_simplify() is somehow corrupting the adjacency_list. I don't know how r300g sets up the register conflicts and register classes, though, so I can't guess why it works fine on i965 but fails for r300g. Created attachment 105451 [details] full backtrace from piglit crash (In reply to comment #4) > All the crashes are in the same place, right? > > Can you run it under gdb and print out n2 and the contents of > g->nodes[n].adjacency_list (it's an array with g->nodes[n].adjacency_count > elements) after the segfault? How about the former before the ra_simplify() > call in the ra_allocate() call that's segfaulting? (If you don't know how to > do this, see > http://stackoverflow.com/questions/2956889/how-to-set-a-counter-for-a-gdb- > breakpoint) > > I'm guessing that it's segfaulting because n2 is some bogus value. n2 comes > from the adjacency_list, which is something generated before the allocator > actually runs by code I didn't touch and then never modified afterward, and > the code that's segfaulting wasn't modified by the commit in question, so > the two most likely options I see are that either this is exposing a bug > somewhere else (like in r300g) or the new ra_simplify() is somehow > corrupting the adjacency_list. I don't know how r300g sets up the register > conflicts and register classes, though, so I can't guess why it works fine > on i965 but fails for r300g. OK, so not sure if I know what I'm doing but selecting one random crashing piglit test /bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto Program received signal SIGSEGV, Segmentation fault. 0xb76391a9 in ra_select (g=0x80c2058) at ../../src/mesa/program/register_allocate.c:525 525 BITSET_TEST(g->regs->regs[r].conflicts, g->nodes[n2].reg)) { print n2 $2 = 0 print n $7 = 1 print g->nodes[n].adjacency_count $1 = 3 print g->nodes[n].adjacency_list $3 = (unsigned int *) 0x80c1b58 print g->nodes[n].adjacency_list[0] $4 = 1 print g->nodes[n].adjacency_list[1] $5 = 0 print g->nodes[n].adjacency_list[2] $6 = 2 full backtrace attached. (In reply to comment #5) > Created attachment 105451 [details] > full backtrace from piglit crash > > (In reply to comment #4) > > All the crashes are in the same place, right? > > > > Can you run it under gdb and print out n2 and the contents of > > g->nodes[n].adjacency_list (it's an array with g->nodes[n].adjacency_count > > elements) after the segfault? How about the former before the ra_simplify() > > call in the ra_allocate() call that's segfaulting? (If you don't know how to > > do this, see > > http://stackoverflow.com/questions/2956889/how-to-set-a-counter-for-a-gdb- > > breakpoint) > > > > I'm guessing that it's segfaulting because n2 is some bogus value. n2 comes > > from the adjacency_list, which is something generated before the allocator > > actually runs by code I didn't touch and then never modified afterward, and > > the code that's segfaulting wasn't modified by the commit in question, so > > the two most likely options I see are that either this is exposing a bug > > somewhere else (like in r300g) or the new ra_simplify() is somehow > > corrupting the adjacency_list. I don't know how r300g sets up the register > > conflicts and register classes, though, so I can't guess why it works fine > > on i965 but fails for r300g. > > OK, so not sure if I know what I'm doing but selecting one random crashing > piglit test > > /bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto > > Program received signal SIGSEGV, Segmentation fault. > 0xb76391a9 in ra_select (g=0x80c2058) at > ../../src/mesa/program/register_allocate.c:525 > 525 BITSET_TEST(g->regs->regs[r].conflicts, g->nodes[n2].reg)) { > > print n2 > $2 = 0 > > print n > $7 = 1 > > print g->nodes[n].adjacency_count > $1 = 3 > > print g->nodes[n].adjacency_list > $3 = (unsigned int *) 0x80c1b58 > > print g->nodes[n].adjacency_list[0] > $4 = 1 > > print g->nodes[n].adjacency_list[1] > $5 = 0 > > print g->nodes[n].adjacency_list[2] > $6 = 2 > > full backtrace attached. Can you print out the value of g->nodes[n2].reg? I think it may be NO_REG (0xffffffff), even though it shouldn't be (if a node is not on the stack, then it's supposed to be assigned a register already). (In reply to comment #5) > Created attachment 105451 [details] > full backtrace from piglit crash > > (In reply to comment #4) > > All the crashes are in the same place, right? > > > > Can you run it under gdb and print out n2 and the contents of > > g->nodes[n].adjacency_list (it's an array with g->nodes[n].adjacency_count > > elements) after the segfault? How about the former before the ra_simplify() > > call in the ra_allocate() call that's segfaulting? (If you don't know how to > > do this, see > > http://stackoverflow.com/questions/2956889/how-to-set-a-counter-for-a-gdb- > > breakpoint) > > > > I'm guessing that it's segfaulting because n2 is some bogus value. n2 comes > > from the adjacency_list, which is something generated before the allocator > > actually runs by code I didn't touch and then never modified afterward, and > > the code that's segfaulting wasn't modified by the commit in question, so > > the two most likely options I see are that either this is exposing a bug > > somewhere else (like in r300g) or the new ra_simplify() is somehow > > corrupting the adjacency_list. I don't know how r300g sets up the register > > conflicts and register classes, though, so I can't guess why it works fine > > on i965 but fails for r300g. > > OK, so not sure if I know what I'm doing but selecting one random crashing > piglit test > > /bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto > > Program received signal SIGSEGV, Segmentation fault. > 0xb76391a9 in ra_select (g=0x80c2058) at > ../../src/mesa/program/register_allocate.c:525 > 525 BITSET_TEST(g->regs->regs[r].conflicts, g->nodes[n2].reg)) { > > print n2 > $2 = 0 > > print n > $7 = 1 > > print g->nodes[n].adjacency_count > $1 = 3 > > print g->nodes[n].adjacency_list > $3 = (unsigned int *) 0x80c1b58 > > print g->nodes[n].adjacency_list[0] > $4 = 1 > > print g->nodes[n].adjacency_list[1] > $5 = 0 > > print g->nodes[n].adjacency_list[2] > $6 = 2 > > full backtrace attached. Oh, and I forgot to mention: If you do find that g->nodes[n2].reg is NO_REG, the next step would be to break at the end of ra_simplify() (but make sure to stop at the last time the breakpoint gets hit before the segfault using the stackoverflow post I linked to) and print out the values of all the nodes (g->nodes[0], g->nodes[1], ..., g->nodes[g->count - 1]). All the ones with .reg = NO_REG should also have .in_stack = true. If one has .reg = NO_REG and .in_stack = false, then in ra_simplify() we should have reached line 468, in which case we either push it onto the stack (if pq_test() returns true) or considered it for optimistic coloring (if pq_test() returns false). So if we finished the loop, then progress == false and so no nodes were pushed on the stack and no nodes were considered for optimistic coloring (see the places where we set progress = true), so no nodes should have .reg = NO_REG and .in_stack = false when we leave ra_simplify(). Then, in ra_select(), whenever we set .in_stack = false (line 536) we also set .reg to something else (line 541) unless we run out of registers in which case we bail out and then r300g will complain about running out of registers. So it seems strange to me that that would happen, but also the most likely explanation of why it's segfaulting. Ok, so indeed I got NO_REG for g->nodes[n2].reg print g->nodes[n2].reg $1 = 4294967295 than I set breakpoint at end of ra_simplify (it gets called just once before the crash) Breakpoint 1, ra_simplify (g=0x80c2058) at ../../src/mesa/program/register_allocate.c:491 491 } (gdb) print g->count $2 = 3 print g->nodes[0] $3 = {adjacency = 0x81b8968, adjacency_list = 0x80c1658, adjacency_list_size = 4, adjacency_count = 3, class = 0, reg = 4294967295, in_stack = false, q_total = 4294967295, spill_cost = 0} print g->nodes[1] $4 = {adjacency = 0x81b3d18, adjacency_list = 0x80c1b58, adjacency_list_size = 4, adjacency_count = 3, class = 2, reg = 4294967295, in_stack = true, q_total = 2, spill_cost = 0} print g->nodes[2] $5 = {adjacency = 0x81c0328, adjacency_list = 0x80c18d8, adjacency_list_size = 4, adjacency_count = 3, class = 3, reg = 4294967295, in_stack = true, q_total = 4, spill_cost = 0} Created attachment 105572 [details] [review] debugging patch Can you try the patch I attached and tell me what output you get between the last "--- begin simplify ---" and "--- end simplify ---" pair? Full test output with debugging patch: $ bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto r300: DRM version: 2.38.0, Name: ATI RV530, ID: 0x71c5, GB: 1, Z: 2 r300: GART size: 509 MB, VRAM size: 256 MB r300: AA compression RAM: YES, Z compression RAM: YES, HiZ RAM: YES --- begin simplify --- got here with node 2 pushing node 2 onto the stack got here with node 1 pushing node 1 onto the stack got here with node 0 got here with node 0 --- end simplify --- Neoprávněný přístup do paměti (SIGSEGV) Created attachment 105630 [details] [review] another debugging patch Ok, it looks like the problem is that node 0's q_total is bogus, which means it never even gets considered for optimistic coloring. To help me figure out why this is, can you apply this patch to master (not on top of the other patch) and tell me the output of the piglit test now? (In reply to comment #12) > Created attachment 105630 [details] [review] [review] > another debugging patch > > Ok, it looks like the problem is that node 0's q_total is bogus, which means > it never even gets considered for optimistic coloring. To help me figure out > why this is, can you apply this patch to master (not on top of the other > patch) and tell me the output of the piglit test now? On (In reply to comment #12) > Created attachment 105630 [details] [review] [review] > another debugging patch > > Ok, it looks like the problem is that node 0's q_total is bogus, which means > it never even gets considered for optimistic coloring. To help me figure out > why this is, can you apply this patch to master (not on top of the other > patch) and tell me the output of the piglit test now? I'm not sure if this matters, but r300g pre-allocates the input registers before calling ra_allocate_no_spills(). (In reply to comment #13) > (In reply to comment #12) > > Created attachment 105630 [details] [review] [review] [review] > > another debugging patch > > > > Ok, it looks like the problem is that node 0's q_total is bogus, which means > > it never even gets considered for optimistic coloring. To help me figure out > > why this is, can you apply this patch to master (not on top of the other > > patch) and tell me the output of the piglit test now? > > On (In reply to comment #12) > > Created attachment 105630 [details] [review] [review] [review] > > another debugging patch > > > > Ok, it looks like the problem is that node 0's q_total is bogus, which means > > it never even gets considered for optimistic coloring. To help me figure out > > why this is, can you apply this patch to master (not on top of the other > > patch) and tell me the output of the piglit test now? > > I'm not sure if this matters, but r300g pre-allocates the input registers > before calling ra_allocate_no_spills(). I think there are no input registers in this case (there's a NumInputs = 0 somewhere in the backtrace) so there aren't any pre-allocated nodes here. Can you post the output of RADEON_DEBUG=ps,vs ? (In reply to comment #14) > (In reply to comment #13) > > (In reply to comment #12) > > > Created attachment 105630 [details] [review] [review] [review] [review] > > > another debugging patch > > > > > > Ok, it looks like the problem is that node 0's q_total is bogus, which means > > > it never even gets considered for optimistic coloring. To help me figure out > > > why this is, can you apply this patch to master (not on top of the other > > > patch) and tell me the output of the piglit test now? > > > > On (In reply to comment #12) > > > Created attachment 105630 [details] [review] [review] [review] [review] > > > another debugging patch > > > > > > Ok, it looks like the problem is that node 0's q_total is bogus, which means > > > it never even gets considered for optimistic coloring. To help me figure out > > > why this is, can you apply this patch to master (not on top of the other > > > patch) and tell me the output of the piglit test now? > > > > I'm not sure if this matters, but r300g pre-allocates the input registers > > before calling ra_allocate_no_spills(). > > I think there are no input registers in this case (there's a NumInputs = 0 > somewhere in the backtrace) so there aren't any pre-allocated nodes here. What Tom probably meant is that inputs are loaded to temps before the fragment shader starts, so inputs and temps pretty much share the temporary file. Not sure how relevant it is to this issue, but obviously you can't rename the temps which are supposed to contain inputs. output with second debug patch: bin/shader_runner tests/shaders/glsl-fs-loop-continue.shader_test -auto r300: DRM version: 2.38.0, Name: ATI RV530, ID: 0x71c5, GB: 1, Z: 2 r300: GART size: 509 MB, VRAM size: 256 MB r300: AA compression RAM: YES, Z compression RAM: YES, HiZ RAM: YES increasing q total, old q total = 0, n1 = 0, n2 = 1, value = 1 increasing q total, old q total = 0, n1 = 1, n2 = 0, value = 1 increasing q total, old q total = 1, n1 = 0, n2 = 2, value = 1 increasing q total, old q total = 0, n1 = 2, n2 = 0, value = 1 increasing q total, old q total = 1, n1 = 1, n2 = 2, value = 1 increasing q total, old q total = 1, n1 = 2, n2 = 1, value = 3 decreasing q total, old q total = 2, n = 2, n2 = 0, value = 0 decreasing q total, old q total = 2, n = 2, n2 = 1, value = 0 decreasing q total, old q total = 2, n = 1, n2 = 0, value = 3 Neoprávněný přístup do paměti (SIGSEGV) Created attachment 105641 [details] RADEON_DEBUG=fp,vp output (In reply to comment #15) > Can you post the output of RADEON_DEBUG=ps,vs ? I suppose you mean RADEON_DEBUG=fp,vp? Created attachment 105645 [details] [review] proposed fix Does this patch fix the piglit failures? For doing a full piglit run, I'd recommend comparing the commit before before my series where the mess started (d72d67832bd7a5f2aa0c402333a7de6804ad35ef) and the last commit (e78a01d5e6f77e075fe667a0f0ccb10d89c0dd58) with my fix on top. Your patch does indeed fix the crashing tests, I still see some piglit regressions but that should be either bug 82882 or bug 82978. Thanks for the fix. FYI, I posted the fix I attached as http://lists.freedesktop.org/archives/mesa-dev/2014-September/067343.html and a few other patches that cleanup things I noticed when fixing this, but I don't have commit access so I'm waiting for someone to push the series before I close this issue. Can someone push Connor patches and backport the fix in time for 10.3? r300 is seriously broken without this fix, with many apps crashing, and it would be nice to have it fixed in time for 10.3. Fixed with commit afd82dcad127b64381ca6d80d0e499368074f474 |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.