Bug 34969

Summary: [nouveau] Card lockup on openarena
Product: xorg Reporter: Nikolay Rysev <mad.f3ka>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: mad.f3ka, randrik
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
messages.log part
none
nouveau dmesg
none
log of buffer activity, from NVFX_TRACE_DRAW=1 ./vertexrate > NV_TRACE.log 2>&1
none
patch for libdrm command buffer tracing
none
Minimal vertexrate.c
none
NOUVEAU_TRACE.log
none
modified vertexrate.c
none
NOUVEAU_TRACE logs
none
hand-typed "log" none

Description Nikolay Rysev 2011-03-03 04:13:00 UTC
Created attachment 44068 [details]
messages.log part

Each time I try to start openarena or any other game, card locks up (not in menu, but in game).

I'm not sure, but log in messages seems to be related to this issue (see messages attachment).

Software versions:
kernel 2.6.37.2
libdrm 2.4.23
mesa 7.10.1
xf86-video-nouveau 0.0.16_git20101217

Hardware:
06:00.0 VGA compatible controller: nVidia Corporation NV43 [GeForce 6600] (rev a2)
Comment 1 Andrew Randrianasulu 2011-07-14 22:40:18 UTC
While nvfx (3d gallium driver) isn't supported, i run few tests with nouveau-git (commit 72690683e35680d912e9b8ff2ee0b7a18631dd0d - "drm/nouveau/pm: Document and expose CL and WR for 0x1002Cx") and mesa-7.12-devel (OpenGL version string: 2.1 Mesa 7.12-devel (git-a09b7f7)).

Other software - Q3 Arena linux demo.

Results kinda strange.

Default settings
Timedemo 1 
demo demo002 - lock up card very fast, can power off machine via ACPI.
demo demo001 - works!

geometry detail set to low
demo demo001 - not tested
demo demo002 - works!

geometry detail set to medium
demo demo002 - lock up after some time (not at start of level)

For perflvl=0 

even demo001 lock up after some time in the middle (always at same place).

So, I guess we run ahead or behind GPU, mostly in sending geometry data? need more test with perf mesa demos, for example ....

currently my performance levels are

drm] nouveau 0000:01:00.0: mem timing table length unknown: 14
[drm] nouveau 0000:01:00.0: 2 available performance level(s)
[drm] nouveau 0000:01:00.0: 0: core 300MHz memory 1000MHz voltage 1300mV fanspeed 100%
[drm] nouveau 0000:01:00.0: 1: core 500MHz memory 1000MHz voltage 1400mV fanspeed 100%
[drm] nouveau 0000:01:00.0: c: core 100MHz memory 501MHz voltage 1300mV
[TTM] Zone  kernel: Available graphics memory: 255992 kiB.


and perflvl 1 lock up whole machine pretty fast (before i managed to run any bench, even  with just composited, via Xrender/Kde 3.5.10 desktop)

Guess I need additional "bug" for tracking this. And probably one more for tracking invisible VTs after s3 sleep ....
Comment 2 Andrew Randrianasulu 2011-07-14 22:41:58 UTC
Created attachment 49119 [details]
nouveau dmesg
Comment 3 Andrew Randrianasulu 2011-07-14 23:21:33 UTC
OK, perf/vertexrate lockup card.

If I disable (in vertexrate.c) everything, but immediate mode, i have

guest@slax:~/botva/src/demos/src/perf$ ./vertexrate
Vertex rate (10000 x Vertex4f)
  Immediate mode: 11.2 million verts/sec

perf/vbo works, but as far as I can see it doesn't draw anything  visible.
Comment 4 Andrew Randrianasulu 2011-07-14 23:43:39 UTC
Ok, VBO glDrawArrays works too:

guest@slax:~/botva/src/demos/src/perf$ ./vertexrate
Vertex rate (10000 x Vertex4f)
  Immediate mode: 11.5 million verts/sec
  VBO glDrawArrays: 12.8 million verts/sec
guest@slax:~/botva/src/demos/src/perf$                      

Uncommenting anything else lock up card.
Comment 5 Andrew Randrianasulu 2011-07-16 17:23:02 UTC
Created attachment 49192 [details]
log of buffer activity, from NVFX_TRACE_DRAW=1 ./vertexrate > NV_TRACE.log 2>&1
Comment 6 Andrew Randrianasulu 2011-07-19 20:27:50 UTC
Sorry, in my case aux. power was NOT connected to card. Now fastest perflevel works for Q3/demo001 (up to 78.0 fps for 1280x1024x32@60, but GPU temp raised to 92 C). 

But unmodified vertexrate demo still freezes card even with default boot clocks :[
Comment 7 Christoph Bumiller 2011-07-20 02:58:54 UTC
Try cutting the vertexrate demo down to a minimal subset that causes the lockup and trace the command buffer.
You're getting DMA_VTX_PROTECTION so you're likely accessing out of bounds vertex data somehow.
If you can find the commands that cause it and compare with the gallium code you might be able to find the error.

Attached patch for tracing command buffer (apply to libdrm).
Comment 8 Christoph Bumiller 2011-07-20 03:00:34 UTC
Created attachment 49332 [details] [review]
patch for libdrm command buffer tracing
Comment 9 Andrew Randrianasulu 2011-07-20 03:23:05 UTC
Created attachment 49333 [details]
Minimal vertexrate.c
Comment 10 Andrew Randrianasulu 2011-07-20 03:24:34 UTC
Created attachment 49334 [details]
NOUVEAU_TRACE.log

for channel 3
Comment 11 Andrew Randrianasulu 2011-07-20 04:39:31 UTC
(In reply to comment #8)
> Created an attachment (id=49332) [details]
> patch for libdrm command buffer tracing

And how to parse this trace into more human-readable format? demmio from envytools?
Comment 12 Andrew Randrianasulu 2011-07-20 04:41:04 UTC
Ops, i mean ./dedma . I was unable to find anything useful
Comment 13 Andrew Randrianasulu 2011-08-02 04:06:28 UTC
More info.

When I set NVFX_SWTNL=1 ./vertexrate it also works without lockup. I was playing with amount of vertex data, and found what  defining MAX_VERTS up to 1263 works OK, but 1264 and up   lead to card lock-up.

So, i traced two cases, and diff'ed first full trace (with lockup) and part of second trace (33 Mb big, due to successful execution of test).
Comment 14 Andrew Randrianasulu 2011-08-02 04:07:19 UTC
Created attachment 49823 [details]
modified vertexrate.c
Comment 15 Andrew Randrianasulu 2011-08-02 04:10:22 UTC
Created attachment 49824 [details]
NOUVEAU_TRACE logs
Comment 16 Andrew Randrianasulu 2011-11-14 15:16:55 UTC
So, I bisected mesa tree (7.11 branch)

Result:

2a904fd6a0cb80eec6dec2bae07fd8778b04caf3 is the first bad commit
commit 2a904fd6a0cb80eec6dec2bae07fd8778b04caf3
Author: Marek OlЕЎГЎk <maraeo@gmail.com>
Date:   Sun Dec 26 04:30:51 2010 +0100

    st/mesa: set vertex arrays state only when necessary

    The vertex arrays state should be set only when (_NEW_ARRAY | _NEW_PROGRAM)
    is dirty. This assumes user buffer content is mutable, which will be
    sorted out in the next commit. The following usage case should be much faster
    now:

    for (i = 0; i < 1000; i++) {
       glDrawElements(...);
    }

    Or even:

    for (i = 0; i < 1000; i++) {
       glSomeStateChangeOtherThanArraysOrProgram(...);
       glDrawElements(...);
    }

    The performance increase from this may be significant in some apps and
    negligible in others. It is especially noticable in the Torcs game (r300g):
        Before: 15.4 fps
        After: 20 fps

    Also less looping over attribs in st_draw_vbo yields slight speed-up
    in apps with lots of glDraw* calls.

:040000 040000 45e5630d445206ce8c7eab6ac6bfe144901695bb 927efd20e354ecae459227bd464dbb6001cb448e M      src


guest@slax:~/botva/src/src/mesa$ git bisect log
git bisect start
# good: [0f7325b89038937bd428f7c89ed9859189a0ab0b] i965: Emit texel offsets in sampler messages.
git bisect good 0f7325b89038937bd428f7c89ed9859189a0ab0b
# bad: [8767fe2437094f33db140a6b92f25116de4fc371] mesa: Sort extensions in extension string by year.
git bisect bad 8767fe2437094f33db140a6b92f25116de4fc371
# bad: [fbd681f1a03f6ad62432107dc94e02674f6de7bf] i915g: Use dump function in sw winsys
git bisect bad fbd681f1a03f6ad62432107dc94e02674f6de7bf
# good: [1f5b67416810f7331fe71db0f767418473083701] egl_dri2: add nouveau support.
git bisect good 1f5b67416810f7331fe71db0f767418473083701
# bad: [e0481cac7d57757d75a39763a1dd36b915979bb4] svga: Disable surface cache for textures
git bisect bad e0481cac7d57757d75a39763a1dd36b915979bb4
# bad: [fc5ab1b19780ef97c5e7f6257a2d91121503bd53] mesa: use gl_format type instead of GLuint
git bisect bad fc5ab1b19780ef97c5e7f6257a2d91121503bd53
# bad: [56029ce52bafbc51b5b6660383767257b7770cd7] r300g: inline some of the pipe_buffer_map/unmap calls
git bisect bad 56029ce52bafbc51b5b6660383767257b7770cd7
# good: [9e96ea0652dda64f8eb311d7dfc9c50519ad02f0] r600g: add alignment cases for linear aligned
git bisect good 9e96ea0652dda64f8eb311d7dfc9c50519ad02f0
# bad: [588fa884d212eba5ffbc69fda75db37d7c77214c] gallium: notify drivers about possible changes in user buffer contents
git bisect bad 588fa884d212eba5ffbc69fda75db37d7c77214c
# good: [cfaf217135d8a8e903b3fbf380f18170df018f0c] vbo: bind arrays only when necessary
git bisect good cfaf217135d8a8e903b3fbf380f18170df018f0c
# good: [cdca3c58aa2d9549f5188910e2a77b438516714f] gallium: remove pipe_vertex_buffer::max_index
git bisect good cdca3c58aa2d9549f5188910e2a77b438516714f
# bad: [2a904fd6a0cb80eec6dec2bae07fd8778b04caf3] st/mesa: set vertex arrays state only when necessary
git bisect bad 2a904fd6a0cb80eec6dec2bae07fd8778b04caf3
Comment 17 Marek Olšák 2011-11-14 15:28:23 UTC
(In reply to comment #16)
> So, I bisected mesa tree (7.11 branch)
> 
> Result:
> 
> 2a904fd6a0cb80eec6dec2bae07fd8778b04caf3 is the first bad commit
> commit 2a904fd6a0cb80eec6dec2bae07fd8778b04caf3
> Author: Marek OlЕЎГЎk <maraeo@gmail.com>
> Date:   Sun Dec 26 04:30:51 2010 +0100
> 
>     st/mesa: set vertex arrays state only when necessary

I guess the bisection failed in this case, because that commit contains a bug that was fixed later.
Comment 18 Andrew Randrianasulu 2011-11-14 16:06:19 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > So, I bisected mesa tree (7.11 branch)
> > 
> > Result:
> > 
> > 2a904fd6a0cb80eec6dec2bae07fd8778b04caf3 is the first bad commit
> > commit 2a904fd6a0cb80eec6dec2bae07fd8778b04caf3
> > Author: Marek OlЕЎГЎk <maraeo@gmail.com>
> > Date:   Sun Dec 26 04:30:51 2010 +0100
> > 
> >     st/mesa: set vertex arrays state only when necessary
> 
> I guess the bisection failed in this case, because that commit contains a bug
> that was fixed later.

gallium: notify drivers about possible changes in user buffer contents ? It was well-masked for Q3 arena until extension sorting exposed it there, but I run my ./vertexrate (reduced) test, and you can see whole bisect log... May be nvfx played too much with mesa/gallium internals.
Comment 19 Andrew Randrianasulu 2011-11-14 16:36:06 UTC
Created attachment 53562 [details]
hand-typed "log"

While trying some self-made liveCD based on slackware-current, I found Slackware's Mesa works fine for both Celestia, and Q3 timedemo demo002. Slackware currently uses Mesa 7.10.2/libdrm 2.4.24/ and some nouveau DDX from march, 2011. I've compiled 7.11 branch , and discovered it hang videocard like 7.12 does on my main system. Then I just set good commit at last one from Luca. And even if i keep all other components the same (libdrm-2.4.27/xf86-video-nouveau few comments behind git/kernel 3.2.0-rc1 mainline) freeze go on and off with just mesa versions.My bisect compilation was without any llvm stuff (normal compilation has llvm 2.9). nice mistery, and thanfully  whole machine not locked up hard, only videocard, it seems.
Comment 20 Andrew Randrianasulu 2012-04-14 14:53:55 UTC
At least my hangs seems  to be  fixed by big driver rewrite. Currently using 3.4.0-rc2 nouveau kernel, libdrm up to commit 292da616fe1f936ca78a3fa8e1b1b19883e343b6 ("nouveau: pull in major libdrm rewrite"), mesa master up to commit b2df031a959f36743527b9abc89913ce4f895de3 ("r300/compiler: Fix nested flow control in r500 vertex shaders") and xf86-video-nouveau up to commit fb3a36b1e5af0f81bb266da894d3442eed8e4e55 ("nve0: initial exa/xv acceleration for kepler chipsets").

Please, re-test with new code.
Comment 21 Pierre Moreau 2014-12-09 18:52:24 UTC
Moving to Nouveau.

Are you still experiencing this issue with an updated graphic stack, aka. kernel 3.18, mesa 10.4, libdrm 2.4.58 and f86-video-nouveau 1.0.11?
Comment 22 Nikolay Rysev 2014-12-10 08:38:45 UTC
(In reply to Pierre Moreau from comment #21)
> Moving to Nouveau.
> 
> Are you still experiencing this issue with an updated graphic stack, aka.
> kernel 3.18, mesa 10.4, libdrm 2.4.58 and f86-video-nouveau 1.0.11?

Sorry, but I cannot test because I no longer have that card.
If the new code works fine for Andrew, I think we can change the status of the bug to "resolved/fixed" or something similar.
Thanks.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.