Bug 86864 - [rv6xx] RADEON_FLAG_GTT_WC causes GPU to reset when playing Second Life / other games
Summary: [rv6xx] RADEON_FLAG_GTT_WC causes GPU to reset when playing Second Life / oth...
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600 (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-29 23:58 UTC by Shawn Starr
Modified: 2016-01-29 07:26 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
lspci verbose output (33.34 KB, text/plain)
2014-11-30 00:13 UTC, Shawn Starr
Details
kernel dmesg output (79.80 KB, text/plain)
2014-11-30 00:22 UTC, Shawn Starr
Details
Running threads listed (12.12 KB, text/plain)
2014-12-01 05:56 UTC, Shawn Starr
Details

Description Shawn Starr 2014-11-29 23:58:36 UTC
I've git bisected from git master and came up with this commit that causes GPU to lock up (DRI driver locks up, I can ssh an reboot the laptop cleanly however).

Kernel: 3.17.2-300.fc21.x86_64

[root@devbox mesa-20141127]# git bisect good
7b4276d7acf2e0f77044cb50caa6ad936fa78786 is the first bad commit
commit 7b4276d7acf2e0f77044cb50caa6ad936fa78786
Author: Michel Dänzer <michel.daenzer@amd.com>
Date:   Tue Aug 26 18:21:50 2014 +0900

    r600g,radeonsi: Always use GTT again for PIPE_USAGE_STREAM buffers
    
    Putting those in VRAM can cause long pauses due to buffers being moved
    into / out of VRAM.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=84662
    Cc: mesa-stable@lists.freedesktop.org
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 672dc7643603cfccc8b9c85c83f37db32c26beeb 8e4a9f33de638b7df98284614a0da3bb3373f320 M      src

With this reverted from this point back to Sept 7th, I have stability with the r600g DRI driver.
Comment 1 Shawn Starr 2014-11-30 00:09:14 UTC
This is started from git master but bisect starts from: 73dd50acf6d244979c2a657906aa56d3ac60d550~1
Comment 2 Shawn Starr 2014-11-30 00:11:48 UTC
lspci outout of GPU:

01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RV635/M86 [Mobility Radeon HD 3650] [1002:9591] (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device [17aa:2127]
        Physical Slot: 1-1
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 30
        Region 0: Memory at d0000000 (32-bit, prefetchable) [size=256M]
        Region 1: I/O ports at 2000 [size=256]
        Region 2: Memory at cfff0000 (32-bit, non-prefetchable) [size=64K]
        [virtual] Expansion ROM at cff00000 [disabled] [size=128K]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0100c  Data: 41e2
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Kernel driver in use: radeon
        Kernel modules: radeon
Comment 3 Shawn Starr 2014-11-30 00:13:31 UTC
Created attachment 110244 [details]
lspci verbose output
Comment 4 Shawn Starr 2014-11-30 00:22:28 UTC
Created attachment 110248 [details]
kernel dmesg output
Comment 5 Shawn Starr 2014-11-30 01:07:39 UTC
Taking some suggestions to disable CP DMA: (R600_DEBUG=nocpdma):

GPU crashed hard:

[  540.397058] radeon 0000:01:00.0: ring 0 stalled for more than 10066msec
[  540.403687] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000001ddb1 last fence id 0x000000000001ddb0 on ring 0)
[  540.413226] radeon 0000:01:00.0: failed to get a new IB (-35)
[  540.419088] [drm:radeon_cs_ib_fill] *ERROR* Failed to get ib !
[  540.642386] radeon 0000:01:00.0: Saved 537 dwords of commands on ring 0.
[  540.648841] radeon 0000:01:00.0: GPU softreset: 0x00000008
[  540.654444] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA0000030
[  540.661829] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[  540.667683] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200000C0
[  540.674041] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  540.680737] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  540.687094] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00020186
[  540.693250] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80028645
[  540.697631] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  540.751932] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00004001
[  540.754222] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[  540.759528] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
[  540.761834] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[  540.763420] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200080C0
[  540.765966] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  540.768397] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  540.770988] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[  540.773460] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80100000
[  540.777015] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  540.778511] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[  540.784272] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
[  540.786484] radeon 0000:01:00.0: WB enabled
[  540.788382] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffff88003603ec00
[  540.823636] [drm] ring test on 0 succeeded in 1 usecs
[  550.825064] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec
[  550.830843] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000001ddc2 last fence id 0x000000000001ddb1 on ring 0)
[  550.841797] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[  550.847294] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
[  550.854825] radeon 0000:01:00.0: ib ring test failed (-35).
[  551.074568] radeon 0000:01:00.0: GPU softreset: 0x00000009
[  551.080710] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA2733030
[  551.087107] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000103
[  551.093704] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200000C0
[  551.100378] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  551.106751] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00008002
[  551.112612] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008086
[  551.119845] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80018645
[  551.125381] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  551.182032] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEF
[  551.187907] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[  551.194887] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
[  551.202106] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[  551.208302] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200080C0
[  551.214418] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  551.221656] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  551.227603] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[  551.233900] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80100000
[  551.241368] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  551.247877] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[  551.256336] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
[  551.261595] radeon 0000:01:00.0: WB enabled
[  551.265559] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffff88003603ec00
[  551.305998] [drm] ring test on 0 succeeded in 1 usecs
[  551.310130] [drm] ib test on ring 0 succeeded in 0 usecs
[  551.315154] switching from power state:
[  551.318241]  ui class: none
[  551.320942]  internal class: boot 
[  551.324177]  caps: video 
[  551.326370]  uvd    vclk: 0 dclk: 0
[  551.329313]          power level 0    sclk: 60000 mclk: 70000 vddc: 1100
[  551.334239]          power level 1    sclk: 60000 mclk: 70000 vddc: 1100
[  551.340383]          power level 2    sclk: 60000 mclk: 70000 vddc: 1100
[  551.345543]  status: c b 
[  551.348176] switching to power state:
[  551.351659]  ui class: performance
[  551.354237]  internal class: none
[  551.358272]  caps: single_disp video 
[  551.361736]  uvd    vclk: 0 dclk: 0
[  551.364876]          power level 0    sclk: 11000 mclk: 40500 vddc: 900
[  551.369971]          power level 1    sclk: 30000 mclk: 70000 vddc: 1100
[  551.375776]          power level 2    sclk: 60000 mclk: 70000 vddc: 1100
[  551.381047]  status: r 
[  562.318061] radeon 0000:01:00.0: ring 0 stalled for more than 10048msec
[  562.325126] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000001ddd0 last fence id 0x000000000001ddcf on ring 0)
[  562.335405] retire_capture_urb: 4 callbacks suppressed
[  562.339810] radeon 0000:01:00.0: failed to get a new IB (-35)
[  562.345503] [drm:radeon_cs_ib_fill] *ERROR* Failed to get ib !
[  562.567926] NMI: PCI system error (SERR) for reason a1 on CPU 0.
[  562.568803] Dazed and confused, but trying to continue
[  562.578130] radeon 0000:01:00.0: Saved 537 dwords of commands on ring 0.
[  562.583853] radeon 0000:01:00.0: GPU softreset: 0x00000009
[  562.589369] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA2733830
[  562.595914] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000103
[  562.600921] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200010C0
[  562.607704] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  562.613598] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00008000
[  562.620112] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008806
[  562.626883] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x800106C5
[  562.633039] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  562.882450] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEF
[  562.888905] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[  562.896025] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
[  562.902355] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[  562.908844] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200090C0
[  562.915298] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  562.921101] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  562.928340] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[  562.934049] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80100000
[  562.940499] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  562.946972] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[  563.124312] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
[  563.129580] radeon 0000:01:00.0: WB enabled
[  563.133126] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffff88003603ec00
[  563.345369] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
[  563.353039] [drm:r600_resume] *ERROR* r600 startup failed on resume
[  563.358920] switching from power state:
[  563.363823]  ui class: none
[  563.366549]  internal class: boot 
[  563.369190]  caps: video 
[  563.371732]  uvd    vclk: 0 dclk: 0
[  563.374927]          power level 0    sclk: 60000 mclk: 70000 vddc: 1100
[  563.381397]          power level 1    sclk: 60000 mclk: 70000 vddc: 1100
[  563.387026]          power level 2    sclk: 60000 mclk: 70000 vddc: 1100
[  563.393153]  status: c b 
[  563.396783] switching to power state:
[  563.400357]  ui class: performance
[  563.403811]  internal class: none
[  563.406935]  caps: single_disp video 
[  563.411205]  uvd    vclk: 0 dclk: 0
[  563.413805]          power level 0    sclk: 11000 mclk: 40500 vddc: 900
[  563.420862]          power level 1    sclk: 30000 mclk: 70000 vddc: 1100
[  563.423187]          power level 2    sclk: 60000 mclk: 70000 vddc: 1100
[  563.425557]  status: r
Comment 6 Michel Dänzer 2014-12-01 02:38:58 UTC
So current Mesa Git master is still affected by this problem, and reverting commit 7b4276d7acf2e0f77044cb50caa6ad936fa78786 on top of that works around it?

Does setting the environment variable RADEON_THREAD=0 work around the problem? If not, please attach gdb to the Second Life process to make sure the environment variable takes effect, i.e. there's no CS helper thread spawned by the r600g driver.
Comment 7 Shawn Starr 2014-12-01 02:56:18 UTC
Reverting patch does stop the lockup.

I will test RADEON_THREAD=0 and confirm w/ gdb running the game.
Comment 8 Shawn Starr 2014-12-01 03:56:06 UTC
With the environment variable set prior to X (in .bashrc). Running game with gdb still locks up GPU.

No crash in game reported (break within gdb):

here is threads:


(gdb) info threads
  Id   Target Id         Frame
  14   Thread 0x7fffe0602700 (LWP 1625) "do-not-directly" 0x00000037a5cc491d in nanosleep () from /lib64/libc.so.6
  13   Thread 0x7ffff000f700 (LWP 1624) "do-not-directly" 0x00000037a640c578 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  12   Thread 0x7fffe0e03700 (LWP 1623) "threaded-ml" 0x00000037a5cf51dd in poll () from /lib64/libc.so.6
  11   Thread 0x7fffc23a4700 (LWP 1621) "gdbus" 0x00000037a5cf51dd in poll () from /lib64/libc.so.6
  10   Thread 0x7fffc2ba5700 (LWP 1620) "do-not-directly" 0x00000037a5d00d83 in epoll_wait () from /lib64/libc.so.6
  9    Thread 0x7fffe27fc700 (LWP 1615) "do-not-directly" 0x00000037a640c590 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  8    Thread 0x7fffe2ffd700 (LWP 1614) "do-not-directly" 0x00000037a640c590 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  7    Thread 0x7fffe37fe700 (LWP 1613) "do-not-directly" 0x00000037a640c590 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  6    Thread 0x7fffe3fff700 (LWP 1612) "do-not-directly" 0x00000037a640c590 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  5    Thread 0x7ffff0b46700 (LWP 1611) "do-not-directly" 0x00000037a640c590 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  4    Thread 0x7ffff1347700 (LWP 1610) "do-not-directly" 0x00000037a640f8fd in nanosleep () from /lib64/libpthread.so.0
  3    Thread 0x7ffff1b48700 (LWP 1609) "do-not-directly" 0x00000037a640c590 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  2    Thread 0x7ffff2403700 (LWP 1583) "do-not-directly" 0x00000037a640f8fd in nanosleep () from /lib64/libpthread.so.0
* 1    Thread 0x7ffff270aa00 (LWP 1579) "do-not-directly" 0x00000037a640f8fd in nanosleep () from /lib64/libpthread.so.0
Comment 9 Michel Dänzer 2014-12-01 04:26:54 UTC
Unfortunately, the r600g helper thread doesn't set a distinctive thread name. Please attach (as opposed to paste) the output of 'thread apply all bt' in gdb when setting RADEON_THREAD=0.
Comment 10 Shawn Starr 2014-12-01 05:56:56 UTC
Created attachment 110280 [details]
Running threads listed
Comment 11 Michel Dänzer 2016-01-29 07:26:14 UTC
Fixed with current kernels.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.