Bug 85144 - i965 vaapi / opengl stops working or hangs gpu at least since git commit after git1406031225.081488
Summary: i965 vaapi / opengl stops working or hangs gpu at least since git commit afte...
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-17 14:01 UTC by Paulo Dias
Modified: 2015-01-31 10:23 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
xorg.0.log (50.19 KB, text/plain)
2014-10-21 15:32 UTC, Paulo Dias
no flags Details
xorg.0.log HEAD (49.81 KB, text/plain)
2014-10-21 15:43 UTC, Paulo Dias
no flags Details

Description Paulo Dias 2014-10-17 14:01:12 UTC
if i keep using mesa git, xorg git, radeon git, drm git, kernel 3.17 and above AND intel git1406031225.081488, everything works fine:

- VAAPI works (encoding/decoding)

easily tested by using:

mpv --vo=vaapi --hwdec=vaapi file.mp4 (low cpu usage)
mpv --vo=opengl --hwdec=vaapi file.mp4 (even lower cpu usage)

- OpenGL works fine
- KWin compositing works fine, same for Unity
- Chrome HW accel works fine

NOW, if i upgrade the i965 driver to any commit past git1406031225.081488,all hell breaks loose:

- VAAPI crashes the GPU
- Chrome hangs with segfault, GPU hangs
- KWIN compositing works
- OpenGL works for a few minutes, then the GPU crashes

i have a DELL LATITUDE 3540 (HD4400, RADEON HD8850m) hybrid
Comment 1 Paulo Dias 2014-10-17 14:20:01 UTC
just to clarify, the problems start after any revision past commit 08148896196443a8582c30b47ff546acca78d69c
Comment 2 Chris Wilson 2014-10-17 15:13:50 UTC
All of those are client errors...
Comment 3 Paulo Dias 2014-10-17 15:52:51 UTC
Hi chris, can you clarify a little further, it would be nice if you could point out if theres a workaround or a temporary fix?
Comment 4 Paulo Dias 2014-10-17 22:27:13 UTC
after bisecting from the last known to master , i got this from git bisect:

git bisect good 08148896196443a8582c30b47ff546acca78d69c
git bisect bad f33d44f41ef0f287375b7a6b1c117abff5a23b19

Bisecting: 136 revisions left to test after this (roughly 7 steps)
[f4b930318c68e0e07d677ebc7b4caa27912561db] sna/dri2: Replace assertion with code to skip updating the back buffer

now what?
Comment 5 Paulo Dias 2014-10-18 00:18:21 UTC
found the offending commit:

129656e4a82d4bf799e5c1d75d0dcb4480f6eb09 is the first bad commit
commit 129656e4a82d4bf799e5c1d75d0dcb4480f6eb09
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jul 17 15:01:06 2014 +0100

    sna/dri2: Disable SwapLimit buffers with buggy prime implementations
    
    If there is a GPU screen, we have to assume that the DRI2 code may pass
    around the wrong pointers to ReuseBufferNotify until the fix is                                                                                                                                                                                
    released:                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                   
    commit 4d92fab39c4225e89f2d157a1f559cb0618a6eaa                                                                                                                                                                                                
    Author: Chris Wilson <chris@chris-wilson.co.uk>                                                                                                                                                                                                
    Date:   Wed Jun 18 11:14:43 2014 +0100                                                                                                                                                                                                         
                                                                                                                                                                                                                                                   
        dri2: Use the PrimeScreen when creating/reusing buffers                                                                                                                                                                                    
                                                                                                                                                                                                                                                   
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>                                                                                                                                                                                         
                                                                                                                                                                                                                                                   
:040000 040000 02a8a213e19dcd5d8a7a59ec9ef07bbccbe594e5 e3c3e278e5479e30ea417186f1a65d4775aa5701 M      src
Comment 6 Paulo Dias 2014-10-18 00:59:16 UTC
the problem is this snippet:

#if DRI2INFOREC_VERSION < 6

#define xorg_can_triple_buffer() 0
#define swap_limit(d, l) false

#else

#if XORG_VERSION_CURRENT >= XORG_VERSION_NUMERIC(1,15,99,904,0)
/* Prime fixed for triple buffer support */
#define xorg_can_triple_buffer() 1
#elif XORG_VERSION_CURRENT < XORG_VERSION_NUMERIC(1,12,99,901,0)
/* Before numGPUScreens was introduced */
#define xorg_can_triple_buffer() 1
#else
/* Subject to crashers when combining triple buffering and Prime */
inline static bool xorg_can_triple_buffer(void)
{
        return screenInfo.numGPUScreens == 0;
}
#endif

IF i change return screenInfo.numGPUScreens == 1, then i wont have the chrome/vaapi/opengl crashes but DRI_PRIME crashes X.
Comment 7 Paulo Dias 2014-10-18 01:16:25 UTC
i mean return 1;
Comment 8 Paulo Dias 2014-10-21 15:32:58 UTC
Created attachment 108190 [details]
xorg.0.log
Comment 9 Paulo Dias 2014-10-21 15:43:26 UTC
Created attachment 108191 [details]
xorg.0.log HEAD
Comment 10 Paulo Dias 2014-10-21 15:44:20 UTC
When USING HEAD:

google-chrome cant use the gpu:

groo@kerberos ~ ->
 13:41 Ter Out 21$ google-chrome
ATTENTION: default value of option force_s3tc_enable overridden by environment.
ATTENTION: option value of option force_s3tc_enable ignored.
[3957:3957:1021/134232:ERROR:CONSOLE(0)] "Error in event handler for (unknown): Cannot read property 'length' of undefined
Stack trace: TypeError: Cannot read property 'length' of undefined
    at refresh_event (chrome-extension://ehhkfhegcenpfoanmgfpfhnmdmflkbgk/js/main.js:469:14)
    at disconnectListener (extensions::messaging:335:9)
    at EventImpl.dispatchToListener (extensions::event_bindings:397:22)
    at Event.publicClass.(anonymous function) [as dispatchToListener] (extensions::utils:93:26)
    at EventImpl.dispatch_ (extensions::event_bindings:379:35)
    at EventImpl.dispatch (extensions::event_bindings:403:17)
    at Event.publicClass.(anonymous function) [as dispatch] (extensions::utils:93:26)
    at dispatchOnDisconnect (extensions::messaging:290:27)", source: chrome-extension://ehhkfhegcenpfoanmgfpfhnmdmflkbgk/index.html (0)
[3,364554496:15:42:32.587568] Native Client module will be loaded at base address 0x00006c1300000000
[3957:4071:1021/134233:ERROR:get_updates_processor.cc(240)] PostClientToServerMessage() failed during GetUpdates
[3957:3957:1021/134238:ERROR:extension_downloader.cc(700)] Invalid URL: '' for extension oenpjldbckebacipkfbcoppmiflglnib
[3957:3957:1021/134238:ERROR:extension_downloader.cc(700)] Invalid URL: '' for extension hehilldlghfkbmmojagnecggemfkfpcc
[3957:3957:1021/134238:ERROR:extension_downloader.cc(700)] Invalid URL: '' for extension ajeeigigjapddbkkekjgpgolgodcpblc
[3957:3957:1021/134238:ERROR:extension_downloader.cc(700)] Invalid URL: '' for extension fkmopoamfjnmppabeaphohombnjcjgla
[3999:4007:1021/154244:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms.
ATTENTION: default value of option force_s3tc_enable overridden by environment.
ATTENTION: option value of option force_s3tc_enable ignored.
[WARNING:flash/platform/pepper/pep_module.cpp(63)] SANDBOXED
[4406:4411:1021/154257:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms.
ATTENTION: default value of option force_s3tc_enable overridden by environment.
ATTENTION: option value of option force_s3tc_enable ignored.
[3957:3992:1021/134301:ERROR:connection_factory_impl.cc(344)] Failed to connect to MCS endpoint with error -7
[4469:4474:1021/154309:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms.

dmesg:

[   46.960033] Watchdog[3133]: segfault at 0 ip 00007f1f7ac8a2db sp 00007f1f6a46a7d0 error 6 in chrome[7f1f76ef0000+4ffd000]
[   71.492683] Watchdog[3573]: segfault at 0 ip 00007fc6af6cc2db sp 00007fc69eeac7d0 error 6 in chrome[7fc6ab932000+4ffd000]
[   89.967681] Watchdog[3647]: segfault at 0 ip 00007f132d6ce2db sp 00007f131ceae7d0 error 6 in chrome[7f1329934000+4ffd000]
[  179.021128] Watchdog[4007]: segfault at 0 ip 00007f7dde7a12db sp 00007f7dcdf817d0 error 6 in chrome[7f7ddaa07000+4ffd000]
[  191.434554] Watchdog[4411]: segfault at 0 ip 00007f24aacd02db sp 00007f249a4b07d0 error 6 in chrome[7f24a6f36000+4ffd000]
[  203.654600] Watchdog[4474]: segfault at 0 ip 00007fefb6f482db sp 00007fefa67287d0 error 6 in chrome[7fefb31ae000+4ffd000]
Comment 11 Paulo Dias 2014-10-21 18:46:56 UTC
if i disable both:

TripleBuffer off
Swapbufferwait off

the driver works has expected, but i lose PRIME (Xorg doesnt allow declaring 2 devices and keeping PRIME, apparently)
Comment 12 Chris Wilson 2014-10-21 20:42:48 UTC
The two xtraces (hanging in GetBuffers) look to be the result of SwapLimiting, should be fixed by

diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
index 6359377..c8c71c5 100644
--- a/src/sna/sna_dri2.c
+++ b/src/sna/sna_dri2.c
@@ -309,6 +309,9 @@ sna_dri2_reuse_buffer(DrawablePtr draw, DRI2BufferPtr buffer)
 
 static bool swap_limit(DrawablePtr draw, int limit)
 {
+       if (!xorg_can_triple_buffer())
+               return false;
+
        DBG(("%s: draw=%ld setting swap limit to %d\n", __FUNCTION__, (long)draw->id, limit));
        DRI2SwapLimit(draw, limit);
        return true;

I think.
Comment 13 Paulo Dias 2014-10-21 22:31:40 UTC
THAT DID IT! it fixed the crashes, vaapi, etc :D im happy as a clam :D
Comment 14 Paulo Dias 2014-10-21 22:34:45 UTC
spoke a little too soon :P its much better, but chrome still crashes (although it comes back , the gpu resets)

[3098:3098:1021/202821:ERROR:extension_downloader.cc(700)] Invalid URL: '' for extension fkmopoamfjnmppabeaphohombnjcjgla                                                                                                                          
[WARNING:flash/platform/pepper/pep_module.cpp(63)] SANDBOXED                                                                                                                                                                                       
[3138:3154:1021/223206:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms.                                                                                                                                        
ATTENTION: default value of option force_s3tc_enable overridden by environment.                                                                                                                                                                    
[3098:3098:1021/203217:ERROR:gpu_process_transport_factory.cc(442)] Lost UI shared context.                                                                                                                                                        
ATTENTION: option value of option force_s3tc_enable ignored.                                                                                                                                                                                       
[4088:4099:1021/223258:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms. 

i can reproduce this crash, when for example, i to to the edit combo in this bug report and try to change the status.

but overall it works, i have 4 tabs opened without crashes, but its still a little unstable.
Comment 15 Paulo Dias 2014-10-21 22:39:28 UTC
spoke too soon, vaapi and prime are now stable, chrome still crashes after a few opened tabs and some gpu crashes

ATTENTION: default value of option force_s3tc_enable overridden by environment.
ATTENTION: option value of option force_s3tc_enable ignored.
[3098:3225:1021/202814:ERROR:get_updates_processor.cc(240)] PostClientToServerMessage() failed during GetUpdates
[3098:3098:1021/202814:ERROR:CONSOLE(0)] "Error in event handler for (unknown): Cannot read property 'length' of undefined
Stack trace: TypeError: Cannot read property 'length' of undefined
    at refresh_event (chrome-extension://ehhkfhegcenpfoanmgfpfhnmdmflkbgk/js/main.js:469:14)
    at disconnectListener (extensions::messaging:335:9)
    at EventImpl.dispatchToListener (extensions::event_bindings:397:22)
    at Event.publicClass.(anonymous function) [as dispatchToListener] (extensions::utils:93:26)
    at EventImpl.dispatch_ (extensions::event_bindings:379:35)
    at EventImpl.dispatch (extensions::event_bindings:403:17)                                                                                                                                                                                      
    at Event.publicClass.(anonymous function) [as dispatch] (extensions::utils:93:26)                                                                                                                                                              
    at dispatchOnDisconnect (extensions::messaging:290:27)", source: chrome-extension://ehhkfhegcenpfoanmgfpfhnmdmflkbgk/index.html (0)                                                                                                            
[3,1485068544:22:28:14.930413] Native Client module will be loaded at base address 0x0000258000000000                                                                                                                                              
[3098:3098:1021/202821:ERROR:extension_downloader.cc(700)] Invalid URL: '' for extension oenpjldbckebacipkfbcoppmiflglnib                                                                                                                          
[3098:3098:1021/202821:ERROR:extension_downloader.cc(700)] Invalid URL: '' for extension hehilldlghfkbmmojagnecggemfkfpcc                                                                                                                          
[3098:3098:1021/202821:ERROR:extension_downloader.cc(700)] Invalid URL: '' for extension ajeeigigjapddbkkekjgpgolgodcpblc                                                                                                                          
[3098:3098:1021/202821:ERROR:extension_downloader.cc(700)] Invalid URL: '' for extension fkmopoamfjnmppabeaphohombnjcjgla                                                                                                                          
[WARNING:flash/platform/pepper/pep_module.cpp(63)] SANDBOXED                                                                                                                                                                                       
[3138:3154:1021/223206:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms.                                                                                                                                        
ATTENTION: default value of option force_s3tc_enable overridden by environment.                                                                                                                                                                    
[3098:3098:1021/203217:ERROR:gpu_process_transport_factory.cc(442)] Lost UI shared context.                                                                                                                                                        
ATTENTION: option value of option force_s3tc_enable ignored.                                                                                                                                                                                       
[4088:4099:1021/223258:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms.                                                                                                                                        
ATTENTION: default value of option force_s3tc_enable overridden by environment.                                                                                                                                                                    
[3098:3098:1021/203309:ERROR:gpu_process_transport_factory.cc(442)] Lost UI shared context.                                                                                                                                                        
ATTENTION: option value of option force_s3tc_enable ignored.                                                                                                                                                                                       
[282:288:1021/203309:ERROR:gpu_channel_host.cc(151)] GpuChannelHost::CreateViewCommandBuffer failed.                                                                                                                                               
[282:288:1021/203309:ERROR:webgraphicscontext3d_command_buffer_impl.cc(243)] Failed to initialize command buffer.                                                                                                                                  
[4206:4212:1021/223527:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms.                                                                                                                                        
[3098:3098:1021/203527:ERROR:gpu_process_transport_factory.cc(442)] Lost UI shared context.                                                                                                                                                        
[3098:3134:1021/203647:ERROR:channel.cc(316)] RawChannel read error (connection broken)                                                                                                                                                            
[3098:3134:1021/203647:ERROR:channel.cc(316)] RawChannel read error (connection broken)

but much better
Comment 16 Paulo Dias 2014-10-21 22:49:34 UTC
if i disable compositing in kwin for ex, i dont get the rendering problems, so something broke with this code and compositing.
Comment 17 Chris Wilson 2014-10-22 06:23:51 UTC
(In reply to Paulo Dias from comment #16)
> if i disable compositing in kwin for ex, i dont get the rendering problems,
> so something broke with this code and compositing.

kwin itself is broken, https://bugs.kde.org/show_bug.cgi?id=336589

GPU hangs from a client process are more than likely a client bug. Look at the error state (/sys/class/drm/card0/error)
Comment 18 Chris Wilson 2015-01-31 10:23:55 UTC
Ah, looks like the Chrome crash is a known unrelated issue and this was just fake triple buffering breakage that we fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.