109072 – GPU hang in blender 2.80

Bug 109072 - GPU hang in blender 2.80

Summary: GPU hang in blender 2.80

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Kenneth Graunke
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-12-16 21:14 UTC by Vladimir Pinchuk
Modified:	2019-01-04 11:30 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg (7.55 MB, text/plain) 2018-12-16 21:14 UTC, Vladimir Pinchuk	Details
/sys/class/drm/card0/error (38.22 KB, text/plain) 2018-12-16 21:15 UTC, Vladimir Pinchuk	Details
lscpu (1.40 KB, text/plain) 2018-12-24 09:40 UTC, Vladimir Pinchuk	Details
how it looks like (1.55 MB, image/gif) 2018-12-25 12:51 UTC, Vladimir Pinchuk	Details
apitrace (26.58 MB, application/octet-stream) 2018-12-26 10:01 UTC, Vladimir Pinchuk	Details
View All

Description Vladimir Pinchuk 2018-12-16 21:14:15 UTC

Created attachment 142829 [details]
dmesg

This can be reproduced by running blender 2.80 on fedora 29 stock kernel and drm-tip kernel by opening blender and rotating the cube for a while. The whole desktop freezes for about 10 seconds and then unfreezes.

Logs from:
4.20.0-rc6+ x86_64

Comment 1 Vladimir Pinchuk 2018-12-16 21:15:28 UTC

Created attachment 142830 [details]
/sys/class/drm/card0/error

Comment 2 Vladimir Pinchuk 2018-12-16 21:26:05 UTC

It is a Dell XPS 13 9360

Comment 3 Lakshmi 2018-12-18 04:17:50 UTC

Product is set as Mesa assuming it's a Mesa bug.

Comment 4 Vladimir Pinchuk 2018-12-23 08:36:11 UTC

Reproduced it with Gnome on Xorg

Comment 5 Danylo 2018-12-24 09:32:48 UTC

Hello, 

We'll also need Mesa's version.

And your cpu have UHD Graphics 620.

Comment 6 Vladimir Pinchuk 2018-12-24 09:39:31 UTC

These logs are from mesa 18.2.6, but I could also reproduce the same bug on 18.3.1.

Comment 7 Vladimir Pinchuk 2018-12-24 09:40:07 UTC

Created attachment 142878 [details]
lscpu

Comment 8 Vladimir Pinchuk 2018-12-24 15:02:05 UTC

Just realized that this large dmesg is hard to navigate. If booted without drm.debug, the relevant lines are

[65454.798345] [drm] GPU HANG: ecode 9:0:0x84dffefc, in blender [31698], reason: hang on rcs0, action: reset
[65454.798349] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[65454.798351] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[65454.798352] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[65454.798354] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[65454.798356] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[65454.799424] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Comment 9 Danylo 2018-12-24 15:22:19 UTC

I thought to ask apitrace of Blender but apitrace is crashing when replaying the trace at the moment...

I cannot reproduce hang with:

i7-7500U - HD Graphics 620

Arch Linux
Blender 2.8
Mesa 18.2.4/18.3.1/master
Kernel 4.19.2

Maybe someone else will be able to reproduce or I'll fix apitrace.
Also I didn't look closer at crash dump.

Comment 10 Vladimir Pinchuk 2018-12-24 19:54:29 UTC

Is a bisect on blender codebase worth trying? This does not occur with 2.79b.

Comment 11 Denis 2018-12-25 11:03:26 UTC

Hi Vladimir, don't think that it may help now. I also tried blender 2.80 on UHD630 GPU, but using 18 ubuntu on X. I didn't reproduce hang also, so it might be related exactly to wayland, will install it tomorrow and check.

To be sure that I got steps correctly:
1. launch app
2. Rotate cube with "middle" button of the mouse. (How long in your case?)
3. Hang occurs?

Comment 12 Vladimir Pinchuk 2018-12-25 12:50:09 UTC

Yes, the steps are correct. The time before the hang occurs varies, but usually is under 10 seconds

Comment 13 Vladimir Pinchuk 2018-12-25 12:51:58 UTC

Created attachment 142888 [details]
how it looks like

Recorded under gnome wayland with Peek.

Comment 14 Vladimir Pinchuk 2018-12-25 12:57:57 UTC

As I already mentioned, this happens on both Xorg and Wayland, even if I run the windows version of blender under wine. The only difference is that on Xorg the mouse cursor still moves, but on Wayland it is frozen too.

Comment 15 Danylo 2018-12-26 09:39:18 UTC

Could you record an apitrace of Blender?

https://github.com/apitrace/apitrace/blob/master/docs/USAGE.markdown

> apitrace trace blender-2.8

And post trace file. I found that I wasn't able to replay the trace of blender 2.8 due to possibly unnecessary assert without which replay is working.

If you have some time you can build apitrace yourself (https://github.com/apitrace/apitrace/blob/master/docs/INSTALL.markdown) and check if replaying it will hang GPU, you would need to comment one line:

retrace/glws_glx.cpp:147

> //assert(!pbuffer);

Since the steps are simple I doubt the trace will hang on other machines but it still worth checking.

Also you had Gnome in all cases, maybe it worth checking with other desktop environment.

Comment 16 Vladimir Pinchuk 2018-12-26 10:01:15 UTC

Created attachment 142898 [details]
apitrace

Apitrace of blender. I have built apitrace and checked, and it indeed reproduces the hang. The hang also occurs if I run blender under Xwayland in weston, so I doubt it is related to the DE. I haven't tried on bare X though.

Comment 17 Danylo 2018-12-26 10:06:34 UTC

Thanks!

The hang is reproducible with this trace:

> [ 4796.729606] [drm] GPU HANG: ecode 9:0:0x84dffefc, in glretrace [13521], reason: hang on rcs0, action: reset

Comment 18 Mark Janes 2018-12-26 14:17:36 UTC

I can't reproduce this on SKL, with debian testing:

   Linux 4.18.0 
   mesa 18.1.9

I built mesa master, and tested with a drm-tip kernel as well, and couldn't reproduce with the trace file.  I tried XWayland and Xorg.

Usually, KBL and SKL has similar failure patterns.  Danylo, does the hang reproduce every time you retrace the file on KBL?

Comment 19 Danylo 2018-12-26 14:24:58 UTC

> Usually, KBL and SKL has similar failure patterns.  Danylo, does the hang reproduce every time you retrace the file on KBL?

Yes, it always hangs on Kaby Lake and Coffee Lake, I didn't test on other machines.

Comment 20 Andrii K 2018-12-27 09:17:18 UTC

Bisected to commit: a363bb2cd0e2a141f2c60be005009703bffcbe4e
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Tue Apr 10 01:18:25 2018 -0700

    i965: Allocate VMA in userspace for full-PPGTT systems.
    
    This patch enables soft-pinning of all buffers, allowing us to skip
    relocation processing entirely.  All systems with full PPGTT and > 4GB
    of VMA should gain these benefits.  This should be most Gen8+.
    
    Unfortunately, this excludes a few systems:
    - Cherryview (only has 32-bit addressing, despite 48-bit pointers)
    - Broadwell with a 32-bit kernel
    - Anybody running pre-4.5 kernel.
    
    We may enable it for Cherryview in the future, but it would require
    some tweaks to the memory zone.
    
    Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>

Comment 21 Andrii K 2018-12-27 09:18:05 UTC

With this commit and INTEL_DEBUG=reemit there is no hang.

Comment 22 Lionel Landwerlin 2019-01-02 12:54:24 UTC

Managed to get some understanding about what's going on here :

Since we switch to softpin all buffers, that means the vertex buffers aren't restricted to the low 4Gb region. So we run into the same HW issue as before.
In effect softpinning VBOs nullifys the 32bit reloc flag in genX(emit_vertex_buffer_state) (genX_state_upload.c).

I'm not quite sure how to fix this apart from disabling softpinning on all buffer objects, because buffers can be reused from one type another (transform feedback output into vertices for instance)...

Comment 23 Kenneth Graunke 2019-01-02 19:04:01 UTC

(In reply to Lionel Landwerlin from comment #22)
> Managed to get some understanding about what's going on here :
> 
> Since we switch to softpin all buffers, that means the vertex buffers aren't
> restricted to the low 4Gb region. So we run into the same HW issue as before.
> In effect softpinning VBOs nullifys the 32bit reloc flag in
> genX(emit_vertex_buffer_state) (genX_state_upload.c).
> 
> I'm not quite sure how to fix this apart from disabling softpinning on all
> buffer objects, because buffers can be reused from one type another
> (transform feedback output into vertices for instance)...

We ought to be doing VF cache invalidations when VB[i] or IB transition between different 4GB segments.  But, maybe we're not doing those properly. :(

Comment 24 Lionel Landwerlin 2019-01-03 16:58:27 UTC

With this MR, I get rid of the hang on my system when replaying the trace :

https://gitlab.freedesktop.org/mesa/mesa/merge_requests/62

Comment 25 Lionel Landwerlin 2019-01-04 11:30:12 UTC

Should be fixed on master at the following commit :

commit 31e4c9ce400341df9b0136419b3b3c73b8c9eb7e
Author: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Date:   Thu Jan 3 16:18:48 2019 +0000

    i965: add CS stall on VF invalidation workaround

Thanks for reporting this!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.