Bug 99720

Summary: GPU HANG: ecode 9:0:0x85dffffb with LibreOffice
Product: Mesa Reporter: Jan Drewes <dr.jan.drewes>
Component: Drivers/DRI/i965Assignee: vladimir.campos
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: high CC: alejandro_aero, dannagifford, gary.c.wang, intel-gfx-bugs, listes, ricardo.vega, sebastian_himmler
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: /sys/class/drm/card1/error
another error log, this time from /sys/class/drm/card0/error
Xorg.0.log
/sys/class/drm/card0/error
lspci -vvv
gpu crash dump
lspci -vvv

Description Jan Drewes 2017-02-09 08:17:43 UTC
Created attachment 129423 [details]
/sys/class/drm/card1/error

[16267.939845] [drm] GPU HANG: ecode 9:0:0x85dffffb, in plasmashell [1732], reason: Hang on render ring, action: reset
[16267.939847] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[16267.939849] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[16267.939850] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[16267.939851] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[16267.939852] [drm] GPU crash dump saved to /sys/class/drm/card1/error
[16267.939947] drm/i915: Resetting chip after gpu hang
[16267.940489] [drm] GuC firmware load skipped
[16269.953879] [drm] RC6 on

This happend on an optimus system while using intel for the plasma desktop, on Kubuntu 16.10. All kubuntu updates as of Feb 08 2017 installed.
Comment 1 Jan Drewes 2017-05-13 00:30:30 UTC
Created attachment 131342 [details]
another error log, this time from /sys/class/drm/card0/error

This bug triggers an X restart everytime it occurs, causing me to loose all unsaved work. It appears to happen most frequently when using LibreOffice Impress to edit a complex slide.

I have since upgraded to Kubuntu 17.04 and the bug remains unchanged, as far as I  can tell.

Note: I am using a custom kernel with an APST fix applied (see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184), but I have encountered the bug with the stock kernel as well. Anything NVMe related (like the APST fix) should not affect the graphics stack anyways.
Comment 2 Mark Janes 2017-05-13 01:18:00 UTC
Mesa devs at Intel have not had a good way to reproduce this hang.  If you can attach a complex slide that causes gpu hang reliably, we might be able to fix this issue.

Some gpu hangs have been fixed upstream in linux, mesa, and sna.  You may find the issue resolved with upstream sources.  Switching to modesetting may improve things also.  Please indicate your hardware and attach your xorg.log with the repro steps.
Comment 3 Chris Wilson 2017-05-13 08:03:22 UTC
(In reply to Mark Janes from comment #2)
> Mesa devs at Intel have not had a good way to reproduce this hang.  If you
> can attach a complex slide that causes gpu hang reliably, we might be able
> to fix this issue.
> 
> Some gpu hangs have been fixed upstream in linux, mesa, and sna.  You may
> find the issue resolved with upstream sources.  Switching to modesetting may
> improve things also.  Please indicate your hardware and attach your xorg.log
> with the repro steps.

This is a mesa bug, please stop deflecting. The hardware is reported in the error state.
Comment 4 Mark Janes 2017-05-15 18:17:23 UTC
I'm not deflecting, I'm asking for help reproducing the issue.
Comment 5 Jan Drewes 2017-06-15 13:49:09 UTC
I have been editing many libre-office documents, and all of them (in impress, in writer) appear to be able to trigger the bug.

Once the bug has happened, it appears to be much more likely to happen again unless I reboot the machine (restarting the X server does not help).

I am now running the latest ubuntu kernel (Linux 4.10.0-23-generic #25-Ubuntu SMP Fri Jun 9 09:39:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux) and the situation is unchanged.
Could the fact that this machine has a 4K/UHD screen contribute somehow, as my other intel machine (HP CoreM tablet with FHD screen) has never crashed like this yet? Or might it be a hidden Optimus problem? (both nvidia and nouveau kernel modules are blacklisted/not loaded, bumblebee is not configured, nvidia GPU is switched off at all times)?

As any text/graphics document in Libre office appears to trigger the but I am not sure if it would make sense to attach one - they all contain personal/confidential information to some amount.

What else can I do to help chase this down or at least find a workaround?

Cheers

Jan
Comment 6 Jan Drewes 2017-06-15 14:07:25 UTC
Created attachment 131981 [details]
Xorg.0.log

added Xorg log file
Comment 7 Mark Janes 2017-06-20 21:06:31 UTC
Jan,  our team set up a hidpi  sklgt2 with the default kubuntu 16.10 install and spent hours manipulating large libreoffice impress documents.  We couldn't generate a gpu hang.
Comment 8 Jan Drewes 2017-06-23 08:16:13 UTC
I really do appreciate your efforts. 
However, the problem remains - although the error has changed a little bit, no more saving of a crash report it appears:

Jun 23 09:43:42 Trouble kernel: [26432.693147] [drm] GPU HANG: ecode 9:0:0x86dffffd, in Xorg [1223], reason: Hang on render ring, action: reset 
Jun 23 09:43:42 Trouble kernel: [26432.693178] drm/i915: Resetting chip after gpu hang 
Jun 23 09:43:42 Trouble kernel: [26432.693294] [drm] RC6 on 
Jun 23 09:43:42 Trouble kernel: [26432.707575] [drm] GuC firmware load skipped 
Jun 23 09:43:45 Trouble systemd[1]: Started Session 17 of user jan. 
Jun 23 09:43:52 Trouble kernel: [26442.636333] drm/i915: Resetting chip after gpu hang 
Jun 23 09:43:52 Trouble kernel: [26442.636402] [drm] RC6 on 
Jun 23 09:43:52 Trouble kernel: [26442.650659] [drm] GuC firmware load skipped 

On Kubuntu, all updates as of June 23 2017.
Linux Trouble 4.10.0-25-generic #29-Ubuntu SMP Tue Jun 20 15:00:02 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


...and until now it has ONLY happened with Libre Office, never with anything else (openGL games, VirtualBox/Windows10/MicrosoftOffice2016, wine/MicrosoftOffice2010, Matlab, intensive use of Firefox, etc.

I am not sure what to do here... just wait for the next (K)ubuntu-version to hopefully fix things?

Would there be any promise in applying the 2017Q1 Intel Graphics Stack Recipe, or should I try to get some updated mesa from somewhere? Any recommendations?
Comment 9 Mark Janes 2017-06-23 17:04:25 UTC
Use the oibaf ppa if it is compatible with kubuntu.
Comment 10 Hesham Ahmed 2017-06-24 20:54:28 UTC
Created attachment 132225 [details]
/sys/class/drm/card0/error

I am also seeing the same bug on almost similar hardware. This is on Arch Linux running kernel 4.11.6 and Gnome 3.24.2 on a Dell Precision 5510 with 4K screen. The issue is impossible to reliably reproduce, it always appears at random while using Libreoffice, I mostly use Calc but the hang happens also when using Writer or Impress, repeating the same task on the same file doesn't reproduce the error. Also, as is the case with OP, nvidia gpu is switched off. 

[drm] GPU HANG: ecode 9:0:0x85dffffb, in Xwayland [7741], reason: Hang on render ring, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
ernel: drm/i915: Resetting chip after gpu hang
[drm] RC6 on
org.gnome.Shell.desktop[7718]: libinput error: libinput bug: timer: offset negative (-2080361)
[drm] GuC firmware load skipped
drm/i915: Resetting chip after gpu hang
[drm] RC6 on
[drm] GuC firmware load skipped
org.gnome.Shell.desktop[7718]: intel_do_flush_locked failed: Input/output error
libreoffice-calc.desktop[27753]: X IO Error
Comment 11 Jan Drewes 2017-06-25 01:53:49 UTC
I have used the oibaf ppa in the past for trying (maybe 2 months ago), but the error remained, and I purged the ppa again.

I guess Hesham Ahmend's post also tells me that I don't need to try the mainline kernels from 4.11.x - would it make sense for me to try the 4.12-rc6?

I guess there is no immediate way to identify what LibreOffice is doing that no other software seems to do in order to trigger the bug?
Comment 12 Jan Drewes 2017-06-25 06:29:31 UTC
...and now it is also hitting me on another intel-based system. As I could not reliably work on my main system (Precision 5110, see reports above) I have switched to my other laptop, a HP Elite X2 1012 G1 (tablet-type, Skylake CoreM 
Intel(R) Core(TM) m5-6Y54 CPU), and, again in LibreOffice, suddenly the screen froze (but not the mouse) and I found this in syslog:


Jun 25 08:20:11 Chaos kernel: [12746.287681] [drm] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [1150], reason: Hang on render ring, action: reset 
Jun 25 08:20:11 Chaos kernel: [12746.287752] drm/i915: Resetting chip after gpu hang 
Jun 25 08:20:11 Chaos kernel: [12746.289722] [drm] RC6 on 
Jun 25 08:20:11 Chaos kernel: [12746.307587] [drm] GuC firmware load skipped 
Jun 25 08:20:31 Chaos kernel: [12766.193242] drm/i915: Resetting chip after gpu hang 
Jun 25 08:20:31 Chaos kernel: [12766.193350] [drm] RC6 on 
Jun 25 08:20:31 Chaos kernel: [12766.210600] [drm] GuC firmware load skipped 
Jun 25 08:20:50 Chaos kernel: [12785.200890] drm/i915: Resetting chip after gpu hang 
Jun 25 08:20:50 Chaos kernel: [12785.202985] [drm] RC6 on 
Jun 25 08:20:50 Chaos kernel: [12785.214964] [drm] GuC firmware load skipped 
Jun 25 08:21:02 Chaos kernel: [12797.168631] drm/i915: Resetting chip after gpu hang 
Jun 25 08:21:02 Chaos kernel: [12797.168756] [drm] RC6 on 
Jun 25 08:21:02 Chaos kernel: [12797.180730] [drm] GuC firmware load skipped 

After a little while, X restarted (...work lost...).

I will assume this is the same bug?
Comment 13 Jan Drewes 2017-06-25 06:34:17 UTC
Created attachment 132228 [details]
lspci -vvv

Added output of lspci -vvv
Comment 14 Jan Drewes 2017-06-25 06:49:38 UTC
Bugs that appear similar to me, all involve GPU hangs and LibreOffice:
Bug 100905
Bug 100794
Bug 95062
Comment 15 Jan Drewes 2017-06-25 07:03:11 UTC
In Bug 95062 Danna Gifford suggested the following:

>> A work-around seems to be starting Libre Office without hardware acceleration >> with the variable LIBGL_ALWAYS_SOFTWARE
>> 
>> e.g. to start Impress from the terminal
>> $ LIBGL_ALWAYS_SOFTWARE=1 loimpress

I am testing this now.
Comment 16 Jan Drewes 2017-06-25 08:22:33 UTC
Another similar report (hang with libreoffice) found on LKML: 
http://lkml.iu.edu/hypermail/linux/kernel/1704.3/01901.html
Comment 17 Paul-Antoine Arras 2017-06-25 11:40:58 UTC
*** Bug 100905 has been marked as a duplicate of this bug. ***
Comment 18 Jan Drewes 2017-06-26 02:39:04 UTC
Re testing the workaround: yesterday, I worked with LibreOffice (multiple documents) for several hours without a crash. Before, being able to work for that long would have been highly unlikely. Therefore, it seems to me as if disabling hardware acceleration actually worked. Now at least, my computer is fully operational again - but of course, this doesn't fix the bug...
Comment 19 Alejandro Lorenzo 2017-09-05 18:04:55 UTC
I think i am hitting the same bug with Linux 4.12.10 in a Dell Precision 5510.

I will attach my card error, output of dmesg is:

[36249.475910] [drm] GPU HANG: ecode 9:0:0x86dffffd, in Xorg [1345], reason: Hang on rcs, action: reset
[36249.475912] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[36249.475912] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[36249.475912] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[36249.475913] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[36249.475913] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[36249.475940] drm/i915: Resetting chip after gpu hang
[36249.476049] [drm] RC6 off
[36257.467677] drm/i915: Resetting chip after gpu hang
[36257.467912] [drm] RC6 off
[36265.535608] drm/i915: Resetting chip after gpu hang
[36265.535819] [drm] RC6 off
[36273.467462] drm/i915: Resetting chip after gpu hang
[36273.467723] [drm] RC6 off
[36286.523448] drm/i915: Resetting chip after gpu hang
[36286.523714] [drm] RC6 off
Comment 20 Alejandro Lorenzo 2017-09-05 18:05:40 UTC
Created attachment 133971 [details]
gpu crash dump
Comment 21 Alejandro Lorenzo 2017-09-05 18:07:06 UTC
Definitely related to Libreoffice, as it is the only program that triggers this behaviour
Comment 22 Danna Gifford 2017-09-05 20:29:09 UTC
Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz

I originally proposed the workaround in comment 15. However, in testing I've found that it does not significantly reduce crashes for me after all.  Booting with nomodeset as a grub option stops the gpu hangs associated with libreoffice, but with a severe performance penalty, and loss of HDPI graphics scaling and suspend.
Comment 23 Alejandro Lorenzo 2017-09-06 08:36:27 UTC
It's been 7 months, several other bugs report very similar behaviour with LibreOffice and it seems something is just not right.

Can we do anything to get this finally fixed, Vladimir?
Comment 24 Alejandro Lorenzo 2017-09-06 09:10:21 UTC
Created attachment 133986 [details]
lspci -vvv
Comment 25 Paul-Antoine Arras 2017-09-06 20:08:16 UTC
(In reply to Danna Gifford from comment #22)
> I originally proposed the workaround in comment 15. However, in testing I've
> found that it does not significantly reduce crashes for me after all. 

It doesn't work for me either.
The workaround I found is to run LO in Xephyr.
Comment 26 Danna Gifford 2017-09-07 09:00:11 UTC
I'm currently testing the Intel drivers installed using intel-graphics-update-tool (see link below). I think this is providing i965-va-driver version 1.8.3-1 (whereas 1.7.3-1 is available from the Zesty repo).

I don't have xserver-xorg-video-intel installed.

So far, things seem to be more stable running LibreOffice.

https://01.org/linuxgraphics/downloads/intel-graphics-update-tool-linux-os-v2.0.5

Sorry I'd try to provide more detail but quite busy at the moment.
Comment 27 Alejandro Lorenzo 2017-09-07 09:12:21 UTC
I have taken a look at the Intel's page about the intel-graphics-update-tool and i don't think that should be related. I am running Debian testing, which contains version 1.8.3 of the affected libraries by that tool, which is more modern than the 1.8.0 installed with the tool and still, i am affected by this bug.

In case it is related to kernel code i have tested:
- 4.9.2
- 4.9.8
- 4.9.9
- 4.10
- 4.11
- 4.12-rc3
- 4.12-rc6
- 4.12-rc7
- 4.12.4
- 4.12.10

From these, by far, the better results i've got were with the 4.12.10. I've been running for maybe two or three weeks without a hang, so i tough it could be resolved, by the other day i got the very same old hang, so it seems the changes in the kernel made it more rare, but the cause still is in there.

I am currently testing 4.13 in my machine. Will tell you about the results.
Comment 28 Danna Gifford 2017-09-07 22:48:45 UTC
(In reply to Alejandro Lorenzo from comment #27)
> I have taken a look at the Intel's page about the intel-graphics-update-tool
> and i don't think that should be related. I am running Debian testing, which
> contains version 1.8.3 of the affected libraries by that tool, which is more
> modern than the 1.8.0 installed with the tool and still, i am affected by
> this bug.

As Alejandro predicted, it doesn't fix it.  However, running with 1.8.3 on 4.10.0-34-generic gave me several hours without a hang tonight, compared with several minutes on 1.7.3, so perhaps an improvement (but could just be very stochastic). It seems something has changed, because it also wouldn't reset the chip/restart X after the hang, and I had to reboot.
Comment 29 Alejandro Lorenzo 2017-09-12 08:30:33 UTC
The bug is marked as NEEDINFO. What info is necessary ? It seems nobody actually is taking care of this bug.
Comment 30 Sebastian Himmler 2017-09-12 15:06:20 UTC
I can reproduce this also. This is a heavy impact for my work if the system crashes during work.


Sep 12 15:54:42 localhost kernel: [drm] GPU HANG: ecode 9:0:0x85dffffb, in Xwayland [2067], reason: Hang on rcs, action: reset
Sep 12 15:54:42 localhost kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Sep 12 15:54:42 localhost kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Sep 12 15:54:42 localhost kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Sep 12 15:54:42 localhost kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Sep 12 15:54:42 localhost kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Sep 12 15:54:42 localhost kernel: drm/i915: Resetting chip after gpu hang
Sep 12 15:54:42 localhost kernel: [drm] RC6 on
Sep 12 15:54:42 localhost org.gnome.Shell.desktop[2047]: Window manager warning: last_user_time (114029411) is greater than comparison timestamp (114029001).  This most likely represents a buggy client sending inaccurate timestamps in mess
ages such as _NET_ACTIVE_WINDOW.  Trying to work around...
Sep 12 15:54:42 localhost org.gnome.Shell.desktop[2047]: Window manager warning: 0x6e00032 (itsa_brief) appears to be one of the offending windows with a timestamp of 114029411.  Working around...
Sep 12 15:54:42 localhost systemd-udevd[665]: Network interface NamePolicy= disabled on kernel command line, ignoring.
Sep 12 15:54:50 localhost kernel: drm/i915: Resetting chip after gpu hang
Sep 12 15:54:50 localhost kernel: [drm] RC6 on
Sep 12 15:54:53 localhost kernel: asynchronous wait on fence i915:gnome-shell[2047]/1:53b49 timed out
Sep 12 15:54:58 localhost kernel: drm/i915: Resetting chip after gpu hang
Sep 12 15:54:58 localhost kernel: [drm] RC6 on
Sep 12 15:55:06 localhost kernel: drm/i915: Resetting chip after gpu hang
Sep 12 15:55:06 localhost kernel: [drm] RC6 on
Sep 12 15:55:14 localhost kernel: drm/i915: Resetting chip after gpu hang
Sep 12 15:55:14 localhost kernel: [drm] RC6 on
Sep 12 15:55:14 localhost org.gnome.Shell.desktop[2047]: intel_do_flush_locked failed: Input/output error
Sep 12 15:55:14 localhost libreoffice-splash.desktop[21306]: X IO Error
Comment 31 Elizabeth 2017-09-26 21:28:06 UTC
Hello everybody, 
I'm trying to reproduce the issue in a SKL eDP 4k GT2, with no luck so far. I'm using kubuntu 17.04 with latest drm-tip. I manage to make LibreOffice crash using impress with a presentation full of images and gifs, and one time X closed, but i didn't get the hang, just Atomic update failure messages and a oom-killer warning.
 
Could you please help me with this information.
1. What LibreOffice version are you using?  
2. What Mesa version are you using?
3. Are you using firmware, guc and huc?
4. How much time(hours) have you been working on libreoffice when the hang happens?
5. Have you verified if HW acceleration is enable?
6. Is there a step list to reproduce the issue?
7. Have you tried to reproduce with intel_iommu=igfx_off parameter on grub?

Also, since this issue is reproducible in your devices, could you please add a full dmesg with "drm.debug=0xe log_bug_len=4M" parameters on grub and/or a clean kern.log. 

Sharing my own information:
1. LibreOffice: version 5.3.1.2 build id 1:5.3.1-0ubuntu2 
2. Mesa version: 17.0.7
3. No firmwares(guc/huc) used.
4. One hour approximately, once libreoffice crash the first time like half an hour or less even after reboot.
5. Disable in my case.
6. I tried by using glxgears + youtube + heavy libreoffice impress document, then copy-paste-write-presentation_mode(F5)-esc and repeat until libreoffice crash.
7. No, since i haven't been able to reproduce.

This is my configuration, and before latest drm-tip I also tried with latest 4.10 since kubuntu distro provides this version, neither could reproduce.

======================================
             Software
======================================
kernel version              : 4.14.0-rc2-drm-tip-ww39-commit-0b65077+
architecture                : x86_64
hardware acceleration       : disabled
swap partition              : disabled

======================================
             Hardware
======================================
platform                   : Skylake
cpu information            : Intel(R) Core(TM) m5-6Y57 CPU @ 1.10GHz
gpu card                   : Intel Corporation HD Graphics 515 (rev 07) (prog-if 00 [VGA controller])
memory ram                 : 3.83 GB
max memory ram             : 16 GB
display resolution         : 3840x2160
hard drive                 : 74GiB (80GB)
current cd clock frequency : 540000 kHz
displays connected         : eDP-1

======================================
             Firmware
======================================
dmc fw loaded             : yes
dmc version               : 1.26
guc fw loaded             : NONE
guc version wanted        : 0.0
guc version found         : 0.0

======================================
             kernel parameters
======================================
initcall_debug drm.debug=0xe log_bug_len=2M

As a side note, last error_state reported by Alejandro Lorenzo is a different issue, ecode is different and his issue is inside the batch of the render ring, while the others are outside the batch. So that hang should be filed on a different case.  

Thanks in advance for your time.
Comment 32 achat1024 2017-10-30 21:21:40 UTC
Hello,

I have also been fighting with this bug for a while.

Tried to update GPU driver using Intel Graphic Update Tool 2.02 and while LO did not hang anymore my whole desktop had lots of issues.

I need to do more testing myself but wanted to point towards this potential solution linked to issues with the HWE stack on 16.04. So far seems to work both in terms of the LO hang and for my desktop:

https://askubuntu.com/questions/964576/libreoffice-5-1-6-2-crashes-ubuntu-16-04-64-bit

My config is

Skylake (GT2) HD Graphic 520
Mesa 17.0.7
Kernel 4.10.0-37
KDE neon user (Ubuntu 16.04)
Plasma 5.11.2
Comment 33 Mark Janes 2017-12-05 23:39:15 UTC
A similar GPU hang was recently fixed for 2D workloads.  It would help us if someone affected by this libreoffice crash would attempt to reproduce it with mesa 17.3.0rc6

see also:
https://bugs.freedesktop.org/show_bug.cgi?id=103555
Comment 34 Alejandro Lorenzo 2018-08-30 12:33:51 UTC
Just so you know, it's been a long time since i've seen this happen with up-to-date kernel + mesa, so i would say this has been fixed
Comment 35 Danna Gifford 2018-08-30 14:10:12 UTC
(In reply to Alejandro Lorenzo from comment #34)
> Just so you know, it's been a long time since i've seen this happen with
> up-to-date kernel + mesa, so i would say this has been fixed

Same for me.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.