Bug 104411 - [CCS] lemonbar-xft GPU hang
Summary: [CCS] lemonbar-xft GPU hang
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: 17.3
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: regression
: 99867 104383 104394 104462 104689 104991 105184 105195 105314 105315 (view as bug list)
Depends on:
Blocks: mesa-18.0
  Show dependency treegraph
 
Reported: 2017-12-29 06:28 UTC by Axel Fischer
Modified: 2018-03-20 15:34 UTC (History)
12 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg detailing hang (58.80 KB, text/x-log)
2017-12-29 06:28 UTC, Axel Fischer
Details
GPU crash dump (58.29 KB, text/plain)
2017-12-29 06:28 UTC, Axel Fischer
Details
dmesg gpu hang after a fresh boot (58.04 KB, text/x-log)
2018-01-03 15:57 UTC, Stephan Fackler
Details
GPU crash dump (62.39 KB, text/x-log)
2018-01-03 15:58 UTC, Stephan Fackler
Details
Simple script which leads to GPU hang when piped into lemonbar-xft (66 bytes, application/x-shellscript)
2018-01-20 11:12 UTC, Stephan Fackler
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Axel Fischer 2017-12-29 06:28:15 UTC
Created attachment 136441 [details]
dmesg detailing hang

The GPU hang is reliably reproducible with mesa 17.3.0 and 17.3.1 by visiting Google Maps in Firefox. Downgrading mesa to version 17.2.7 fixes the issue.
Comment 1 Axel Fischer 2017-12-29 06:28:54 UTC
Created attachment 136442 [details]
GPU crash dump
Comment 2 Elizabeth 2017-12-29 15:28:16 UTC
Thanks for your time Axel, 
Could you help us to bisect the issue to find the culprit commit?
Meanwhile, I'll try to reproduce by my side.
Comment 3 Axel Fischer 2017-12-29 16:07:31 UTC
(In reply to Elizabeth from comment #2)
> Thanks for your time Axel, 
> Could you help us to bisect the issue to find the culprit commit?
> Meanwhile, I'll try to reproduce by my side.

Certainly, I am glad to help. I will start bisecting within the next few days.
Comment 4 Axel Fischer 2017-12-30 09:15:59 UTC
I just finished the git bisect. This issue was introduced with commit ea0d2e98ecb369ab84e78c84709c0930ea8c293a. Unfortunately, I do not know enough about mesa or the Intel drivers to further debug the problem and create a patch myself. However, please let me know if I can provide additional information that would be helpful.
Comment 5 Kenneth Graunke 2018-01-02 00:04:30 UTC
Interesting, thank you for bisecting!  So far, I'm not able to reproduce this :(

Does it work if you run with INTEL_DEBUG=norbc ?
Comment 6 Axel Fischer 2018-01-02 09:41:32 UTC
(In reply to Kenneth Graunke from comment #5)
> Interesting, thank you for bisecting!  So far, I'm not able to reproduce
> this :(
> 
> Does it work if you run with INTEL_DEBUG=norbc ?

Thank you for working on this. Yes, when setting INTEL_DEBUG=norbc it works without problems.
Comment 7 Stephan Fackler 2018-01-03 15:55:27 UTC
I very likely have the same bug on my Thinkpad T470 (curiously, Axel Fischer is using a Thinkpad T460 model) and can confirm mostly everything what has already been said.

- The GPU hang occurs reliably when using Mesa 17.3.0 and 17.3.1. Downgrading to 17.2.6 on Arch fixed the issue for me as well.
- Everything runs without any problems as soon as INTEL_DEBUG=norbc is set.

However, for me the GPU hang always occurs already a bunch of seconds after starting Xorg. Currently, I have not yet made the effort to do a git bisect.

I append my dmesg log und and GPU crash dump in the hope to give further hints. If I can help further, just tell me and I will try to do so.
Comment 8 Stephan Fackler 2018-01-03 15:57:18 UTC
Created attachment 136523 [details]
dmesg gpu hang after a fresh boot
Comment 9 Stephan Fackler 2018-01-03 15:58:52 UTC
Created attachment 136524 [details]
GPU crash dump
Comment 10 Mark Janes 2018-01-05 18:54:40 UTC
Elizabeth, since Ken can't reproduce this, can you please take a look?  You might be able to discover the factors that trigger the hang.
Comment 11 Alejandro Lorenzo 2018-01-09 18:59:08 UTC
I would say this also affects my Dell XPS Precision 9550. Working with Eclipse Kepler causes a hang in a very predictable manner.

Using the INTEL_DEBUG=norbc works without hangs and I am downgrading also to 17.2.5 (Debian Testing packets) ( WIP )

Details about the processor:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping        : 3
microcode       : 0xba
cpu MHz         : 2600.000
cache size      : 6144 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs            :
bogomips        : 5184.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
Comment 12 Elizabeth 2018-01-17 22:45:55 UTC
Hello everybody,
Sorry for the delay. I tried to reproduce on a SKL laptop without success. I installed Gentoo, xorg, default kernel, kde, firefox and mesa.
I went to firefox and maps, there I used map view, satellite view, street view, and the hang didn't happen.

I'm using:
Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
Gentoo Base System release 2.4.1
Firefox ESR 52.5.2 (64-bit)
X.Org X Server 1.19.5
plasmashell 5.10.5
Qt 5.7.1
KDE Frameworks 5.40.0
kf5-config 1.0
Memory 8Gb

I tried default kernel and mesa,
kernel-genkernel-x86_64-4.9.72-gentoo
Mesa 17.2.7 and Mesa 17.3.1

And latest drm-tip and mesa, 
Linux gentoo 4.15.0-rc8+ #4 SMP Wed Jan 17 15:16:57 CST 2018 x86_64 
Mesa 17.3.1 and Mesa 17.4.0-devel

What desktop environment are you using? Did firefox return any crash log when the hang occurs? Are you using any special capability on maps? Are you logged in? How long is needed until the device hang? Stephan, Alejandro, in your case how can reliable produce the hang?

Thank you.
Comment 13 Axel Fischer 2018-01-17 23:40:25 UTC
Hi Elizabeth,

I am on Gentoo's testing branch. I am only using a window manager (herbstluftwm 0.7.0) but no desktop environment. The issue is reproducible with:

Firefox 57.0.4 built against GTK 2.24.31
X.Org X server 1.19.6
Core i7-6600U
24 GiB memory

The device hang happens immediately when loading Google maps (normal map view). I was logged in when it happened. Firefox did not produce any crash log.

In case it is not reproducible for you, I can try to debug it and see if I can figure out what the problem is. I have a lot of experience with C++, however no C and no knowledge of mesa. That might take some time though since I will have to do a lot of reading/learning for that first.
Comment 14 Stephan Fackler 2018-01-18 16:40:50 UTC
Hi Elizabeth!

Thanks for your effort in investigating the issue.

For me starting X with my standard configuration always leads to a GPU hang within at most 10 seconds after start up of X.

I'm using a Thinkpad T470 with Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz and 32GB of RAM on Arch Linux. I'm currently using 

linux 4.14.13-1
mesa 17.3.2-2
xorg-server 1.19.6-2

The following information may be useful. As Axel is using herbstluftwm and I do as well, I was curious whether the WM is the issue. And indeed, I get the following interesting result:

1) awesomewm without lemonbar: no crash
2) herbstluftwm with lemonbar: crash
3) herbstluftwm without lemonbar: no crash

In fact, I'm not using the standard lemonbar package from Arch, but the package lemonbar-xft-git from AUR which is simply a build of https://github.com/krypt-n/bar

In the next days, I will check further combinations and in particular the lemonbar issue to get down to the precise reason for the crash. Sadly, currently I do not have more time to dig deeper. But in the meantime I nevertheless want to share this information.
Comment 15 Mark Janes 2018-01-18 17:18:39 UTC
Axel: please check that this is not a dup of 
https://bugs.freedesktop.org/show_bug.cgi?id=104214

You should be able to build and test master or the 17.3 release branch, which has the fix ready for the 17.3.1 release.
Comment 16 Stephan Fackler 2018-01-20 11:12:09 UTC
Created attachment 136865 [details]
Simple script which leads to GPU hang when piped into lemonbar-xft

I have now further investigated the reason for the GPU crash on my laptop and could construct the following minimal example. Steps to reproduce the crash on my laptop:

1) Install https://github.com/krypt-n/bar which gives you the lemonbar executable
2) ./crash_me.sh | lemonbar -f "Droid Sans"

The script pipes lines into lemonbar which then should be shown on the bar.

Further observations:

1) the GPU hang occurs after the first update of lemonbar. In the script this is after 1 second (sleep 1). If this is changed to e.g. sleep 30 the crash occurs after 30 seconds.
2) The crash occurs for some XFT fonts, but not for all. I suspect that the crash orginates from XFT font rendering.
3) Changing the echod text to something different results in no further crash. So oddly, this seems to be a rather fine tuned issue.
4) The crash occurs independently of the chosen X window manager.

If I can help further, I will try to do so.
Comment 17 Kenneth Graunke 2018-01-23 02:02:17 UTC
Thank you, this is very helpful!  I can reproduce the lemonbar GPU hang with your script.  (To avoid crashing X while debugging, you can run it in Xephyr -glamor as well...)

I've determined that always_flush_cache=true, INTEL_DEBUG=reemit, and always_flush_batch=true all work around the problem.  We may be missing a flush somewhere.  I'm not sure where.

Renaming bug to mention CCS and Lemonbar.
Comment 18 Jason Ekstrand 2018-01-23 23:19:24 UTC
This may help:

https://patchwork.freedesktop.org/series/36957/
Comment 19 Kenneth Graunke 2018-01-24 05:03:36 UTC
It does not. :(
Comment 20 Mark Janes 2018-01-24 07:20:54 UTC
*** Bug 104383 has been marked as a duplicate of this bug. ***
Comment 21 Mark Janes 2018-01-24 19:00:53 UTC
patch on list:

https://patchwork.freedesktop.org/series/37023/
Comment 22 Jason Ekstrand 2018-01-25 03:10:37 UTC
This is fixed by the following commit in master:

commit 20f70ae3858bc213e052a8434f0e637eb36203c4
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Tue Jan 23 23:47:26 2018 -0800

    i965/draw: Set NEW_AUX_STATE when draw aux changes
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104411
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104383
    Fixes: ea0d2e98ecb369ab84e78c84709c0930ea8c293a
    Cc: mesa-stable@lists.freedesktop.org
    Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
    Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Comment 23 Mark Janes 2018-01-25 16:51:07 UTC
*** Bug 104462 has been marked as a duplicate of this bug. ***
Comment 24 Mark Janes 2018-01-25 18:09:16 UTC
*** Bug 104689 has been marked as a duplicate of this bug. ***
Comment 25 Mark Janes 2018-03-01 08:17:30 UTC
*** Bug 105184 has been marked as a duplicate of this bug. ***
Comment 26 Mark Janes 2018-03-01 20:12:26 UTC
*** Bug 105314 has been marked as a duplicate of this bug. ***
Comment 27 Mark Janes 2018-03-01 20:12:53 UTC
*** Bug 105195 has been marked as a duplicate of this bug. ***
Comment 28 goliath 2018-03-03 19:23:50 UTC
*** Bug 105315 has been marked as a duplicate of this bug. ***
Comment 29 Mark Janes 2018-03-05 15:00:13 UTC
*** Bug 104394 has been marked as a duplicate of this bug. ***
Comment 30 Mark Janes 2018-03-15 20:04:54 UTC
*** Bug 99867 has been marked as a duplicate of this bug. ***
Comment 31 Mark Janes 2018-03-20 15:34:27 UTC
*** Bug 104991 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.