Bug 100680 - [BDW] X sometimes hangs, sometimes produces strange artifacts
Summary: [BDW] X sometimes hangs, sometimes produces strange artifacts
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-14 03:02 UTC by cmalchik
Modified: 2018-03-02 16:07 UTC (History)
1 user (show)

See Also:
i915 platform: BDW
i915 features: GPU hang


Attachments
from /sys/class/drm/card0/error (24.94 KB, text/plain)
2017-04-14 03:02 UTC, cmalchik
no flags Details

Description cmalchik 2017-04-14 03:02:31 UTC
Created attachment 130838 [details]
from /sys/class/drm/card0/error

This happens unpredictably - sometimes I have a few hours of smooth usage and sometimes X freezes up after just a minute or so. Usually freezing is preceded by a few seconds of jitters or strange artifacts on my terminal. Sometimes the freezing/unresponsiveness returns to normal after a few seconds, and sometimes it freezes indefinitely, or at least until I lose patience and reboot.

The last time this happened I got some jitters and the screen froze, but returned to normal after a few seconds. Here are the last few lines of dmesg, which prompted me to submit this bug:

[ 1598.035647] DMAR: DRHD: handling fault status reg 2
[ 1598.035661] DMAR: [DMA Write] Request device [00:02.0] fault addr ff829000 [fault reason 23] Unknown
[ 1608.787489] [drm] GPU HANG: ecode 8:0:0x86dffffd, in Xorg [978], reason: Hang on render ring, action: reset
[ 1608.787491] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1608.787491] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1608.787492] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1608.787492] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1608.787493] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1608.787542] drm/i915: Resetting chip after gpu hang
[ 1640.588633] DMAR: DRHD: handling fault status reg 3
[ 1640.588637] DMAR: [DMA Write] Request device [00:02.0] fault addr ff829000 [fault reason 23] Unknown
[ 1640.588640] DMAR: DRHD: handling fault status reg 2
[ 1640.588642] DMAR: [DMA Write] Request device [00:02.0] fault addr ff832000 [fault reason 23] Unknown
[ 1640.588647] DMAR: DRHD: handling fault status reg 2
[ 1640.588649] DMAR: [DMA Write] Request device [00:02.0] fault addr ff835000 [fault reason 23] Unknown
[ 1640.588650] DMAR: [DMA Write] Request device [00:02.0] fault addr ff836000 [fault reason 23] Unknown
[ 1640.588655] DMAR: DRHD: handling fault status reg 3
[ 1640.588656] DMAR: [DMA Write] Request device [00:02.0] fault addr ff838000 [fault reason 23] Unknown
[ 1640.588660] DMAR: DRHD: handling fault status reg 2
[ 1640.588662] DMAR: [DMA Write] Request device [00:02.0] fault addr ff83b000 [fault reason 23] Unknown
[ 1640.588672] DMAR: DRHD: handling fault status reg 3
[ 1640.588674] DMAR: [DMA Write] Request device [00:02.0] fault addr ff87f000 [fault reason 23] Unknown
[ 1640.588684] DMAR: DRHD: handling fault status reg 3
[ 1640.588685] DMAR: [DMA Write] Request device [00:02.0] fault addr ff881000 [fault reason 23] Unknown
[ 1640.588689] DMAR: DRHD: handling fault status reg 2
[ 1640.588690] DMAR: [DMA Write] Request device [00:02.0] fault addr ff884000 [fault reason 23] Unknown
[ 1640.588693] DMAR: DRHD: handling fault status reg 2
[ 1640.588694] DMAR: [DMA Write] Request device [00:02.0] fault addr ff885000 [fault reason 23] Unknown
[ 1659.761304] drm/i915: Resetting chip after gpu hang
[ 1659.761464] dmar_fault: 76 callbacks suppressed
[ 1659.761467] DMAR: DRHD: handling fault status reg 3
[ 1659.761480] DMAR: [DMA Write] Request device [00:02.0] fault addr ff85e000 [fault reason 23] Unknown
[ 1659.761652] DMAR: DRHD: handling fault status reg 2
[ 1659.761660] DMAR: [DMA Write] Request device [00:02.0] fault addr ff85e000 [fault reason 23] Unknown
[ 1659.763325] DMAR: DRHD: handling fault status reg 3
[ 1659.763336] DMAR: [DMA Write] Request device [00:02.0] fault addr ff85e000 [fault reason 23] Unknown
[ 1659.763645] DMAR: DRHD: handling fault status reg 2
[ 1659.763654] DMAR: [DMA Write] Request device [00:02.0] fault addr ff85e000 [fault reason 23] Unknown
[ 1659.764067] DMAR: DRHD: handling fault status reg 3
[ 1659.764076] DMAR: [DMA Write] Request device [00:02.0] fault addr ff829000 [fault reason 23] Unknown
[ 1659.764630] DMAR: DRHD: handling fault status reg 2
[ 1659.764640] DMAR: [DMA Write] Request device [00:02.0] fault addr ff829000 [fault reason 23] Unknown
[ 1659.764647] DMAR: [DMA Write] Request device [00:02.0] fault addr ff82a000 [fault reason 23] Unknown
[ 1659.764698] DMAR: DRHD: handling fault status reg 2
[ 1659.764705] DMAR: [DMA Write] Request device [00:02.0] fault addr ff82a000 [fault reason 23] Unknown
[ 1659.765415] DMAR: DRHD: handling fault status reg 2
[ 1659.765424] DMAR: [DMA Write] Request device [00:02.0] fault addr ff82a000 [fault reason 23] Unknown
[ 1659.765431] DMAR: [DMA Write] Request device [00:02.0] fault addr ff82b000 [fault reason 23] Unknown
[ 1677.745357] drm/i915: Resetting chip after gpu hang
[ 1677.760715] dmar_fault: 248 callbacks suppressed
[ 1677.760716] DMAR: DRHD: handling fault status reg 3
[ 1677.760720] DMAR: [DMA Write] Request device [00:02.0] fault addr ff829000 [fault reason 23] Unknown
[ 1677.760724] DMAR: DRHD: handling fault status reg 2
[ 1677.760727] DMAR: [DMA Write] Request device [00:02.0] fault addr ff832000 [fault reason 23] Unknown
[ 1677.760731] DMAR: DRHD: handling fault status reg 2
[ 1677.760733] DMAR: [DMA Write] Request device [00:02.0] fault addr ff836000 [fault reason 23] Unknown
[ 1677.760736] DMAR: DRHD: handling fault status reg 3
[ 1677.760738] DMAR: [DMA Write] Request device [00:02.0] fault addr ff83a000 [fault reason 23] Unknown
[ 1677.760742] DMAR: DRHD: handling fault status reg 3
[ 1677.760745] DMAR: [DMA Write] Request device [00:02.0] fault addr ff83f000 [fault reason 23] Unknown
[ 1677.760748] DMAR: DRHD: handling fault status reg 2
[ 1677.760750] DMAR: [DMA Write] Request device [00:02.0] fault addr ff843000 [fault reason 23] Unknown
[ 1677.760752] DMAR: DRHD: handling fault status reg 2
[ 1677.760755] DMAR: [DMA Write] Request device [00:02.0] fault addr ff847000 [fault reason 23] Unknown
[ 1677.760757] DMAR: DRHD: handling fault status reg 3
[ 1677.760759] DMAR: [DMA Write] Request device [00:02.0] fault addr ff84a000 [fault reason 23] Unknown
[ 1677.760763] DMAR: DRHD: handling fault status reg 2
[ 1677.760765] DMAR: [DMA Write] Request device [00:02.0] fault addr ff84e000 [fault reason 23] Unknown
[ 1677.760768] DMAR: DRHD: handling fault status reg 2
[ 1677.760771] DMAR: [DMA Write] Request device [00:02.0] fault addr ff852000 [fault reason 23] Unknown
[ 1689.713436] drm/i915: Resetting chip after gpu hang

Some more info:

$ uname -a
Linux myvoidlinuxbox 4.10.10_1 #1 SMP PREEMPT Thu Apr 13 14:00:02 UTC 2017 x86_64 GNU/Linux
$ lspci
00:00.0 Host bridge: Intel Corporation Broadwell-U Host Bridge -OPI (rev 09)
00:02.0 VGA compatible controller: Intel Corporation Iris Graphics 6100 (rev 09)
00:03.0 Audio device: Intel Corporation Broadwell-U Audio Controller (rev 09)
00:14.0 USB controller: Intel Corporation Wildcat Point-LP USB xHCI Controller (rev 03)
00:16.0 Communication controller: Intel Corporation Wildcat Point-LP MEI Controller #1 (rev 03)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (3) I218-V (rev 03)
00:1b.0 Audio device: Intel Corporation Wildcat Point-LP High Definition Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #1 (rev e3)
00:1c.3 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #4 (rev e3)
00:1d.0 USB controller: Intel Corporation Wildcat Point-LP USB EHCI Controller (rev 03)
00:1f.0 ISA bridge: Intel Corporation Wildcat Point-LP LPC Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation Wildcat Point-LP SATA Controller [AHCI Mode] (rev 03)
00:1f.3 SMBus: Intel Corporation Wildcat Point-LP SMBus Controller (rev 03)
02:00.0 Network controller: Intel Corporation Wireless 7265 (rev 59)
$
Comment 1 Elizabeth 2017-06-22 17:08:52 UTC
Hello, could you please boot with the parameter "drm.debug=0xe" on grub and provide the full dmesg log? Is this present in the lastest kernel update? Thanks.
Comment 2 Elizabeth 2017-07-20 22:27:13 UTC
Hello,
Is this still valid? Could you please try to reproduce on kernel 4.13?
Thanks.
Comment 3 cmalchik 2017-07-21 00:20:03 UTC
(In reply to Elizabeth from comment #2)
> Hello,
> Is this still valid? Could you please try to reproduce on kernel 4.13?
> Thanks.

hey,

sorry, i've been busy and am not currently using the computer that i had this bug on.  i should be have time to test this in the next couple weeks.
Comment 4 Elizabeth 2017-08-03 18:55:17 UTC
(In reply to cmalchik from comment #3)
> (In reply to Elizabeth from comment #2)
> sorry, i've been busy and am not currently using the computer that i had
> this bug on.  i should be have time to test this in the next couple weeks.
Thanks for the update and the time invested, waiting for tests.
Comment 5 Elizabeth 2017-09-12 15:48:53 UTC
Hello, sorry for pestering. Still valid with latest drm-tip or vanilla mainline?? thanks.
Comment 6 cmalchik 2018-01-03 22:36:08 UTC
(In reply to Elizabeth from comment #5)
> Hello, sorry for pestering. Still valid with latest drm-tip or vanilla
> mainline?? thanks.

still valid.  i'm having the same issues after a fresh void install and updating to linux 4.14.11.  my libdrm version is 2.4.89.
Comment 7 Elizabeth 2018-01-04 23:17:00 UTC
Could you try to get a dmesg or a clean kern.log with debug info?
Comment 8 cmalchik 2018-01-27 06:10:21 UTC
hi elizabeth,

tentatively, the issue seems to be fixed in linux 4.14.15.  do you know of any change between 4.14.11 and 4.14.15 that might have fixed it?

i can also confirm that downgrading to e.g. 4.13.15 introduces the problem again (i would've tried 4.14.11 but couldn't get it from the repos).

if it would still be useful, i can try to get a dmesg or kern.log when the problem happens in 4.13.15.  but i am short on time and don't know how to get a kern.log with debug info so if this is still needed it would help to have step-by-step instructions.  thanks for following up on this thus far!
Comment 9 Elizabeth 2018-01-29 17:46:40 UTC
Not really sure. Probably some of the dma-buf changes. I don't believe kern log right now will be helpful since we really don't have a known culprit. I'm marking as fixed by now, and if issue is back please mark as REOPENED and update logs. Thanks for your time :) 

About the kern log: 
In ubuntu:
$ sudo nano /etc/default/grub
add parameter in command line:
"drm.debug=0x1e"
$ sudo update-grub
Either erase your kern log or rename it if you want to keep it:
$ sudo rm /var/log/kern.log
or
$ sudo mv /var/log/kern.log kern.log.bkp
$ sudo reboot
After restarting reproduce issue and copy the log to share the info:
$ sudo cp /var/log/kern.log kern_log_my_issue
Attach log to FDO.

To verify that the debug info is included, you can do a quick search of "drm" in your log and most of it should be populated with lines starting with this word.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.