Bug 97805 - Occasional GPU Hang: ecode 9:0:0x84dffff8, kernel: 4.7.3,
Summary: Occasional GPU Hang: ecode 9:0:0x84dffff8, kernel: 4.7.3,
Status: RESOLVED INVALID
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-14 14:36 UTC by Artem Shinkarov
Modified: 2017-02-10 22:38 UTC (History)
1 user (show)

See Also:
i915 platform: SKL
i915 features: GPU hang


Attachments
Dmesg (71.11 KB, text/plain)
2016-09-14 14:36 UTC, Artem Shinkarov
Details
Bzipped error from the graphics card (21.79 KB, application/x-bzip2)
2016-09-14 14:37 UTC, Artem Shinkarov
Details
New dmesg with stack traces (21.41 KB, text/plain)
2016-10-12 16:33 UTC, Artem Shinkarov
Details

Description Artem Shinkarov 2016-09-14 14:36:39 UTC
Created attachment 126517 [details]
Dmesg

I observe occasional GPU hangs at least since the kernel 4.7.0.  It seems that it is not concerned with any specific rendering activities as most of the time I work within a terminal.

The crash from the dmesg looks like this:
[  624.838943] ------------[ cut here ]------------
[  624.838956] WARNING: CPU: 1 PID: 2865 at drivers/gpu/drm/i915/intel_pm.c:3675 skl_update_other_pipe_wm+0x148/0x150 [i915]
[  624.838957] WARN_ON(!wm_changed)
[  624.838959] Modules linked in:
[  624.838960]  snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic uvcvideo videobuf2_vmalloc btusb btrtl videobuf2_memops btintel videobuf2_v4l2 btbcm videobuf2_core bluetooth videodev nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) rtsx_pci_sdmmc dell_wmi dell_laptop dell_smbios dcdbas brcmfmac x86_pkg_temp_thermal snd_hda_intel i915 intel_powerclamp snd_hda_codec brcmutil coretemp mmc_core snd_hwdep snd_hda_core rtsx_pci mfd_core wmi dell_smo8800 efivarfs
[  624.838979] CPU: 1 PID: 2865 Comm: X Tainted: P           OE   4.7.3-gentoo #1
[  624.838980] Hardware name: Dell Inc. XPS 15 9550/0N7TVV, BIOS 01.02.00 04/07/2016
[  624.838981]  0000000000000000 ffff8804a753b940 ffffffff8135f9d8 ffff8804a753b990
[  624.838983]  0000000000000000 ffff8804a753b980 ffffffff8108ca06 00000e5b00000003
[  624.838985]  ffff8804a753ba04 ffff8804a8f4a2e0 ffff8804aa1ab000 ffff8804aaa45bd0
[  624.838987] Call Trace:
[  624.838991]  [<ffffffff8135f9d8>] dump_stack+0x4d/0x65
[  624.838994]  [<ffffffff8108ca06>] __warn+0xc6/0xe0
[  624.838996]  [<ffffffff8108ca6a>] warn_slowpath_fmt+0x4a/0x50
[  624.839005]  [<ffffffffa00d58f8>] skl_update_other_pipe_wm+0x148/0x150 [i915]
[  624.839013]  [<ffffffffa00d5a5c>] skl_update_wm+0x15c/0x5f0 [i915]
[  624.839024]  [<ffffffffa0162da0>] ? intel_ddi_enable_transcoder_func+0x170/0x240 [i915]
[  624.839032]  [<ffffffffa00d9729>] intel_update_watermarks+0x19/0x20 [i915]
[  624.839044]  [<ffffffffa0145616>] haswell_crtc_enable+0x746/0x8c0 [i915]
[  624.839057]  [<ffffffffa0140fda>] intel_atomic_commit+0x5ea/0x1460 [i915]
[  624.839059]  [<ffffffff8147c014>] ? drm_atomic_set_crtc_for_connector+0x94/0x100
[  624.839061]  [<ffffffff8147c962>] drm_atomic_commit+0x32/0x50
[  624.839064]  [<ffffffff81459acc>] drm_atomic_helper_set_config+0x7c/0xb0
[  624.839066]  [<ffffffff8146bda0>] drm_mode_set_config_internal+0x60/0x110
[  624.839068]  [<ffffffff814704f4>] drm_mode_setcrtc+0x414/0x530
[  624.839070]  [<ffffffff814626de>] drm_ioctl+0x13e/0x510
[  624.839071]  [<ffffffff814700e0>] ? drm_mode_setplane+0x1c0/0x1c0
[  624.839074]  [<ffffffff811c5edd>] do_vfs_ioctl+0x8d/0x590
[  624.839076]  [<ffffffff81098015>] ? recalc_sigpending+0x15/0x50
[  624.839078]  [<ffffffff811c6454>] SyS_ioctl+0x74/0x80
[  624.839081]  [<ffffffff8183f5db>] entry_SYSCALL_64_fastpath+0x13/0x8f
[  624.839082] ---[ end trace ee9b30cab71e9ffd ]---

The system is: Linux temanbk 4.7.3-gentoo #1 SMP Thu Sep 8 18:02:23 BST 2016 x86_64 Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz GenuineIntel GNU/Linux

The dmesg and error.bz2 are attached.
Comment 1 Artem Shinkarov 2016-09-14 14:37:08 UTC
Created attachment 126518 [details]
Bzipped error from the graphics card
Comment 2 yann 2016-09-14 15:22:36 UTC
Artem, you have two different issues here. Regarding watermark (with the backtrace from kernel log that you have pasted), I recommend to update your kernel because there were recently some improvement done (see https://cgit.freedesktop.org/drm-intel/log/drivers/gpu/drm/i915/intel_pm.c)

Regarding the second one, GPU Hang, assigning to Mesa product.

From this error dump, hung is happening in render ring batch with active head at 0xff02c63c, with 0x7b000005 (3DPRIMITIVE) as IPEHR.

Batch extract (around 0xff02c63c):

0xff02c614:      0x78090007: 3DSTATE_VERTEX_ELEMENTS
0xff02c618:      0x26000000:    buffer 4: valid, type 0x0000, src offset 0x0000 bytes
0xff02c61c:      0x22220000:    (0.0, 0.0, 0.0, 0.0), dst offset 0x00 bytes
0xff02c620:      0x26f60000:    buffer 4: valid, type 0x00f6, src offset 0x0000 bytes
0xff02c624:      0x11230000:    (X, Y, 0.0, 1.0), dst offset 0x00 bytes
0xff02c628:      0x26d80004:    buffer 4: valid, type 0x00d8, src offset 0x0004 bytes
0xff02c62c:      0x12230000:    (X, 0.0, 0.0, 1.0), dst offset 0x00 bytes
0xff02c630:      0x26850008:    buffer 4: valid, type 0x0085, src offset 0x0008 bytes
0xff02c634:      0x11230000:    (X, Y, 0.0, 1.0), dst offset 0x00 bytes
Bad length 7 in (null), expected 6-6
0xff02c638:      0x7b000005: 3DPRIMITIVE: fail sequential
0xff02c63c:      0x00000000:    vertex count
0xff02c640:      0x0000000f:    start vertex
0xff02c644:      0x00000262:    instance count
0xff02c648:      0x00000001:    start instance
0xff02c64c:      0x00000000:    index bias
0xff02c650:      0x00000000: MI_NOOP
Comment 3 Artem Shinkarov 2016-09-14 16:42:41 UTC
Dear Yann, thanks for a quick reply.  I will give a drm-intel kernel a go in the next few days, and will report whether it helped with the problem.

BTW, any ideas when the changes that you were pointing to will be merged into the kernell, if at all?

Also, I didn't mention this before, but it might be relevant:
in the xorg.conf, I explicitly set DRI to 3.  This gets rid of the annoying tearing that I observe when for example I do some scrolling in firefox.

Here is a relevant part of the xorg.conf
Section "Device"
   Identifier  "Intel Graphics"
   Driver      "intel"
   Option      "DRI"    "3"
EndSection
Comment 4 yann 2016-09-15 11:40:44 UTC
(In reply to Artem Shinkarov from comment #3)
> Dear Yann, thanks for a quick reply.  I will give a drm-intel kernel a go in
> the next few days, and will report whether it helped with the problem.
> 
> BTW, any ideas when the changes that you were pointing to will be merged
> into the kernell, if at all?

Some of them are already part of 4.8 and I think that by 4.9 (not having crystal ball:)) they should be of it.

> 
> Also, I didn't mention this before, but it might be relevant:
> in the xorg.conf, I explicitly set DRI to 3.  This gets rid of the annoying
> tearing that I observe when for example I do some scrolling in firefox.
> 
> Here is a relevant part of the xorg.conf
> Section "Device"
>    Identifier  "Intel Graphics"
>    Driver      "intel"
>    Option      "DRI"    "3"
> EndSection

Right, with DRI3 you don't need to set "TearFree" at "true"
Comment 5 Artem Shinkarov 2016-10-12 16:32:39 UTC
A late update from my side.

I compiled a fresh kernel from the intel-drm repository (commit 9aa8c0cdbc076bcc0486d7a31922a0f77c032fe7) as I was suggested, and I am running it for several weeks now.  

Most of the annoying behaviour went away.  However, there are few things that are still there.  First of all, I occasionally see the following message in the dmesg:

[  397.680131] [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe A FIFO underrun

which causes a short blink of the screen.

Secondly, just today, a computer hang and the later inspection of system log revealed stack traces caused by vblank wait time outs in functions "intel_atomic_commit_tail" and "drm_wait_one_vblank".  Stack traces attached as [2016-12-10-dmesg.txt].  I don't think I can extract more info, as I had to do a hard reset.

Any suggestions?
Comment 6 Artem Shinkarov 2016-10-12 16:33:10 UTC
Created attachment 127252 [details]
New dmesg with stack traces
Comment 7 Matt Turner 2016-11-04 00:34:59 UTC
Please test a new version of Mesa (12 or 13) and mark as REOPENED
if you can reproduce and RESOLVED/* if you cannot reproduce.
Comment 8 Annie 2017-02-10 22:38:54 UTC
Dear Reporter,

This Mesa bug has been in the "NEEDINFO" status for over 60 days. I am closing this bug based on lack of response but feel free to reopen if resolution is still needed. Please ensure you're supplying the correct information as requested.

Thank you.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.