105155 – [bdw] GPU HANG: ecode 8:0:0x86dffffd, in Xorg

Bug 105155 - [bdw] GPU HANG: ecode 8:0:0x86dffffd, in Xorg

Summary: [bdw] GPU HANG: ecode 8:0:0x86dffffd, in Xorg

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-02-18 17:33 UTC by Samuel Thibault
Modified:	2019-09-25 19:09 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
error dump (41.73 KB, text/plain) 2018-02-18 17:33 UTC, Samuel Thibault	Details
Xorg log (60.57 KB, text/plain) 2018-02-18 17:34 UTC, Samuel Thibault	Details
dmesg (55.54 KB, text/plain) 2018-02-18 17:34 UTC, Samuel Thibault	Details
lspci (8.07 KB, text/plain) 2018-02-18 17:35 UTC, Samuel Thibault	Details
xorg.conf (890 bytes, text/plain) 2018-02-18 17:36 UTC, Samuel Thibault	Details
packages versions (6.22 KB, text/plain) 2018-02-18 17:38 UTC, Samuel Thibault	Details
error dump (50.80 KB, application/octet-stream) 2018-03-15 08:30 UTC, Samuel Thibault	Details
before xrandr call (1.38 KB, text/plain) 2018-03-21 18:56 UTC, Samuel Thibault	Details
after xrandr call (1.38 KB, text/plain) 2018-03-21 18:56 UTC, Samuel Thibault	Details
dmesg 4.5.12 (324.38 KB, text/plain) 2018-03-23 01:16 UTC, Samuel Thibault	Details
dmesg 4.16.0-rc6 (595.40 KB, text/plain) 2018-03-23 08:17 UTC, Samuel Thibault	Details
View All

Description Samuel Thibault 2018-02-18 17:33:39 UTC

Created attachment 137426 [details]
error dump

Hello,

When I leave the modesetting driver parameter AccelMethod to default, I am very easily getting a GPU hang, I just need to run:

startx fvwm

I get an X session with eDP-1 on the left and DP-2 on the right, which I change with

xrandr --output eDP-1 --auto --output DP-2 --auto --above eDP-1

and then I get a GPU hang. I am now attaching the /sys/class/drm/card0/error output, will also attach other logs.

Comment 1 Samuel Thibault 2018-02-18 17:34:34 UTC

Created attachment 137427 [details]
Xorg log

Comment 2 Samuel Thibault 2018-02-18 17:34:56 UTC

Created attachment 137428 [details]
dmesg

Comment 3 Samuel Thibault 2018-02-18 17:35:31 UTC

Created attachment 137429 [details]
lspci

Comment 4 Samuel Thibault 2018-02-18 17:36:42 UTC

Created attachment 137430 [details]
xorg.conf

Comment 5 Samuel Thibault 2018-02-18 17:38:00 UTC

Created attachment 137431 [details]
packages versions

Comment 6 Elizabeth 2018-02-22 17:57:37 UTC

Hello Samuel, it makes any difference if you add intel_iommu=igfx_off to grub? What distro and desktop environment are you using? Is it possible that you test with mesa 18.0.0.rc4?

Comment 7 Samuel Thibault 2018-02-22 19:11:46 UTC

It makes a huge difference indeed! No issue at all so far, despite trying to read videos, do some OpenGL, etc.

Comment 8 Elizabeth 2018-03-14 23:05:58 UTC

Hello Samuel, were you able to test with mesa 18.0.0-rc4 or latest 17.3.6 release?

Comment 9 Samuel Thibault 2018-03-14 23:21:55 UTC

17.3.6 was getting the same result.
I have just upgraded to debian experimental's 18.0.0-rc4 and my reproduction test case doesn't have any issue. I'll see how well it goes on the long run.

Comment 10 Samuel Thibault 2018-03-15 08:30:24 UTC

Created attachment 138121 [details]
error dump

This morning the same symptom happened, here is the error dump.
linux 4.15.0, mesa 18.0.0~rc4

Comment 11 Elizabeth 2018-03-15 15:17:31 UTC

(In reply to Samuel Thibault from comment #10)
> Created attachment 138121 [details]
> error dump
> 
> This morning the same symptom happened, here is the error dump.
> linux 4.15.0, mesa 18.0.0~rc4
So steps from comment #1 still produce it?

Comment 12 Samuel Thibault 2018-03-15 23:03:27 UTC

Yes. I guess I was just lucky yesterday and should have tried more times.

Comment 13 Elizabeth 2018-03-16 19:55:09 UTC

Hi, I'm going to try to replicate the issue, I just found a BDW with a DP output but was having HW issues, so I'll try with another BDW with HDMI. To summarize, Arch linux + fvwm + eDP & a external display + AccelMethod to default, and finally use xrandr to change the outputs, right?

Comment 14 Samuel Thibault 2018-03-16 19:56:50 UTC

It's Debian Buster :) but yes that's it.

Comment 15 Elizabeth 2018-03-16 20:18:21 UTC

(In reply to Samuel Thibault from comment #14)
> It's Debian Buster :) but yes that's it.
Oh, you right: (Debian 7.3.0-1). 
I was looking a different log.

Comment 16 Elizabeth 2018-03-20 22:06:51 UTC

Hi again, I failed to replicate on a BDW with HDMI. This is what I have done so far:

1. Install debian sid
2. Update and upgrade system
3. Install newer kernel (because I needed to apply this patch https://patchwork.kernel.org/patch/10156067/ for acpi)
3. Install fvwm
4. startx
5. Connect HDMI output (after booting had finished)
6. Used:
 xrandr --output eDP-1 --auto --output DP-2 --auto --(above/below/left-of/right-of) eDP-1
7. No hang happened. (I can use terminal only on primary display, not sure if it's expected from fvwm).

As for the AccelMethod, I did nothing to it:
gfx@debian:~$ sudo find / -name xorg.conf
/usr/share/doc/xserver-xorg-video-intel/xorg.conf
gfx@debian:~$ cat /usr/share/doc/xserver-xorg-video-intel/xorg.conf
Section "Device"
        Identifier "Intel"
        Driver "intel"
#       Option "AccelMethod" "uxa"
EndSection
gfx@debian:~$

What I'm missing to be able to replicate?

Thanks.

gfx@debian:~$ uname -a
Linux debian 4.16.0-rc6 #1 SMP Tue Mar 20 02:10:14 PDT 2018 x86_64 GNU/Linux
gfx@debian:~$ glxinfo | grep -i "opengl version"
OpenGL version string: 3.0 Mesa 17.3.6
gfx@debian:~$

Comment 17 Samuel Thibault 2018-03-21 18:55:33 UTC

Hello,

fvwm indeed doesn't currently detect xinerama changes, but let's just get rid of it from the picture.

- I boot with the external screen plugged on VGA, so the linux console shows up on both screens.
- I run startx xterm, it works fine, one screen on the right of the other.
- In the xterm window, I run "xrandr --output eDP-1 --auto --output DP-2 --auto --above eDP-1" to get one screen above the other
- I can type enter in xterm a couple of times, it still works, until it has to scroll, and there things hang.

I'll attach the xrandr output before and after the change, in case details there matter.

Comment 18 Samuel Thibault 2018-03-21 18:56:13 UTC

Created attachment 138256 [details]
before xrandr call

Comment 19 Samuel Thibault 2018-03-21 18:56:29 UTC

Created attachment 138257 [details]
after xrandr call

Comment 20 Elizabeth 2018-03-22 20:09:09 UTC

And I tried again with the HDMI connected from the beginning and not issue so far. 
Could you try same kernel as mine, it's the mainline one at https://www.kernel.org? 
Also you could try to add the parameter drm.debug=0x1e in grub to get more debug information, and by ssh do a dmesg -w to check for any errors occurring just before the hang.
You mentioned vga before, is that correct? Don't you mean DP?
The error state indicates a mesa related issue, but do this worked with a previous version of mesa?

Comment 21 Samuel Thibault 2018-03-23 01:16:38 UTC

Created attachment 138292 [details]
dmesg 4.5.12

I'll have to recompile that kernel, but here are already the dmesg results with 4.15.12 with drm.debug=0x1e for now.

I actually didn't need to use ssh because even if Xorg looks frozen, ctrl-alt-f2 works (it just takes some time to take effect. Apparently moving the mouse helps)

>  You mentioned vga before, is that correct? Don't you mean DP?

Well, it's really a VGA plug that I have on this laptop, even if in xrandr it happens to be called DP-2.

> do this worked with a previous version of mesa?

As far as I can remember, I have never been able to get a stable Xorg workspace without disabling acceleration, before adding intel_iommu=igfx_off.

Comment 22 Samuel Thibault 2018-03-23 08:17:50 UTC

Created attachment 138304 [details]
dmesg 4.16.0-rc6

The symptoms are a bit different with 4.16.0-rc6 indeed (it seems it is able to recover), but I'm still getting hangs, here is the dmesg with debugging.

Comment 23 Elizabeth 2018-03-23 15:48:23 UTC

Well I found interesting that this messages are just before the hang report, from dmesg:

[  154.524158] DMAR: DRHD: handling fault status reg 3
[  154.524160] DMAR: [DMA Read] Request device [00:02.0] fault addr 38d000 [fault reason 05] PTE Write access is not set

And looking around I believe I found why you have to use the intel_iommu=igfx_off:

"On HPE ProLiant Gen9-series servers running Red Hat Enterprise Linux 6, Red Hat Enterprise Linux 7, SUSE Linux Enterprise Server 11 SP3, or SUSE Linux Enterprise Server 12 with the I/O Memory Management Unit (IOMMU) option Enabled in the ROM-Based Setup Utility (RBSU) and with "intel_iommu=on" added to the Linux kernel boot parameters, the IP addresses assigned to interface will not be accessible and a message similar to "CPU stuck" may be displayed on the console. In addition, DMAR fault messages are logged in the /var/log/messages as follows:

> dmar: DRHD: handling fault status reg 2 
> dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr 791dc000 
> DMAR:[fault reason 05] PTE Write access is not set 
> dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr 791dc000 
> DMAR:[fault reason 05] PTE Write access is not set 
> dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr 791dc000 
> DMAR:[fault reason 05] PTE Write access is not set

This occurs because of a known limitation that the bnx2x driver has with the Option Card Black Box - Active Health (OCBB) feature when IOMMU is enabled. The network adapter firmware will attempt to access a memory area that is no longer assigned the network devices when bringing up/down the interface or loading/unloading the driver. When this occurs, a reboot is required."

Information from here https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c04565693

Comment 24 GitLab Migration User 2019-09-25 19:09:25 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1692.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.