Bug 106931 - GPU HANG: ecode 9:0:0x85dffffb, in electron [21270], reason: Ring hung, action: reset
Summary: GPU HANG: ecode 9:0:0x85dffffb, in electron [21270], reason: Ring hung, actio...
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-15 21:47 UTC by Jim
Modified: 2018-08-13 09:45 UTC (History)
1 user (show)

See Also:
i915 platform: SKL
i915 features: GPU hang


Attachments
The attached file is the output from /sys/class/drm/card0/error (529.58 KB, text/x-log)
2018-06-15 21:47 UTC, Jim
no flags Details

Description Jim 2018-06-15 21:47:23 UTC
Created attachment 140177 [details]
The attached file is the output from /sys/class/drm/card0/error

Sorry if this is more noise on top of 102397 and 102470 (both a chrome and chromium), but I found this similar issue in the dmesg output :

[Tue Jun 12 19:31:46 2018] [drm] stuck on render ring
[Tue Jun 12 19:31:46 2018] [drm] GPU HANG: ecode 9:0:0x85dffffb, in electron [21270], reason: Ring hung, action: reset
[Tue Jun 12 19:31:46 2018] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[Tue Jun 12 19:31:46 2018] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[Tue Jun 12 19:31:46 2018] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[Tue Jun 12 19:31:46 2018] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[Tue Jun 12 19:31:46 2018] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[Tue Jun 12 19:31:46 2018] drm/i915: Resetting chip after gpu hang
[Tue Jun 12 19:31:48 2018] [drm] RC6 on

This was the very last dmesg entry and grepping for "\[drm\]" only resulted in the same output.

$ dmesg -T |grep "\[drm\]"
[Tue Jun 12 19:31:46 2018] [drm] stuck on render ring
[Tue Jun 12 19:31:46 2018] [drm] GPU HANG: ecode 9:0:0x85dffffb, in electron [21270], reason: Ring hung, action: reset
[Tue Jun 12 19:31:46 2018] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[Tue Jun 12 19:31:46 2018] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[Tue Jun 12 19:31:46 2018] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[Tue Jun 12 19:31:46 2018] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[Tue Jun 12 19:31:46 2018] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[Tue Jun 12 19:31:48 2018] [drm] RC6 on

Also, I've attached the /sys/class/drm/card0/error log (named sys_class_drm_card0_error_20180618).

Thanks.
Comment 1 Chris Wilson 2018-06-15 21:54:03 UTC
Hmm, first things first, please update both kernel and userspace drivers, both are quite old.
Comment 2 Jim 2018-06-15 22:23:50 UTC
Here's some extra info regarding my hardware:

--------------------------------------------------------------------------------
$ lspci 
00:00.0 Host bridge: Intel Corporation Sky Lake Host Bridge/DRAM Registers (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Sky Lake Integrated Graphics (rev 06)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:16.3 Serial controller: Intel Corporation Sunrise Point-H KT Redirection (rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #17 (rev f1)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #9 (rev f1)
00:1d.2 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #11 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)

--------------------------------------------------------------------------------
$ sudo dmidecode --type bios
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.0 present.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
	Vendor: American Megatrends Inc.
	Version: 1802
	Release Date: 07/06/2016
	Address: 0xF0000
	Runtime Size: 64 kB
	ROM Size: 16384 kB
	Characteristics:
		PCI is supported
		APM is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		BIOS ROM is socketed
		EDD is supported
		5.25"/1.2 MB floppy services are supported (int 13h)
		3.5"/720 kB floppy services are supported (int 13h)
		3.5"/2.88 MB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		ACPI is supported
		USB legacy is supported
		BIOS boot specification is supported
		Targeted content distribution is supported
		UEFI is supported
	BIOS Revision: 5.11

Handle 0x0058, DMI type 13, 22 bytes
BIOS Language Information
	Language Description Format: Long
	Installable Languages: 8
		en|US|iso8859-1
		fr|FR|iso8859-1
		zh|CN|unicode
		<BAD INDEX>
		<BAD INDEX>
		<BAD INDEX>
		<BAD INDEX>
		<BAD INDEX>
	Currently Installed Language: en|US|iso8859-1

--------------------------------------------------------------------------------
$ sudo dmidecode --type baseboard
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.0 present.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
	Manufacturer: ASUSTeK COMPUTER INC.
	Product Name: Q170M-C
	Version: Rev X.0x
	Serial Number: ***************
	Asset Tag: Default string
	Features:
		Board is a hosting board
		Board is replaceable
	Location In Chassis: Default string
	Chassis Handle: 0x0003
	Type: Motherboard
	Contained Object Handles: 0

Handle 0x0024, DMI type 10, 8 bytes
On Board Device 1 Information
	Type: Video
	Status: Enabled
	Description: To Be Filled By O.E.M.
On Board Device 2 Information
	Type: Ethernet
	Status: Enabled
	Description: To Be Filled By O.E.M.

Handle 0x003E, DMI type 41, 11 bytes
Onboard Device
	Reference Designation:  Onboard IGD
	Type: Video
	Status: Enabled
	Type Instance: 1
	Bus Address: 0000:00:02.0

Handle 0x003F, DMI type 41, 11 bytes
Onboard Device
	Reference Designation:  Onboard LAN
	Type: Ethernet
	Status: Enabled
	Type Instance: 1
	Bus Address: 0000:00:19.0

Handle 0x0040, DMI type 41, 11 bytes
Onboard Device
	Reference Designation:  Onboard 1394
	Type: Other
	Status: Enabled
	Type Instance: 1
	Bus Address: 0000:03:1c.2

--------------------------------------------------------------------------------
$ sudo dmidecode --type processor
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.0 present.

Handle 0x0045, DMI type 4, 48 bytes
Processor Information
	Socket Designation: LGA1151
	Type: Central Processor
	Family: Core i5
	Manufacturer: Intel(R) Corporation
	ID: E3 06 05 00 FF FB EB BF
	Signature: Type 0, Family 6, Model 94, Stepping 3
	Flags:
		FPU (Floating-point unit on-chip)
		VME (Virtual mode extension)
		DE (Debugging extension)
		PSE (Page size extension)
		TSC (Time stamp counter)
		MSR (Model specific registers)
		PAE (Physical address extension)
		MCE (Machine check exception)
		CX8 (CMPXCHG8 instruction supported)
		APIC (On-chip APIC hardware supported)
		SEP (Fast system call)
		MTRR (Memory type range registers)
		PGE (Page global enable)
		MCA (Machine check architecture)
		CMOV (Conditional move instruction supported)
		PAT (Page attribute table)
		PSE-36 (36-bit page size extension)
		CLFSH (CLFLUSH instruction supported)
		DS (Debug store)
		ACPI (ACPI supported)
		MMX (MMX technology supported)
		FXSR (FXSAVE and FXSTOR instructions supported)
		SSE (Streaming SIMD extensions)
		SSE2 (Streaming SIMD extensions 2)
		SS (Self-snoop)
		HTT (Multi-threading)
		TM (Thermal monitor supported)
		PBE (Pending break enabled)
	Version: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
	Voltage: 1.0 V
	External Clock: 100 MHz
	Max Speed: 3600 MHz
	Current Speed: 3200 MHz
	Status: Populated, Enabled
	Upgrade: Other
	L1 Cache Handle: 0x0042
	L2 Cache Handle: 0x0043
	L3 Cache Handle: 0x0044
	Serial Number: To Be Filled By O.E.M.
	Asset Tag: To Be Filled By O.E.M.
	Part Number: To Be Filled By O.E.M.
	Core Count: 4
	Core Enabled: 4
	Thread Count: 4
	Characteristics:
		64-bit capable
		Multi-Core
		Execute Protection
		Enhanced Virtualization
		Power/Performance Control

--------------------------------------------------------------------------------
$ sudo dmidecode --type memory   
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.0 present.

Handle 0x0046, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: None
	Maximum Capacity: 64 GB
	Error Information Handle: Not Provided
	Number Of Devices: 4

Handle 0x0047, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x0046
	Error Information Handle: Not Provided
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 4096 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_A1
	Bank Locator: BANK 0
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2133 MHz
	Manufacturer: Kingston
	Serial Number: 33322423
	Asset Tag: 9876543210
	Part Number: 9905678-033.A00G    
	Rank: 1
	Configured Clock Speed: 2133 MHz
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: 1.2 V

(only the first RAM slot is in use)
Comment 3 Jim 2018-06-15 22:36:52 UTC
(In reply to Chris Wilson from comment #1)
> Hmm, first things first, please update both kernel and userspace drivers,
> both are quite old.

I agree. However, in this current environment, we are kinda stuck on kernel version 4.4.0-31-generic. But thanks for the reply and I'll bring that up with my team.

In the meantime, I can gather a few non-production systems, update them (apt update && apt upgrade && apt dist-upgrade), and run them under the same load to see what happens. I assume that we'll have trouble reproducing the issue because I've haven't notice this before (not to say it hasn't ever happened before, I just notice it today).

Thanks for the reply.
Comment 4 James Ausmus 2018-06-15 23:05:16 UTC
Please try using https://cgit.freedesktop.org/drm-tip and send dmesg with the following added to the kernel params:

drm.debug=0x1e log_buf_len=4M


Do you have a feel for how often the issue happens?
Comment 5 Jani Saarinen 2018-06-25 10:09:16 UTC
Reporter, any updates to this?
Comment 6 Jani Saarinen 2018-08-13 09:45:02 UTC
No feedback in many months, closing as resolved works for me.
Please re-open is still the case after testing latest https://cgit.freedesktop.org/drm-tip and send dmesg with drm.debug=0x1e log_buf_len=4M?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.