26723 – frequent X server crash ; [drm:i915_hangcheck_elapsed] ; hardware wedged ; have to reboot.

Bug 26723 - frequent X server crash ; [drm:i915_hangcheck_elapsed] ; hardware wedged ; have to reboot.

Summary: frequent X server crash ; [drm:i915_hangcheck_elapsed] ; hardware wedged ; ha...

Status:	RESOLVED DUPLICATE of bug 26345

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium normal
Assignee:	Carl Worth
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-02-24 02:00 UTC by theonewiththeevillook
Modified:	2010-06-08 14:32 UTC (History)
CC List:	6 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg output, after the crash. (49.21 KB, text/plain) 2010-02-24 02:00 UTC, theonewiththeevillook	no flags	Details
xorg.conf in use. (3.37 KB, text/plain) 2010-02-24 02:03 UTC, theonewiththeevillook	no flags	Details
kernel configuration for the kernel in use. (12.73 KB, application/gzip) 2010-02-24 02:06 UTC, theonewiththeevillook	no flags	Details
Xorg log after the crash (14.61 KB, text/plain) 2010-02-24 02:13 UTC, theonewiththeevillook	no flags	Details
View All

Description theonewiththeevillook 2010-02-24 02:00:21 UTC

Created attachment 33522 [details]
dmesg output, after the crash.

Using Gentoo, with:
Xorg intel driver 2.9.1
libdrm 2.4.18
kernel 2.6.32

My graphics hardware (output of lspci):
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) (prog-if 00 [VGA controller])
	Subsystem: IBM NetVista A30p
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at 88000000 (32-bit, prefetchable) [size=128M]
	Region 1: Memory at 80000000 (32-bit, non-prefetchable) [size=512K]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [d0] Power Management version 1
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

Here is what happens: while I'm doing nothing special (like filling a form in firefox) the screen goes black. Using CTRL+ALT+Fn doesn't change anything on screen, but I can blind type commands (like "reboot"). I can also log in through ssh and get some information (see below for dmesg and intel_gpu_dump). I cannot trigger the bug, it seems to happen randomly (happens 3 times between 08:30 and 10:00 CET)

A last thing I should say, which might be related, is that getting EDID is a problem on this box. Most of the time, it is corrupted : the end of it is missing (hexdump shows ff), usually resulting in a checksum error.

Although the symptoms are different, I suspect this is related to bug #25765. I must also add that in the past, with previous versions of the kernel, libdrm and the intel drivers, I've had various symptoms such as hard crash (no ssh possible, no sysrq keys would work), X server crash but with the ability to use tty1-6, or screen going funny in X, like : frozen but with the mouse still working, or almost-frozen with some apps (Emacs) able to partially modify the screen... each time I was hoping that the next version of kernel/Xorg/libdrm would fix it (my expectations were high after bug #25475 was set to "FIXED") but each time a new symptom seemed to appear.

Thanks for looking into it, and please let me know if I should provide more info, or if I can otherwise help to debug. 

dmesg output (full log attached):
=============
[81604.882014] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[81604.882026] render error detected, EIR: 0x00000000
[81604.882030] i915: Waking up sleeping processes
[81604.882047] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 2146572 at 2146570)
[81604.882367] reboot required
[81606.038019] [drm:i915_gem_idle] *ERROR* hardware wedged
============= (end of partial dmesg output)

and intel_gpu_dump gives this:
============= (gpu dump)
ACTHD: 0x0164d238
EIR: 0x00000000
EMR: 0xffffff69
ESR: 0x00000001
PGTBL_ER: 0x00000000
IPEHR: 0x41800000
IPEIR: 0x00000000
INSTDONE: 0x01ffffc1
  busy: GMBUS
  busy: FBC
  busy: Secondary ring 3
  busy: Secondary ring 2
  busy: Secondary ring 1
  busy: Secondary ring 0
  busy: Primary ring 1
Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write

Warning: Ignoring unrecognized line at 
/sys/kernel/debug/dri/0/i915_ringbuffer_data:1:
No ringbuffer setup
============= (end of gpu dump)
(the last three lines were sent to stderr)

get-edid | parse-edid output:
============= (edid)
Section "Monitor"
        # Block type: 2:0 3:fd
        # Block type: 2:0 3:fc
        Identifier "CPD-E230"
        VendorName "SNY"
        ModelName "CPD-E230"
        # Block type: 2:0 3:fd
        HorizSync 30-85
        VertRefresh 48-170
        # Max dot clock (video bandwidth) 190 MHz
        # Block type: 2:0 3:fc
        # Block type: 2:0 3:ff
        # DPMS capabilities: Active off:yes  Suspend:no  Standby:no

        Mode    "1024x768"      # vfreq 84.997Hz, hfreq 68.677kHz
                DotClock        94.500000
                HTimings        1024 1072 1168 1376
                VTimings        768 769 772 808
                Flags   "+HSync" "+VSync"
        EndMode
        # Block type: 2:0 3:fd
        # Block type: 2:0 3:fc
        # Block type: 2:0 3:ff
EndSection
============= (end of edid)

Example of how wrong the edid can be:
============= (wrong edid)
00000000  00 ff ff ff ff ff ff 00  4d d9 71 07 01 01 01 01  |........M.q.....|
00000010  31 0b 01 02 0e 21 18 96  2b 0c c9 a0 57 47 9b 27  |1....!..+...WG.'|
00000020  12 48 4c ff ff 80 31 59  45 59 61 5b ff ff ff ff  |.HL...1YEYa[....|
00000030  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00000080
============= (end of wrong edid)

xorg.conf follows as an attachment, as well as my kernel .config.

Comment 1 theonewiththeevillook 2010-02-24 02:03:01 UTC

Created attachment 33523 [details]
xorg.conf in use.

Comment 2 theonewiththeevillook 2010-02-24 02:06:35 UTC

Created attachment 33524 [details]
kernel configuration for the kernel in use.

I really have no idea if this is useful, but I said I would attach it.

Comment 3 theonewiththeevillook 2010-02-24 02:13:57 UTC

Created attachment 33525 [details]
Xorg log after the crash

This log was obviously made after an automatic restart of the X server, so it might be useless as well, but I include it for completeness...

(Side note : I see that many people receive emails because of my attachments... really I'm sorry about this. I don't know if too many attachments is better than too few, and I don't know if I should have made a tarball instead. This is not addressed in the Bug Writing Guidelines, so if someone has an answer, my email box is always open! Thanks)

Comment 4 Cyril Brulebois 2010-02-26 22:10:21 UTC

I'm getting this regularly (every few hours usually) as well.

I'm running something similar:
 * Debian's 2.6.32-9 kernel (including up to 2.6.32.9 plus patches)
 * Intel driver 2.9.1
 * libdrm 2.4.18.

Hardware:
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03) (prog-if 00 [VGA controller])
        Subsystem: Dell Device 0201

As for reproducibility, some clues are available in #25475:
http://bugs.freedesktop.org/show_bug.cgi?id=25475#c48
http://bugs.freedesktop.org/show_bug.cgi?id=25475#c83

I didn't try them yet though, wanted to leave a note on the tracker before doing so. AFAIR my crashes happened while either scrolling down a web browser, or even while scrolling down in git log's internal pager.

Comment 5 Chris Wilson 2010-03-29 04:10:01 UTC

The original bug is a dup of 26345, as identified by:
IPEHR: 0x41800000


*** This bug has been marked as a duplicate of bug 26345 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.