Bug 17713 - [845G] GPU hanging on X start (sometimes)
[845G] GPU hanging on X start (sometimes)
Status: RESOLVED FIXED
Product: xorg
Classification: Unclassified
Component: Driver/intel
7.3 (2007.09)
x86 (IA32) Linux (All)
: medium critical
Assigned To: Wang Zhenyu
Xorg Project Team
: NEEDINFO
: 17670 18270 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-22 04:10 UTC by david manyé
Modified: 2009-11-05 22:16 UTC (History)
10 users (show)

See Also:
i915 platform:
i915 features:


Attachments
log for version 1.4.2 from a machine that refuses to boot X (32.59 KB, text/plain)
2008-09-22 04:12 UTC, david manyé
no flags Details
log for version 1.4.2 from a machine that boot X correctly (35.98 KB, text/plain)
2008-09-22 04:13 UTC, david manyé
no flags Details
log for version 1.5.0 from the same machine that refuses to boot X with 1.4.2 (27.41 KB, text/plain)
2008-09-22 04:14 UTC, david manyé
no flags Details
Xorg.0.logs, dmesgs, and lsmod in working versus nonworking times (29.00 KB, application/x-bzip-compressed-tar)
2008-12-18 20:41 UTC, Joey Adams
no flags Details
xserver 1.5.3 + intel drv 2.5.1 (5.36 KB, application/gzip)
2009-01-15 02:27 UTC, david manyé
no flags Details
X starts successfully, mode debug on, rev 3 machine (83.12 KB, application/octet-stream)
2009-02-26 03:04 UTC, david manyé
no flags Details
X fails to, mode debug on, rev 3 machine (72.29 KB, application/octet-stream)
2009-02-26 03:06 UTC, david manyé
no flags Details
output log for intel_reg_dumper in console mode (1.78 KB, application/gzip)
2009-08-03 01:54 UTC, david manyé
no flags Details
output log for intel_reg_dumper in graphic mode (1.84 KB, application/gzip)
2009-08-03 01:55 UTC, david manyé
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description david manyé 2008-09-22 04:10:55 UTC
i have a lab with ~30 near identical computers. they all have an integrated intel video adapter: lspci shows:

00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)

some are "rev 01" and others are "rev 03".

we run gnu/linux debian lenny which currently uses:

xserver-xorg             1:7.3+15         the X.Org X server
xserver-xorg-core        2:1.4.2-4        Xorg X server - core server
xserver-xorg-video-intel 2:2.3.2-2+lenny2 X.Org X server -- Intel i8xx, i9xx...

when booting the machines, sometimes, some of them (not always the same), when trying to start X crash. a gdm message appears saying that X cannot be started.

rebooting a machine a few times usually solves the problem.

we have other labs with other intel chipsets without problem.

on one of the crashing machines, i've installed a more recent version (debian experimental):

xserver-xorg-video-intel 2:2.4.2-1
xserver-xorg             1:7.4~3
xserver-xorg-core        2:1.5.0-1

and the computer seems to hang (but i can ssh to it remotely).

i attach a .tar.gz with three Xorg logs:
 Xorg.log.0.1.4.2.crashed
 Xorg.log.0.1.5.0.crashed
 Xorg.log.0.1.4.2.ok

the first two are from the same machine. it didn't boot neither 1.4.2 nor 1.5.0. the third log is from another (identical) machine that successfully started X with 1.4.2. i think the names are self-explanatory.

also, i tried unsuccessfully to reinitialize the graphics card using vbetool (http://packages.debian.org/lenny/vbetool). 

debian bug 498703 (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498703) reffers to the same bug explained here.

feel free to ask for more info or tests. any help will be appretiated.
Comment 1 david manyé 2008-09-22 04:12:26 UTC
Created attachment 19093 [details]
log for version 1.4.2 from a machine that refuses to boot X
Comment 2 david manyé 2008-09-22 04:13:13 UTC
Created attachment 19094 [details]
log for version 1.4.2 from a machine that boot X correctly
Comment 3 david manyé 2008-09-22 04:14:00 UTC
Created attachment 19095 [details]
log for version 1.5.0 from the same machine that refuses to boot X with 1.4.2
Comment 4 Gordon Jin 2008-09-23 02:20:38 UTC
I'd suggest to focus on the problem with the latest driver.

Note: when we say driver version, it means xorg-video-intel (2.4.2) not xserver version (1.5.0).
Comment 5 Michael Fu 2008-09-25 19:56:15 UTC
*** Bug 17670 has been marked as a duplicate of this bug. ***
Comment 6 Felix Miata 2008-09-30 18:39:19 UTC
I have openSUSE Factory and kernel-2.6.27-rc6-7-pae on i845G on Dell GX260 desktop. Xorg.0.log shows Xorg 1.5.0 Intel module version 2.4.97. Attempting to use X either directly or via Sax2 completely locks up the system. Excerpt for Xorg.0.log:
Fatal server error:
lockup

Error in I830WaitLpRing(), timeout for 2 seconds
pgetbl_ctl: 0x3ffe0001 getbl_err: 0x00000000
ipeir: 0x00000000 iphdr: 0x05000000
LP ring tail: 0x00000020 head: 0x0000000c len: 0x0001f001 start 0x00000000
eir: 0x0000 esr: 0x0000 emr: 0xff7b
instdone: 0xffc1 instpm: 0x0000
memmode: 0x00000000 instps: 0x00000040
hwstam: 0xffff ier: 0x0000 imr: 0xffff iir: 0x0000
Ring at virtual 0xaf897000 head 0xc tail 0x20 count 5 acthd 0x311a8
Comment 7 Felix Miata 2008-09-30 19:03:01 UTC
Comment 6 system booted to Mandriva Cooker 2.6.27-rc7.5.1 with Xorg 1.4.2/Intel 2.4.2-4 seems to work fine.
Comment 8 Felix Miata 2008-10-22 10:25:40 UTC
I still crash on 1.5.2 and 2.4.97 on openSUSE Factory's 2.6.27.1.
Comment 9 Felix Miata 2008-10-22 13:49:26 UTC
I have both Intel motherboard and Foxconn/Dell motherboard 845G systems. Problem is gone with 2.5.0 and 1.5.2 on the Intel, but not on the Dell, regardless whether BIOS video buffer is set to 1M or 8M.
Comment 10 Gordon Jin 2008-11-16 18:31:20 UTC
How about if you add "AccelMethod NoAccel" in under device section in xorg.conf?

This bug seems a bit similar to bug#17291.
Comment 11 Felix Miata 2008-11-16 18:51:13 UTC
(In reply to comment #10)
> How about if you add "AccelMethod NoAccel" in under device section in
> xorg.conf?

...
1.5.2
...
Parse error on line 164 of section Device in file /etc/x11/xorg.conf
  "AccelMethod NoAccel" is not a valid keyword in this section."
...
Comment 12 Felix Miata 2008-11-16 18:58:41 UTC
Adding 'Option "NoAccel"' in Section "Device" in xorg.conf enables X startup success on the GX260.
Comment 13 Stefan Dirsch 2008-12-13 15:56:30 UTC
Isn't this a duplicate of Bug #18270?
Comment 14 Michael Fu 2008-12-14 20:54:06 UTC
*** Bug 18270 has been marked as a duplicate of this bug. ***
Comment 15 Joey Adams 2008-12-18 20:41:25 UTC
Created attachment 21290 [details]
Xorg.0.logs, dmesgs, and lsmod in working versus nonworking times
Comment 16 Joey Adams 2008-12-18 20:46:29 UTC
Yay, glad I did a search.  Indeed, this happens to me when I get a "(WW) intel(0): PRB0_HEAD (0x00000004) and PRB0_TAIL (0x00000000) indicate ring buffer not flushed".  Additionally, I get "underrun on pipe A" and a couple of other potentially related errors.  I wonder if they're related.

Since I have free time, plenty of knowledge of C, and a very very tiny bit of knowledge about X drivers, I'll see if I can hack away at this.  Don't count on me, though :)

Anyway, here's the bug report I was about to post:

Title:  Intel driver locks up system at startup randomly; underrun on pipe A

I have an Intel 82845G/GL integrated chipset on an HP Pavilion 503n displaying on a 17-inch LCD monitor.  The following bug happens on the latest Ubuntu Intrepid on Linux 2.6.27-9-generic as well as in Ubuntu Hardy on Linux 2.6.24-14-generic.  All my testing is on Ubuntu Intrepid for this bug information.

My system locks up completely (can't be accessed even through VT switching or SSH) when Ubuntu Intrepid starts up, but this problem occurs randomly.  I'm guessing the randomness is caused by Ubuntu racing to start X before starting a few other system services.  Other times, the driver runs fine, 3D and all, except for these potentially related problems:

1. A blank screen after VT-switching a bit and switching back to F7 or wherever the X server is running.
2. Horizontal jumping effects (seemingly lasting only three or so screen refreshes; a tiny fraction of a second) after resuming from suspend.  These effects appear in greater frequency at higher resolutions (over ten times more on 1280x1024 than on 1024x768), and they happen more often when a lot of 2D graphics action is going on (glxgears and video don't cause that much jumpiness, but GNOME's progress bar causes it like crazy).

The lockup as well as these two problems are always accompanied by one or more instances of this message in /var/log/Xorg.0.log:

(EE) intel(0): underrun on pipe A!

Before a lockup occurred, the (WW) lines below appeared in the log in a test:

(II) intel(0): Fixed memory allocation layout:
(II) intel(0): 0x00000000-0x0001ffff: ring buffer (128 kB)
-- snip --
(II) intel(0): 0x08000000:            end of aperture
(WW) intel(0): PRB0_HEAD (0x00000004) and PRB0_TAIL (0x00000000) indicate ring buffer not flushed
(WW) intel(0): Existing errors found in hardware state.

When the blank screen problem occured, these (WW) lines appeared instead in a separate test:

(WW) intel(0): ESR is 0x00000010, page table error
(WW) intel(0): PGTBL_ER is 0x00000011
(WW) intel(0): Existing errors found in hardware state.

When the resume from suspend jumping effects happened, no (WW) lines after the "Fixed memory allocation layout:" occurred in yet another test.

In my last occurrence of a lockup, the X cursor appeared and the mouse worked for a second or two before X crashed and brought Linux with it.  Before X's demise, the following appears in the log:

(EE) intel(0): underrun on pipe A!
(EE) intel(0): underrun on pipe A!
-- snip --
Error in I830WaitLpRing(), timeout for 2 seconds
pgetbl_ctl: 0x3ff60001 getbl_err: 0x00000000
ipeir: 0x00000000 iphdr: 0x54300004
LP ring tail: 0x000002c0 head: 0x000001e4 len: 0x0001f001 start 0x00000000
eir: 0x0000 esr: 0x0000 emr: 0xff7b
instdone: 0xffc1 instpm: 0x0000
memmode: 0x00000000 instps: 0x00000024
hwstam: 0xfffe ier: 0x0002 imr: 0x053c iir: 0x0080
Ring at virtual 0xaf89b000 head 0x1e4 tail 0x2c0 count 55
Ring at virtual 0xaf89b000 head 0x1e4 tail 0x2c0 count 55
-- supersnip --
Ring at virtual 0xaf89b000 head 0x1e4 tail 0x2c0 count 55
Ring end
space: 130844 wanted 131064
(II) intel(0): [drm] removed 1 reserved context for kernel
(II) intel(0): [drm] unmapping 8192 bytes of SAREA 0xf8aee000 at 0xb7b0c000
(II) intel(0): [drm] Closed DRM master.

Fatal server error:
lockup

(II) Macintosh mouse button emulation: Close
(II) UnloadModule: "evdev"
(II) AT Translated Set 2 keyboard: Close
(II) UnloadModule: "evdev"
(II) Logitech Trackball: Close
(II) UnloadModule: "evdev"
(II) AIGLX: Suspending AIGLX clients for VT switch

Backtrace:
0: /usr/X11R6/bin/X(xf86SigHandler+0x79) [0x80c3009]
1: [0xb803b400]
2: /usr/lib/xorg/modules/drivers//intel_drv.so [0xb7ab2b50]
3: /usr/X11R6/bin/X [0x80d6b0a]
4: /usr/lib/xorg/modules/extensions//libglx.so [0xb7b6cbe9]
5: /usr/X11R6/bin/X(AbortDDX+0x79) [0x80a8b09]
6: /usr/X11R6/bin/X(AbortServer+0x28) [0x813c498]
7: /usr/X11R6/bin/X(FatalError+0x63) [0x813caa3]
8: /usr/lib/xorg/modules/drivers//intel_drv.so(I830WaitLpRing+0x201) [0xb7aa71d1]
9: /usr/lib/xorg/modules/drivers//intel_drv.so(I830Sync+0x1c3) [0xb7aa75e3]
10: /usr/lib/xorg/modules/drivers//intel_drv.so [0xb7acf7ea]
11: /usr/lib/xorg/modules//libexa.so(exaWaitSync+0x65) [0xb79b0045]
12: /usr/lib/xorg/modules//libexa.so(ExaDoPrepareAccess+0x7e) [0xb79b123e]
13: /usr/lib/xorg/modules//libexa.so(ExaCheckPutImage+0x103) [0xb79b8e03]
14: /usr/lib/xorg/modules//libexa.so [0xb79b2585]
15: /usr/X11R6/bin/X [0x817948d]
16: /usr/X11R6/bin/X(ProcPutImage+0x15e) [0x808951e]
17: /usr/X11R6/bin/X(Dispatch+0x34f) [0x808c89f]
18: /usr/X11R6/bin/X(main+0x47d) [0x8071d1d]
19: /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe5) [0xb7c44685]
20: /usr/X11R6/bin/X [0x8071101]
Saw signal 11.  Server aborting.
(II) AIGLX: Suspending AIGLX clients for VT switch

Complete logs are attached (see "Xorg.0.logs, dmesgs, and lsmod in working versus nonworking times"), and the DESCRIPTIONS file in the tarball explains each of the 9 logs.
Comment 17 Stefan Dirsch 2009-01-02 07:25:14 UTC
openSUSE 11.1/Novell SLE11 specific
-----------------------------------

  * The issue is that sax2, which runs during installation (and also on 
    LiveCD), does not use Xserver's "-br" option yet and the problem only 
    occurs when this option is not being set. This is the reasons why 
    xdm/gdm/kdm, which now use "-br" option by default, work after 
    installation (after being forced to reboot). Still it's a driver bug.
Comment 18 MaLing 2009-01-06 22:59:52 UTC
(In reply to comment #17)
> openSUSE 11.1/Novell SLE11 specific
> -----------------------------------
>   * The issue is that sax2, which runs during installation (and also on 
>     LiveCD), does not use Xserver's "-br" option yet and the problem only 
>     occurs when this option is not being set. This is the reasons why 
>     xdm/gdm/kdm, which now use "-br" option by default, work after 
>     installation (after being forced to reboot). Still it's a driver bug.

hi david,

Could you try comment 7?

Thanks
Ma Ling
Comment 19 MaLing 2009-01-13 18:27:01 UTC
ping  david  ~
Comment 20 david manyé 2009-01-15 02:00:53 UTC
(In reply to comment #19)
> ping  david  ~
> 

hello, 

sorry for the delay but there were health issues (which aren't solved yet).

the problem isn't still solved.

today i've booted all 30 computers in the lab. about 1/6 failed. from those failed, some have got frozen (kernel hang) and some others got gdm failing/retrying to start X. 

in one of computers where gdm keep retrying, i've installed intel driver from debian experimental (v2.5.1) but gdm complains the same way. i've installed also the -dbg package to get more info...

i'll try to install mandriva to see what happens... also, i'll try with a 2.6.27 debian kernel compiled by me (i've tried with an unofficial version and it seems kernel and X don't like each other).
Comment 21 david manyé 2009-01-15 02:27:03 UTC
Created attachment 22006 [details]
xserver 1.5.3 + intel drv 2.5.1

these are the logs from X (1.5.3) and from gdm with intel driver 2.5.1
Comment 22 Wang Zhenyu 2009-01-19 19:54:29 UTC
Please use 'intel_gtt' dumper under src/reg_dumper, provide us two logs; one is when AGP is disabled (agp=off kernel param should work), and another one is normal boot with AGP enabled.

Comment 23 david manyé 2009-01-21 01:42:31 UTC
(In reply to comment #22)
> Please use 'intel_gtt' dumper under src/reg_dumper, provide us two logs; one is
> when AGP is disabled (agp=off kernel param should work), and another one is
> normal boot with AGP enabled.
> 

i've used intel driver v2.6.0 sources. 

# ./intel_gtt 
Unsupported chipset for gtt dumper
# dmesg | grep Chipset
[    7.410788] agpgart: Detected an Intel 830M Chipset.

Comment 24 MaLing 2009-02-03 06:54:14 UTC
hi david
Could you help us to generate two log files with Modedebug option under comment #1 and #2 case respectively, then paste them again.

Thanks
Ma Ling

Comment 25 david manyé 2009-02-26 03:04:48 UTC
Created attachment 23320 [details]
X starts successfully, mode debug on, rev 3 machine
Comment 26 david manyé 2009-02-26 03:06:19 UTC
Created attachment 23321 [details]
X fails to, mode debug on, rev 3 machine
Comment 27 david manyé 2009-02-26 03:11:51 UTC
(In reply to comment #24)
> hi david
> Could you help us to generate two log files with Modedebug option under comment
> #1 and #2 case respectively, then paste them again.
> 
> Thanks
> Ma Ling
> 

sorry for the long delay. i hope it's not too late ;-) 

i've sent the logs you ask and i want to add something i've found. about 85% of the machines have this:

00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE
Chipset Integrated Graphics Device (rev 01)

the rest of the machines are the same but "rev 3" instead of "rev 1". 

what i can see is that "rev 1" machines usually work, and that "rev 3" machines usually fail. in other words, "rev 3" machines are more faulty than "rev 1" ones.

thanks.
Comment 28 Jesse Barnes 2009-05-11 11:21:29 UTC
Adjusting severity: crashes & hangs should be marked critical.
Comment 29 david manyé 2009-06-25 01:45:09 UTC
hello, 

i tested again with:
  kernel 2.6.30
  xorg 1.6.2 rc1
  intel driver 2.7.1

and all machines boot X correctly, though it is easy to crash the "rev 3" ones when going back from console to X or when gdm starts again after an X session log out.

for the next months we plan not to use the "rev 3" machines so this bug hopefully won't appear. if you want to close this bug, feel free to do it. it you want me to test something more, feel free to ask it.

thanks.
Comment 30 Eric Anholt 2009-07-13 13:00:01 UTC
Does it also occur with KMS enabled?  switching to/from text mode has always been quite unreliable, and KMS should fix that.
Comment 31 Eric Anholt 2009-07-15 14:18:40 UTC
Also, intel_gpu_dump output with the hung X Server would help.
Comment 32 ykzhao 2009-07-31 06:18:52 UTC
Hi, David
     How about the test result?
 thanks.
Comment 33 david manyé 2009-08-03 01:51:16 UTC
(In reply to comment #32)
> Hi, David
>      How about the test result?
>  thanks.
> 

about kms: kms in debian kernel is not active by default. sadly i have not much time to recompile my own kernel and get it working ...

about intel_gpu_dump: sorry, i haven't found it neither the debian sid/unstable precompiled package nor in debian sources.

i've tried intel_reg_dumper (don't know if it would help) and basically i get just two different dumps: one when in console mode and another in graphic mode. there's no difference whether the X server is working or hung. i'll attach both outputs.

i have not much time and in a week i'll be on holidays during a month, and i expect to have a lot of urgent work when i come back, so probably i'll can't get some time to follow this bug. also, these failing machines are scheduled to be substituted soon. i'll ask to retain one of them to play with this bug...


thanks.

------- Comment #31 From Eric Anholt 2009-07-15 14:18:40 PST [reply] -------

Also, intel_gpu_dump output with the hung X Server would help.

------- Comment #32 From ykzhao 2009-07-31 06:18:52 PST [reply] -------

Hi, David
     How about the test result?
 thanks.

Comment 34 david manyé 2009-08-03 01:54:44 UTC
Created attachment 28286 [details]
output log for intel_reg_dumper in console mode
Comment 35 david manyé 2009-08-03 01:55:17 UTC
Created attachment 28287 [details]
output log for intel_reg_dumper in graphic mode
Comment 36 Wang Zhenyu 2009-08-06 00:43:55 UTC
Please test my three patches on bug #23082, which aims to fix 845G problem in KMS with UXA.
Comment 37 Wang Zhenyu 2009-09-28 00:42:23 UTC
This should be fixed by Eric's
commit e517a5e97080bbe52857bd0d7df9b66602d53c4d
Author: Eric Anholt <eric@anholt.net>
Date:   Thu Sep 10 17:48:48 2009 -0700

    agp/intel: Fix the pre-9xx chipset flush.

    Ever since we enabled GEM, the pre-9xx chipsets (particularly 865) have had
    serious stability issues.  Back in May a wbinvd was added to the DRM to
    work around much of the problem.  Some failure remained -- easily visible
    by dragging a window around on an X -retro desktop, or by looking at
bugzilla.

    The chipset flush was on the right track -- hitting the right amount of
    memory, and it appears to be the only way to flush on these chipsets, but
the
    flush page was mapped uncached.  As a result, the writes trying to clear
the
    writeback cache ended up bypassing the cache, and not flushing anything! 
The
    wbinvd would flush out other writeback data and often cause the data we
wanted
    to get flushed, but not always.  By removing the setting of the page to UC
    and instead just clflushing the data we write to try to flush it, we get
the
    desired behavior with no wbinvd.

    This exports clflush_cache_range(), which was laying around and happened to
    basically match the code I was otherwise going to copy from the DRM.

    Signed-off-by: Eric Anholt <eric@anholt.net>
    Signed-off-by: Brice Goglin <Brice.Goglin@ens-lyon.org>
    Cc: stable@kernel.org

Please test with upstream kernel with KMS.
Comment 38 Wang Zhenyu 2009-10-19 23:47:13 UTC
Could you verify with new linus kernel? othewise this is just time out warning, and I'll mark it as fixed...
Comment 39 david manyé 2009-11-02 02:17:30 UTC
(In reply to comment #38)
> Could you verify with new linus kernel? othewise this is just time out warning,
> and I'll mark it as fixed...
> 

until today i've had not time to do any test. as i foretold you, the computers are already substituted and the few that were left cannot be used to test anything. so leave this bug rest in peace.
Comment 40 Wang Zhenyu 2009-11-05 22:16:12 UTC
ok, mark this as fixed by Eric's patch in kernel. Feel free to reopen if you would have chance to retest and issue was still there.