24753 – [i915] Occasional X freezes / GPU lockups with driver version 2.9.0

Bug 24753 - [i915] Occasional X freezes / GPU lockups with driver version 2.9.0

Summary: [i915] Occasional X freezes / GPU lockups with driver version 2.9.0

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-10-27 04:49 UTC by Michal Suchanek
Modified:	2010-03-11 11:42 UTC (History)
CC List:	8 users (show)

See Also:
i915 platform:
i915 features:

Attachments
X log (24.90 KB, text/plain) 2009-10-27 04:50 UTC, Michal Suchanek	no flags	Details
dmesg (99.36 KB, text/plain) 2009-10-27 04:50 UTC, Michal Suchanek	no flags	Details
an archive of logs and GPU dumps collected from two lockups (669.66 KB, application/x-compressed-tar) 2009-10-27 04:55 UTC, Michal Suchanek	no flags	Details
pre-lockup intel_gpu_dump (no compositing) (76.19 KB, application/octet-stream) 2009-10-29 00:56 UTC, Justus-bulk	no flags	Details
post-lockup intel_gpu_dump (no compositing) (195.75 KB, application/octet-stream) 2009-10-29 00:57 UTC, Justus-bulk	no flags	Details
GPU dump, Xorg.0.log, etc. (563.47 KB, application/x-compressed) 2009-11-04 12:21 UTC, Robert Huitl	no flags	Details
Output of pstree -pA after the lockup (10.78 KB, text/plain) 2009-11-04 12:22 UTC, Robert Huitl	no flags	Details
Output of pstree -pA for yet another lockup (9.99 KB, text/plain) 2009-11-06 07:51 UTC, Robert Huitl	no flags	Details
GPU dump, dmesg output reporting hung task (620.55 KB, application/x-gzip) 2009-11-17 04:38 UTC, Robert Huitl	no flags	Details
Corrupt rendering (143.83 KB, image/png) 2009-11-17 04:48 UTC, Robert Huitl	no flags	Details
gpudump from wedged eeepc 900 (319.59 KB, application/gzip) 2009-11-18 09:58 UTC, Daniel Kahn Gillmor	no flags	Details
gpu-dump, xorg backtrace, pstree output, etc. (483.57 KB, application/x-gzip) 2009-11-20 08:27 UTC, Robert Huitl	no flags	Details
Two more intel_gpu_dumps (584.35 KB, application/x-gzip) 2009-11-30 11:05 UTC, Robert Huitl	no flags	Details
Older corruption with driver 2.8.1 (30.64 KB, image/png) 2009-12-08 05:48 UTC, Robert Huitl	no flags	Details
Ugly font rendering issue with the latest svn builds of libdrm and intel (2009-12-17) (87.72 KB, image/png) 2009-12-17 05:56 UTC, Christian Schafmeister	no flags	Details
Show Obsolete (2) View All

Description Michal Suchanek 2009-10-27 04:49:44 UTC

Logs from driver version 2.9.0 as requested in bug 23116

Comment 1 Michal Suchanek 2009-10-27 04:50:22 UTC

Created attachment 30735 [details]
X log

Comment 2 Michal Suchanek 2009-10-27 04:50:54 UTC

Created attachment 30736 [details]
dmesg

Comment 3 Michal Suchanek 2009-10-27 04:55:05 UTC

Created attachment 30737 [details]
an archive of logs and GPU dumps collected from two lockups

The bugzilla limit does not allow attaching the dumps in plain text so there you go.

Comment 4 Chris Wilson 2009-10-27 05:37:24 UTC

Hmm, in 6, gpu dump missed the interesting batch buffer:

Ringbuffer:
0x0000cf48:      0x18800080: MI_BATCH_BUFFER_START
0x0000cf4c:      0x0e53a001:    dword 1
0x0000cf50: HEAD 0x02000004: MI_FLUSH

ACTHD: 0x0e53d664

But the first batch buffer dumped is the batchbuffer at 0x02812000, which is the third in the ringbuffer queue.

However, both dumps contain instances like:

0x083f9b30:      0x7d000006: 3DSTATE_MAP_STATE
0x083f9b34:      0x00000003:    mask
0x083f9b38:      0x0b471000:    map 0 MS2
0x083f9b3c:      0x00000194:    map 0 MS3
0x083f9b40:      0x01e00000:    map 0 MS4
0x083f9b44:      0x0b736000:    map 1 MS2
0x083f9b48:      0x00000184:    map 1 MS3
0x083f9b4c:      0x01e00000:    map 1 MS4
...
0x083f9c18:      0x7f1c0011: 3DPRIMITIVE inline RECTLIST
0x083f9c1c:      0x4392f000:     V0.X = 293.875000
0x083f9c20:      0x4475f800:     V0.Y = 983.875000
0x083f9c24:      0x43930000:     V0.T0.X = 294.000000
0x083f9c28:      0x44760000:     V0.T0.Y = 984.000000
0x083f9c2c:      0x3f800000:     V0.T1.X = 1.000000
0x083f9c30:      0x3f800000:     V0.T1.Y = 1.000000
0x083f9c34:      0x43927000:     V1.X = 292.875000
0x083f9c38:      0x4475f800:     V1.Y = 983.875000
0x083f9c3c:      0x43928000:     V1.T0.X = 293.000000
0x083f9c40:      0x44760000:     V1.T0.Y = 984.000000
0x083f9c44:      0x00000000:     V1.T1.X = 0.000000
0x083f9c48:      0x3f800000:     V1.T1.Y = 1.000000
0x083f9c4c:      0x43927000:     V2.X = 292.875000
0x083f9c50:      0x4475b800:     V2.Y = 982.875000
0x083f9c54:      0x43928000:     V2.T0.X = 293.000000
0x083f9c58:      0x4475c000:     V2.T0.Y = 983.000000
0x083f9c5c:      0x00000000:     V2.T1.X = 0.000000
0x083f9c60:      0x00000000:     V2.T1.Y = 0.000000

which at first glance is a most surprising instruction sequence. Both textures are a single pixel high (and >1 pixel wide, so not a 1x1R solid), with the mask being scaled over the entire rectangle but with the src being apparently tiled.

The gpu dies shortly afterwards.

Given you are using 2.9.0, we should have addressed any possible access beyond the end of the texture -- but this still looks like the most suspicious set of instructions.

What applications were you running at the time? And in particular is it cairo based?

Comment 5 Michal Suchanek 2009-10-27 05:58:01 UTC

Most likely Firefox and Thunderbird (mozilla.com build) and gkrellm which use GTK/pango/cairo.

rxvt-unicode also links with cairo for some reason which is somewhat unexpected.

Comment 6 Chris Wilson 2009-10-27 06:40:12 UTC

gkrellm only seems to be using pangocairo, every else it does seems to be hitting core requests and so unlikely to be the source of the strange instruction set. (At least from watching the default gkrellm install here.)

Similary rxvt-unicode is using the core drawing api in all its glory, again unlikely to be the culprit.

I'm not convinced that Thunderbird uses cairo for anything other than the Gtk+ integration (or at least system cairo), but it is at least generating a stream of XRender requests.

Firefox is a heavy user of XRender.

Michal, if you can narrow down the trigger that would be most useful. From the list you gave, the most likely candidates are the moz apps.

Comment 7 Michal Suchanek 2009-10-27 08:12:51 UTC

It is not exactly easy to check. The lockup usually takes a few days to reproduce and I am using the named applications pretty much all the time.

I have recently switched to KMS. The switch required me to restart the system occasionally for unrelated reasons and I had no lockups since the switch, possibly because I was not running the same X server long enough.

Can the lockup result from an application that has no mapped window?

If not then Firefox is definitely the culprit. 
I use it the most and iirc it was always visible when the X server locked up.

Comment 8 Chris Wilson 2009-10-27 08:25:20 UTC

Thanks Michal,
I don't think KMS should have much impact upon this bug. But as I haven't spotted a definitive cause of death, if you do experience any more lockups, please do grab a gpu dump and describe what was happening at the time of the hang - so that we can see if we can establish a pattern. (Rendering can occur to offscreen buffers, so just because an application isn't visible doesn't mean that it has stopped sending commands to the GPU (via X or GL).)

Comment 9 Justus-bulk 2009-10-29 00:54:51 UTC

This bug looks a lot like Bug #20560.

I have the same issue, same GPU, same driver. It first happened after upgrading from xserver-xorg-video-intel 2:2.3.2-2+lenny6 to 2:2.8.1-1.

See https://bugs.freedesktop.org/attachment.cgi?id=30243 for a pre-lockup GPU dump and https://bugs.freedesktop.org/attachment.cgi?id=30244 for a post-lockup GPU dump, with X compositing enabled.

I'll now attach two recent dumps, pre- and post-lockup (with an overnight suspend-to-RAM in-between), with compositing disabled.

It's difficult to tell what triggers this, but I believe that most of my GPU lockups occurred while okular, firefox or openoffice were rendering.

It seems that with compositing enabled, my GPU hangs very soon, between minutes and a few days, usually some hours after starting up, whereas with compositing disabled, the machine runs fine for at least a week.

Comment 10 Justus-bulk 2009-10-29 00:56:52 UTC

Created attachment 30787 [details]
pre-lockup intel_gpu_dump (no compositing)

Comment 11 Justus-bulk 2009-10-29 00:57:49 UTC

Created attachment 30788 [details]
post-lockup intel_gpu_dump (no compositing)

Comment 12 Chris Wilson 2009-10-29 01:28:41 UTC

(In reply to comment #9)
> This bug looks a lot like Bug #20560.

No it doesn't, I can't see the similarity between the gpu dump here and Thomas's (and there have been evidently quite a few substantial changes in the driver since that bug report).

> It seems that with compositing enabled, my GPU hangs very soon, between minutes
> and a few days, usually some hours after starting up, whereas with compositing
> disabled, the machine runs fine for at least a week.

Please try updating your software to the current stable releases (including the kernel). The gpu dump you attached contained at least one example of a bug that we know we have fixed, but does have a passing similarity to the one here. If the (non-composited) hang reoccurs after updating, please file a new bug report with the gpu dump - and allows us to judge whether or not it is indeed a duplicate.

Comment 13 Michal Suchanek 2009-10-29 01:40:28 UTC

I would think that nx1 bitmaps are nothing unusual. They are often used for background in web pages.

Comment 14 Chris Wilson 2009-10-29 01:55:04 UTC

(In reply to comment #13)
> I would think that nx1 bitmaps are nothing unusual. They are often used for
> background in web pages.

Yeah having 1D bitmaps is not surprising, and maybe we should cater to those to save a few bytes in the command stream... But that *single* pixel being rendered that I highlighted from the dump, samples the entire 1x389 mask and the entire 1x405 source. Which is odd.

Comment 15 Daniel Kahn Gillmor 2009-10-30 10:57:47 UTC

I just had another lockup with this version myself, but of course i hadn't set up any way to generate a dump once the UI was locked so i don't have a dump to offer.

The only apps i had open during the last crash were:  icedove 2 (thunderbird variant) and iceweasel 3.5 (firefox variant), rxvt-unicode, emacs 22 (X11-based), and korganizer.

the lockups seem to be on the order of several days apart, and i basically never shut my machine down except for kernel upgrades -- it's always sleep/resume cycles and rotating between several variant external monitors.  I'll post a dump if i can get one.  I've modified my ACPI scripts to record a dump during a poweroff button event (i think).  if there are better/easier ways to do that, please point me toward them.

Comment 16 Robert Huitl 2009-11-04 12:20:38 UTC

Finally another lockup. I managed to get a GPU dump. Last thing I did was scroll in Konsole, with Kate in the background, Opera running but invisible. I also attached the output of pstree -pA so you can see what applications where running at the time of the crash.

Backtrace of X server, sorry, no debug symbols:
#0  0xffffe424 in __kernel_vsyscall ()
#1  0xb7463719 in ioctl () from /lib/libc.so.6
#2  0xb725dd28 in drm_intel_gem_bo_map_gtt () from
#/usr/lib/libdrm_intel.so.1
#3  0xb71ef991 in ?? () from /usr/lib/xorg/modules/drivers//intel_drv.so
#4  0x0b3cf078 in ?? ()
#5  0x00000000 in ?? ()

Comment 17 Robert Huitl 2009-11-04 12:21:53 UTC

Created attachment 30969 [details]
GPU dump, Xorg.0.log, etc.

Comment 18 Robert Huitl 2009-11-04 12:22:37 UTC

Created attachment 30970 [details]
Output of pstree -pA after the lockup

Comment 19 Robert Huitl 2009-11-04 12:24:35 UTC

By the way, system configuration:

- xorg-server-1.6.5
- mesa-7.5.2
- xf86-video-intel-2.9.0-r1
- libdrm-2.4.13
- Kernel 2.6.31.5, KMS enabled, no additional patches applied

Comment 20 Robert Huitl 2009-11-06 07:51:59 UTC

Created attachment 31014 [details]
Output of pstree -pA for yet another lockup

Comment 21 Robert Huitl 2009-11-06 07:57:28 UTC

I might have found a good way to reproduce the bug. But first:

Yesterday I had another lockup. I attached the output of pstree. At the time of the lockup, the preferences window of Eclipse was active on one screen, Konsole opened on the other.

Today another lockup. The system had been restarted only a few minutes ago, and I opened a large file (~2000k lines) in Eclipse and started scrolling by dragging the scrollbar handle around. I noticed how fast the window contents are updated, while the scrollbar handle seemed to lag behind - I guess they're using HW acceleration to update the text view. After a couple of seconds of scrolling, X locked up.

Comment 22 Chris Wilson 2009-11-06 12:27:13 UTC

That latest dump seems to share very little similarity with the initial dumps. In the batch that died, it does a 2D blit to the target surface then proceeds to manually tile a texture to the same surface, where it promptly dies in the middle of vertex fetch. Although the command stream is rather inefficient, it doesn't seem obviously broken.

However, using 2D and then 3D to the same surface without a flush may result in invalid rendering, but I don't recall the documentation warning that it may hang the chip. But doing so is frightfully easy, so...

Can you update to the latest bits from xf86-video-intel.git? Within that tree are currently a couple of debug options that I added to identify problems with using 2D + 3D without the appropriate flushes, namely:

  Section "Driver"
    Option "DebugFlushCaches" "1"
  EndSection

I'd appreciate if you could try running with that option enabled for a while to see if the missing flushes might be causing more than just invalid rendering. Thanks.

Comment 23 Robert Huitl 2009-11-16 10:26:55 UTC

Chris, I'm running the driver with your debug flushes since Nov 6 and haven't had a lockup since then. Great :D

I do have those drawing glitches you referred to, usually there are empty lines where text should have been rendered (e.g. in konsole). So this might very well be a cache-related problem...

Comment 24 Chris Wilson 2009-11-16 10:34:22 UTC

(In reply to comment #23)
> Chris, I'm running the driver with your debug flushes since Nov 6 and haven't
> had a lockup since then. Great :D

Thanks for the update! Scary that the missing flushes should affect system stability...

> I do have those drawing glitches you referred to, usually there are empty lines
> where text should have been rendered (e.g. in konsole). So this might very well
> be a cache-related problem...

Just areas of missing text or corrupt rendering? The forced flushing should have eliminated all areas of corrupt rendering, the missing text is then likely to be missing instructions -- as bizarre as that sounds.

Comment 25 Robert Huitl 2009-11-17 04:38:00 UTC

Created attachment 31254 [details]
GPU dump, dmesg output reporting hung task

Comment 26 Robert Huitl 2009-11-17 04:45:44 UTC

Cheered too soon, I had a lockup yesterday. But I'm not sure if it is the same problem, at least the Xorg backtrace is different:

#0  0xffffe424 in __kernel_vsyscall ()
#1  0xb7384719 in ioctl () from /lib/libc.so.6
#2  0xb715946d in drmIoctl () from /usr/lib/libdrm.so.2
#3  0xb7159882 in drmCommandNone () from /usr/lib/libdrm.so.2
#4  0xb711c02f in ?? () from /usr/lib/xorg/modules/drivers//intel_drv.so
#5  0x00000008 in ?? ()
#6  0x00000018 in ?? ()
#7  0xbfc2af58 in ?? ()
#8  0x08225280 in ?? ()
#9  0x00000000 in ?? ()

Particularly interesting is dmesg:

[420840.472049] INFO: task i915/0:2350 blocked for more than 120 seconds.
(see attachment)

I think I was working with Opera at the time of the lockup.

While writing this I had one of the more severe corruptions, not just missing text. I will attach a screen shot. I could scroll up and down in less and the corruption persisted, until I forced a redraw (e.g. by scrolling the line out of view or selecting it).

Comment 27 Robert Huitl 2009-11-17 04:48:36 UTC

Created attachment 31255 [details]
Corrupt rendering

Comment 28 Chris Wilson 2009-11-17 05:51:49 UTC

(In reply to comment #26)
> Cheered too soon, I had a lockup yesterday. But I'm not sure if it is the same
> problem, at least the Xorg backtrace is different:

Hmm, that backtrace is a bit of a mystery. It appears that you are running an old kernel with a recent xlib driver. I can see the MI_FLUSH between each operation, but you had a ringbuffer overflow which has been fixed in the kernel.

The peculiarity about this dump is that the current ACTHD is beyond the end of the batchbuffer - which suggests a CPU cache flushing error, of which several were fixed after the ringbuffer overflow...

In short I suspect that this occurence of the hang is due to older bugs.

Comment 29 Robert Huitl 2009-11-17 06:02:51 UTC

I'm running 2.6.31.5, do you know what version contains the fix?

Comment 30 Daniel Kahn Gillmor 2009-11-18 09:56:53 UTC

I just had a graphics lockup with debian's 2.6.30-2-686 kernel, with:

xserver-xorg-video-intel 2:2.9.0-1
libdrm-intel1 2.4.14-1+b1
xserver-xorg-core 2:1.6.5-1
intel-gpu-tools 1.0.1-1

since the graphics were locked up, and i don't run an externally-available ssh daemon or serial console on this machine, i had to get a gpudump by triggering it with an ACPI event.  how are other people doing this?

my hardware is an Asus eeepc 900, and the relevant PCI devices are reported as:

00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 04)
00:02.1 Display controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 04)

relevant lines from the kernel log include (across the hang and the reboot):

0 dkg@pip:~$ grep 915 /var/log/kern.log
Nov 17 17:46:39 pip kernel: [712250.402182] [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0
Nov 17 17:49:10 pip kernel: [712351.373849] [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0
Nov 17 23:15:41 pip kernel: [712430.091569] usb 3-1: configuration #1 chosen from 1 choice
Nov 17 23:53:22 pip kernel: [714691.119911] [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0
Nov 18 10:15:07 pip kernel: [714699.869154] pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900007, writing 0x900003)
Nov 18 11:18:23 pip kernel: [718496.804027] [drm:i915_gem_idle] *ERROR* hardware wedged
Nov 18 11:24:21 pip kernel: [    1.229157] NET: Registered protocol family 10
Nov 18 11:24:21 pip kernel: [    1.756059] agpgart-intel 0000:00:00.0: Intel 915GM Chipset
Nov 18 11:25:20 pip kernel: [   84.173230] [drm:i915_gem_detect_bit_6_swizzle] *ERROR* Couldn't read from MCHBAR.  Disabling tiling.
Nov 18 11:25:20 pip kernel: [   84.173271] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
Nov 18 11:25:37 pip kernel: [  101.403890] [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0
0 dkg@pip:~$ 

In the X session, i was running icedove 2.0.0.22, iceweasel 3.5.4, emacs22 (X11-enabled), korganizer, openbox, and had nm-applet and the korganizer alert applet active in my dock.  The wedge happened as i was trying to view a new message in icedove, an action which usually does not cause a lockup.

viewing the same message in icedove after a restart did not cause the same "hardware wedged"

i'll attach the gpudump shortly.  Any other information which would be useful?  What actions should i add to my ACPI hook for use during future hangs?

Comment 31 Daniel Kahn Gillmor 2009-11-18 09:58:24 UTC

Created attachment 31295 [details]
gpudump from wedged eeepc 900

here is the gpudump gathered during the hang.

Comment 32 Chris Wilson 2009-11-18 10:43:19 UTC

(In reply to comment #29)
> I'm running 2.6.31.5, do you know what version contains the fix?

* scratches head.

The commit I'm expecting to have fixed the wrap-around as shown in your last dump was

commit 0ef82af7253c1929a3995f271b8b0db462d1a0c3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Sep 5 18:07:06 2009 +0100

    drm/i915: Pad ringbuffer with NOOPs before wrapping
    
which was in 2.6.31. The kernel version wasn't reported in the dmesg -- is there a chance that you were running an earlier kernel for that run?

Comment 33 Chris Wilson 2009-11-18 10:55:21 UTC

(In reply to comment #30)

Daniel, you are running an old combination of kernel and drivers - the dmesg output alone has warnings that have been fixed and should give improved performance.

Though I could not see an apparent cause (not without verifying the stride and length of the buffers which information is not included with the dump), the dump was very interesting. Whatever application you were using likes to draw lots of vertical lines a pixel at a time -- and managing to circumvent logic designed to amalgamate such requests.

Daniel, please in future file new bug reports unless you have an absolutely identical gpu dump to an existing bug report. There are many, many ways in which we can hang the GPU, so apparently similar bugs can have completely different causes. Therefore we assume each hang is a different bug (and so demands a separate bug report) until we can prove otherwise.

Comment 34 Daniel Kahn Gillmor 2009-11-18 11:10:45 UTC

Chris, thanks for the suggestions.  i thought that since this bug report mentions version 2.9.0, and i'm running 2.9.0, that this was the relevant place to put things.  I will file a separate report in the future.

Do you need me to be running 2.9.1 to accept a bug report?  Do you need me to be running the 2.6.31 kernel?

Any advice for other things i should try to capture during a future hang?  i'd like my ACPI hook to capture the info you'd want to see.

Comment 35 Chris Wilson 2009-11-18 11:31:16 UTC

(In reply to comment #34)
> Chris, thanks for the suggestions.  i thought that since this bug report
> mentions version 2.9.0, and i'm running 2.9.0, that this was the relevant place
> to put things.  I will file a separate report in the future.

Thanks Daniel.

> Do you need me to be running 2.9.1 to accept a bug report?  Do you need me to
> be running the 2.6.31 kernel?

Not need per se, it just helps to reduce the number of duplicates and known bugs that I need to check through.

> Any advice for other things i should try to capture during a future hang?  i'd
> like my ACPI hook to capture the info you'd want to see.

Along with the gpu dump and dmesg, tar up /sys/kernel/debug/dri/0 and /var/log/Xorg.log.

Comment 36 Robert Huitl 2009-11-19 07:15:06 UTC

> commit 0ef82af7253c1929a3995f271b8b0db462d1a0c3
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Sat Sep 5 18:07:06 2009 +0100
> 
>     drm/i915: Pad ringbuffer with NOOPs before wrapping
> 
> which was in 2.6.31. The kernel version wasn't reported in the dmesg -- is
> there a chance that you were running an earlier kernel for that run?

I verified that I was running 2.6.31.5. I built it on Oct 23 and it's been used since then. However, I found that your patch is NOT in 2.6.31.5. I think it's queued for 2.6.32, so I applied it manually.


@Daniel: You asked how to acquire debug information. I always SSH into the machine when it locks up. I use this script:

#!/bin/sh
mount -t debugfs debugfs /sys/kernel/debug
datestr=$(date +%Y%m%d)
mkdir -p dri_debug-$datestr/proc_dri_0
cp -r /sys/kernel/debug/dri/0/i915* dri_debug-$datestr
cp /proc/dri/0/gem* dri_debug-$datestr/proc_dri_0
intel_gpu_dump > dri_debug-$datestr/intel_gpu_dump.txt
dmesg > dri_debug-$datestr/dmesg.txt
cp /var/log/Xorg.0.log dri_debug-$datestr/
cp /var/log/kdm.log dri_debug-$datestr/kdm.log
tar czf dri_debug-$datestr.tgz dri_debug-$datestr/

Should be fairly easy to call it from your ACPI hook.

Comment 37 Daniel Kahn Gillmor 2009-11-19 08:35:06 UTC

Thanks, Robert.  your script grabbed a few things i hadn't (like /proc/dri and the display manager log), which i added to the script called by my ACPI hook.  very useful.  hopefully i won't have any more lockups, but if i do, i've got good data collection set up now.

Comment 38 Robert Huitl 2009-11-20 08:26:20 UTC

And once again, another lockup. Last thing I did was delete a mail in Kontact/Kmail, the last rendering probably took place because I accepted the "Really delete mail?" dialog.

This was with 0ef82af7253c1929a3995f271b8b0db462d1a0c3 applied. I didn't rebuild the kernel binary, just the modules, but rebooted to make sure the new i915.ko is used.

There are some startling backtraces in dmesg, but they were about 1 hour old and related to the PWC module (which I think is somewhat broken at the moment).

Comment 39 Robert Huitl 2009-11-20 08:27:13 UTC

Created attachment 31346 [details]
gpu-dump, xorg backtrace, pstree output, etc.

Comment 40 Chris Wilson 2009-11-20 08:48:07 UTC

(In reply to comment #39)
> Created an attachment (id=31346) [details]

That batch buffer looks promising. There really is nothing that could go wrong except if we screwed up the apparently simple blit to the final buffer - which is equally odd. The buffer exists in the active list, but we don't print out the size so I can't verify that the buffer is large enough for the request. Annoying.

Time to think what information I need to add to debugfs.

Comment 41 Robert Huitl 2009-11-30 11:05:38 UTC

Created attachment 31608 [details]
Two more intel_gpu_dumps

Hi Chris, I had two more lockups. I attached the output of intel_gpu_dump only, but I have all the other files as well if you need them.

Comment 42 Chris Wilson 2009-11-30 11:28:18 UTC

(In reply to comment #41)
> Created an attachment (id=31608) [details]
> Two more intel_gpu_dumps
> 
> Hi Chris, I had two more lockups. I attached the output of intel_gpu_dump only,
> but I have all the other files as well if you need them.

Hmm, another couple of weird dumps. The first reports a ACTHD location not among the listed batchbuffers, and the second dies in the middle of a perfectly sane looking series of glyphs.

Comment 43 Robert Huitl 2009-12-07 07:33:03 UTC

Chris, I don't understand the internal workings of the GPU, but as I observe a lot of missing and/or invalid renderings, I suspect that the corresponding data might not be correctly flushed to the GPU. Could this also be the case with the command stream, in a way such that the GPU sees different/partial commands (but intel_gpu_dump sees the correct data from the CPU cache)?

BTW, I updated to 2.6.32. No lockups so far, but an X crash instead.

[110636.596069] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[110636.596082] render error detected, EIR: 0x00000000
[110636.596089] i915: Waking up sleeping processes
[110636.596107] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 40033003 at 40033002)
[110636.596788] reboot required
[110636.927496] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged

Is this just the 2.6.32 way of dealing with the freeze, or a totally different bug?

Thanks,
Robert

Comment 44 Chris Wilson 2009-12-07 14:54:57 UTC

(In reply to comment #43)
> Chris, I don't understand the internal workings of the GPU, but as I observe a
> lot of missing and/or invalid renderings,

Richard, just need to check, I've been tinkering with cache flushes in xf86-video-intel recently -- with which versions of the driver have you been seeing this style of corruption? If it is older than my tinkering, then yes this could be a very significant observation.

> I suspect that the corresponding data
> might not be correctly flushed to the GPU. Could this also be the case with the
> command stream, in a way such that the GPU sees different/partial commands (but
> intel_gpu_dump sees the correct data from the CPU cache)?

Yes, that style of cache coherency issue is what I fear may be the root cause of this bug.
 
> BTW, I updated to 2.6.32. No lockups so far, but an X crash instead.

If you are updating, make sure you do pull the bleeding edge from libdrm and xf86-video-intel [I've been tinkering... ;-]

> [110636.596069] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed...
> GPU hung
> [110636.596082] render error detected, EIR: 0x00000000
> [110636.596089] i915: Waking up sleeping processes
> [110636.596107] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5
> (awaiting 40033003 at 40033002)
> [110636.596788] reboot required
> [110636.927496] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
> 
> Is this just the 2.6.32 way of dealing with the freeze, or a totally different
> bug?

Same bug, the manner of the death has just changed slightly. Newer chipsets have gained the ability to automatically recover from gpu hangs, alas not the venerable 8xx though. Did X really crash, or appear to hang? With an uptodate libdrm.git and xf86-video-intel.git, X should not be crashing -- just no longer updating the screen and probably flooding the logs.

Comment 45 Robert Huitl 2009-12-08 05:47:56 UTC

(In reply to comment #44)
> Richard, just need to check, I've been tinkering with cache flushes in
> xf86-video-intel recently -- with which versions of the driver have you been
> seeing this style of corruption? If it is older than my tinkering, then yes
> this could be a very significant observation.

I know I had a corruption on Oct 21 with driver 2.8.1. It looked a bit different than current corruptions, though, which are usually yellow, while the old one was monochrome only ;-)


> > BTW, I updated to 2.6.32. No lockups so far, but an X crash instead.
> 
> If you are updating, make sure you do pull the bleeding edge from libdrm and
> xf86-video-intel [I've been tinkering... ;-]

I updated libdrm to b84314a86ea4ad30e0f57a71b4ef0fa138fb24c6 and xf86-video-intel to c1afc831c8fe4cbececee7dfa23506a6746c2425. Very unstable, I had lockups within minutes:

[  291.044333] [drm:i915_gem_object_pin] *ERROR* Failure to install fence: -28                                                                     
[  291.105608] ------------[ cut here ]------------                                                                                                
[  291.105616] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:2122!                                                                                 
[  291.105620] invalid opcode: 0000 [#1] PREEMPT                                                                                                   
[  291.105624] last sysfs file: /sys/class/power_supply/BAT0/energy_full
[  291.105627] Modules linked in: ext3 jbd mbcache joydev hdaps tp_smapi thinkpad_ec sco bnep rfcomm l2cap crc16 bluetooth lib80211_crypt_tkip iptable_mangle iptable_filter ip_tables x_tables snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device dm_crypt dm_mod fuse cpufreq_ondemand cpufreq_powersave i2c_dev fan acpi_cpufreq i915 8250_pci 8250 serial_core snd_intel8x0 drm_kms_helper snd_intel8x0m drm snd_ac97_codec thinkpad_acpi i2c_algo_bit rfkill ac97_bus video ipw2200 snd_pcm sdhci_pci usbhid snd_timer yenta_socket i2c_i801 thermal tg3 libipw sdhci mmc_core led_class backlight rsrc_nonstatic pcmcia_core battery processor sg snd i2c_core ac output lib80211 nvram libphy thermal_sys hwmon uhci_hcd snd_page_alloc pcspkr button ehci_hcd evdev
[  291.105698]
[  291.105703] Pid: 12034, comm: X Not tainted (2.6.32 #1) 25256NG
[  291.105707] EIP: 0060:[<f9c74b92>] EFLAGS: 00213246 CPU: 0
[  291.105722] EIP is at i915_gem_evict_everything+0xfe/0x109 [i915]
[  291.105726] EAX: f6206000 EBX: 00000000 ECX: f4c47bc8 EDX: f6689e08
[  291.105730] ESI: 00000000 EDI: f6bed800 EBP: f6689e08 ESP: f6207df8
[  291.105733]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[  291.105738] Process X (pid: 12034, ti=f6206000 task=f6856070 task.ti=f6206000)
[  291.105741] Stack:
[  291.105742]  0000000c 0000000c ffffffe4 f40689c0 f9c75e66 f6207e54 c1060f27 001029be
[  291.105750] <0> 001028bf 001029bf 00102abe f6123640 f6207e9c f6bed800 f6689000 f6bed810
[  291.105757] <0> 00000000 0000000c 00000000 00000000 f4fb7c00 f4069e00 00102abe f4ff8800
[  291.105765] Call Trace:
[  291.105779]  [<f9c75e66>] ? i915_gem_execbuffer+0x791/0xe64 [i915]
[  291.105789]  [<c1060f27>] ? unmap_mapping_range_vma+0x3b/0x95
[  291.105804]  [<f9692440>] ? drm_ioctl+0x1bb/0x239 [drm]
[  291.105817]  [<f9c756d5>] ? i915_gem_execbuffer+0x0/0xe64 [i915]
[  291.105831]  [<f9c76884>] ? i915_gem_fault+0xe5/0x114 [i915]
[  291.105836]  [<c105fe11>] ? __do_fault+0x44/0x32d
[  291.105843]  [<c107b000>] ? vfs_ioctl+0x49/0x5f
[  291.105847]  [<c107b54f>] ? do_vfs_ioctl+0x47f/0x4bb
[  291.105855]  [<c101535e>] ? do_page_fault+0x26b/0x281
[  291.105860]  [<c107b5b7>] ? sys_ioctl+0x2c/0x42
[  291.105865]  [<c1002834>] ? sysenter_do_call+0x12/0x26
[  291.105868] Code: 00 00 39 83 f8 0d 00 00 0f 94 c0 0f b6 d8 eb 02 31 db 89 e0 25 00 e0 ff ff ff 48 14 f6 40 08 08 74 05 e8 91 6d 5e c7 84 db 75 04 <0f> 0b eb fe 5b 89 f0 5e 5f 5d c3 57 56 53 89 c3 8b 78 08 8b 70
[  291.105906] EIP: [<f9c74b92>] i915_gem_evict_everything+0xfe/0x109 [i915] SS:ESP 0068:f6207df8
[  291.105922] ---[ end trace ef6205f4d33f2798 ]---

I reverted xf86-video-intel, but kept the git libdrm. Works for now.


> Same bug, the manner of the death has just changed slightly. Newer chipsets
> have gained the ability to automatically recover from gpu hangs, alas not the
> venerable 8xx though. Did X really crash, or appear to hang?

Kind of both. It crashed because the screen went black and the hard drive started working like it always does when X dies and X applications are terminated. Then X has been restarted, but the login manager wouldn't appear. Running "/etc/init.d/kdm restart" did restart X, but still no KDM, screen just stayed black.

Comment 46 Robert Huitl 2009-12-08 05:48:26 UTC

Created attachment 31833 [details]
Older corruption with driver 2.8.1

Comment 47 Michal Suchanek 2009-12-08 06:49:03 UTC

I could not reproduce lockups with 

linux 2.6.32-rc3 KMS 
video-intel 2.9.0 
libdrm 2.4.14 
mesa 7.6
X server 1.6.5

I still got occasional corruption in urxvt, though.

I think that the only difference from the locking-up setup is the kernel upgrade and switch to KMS.

Now I have upgraded kernel to 2.6.32 with a patch which allows running KMS without frobbing the card with non-KMS X server.

Comment 48 Antti Mäkelä 2009-12-10 01:23:59 UTC

(In reply to comment #47)
> I could not reproduce lockups with 
> 
> linux 2.6.32-rc3 KMS 
> video-intel 2.9.0 
> libdrm 2.4.14 
> mesa 7.6
> X server 1.6.5
> 
> I still got occasional corruption in urxvt, though.
> 
> I think that the only difference from the locking-up setup is the kernel
> upgrade and switch to KMS.
> 
> Now I have upgraded kernel to 2.6.32 with a patch which allows running KMS
> without frobbing the card with non-KMS X server.
> 

  After upgrading to kernel 2.6.32, I still get lockups after I resume from suspend and wait a few hours, but new with this kernel is that I can at least alt+ctrl+f1 to text console to reboot computer gracefully.

  I'm running xorg-server 1.6.4, mesa 7.5, video-intel 2.9.1, User Mode Switching, libdrm 2.4.13. I'm wondering if upgrading to your versions would do the trick.

Comment 49 Christian Schafmeister 2009-12-10 02:21:01 UTC

I made some tests today with the following configuration:

distribution: gentoo
kernel 2.6.32 (KMS enabled)
mesa 7.5.1
libdrm svn (2009-12-10) 
intel driver svn (2009-12-10) (using uxa)
xorg server 1.6.5 (I deleted xorg.conf ... so everything gets set up dynamically)

After 1 suspend / resume I only need to use any page with massive javascript to test if my system freezes and sadly it still freezes after about 5 min. The good thing: Now only xorg-server freezes and I can switch to console to reboot or kill xserver. When the freeze appears my dmesg gets spammed with this message:

ERROR* Execbuf while wedged
[drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
...


My xorg log spams this one:

(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
...


Strangely a hibernate / sleep combination seems to work. I configured tuxonice to go into sleep after saving the memory to swap and not to completely shutdown the power:

## Powerdown method - 3 for suspend-to-RAM, 4 for ACPI S4 sleep, 5 for poweroff
PowerdownMethod 3

So this memory saving to swap seems to make the difference,

I hope this helps solving the bug.

greetings
Christian

Comment 50 Antti Mäkelä 2009-12-10 02:29:39 UTC

(In reply to comment #49)
> I made some tests today with the following configuration:
> ERROR* Execbuf while wedged
> [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged

  I'm getting the exact same symptoms.

Comment 51 Michal Suchanek 2009-12-10 05:34:34 UTC

swsusp does not work for me for some reason, on resume the image fails to load so I use STR only:

sg_start --stop --pc=3 /dev/disk/by-id/ieee1394*
echo mem > /sys/power/state

Comment 52 Christian Schafmeister 2009-12-11 03:03:50 UTC

(In reply to comment #47)
> I could not reproduce lockups with 
> 
> linux 2.6.32-rc3 KMS 
> video-intel 2.9.0 
> libdrm 2.4.14 
> mesa 7.6
> X server 1.6.5

I made some tests with exactly these versions and I still get lockups after about 5 min. of massive javascript usage.

The driver versions seem to react differently:

intel drivers < 2.9.1:
random screen corruption (can be solved by switching to vt1 and back). But after some min. I just get a black screen and cannot switch screens anymore. I even cannot remotely access the notebook via ssh. So I think the systems freezes in this case

intel drivers >= 2.9.1
No screen corruptions anymore. But the black screen still occurs. In this case the system doesn't freeze. I can switch to vt1 and kill X but a restarted x server is unusable. The driver seems to not update the screen anymore. It's like a screenshot of my loginmanager. The only thing that's changing is the mouse cursor. So I have to reboot to get a working X server again.

I had some font rendering issues with the svn version of the driver some days ago but these problems seem to be fixed in the latest svn build, which sadly still doesn't correctly suspend /resume.

Comment 53 Christian Schafmeister 2009-12-17 05:55:25 UTC

I'm currently testing the latest svn version of libdrm and the intel driver. My system is running stable for 2 hours now since the last suspend /resume. It seems my stable, but I get some ugly font issues again. I'll attach a screenshot of it. The font issues disappear when I mark the text. They mainly appear in console windows and on webpages.

I hope this helps.

Greetings
Christian

Comment 54 Christian Schafmeister 2009-12-17 05:56:24 UTC

Created attachment 32145 [details]
Ugly font rendering issue with the latest svn builds of libdrm and intel (2009-12-17)

Comment 55 Chris Wilson 2010-02-10 06:27:05 UTC

There are lots of different bugs identified here that I recognize as being fixed. The first is likely fixed by

commit 4f0f871730b76730ca58209181d16725b0c40184
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 10 09:45:13 2010 +0000

    intel: Handle resetting of input params after EINTR during SET_TILING
    
The ENOSPC oops by:

commit fdcde592c2c48e143251672cf2e82debb07606bd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 9 08:32:54 2010 +0000

    intel: Account for potential pinned buffers hogging fences

and

commit 0ce907f89118aa8748f950700b6919b1d8d8a038
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jan 23 20:26:35 2010 +0000

    drm/i915: Prevent use of uninitialized pointers along error path.

The garbled fonts were quite a few nasty bugs in their own right.

I believe all the bugs mentioned in this report have been addressed, so please open a new bug report if symptoms persist.

Comment 56 Michal Suchanek 2010-02-10 08:11:40 UTC

Indeed, the occasional lockups changed to rare lockups.

I had only one recently and did not try to trace if it was related to X or something else.

Thanks

Comment 57 Antti Mäkelä 2010-02-10 10:34:03 UTC

(In reply to comment #55)
> There are lots of different bugs identified here that I recognize as being
> fixed. The first is likely fixed by

  Are these in any released version yet, or only in source tree?

  (Looking at intellinuxgraphics.org shows 2.10.0 as the latest and that's from January).

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.