Bug 26645 - 945GM: gpu hangs when using xscreensaver
Summary: 945GM: gpu hangs when using xscreensaver
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: 7.5 (2009.10)
Hardware: Other All
: medium normal
Assignee: Chris Wilson
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
: 28274 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-02-19 03:31 UTC by Tomas M.
Modified: 2010-06-25 09:41 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg of the hang triggered with xscreensaver (537 bytes, application/octet-stream)
2010-02-19 03:32 UTC, Tomas M.
no flags Details
intel gpu dump (93.17 KB, application/x-bzip)
2010-02-19 03:33 UTC, Tomas M.
no flags Details
Record batch buffer at time of error (15.37 KB, patch)
2010-02-19 04:02 UTC, Chris Wilson
no flags Details | Splinter Review
patched 2.6.33-rc8, triggered with xscreensaver preview window (757.54 KB, application/octet-stream)
2010-02-19 05:13 UTC, Tomas M.
no flags Details
running glknots xscreensaver (758.25 KB, application/octet-stream)
2010-02-19 06:15 UTC, Tomas M.
no flags Details
with your linux-2.6 branch (757.54 KB, application/octet-stream)
2010-02-19 09:47 UTC, Tomas M.
no flags Details
content of /sys/kernel/debug/dri/0 (205.36 KB, application/x-gzip)
2010-04-25 13:17 UTC, Jan De Luyck
no flags Details
gpu dump (117.08 KB, application/x-gzip)
2010-04-25 13:18 UTC, Jan De Luyck
no flags Details
xorg 1.8 log when killed by glknot (27.28 KB, application/octet-stream)
2010-04-29 03:50 UTC, Tomas M.
no flags Details

Description Tomas M. 2010-02-19 03:31:01 UTC
while this is not the only way to hang the CPU. i managed to hang it twice while watching the screensaver list and previewing them in the small window of xscreensaver (takes about 5~10minutes).

another way to produce a similar hang is using the video overlay feature included in kernel 2.6.33

if we asume they are 2 completely different bugs, we will test only the xscreensaver one which seems to be more easy to trigger.

xorg 7.5
kernel 2.6.33-rc8
libdrm 2.4.18
xf86-video-intel 2.10.0

hardware: 
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)
Comment 1 Tomas M. 2010-02-19 03:32:13 UTC
Created attachment 33414 [details]
dmesg of the hang triggered with xscreensaver
Comment 2 Tomas M. 2010-02-19 03:33:42 UTC
Created attachment 33415 [details]
intel gpu dump
Comment 3 Chris Wilson 2010-02-19 04:02:27 UTC
Created attachment 33416 [details] [review]
Record batch buffer at time of error

That gpu dump does not correspond to the hang, try using the attached patch to capture the faulting batch buffer in /debug/dri/.../i915_error_state
Comment 4 Tomas M. 2010-02-19 05:13:42 UTC
Created attachment 33417 [details]
patched 2.6.33-rc8, triggered with xscreensaver preview window

took about 10 miuntes to trigger.

when this happened, i could switch to a vt
Comment 5 Chris Wilson 2010-02-19 05:49:38 UTC
Wow. That is weird the batchbuffer executed looks like an uninitialised blob of memory:

batchbuffer at 0x0fab9000:
0x0fab9000:      0x00000000: MI_NOOP
0x0fab9004:      0x00000000: MI_NOOP
0x0fab9008:      0x00000000: MI_NOOP
0x0fab900c:      0x00000000: MI_NOOP
0x0fab9010:      0x0fb00000: MI UNKNOWN
0x0fab9014: HEAD 0x00000000: MI_NOOP
[The non-zero values are consistent with the kernel relocations.]

ringbuffer:
0x007dab60:      0x10800001: MI_STORE_DATA_INDEX
0x007dab64:      0x00000080:    dword 1
0x007dab68:      0x0004f0bb:    dword 2
0x007dab6c:      0x01000000: MI_USER_INTERRUPT
0x007dab70:      0x02000000: MI_FLUSH
0x007dab74:      0x00000000: MI_NOOP
0x007dab78:      0x18800080: MI_BATCH_BUFFER_START
0x007dab7c:      0x0fab9001:    dword 1

seqno at time of hang: 4f0bb,
i.e. there is no doubt that we intended to execute that buffer.
Comment 6 Tomas M. 2010-02-19 05:53:10 UTC
(In reply to comment #5)
> Wow. That is weird the batchbuffer executed looks like an uninitialised blob of
> memory:

> 
> seqno at time of hang: 4f0bb,
> i.e. there is no doubt that we intended to execute that buffer.
> 


im good a triggering bugs. for the rest... :/
Comment 7 Tomas M. 2010-02-19 06:15:35 UTC
Created attachment 33419 [details]
running glknots xscreensaver

tried to trigger the bug again, (and provide more data on the randomness of the buffer).

i noticed glknots, before drawing a knot, renders a garbage frame. restarting the screensaver several times did trigger this faster.

attaching a new i915_error_state. if more are needed, let me know.
Comment 8 Chris Wilson 2010-02-19 06:26:48 UTC
(In reply to comment #7)
> attaching a new i915_error_state. if more are needed, let me know.

That follows the same pattern as the first, so I think we can identify the symptoms of the bug at least.

Judging from the timing of the hangs, do you think these are trigged by the GL application or by the X server? [Given the empty state of the batch buffer, it's a bit hard to identify from where the batches are being submitted... Hmm...]

Comment 9 Tomas M. 2010-02-19 06:33:17 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > attaching a new i915_error_state. if more are needed, let me know.
> 
> That follows the same pattern as the first, so I think we can identify the
> symptoms of the bug at least.
> 
> Judging from the timing of the hangs, do you think these are trigged by the GL
> application or by the X server? [Given the empty state of the batch buffer,
> it's a bit hard to identify from where the batches are being submitted...
> Hmm...]
> 

i can only trigger this with xscreensaver. 

could this be a xscreensaver bug (or glknots's) instead? i dont know how the driver works or interacts with software.

are broken apps expected to break the driver?

i could try the overlay issue too and see if we get a similar result if needed.just let me know, it takes far longer to trigger.
Comment 10 Tomas M. 2010-02-19 06:39:46 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > attaching a new i915_error_state. if more are needed, let me know.
> 
> That follows the same pattern as the first, so I think we can identify the
> symptoms of the bug at least.
> 
> Judging from the timing of the hangs, do you think these are trigged by the GL
> application or by the X server? [Given the empty state of the batch buffer,
> it's a bit hard to identify from where the batches are being submitted...
> Hmm...]
> 

ive just re-read your question.

i think its the app which is submitting the first frame with garbage. (or whatever this means)

but my question stands: shouldnt the driver sanitize this data?
Comment 11 Chris Wilson 2010-02-19 06:42:19 UTC
(In reply to comment #9)
> could this be a xscreensaver bug (or glknots's) instead? i dont know how the
> driver works or interacts with software.

No worries, I'm trying to identify the call path - but I suspect that it is more or less irrelevant to the actual bug.

> are broken apps expected to break the driver?

It's a broken driver, either GEM, X or GL.
 
> i could try the overlay issue too and see if we get a similar result if
> needed.just let me know, it takes far longer to trigger.

I suspect the overlay issue is a separate issue since that involves several different code paths - but this is a bizarre bug that may indeed be cropping up in other places.

Comment 12 Chris Wilson 2010-02-19 08:35:08 UTC
Out of curiosity can you try the branch:
  git pull git://anongit.freedesktop.org/~ickle/linux-2.6 error-state
it has one patch to pwrite() that may be relevant here.
Comment 13 Tomas M. 2010-02-19 08:47:25 UTC
(In reply to comment #12)
> Out of curiosity can you try the branch:
>   git pull git://anongit.freedesktop.org/~ickle/linux-2.6 error-state
> it has one patch to pwrite() that may be relevant here.
> 

$  git pull git://anongit.freedesktop.org/~ickle/linux-2.6 error-state
fatal: The remote end hung up unexpectedly

what do you make out of that?
Comment 14 Chris Wilson 2010-02-19 08:52:47 UTC
My fault, I had not made that tree public and so it was only accessible via ssh. Fixed, though it will not be visible until the next cronjob fires [~30 minutes].
Comment 15 Tomas M. 2010-02-19 09:46:00 UTC
(In reply to comment #12)
> Out of curiosity can you try the branch:
>   git pull git://anongit.freedesktop.org/~ickle/linux-2.6 error-state
> it has one patch to pwrite() that may be relevant here.
> 

it died same way.. attaching i915_error_state
Comment 16 Tomas M. 2010-02-19 09:47:13 UTC
Created attachment 33429 [details]
with your linux-2.6 branch
Comment 17 Chris Wilson 2010-02-19 10:19:22 UTC
Indeed, the error looks identical. Rules out one possibility, thanks.
Comment 18 Tomas M. 2010-02-19 15:09:00 UTC
(In reply to comment #17)
> Indeed, the error looks identical. Rules out one possibility, thanks.
> 

kernel 2.6.32.8 has the same problem.

dmesg gets spammed with the wedged errors

Comment 19 Tomas M. 2010-03-08 04:40:58 UTC
X.Org X Server 1.7.5.901 (1.7.6 RC 1)

tested and bug still present
Comment 20 Tomas M. 2010-03-14 11:14:02 UTC
X.Org X Server 1.7.5.902 (1.7.6 RC 2)
libdrm 2.4.19


bug still present
Comment 21 Tomas M. 2010-03-31 04:48:53 UTC
mesa 7.7.1 and its childs.

still present :(

im not sure if it is helpful or not to post re-tests with new versions of packages. if it just add noise, please say so.
Comment 22 Chris Wilson 2010-03-31 04:54:42 UTC
Tomas, the reminders are quite helpful, thanks. This is perhaps the most worrying bug on i915 -- I haven't found anything that could suggest how this might even occur. Upon relocation the kernel is handing us freshly zeroed pages, pages which we have just written to with the instructions for the batch! Gah!
Comment 23 Jan De Luyck 2010-04-25 13:16:40 UTC
I also seem to be hitting this bug. I can reproduce it by just starting xscreensaver, though sometimes i get the same effect (blank screen, machine remotely manageable) with apps like google-chrome, vncviewer, ...

I'll attach my debug/dri/0 dir, and also the gpu dump.
Comment 24 Jan De Luyck 2010-04-25 13:17:49 UTC
Created attachment 35284 [details]
content of /sys/kernel/debug/dri/0
Comment 25 Jan De Luyck 2010-04-25 13:18:30 UTC
Created attachment 35285 [details]
gpu dump
Comment 26 Tomas M. 2010-04-29 03:49:54 UTC
ive installed

xorg 1.8
intel graphics 2.11
mesa 7.8.1

the scene changed a bit.

now glknot fails to preview, or run on fullscreen. 

and 1 out of 10, it kills X .

im attaching the xorg.log which contains "some info"
Comment 27 Tomas M. 2010-04-29 03:50:48 UTC
Created attachment 35330 [details]
xorg 1.8 log when killed by glknot
Comment 28 Chris Wilson 2010-04-29 06:05:31 UTC
(In reply to comment #27)
> Created an attachment (id=35330) [details]
> xorg 1.8 log when killed by glknot

Does dmesg report a GPU error or hang when glknot dies? The current hypothesis is that glknot dies leaving the DRI client state in the XServer inconsistent and leading to the XServer dying from the segfault. (So two bugs.)
Comment 29 Tomas M. 2010-05-16 08:31:11 UTC
(In reply to comment #28)
> (In reply to comment #27)
> > Created an attachment (id=35330) [details] [details]
> > xorg 1.8 log when killed by glknot
> 
> Does dmesg report a GPU error or hang when glknot dies? The current hypothesis
> is that glknot dies leaving the DRI client state in the XServer inconsistent
> and leading to the XServer dying from the segfault. (So two bugs.)

yes. here it is

----
[drm:i915_gem_madvise_ioctl] *ERROR* Attempted i915_gem_madvise_ioctl() on a pinned object
----
Comment 30 Tomas M. 2010-05-18 04:12:33 UTC
xscreensaver 5.11: no changes.
Comment 31 Tomas M. 2010-05-31 04:24:32 UTC
*** Bug 28274 has been marked as a duplicate of this bug. ***
Comment 32 Matthias Hopf 2010-05-31 05:49:19 UTC
Chris' commit 8accf0a8 in Mesa master fixes the issue for us.

Still, this can only be understood as a workaround, in the long term the kernel
must be fixed.
Comment 33 Chris Wilson 2010-05-31 06:00:26 UTC
(In reply to comment #32)
> Chris' commit 8accf0a8 in Mesa master fixes the issue for us.
> 
> Still, this can only be understood as a workaround, in the long term the kernel
> must be fixed.

What fix do you propose for the kernel? The only protection we could add is to perform command stream validation, and that is only likely to catch the gross errors that we can spot even more easily in userspace with asserts like above, or even the hypothetical validator.
Comment 34 Chris Wilson 2010-06-25 09:41:34 UTC
I am satisfied that this bug was the result of the buffer overrun in mesa.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.