while this is not the only way to hang the CPU. i managed to hang it twice while watching the screensaver list and previewing them in the small window of xscreensaver (takes about 5~10minutes). another way to produce a similar hang is using the video overlay feature included in kernel 2.6.33 if we asume they are 2 completely different bugs, we will test only the xscreensaver one which seems to be more easy to trigger. xorg 7.5 kernel 2.6.33-rc8 libdrm 2.4.18 xf86-video-intel 2.10.0 hardware: 00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)
Created attachment 33414 [details] dmesg of the hang triggered with xscreensaver
Created attachment 33415 [details] intel gpu dump
Created attachment 33416 [details] [review] Record batch buffer at time of error That gpu dump does not correspond to the hang, try using the attached patch to capture the faulting batch buffer in /debug/dri/.../i915_error_state
Created attachment 33417 [details] patched 2.6.33-rc8, triggered with xscreensaver preview window took about 10 miuntes to trigger. when this happened, i could switch to a vt
Wow. That is weird the batchbuffer executed looks like an uninitialised blob of memory: batchbuffer at 0x0fab9000: 0x0fab9000: 0x00000000: MI_NOOP 0x0fab9004: 0x00000000: MI_NOOP 0x0fab9008: 0x00000000: MI_NOOP 0x0fab900c: 0x00000000: MI_NOOP 0x0fab9010: 0x0fb00000: MI UNKNOWN 0x0fab9014: HEAD 0x00000000: MI_NOOP [The non-zero values are consistent with the kernel relocations.] ringbuffer: 0x007dab60: 0x10800001: MI_STORE_DATA_INDEX 0x007dab64: 0x00000080: dword 1 0x007dab68: 0x0004f0bb: dword 2 0x007dab6c: 0x01000000: MI_USER_INTERRUPT 0x007dab70: 0x02000000: MI_FLUSH 0x007dab74: 0x00000000: MI_NOOP 0x007dab78: 0x18800080: MI_BATCH_BUFFER_START 0x007dab7c: 0x0fab9001: dword 1 seqno at time of hang: 4f0bb, i.e. there is no doubt that we intended to execute that buffer.
(In reply to comment #5) > Wow. That is weird the batchbuffer executed looks like an uninitialised blob of > memory: > > seqno at time of hang: 4f0bb, > i.e. there is no doubt that we intended to execute that buffer. > im good a triggering bugs. for the rest... :/
Created attachment 33419 [details] running glknots xscreensaver tried to trigger the bug again, (and provide more data on the randomness of the buffer). i noticed glknots, before drawing a knot, renders a garbage frame. restarting the screensaver several times did trigger this faster. attaching a new i915_error_state. if more are needed, let me know.
(In reply to comment #7) > attaching a new i915_error_state. if more are needed, let me know. That follows the same pattern as the first, so I think we can identify the symptoms of the bug at least. Judging from the timing of the hangs, do you think these are trigged by the GL application or by the X server? [Given the empty state of the batch buffer, it's a bit hard to identify from where the batches are being submitted... Hmm...]
(In reply to comment #8) > (In reply to comment #7) > > attaching a new i915_error_state. if more are needed, let me know. > > That follows the same pattern as the first, so I think we can identify the > symptoms of the bug at least. > > Judging from the timing of the hangs, do you think these are trigged by the GL > application or by the X server? [Given the empty state of the batch buffer, > it's a bit hard to identify from where the batches are being submitted... > Hmm...] > i can only trigger this with xscreensaver. could this be a xscreensaver bug (or glknots's) instead? i dont know how the driver works or interacts with software. are broken apps expected to break the driver? i could try the overlay issue too and see if we get a similar result if needed.just let me know, it takes far longer to trigger.
(In reply to comment #8) > (In reply to comment #7) > > attaching a new i915_error_state. if more are needed, let me know. > > That follows the same pattern as the first, so I think we can identify the > symptoms of the bug at least. > > Judging from the timing of the hangs, do you think these are trigged by the GL > application or by the X server? [Given the empty state of the batch buffer, > it's a bit hard to identify from where the batches are being submitted... > Hmm...] > ive just re-read your question. i think its the app which is submitting the first frame with garbage. (or whatever this means) but my question stands: shouldnt the driver sanitize this data?
(In reply to comment #9) > could this be a xscreensaver bug (or glknots's) instead? i dont know how the > driver works or interacts with software. No worries, I'm trying to identify the call path - but I suspect that it is more or less irrelevant to the actual bug. > are broken apps expected to break the driver? It's a broken driver, either GEM, X or GL. > i could try the overlay issue too and see if we get a similar result if > needed.just let me know, it takes far longer to trigger. I suspect the overlay issue is a separate issue since that involves several different code paths - but this is a bizarre bug that may indeed be cropping up in other places.
Out of curiosity can you try the branch: git pull git://anongit.freedesktop.org/~ickle/linux-2.6 error-state it has one patch to pwrite() that may be relevant here.
(In reply to comment #12) > Out of curiosity can you try the branch: > git pull git://anongit.freedesktop.org/~ickle/linux-2.6 error-state > it has one patch to pwrite() that may be relevant here. > $ git pull git://anongit.freedesktop.org/~ickle/linux-2.6 error-state fatal: The remote end hung up unexpectedly what do you make out of that?
My fault, I had not made that tree public and so it was only accessible via ssh. Fixed, though it will not be visible until the next cronjob fires [~30 minutes].
(In reply to comment #12) > Out of curiosity can you try the branch: > git pull git://anongit.freedesktop.org/~ickle/linux-2.6 error-state > it has one patch to pwrite() that may be relevant here. > it died same way.. attaching i915_error_state
Created attachment 33429 [details] with your linux-2.6 branch
Indeed, the error looks identical. Rules out one possibility, thanks.
(In reply to comment #17) > Indeed, the error looks identical. Rules out one possibility, thanks. > kernel 2.6.32.8 has the same problem. dmesg gets spammed with the wedged errors
X.Org X Server 1.7.5.901 (1.7.6 RC 1) tested and bug still present
X.Org X Server 1.7.5.902 (1.7.6 RC 2) libdrm 2.4.19 bug still present
mesa 7.7.1 and its childs. still present :( im not sure if it is helpful or not to post re-tests with new versions of packages. if it just add noise, please say so.
Tomas, the reminders are quite helpful, thanks. This is perhaps the most worrying bug on i915 -- I haven't found anything that could suggest how this might even occur. Upon relocation the kernel is handing us freshly zeroed pages, pages which we have just written to with the instructions for the batch! Gah!
I also seem to be hitting this bug. I can reproduce it by just starting xscreensaver, though sometimes i get the same effect (blank screen, machine remotely manageable) with apps like google-chrome, vncviewer, ... I'll attach my debug/dri/0 dir, and also the gpu dump.
Created attachment 35284 [details] content of /sys/kernel/debug/dri/0
Created attachment 35285 [details] gpu dump
ive installed xorg 1.8 intel graphics 2.11 mesa 7.8.1 the scene changed a bit. now glknot fails to preview, or run on fullscreen. and 1 out of 10, it kills X . im attaching the xorg.log which contains "some info"
Created attachment 35330 [details] xorg 1.8 log when killed by glknot
(In reply to comment #27) > Created an attachment (id=35330) [details] > xorg 1.8 log when killed by glknot Does dmesg report a GPU error or hang when glknot dies? The current hypothesis is that glknot dies leaving the DRI client state in the XServer inconsistent and leading to the XServer dying from the segfault. (So two bugs.)
(In reply to comment #28) > (In reply to comment #27) > > Created an attachment (id=35330) [details] [details] > > xorg 1.8 log when killed by glknot > > Does dmesg report a GPU error or hang when glknot dies? The current hypothesis > is that glknot dies leaving the DRI client state in the XServer inconsistent > and leading to the XServer dying from the segfault. (So two bugs.) yes. here it is ---- [drm:i915_gem_madvise_ioctl] *ERROR* Attempted i915_gem_madvise_ioctl() on a pinned object ----
xscreensaver 5.11: no changes.
*** Bug 28274 has been marked as a duplicate of this bug. ***
Chris' commit 8accf0a8 in Mesa master fixes the issue for us. Still, this can only be understood as a workaround, in the long term the kernel must be fixed.
(In reply to comment #32) > Chris' commit 8accf0a8 in Mesa master fixes the issue for us. > > Still, this can only be understood as a workaround, in the long term the kernel > must be fixed. What fix do you propose for the kernel? The only protection we could add is to perform command stream validation, and that is only likely to catch the gross errors that we can spot even more easily in userspace with asserts like above, or even the hypothetical validator.
I am satisfied that this bug was the result of the buffer overrun in mesa.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.