Created attachment 31780 [details] xorg log Hi there, I just experienced a severe crash of xorg-server, which triggered the standard restart of the server - however it didn't come back... Some informations about my system... kernel: vanilla-2.6.32 libdrm: latest git master mesa: latest git master xf86-video-intel: latest git master xorg-server: 1.7.3 gfx hardware: 00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03) Xorg crashes with this error message: Failed to map batchbuffer: Input/output error (attaching entire xorg log...) The kernel log contains a lot of: [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged (and probably more, again attaching the complete log) Greets, Tobias
Created attachment 31781 [details] kernel log / dmesg
Managed to reliably trigger this by restoring my Seamonkey (Mozilla suite) session, which resulted in a bunch of tabs opening at the same time (and rendering). This either crashes or freezes X (VT switch still works) and the kernel log looks similar. Going back to intel-2.9.1 fixes the issue, adding regression to keywords.
Tobias after a GPU hang as you have here we actually need a gpu dump. Please can you follow the guide at http://intellinuxgraphics.org/how_to_report_bug.html to grab a dump and additional information. Thanks.
This isn't working for me: $ intel_gpu_dump intel_gpu_dump: Couldn't find i915 debugfs directory. Is debugfs mounted? You might try mounting it with a command such as: sudo mount -t debugfs debugfs /sys/kernel/debug $ cat /sys/module/drm/parameters/debug 6 $ mount debugfs on /sys/kernel/debug type debugfs (rw) $ zgrep DEBUG_FS /proc/config.gz CONFIG_DEBUG_FS=y I'm omitting the lsmod output, but I can assure you that drm and i915 are loaded. Any ideas? Anyway, also after your recent patch to video-intel Chris, the lockup still happens.
Ooooops! Sry.... looks like I forgot make modules_install. Dumping is now working, however it was a bit harder to produce a lockup this time. Anyway, as soon as the lockup came (X didn't restart this time) I switched to console and made the dump. Funny thing is that the cursor was still movable in X. Now idea what this means...
Created attachment 31812 [details] intel_gpu_dump output
Created attachment 31813 [details] new kernel log corresponds to the situation where the dump was made
I'd like to add that I can also see minor visual corruptions, e.g. to fonts - I think these were mentioned in the other bugreport.
Just did git pull for xf86-video-intel, libdrm and mesa - but the hang still does occur. Considering that there is some stabilization work going on for a new release, I'm bumping the severity to critical. It's quite random when the hang kicks in. Sometimes it takes about half an hour, sometimes you're done in five minutes (or fewer) - so this makes the driver pretty much unusuable for me. Anything else I can test? Does upgrading the kernel to some of the drm-next-xyz branches make sense? Greets, Tobias
Created attachment 32054 [details] Xorg log
Created attachment 32055 [details] kern.log
I am having the same experience on my computer. My system information: System environment: -- chipset:Intel Corporation 82945G/GZ Integrated Graphics Controller -- system architecture: i686 -- xf86-video-intel:2.9.99.901+git20091209.093bb9eb -- xserver:7.4 -- mesa:7.7-rc2 -- libdrm:2.4.16 -- kernel:2.6.32-020632-generic kernel -- Linux distribution:Ubuntu 9.10 -- Machine or mobo model:Shuttle K45 -- Display connector:VGA As for reproducing steps, I can't find any. The display usually hangs when I try to close a web page in firefox with flash video streamed in it. But this might be related also to the upgrade of the flash plugin two days ago. Also got the freezing while working in blender. The last freeze occurred only by turning the mouse wheel in Thunderbird.
My more specific information about mesa is: 7.7.0~git20091211+mesa-7-7-branch.8413a3ae
Created attachment 32081 [details] Intel gpu dump from the last crash
Created attachment 32082 [details] dmesg output
Created attachment 32141 [details] Intel GPU dump 2 This crashes are frequent! Latest crash happened with mesa 7.0~git20091216+mesa-7-7-branch.bf75ee9c. Why is this bug's priority marked as "medium"? Sending another intel gpu dump.
I also have a question, is this intel driver or mesa related? Tobias do you have any idea about this?
I would blame it on either libdrm or xf86-video-intel, if we can assume that the bug isn't in the kernel DRM module code. I don't see mesa producing this issue, since I don't need any 3D application to cause the hang. Just regular 2D rendering does it for me. Since I don't use any composite stuff or fancy 3D desktop effects the problem should not be mesa but the DDX. Like I already stated above this issue is a regression, since downgrading video-intel solves the problem. So it's either a new issue inside video-intel or some code changes trigger an (already existing) error in libdrm/DRM kernel module.
If it's not mesa, I would vote for libdrm rather than xf86-video-intel. Who knows. Until this problem is picked by a developer, I don't think we can know for sure. I wonder why is there so little activity for this bug. I am getting at least 10 freezes a day. As for the libdrm, I am using libdrm 2.4.16+git20091211.edc77dd2. I'll try downgrading it and see if that might help.
I have both mesa and libdrm at git master tip. Only video-intel is at version 2.9.1 - once I upgrade to git master I get the issues. @Petar: Why exactly do you vote for libdrm?
I was probably wrong. I noticed the freezes after upgrading libdrm on 12th of december. Unfortunately there was big rendering problem with the intel driver and a few versions (updates from git) were not tested by me except for the rendering problem. I downgraded libdrm and the corruption occurred. Now I upgraded to the latest libdrm and downgraded to intel driver 2.9.0+git20091111.dbb68168. So far no freezes. As with the rendering problems, this bug might be introduced after commit dbb68168dc909ab2ec1d935322c3fd8581e666f1 Revert "configure: make --disable-dri work even if the server supports DRI". There are 46 commits after this one in the last driver I was using and still having the problem. Good luck finding the culprit :)
The driver I am using for day and a half using commits up to and including dbb68168dc909ab2ec1d935322c3fd8581e666f1 is not showing any signs of this bug.
Tobias do you have an experience in reversing commits?
It's very unlikely that commit dbb68168dc909ab2ec1d935322c3fd8581e666f1 is the culprit, since I don't use --disable-dri at all and the commit mainly touches the autotools scripts. Concerning commit reversion: You just use git-revert with the commit id as parameter. It's all documented in the manpage. You probably want to add the -n option.
I am not saying that(In reply to comment #24) > It's very unlikely that commit dbb68168dc909ab2ec1d935322c3fd8581e666f1 is the > culprit, since I don't use --disable-dri at all and the commit mainly touches > the autotools scripts. > > Concerning commit reversion: You just use git-revert with the commit id as > parameter. It's all documented in the manpage. You probably want to add the -n > option. > I am not saying that commit dbb68168dc909ab2ec1d935322c3fd8581e666f1 is the culprit, but that the bug was introduces afterwards. I was asking you about the commit reversal because I have an idea how you should test for the problematic commit since you can reliably reproduce the bug. This might be a stupid idea but why not try it anyway. You should use "binary search. Add lets say 20 commits after commit dbb68168dc909ab2ec1d935322c3fd8581e666f1 and test. If there is a problem then take the last 10 and do the test again. I hope you understand what I'm trying to say. This might be a naive approach that might not work if the commits are interdependent, but it's a lot better than adding 40+ commits one at a time. Than again it might be a totally stupid idea :). I am not testing by using git. I am using prebuilt testing packages from Ubuntu driver testing repository. If you decide to test this way, be carefull about this commit: commit 3f11bbec420080151406c203af292e55177e77d1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Nov 29 21:39:41 2009 +0000 uxa-glyphs: Enable TILING_X on glyph caches. as it introduced some other bug (25406) and was reversed afterwards.
What you mean is a bisection, which is already fully implemented in git bisect.
Super duper if that function is already implemented. The question is do you have time to do the bisecting? If you can locate the problematic commit, we can write to the developer responsible for it, because as far as I can see, not much attention was given to this bug although they already started testing "release candidates" for the next version of the intel driver. Sorry I can't be of more help right now, I don't have much experience in building the driver by myself from git. But if you can find the problematic commit I can remove it from the testing driver packages already built for the Ubuntu distribution that I'm using and do the testing.
Locating the bug by bisecting (manualy) is not working for me due to other serious bug present between commit dbb68168dc909ab2ec1d935322c3fd8581e666f1 and commit 37f631d669c165c4fb56ccd7a6fc0a432f453b52. Tobias can you try to trigger the bug with a driver containing commits up to (including) commit 47416b1eea09b238a997636d35998d71e0d18161 i965: Maximum number of vertices per composite is 24, not 18. I've been using this build for 3 hours and no crashes so far, except for freeze I'm getting with full screen flash videos (which is yet another bug).
Sorry, but I have no time to debug this bug - bisecting would take an immense amount of time since testing the current bisection copy requires at least some hours of testing with different applications (in case the driver doesn't crash immediately). The other problem you already mention yourself - it's not guaranteed that the bisection copy cleanly compiles nor if it does introduce another serious issue. I only do bisection for wine, mesa and most X components are hell to bisect. At least I'm going to remain at 2.9.1 till Chris shows up and gives me new instructions. I hope the GPU dump gives the intel devs some clue which commit could be the culprit. Once we have a list of possible commits I can do tests by reverting the particular commits.
None of the dumps so far actually capture the erroneous batch buffer. I have a patch in the works that should capture the error better, and once I've returned to civilisation (and a reliable internet connection) I'll make it available and hopefully it will find the culprit causing these bugs. Thanks for your patience and perseverance.
I have noticed reverting libdrm to a 11-25-2009 checkout fixes the problem here. The problem started somewhere between the commits on 11-30 and the 2.4.16 release on 12-03. In my case browsing a directory with about 800 video thumbnails in nautilus triggers a crash reliably. Scrolling in chrome/firefox also seems to be a common trigger for it.
Created attachment 32257 [details] intel_gpu_dump - not original reporter Acer Aspire One AOA150 intel_gpu_dump during a hang incase it helps. Versions at the time of this dump: Ubuntu 10.04 Lucid xserver: master 1.7.99.2 20091217 checkout at 0cb638dc libdrm: master 2.4.17 20091221 checkout at fdb33d56 mesa: master 7.8.0 20091221 checkout at 71678a7 kernel: drm-intel-next up to commit cf74ecbbff3e3b45bae61d28d2220f74d853e2f0 (drm/i915: remove render reclock support) Also happens with stock 2.6.32, and 2.6.33-rc1. Xorg.0.log: [14925.959682] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error [14925.960043] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error [14925.961466] (EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error. (repeat) dmesg: [ 5490.524107] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 5490.524133] render error detected, EIR: 0x00000000 [ 5490.524316] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 483783 at 483781) [ 5490.526299] [drm:i915_gem_execbuffer] *ERROR* i915_gem_do_execbuffer returns -5 (repeat) intel_gpu_top shows Sampler Cache, Filtering, Bypass FIFO, Pixel shader and Color calculator at 100% during a hang.
(In reply to comment #32) > [ 5490.526299] [drm:i915_gem_execbuffer] *ERROR* i915_gem_do_execbuffer returns > -5 Sorry, I just noticed it does not give the Execbuff while wedged error with this drm-intel-next based kernel and instead says this now in place of it, but the crash seems identical otherwise in both cases. I'll upload a dump from a 2.6.32 kernel instead when I crash next if it would be more useful.
Created attachment 32264 [details] intel_gpu_dump 2 - not original reporter (In reply to comment #32) New dump, same versions and error messages as in comment #32
Narrowed it down to two commits at least after a few days of bisecting. This libdrm that has actually been stable is 2.4.16 with these two commits reverted intel: Check and propagate errors from building reloc-tree 792fed1e2460f96459141b5a628dd5ab4fbb87db intel: Repeat execbuffer after EINTR b73612e4fd69565aa2c5c2e9677f3e0af1945f7d Reverting just the 792fed one didn't fix it but I've had 31 hours uptime with both reverted. I haven't tried with just b73612 reverted yet.
8 hours in on 2.4.17 with just this commit reverted, I was *really* lucky to last this long without it reverted so it might just be this commit that was causing the problem. libdrm with both of those commits I mentioned in the last comment reverted was 100% stable though after 3 days uptime. intel: Repeat execbuffer after EINTR b73612e4fd69565aa2c5c2e9677f3e0af1945f7d
(In reply to comment #36) > 8 hours in on 2.4.17 with just this commit reverted, I was *really* lucky to > last this long without it reverted so it might just be this commit that was > causing the problem. libdrm with both of those commits I mentioned in the last > comment reverted was 100% stable though after 3 days uptime. > > intel: Repeat execbuffer after EINTR > b73612e4fd69565aa2c5c2e9677f3e0af1945f7d > Robert if you applied the patch in libdrm 2.4.17+git20091230.c5c503b5-0ubuntu0sarvatt~karmic I can tell you it's NOT working. I upgraded libdrm and tried with the latest intel driver from xorg edgers and Xorg crashed again (while playing some flash game on facebook). Chris Wilson where the hell are you? Next time you decide to travel somewhere, make sure they have good internet :)
(In reply to comment #36) > 8 hours in on 2.4.17 with just this commit reverted, I was *really* lucky to > last this long without it reverted so it might just be this commit that was > causing the problem. libdrm with both of those commits I mentioned in the last > comment reverted was 100% stable though after 3 days uptime. > > intel: Repeat execbuffer after EINTR > b73612e4fd69565aa2c5c2e9677f3e0af1945f7d Hmm. That's extremely unlikely to be the cause, it might just be conceivable that somebody is doing something extremely fishy in a signal handler that is being run when EINTR is being provoked. But, seriously, this looks like a wild goose chase.
Indeed I eventually crashed about 14 hours in with just that revert, and 792fed1e2460f96459141b5a628dd5ab4fbb87db doesn't revert cleanly from 2.4.17. It's hard to bisect because the first 3/4ths of the commits from 11-30 to 12-03 have other freezing problems on their own in that I can crash it loading a large image in firefox so it's hard to have enough uptime to actually get this crash. 2.4.16 with those 2 commits reverted was stable for 3 days though, and I haven't had more than 24 hours uptime since 2.4.16 was released because of this crash.
Bisections sound good in theory but in practise they are not working. At least not in this case. most probably because some of the commits touch the same code. I am not sure how graphic card driver development is working, but from what I've seen so far it's a devils business and it's incredibly complex. How come this bugs don't show up on the developers machine, and yet they show up when we use them? Isn't it possible to build testing applications that will force different portions of the driver code to be tested systematically and in one pass when we install the driver on our machines (and run the testing application)? I wonder why even the "http://intellinuxgraphics.org/how_to_report_bug.html" link? Can't this information gathering be automated by running a single script? Sorry for this little rambling, I know this is a place for bug reporting and not for code development model discussions :)
Created attachment 32455 [details] [review] Record batch buffer at time of error This is the kernel patch that I hope Eric will accept into drm-intel-next that should capture the batch buffer that is triggering this error. After applying this patch (and the hang occurs) can you upload the contents of /sys/kernel/debug/dri/0/i915_error_state, please?
I've been trying to trigger this bug with the drm-intel-next kernel from 2010-01-06 and had no successes after using my system for two and a half days (with a few system suspends during that time). Today I tried with kernel 2.6.32.3 vanilla and the bug reappeared only after 3 minutes. I am not even sure if the patch was in included in drm-intel-next kernel from 6th of January. Now I'm trying with drm-intel-next kernel built today (8th of January). If the bug does not reappear while using drm-intel-next kernel is it because this bug is present in the kernel and not in the intel driver? Any suggestions Chris?
Petar, that would be a logical conclusion if drm-intel-next proves stable. (At the very least it means that -intel is doing something that previous kernels struggle with, something we should be wary of). I'm at a loss to think of what recent kernel changes that may have impacted -intel. And for the record, I've not yet written a version of the batch buffer capture patch that satisfies Eric, so it is not yet included in drm-intel-next. Thanks for your patience and continued testing.
Too bad you patch didn't get into the drm-intel-next. So the next info might be totally unusable to you but I am giving it for a reference because it is somehow different (well at least the dmesg error text). Linux kernel: drm-intel-next 2.6.33-997 (built at 8th of January) Libdrm: 2.4.17+git20091230.c5c503b5 xserver-xorg-video-intel: 2.10.0 git20100104.09103514 x.org: 7.4 When this bug is triggered dmesg shows: drm:i915_gem_execbuffer] *ERROR* i915_gem_do_execbuffer returns -5 This is the output of sys/kernel/debug/dri/0/i915_error_state: Time: 1262991388 s 668451 us EIR: 0x00000000 PGTBL_ER: 0x00000000 INSTPM: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x54f00006 INSTDONE: 0x7fc00081 ACTHD: 0x0593c138
It seams that Santa was protecting my computer from freezing for 2 days but now he is gone. Got the X.org freeze with drm-intel-next 2.6.32-997 from 6th of January too. Same dmesg message, similar sys/kernel/debug/dri/0/i915_error_state output: Time: 1262992580 s 316716 us EIR: 0x00000000 PGTBL_ER: 0x00000000 INSTPM: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x54f00006 INSTDONE: 0x7fc00081 ACTHD: 0x0e1f6138 Once again going back to the rock solid (I hope) xserver-xorg-video-intel_2.9.0+git20091111.dbb68168 :)
Robert Hooker do you have any connections with Ubuntu kernel builders? If you do, can you somehow persuade them to build a kernel package of drm-intel-next that includes Chris Wilson's patch? Thanks in advance :)
Created attachment 32701 [details] latest intel_gpu_dump output Sorry, I was busy so I couldn't report back sooner. The issue still exists with latest libdrm, mesa and xf86-video-intel git master. I'm using vanilla sources 2.6.32.3 with Chris' patch applied. $ cat i915_error_state Time: 1263837225 s 979362 us EIR: 0x00000010 PGTBL_ER: 0x00000010 INSTPM: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x00000000 INSTDONE: 0x03ffffc0 ACTHD: 0x00000000 I'm also going to attach the new output of intel_gpu_dump.
im affected with this, and so do people from http://bugzilla.kernel.org/process_bug.cgi i first thought it was a kernel issue (not that experienced with debugging). then i started testing different versions of xf86-video-intel and came to report here. a reliable way to trigger this bug is to enable compiz and setup its resize method as "normal"(through compiz config settings manager) it should upgrade windows contents on the fly while resizing. open chromium or firefox and start resizing like crazy. takes a few minutes to trigger it. 00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03) xf86-video-intel 2.9.1 works. xf86-video-intel 2.9.99.901 does not. tried to bisect, but half of the commits between those versions break the build, or Xorg badly. libdrm 2.4.17
I think Tomas meant: http://bugzilla.kernel.org/show_bug.cgi?id=15004 (using my comment as an excuse to add me to CC :p)
(In reply to comment #48) > a reliable way to trigger this bug is to enable compiz and setup its resize > method as "normal"(through compiz config settings manager) it should upgrade > windows contents on the fly while resizing. I think you have a different bug, since the original report is from a system not using a compositing manager. The issue with driver bugs is that they all have superficially identical symptoms and we need to catch the driver in the act of crashing the hardware to be able to find and distinguish between bugs. To grab a dump follow the instructions at http://intellinuxgraphics.org/how_to_report_bug.html . One issue faced by the reporters of the original bug is that the batchbuffer has been cleared from the lists prior to dumping the GPU contents, so we have not yet been able to identify the cause.
(In reply to comment #50) > (In reply to comment #48) > > a reliable way to trigger this bug is to enable compiz and setup its resize > > method as "normal"(through compiz config settings manager) it should upgrade > > windows contents on the fly while resizing. > > I think you have a different bug, since the original report is from a system > not using a compositing manager. The issue with driver bugs is that they all > have superficially identical symptoms and we need to catch the driver in the > act of crashing the hardware to be able to find and distinguish between bugs. > To grab a dump follow the instructions at > http://intellinuxgraphics.org/how_to_report_bug.html . One issue faced by the > reporters of the original bug is that the batchbuffer has been cleared from the > lists prior to dumping the GPU contents, so we have not yet been able to > identify the cause. > i will try to catch a gpu dump, but im almost certain its the same bug. i said i could trigger this easily with compiz's resize plugin. but the issue is present with metacity and no 3d rendering whatsoever (this is the first test i did, weeks ago). sorry if i missed this bit out. waiting for the driver to trip takes a LOT of time, this is why i posted a reliable way to trigger this.
Created attachment 32719 [details] Tomas M. intel gpu dump
(In reply to comment #50) > One issue faced by the > reporters of the original bug is that the batchbuffer has been cleared from the > lists prior to dumping the GPU contents, so we have not yet been able to > identify the cause. > Huh? I though your kernel patch should solve this issue, or what is it for?
The information that captures will be in i915_error_state. You will in fact need an updated patch to capture the batchbuffer on an i8xx (since BBADDR didn't exist for that generation of GPUs).
Created attachment 32723 [details] [review] Record batch buffer at time of error [v2]
OK, so my information from comment #47 should suffice, right? Since my hardware is i915-based.
Hmm, that i915_error_state would imply that the active list was indeed empty at the time that the error was captured which as you can imagine is not exactly a lot of information as to the cause of the error.
So what should I do this? This is what I get after X output gets stuck.
I'm getting following log on 945gm (very easy to reproduce with kde 4.4rc1): Jan 16 20:09:23 anarsoul-laptop kernel: [86786.087142] [drm:i915_gem_object_pin] *ERROR* Failure to install fence: -28 Jan 16 20:09:23 anarsoul-laptop kernel: [86786.168476] [drm:i915_gem_object_pin] *ERROR* Failure to install fence: -28 Jan 16 20:09:23 anarsoul-laptop kernel: [86786.168481] [drm:i915_gem_execbuffer] *ERROR* Failed to pin buffer 19 of 26, total 16060416 bytes: -28 Jan 16 20:09:23 anarsoul-laptop kernel: [86786.168486] [drm:i915_gem_execbuffer] *ERROR* 1039 objects [22 pinned], 153628672 object bytes [27529216 pinned], 28577792/260308992 gtt bytes Jan 16 20:09:25 anarsoul-laptop kernel: [86787.832115] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Jan 16 20:09:25 anarsoul-laptop kernel: [86787.832130] render error detected, EIR: 0x00000000 Jan 16 20:09:25 anarsoul-laptop kernel: [86787.832134] i915: Waking up sleeping processes Jan 16 20:09:25 anarsoul-laptop kernel: [86787.832186] reboot required Jan 16 20:09:25 anarsoul-laptop kernel: [86787.832206] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 14478201 at 14478197) Jan 16 20:09:25 anarsoul-laptop kernel: [86787.832668] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged Is it same bug? Should I file another bug report?
(In reply to comment #55) > Created an attachment (id=32723) [details] > Record batch buffer at time of error [v2] > hi chris, if you still believe its a different bug, please say so, i will open a new bug. i tried your v2 patch in order to collect more info, but 2.6.33-rc4 with this patch and xf86-video-intel 2.9.99 produces a hard lock. - ssh: no route to host - sysrq-REISUB non-responding - external VGA display shuts off. internal LVDS screen frozen. only way to power off is to keep power button pressed for a few seconds. so i could not collect any info :( tried this twice just to be sure.
Chris, do you need any additional info?
Created attachment 32829 [details] i915_error_state I've got this error state with first version of patch, second doesn't apply on 2.6.32.6
It seems A17-affected machines (usually those who use 2 memory modules of same size) are not affected by this bug :)
(In reply to comment #63) > It seems A17-affected machines (usually those who use 2 memory modules of same > size) are not affected by this bug :) > i cannot test this since my laptop has one so-dimm slot only. :( will try and test the first version of the patch.
(In reply to comment #63) > It seems A17-affected machines (usually those who use 2 memory modules of same > size) are not affected by this bug :) > Sorry for this stupid question, are you referring to machines with two RAM modules? Mine has two slots, each holding 1GB of RAM and I'm hit by the bug. Then again as far as I know there are different BIOS settings about the method and portion of the RAM that is going to be used as a video memory. Can you be more specific about your comment?
Created attachment 32836 [details] got this with the first patch. and kernel 2.6.32.6
(In reply to comment #65) > Sorry for this stupid question, are you referring to machines with two RAM > modules? Mine has two slots, each holding 1GB of RAM and I'm hit by the bug. > Then again as far as I know there are different BIOS settings about the method > and portion of the RAM that is going to be used as a video memory. Can you be > more specific about your comment? Uh, then it's something else... My friend didn't hit this bug, and he has same software (gentoo ~x86), the only difference is in RAM: he has 2x1gb, and my machine has 512mb+2gb. A17-affected machines are machines with 2 memory modules of same size (in this case memory will work in interleaved mode) Cite from i915_gem_tiling.c: On mobile 9xx chipsets, channel interleave by the CPU is determined by DCC. For single-channel, neither the CPU nor the GPU do swizzling. For dual channel interleaved, the GPU's interleave is bit 9 and 10 for X tiled, and bit 9 for Y tiled. The CPU's interleave is independent, and can be based on either bit 11 (haven't seen this yet) or bit 17 (common).
back on the issue. since the error logs and dumps of the GPU dont provide any additional info on what is causing this, and half of the commmits from 2.9.1 to 2.9.99 break the build or xorg itself (making a bisection quite difficult). is there a way to determine a list of commits that could be introducing this regression? in short. is there a dev willing to provide a list of suspects? im willing to test them all out.
Adding some bits of info: My system also is using both RAM slots (2 x 1GB) and yes, I have the GPU hang issues.
I can't confirm the ram fact. My Thinkpad X41 is using a 512mb and 1gb ram module and I'm also affected by this annoying bug.
(In reply to comment #33) > (In reply to comment #32) > > > [ 5490.526299] [drm:i915_gem_execbuffer] *ERROR* i915_gem_do_execbuffer returns > > -5 > > Sorry, I just noticed it does not give the Execbuff while wedged error with > this drm-intel-next based kernel and instead says this now in place of it, but > the crash seems identical otherwise in both cases. I'll upload a dump from a > 2.6.32 kernel instead when I crash next if it would be more useful. BTW, back when it said "Execbuff while wedged" I could recover the GPU by doing a suspend-resume cycle on my X41. After the resume the kernel would say something along the lines of "VT switch, reenabling wedged GPU even though this might not work" and after that it would recover. With newer kernels that say "i915_gem_do_execbuffer returns -5" I can only reboot to recover...
(In reply to comment #19) > I wonder why is there so little activity for this bug. I am getting > at least 10 freezes a day. I'm also seeing this, but more like 0-3 freezes a day. Most recently I upgraded to the git version of xf86-video-intel, but I still see it. When I first started using KMS everything worked fine until most likely a newer xf86-video-intel version entered Debian/testing. Though for me this bug is independent of KMS (in contrast to https://bugs.freedesktop.org/show_bug.cgi?id=26346), I also get these GPU hangs without using KMS.
Created attachment 32943 [details] dmesg, gpu_dump output and Xorg log - not original reporter
Has anyone already tested latest drm-intel-next git (http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git;a=shortlog;h=drm-intel-next)?
(In reply to comment #74) > Has anyone already tested latest drm-intel-next git > (http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git;a=shortlog;h=drm-intel-next)? > I did. Xorg crashed again. dmesg: ... [38665.264015] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [38665.264026] render error detected, EIR: 0x00000000 [38665.264051] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 6436471 at 6436463) /sys/kernel/debug/dri/0/i915_error_state: Time: 1265058362 s 761303 us EIR: 0x00000000 PGTBL_ER: 0x00000000 INSTPM: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x54300004 INSTDONE: 0x7fc44081 ACTHD: 0x0360dbbc Also I'm sending another gpu_dump.
Created attachment 32980 [details] Intel_gpu_dump output linux-image-2.6.33-997-generic_2.6.33-997.201001301338 xserver-xorg-video-intel 2.10.0+git20100127.918151a7
FWIW I've found out that I can flip /sys/kernel/debug/dri/0/i915_wedged back to 0 by writing to it. If I do that after a suspend to ram I can continue using the system without having to reboot. In case X11 didn't terminate because it got a failed system call I can even continue the X11 session without problems. #!/bin/sh sync chvt 1 echo mem > /sys/power/state echo 0 > /sys/kernel/debug/dri/0/i915_wedged chvt 7 Maybe that helps someone else suffering from this bug...
Regarding the RAM, I'm also using an X41 (Tablet) with one 512mb and one 1gb ram module. 00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03) 00:02.1 Display controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03) -- xf86-video-intel: git 20100205 1a76fa5574e8e8f88ac3518a4e4494e1af301dc1 -- xserver:7.4 (Debian/testing) -- libdrm:git 20100205 1802e1a4e747b5906d3af10c4a53fd457eddcbb4 -- kernel:2.6.33-rc6 -- Linux distribution:Debian/testing -- Machine or mobo model:IBM X41 Tablet 1866CTO -- Display connector:LVDS
Hi! Your hint looks promising and I want to test it. My problem is that I don't have an entry "i915_wedged" in the /sys/kernel/debug/dri/0/ path. I have a lot of other entries there, but not this one. Is it only available in newer kernel versions? Currently I'm using vanilla-sources 2.5.32. Which version do you use? Greetings Christian
Short correction :-) > Currently I'm using vanilla-sources 2.5.32 It's kernel version 2.6.32
I'm using 2.6.33-rc6. Note that in some earlier kernel version (2.6.32? 2.6.31?) the kernel would automatically unwedge after resume for me. So maybe for you it might be sufficient to just do a suspend-resume cycle.
I upgraded to kernel-2.6.33-rc6 and your "hotfix" seems to be working! Thanks alot!! Finally I can use my laptop as a laptop again (suspend is working!!!!!). On kernels <= 2.6.32 the suspend-resume cycle didn't fix the problem for me. I have two numbered entries under /sys/debug/kernel/dri. I don't know if this is hardware dependent, but I have to set them both to 0 so that my system gets stable after a suspend / resume cycle. I'm now running with an uptime of 10 hours and suspended about 4 times (but non of the suspend cycles was used because of the bug). The only bugs left here: - Sometimes the screen flickers (especially when I play video files). Switching to another vt and back solves the problem. This behaviour occures ~ once an hour. - Sometimes there are ugly font issues. Characters don't get rendered correctly. If a character like "a" gets rendered only half, all other "a" characters on the screen have the same issue. I already had this problem in an earlier svn version and don't know what is causing this and how to fix it. I can make a screenshot if it helps solving the problem. greetings Christian (In reply to comment #81) > I'm using 2.6.33-rc6. > Note that in some earlier kernel version (2.6.32? 2.6.31?) the kernel would > automatically unwedge after resume for me. So maybe for you it might be > sufficient to just do a suspend-resume cycle. >
For me, there is a very reliable way to reproduce this bug: Open _Epiphany_ web browser and go to the following website: http://www.powerdeveloper.org/forums/index.php Then click on the "Developers" forum. For me sometimes it even crashes before I click the "Developers" forum. I'm using: 2.6.33-rc6 Xorg 1.7.4 Intel 2.10 libdrm2.4.17 Hope that helps...
(In reply to comment #82) > I have two numbered entries under /sys/debug/kernel/dri. I don't know if this > is hardware dependent, but I have to set them both to 0 so that my system gets > stable after a suspend / resume cycle. I'm now running with an uptime of 10 > hours and suspended about 4 times (but non of the suspend cycles was used > because of the bug). For me if I set the entry in directory '0' to 0 the one in directory '64' is also set to 0. > The only bugs left here: > - Sometimes the screen flickers (especially when I play video files). Switching I had that one often with flash, but never with MPlayer. However since updating to the git version of xf86-video-intel I haven't seen that bug I think. I still switched MPlayer to the backend scaler now that it is available again since I'm hoping that will use less power that using the 3D engine to blit the video. Would be interesting if that makes a measurable difference, I did not check that... > - Sometimes there are ugly font issues. Characters don't get rendered > correctly. If a character like "a" gets rendered only half, all other "a" > characters on the screen have the same issue. I already had this problem in an > earlier svn version and don't know what is causing this and how to fix it. I > can make a screenshot if it helps solving the problem. I also see that one, but it only started after upgrading to the git version of xf86-video-intel for me. With the Debian/testing version I didn't experience that particular bug. I still have those though: https://bugs.freedesktop.org/show_bug.cgi?id=26346 http://uguu.de/~ranma/intel_kms_bugs/
Just an addition to Tobias Dietrich's suspend hotfix: I have to set echo 1 > /sys/kernel/debug/dri/0/i915_wedged echo 1 > /sys/kernel/debug/dri/0/i915_wedged before suspending. Otherwise suspend mode eats my battery ultra fast. Perhaps this is a thinkpad x41 problem. After resuming I set the same value to 0 again. This works for me to keep the battery drain low. Greetings Christian
(In reply to comment #85) > Just an addition to Tobias Dietrich's suspend hotfix: > > I have to set > > echo 1 > /sys/kernel/debug/dri/0/i915_wedged > echo 1 > /sys/kernel/debug/dri/0/i915_wedged > > before suspending. Otherwise suspend mode eats my battery ultra fast. Perhaps > this is a thinkpad x41 problem. After resuming I set the same value to 0 again. > This works for me to keep the battery drain low. Hmm, FWIW I never tested battery drain during suspend. I don't think I disconnected AC power even once over the last month. (^^;
This bug also affects me. I described it in kernel bugzilla: http://bugzilla.kernel.org/show_bug.cgi?id=15188 For me, it occurs only if there was at least one suspend. Usually the bug triggers 0-2 times per day. I'm using latest kernel (2.6.32.7) and userspace (libdrm 2.4.17, xf86-video-intel 2.10.0, mesa 7.7, xorg-server 1.7.4).
Everyone whose bug is not fixed by this [libdrm], please open a new bug report: commit 4f0f871730b76730ca58209181d16725b0c40184 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Feb 10 09:45:13 2010 +0000 intel: Handle resetting of input params after EINTR during SET_TILING The SET_TILING is pernicious in that it overwrites the input arguments following an error in order to report the current tiling state of the buffer. This caught us by surprise as we then fed those arguments back into to the ioctl unmodified following an EINTR and so the kernel then reported success for the no-op. We interpreted this success as meaning that the tiling on the buffer had changed so updated our state and started using the buffer incorrectly in the new tiled/untiled manner. This lead to all sorts of random corruption and GPU hangs, even though the batch buffers would look sane (when the GPU had not wandered off into forbidden territory). References: Bug 25475 - [i915] Xorg crash / Execbuf while wedged http://bugs.freedesktop.org/show_bug.cgi?id=25475 Bug 25554 - i830_uxa_prepare_access: gtt bo map failed: Input/output error http://bugs.freedesktop.org/show_bug.cgi?id=25554 (And probably every other weird bug in the last few months.) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
(In reply to comment #88) > intel: Handle resetting of input params after EINTR during SET_TILING > ..... I tested the problem with - libdrm svn (with the mentioned commit included) - intel driver svn (also tested it with 2.9.1) - mesa 7.5.2 - xorg 7.4 and the problem still exists. But I get a different dmesg output when X is unusable again: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung render error detected, EIR: 0x00000000 [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 22980 at 22976) ath5k phy0: unsupported jumbo Greetings Christian
I don't know if I should fill a new bug report since the dmesg output changed.
Please file a new bug report for different errors, as the aforementioned commit fixes the original bug.
(In reply to comment #88) > Everyone whose bug is not fixed by this [libdrm], please open a new bug report: > commit 4f0f871730b76730ca58209181d16725b0c40184 Great. I've recompiled libdrm now, if I don't see any errors I'll close https://bugs.freedesktop.org/show_bug.cgi?id=26346 as I'm assuming this will be fixed by that commit.
Just to report back: I updated my kernel to vanilla-sources 2.6.32.8 and everything else (mesa, libdrm, xf86-vid-intel) to git master and yeahh.. issue seems to be gone. At least I haven't been able to trigger it yet :) Thanks again Chris to hunting this one down! Greets, Tobias
(In reply to comment #92) > (In reply to comment #88) > > Everyone whose bug is not fixed by this [libdrm], please open a new bug report: > > commit 4f0f871730b76730ca58209181d16725b0c40184 > > Great. I've recompiled libdrm now, if I don't see any errors I'll close > https://bugs.freedesktop.org/show_bug.cgi?id=26346 as I'm assuming this will be > fixed by that commit. Okay, I haven't seen this bug since upgrading to that libdrm so I considered this one fixed. Unfortunately the issues described at https://bugs.freedesktop.org/show_bug.cgi?id=26346 still persist.
(In reply to comment #94) > (In reply to comment #92) > > (In reply to comment #88) > > > Everyone whose bug is not fixed by this [libdrm], please open a new bug report: > > > commit 4f0f871730b76730ca58209181d16725b0c40184 > > Okay, I haven't seen this bug since upgrading to that libdrm so I considered > this one fixed. Unfortunately the issues described at I did experience another GPU hang yesterday. :( Apparently while now much less frequent it can still happen. Unfortunately I was in a hurry at the time and didn't make a gpu dump. I'll open a new bug for it when it happens again.
(In reply to comment #95) > (In reply to comment #94) > > (In reply to comment #92) > > > (In reply to comment #88) > > > > Everyone whose bug is not fixed by this [libdrm], please open a new bug report: > > > > commit 4f0f871730b76730ca58209181d16725b0c40184 > > > > Okay, I haven't seen this bug since upgrading to that libdrm so I considered > > this one fixed. Unfortunately the issues described at > > I did experience another GPU hang yesterday. :( > Apparently while now much less frequent it can still happen. > Unfortunately I was in a hurry at the time and didn't make a gpu dump. > I'll open a new bug for it when it happens again. > I experienced one or two GPU hangs since this bug was closed but this time I couldn't switch to console in order to get dmesg or perform a GPU dump. It's obvious that the bug reported HERE was FIXED. I'm sure there are other bugs present, but once some of us figures out more specifically what's wrong, he should open a new bug report.
> > > > I did experience another GPU hang yesterday. :( > > Apparently while now much less frequent it can still happen. > > Unfortunately I was in a hurry at the time and didn't make a gpu dump. > > I'll open a new bug for it when it happens again. > > > > I experienced one or two GPU hangs since this bug was closed but this time I > couldn't switch to console in order to get dmesg or perform a GPU dump. > It's obvious that the bug reported HERE was FIXED. > I'm sure there are other bugs present, but once some of us figures out more > specifically what's wrong, he should open a new bug report. > yes, there is another similar bug crawling. ive got a dmesg and two gpu dumps queued up for a report. a way to trigger it (not so reliable), is to start xscreensaver and go through the list of screensavers letting them preview in the small window. (takes a while). enabling the video overlay feature of kernel 2.6.33 also triggers it after hours of usage.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.