After upgrading the kernel to 2.6.30 (from 2.6.29) I found a problem with udev using a lot of CPU time. The problem would trigger randomly, right at boot time and others some minutes after. Once triggered, only killing udev would stop the problem. It turned out to be a DRM problem where something got stuck and would send events with every interrupt. The problem is also present on 2.6.31 (~rc4). Reference LMKM thread: http://lkml.org/lkml/2009/6/28/11 Hardware: Dell Studio desktop, Intel G45 chipset, using the integrated graphics. An example dmesg with a debug patch applied can be found here: http://lkml.org/lkml/2009/7/22/313 And a test patch that proved to solve the problem here: http://lkml.org/lkml/2009/7/22/327
Created attachment 29047 [details] [review] Init outputs before IRQ status Does this patch also work? In order to avoid the spurious interrupts we're supposed to initialize the PEG band gap to the correct voltage...
I just tested from latest linus' git and with this latest patch and I do see the problem still. I had nothing in dmesg when the problem triggered since I didn't apply the debug patch provided some time ago. However, I did have an error that might or might not be related to the patch (I can just say that git from about a week ago didn't show this error): [ 23.764976] kbuildsycoca4 used greatest stack depth: 6028 bytes left [ 75.535185] [drm:i915_gem_execbuffer] *ERROR* Object f638f3c0 appears more than once in object list [ 75.602287] [drm:i915_gem_execbuffer] *ERROR* Object f6160060 appears more than once in object list [ 75.617678] [drm:i915_gem_execbuffer] *ERROR* Object f6160060 appears more than once in object list [ 75.671562] [drm:i915_gem_execbuffer] *ERROR* Object f638f420 appears more than once in object list [ 75.818337] [drm:i915_gem_execbuffer] *ERROR* Object f638f480 appears more than once in object list [ 75.881663] [drm:i915_gem_execbuffer] *ERROR* Object f638f4e0 appears more than once in object list [ 75.901713] [drm:i915_gem_execbuffer] *ERROR* Object f638f4e0 appears more than once in object list [ 76.150701] [drm:i915_gem_execbuffer] *ERROR* Object f638f300 appears more than once in object list [ 76.170042] [drm:i915_gem_execbuffer] *ERROR* Object f638f300 appears more than once in object list [ 76.242343] [drm:i915_gem_execbuffer] *ERROR* Object f638f540 appears more than once in object list [ 76.306192] [drm:i915_gem_execbuffer] *ERROR* Object f638f420 appears more than once in object list [ 76.312373] [drm:i915_gem_execbuffer] *ERROR* Object f638f420 appears more than once in object list [ 76.317394] [drm:i915_gem_execbuffer] *ERROR* Object f638f420 appears more than once in object list [ 76.451448] [drm:i915_gem_execbuffer] *ERROR* Object f638f540 appears more than once in object list [ 76.455672] [drm:i915_gem_execbuffer] *ERROR* Object f638f540 appears more than once in object list [ 76.471548] [drm:i915_gem_execbuffer] *ERROR* Object f638f540 appears more than once in object list [ 76.552370] [drm:i915_gem_execbuffer] *ERROR* Object f638f3c0 appears more than once in object list [ 76.567279] [drm:i915_gem_execbuffer] *ERROR* Object f638f3c0 appears more than once in object list [ 76.614682] [drm:i915_gem_execbuffer] *ERROR* Object f638f300 appears more than once in object list [ 76.627649] [drm:i915_gem_execbuffer] *ERROR* Object f638f300 appears more than once in object list [ 76.697832] [drm:i915_gem_execbuffer] *ERROR* Object f638f660 appears more than once in object list [ 76.711030] [drm:i915_gem_execbuffer] *ERROR* Object f638f660 appears more than once in object list [ 76.845556] [drm:i915_gem_execbuffer] *ERROR* Object f638f300 appears more than once in object list [ 76.862304] [drm:i915_gem_execbuffer] *ERROR* Object f638f300 appears more than once in object list [ 77.228811] [drm:i915_gem_execbuffer] *ERROR* Object f638f600 appears more than once in object list [ 77.241402] [drm:i915_gem_execbuffer] *ERROR* Object f638f600 appears more than once in object list [ 77.298431] [drm:i915_gem_execbuffer] *ERROR* Object f638f3c0 appears more than once in object list [ 77.318427] [drm:i915_gem_execbuffer] *ERROR* Object f638f3c0 appears more than once in object list [ 77.446953] [drm:i915_gem_execbuffer] *ERROR* Object f638f5a0 appears more than once in object list [ 77.462289] [drm:i915_gem_execbuffer] *ERROR* Object f638f5a0 appears more than once in object list [ 77.601891] [drm:i915_gem_execbuffer] *ERROR* Object f638f3c0 appears more than once in object list [ 77.614703] [drm:i915_gem_execbuffer] *ERROR* Object f638f3c0 appears more than once in object list [ 187.018764] kio_thumbnail used greatest stack depth: 5644 bytes left Let me know if I should try something else with this patch or another.
Oh well, I guess it's not the PEG band voltage bug then... I'll ping the display guys and see what I can come up with.
Created attachment 29419 [details] [review] another debug patch Hopefully this one allows your monitor to come back?
Yes, this patch solves the problem. In fact, the second part of the patch is the same as the test patch that already proved to solve it some time ago, but it was not considered the right fix. I guess the first part of this last patch is what was missing to make the first one a real fix? Thanks.
(In reply to comment #5) > Yes, this patch solves the problem. In fact, the second part of the patch is > the same as the test patch that already proved to solve it some time ago, but > it was not considered the right fix. I guess the first part of this last patch > is what was missing to make the first one a real fix? No, this one still isn't right (chipset guys will get back to me soon I hope). It was a test patch for what sounds like a related issue; I've had one report that although the hack patch solves the stuck interrupt issue it also prevents monitors from syncing again when they're turned off and back on again (all the while attached). Pretty weird, but possibly related to the hotplug quirks on G45.
Ah, ok, I didn't know about that monitor problem and I really can't confirm if it solves that problem. I was just talking about the uevents thing in my previous reply.
I pulled from git today (the soon-to-be 2.6.32-rc1) and I can't reproduce the problem anymore. I'll check to make sure I haven't done anything wrong maybe with the .config (though DRI and everything is working fine), but it does look like it is fixed. I do see the [drm:i915_gem_execbuffer] errors posted above, but that seems unrelated to this issue. Any idea of what could have fixed it? All upcomming distros will ship with kernel 2.6.31, so it would still be nice to know what fixed it and be able to backport it, if possible.
Interesting... no I'm not sure what may have fixed it offhand. Would it be too much trouble to bisect it? There have been some fixes to somewhat related areas, but nothing that should directly affect the stuck hotplug interrupt afaik...
I've just pulled from git again and bad news: I can see the problem again. I don't know why it seemed to be fixed some days ago, but I probably did something wrong. Sorry about the false report. I'm not sure I'll have the time and knowledge to perform a bisect, but in case I can do it, would it be worth to bisect between .29 and .30 to see the commit that introduced the problem? Or is it not too relevant at this point?
No, you don't need to bisect for the bad commit; I think I know what it's related to. I was hoping you could bisect to find the *good* commit, but it sounds like there isn't one. :p When I get back from travelling I'll dig through all the hotplug errata (apparently there were many) and see if I can come up with a real patch for this.
Small update: I have connected my monitor using the HDMI connector on my card (with a HDMI to DVI adapter) and I don't get interrupts anymore. Not sure if this was expected, but thought I should report it just in case. Previously I was connecting my monitor through VGA. Let me know if you need any further info.
Well, after _days_ of using for many hours the computer with the monitor connected via HDMI, the bug has showed its face again. So using HDMI doen't completely solve the problem, but it makes it much more difficult to trigger (it just took 5-10 minutes of usage before).
Created attachment 30988 [details] [review] Check hotplug status bits Looks like we were checking the wrong bits in the interrupt handler. Can you give this patch a try?
Unfortunately it doesn't seem to help. First I tested on current git but DRI was not working for some reason, so while I couldn't reproduce the bug I thought it was not a good test. So then I tested the patch on 2.6.31.5 (it applied with a trivial change) and there I could reproduce the bug in a few minutes (using the VGA connector). I'll try to retest on current git again to be sure (if I find the reason why DRI didn't work).
Ok, the DRI problem was a stupid typo in the boot parameters, so now it booted fine and just playing a Tux Racer game made the problem show up (even on HDMI). So the patch really doesn't help :(
Created attachment 31000 [details] [review] Handle spurious interrupts Ok maybe we need to use both sets; if we get an interrupt on a port but the live bit isn't set we should disable the port.
Created attachment 31001 [details] [review] Handle spurious interrupts #2 Oops, last one had live vs. hotplug interrupts in the wrong order.
Sorry for the bad news, this one didn't help either. I could trigger the interrupt storm within a few minutes of usage via VGA.
Created attachment 31133 [details] [review] IRQ debug patch Can you guys reproduce the problem with this patch applied and attach the output to this bug? I had some similar data awhile back but I lost it, and I need new theories so I want to see the initial problem data again. Thanks.
Created attachment 31146 [details] dmesg with HDMI stuck interrupts Here is a full dmesg with the patch applied. Let me know if you need further information.
I wonder if DP_D is supposed to be enabled on your system at all... Can you try the patchset at https://bugs.freedesktop.org/show_bug.cgi?id=22785? It may need a refresh, I'll ping the author.
Ok, I'll try those patches when the author posts the refreshed ones. One thing I noticed is that xrandr -q reports this: VGA1 disconnected (normal left inverted right x axis y axis) DVI1 connected 1680x1050+0+0 (normal left inverted right x axis y axis) 474mm x 296mm 1680x1050 60.0*+ 1280x1024 75.0 1024x768 75.1 60.0 800x600 75.0 60.3 640x480 75.0 60.0 720x400 70.1 DP1 disconnected (normal left inverted right x axis y axis) But the DVI1 that appears connected is in fact an HDMI one. When I connect through VGA is also reports that I have a DVI output (disconnected in that case) but this computer does not have DVI at all.
Hm ok, well maybe the child device patchset will help after all...
I've tested the child device patchset and it did work correctly in detecting my HDMI output as HDMI plus not detecting an inexistent DP (I posted about it on the bug report). However, that didn't change the situation regarding the interrupt storm. I could easily reproduce the problem by playing tuxracer :(
Do you know which outputs where detected? I'm thinking if DP_D wasn't created, we should also disable interrupts from that source rather than enabling all of them...
I didn't apply the debug patch in my last test, but from my Xorg.0.log: (II) intel(0): Integrated Graphics Chipset: Intel(R) G45/G43 (--) intel(0): Chipset: "G45/G43" (II) intel(0): Output VGA1 has no monitor section (II) intel(0): Output HDMI1 has no monitor section (II) intel(0): Output VGA1 disconnected (II) intel(0): Output HDMI1 connected (II) intel(0): Using exact sizes for initial modes (II) intel(0): Output HDMI1 using initial mode 1680x1050 Also xrandr reports only a VGA1 disconnected and a HDMI1 connected. Should I apply the last debug patch and send the logs? I was also going to try those two previous patches you posted here with the child device ones applied, since I thought that maybe they didn't work just because HDMI was being detected as DVI.
I tested the previous patches with the child device ones and it didn't work either.
Created attachment 31712 [details] [review] debug output init Can you apply this patch and attach the output from when you load with drm debug=6? I'm hoping the DP output causing problems is ignored; if so I can fix up the hotplug code to handle that case.
This patch doesn't apply on top of 2.6.32 and i can't seem to find anything similar in the source code to apply it manually. What should I do to test it?
It should apply to Eric's drm-intel-next branch.
I'm trying to get something useful but even assuming I built the kernel correctly with the drm-intel-next branch (at least the patch applied and the kernel does work), when I boot with drm.debug=6 and try to get the dmesg I just get this line repeated all the time: [ 40.881495] [drm:i915_add_request], 2242 [ 40.886500] [drm:i915_add_request], 2243 ... I tried to get dmesg without starting X, but again it is flooded by this: [ 32.375745] [drm:i915_driver_irq_handler], hotplug event received, stat 0x38200000 [ 32.376571] [drm:i915_driver_irq_handler], hotplug event received, stat 0x30200000 Any idea of how to avoid this messages flooding the log so I can get it from the start?
You could try drm debug=4 instead, I think that'll dump fewer messages.
Created attachment 31761 [details] DRM debug log Yes, that worked. Here is the log with drm.debug=4.
Created attachment 31953 [details] [review] enable hotplug only for detected outputs I didn't include all the output I wanted, but I'm hoping this is what you were running into. This patch only enables hotplug detection for outputs we actually initialize, so should minimize the chance of getting interrupts for outputs that don't exist. I also found a note about DP_D in some recent that I'll check out, it could also be what you're hitting.
This one looks REALLY good. I've been trying for an hour to reproduce the problem by all means and I've been unable. The problem is 95% reproducible within 5-10 minutes, so I'm almost certain that this patch fixed it. Thanks! :) Anyway I'll keep testing tomorrow (it's late here) and report back with a 100% definitive answer.
Ok, so I've built the same kernel from drm-intel-next without the patch and there I can easily reproduce the problem by simply running glxgears. On the patched kernel there is no way to reproduce it, so now I'm certain that this patch fixes the problem here. Thank you for all your effort into solving this issue! Side note: In case this patch is a candidate for being backported, I wonder if it depends on the other patches that make my outputs being correctly detected. Up to (and including) 2.6.32, 3 outputs are detected here: VGA, DVI and DP, but I just have a VGA and a HDMI outputs. In drm-intel-next (therefor 2.6.33, I assume), the outputs are detected correctly as VGA and HDMI. Just in case it matters. If you'd like me to test any backport or if you want me to send any logs from the patched latest kernel, please let me know.
Thanks a lot for testing and confirming. Yeah, it does depend on correct output detection, which is only present in git (so it'll land in 2.6.33). I'll post it for review now.
commit b01f2c3a4a37d09a47ad73ccbb46d554d21cfeb0 drm/i915: only enable hotplug for detected outputs Fix on its way upstream.
I have just upgraded to 2.6.33 hoping to finally leave this bug behind, but found that it's still there. However, it's probably just because the outputs are not correctly detected. This computer only has 2 outputs: VGA (not used) and HDMI (used). But this is what "xrandr -q" says: VGA1 disconnected (normal left inverted right x axis y axis) HDMI1 connected 1680x1050+0+0 (normal left inverted right x axis y axis) 474mm x 296mm 1680x1050 60.0*+ 1280x1024 75.0 60.0 1152x864 75.0 1024x768 75.1 60.0 800x600 75.0 60.3 640x480 75.0 60.0 720x400 70.1 DP1 disconnected (normal left inverted right x axis y axis) And it's that detected (and therefor initialized) DP output which causes the trouble (or so is my understanding). At some point detection worked good with drm-intel-next branch (detecting only the 2 existing ones), but with 2.6.33 it again detects a non-existent DP. Any ideas? Should I open a new report for this thing?
Yeah, please open a new one. Would be especially good if you could bisect where things went bad.
I guess no need to bisect. I found that the child device patches were reverted by another commit (6207937d4feea000913e8ca23fe20c7744be7847) because they caused trouble for other people. I posted on the relevant report (bug #22785) so I hope that Zhao Yakui can look into another solution.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.