Summary: | [HSW bisected]igt/module_reload causes [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44004 and system hang with headless, with power well disabled | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||||||||||
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||
Severity: | major | ||||||||||||||
Priority: | highest | CC: | pierpaolo.franco, przanoni, xunx.fang, yangweix.shui, yi.sun | ||||||||||||
Version: | unspecified | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux (All) | ||||||||||||||
Whiteboard: | |||||||||||||||
i915 platform: | i915 features: | ||||||||||||||
Attachments: |
|
Hm, the dmesg output seems to be rather sparse. Have you checked that full debug output is enabled over the console? You can set this with dmesg -n 8 Note that the dmesg tool shipping with most distros is broken, and the upstream maintainer unwilling to fix it (I have no idea about the amount of fail going on there). If your dmesg tool rejects the above command please compile&run (as root) the below littel C tool: #include <stdio.h> #include <sys/syslog.h> #include <sys/klog.h> int main() { klogctl(8, NULL, 8); } If the dmesg with full debug enabled doesn't show more I guess we need the bisect result to move forward here. But since there's quite a few commits to test grabbing a new dmesg might be quicker. Created attachment 83705 [details]
dmesg with c tool
Hm, module reload should dump much more information when the dmesg verbosity is correctly set over netconsole. So I still think there's something wrong with your netconsole setup. Can you please check whether your netconsole setup works correctly (= you get the same output over netconsole as if you grab the dmesg output locally) on a machine where module_reload works? If that works please check the netconsole output of the affected hsw machine with another test which is know to put debug information into dmesg but doesn't crash the machine, like testdisplay. Also, can you please start to bisect this issue so that we don't get blocked on getting netconsole to work better? Bisect shows: bf51d5e2cda5d36d98e4b46ac7fca9461e512c41 is the first bad commit commit bf51d5e2cda5d36d98e4b46ac7fca9461e512c41 Author: Paulo Zanoni <paulo.r.zanoni@intel.com> Date: Wed Jul 3 17:12:13 2013 -0300 drm/i915: switch disable_power_well default value to 1 Now that the audio driver is using our power well API, everything should be working correctly, so let's give it a try. Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Can you please double-check the bisect by booting with i915.disable_power_well=0 on latest -nightly? In any case one for Paulo it seems ... (In reply to comment #6) > Can you please double-check the bisect by booting with > i915.disable_power_well=0 on latest -nightly? > > In any case one for Paulo it seems ... Add i915.disable_power_well=0, this issue goes away. Created attachment 84069 [details] [review] Fix for the bug Hi Can you please verify this patch fixes the bug? Please notice that this patch only fixes the "unclaimed register" messages. When running module_reload I still see many problems which are not fixed by this patch: - We get a gpu hang most of the time - Backlight doesn't work - Sometimes we get an oops - There's a big lockdep warning every time. Thanks, Paulo (In reply to comment #8) > Created attachment 84069 [details] [review] [review] > Fix for the bug > > Hi > > Can you please verify this patch fixes the bug? > > Please notice that this patch only fixes the "unclaimed register" messages. > When running module_reload I still see many problems which are not fixed by > this patch: > - We get a gpu hang most of the time > - Backlight doesn't work > - Sometimes we get an oops > - There's a big lockdep warning every time. > > Thanks, > Paulo I test this patch. This issue still exists. (In reply to comment #9) > (In reply to comment #8) > > Created attachment 84069 [details] [review] [review] [review] > > Fix for the bug > > > > Hi > > > > Can you please verify this patch fixes the bug? > > > > Please notice that this patch only fixes the "unclaimed register" messages. > > When running module_reload I still see many problems which are not fixed by > > this patch: > > - We get a gpu hang most of the time > > - Backlight doesn't work > > - Sometimes we get an oops > > - There's a big lockdep warning every time. > > > > Thanks, > > Paulo > > I test this patch. This issue still exists. Which issue specifically: the black screen or the "unclaimed register" messages? Can you please post the new dmesg output? Thanks, Paulo Created attachment 84128 [details]
dmesg with patch "Fix for the bug"
(In reply to comment #11) > Created attachment 84128 [details] > dmesg with patch "Fix for the bug" Please the full dmesg. What I did here to test was: - Boot the machine - SSH to it from another machine - Run ./module_reload from the SSH session - dmesg > dmesg.txt Does that work for you? Paulo, I add drm.debug=0xe to grub, the attached dmesg is full dmesg.
>
> - Boot the machine
> - SSH to it from another machine
> - Run ./module_reload from the SSH session
> - dmesg > dmesg.txt
>
> Does that work for you?
Because system hang, I use netconsole.
Do you have any suggest for more dmesg?
(In reply to comment #13) > Paulo, I add drm.debug=0xe to grub, the attached dmesg is full dmesg. > > > > > - Boot the machine > > - SSH to it from another machine > > - Run ./module_reload from the SSH session > > - dmesg > dmesg.txt > > > > Does that work for you? > > Because system hang, I use netconsole. > > Do you have any suggest for more dmesg? I really can't reproduce your bug. Can you please attach the full "dmesg" from right before you run module_reload? Please do it with drm.debug=0xe. Also please attach the output of "lsmod" from right before when you run module_reload. Also, please check if when you run module_reload you get some error messages about module i915 still being loaded. Also, please check whether you're just getting a backlight bug or a real full machine hang. I'm running drm-intel-nightly. (In reply to comment #11) > Created attachment 84128 [details] > dmesg with patch "Fix for the bug" The errors on this log are for registers that should still work even when the power well is disabled... I wonder if there's some issue between the Audio and Graphics drivers. I added drm.debug=0xe. # lsmod Module Size Used by dm_mod 58278 0 iTCO_wdt 4535 1 iTCO_vendor_support 1584 1 iTCO_wdt snd_hda_codec_realtek 32195 1 pcspkr 1675 0 i2c_i801 8342 0 lpc_ich 12616 0 mfd_core 2401 1 lpc_ich snd_hda_intel 24636 0 snd_hda_codec 98736 2 snd_hda_codec_realtek,snd_hda_intel snd_hwdep 5141 1 snd_hda_codec snd_pcm 66617 2 snd_hda_codec,snd_hda_intel snd_page_alloc 5906 2 snd_pcm,snd_hda_intel snd_timer 15318 1 snd_pcm snd 50129 6 snd_hda_codec_realtek,snd_hwdep,snd_timer,snd_pcm,snd_hda_codec,snd_hda_intel soundcore 4210 1 snd battery 10931 0 ac 3999 0 acpi_cpufreq 6339 0 i915 529852 1 video 11092 1 i915 button 4333 1 i915 drm_kms_helper 22788 1 i915 drm 195808 2 i915,drm_kms_helper mperf 1027 1 acpi_cpufreq freq_table 2092 1 acpi_cpufreq Ok, I reinstalled my OS and now I can reproduce the bug. I also did an analysis: The bug is not really caused by module reload, it's caused by the "echo 0 > /sys/class/vtconsole/vtcon1/bind" command that's part of the module_reload script. Since I couldn't reproduce the bug in my previous installation, I suspect the bug is dependent on which console driver is used. We get the problem even if the power well is enabled when we run the command. But if the power well was *never* disabled, we don't get the problem. This means that the condition to make the bug happen later is just a disable/enable cycle on the power well. Another thing I notice is that the endless flood of "unclaimed register" messages doesn't seem to be triggered by our driver. It seems something else is causing these messages. Maybe something is writing to VGA memory? If we disable the north display error interrupts, everything appears to work (except for a few error messages on dmesg). Hi I believe this patch series fixes the problem: http://lists.freedesktop.org/archives/intel-gfx/2013-September/033984.html Can you please test? Thanks, Paulo (In reply to comment #18) > Hi > > I believe this patch series fixes the problem: > http://lists.freedesktop.org/archives/intel-gfx/2013-September/033984.html > > Can you please test? > > Thanks, > Paulo Test these patches output: ERROR: Module i915 is in use WARNING: i915.ko still loaded! module successfully loaded again (In reply to comment #19) > (In reply to comment #18) > > Hi > > > > I believe this patch series fixes the problem: > > http://lists.freedesktop.org/archives/intel-gfx/2013-September/033984.html > > > > Can you please test? > > > > Thanks, > > Paulo > > Test these patches > output: > ERROR: Module i915 is in use > WARNING: i915.ko still loaded! > module successfully loaded again Are you sure you did the test correctly? Did you kill X and all the other DRM clients when doing this? If you repeat the test with i915.disable_power_well=0 do you also get the same problem? All conditions necessary to run the other intel-gpu-tools tests apply when running module_reload. > Are you sure you did the test correctly? Did you kill X and all the other > DRM clients when doing this? If you repeat the test with > i915.disable_power_well=0 do you also get the same problem? Run with i915.disable_power_well=0, also has same output info. > > All conditions necessary to run the other intel-gpu-tools tests apply when > running module_reload. Clean boot system, then test it. (In reply to comment #21) > > Are you sure you did the test correctly? Did you kill X and all the other > > DRM clients when doing this? If you repeat the test with > > i915.disable_power_well=0 do you also get the same problem? > > Run with i915.disable_power_well=0, also has same output info. > > > > > All conditions necessary to run the other intel-gpu-tools tests apply when > > running module_reload. > > Clean boot system, then test it. I have no idea what happens on your "clean boot system" configuration. Does the desktop environment run? Do you have X running? That "i915.ko still loaded!" message really suggests there are programs you need to kill before doing the tests. Please make sure your testing environment is correct before you try to run module_reload. (In reply to comment #22) > (In reply to comment #21) > > > Are you sure you did the test correctly? Did you kill X and all the other > > > DRM clients when doing this? If you repeat the test with > > > i915.disable_power_well=0 do you also get the same problem? > > > > Run with i915.disable_power_well=0, also has same output info. > > > > > > > > All conditions necessary to run the other intel-gpu-tools tests apply when > > > running module_reload. > > > > Clean boot system, then test it. > > I have no idea what happens on your "clean boot system" configuration. Does > the desktop environment run? Do you have X running? "clean boot system" means system boot up without X and the desktop environment running. > That "i915.ko still loaded!" message really suggests there are programs you > need to kill before doing the tests. Please make sure your testing > environment is correct before you try to run module_reload. Since Kernel 3.12.0rc2, after running "echo 0 > /sys/class/vtconsole/vtcon1/bind" and "rmmod intel_ips &> /dev/null", lsmod shows there are still 2 programs using i915 module on haswell(I check all platform, this only happens on haswell). It causes i915 module can not be unloaded. I don't know what are the 2 programs, and how to kill them. [root@x-hsw24 ~]# lsmod Module Size Used by snd_hda_codec_realtek 34049 1 snd_hda_codec_hdmi 27236 1 snd_hda_intel 25263 0 snd_hda_codec 100145 3 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_intel snd_hwdep 5094 1 snd_hda_codec snd_seq 41921 0 snd_seq_device 4581 1 snd_seq snd_pcm 66241 3 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec iTCO_wdt 4551 1 iTCO_vendor_support 1608 1 iTCO_wdt snd_timer 15514 2 snd_seq,snd_pcm snd 50169 9 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hwdep,snd_seq,snd_seq_device,snd_pcm,snd_timer serio_raw 3929 0 pcspkr 1699 0 microcode 7049 0 soundcore 4219 1 snd i2c_i801 8374 0 snd_page_alloc 5930 2 snd_hda_intel,snd_pcm lpc_ich 12608 0 mfd_core 2441 1 lpc_ich acpi_cpufreq 6299 0 freq_table 2068 1 acpi_cpufreq uinput 6676 0 ipv6 246159 43 i915 531119 2 drm_kms_helper 22993 1 i915 drm 199664 2 i915,drm_kms_helper button 4261 1 i915 video 10625 1 i915 dm_mirror 11024 0 dm_region_hash 5831 1 dm_mirror dm_log 7220 2 dm_mirror,dm_region_hash dm_mod 66376 2 dm_mirror,dm_log (In reply to comment #23) > (In reply to comment #22) > > That "i915.ko still loaded!" message really suggests there are programs you > > need to kill before doing the tests. Please make sure your testing > > environment is correct before you try to run module_reload. > Since Kernel 3.12.0rc2, after running "echo 0 > > /sys/class/vtconsole/vtcon1/bind" and "rmmod intel_ips &> /dev/null", lsmod > shows there are still 2 programs using i915 module on haswell(I check all > platform, this only happens on haswell). It causes i915 module can not be > unloaded. I don't know what are the 2 programs, and how to kill them. Could be the audio driver holding module references for the power well stuff. Can you please unload all sound drivers and see whether that gets rid of the references? Unloading snd_hda_intel module gets rid of the references. Thanks, Daniel. I verify Paulo's patch series fixes the bug. So I guess we need to patch module_reload to also "rmmod snd_hda_intel" (or whatever the correct module is). (In reply to comment #26) > So I guess we need to patch module_reload to also "rmmod snd_hda_intel" (or > whatever the correct module is). Yup, I think we need to add that. Lu Hua, can you please file a new bug report that module_reload doesn't work on Haswell? It's a regression, so top priority bug. (In reply to comment #27) > (In reply to comment #26) > > So I guess we need to patch module_reload to also "rmmod snd_hda_intel" (or > > whatever the correct module is). > > Yup, I think we need to add that. Lu Hua, can you please file a new bug > report that module_reload doesn't work on Haswell? It's a regression, so top > priority bug. The ugly thing about it, is that, at least on my machine, we need to "pkill alsactl" in addition to stopping gdm. Do we want this on the script? (In reply to comment #27) > (In reply to comment #26) > So I guess we need to patch module_reload to > also "rmmod snd_hda_intel" (or > whatever the correct module is). Yup, I > think we need to add that. Lu Hua, can you please file a new bug report that > module_reload doesn't work on Haswell? It's a regression, so top priority > bug. Note: Bug 70336 is traced for this. Ok, Paulo's patch to the module reload test has landed. It still happens on latest -nightly kernel. (In reply to comment #31) > It still happens on latest -nightly kernel. What exactly? The unclaimed register error or the failure to unload the module? We've had a bunch of bugs here ... Run ./drv_module_reload ,It doesn't have any output. dmesg: [ 122.577167] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to c400c [ 122.577224] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c [ 122.577322] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c [ 122.577422] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 44008 [ 122.577521] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c [ 122.577619] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c [ 122.577720] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 44008 [ 122.577817] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c [ 122.577876] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 4400c [ 122.577974] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44008 [ 122.578033] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 44008 [ 122.578131] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c [ 122.578189] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register before interrupt [ 122.578286] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 44008 [ 122.578400] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c [ 122.578526] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c [ 122.578648] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c [ 122.578729] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c [ 122.578852] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 4400c [ 122.578973] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44008 [ 122.579055] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c [ 122.579138] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c [ 122.579197] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 4400c [ 122.579321] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register before interrupt [ 122.579440] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44008 [ 122.579520] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44040 [ 122.579576] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to c400c The patches that fix this bug have been submitted a long time already. See comment #25. Daniel didn't seem to be too interested in merging them... I'll try to investigate the problem a little bit more so I can provide a more detailed/precise description of the problem and why the current patches fix it (because of the questions from the reviewers). But I'm not sure what I should do to fix this bug if we're not willing to merge the patches. Paulo, the link in comment #18 is dead, so I have no idea which patches we're still missing ... (In reply to comment #35) > Paulo, the link in comment #18 is dead, so I have no idea which patches > we're still missing ... I didn't think those archives were moving targets! That's really bad! http://lists.freedesktop.org/archives/intel-gfx/2013-September/033825.html (In reply to comment #36) > (In reply to comment #35) > > Paulo, the link in comment #18 is dead, so I have no idea which patches > > we're still missing ... > > I didn't think those archives were moving targets! That's really bad! > http://lists.freedesktop.org/archives/intel-gfx/2013-September/033825.html In case they get lost again: http://lists.freedesktop.org/archives/intel-gfx/2013-September/ Then search for: [PATCH 0/5] module_reload fixes I did some more investigation, and it seems the Kernel gets stuck on an infinite loop that keeps calling VGA. I added some printks, and I at the moment we do the "echo 0" to the bind file (In reply to comment #38) > I did some more investigation, and it seems the Kernel gets stuck on an > infinite loop that keeps calling VGA. I added some printks, and I at the > moment we do the "echo 0" to the bind file (accidentally hit the "Save Changes" button) I did some more investigation, and it seems the Kernel gets stuck on an infinite loop that keeps calling VGA functions. I added some printks, and at the moment we do the "echo 0" to the bind file, we keep printing this forever: [ 431.878277] === write_vga [ 431.878278] === vgacon_set_cursor_size [ 431.878297] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register before interrupt [ 431.878331] === vgacon_cursor [ 431.878332] === vgacon_restore_screen [ 431.878332] === write_vga [ 431.878333] === vgacon_set_cursor_size [ 431.878373] === vgacon_dummy [ 431.878373] === vgacon_scroll [ 431.878374] === vgacon_restore_screen [ 431.878374] === vgacon_scrollback_update [ 431.878435] === vga_set_mem_top [ 431.878435] === write_vga [ 431.878439] === vgacon_dummy [ 431.878440] === vgacon_scroll [ 431.878440] === vgacon_restore_screen [ 431.878440] === vgacon_scrollback_update [ 431.878501] === vga_set_mem_top [ 431.878502] === write_vga [ 431.878503] === vgacon_cursor [ 431.878503] === vgacon_restore_screen [ 431.878504] === write_vga [ 431.878505] === vgacon_set_cursor_size [ 431.878524] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register before interrupt [ 431.878559] === vgacon_cursor [ 431.878559] === vgacon_restore_screen [ 431.878559] === write_vga [ 431.878560] === vgacon_set_cursor_size [ 431.878601] === vgacon_dummy [ 431.878601] === vgacon_scroll Also, the function that triggers the unclaimed register is clearly vgacon_set_cursor_size. It looks like we're stuck in an infinite loop inside console_unlock. The code that seems to fix the problem is just: vga_get_uninterruptible(dev->pdev, VGA_RSRC_LEGACY_IO); val = inb(VGA_MSR_READ); outb(val & ~VGA_MSR_MEM_EN, VGA_MSR_WRITE); vga_set_legacy_decoding(dev->pdev, VGA_RSRC_LEGACY_IO | VGA_RSRC_NORMAL_IO |VGA_RSRC_NORMAL_MEM); vga_put(dev->pdev, VGA_RSRC_LEGACY_IO); (In reply to comment #39) > (In reply to comment #38) > > I did some more investigation, and it seems the Kernel gets stuck on an > > infinite loop that keeps calling VGA. I added some printks, and I at the > > moment we do the "echo 0" to the bind file > > (accidentally hit the "Save Changes" button) > > I did some more investigation, and it seems the Kernel gets stuck on an > infinite loop that keeps calling VGA functions. I added some printks, and at > the moment we do the "echo 0" to the bind file, we keep printing this > forever: > > [ 431.878277] === write_vga > [ 431.878278] === vgacon_set_cursor_size > [ 431.878297] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register > before interrupt > [ 431.878331] === vgacon_cursor > [ 431.878332] === vgacon_restore_screen > [ 431.878332] === write_vga > [ 431.878333] === vgacon_set_cursor_size > [ 431.878373] === vgacon_dummy > [ 431.878373] === vgacon_scroll > [ 431.878374] === vgacon_restore_screen > [ 431.878374] === vgacon_scrollback_update > [ 431.878435] === vga_set_mem_top > [ 431.878435] === write_vga > [ 431.878439] === vgacon_dummy > [ 431.878440] === vgacon_scroll > [ 431.878440] === vgacon_restore_screen > [ 431.878440] === vgacon_scrollback_update > [ 431.878501] === vga_set_mem_top > [ 431.878502] === write_vga > [ 431.878503] === vgacon_cursor > [ 431.878503] === vgacon_restore_screen > [ 431.878504] === write_vga > [ 431.878505] === vgacon_set_cursor_size > [ 431.878524] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register > before interrupt > [ 431.878559] === vgacon_cursor > [ 431.878559] === vgacon_restore_screen > [ 431.878559] === write_vga > [ 431.878560] === vgacon_set_cursor_size > [ 431.878601] === vgacon_dummy > [ 431.878601] === vgacon_scroll > That just looks like normal vgacon activity when stuff changes on the screen. The momement you unbind fbcon, vgacon takes over again. The But if the legacy resources aren't enabled you get the unclaimed register errors. Either we'd need to restore the hardware to a state where VGA actually works, or vgacon shouldn't be allowed to rebind after someone has clobbered the state. I'd go for the latter since it seems easier. Maybe there was a patch already somewhere. Also restoring the hardware state could happen only at driver unload time, and by that time vgacon has already kicked in. So even in that case we'd need to prevent vgacon from taking over before the hardware state has been restored. This solves the problem: diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index a28347d..ea5b3f7 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -10801,6 +10801,22 @@ static void intel_init_quirks(struct drm_device *dev) } } +void i915_power_well_redisable_vga(struct drm_i915_private *dev_priv) +{ + struct drm_device *dev = dev_priv->dev; + u8 val; + + vga_get_uninterruptible(dev->pdev, VGA_RSRC_LEGACY_IO); + val = inb(VGA_MSR_READ); + DRM_DEBUG_KMS("=== msr_read read: 0x%04x\n", val); + outb(val & ~VGA_MSR_MEM_EN, VGA_MSR_WRITE); + DRM_DEBUG_KMS("=== msr_read write: 0x%04x\n", val & ~VGA_MSR_MEM_EN); + vga_set_legacy_decoding(dev->pdev, VGA_RSRC_LEGACY_IO | + VGA_RSRC_NORMAL_IO | + VGA_RSRC_NORMAL_MEM); + vga_put(dev->pdev, VGA_RSRC_LEGACY_IO); +} + /* Disable the VGA plane that we never use */ static void i915_disable_vga(struct drm_device *dev) { diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h index b5e7cb5..d7a5b13 100644 --- a/drivers/gpu/drm/i915/intel_drv.h +++ b/drivers/gpu/drm/i915/intel_drv.h @@ -705,6 +705,7 @@ void ironlake_check_encoder_dotclock(const struct intel_crtc_config *pipe_config, int dotclock); bool intel_crtc_active(struct drm_crtc *crtc); +void i915_power_well_redisable_vga(struct drm_i915_private *dev_priv); void i915_disable_vga_mem(struct drm_device *dev); void hsw_enable_ips(struct intel_crtc *crtc); void hsw_disable_ips(struct intel_crtc *crtc); diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c index 6374884..425807b 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -5694,6 +5694,8 @@ static void hsw_set_power_well(struct drm_device *dev, if (wait_for((I915_READ(HSW_PWR_WELL_DRIVER) & HSW_PWR_WELL_STATE_ENABLED), 20)) DRM_ERROR("Timeout enabling power well\n"); + + i915_power_well_redisable_vga(dev_priv); } if (IS_BROADWELL(dev)) { And the interesting thing, here's "dmesg | grep ===" from booting my machine with this patch (eDP+HDMI): [ 7.201853] [drm:i915_power_well_redisable_vga], === msr_read read: 0x0000 [ 7.201855] [drm:i915_power_well_redisable_vga], === msr_read write: 0x0000 So we read zero and write zero, and that seems to solve the problem.... (In reply to comment #41) > This solves the problem: > Oh, and by "solves the problem" I mean "solves the problem where you boot with eDP+HDMI, so when you unbind the power well is already enabled". The case where we only have eDP is sovled by other patches. Today I submitted "drm/i915: disable VGA mem when reenabling the power well": http://lists.freedesktop.org/archives/intel-gfx/2013-November/036733.html This fixes the case where the power well is *enabled* when we do the unbind. This won't fix the problem if you just have an eDP panel enabled. The case where the power well is *disabled* when we do the unbind will require more patches. I've pushed a completely untested patch series to just shut up vgacon to http://cgit.freedesktop.org/~danvet/drm/log/?h=vgacon-crap (In reply to comment #44) > I've pushed a completely untested patch series to just shut up vgacon to > > http://cgit.freedesktop.org/~danvet/drm/log/?h=vgacon-crap Just to document the bug what I said on IRC: this doesn't fix the bug. Function do_unregister_con_driver() returns -ENODEV, so we don't really unbind vgacon. This happens because vgacon doesn't have the CON_DRIVER_FLAG_MODULE flag. The vgacon is not initalized by do_register_con_driver(), it's initialized by con_init(), though the "conswitchp" variable which is initalized by arch/x86/kernel/setup.c. I am not sure if it's possible to just uninitialize vgacon... Reassigning bug back to default assignee. I already provided patches that fix the bug, twice, but it seems people don't like my current approach, yet they don't suggest/submit anything that actually fixes the bug. I continued the investigation, and posted the findings as a reply to the patch: http://lists.freedesktop.org/archives/intel-gfx/2013-December/037289.html (In reply to comment #47) > I continued the investigation, and posted the findings as a reply to the > patch: > http://lists.freedesktop.org/archives/intel-gfx/2013-December/037289.html Patch fail. (Patch is indented 1 space.) patching file drivers/gpu/drm/i915/intel_display.c Hunk #1 succeeded at 11083 with fuzz 1 (offset 19 lines). (Patch is indented 1 space.) patching file drivers/gpu/drm/i915/intel_pm.c Hunk #1 FAILED at 5703. 1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/i915/intel_pm.c.rej Paulo has a new set of patches iirc ... (In reply to comment #49) > Paulo has a new set of patches iirc ... We recently merged the patches to fix the case where the power well is _enabled_ while we do module_unload. I'm not sure how QA tests this bug, but this case should be fixed. What remains is the patch to enable the power well in case we unbind fbcon. I have an implementation for this, but it touches the FB layer and it seems people don't want to touch those files (even if we just add some callback support). I need do re-check if there's a way to implement this without touching anything outside drivers/gpu/drm/i915 (I already did it once, but can try again...). If we do this, the code will work regardless of the loaded console. Chris recently submitted "drm/i915: Kick out vga console" which kills vgacon and, as a consequence, also indirectly solves the problem. If Chris "kill vgacon" patch is good enough I'd very much prefer that one. We need it anyway and it's imo the sane thing to do in general ... Daneil, any update on this bug? Do we have new patches? Hua, pls help to retest. For reference Chris' patch: http://patchwork.freedesktop.org/patch/16792/ Please test this patch. Created attachment 99372 [details] dmesg (In reply to comment #53) > For reference Chris' patch: > > http://patchwork.freedesktop.org/patch/16792/ > > Please test this patch. Test this patch, the hang goes away. It causes call trace. output: module successfully unloaded module successfully loaded again [ 35.971940] WARNING: CPU: 4 PID: 4123 at drivers/gpu/drm/drm_crtc.c:4704 drm_mode_config_cleanup+0x164/0x212 [drm]() [ 35.971985] Modules linked in: ipv6 dm_mod snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic iTCO_wdt iTCO_vendor_support serio_raw pcspkr i2c_i801 lpc_ich mfd_c ore snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore battery wmi tpm_infineon tpm_tis tpm acpi_cpufreq i915(-) video button drm_kms_helper drm [last un loaded: snd_hda_intel] [ 35.973233] CPU: 4 PID: 4123 Comm: rmmod Not tainted 3.14.0_kcloud_f0122d_20140520+ #38 [ 35.973276] Hardware name: ASUS All Series/Z87-EXPERT, BIOS 1008 05/17/2013 [ 35.973318] 0000000000000006 ffffffff81709105 0000000000000000 ffffffff81033f78 [ 35.973508] ffffffffa000d305 ffff880251774000 ffff8802517743b8 ffff8802517742a0 [ 35.973698] ffff8802517743c8 ffffffffa000d305 ffff880251774000 ffff8802517742a0 [ 35.973890] Call Trace: [ 35.974011] [<ffffffff81709105>] ? dump_stack+0x41/0x51 [ 35.974132] [<ffffffff81033f78>] ? warn_slowpath_common+0x6f/0x84 [ 35.974255] [<ffffffffa000d305>] ? drm_mode_config_cleanup+0x164/0x212 [drm] [ 35.974379] [<ffffffffa000d305>] ? drm_mode_config_cleanup+0x164/0x212 [drm] [ 35.974507] [<ffffffffa00a6d15>] ? intel_modeset_cleanup+0xca/0xf0 [i915] [ 35.974633] [<ffffffffa00c9000>] ? i915_driver_unload+0xaf/0x28e [i915] [ 35.974760] [<ffffffffa000759e>] ? drm_dev_unregister+0x1e/0x8b [drm] [ 35.974883] [<ffffffffa000775b>] ? drm_put_dev+0x45/0x4e [drm] [ 35.975004] [<ffffffff812f3c79>] ? pci_device_remove+0x38/0x80 [ 35.975126] [<ffffffff813758ec>] ? __device_release_driver+0x7e/0xd4 [ 35.975246] [<ffffffff81375ec3>] ? driver_detach+0x71/0x9a [ 35.975367] [<ffffffff81375657>] ? bus_remove_driver+0x60/0x7d [ 35.975487] [<ffffffff812f3411>] ? pci_unregister_driver+0x17/0x75 [ 35.975611] [<ffffffffa0008ec8>] ? drm_pci_exit+0x37/0x96 [drm] [ 35.975733] [<ffffffff8107caf0>] ? SyS_delete_module+0x10c/0x182 [ 35.975855] [<ffffffff810024f9>] ? do_notify_resume+0x4e/0x5d [ 35.975991] [<ffffffff817138ea>] ? int_signal+0x12/0x17 [ 35.976114] [<ffffffff81713662>] ? system_call_fastpath+0x16/0x1b [ 35.976234] ---[ end trace c063e9f88c0c47fc ]--- [ 35.976357] [drm:intel_crtc_cursor_set] cursor off [ 35.976358] [drm:intel_crtc_cursor_set] cursor off [ 35.976359] [drm:intel_crtc_cursor_set] cursor off [ 35.977844] ------------[ cut here ]------------ (In reply to comment #54) > Created attachment 99372 [details] > dmesg > > (In reply to comment #53) > > For reference Chris' patch: > > > > http://patchwork.freedesktop.org/patch/16792/ > > > > Please test this patch. > > Test this patch, the hang goes away. > It causes call trace. This is bug 77511. We really need a script to traige backtraces to avoid such confusions. Chris, can you please add the select DUMMY_CONSOLE to your patch, add a commit message and re-submit it please? Maybe cc: David Herrmann on the patch, too. config DUMMY_CONSOLE bool depends on VGA_CONSOLE!=y || SGI_NEWPORT_CONSOLE!=y default y Hmm, that would seem to make "select DUMMY_CONSOLE" nonsense. Anyway, as I said, I'd like for Paulo to write the commit message as I am not sure just what and why this works. Hm, I've thought it would break. Probably easiest to push this out into a branch the 0-day tester looks at and wait for reports ;-) Fix pushed out to -fixes (though it was in -nightly for about a week now). commit a4de05268e674e8ed31df6348269e22d6c6a1803 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Jun 5 16:20:46 2014 +0200 drm/i915: Kick out vga console Test on latest -nightly kernel, i915 is in use. Test on latest -queued kernel, error still exists. output(nightly kernel) rmmod: ERROR: Module i915 is in use WARNING: i915.ko still loaded! module successfully loaded again lsmod(nightly kenrel) Module Size Used by snd_hda_intel 15258 0 ip6table_filter 1067 0 ip6_tables 13864 1 ip6table_filter ipv6 253295 24 iptable_filter 1112 0 ip_tables 13467 1 iptable_filter ebtable_nat 1529 0 ebtables 19258 1 ebtable_nat x_tables 13403 5 ip6table_filter,ip_tables,iptable_filter,ebtables,ip6_tables dm_mod 67053 0 snd_hda_codec_realtek 46034 1 snd_hda_codec_generic 38382 1 snd_hda_codec_realtek iTCO_wdt 4695 0 iTCO_vendor_support 1624 1 iTCO_wdt snd_hda_codec_hdmi 31910 1 serio_raw 3929 0 pcspkr 1715 0 i2c_i801 8397 0 lpc_ich 13312 0 mfd_core 2537 1 lpc_ich snd_hda_controller 16013 1 snd_hda_intel snd_hda_codec 71968 5 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_intel,snd_hda_controller snd_hwdep 5309 1 snd_hda_codec snd_pcm 69054 4 snd_hda_codec_hdmi,snd_hda_codec,snd_hda_intel,snd_hda_controller snd_timer 15325 1 snd_pcm snd 49512 8 snd_hda_codec_realtek,snd_hwdep,snd_timer,snd_hda_codec_hdmi,snd_pcm,snd_hda_codec_generic,snd_hda_codec,snd_hda_intel soundcore 4178 2 snd,snd_hda_codec battery 12312 0 wmi 6915 0 tpm_infineon 6274 0 tpm_tis 6884 0 tpm 18810 2 tpm_tis,tpm_infineon acpi_cpufreq 5906 0 i915 720378 3 video 11915 1 i915 button 4229 1 i915 drm_kms_helper 27746 1 i915 drm 206876 3 i915,drm_kms_helper Test on latest -queued kernel: output: module successfully unloaded module successfully loaded again dmesg -r | egrep "<[1-3]>" |grep drm <3>[ 31.720889] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 209c <3>[ 31.726335] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to a090 <3>[ 31.945258] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to a090 <3>[ 31.951119] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 45404 The patch is only in -fixes. The module reload failure sounds like a separate regression. (In reply to comment #61) > The patch is only in -fixes. The module reload failure sounds like a > separate regression. Report Bug 80176 about the failure, the failure happens on -fixes and -nightly kernel. So I will verify this bug after bug 80176 fixed. Verified.Fixed. Closing verified+fixed. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 83695 [details] dmesg System Environment: -------------------------- Platform: Haswell Kernel: (drm-intel-fixes)61c2542b3b8e70fc1b82880c3ef7e2f930875b97 Bug detailed description: ----------------------------- It happens on Haswell with -fixes ,-queued, -nightly kernel. Run ./module_reload, system hang.No output info. If connect VGA or DP or HDMI, It works well. The latest known good commit:19b2dbde5732170a03bd82cc8bd442cf88d856f7 The latest known bad commit: d4eead50eb206b875f54f66cc0f6ec7d54122c28 Reproduce steps: ---------------------------- 1. ./module_reload