67813 – [HSW bisected]igt/module_reload causes [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44004 and system hang with headless, with power well disabled

Bug 67813 - [HSW bisected]igt/module_reload causes [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44004 and system hang with headless, with power well disabled

Summary: [HSW bisected]igt/module_reload causes [drm:hsw_unclaimed_reg_check] *ERROR* ...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	highest major
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-08-06 07:25 UTC by lu hua
Modified:	2016-09-30 13:25 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg (2.13 MB, text/plain) 2013-08-06 07:25 UTC, lu hua	no flags	Details
dmesg with c tool (1.24 MB, text/plain) 2013-08-06 08:53 UTC, lu hua	no flags	Details
Fix for the bug (1.76 KB, patch) 2013-08-14 17:33 UTC, Paulo Zanoni	no flags	Details \| Splinter Review
dmesg with patch "Fix for the bug" (4.32 KB, text/plain) 2013-08-16 08:14 UTC, lu hua	no flags	Details
dmesg (123.43 KB, text/plain) 2014-05-20 02:45 UTC, lu hua	no flags	Details
View All

Description lu hua 2013-08-06 07:25:09 UTC

Created attachment 83695 [details]
dmesg

System Environment:
--------------------------
Platform: Haswell
Kernel: (drm-intel-fixes)61c2542b3b8e70fc1b82880c3ef7e2f930875b97

Bug detailed description:
-----------------------------
It happens on Haswell with -fixes ,-queued, -nightly kernel.
Run ./module_reload, system hang.No output info.
If connect VGA or DP or HDMI, It works well.

The latest known good commit:19b2dbde5732170a03bd82cc8bd442cf88d856f7
The latest known bad commit: d4eead50eb206b875f54f66cc0f6ec7d54122c28

Reproduce steps:
----------------------------
1. ./module_reload

Comment 1 Daniel Vetter 2013-08-06 07:54:02 UTC

Hm, the dmesg output seems to be rather sparse. Have you checked that full debug output is enabled over the console? You can set this with

dmesg -n 8

Note that the dmesg tool shipping with most distros is broken, and the upstream maintainer unwilling to fix it (I have no idea about the amount of fail going on there). If your dmesg tool rejects the above command please compile&run (as root) the below littel C tool:

#include <stdio.h>
#include <sys/syslog.h>
#include <sys/klog.h>

int main()
{
        klogctl(8, NULL, 8);
}

If the dmesg with full debug enabled doesn't show more I guess we need the bisect result to move forward here. But since there's quite a few commits to test grabbing a new dmesg might be quicker.

Comment 2 lu hua 2013-08-06 08:53:52 UTC

Created attachment 83705 [details]
dmesg with c tool

Comment 3 Daniel Vetter 2013-08-06 11:11:47 UTC

Hm, module reload should dump much more information when the dmesg verbosity is correctly set over netconsole. So I still think there's something wrong with your netconsole setup.

Can you please check whether your netconsole setup works correctly (= you get the same output over netconsole as if you grab the dmesg output locally) on a machine where module_reload works? If that works please check the netconsole output of the affected hsw machine with another test which is know to put debug information into dmesg but doesn't crash the machine, like testdisplay.

Comment 4 Daniel Vetter 2013-08-06 11:35:05 UTC

Also, can you please start to bisect this issue so that we don't get blocked on getting netconsole to work better?

Comment 5 lu hua 2013-08-08 05:53:25 UTC

Bisect shows: bf51d5e2cda5d36d98e4b46ac7fca9461e512c41 is the first bad commit
commit bf51d5e2cda5d36d98e4b46ac7fca9461e512c41
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date:   Wed Jul 3 17:12:13 2013 -0300

    drm/i915: switch disable_power_well default value to 1

    Now that the audio driver is using our power well API, everything
    should be working correctly, so let's give it a try.

    Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Comment 6 Daniel Vetter 2013-08-08 06:37:25 UTC

Can you please double-check the bisect by booting with i915.disable_power_well=0 on latest -nightly?

In any case one for Paulo it seems ...

Comment 7 lu hua 2013-08-08 06:49:50 UTC

(In reply to comment #6)
> Can you please double-check the bisect by booting with
> i915.disable_power_well=0 on latest -nightly?
> 
> In any case one for Paulo it seems ...


Add i915.disable_power_well=0, this issue goes away.

Comment 8 Paulo Zanoni 2013-08-14 17:33:50 UTC

Created attachment 84069 [details] [review]
Fix for the bug

Hi

Can you please verify this patch fixes the bug?

Please notice that this patch only fixes the "unclaimed register" messages. When running module_reload I still see many problems which are not fixed by this patch:
- We get a gpu hang most of the time
- Backlight doesn't work
- Sometimes we get an oops
- There's a big lockdep warning every time.

Thanks,
Paulo

Comment 9 lu hua 2013-08-15 07:30:32 UTC

(In reply to comment #8)
> Created attachment 84069 [details] [review] [review]
> Fix for the bug
> 
> Hi
> 
> Can you please verify this patch fixes the bug?
> 
> Please notice that this patch only fixes the "unclaimed register" messages.
> When running module_reload I still see many problems which are not fixed by
> this patch:
> - We get a gpu hang most of the time
> - Backlight doesn't work
> - Sometimes we get an oops
> - There's a big lockdep warning every time.
> 
> Thanks,
> Paulo

I test this patch. This issue still exists.

Comment 10 Paulo Zanoni 2013-08-15 12:56:07 UTC

(In reply to comment #9)
> (In reply to comment #8)
> > Created attachment 84069 [details] [review] [review] [review]
> > Fix for the bug
> > 
> > Hi
> > 
> > Can you please verify this patch fixes the bug?
> > 
> > Please notice that this patch only fixes the "unclaimed register" messages.
> > When running module_reload I still see many problems which are not fixed by
> > this patch:
> > - We get a gpu hang most of the time
> > - Backlight doesn't work
> > - Sometimes we get an oops
> > - There's a big lockdep warning every time.
> > 
> > Thanks,
> > Paulo
> 
> I test this patch. This issue still exists.

Which issue specifically: the black screen or the "unclaimed register" messages? Can you please post the new dmesg output?

Thanks,
Paulo

Comment 11 lu hua 2013-08-16 08:14:42 UTC

Created attachment 84128 [details]
dmesg with patch "Fix for the bug"

Comment 12 Paulo Zanoni 2013-08-16 13:48:39 UTC

(In reply to comment #11)
> Created attachment 84128 [details]
> dmesg with patch "Fix for the bug"

Please the full dmesg.

What I did here to test was:

- Boot the machine
- SSH to it from another machine
- Run ./module_reload from the SSH session
- dmesg > dmesg.txt

Does that work for you?

Comment 13 lu hua 2013-08-19 05:54:45 UTC

Paulo, I add drm.debug=0xe to grub, the attached dmesg is full dmesg. 

> 
> - Boot the machine
> - SSH to it from another machine
> - Run ./module_reload from the SSH session
> - dmesg > dmesg.txt
> 
> Does that work for you?

Because system hang, I use netconsole.

Do you have any suggest for more dmesg?

Comment 14 Paulo Zanoni 2013-08-26 16:16:12 UTC

(In reply to comment #13)
> Paulo, I add drm.debug=0xe to grub, the attached dmesg is full dmesg. 
> 
> > 
> > - Boot the machine
> > - SSH to it from another machine
> > - Run ./module_reload from the SSH session
> > - dmesg > dmesg.txt
> > 
> > Does that work for you?
> 
> Because system hang, I use netconsole.
> 
> Do you have any suggest for more dmesg?

I really can't reproduce your bug. Can you please attach the full "dmesg" from right before you run module_reload? Please do it with drm.debug=0xe. Also please attach the output of "lsmod" from right before when you run module_reload.

Also, please check if when you run module_reload you get some error messages about module i915 still being loaded.

Also, please check whether you're just getting a backlight bug or a real full machine hang.

I'm running drm-intel-nightly.

Comment 15 Paulo Zanoni 2013-08-26 16:18:51 UTC

(In reply to comment #11)
> Created attachment 84128 [details]
> dmesg with patch "Fix for the bug"

The errors on this log are for registers that should still work even when the power well is disabled... I wonder if there's some issue between the Audio and Graphics drivers.

Comment 16 lu hua 2013-08-27 06:33:51 UTC

I added drm.debug=0xe.

# lsmod
Module                  Size  Used by
dm_mod                 58278  0
iTCO_wdt                4535  1
iTCO_vendor_support     1584  1 iTCO_wdt
snd_hda_codec_realtek    32195  1
pcspkr                  1675  0
i2c_i801                8342  0
lpc_ich                12616  0
mfd_core                2401  1 lpc_ich
snd_hda_intel          24636  0
snd_hda_codec          98736  2 snd_hda_codec_realtek,snd_hda_intel
snd_hwdep               5141  1 snd_hda_codec
snd_pcm                66617  2 snd_hda_codec,snd_hda_intel
snd_page_alloc          5906  2 snd_pcm,snd_hda_intel
snd_timer              15318  1 snd_pcm
snd                    50129  6 snd_hda_codec_realtek,snd_hwdep,snd_timer,snd_pcm,snd_hda_codec,snd_hda_intel
soundcore               4210  1 snd
battery                10931  0
ac                      3999  0
acpi_cpufreq            6339  0
i915                  529852  1
video                  11092  1 i915
button                  4333  1 i915
drm_kms_helper         22788  1 i915
drm                   195808  2 i915,drm_kms_helper
mperf                   1027  1 acpi_cpufreq
freq_table              2092  1 acpi_cpufreq

Comment 17 Paulo Zanoni 2013-09-25 21:49:35 UTC

Ok, I reinstalled my OS and now I can reproduce the bug. I also did an analysis:

The bug is not really caused by module reload, it's caused by the "echo 0 > /sys/class/vtconsole/vtcon1/bind" command that's part of the module_reload script.

Since I couldn't reproduce the bug in my previous installation, I suspect the bug is dependent on which console driver is used.

We get the problem even if the power well is enabled when we run the command. But if the power well was *never* disabled, we don't get the problem. This means that the condition to make the bug happen later is just a disable/enable cycle on the power well.

Another thing I notice is that the endless flood of "unclaimed register" messages doesn't seem to be triggered by our driver. It seems something else is causing these messages. Maybe something is writing to VGA memory?

If we disable the north display error interrupts, everything appears to work (except for a few error messages on dmesg).

Comment 18 Paulo Zanoni 2013-09-26 23:08:47 UTC

Hi

I believe this patch series fixes the problem:
http://lists.freedesktop.org/archives/intel-gfx/2013-September/033984.html

Can you please test?

Thanks,
Paulo

Comment 19 lu hua 2013-09-27 07:50:45 UTC

(In reply to comment #18)
> Hi
> 
> I believe this patch series fixes the problem:
> http://lists.freedesktop.org/archives/intel-gfx/2013-September/033984.html
> 
> Can you please test?
> 
> Thanks,
> Paulo

Test these patches
output:
ERROR: Module i915 is in use
WARNING: i915.ko still loaded!
module successfully loaded again

Comment 20 Paulo Zanoni 2013-09-27 13:28:57 UTC

(In reply to comment #19)
> (In reply to comment #18)
> > Hi
> > 
> > I believe this patch series fixes the problem:
> > http://lists.freedesktop.org/archives/intel-gfx/2013-September/033984.html
> > 
> > Can you please test?
> > 
> > Thanks,
> > Paulo
> 
> Test these patches
> output:
> ERROR: Module i915 is in use
> WARNING: i915.ko still loaded!
> module successfully loaded again

Are you sure you did the test correctly? Did you kill X and all the other DRM clients when doing this? If you repeat the test with i915.disable_power_well=0 do you also get the same problem?

All conditions necessary to run the other intel-gpu-tools tests apply when running module_reload.

Comment 21 lu hua 2013-09-29 05:38:22 UTC

> Are you sure you did the test correctly? Did you kill X and all the other
> DRM clients when doing this? If you repeat the test with
> i915.disable_power_well=0 do you also get the same problem?

Run with i915.disable_power_well=0, also has same output info.

> 
> All conditions necessary to run the other intel-gpu-tools tests apply when
> running module_reload.

Clean boot system, then test it.

Comment 22 Paulo Zanoni 2013-09-30 13:05:39 UTC

(In reply to comment #21)
> > Are you sure you did the test correctly? Did you kill X and all the other
> > DRM clients when doing this? If you repeat the test with
> > i915.disable_power_well=0 do you also get the same problem?
> 
> Run with i915.disable_power_well=0, also has same output info.
> 
> > 
> > All conditions necessary to run the other intel-gpu-tools tests apply when
> > running module_reload.
> 
> Clean boot system, then test it.

I have no idea what happens on your "clean boot system" configuration. Does the desktop environment run? Do you have X running?

That "i915.ko still loaded!" message really suggests there are programs you need to kill before doing the tests. Please make sure your testing environment is correct before you try to run module_reload.

Comment 23 fangxun 2013-10-09 07:33:40 UTC

(In reply to comment #22)
> (In reply to comment #21)
> > > Are you sure you did the test correctly? Did you kill X and all the other
> > > DRM clients when doing this? If you repeat the test with
> > > i915.disable_power_well=0 do you also get the same problem?
> > 
> > Run with i915.disable_power_well=0, also has same output info.
> > 
> > > 
> > > All conditions necessary to run the other intel-gpu-tools tests apply when
> > > running module_reload.
> > 
> > Clean boot system, then test it.
> 
> I have no idea what happens on your "clean boot system" configuration. Does
> the desktop environment run? Do you have X running?
 "clean boot system" means system boot up without X and the desktop environment running.

> That "i915.ko still loaded!" message really suggests there are programs you
> need to kill before doing the tests. Please make sure your testing
> environment is correct before you try to run module_reload.
  Since Kernel 3.12.0rc2, after running "echo 0 > /sys/class/vtconsole/vtcon1/bind" and "rmmod intel_ips &> /dev/null", lsmod shows there are still 2 programs using i915 module on haswell(I check all platform, this only happens on haswell). It causes i915 module can not be unloaded. I don't know what are the 2 programs, and how to kill them. 

[root@x-hsw24 ~]# lsmod
Module                  Size  Used by
snd_hda_codec_realtek    34049  1
snd_hda_codec_hdmi     27236  1
snd_hda_intel          25263  0
snd_hda_codec         100145  3 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_intel
snd_hwdep               5094  1 snd_hda_codec
snd_seq                41921  0
snd_seq_device          4581  1 snd_seq
snd_pcm                66241  3 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec
iTCO_wdt                4551  1
iTCO_vendor_support     1608  1 iTCO_wdt
snd_timer              15514  2 snd_seq,snd_pcm
snd                    50169  9 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hwdep,snd_seq,snd_seq_device,snd_pcm,snd_timer
serio_raw               3929  0
pcspkr                  1699  0
microcode               7049  0
soundcore               4219  1 snd
i2c_i801                8374  0
snd_page_alloc          5930  2 snd_hda_intel,snd_pcm
lpc_ich                12608  0
mfd_core                2441  1 lpc_ich
acpi_cpufreq            6299  0
freq_table              2068  1 acpi_cpufreq
uinput                  6676  0
ipv6                  246159  43
i915                  531119  2
drm_kms_helper         22993  1 i915
drm                   199664  2 i915,drm_kms_helper
button                  4261  1 i915
video                  10625  1 i915
dm_mirror              11024  0
dm_region_hash          5831  1 dm_mirror
dm_log                  7220  2 dm_mirror,dm_region_hash
dm_mod                 66376  2 dm_mirror,dm_log

Comment 24 Daniel Vetter 2013-10-09 08:36:31 UTC

(In reply to comment #23)
> (In reply to comment #22)
> > That "i915.ko still loaded!" message really suggests there are programs you
> > need to kill before doing the tests. Please make sure your testing
> > environment is correct before you try to run module_reload.
>   Since Kernel 3.12.0rc2, after running "echo 0 >
> /sys/class/vtconsole/vtcon1/bind" and "rmmod intel_ips &> /dev/null", lsmod
> shows there are still 2 programs using i915 module on haswell(I check all
> platform, this only happens on haswell). It causes i915 module can not be
> unloaded. I don't know what are the 2 programs, and how to kill them. 

Could be the audio driver holding module references for the power well stuff. Can you please unload all sound drivers and see whether that gets rid of the references?

Comment 25 fangxun 2013-10-09 09:48:02 UTC

Unloading snd_hda_intel module gets rid of the references. Thanks, Daniel.
I verify Paulo's patch series fixes the bug.

Comment 26 Paulo Zanoni 2013-10-09 12:53:37 UTC

So I guess we need to patch module_reload to also "rmmod snd_hda_intel" (or whatever the correct module is).

Comment 27 Daniel Vetter 2013-10-09 13:24:10 UTC

(In reply to comment #26)
> So I guess we need to patch module_reload to also "rmmod snd_hda_intel" (or
> whatever the correct module is).

Yup, I think we need to add that. Lu Hua, can you please file a new bug report that module_reload doesn't work on Haswell? It's a regression, so top priority bug.

Comment 28 Paulo Zanoni 2013-10-09 21:22:20 UTC

(In reply to comment #27)
> (In reply to comment #26)
> > So I guess we need to patch module_reload to also "rmmod snd_hda_intel" (or
> > whatever the correct module is).
> 
> Yup, I think we need to add that. Lu Hua, can you please file a new bug
> report that module_reload doesn't work on Haswell? It's a regression, so top
> priority bug.

The ugly thing about it, is that, at least on my machine, we need to "pkill alsactl" in addition to stopping gdm. Do we want this on the script?

Comment 29 fangxun 2013-10-14 02:02:54 UTC

(In reply to comment #27)
> (In reply to comment #26)
> So I guess we need to patch module_reload to
> also "rmmod snd_hda_intel" (or
> whatever the correct module is).

Yup, I
> think we need to add that. Lu Hua, can you please file a new bug report that
> module_reload doesn't work on Haswell? It's a regression, so top priority
> bug.

Note: Bug 70336 is traced for this.

Comment 30 Daniel Vetter 2013-11-18 08:12:59 UTC

Ok, Paulo's patch to the module reload test has landed.

Comment 31 lu hua 2013-11-19 03:39:53 UTC

It still happens on latest -nightly kernel.

Comment 32 Daniel Vetter 2013-11-19 19:02:00 UTC

(In reply to comment #31)
> It still happens on latest -nightly kernel.

What exactly? The unclaimed register error or the failure to unload the module? We've had a bunch of bugs here ...

Comment 33 lu hua 2013-11-20 07:59:37 UTC

Run ./drv_module_reload ,It doesn't have any output. 

dmesg:
[  122.577167] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to c400c
[  122.577224] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c
[  122.577322] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c
[  122.577422] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 44008
[  122.577521] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c
[  122.577619] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c
[  122.577720] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 44008
[  122.577817] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c
[  122.577876] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 4400c
[  122.577974] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44008
[  122.578033] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 44008
[  122.578131] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c
[  122.578189] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register before interrupt
[  122.578286] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 44008
[  122.578400] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c
[  122.578526] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c
[  122.578648] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c
[  122.578729] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to c400c
[  122.578852] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 4400c
[  122.578973] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44008
[  122.579055] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c
[  122.579138] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 4400c
[  122.579197] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 4400c
[  122.579321] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register before interrupt
[  122.579440] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44008
[  122.579520] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to 44040
[  122.579576] [drm:hsw_unclaimed_reg_check] *ERROR* Unclaimed write to c400c

Comment 34 Paulo Zanoni 2013-11-28 14:53:10 UTC

The patches that fix this bug have been submitted a long time already. See comment #25. Daniel didn't seem to be too interested in merging them...

I'll try to investigate the problem a little bit more so I can provide a more detailed/precise description of the problem and why the current patches fix it (because of the questions from the reviewers). But I'm not sure what I should do to fix this bug if we're not willing to merge the patches.

Comment 35 Daniel Vetter 2013-11-28 14:57:15 UTC

Paulo, the link in comment #18 is dead, so I have no idea which patches we're still missing ...

Comment 36 Paulo Zanoni 2013-11-28 15:36:48 UTC

(In reply to comment #35)
> Paulo, the link in comment #18 is dead, so I have no idea which patches
> we're still missing ...

I didn't think those archives were moving targets! That's really bad!
http://lists.freedesktop.org/archives/intel-gfx/2013-September/033825.html

Comment 37 Paulo Zanoni 2013-11-28 15:37:56 UTC

(In reply to comment #36)
> (In reply to comment #35)
> > Paulo, the link in comment #18 is dead, so I have no idea which patches
> > we're still missing ...
> 
> I didn't think those archives were moving targets! That's really bad!
> http://lists.freedesktop.org/archives/intel-gfx/2013-September/033825.html

In case they get lost again:
http://lists.freedesktop.org/archives/intel-gfx/2013-September/

Then search for:
[PATCH 0/5] module_reload fixes

Comment 38 Paulo Zanoni 2013-11-28 18:00:04 UTC

I did some more investigation, and it seems the Kernel gets stuck on an infinite loop that keeps calling VGA. I added some printks, and I at the moment we do the "echo 0" to the bind file

Comment 39 Paulo Zanoni 2013-11-28 20:14:49 UTC

(In reply to comment #38)
> I did some more investigation, and it seems the Kernel gets stuck on an
> infinite loop that keeps calling VGA. I added some printks, and I at the
> moment we do the "echo 0" to the bind file

(accidentally hit the "Save Changes" button)

I did some more investigation, and it seems the Kernel gets stuck on an infinite loop that keeps calling VGA functions. I added some printks, and at the moment we do the "echo 0" to the bind file, we keep printing this forever:

[  431.878277] === write_vga
[  431.878278] === vgacon_set_cursor_size
[  431.878297] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register before interrupt
[  431.878331] === vgacon_cursor  
[  431.878332] === vgacon_restore_screen
[  431.878332] === write_vga  
[  431.878333] === vgacon_set_cursor_size  
[  431.878373] === vgacon_dummy
[  431.878373] === vgacon_scroll  
[  431.878374] === vgacon_restore_screen  
[  431.878374] === vgacon_scrollback_update  
[  431.878435] === vga_set_mem_top  
[  431.878435] === write_vga  
[  431.878439] === vgacon_dummy  
[  431.878440] === vgacon_scroll
[  431.878440] === vgacon_restore_screen  
[  431.878440] === vgacon_scrollback_update  
[  431.878501] === vga_set_mem_top  
[  431.878502] === write_vga  
[  431.878503] === vgacon_cursor  
[  431.878503] === vgacon_restore_screen  
[  431.878504] === write_vga  
[  431.878505] === vgacon_set_cursor_size  
[  431.878524] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register before interrupt
[  431.878559] === vgacon_cursor  
[  431.878559] === vgacon_restore_screen  
[  431.878559] === write_vga  
[  431.878560] === vgacon_set_cursor_size
[  431.878601] === vgacon_dummy  
[  431.878601] === vgacon_scroll 

Also, the function that triggers the unclaimed register is clearly vgacon_set_cursor_size.

It looks like we're stuck in an infinite loop inside console_unlock.

The code that seems to fix the problem is just:
vga_get_uninterruptible(dev->pdev, VGA_RSRC_LEGACY_IO);
val = inb(VGA_MSR_READ);
outb(val & ~VGA_MSR_MEM_EN, VGA_MSR_WRITE);
vga_set_legacy_decoding(dev->pdev, VGA_RSRC_LEGACY_IO | VGA_RSRC_NORMAL_IO |VGA_RSRC_NORMAL_MEM);
vga_put(dev->pdev, VGA_RSRC_LEGACY_IO);

Comment 40 Ville Syrjala 2013-11-28 20:29:10 UTC

(In reply to comment #39)
> (In reply to comment #38)
> > I did some more investigation, and it seems the Kernel gets stuck on an
> > infinite loop that keeps calling VGA. I added some printks, and I at the
> > moment we do the "echo 0" to the bind file
> 
> (accidentally hit the "Save Changes" button)
> 
> I did some more investigation, and it seems the Kernel gets stuck on an
> infinite loop that keeps calling VGA functions. I added some printks, and at
> the moment we do the "echo 0" to the bind file, we keep printing this
> forever:
> 
> [  431.878277] === write_vga
> [  431.878278] === vgacon_set_cursor_size
> [  431.878297] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register
> before interrupt
> [  431.878331] === vgacon_cursor  
> [  431.878332] === vgacon_restore_screen
> [  431.878332] === write_vga  
> [  431.878333] === vgacon_set_cursor_size  
> [  431.878373] === vgacon_dummy
> [  431.878373] === vgacon_scroll  
> [  431.878374] === vgacon_restore_screen  
> [  431.878374] === vgacon_scrollback_update  
> [  431.878435] === vga_set_mem_top  
> [  431.878435] === write_vga  
> [  431.878439] === vgacon_dummy  
> [  431.878440] === vgacon_scroll
> [  431.878440] === vgacon_restore_screen  
> [  431.878440] === vgacon_scrollback_update  
> [  431.878501] === vga_set_mem_top  
> [  431.878502] === write_vga  
> [  431.878503] === vgacon_cursor  
> [  431.878503] === vgacon_restore_screen  
> [  431.878504] === write_vga  
> [  431.878505] === vgacon_set_cursor_size  
> [  431.878524] [drm:intel_uncore_check_errors] *ERROR* Unclaimed register
> before interrupt
> [  431.878559] === vgacon_cursor  
> [  431.878559] === vgacon_restore_screen  
> [  431.878559] === write_vga  
> [  431.878560] === vgacon_set_cursor_size
> [  431.878601] === vgacon_dummy  
> [  431.878601] === vgacon_scroll 
> 

That just looks like normal vgacon activity when stuff changes on the screen.

The momement you unbind fbcon, vgacon takes over again. The  But if the legacy resources aren't enabled you get the unclaimed register errors.

Either we'd need to restore the hardware to a state where VGA actually works,
or vgacon shouldn't be allowed to rebind after someone has clobbered the state. I'd go for the latter since it seems easier. Maybe there was a patch already somewhere. Also restoring the hardware state could happen only at driver unload time, and by that time vgacon has already kicked in. So even in that case we'd need to prevent vgacon from taking over before the hardware state has been restored.

Comment 41 Paulo Zanoni 2013-11-28 20:38:17 UTC

This solves the problem:


diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index a28347d..ea5b3f7 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -10801,6 +10801,22 @@ static void intel_init_quirks(struct drm_device *dev)
     }
 }
 
+void i915_power_well_redisable_vga(struct drm_i915_private *dev_priv)
+{
+    struct drm_device *dev = dev_priv->dev;
+    u8 val;
+
+    vga_get_uninterruptible(dev->pdev, VGA_RSRC_LEGACY_IO);
+    val = inb(VGA_MSR_READ);
+    DRM_DEBUG_KMS("=== msr_read read:  0x%04x\n", val);
+    outb(val & ~VGA_MSR_MEM_EN, VGA_MSR_WRITE);
+    DRM_DEBUG_KMS("=== msr_read write: 0x%04x\n", val & ~VGA_MSR_MEM_EN);
+    vga_set_legacy_decoding(dev->pdev, VGA_RSRC_LEGACY_IO |
+                       VGA_RSRC_NORMAL_IO |
+                       VGA_RSRC_NORMAL_MEM);
+    vga_put(dev->pdev, VGA_RSRC_LEGACY_IO);
+}
+
 /* Disable the VGA plane that we never use */
 static void i915_disable_vga(struct drm_device *dev)
 {
diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index b5e7cb5..d7a5b13 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -705,6 +705,7 @@ void
 ironlake_check_encoder_dotclock(const struct intel_crtc_config *pipe_config,
                 int dotclock);
 bool intel_crtc_active(struct drm_crtc *crtc);
+void i915_power_well_redisable_vga(struct drm_i915_private *dev_priv);
 void i915_disable_vga_mem(struct drm_device *dev);
 void hsw_enable_ips(struct intel_crtc *crtc);
 void hsw_disable_ips(struct intel_crtc *crtc);
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 6374884..425807b 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -5694,6 +5694,8 @@ static void hsw_set_power_well(struct drm_device *dev,
             if (wait_for((I915_READ(HSW_PWR_WELL_DRIVER) &
                       HSW_PWR_WELL_STATE_ENABLED), 20))
                 DRM_ERROR("Timeout enabling power well\n");
+
+            i915_power_well_redisable_vga(dev_priv);
         }
 
         if (IS_BROADWELL(dev)) {


And the interesting thing, here's "dmesg | grep ===" from booting my machine with this patch (eDP+HDMI):
[    7.201853] [drm:i915_power_well_redisable_vga], === msr_read read:  0x0000
[    7.201855] [drm:i915_power_well_redisable_vga], === msr_read write: 0x0000

So we read zero and write zero, and that seems to solve the problem....

Comment 42 Paulo Zanoni 2013-11-28 20:39:49 UTC

(In reply to comment #41)
> This solves the problem:
> 

Oh, and by "solves the problem" I mean "solves the problem where you boot with eDP+HDMI, so when you unbind the power well is already enabled".

The case where we only have eDP is sovled by other patches.

Comment 43 Paulo Zanoni 2013-11-29 13:42:48 UTC

Today I submitted "drm/i915: disable VGA mem when reenabling the power well":

http://lists.freedesktop.org/archives/intel-gfx/2013-November/036733.html

This fixes the case where the power well is *enabled* when we do the unbind. This won't fix the problem if you just have an eDP panel enabled.

The case where the power well is *disabled* when we do the unbind will require more patches.

Comment 44 Daniel Vetter 2013-11-29 14:48:48 UTC

I've pushed a completely untested patch series to just shut up vgacon to

http://cgit.freedesktop.org/~danvet/drm/log/?h=vgacon-crap

Comment 45 Paulo Zanoni 2013-11-29 17:14:17 UTC

(In reply to comment #44)
> I've pushed a completely untested patch series to just shut up vgacon to
> 
> http://cgit.freedesktop.org/~danvet/drm/log/?h=vgacon-crap

Just to document the bug what I said on IRC: this doesn't fix the bug.

Function do_unregister_con_driver() returns -ENODEV, so we don't really unbind vgacon. This happens because vgacon doesn't have the CON_DRIVER_FLAG_MODULE flag. The vgacon is not initalized by do_register_con_driver(), it's initialized by con_init(), though the "conswitchp" variable which is initalized by arch/x86/kernel/setup.c. I am not sure if it's possible to just uninitialize vgacon...

Comment 46 Paulo Zanoni 2013-12-03 15:55:33 UTC

Reassigning bug back to default assignee. I already provided patches that fix the bug, twice, but it seems people don't like my current approach, yet they don't suggest/submit anything that actually fixes the bug.

Comment 47 Paulo Zanoni 2013-12-11 17:49:57 UTC

I continued the investigation, and posted the findings as a reply to the patch: http://lists.freedesktop.org/archives/intel-gfx/2013-December/037289.html

Comment 48 lu hua 2013-12-13 05:12:05 UTC

(In reply to comment #47)
> I continued the investigation, and posted the findings as a reply to the
> patch:
> http://lists.freedesktop.org/archives/intel-gfx/2013-December/037289.html

Patch fail.
(Patch is indented 1 space.)
patching file drivers/gpu/drm/i915/intel_display.c
Hunk #1 succeeded at 11083 with fuzz 1 (offset 19 lines).
(Patch is indented 1 space.)
patching file drivers/gpu/drm/i915/intel_pm.c
Hunk #1 FAILED at 5703.
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/i915/intel_pm.c.rej

Comment 49 Daniel Vetter 2013-12-16 09:19:33 UTC

Paulo has a new set of patches iirc ...

Comment 50 Paulo Zanoni 2013-12-16 20:14:42 UTC

(In reply to comment #49)
> Paulo has a new set of patches iirc ...

We recently merged the patches to fix the case where the power well is _enabled_ while we do module_unload. I'm not sure how QA tests this bug, but this case should be fixed.

What remains is the patch to enable the power well in case we unbind fbcon. I have an implementation for this, but it touches the FB layer and it seems people don't want to touch those files (even if we just add some callback support). I need do re-check if there's a way to implement this without touching anything outside drivers/gpu/drm/i915 (I already did it once, but can try again...). If we do this, the code will work regardless of the loaded console.

Chris recently submitted "drm/i915: Kick out vga console" which kills vgacon and, as a consequence, also indirectly solves the problem.

Comment 51 Daniel Vetter 2013-12-16 22:22:11 UTC

If Chris "kill vgacon" patch is good enough I'd very much prefer that one. We need it anyway and it's imo the sane thing to do in general ...

Comment 52 Guang Yang 2014-05-17 01:15:56 UTC

Daneil, any update on this bug? Do we have new patches?
Hua, pls help to retest.

Comment 53 Daniel Vetter 2014-05-19 09:12:20 UTC

For reference Chris' patch:

http://patchwork.freedesktop.org/patch/16792/

Please test this patch.

Comment 54 lu hua 2014-05-20 02:45:27 UTC

Created attachment 99372 [details]
dmesg

(In reply to comment #53)
> For reference Chris' patch:
> 
> http://patchwork.freedesktop.org/patch/16792/
> 
> Please test this patch.

Test this patch, the hang goes away.
It causes call trace.

output:
module successfully unloaded
module successfully loaded again

[   35.971940] WARNING: CPU: 4 PID: 4123 at drivers/gpu/drm/drm_crtc.c:4704 drm_mode_config_cleanup+0x164/0x212 [drm]()
[   35.971985] Modules linked in: ipv6 dm_mod snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic iTCO_wdt iTCO_vendor_support serio_raw pcspkr i2c_i801 lpc_ich mfd_c
ore snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore battery wmi tpm_infineon tpm_tis tpm acpi_cpufreq i915(-) video button drm_kms_helper drm [last un
loaded: snd_hda_intel]
[   35.973233] CPU: 4 PID: 4123 Comm: rmmod Not tainted 3.14.0_kcloud_f0122d_20140520+ #38
[   35.973276] Hardware name: ASUS All Series/Z87-EXPERT, BIOS 1008 05/17/2013
[   35.973318]  0000000000000006 ffffffff81709105 0000000000000000 ffffffff81033f78
[   35.973508]  ffffffffa000d305 ffff880251774000 ffff8802517743b8 ffff8802517742a0
[   35.973698]  ffff8802517743c8 ffffffffa000d305 ffff880251774000 ffff8802517742a0
[   35.973890] Call Trace:
[   35.974011]  [<ffffffff81709105>] ? dump_stack+0x41/0x51
[   35.974132]  [<ffffffff81033f78>] ? warn_slowpath_common+0x6f/0x84
[   35.974255]  [<ffffffffa000d305>] ? drm_mode_config_cleanup+0x164/0x212 [drm]
[   35.974379]  [<ffffffffa000d305>] ? drm_mode_config_cleanup+0x164/0x212 [drm]
[   35.974507]  [<ffffffffa00a6d15>] ? intel_modeset_cleanup+0xca/0xf0 [i915]
[   35.974633]  [<ffffffffa00c9000>] ? i915_driver_unload+0xaf/0x28e [i915]
[   35.974760]  [<ffffffffa000759e>] ? drm_dev_unregister+0x1e/0x8b [drm]
[   35.974883]  [<ffffffffa000775b>] ? drm_put_dev+0x45/0x4e [drm]
[   35.975004]  [<ffffffff812f3c79>] ? pci_device_remove+0x38/0x80
[   35.975126]  [<ffffffff813758ec>] ? __device_release_driver+0x7e/0xd4
[   35.975246]  [<ffffffff81375ec3>] ? driver_detach+0x71/0x9a
[   35.975367]  [<ffffffff81375657>] ? bus_remove_driver+0x60/0x7d
[   35.975487]  [<ffffffff812f3411>] ? pci_unregister_driver+0x17/0x75
[   35.975611]  [<ffffffffa0008ec8>] ? drm_pci_exit+0x37/0x96 [drm]
[   35.975733]  [<ffffffff8107caf0>] ? SyS_delete_module+0x10c/0x182
[   35.975855]  [<ffffffff810024f9>] ? do_notify_resume+0x4e/0x5d
[   35.975991]  [<ffffffff817138ea>] ? int_signal+0x12/0x17
[   35.976114]  [<ffffffff81713662>] ? system_call_fastpath+0x16/0x1b
[   35.976234] ---[ end trace c063e9f88c0c47fc ]---
[   35.976357] [drm:intel_crtc_cursor_set] cursor off
[   35.976358] [drm:intel_crtc_cursor_set] cursor off
[   35.976359] [drm:intel_crtc_cursor_set] cursor off
[   35.977844] ------------[ cut here ]------------

Comment 55 Daniel Vetter 2014-05-20 09:29:36 UTC

(In reply to comment #54)
> Created attachment 99372 [details]
> dmesg
> 
> (In reply to comment #53)
> > For reference Chris' patch:
> > 
> > http://patchwork.freedesktop.org/patch/16792/
> > 
> > Please test this patch.
> 
> Test this patch, the hang goes away.
> It causes call trace.

This is bug 77511. We really need a script to traige backtraces to avoid such confusions.

Chris, can you please add the select DUMMY_CONSOLE to your patch, add a commit message and re-submit it please?

Comment 56 Daniel Vetter 2014-05-20 09:29:52 UTC

Maybe cc: David Herrmann on the patch, too.

Comment 57 Chris Wilson 2014-05-20 09:44:47 UTC

config DUMMY_CONSOLE
        bool
        depends on VGA_CONSOLE!=y || SGI_NEWPORT_CONSOLE!=y
        default y

Hmm, that would seem to make "select DUMMY_CONSOLE" nonsense. Anyway, as I said, I'd like for Paulo to write the commit message as I am not sure just what and why this works.

Comment 58 Daniel Vetter 2014-05-21 16:56:49 UTC

Hm, I've thought it would break. Probably easiest to push this out into a branch the 0-day tester looks at and wait for reports ;-)

Comment 59 Jani Nikula 2014-06-11 08:56:01 UTC

Fix pushed out to -fixes (though it was in -nightly for about a week now).

commit a4de05268e674e8ed31df6348269e22d6c6a1803
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jun 5 16:20:46 2014 +0200

    drm/i915: Kick out vga console

Comment 60 lu hua 2014-06-16 06:51:49 UTC

Test on latest -nightly kernel, i915 is in use. Test on latest -queued kernel, error still exists.
output(nightly kernel)
rmmod: ERROR: Module i915 is in use
WARNING: i915.ko still loaded!
module successfully loaded again

lsmod(nightly kenrel)
Module                  Size  Used by
snd_hda_intel          15258  0
ip6table_filter         1067  0
ip6_tables             13864  1 ip6table_filter
ipv6                  253295  24
iptable_filter          1112  0
ip_tables              13467  1 iptable_filter
ebtable_nat             1529  0
ebtables               19258  1 ebtable_nat
x_tables               13403  5 ip6table_filter,ip_tables,iptable_filter,ebtables,ip6_tables
dm_mod                 67053  0
snd_hda_codec_realtek    46034  1
snd_hda_codec_generic    38382  1 snd_hda_codec_realtek
iTCO_wdt                4695  0
iTCO_vendor_support     1624  1 iTCO_wdt
snd_hda_codec_hdmi     31910  1
serio_raw               3929  0
pcspkr                  1715  0
i2c_i801                8397  0
lpc_ich                13312  0
mfd_core                2537  1 lpc_ich
snd_hda_controller     16013  1 snd_hda_intel
snd_hda_codec          71968  5 snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_intel,snd_hda_controller
snd_hwdep               5309  1 snd_hda_codec
snd_pcm                69054  4 snd_hda_codec_hdmi,snd_hda_codec,snd_hda_intel,snd_hda_controller
snd_timer              15325  1 snd_pcm
snd                    49512  8 snd_hda_codec_realtek,snd_hwdep,snd_timer,snd_hda_codec_hdmi,snd_pcm,snd_hda_codec_generic,snd_hda_codec,snd_hda_intel
soundcore               4178  2 snd,snd_hda_codec
battery                12312  0
wmi                     6915  0
tpm_infineon            6274  0
tpm_tis                 6884  0
tpm                    18810  2 tpm_tis,tpm_infineon
acpi_cpufreq            5906  0
i915                  720378  3
video                  11915  1 i915
button                  4229  1 i915
drm_kms_helper         27746  1 i915
drm                   206876  3 i915,drm_kms_helper

Test on latest -queued kernel:
output:
module successfully unloaded
module successfully loaded again

dmesg -r | egrep "<[1-3]>" |grep drm
<3>[   31.720889] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 209c
<3>[   31.726335] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to a090
<3>[   31.945258] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to a090
<3>[   31.951119] [drm:hsw_unclaimed_reg_clear] *ERROR* Unknown unclaimed register before writing to 45404

Comment 61 Daniel Vetter 2014-06-17 23:03:48 UTC

The patch is only in -fixes. The module reload failure sounds like a separate regression.

Comment 62 lu hua 2014-06-18 07:33:40 UTC

(In reply to comment #61)
> The patch is only in -fixes. The module reload failure sounds like a
> separate regression.

Report Bug 80176 about the failure, the failure happens on -fixes and -nightly kernel. So I will verify this bug after bug 80176 fixed.

Comment 63 lu hua 2014-08-01 08:21:47 UTC

Verified.Fixed.

Comment 64 Jari Tahvanainen 2016-09-30 13:25:37 UTC

Closing verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.