Created attachment 102512 [details] boot log System Environment: -------------------------- Platform: Broadwell Kernel: drm-intel-nightly/ed4d04defe2c6962efe8f4ba3587a8e69e06d2dd Bug detailed description: ------------------------- Clean boot system, it fails 3 in 5 cycles. It happens on Broadwell with -nightly and -fixes kernel. Run 5 cycles on -queued kernel, it works well. Reproduce steps: ------------------------- 1. clean boot system
Regression in the -fixes tree? Please bisect.
It also happens on BYT.
This issue exist on HSW, too.
Also what exactly is the failure mode? The boot log freezes at the second console takeover, so check you have: commit 1bb9e632a0aeee1121e652ee4dc80e5e6f14bcd2 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Tue Jul 8 10:02:43 2014 +0200 drm/i915: Only unbind vgacon, not other console drivers
(In reply to comment #4) > Also what exactly is the failure mode? The boot log freezes at the second > console takeover, so check you have: > > commit 1bb9e632a0aeee1121e652ee4dc80e5e6f14bcd2 > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Tue Jul 8 10:02:43 2014 +0200 > > drm/i915: Only unbind vgacon, not other console drivers We have this commit.
==Bisect results== ---------------------------- Bisect shows: 01527b3127997ef6370d5ad4fa25d96847fbf12a is the first bad commit 01527b3127997ef6370d5ad4fa25d96847fbf12a is the first bad commit commit 01527b3127997ef6370d5ad4fa25d96847fbf12a Author: Clint Taylor <clinton.a.taylor@intel.com> Date: Mon Jul 7 13:01:46 2014 -0700 drm/i915/vlv: T12 eDP panel timing enforcement during reboot The panel power sequencer on vlv doesn't appear to accept changes to its T12 power down duration during warm reboots. This change forces a delay for warm reboots to the T12 panel timing as defined in the VBT table for the connected panel. Ver2: removed redundant pr_crit(), commented magic value for pp_div_reg Ver3: moved SYS_RESTART check earlier, new name for pp_div. Ver4: Minor issue changes Ver5: Move registration of reboot notifier to edp_connector_init, Added warning comment to handler about lack of PM notification.
Add details of this commit. commit 01527b3127997ef6370d5ad4fa25d96847fbf12a Author: Clint Taylor <clinton.a.taylor@intel.com> AuthorDate: Mon Jul 7 13:01:46 2014 -0700 Commit: Daniel Vetter <daniel.vetter@ffwll.ch> CommitDate: Wed Jul 9 09:52:14 2014 +0200
This patch is VLV specific - why would it be causing boot failures on other platforms?
Comment on attachment 102512 [details] boot log Yeah I'm not seeing how this patch would affect non-BDW either. All the structures it checks in the pre VLV check look like they ought to be allocated and present. And the log doesn't have a crash in it... So either this is timing related or this is the wrong bisect result.
(In reply to comment #9) > Comment on attachment 102512 [details] > boot log > > Yeah I'm not seeing how this patch would affect non-BDW either. All the > structures it checks in the pre VLV check look like they ought to be > allocated and present. And the log doesn't have a crash in it... > > So either this is timing related or this is the wrong bisect result. As this bug title said this issue can't be reproduced 100%. Each time I run 5 times. If machine boot successfully with no crash in this round, I considered the commit I tested was good. Eventually bisect shown this commit was the first bad commit.And i tested its parent commit 5 five times , I didn't get crash. Since you both thought this bisect was wrong. I will try more times for a more credible bisect result.I will reply after my second bisect.
Here is the first bad commit. I must be a very lucky guy, yesterday,because I have tested this commit but it worked well!!!! commit 1bb9e632a0aeee1121e652ee4dc80e5e6f14bcd2 Author: Daniel Vetter <daniel.vetter@ffwll.ch> AuthorDate: Tue Jul 8 10:02:43 2014 +0200 Commit: Daniel Vetter <daniel.vetter@ffwll.ch> CommitDate: Wed Jul 9 09:52:13 2014 +0200 drm/i915: Only unbind vgacon, not other console drivers The console subsystem only provides a function to switch to a given console, but we want to actually only switach away from vgacon. Unconditionally switching to the dummy console resulted in switching away from fbcon in multi-gpu setups when other gpu drivers are loaded before i915. Then either the reinitialization of fbcon when i915 registers its fbdev emulation or the teardown of the fbcon driver killed the machine. So only switch to the dummy console when it's required. Kudos to Chris for the original idea, I've only refined it a bit to still unregister vgacon even when it's currently unused. This regression has been introduced in
My simply debug as below, With bad commit, below file will go to wrong place. (-fixes branch: 92ae62076957c5904509f755eea0075ad60f74c6) drivers/video/fbdev/core/fbmem.c line:1696 function "static int do_unregister_framebuffer(struct fb_info *fb_info)" will return -EINVAL, because of this line code: if (i < 0 || i >= FB_MAX || registered_fb[i] != fb_info) more detail: it is "registered_fb[i] != fb_info" cause function return -EINVAL
My simply debug as below, maybe this will help With bad commit, below file will go to wrong place. (-fixes branch: 92ae62076957c5904509f755eea0075ad60f74c6) drivers/video/fbdev/core/fbmem.c line:1696 function "static int do_unregister_framebuffer(struct fb_info *fb_info)" will return -EINVAL, because of this line code: if (i < 0 || i >= FB_MAX || registered_fb[i] != fb_info) more detail: it is "registered_fb[i] != fb_info" cause function return -EINVAL
Excuse my stupid debug method i add two lines log message. printk(KERN_WARNING "fb_info %p registered_fb %p",fb_info,registered_fb[i]); if (i < 0 || i >= FB_MAX || ( registered_fb[i] != fb_info )) return -EINVAL; printk(KERN_WARNING "Pass this line?" ); On HSW, log shows [ 2.082948]fb_info ffff8801454fc800 registered_fb ffff8801454fc800 (this is last line) On BDW, log shows [ 27.535818] fb: switching to inteldrmfb from EFI VGA [ 27.535821] fb_info ffff880149e06000 registered_fb ffff880149e06000 [ 27.673416] Pass this line? [ 27.673418] Console: switching to colour dummy device 80x25 My last comment based on HSW machine debug, may be it's not accurate.I will feel such grateful, if someone takes a minute to educate me why this happen.
Created attachment 102667 [details] [review] Trace fb comings and goings Try this patch to see what tale it gives for the efifb.
(In reply to comment #15) > Created attachment 102667 [details] [review] [review] > Trace fb comings and goings > > Try this patch to see what tale it gives for the efifb. There is nothing different output. If i translate output to serial port output will show as below: [ 26.895983] ACPI: Power Button [PWRF] [ 26.898922] [drm] Memory usable by graphics device = 4096M [ 26.898923] [drm] Replacing VGA console driver [ 27.449211] fb: switching to inteldrmfb from EFI VGA [ 27.509560] Unregistering EFI VGA framebuffer [ 27.509577] Console: switching to colour dummy device 80x25
Please retest with latest drm-intel-fixes, that has a fix for vgacon unbinding. Note that the fbcon setup is done with the console_lock held, so if the machine dies in there no printk will reach netconsole or anything else really. You can try to debug with CONFIG_DRM_I915_FBDEV=n to avoid some of the fun, but this will likely not help in this case. Still worth a shot.
(In reply to comment #17) > Please retest with latest drm-intel-fixes, that has a fix for vgacon > unbinding. Still boot fail. > Note that the fbcon setup is done with the console_lock held, so if the > machine dies in there no printk will reach netconsole or anything else > really. You can try to debug with CONFIG_DRM_I915_FBDEV=n to avoid some of > the fun, but this will likely not help in this case. Still worth a shot.
Sorry I've confused myself. Are you sure about the bisect result in comment #11 ? That patch changes the code back to how it was on older kernels in some situations (which fixed a regression). If that helps it means older kernels also should have failed too boot (e.g. 3.15).
(In reply to comment #19) > Sorry I've confused myself. Are you sure about the bisect result in comment > #11 ? > I tested his parents commit(f1e1c2129b79cfdaf07bca37c5a10569fe021abe) 10 times , all successfully boot. From this commit(1bb9e632a0aeee1121e652ee4dc80e5e6f14bcd2), failure rate became too high.
If I build the issue kernel with "debug config", then the machine can boot up successfully . I attach both debug_config and normal config.
Created attachment 102893 [details] normal config
Created attachment 102894 [details] debug config
I notice some differences between debug kernel dmesg and normal kernel dmesg . Booting with debug kernel(boot up successfully) dmesg like below: [ 5.295520] [drm] Memory usable by graphics device = 2048M [ 5.299528] [drm:i915_gem_gtt_init] GMADR size = 256M [ 5.299531] [drm:i915_gem_gtt_init] GTT stolen size = 32M [ 5.299533] [drm:i915_gem_gtt_init] ppgtt mode: 1 [ 5.299535] [drm] Replacing VGA console driver [ 5.314627] checking generic (b0000000 7e9000) vs hw (b0000000 10000000) [ 5.314629] fb: switching to inteldrmfb from EFI VGA [ 5.317939] Console: switching to colour dummy device 80x25 [ 5.318122] usb 1-5: new high-speed USB device number 3 using xhci_hcd Booting with normal kernel(boot failure) dmesg like this: [ 26.898922] [drm] Memory usable by graphics device = 2048M [ 26.898923] [drm] Replacing VGA console driver [ 27.449211] fb: switching to inteldrmfb from EFI VGA
Created attachment 102897 [details] diff_configs
can any developer reproduce this?
Note: This seems to be only reproducible with the normal config. Lu, please don't forget to add such crucial information from internal threads here. (In reply to comment #25) > Created attachment 102897 [details] > diff_configs Please attach unified diff (i.e. diff -u) since I can't read traditional diff output ;-) A bunch of things to test: - Please test 4c2e0990ade3251c9b5770aa8f06b06375b66f9f with the normal config extensively and make sure it really works. This is the parent of the commit which the offending patch tried to fix, so should have the same behaviour really. - Please test a4de05268e674e8ed31df6348269e22d6c6a1803 with the normal config extensively and and make sure it really works. This is the patch which introcuded the regression the offending patch claims to fix. - Please take the normal config and change config options step-by-step (yeah, this will take time so lower priority) until the kernel is stable. Then report which config option makes the kernel stable. Since it usually just takes a few reboots to hit the problem on broken kernels it's better to start with the broken .config: That way you don't have to boot 10+ times to make sure it really works, but can proceed to the next config option after the first hang.
> (In reply to comment #25) > > Created attachment 102897 [details] > > diff_configs > > Please attach unified diff (i.e. diff -u) since I can't read traditional > diff output ;-) > diff-u_configs.log is attached. :)
Created attachment 102900 [details] diff-u_configs
(In reply to comment #27) > A bunch of things to test: > - Please test 4c2e0990ade3251c9b5770aa8f06b06375b66f9f with the normal > config extensively and make sure it really works. This is the parent of the > commit which the offending patch tried to fix, so should have the same > behaviour really. -fixes branch (4c2e0990ade3251c9b5770aa8f06b06375b66f9f) I reboot machine at least 10 times. I didn't find issue. > - Please test a4de05268e674e8ed31df6348269e22d6c6a1803 with the normal > config extensively and and make sure it really works. This is the patch > which introcuded the regression the offending patch claims to fix. -fixes branch (a4de05268e674e8ed31df6348269e22d6c6a1803) I reboot machine at least 10 times. I didn't find issue.
Update more info about this issue fail rate: BYT/HSW/BDW/BSW four platforms have the similar fail rate as below: 1. With debug mode config, the fail rate is about 2% 2. With non-debug mode config, the fail rate is about 50%
(In reply to comment #31) > Update more info about this issue fail rate: > BYT/HSW/BDW/BSW four platforms have the similar fail rate as below: > 1. With debug mode config, the fail rate is about 2% > 2. With non-debug mode config, the fail rate is about 50% So can we please recheck the bisect? Afaik we've only done about 10 boot tests thus far each time, and it looks like a 2% failure rate could have crept through.
(In reply to comment #32) > (In reply to comment #31) > > Update more info about this issue fail rate: > > BYT/HSW/BDW/BSW four platforms have the similar fail rate as below: > > 1. With debug mode config, the fail rate is about 2% > > 2. With non-debug mode config, the fail rate is about 50% > > So can we please recheck the bisect? Afaik we've only done about 10 boot > tests thus far each time, and it looks like a 2% failure rate could have > crept through. I suggest focusing on non-debug mode, in which case 10 boot should be enough. 2% is too hard to bisect.
I am confused. You've provided configs and diffs which do not just work on any of the mentioned SHAs in this thread that I can find. There has been a lot of information. I would like the following table filled out, with the config files (not a config + a diff) attached. Fill in the table: SHA | config | failures per 10 boots -------------------------------------------------------------------- <SHA of failing> | <config name of failing> | <pass rate of failing> <SHA of good> | <config name of good> | <pass rate of good> <SHA of any others>| <config names> | <pass rate>
And please list the BIOS versions and platforms you are using in that table.
Created attachment 103243 [details] normal.config for attempt to reproduce 1bb9e63 The attached config file was originally based off of https://bugs.freedesktop.org/attachment.cgi?id=102893. That config with the diff seemed incomplete. 1bb9e63 | normal.config | 10/10 pass
(In reply to comment #36) > Created attachment 103243 [details] > normal.config for attempt to reproduce 1bb9e63 > > The attached config file was originally based off of > https://bugs.freedesktop.org/attachment.cgi?id=102893. That config with the > diff seemed incomplete. > > 1bb9e63 | normal.config | 10/10 pass Bios version: 77 Platform: WTM2
-fixes f1e1c212 | normal.config | 79/79 pass -nightly 411fa8b2 | debug.config | 10/10 pass 411fa8b2 | normal.config | 3 /10 pass 411fa8b2 | distri.config | 10/10 pass BIOS : HSWLPTU1.86C.0135.R01.1311020052 Platform: SDS
> - Please take the normal config and change config options step-by-step > (yeah, this will take time so lower priority) until the kernel is stable. > Then report which config option makes the kernel stable. Since it usually > just takes a few reboots to hit the problem on broken kernels it's better to > start with the broken .config: That way you don't have to boot 10+ times to > make sure it really works, but can proceed to the next config option after > the first hang. Hi, according my preliminary bisect,bellow two configs can decrease boot failure rate, if add them in "normal.config". CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_KERNEL=y
I tested on one machine: -fixes f1e1c212 | normal.config | 21/21 pass 1bb9e632 | normal.config | 13/14 pass -nightly 411fa8b2 | normal.config | 18/19 pass BIOS: V80.R01 platform : WTM On another machine: 1bb9e632 | normal.config | 2/5 pass BIOS: V83.R00 platform: WTM
(In reply to comment #40) > I tested on one machine: > -fixes > f1e1c212 | normal.config | 21/21 pass > 1bb9e632 | normal.config | 13/14 pass > > -nightly > 411fa8b2 | normal.config | 18/19 pass > > BIOS: V80.R01 > platform : WTM > > On another machine: > 1bb9e632 | normal.config | 2/5 pass > > BIOS: V83.R00 > platform: WTM Update machine info. I tested on one BDW machine: -fixes f1e1c212 | normal.config | 21/21 pass 1bb9e632 | normal.config | 13/14 pass -nightly 411fa8b2 | normal.config | 18/19 pass BIOS: V80.R01 platform : WTM2 On another BDW machine: 1bb9e632 | normal.config | 2/5 pass BIOS: V83.R00 platform: WTM2
(In reply to comment #41) > Update machine info. > I tested on one BDW machine: > -fixes > f1e1c212 | normal.config | 21/21 pass > 1bb9e632 | normal.config | 13/14 pass > > -nightly > 411fa8b2 | normal.config | 18/19 pass > > BIOS: V80.R01 > platform : WTM2 > > On another BDW machine: > 1bb9e632 | normal.config | 2/5 pass > > BIOS: V83.R00 > platform: WTM2 This time I confirm all machine info are right. I'm badly sorry for my mistake. -fixes f1e1c212 | normal.config | 21/21 pass 1bb9e632 | normal.config | 13/14 pass -nightly 411fa8b2 | normal.config | 18/19 pass BIOS: V80 platform : WTM1 On another BDW machine: 1bb9e632 | normal.config | 2/5 pass BIOS: V82.R00 platform: STP Third BDW machine: 1bb9e632 | normal.config | 19/20 pass BIOS: V82.R00 platform: STP
Created attachment 103329 [details] [review] insert delay Please test this quick hack.
Also please test an otherwise broken kernel config with CONFIG_FB_EFI=n.
(In reply to comment #43) > Created attachment 103329 [details] [review] [review] > insert delay > > Please test this quick hack. This patch didn't work. When it crashed, I didn't see any log that you added in patch, Or I missed them because of screen quickly output. If you need I can attach output later
(In reply to comment #44) > Also please test an otherwise broken kernel config with CONFIG_FB_EFI=n. This amazing config works. I tested about 10 times without boot failure when using CONFIG_FB_EFI=n. I tested 3 times all failed, after I change to CONFIG_FB_EFI=y from CONFIG_FB_EFI=n.
(In reply to comment #46) > (In reply to comment #44) > > Also please test an otherwise broken kernel config with CONFIG_FB_EFI=n. > This amazing config works. I tested about 10 times without boot failure when > using CONFIG_FB_EFI=n. I tested 3 times all failed, after I change to > CONFIG_FB_EFI=y from CONFIG_FB_EFI=n. Can you please make _really_ sure that this helps on all machines? This is a tricky bug so I don't want to jump to conclusion.
(In reply to comment #47) > (In reply to comment #46) > > (In reply to comment #44) > > > Also please test an otherwise broken kernel config with CONFIG_FB_EFI=n. > > This amazing config works. I tested about 10 times without boot failure when > > using CONFIG_FB_EFI=n. I tested 3 times all failed, after I change to > > CONFIG_FB_EFI=y from CONFIG_FB_EFI=n. > > Can you please make _really_ sure that this helps on all machines? This is a > tricky bug so I don't want to jump to conclusion. I retested total 272 times with CONFIG_FB_EFI=n on the machine that can easily reproduce this bug, and the 10 times on other two machines, all of them booted successfully.Excuse me, I think it's a little hard to make sure this config works on _all_ machine, because this issue can't be reproduced for every reboot, even on the machine that can easily reproduce this issue. If this is not enough to prove config works. I need some time to satisfy your requirement. :)
These days the same issue doesn't bother us since we build kernel with CONFIG_FB_EFI=n . But we must set CONFIG_FB_VESA=n, too, to avoid some machines boot failure. I think we just hide this issue instead of fixing it.
Given comment 48 in this bug, why is the bug not closed? If not closed, what is the next step to close it?
Created attachment 104073 [details] [review] Avoid fbcon memory corruption Considering that the bug has been narrowed down to a timing issue in takeover from efifb, it could just be bug 72765 - and should be fixed by the attached.
Created attachment 104117 [details] boot log(patch) (In reply to comment #51) > Created attachment 104073 [details] [review] [review] > Avoid fbcon memory corruption > > Considering that the bug has been narrowed down to a timing issue in > takeover from efifb, it could just be bug 72765 - and should be fixed by the > attached. Apply this patch, It still fails.
Daniel, how do you think about the bisected patch (mentioned in comment#11)?
(In reply to comment #53) > Daniel, how do you think about the bisected patch (mentioned in comment#11)? It's a red herring, I think it's clear now that EFIFB is the culprit and we've just been unlucky with the timing.
Adding Peter Jones who's efifb maintainer.
Dropping the regression marker on this one since it's just EFIFB being broken, and my patch just made it a bit easier to hit on a few machines. Still treat it as P1 since it's blocking QA.
And de-assigning since (due to lack of an efi system here that's already set up) I'm probably not the best guy to look at this right now.
Possible duplicate: https://bugzilla.kernel.org/show_bug.cgi?id=86671
Please retest with current drm-intel-nightly, I'm guessing this may be fixed by commit 0485c9dc24ec0939b42ca5104c0373297506b555 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Fri Nov 14 10:09:49 2014 +0100 drm/i915: Kick fbdev before vgacon also see bug 82439.
Test 10 cycles on BYT and BDW, it works well. Close it.
Verified.Fixed.
Closing old verified.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.