Bug 58876

Summary: [GM45 DP-DVI dongle regression] transient failure to read EDID (fbcon fails then X works)
Product: DRI Reporter: andreas.sturmlechner
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high    
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg_3.8git_extdvi-noterm.log
none
intel-reg-dump-3.8git-noterm.log
none
dmesg_3.8git_extdvi-drmdebug.log
none
dmesg_3.8git_extdvi-drmdebug-v2.log
none
bisect.log
none
20131228-2203_3.13.0-rc4+_dmesg.log
none
20140201-2343_3.13.0+_dmesg.log
none
20140609-1030_3.15.0_dmesg-OFF.log
none
20140609-1030_3.15.0_i915regdump-OFF.log
none
20140609-1330_3.15.0_dmesg-ON.log
none
20140609-1330_3.15.0_i915regdump-ON.log
none
20140609-1410_3.4.92-gentoo_dmesg-ON.log
none
20140609-1410_3.4.92-gentoo_i915regdump-ON.log
none
20141003-2338_3.17.0-rc7+_dmesg-stop-OFF.log
none
20141019-2125_3.17.0-rc7+_dmesg-stop-ON.log (drm.debug=6)
none
idle gmbus harder in takeover none

Description andreas.sturmlechner 2012-12-30 01:18:12 UTC
Created attachment 72287 [details]
dmesg_3.8git_extdvi-noterm.log

As seen in the attached dmesg output, that line has appeared with kernel-3.8, permanently, while 3.6 and 3.7 are OK:

[    0.955012] [drm] GMBUS [i915 gmbus dpb] timed out, falling back to bit banging on pin 5


The following two lines appear in those cases where there is no terminal:

[    1.000013] i915 0000:00:02.0: No connectors reported connected with modes
[    1.000058] [drm] Cannot find any crtc or sizes - going 1024x768


After init, my display goes standby until X starts just fine. With the exception of the phenomenon in bug 58867, which is part of every boot.
Comment 1 andreas.sturmlechner 2012-12-30 01:31:14 UTC
Created attachment 72288 [details]
intel-reg-dump-3.8git-noterm.log

OK, either it is more like 9 in 10 times or this is again related to external display usage - I've lost track of that somewhen after the n-th reboot today.
Comment 2 Chris Wilson 2012-12-30 10:25:24 UTC
The issue would appear to be a transient failure to read the EDID, but no idea why.
Comment 3 andreas.sturmlechner 2012-12-30 10:38:10 UTC
Created attachment 72291 [details]
dmesg_3.8git_extdvi-drmdebug.log

dmesg log with drm.debug=6 attached
Comment 4 andreas.sturmlechner 2012-12-30 10:56:01 UTC
Created attachment 72294 [details]
dmesg_3.8git_extdvi-drmdebug-v2.log

no actually containing full dmesg output
Comment 5 andreas.sturmlechner 2012-12-30 17:20:27 UTC
OK, this seems to always work for LVDS, and never for the external display.
Comment 6 andreas.sturmlechner 2012-12-31 21:05:19 UTC
Access to a different setup, new results - it seems I can isolate the trouble to the DVI setup.

In short:

1) LVDS: OK
2) DP* to display: OK
3) DP* + DVI-D-Adaptor to display: no fbcon

* two UltraBase docking stations on different locations


Setup (3) is fine with older kernels - except that it usually needs the BIOS to bring up the connection.
Comment 7 andreas.sturmlechner 2013-01-05 13:32:40 UTC
I have updated to 3.8_rc2 and I'm back to setup (3) for a couple days - it's still the same, and it definitely happens only there. Since it's 100% reproducible I could go on and bisect now, unless someone has an idea what patch could be the problem.
Comment 8 Chris Wilson 2013-01-05 14:57:39 UTC
Hmm, a bisect would be valuable. They were a few DPCD handling patches that aimed to improve handling of dongles, but knowing what caused the regression here may lead to further improvements.
Comment 9 andreas.sturmlechner 2013-01-07 06:34:46 UTC
git bisect has ended with ce4a9cc579381bc70b12ebb91c57da31baf8e3b7 being the first bad commit, which doesn't make any sense to me at all. Maybe the actual bad commit is somewhere close before, but I was already far beyond the point of getting a healthy amount of sleep, so I stopped there. I don't have access to that setup now for a few days.
Comment 10 Daniel Vetter 2013-01-07 15:23:15 UTC
Hm, DPCD should affect things on the DP-DVI dongle, we'll use the HDMI encoder in that case. Still, a bisect would be really useful to track things down.
Comment 11 andreas.sturmlechner 2013-01-19 15:47:08 UTC
I would like to bisect more, but it's hard. During testing it became apparent that this is another timing related issue. For several reboots it sometimes works, then at random it doesn't anymore. The same kernel image that seemingly reproduced the behaviour with absolute certainty weeks ago, suddenly worked again a few days ago. The only thing I can say for sure is that I never saw that behaviour before 3.8 merge window.
Comment 12 andreas.sturmlechner 2013-02-02 10:43:19 UTC
Just a small notice that it's still present in 3.8_rc6 (instant 1st boot experience). I haven't had time for more kernel building during the last weeks, and I should really outsource this to a beefier machine in the future - unless Lenovo brings out a non-pathetic successor to the X200s.
Comment 13 andreas.sturmlechner 2013-04-20 11:31:24 UTC
Now something interesting just happened. Coming back to that setup, I had forgotten to re-enable xdm in default runlevel, and once again was greeted with a black screen. Then, after what must have been about half a minute - right before resignation -, suddenly the terminal appeared in a non-native resolution (possibly the one from closed-lid LVDS). I've so far seen that non-native terminal res only when booting open-lid.

Observation made with vanilla-kernel-3.8.7. Any possibly related change to EDID or lid handling?
Comment 14 andreas.sturmlechner 2013-06-09 11:55:30 UTC
I was just trying 3.10 (rc5) for the first time and had a few reboots/cold starts, all with fbcon brought up successfully. Out of experience I won't call it fixed just yet, but it looks good. :)
Comment 15 Daniel Vetter 2013-06-16 11:54:35 UTC
So maybe we are indeed getting better at this DP whack-a-mole game ...
/me is hopefully

Please update this bug once you're confident that it works (or that it broke again).
Comment 16 andreas.sturmlechner 2013-06-16 11:59:05 UTC
Sorry, it has already happened again. :/

Someone with a similar problem (but with LvDS) has it working with i915 as a module, will look into that, but then next thing will be looking into compile offload for all the bisecting that is going to be needed.
Comment 17 andreas.sturmlechner 2013-07-06 16:34:21 UTC
Created attachment 82124 [details]
bisect.log

Well, I spent the day on bisecting but seemed to have failed again. Based on the result I produced a revert and applied that over 3.10.0 sources, but a few reboots later it was the same old trouble again. I'm on the verge of just throwing away that display and be done with it...
Comment 18 andreas.sturmlechner 2013-07-06 16:58:59 UTC
...except that the display is not to blame, as I just reproduced the same failure on the family's other shiny new LCD. Either way, the cost of bisecting this is just way too much with the countless reboots required to gain *some* *questionable* security whether it's good/bad.
Comment 19 Jani Nikula 2013-09-10 14:04:49 UTC
Another release, another try?
Comment 20 andreas.sturmlechner 2013-09-13 21:30:02 UTC
I was following 3.11 since rc2 and it's all the same. But I don't trust that setup anymore, so I will finally carry over my second docking station just to be sure.
Comment 21 andreas.sturmlechner 2013-09-14 15:18:35 UTC
OK, this isn't funny anymore:

For months, I am fighting with 3.11 RCs to detect my external display (the behaviour had regressed in so far that after fbcon it also didn't bring it up in X anymore), often rebooting several times until it worked.

- so I try the same on an other external display: positive, same failure
- so I transfer my other docking station to reproduce it: positive

Then, having ruled out hardware issures with my DVI display, docking stations, DVI cables, I think to myself: let's try out good old 3.4 series, because I can't test this DP-DVI dongle on an other system (soon, there'll be a Haswell box to the rescue), so at least I can try to reproduce success on older kernels to rule out a broken dongle.

- so, I build and boot into 3.4.61 once: success
- so, I build and boot into 3.10.11 afterwards: success with fbcon as well as X (??)
- so, I boot again into 3.11.0: success (fbcon and X) (???)


All the while, I have zero problems in my flat where the Thinkpad is connected to a Displayport screen.

Every time I think I could come to a conclusion, there comes my system and hits me right back in my face.
Comment 22 andreas.sturmlechner 2013-09-15 14:49:10 UTC
Today my system is back to normal. No fbcon/X with 3.11 all the time, X only after manually enabling output on the external display (while often it isn't even detected).

The only constant being that 3.4.61/62 (anything pre-3.8) works all the time.
Comment 23 andreas.sturmlechner 2013-11-10 12:41:41 UTC
3.12.0 Update: Blank screen (no fbcon, no X) on second try, so nothing has changed.

Meanwhile, 3.4 (.68) runs great and I'm glad it continues to be supported for some time.
Comment 24 Daniel Vetter 2013-11-11 06:49:26 UTC
(In reply to comment #23)
> 3.12.0 Update: Blank screen (no fbcon, no X) on second try, so nothing has
> changed.
> 
> Meanwhile, 3.4 (.68) runs great and I'm glad it continues to be supported
> for some time.

Hm, can you please try to bisect where this regression has been introduced?
Comment 25 Jani Nikula 2013-12-16 14:24:14 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > 3.12.0 Update: Blank screen (no fbcon, no X) on second try, so nothing has
> > changed.
> > 
> > Meanwhile, 3.4 (.68) runs great and I'm glad it continues to be supported
> > for some time.
> 
> Hm, can you please try to bisect where this regression has been introduced?

Either that, or a retry of drm-intel-nightly http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-nightly which has some DP dongle fixes since the last try.
Comment 26 andreas.sturmlechner 2013-12-27 16:11:13 UTC
I'm a few boots into current drm-intel-nightly, and so far it looks good! Also threw in Linus' git master (insta-fail), while the following reboot into the drm-intel-nightly image was successful once more.
Comment 27 andreas.sturmlechner 2013-12-28 17:16:07 UTC
Too soon once more. Unfortunately, the blank screen is back again...

(In reply to comment #24)
> (In reply to comment #23)
> > 3.12.0 Update: Blank screen (no fbcon, no X) on second try, so nothing has
> > changed.
> > 
> Hm, can you please try to bisect where this regression has been introduced?

I tried 3.8/3.9/3.10 again and now X never comes up when there's no fbcon. So I guess there was not a further regression in kernel, but rather a change in xorg-server.
Comment 28 andreas.sturmlechner 2013-12-28 23:52:27 UTC
Created attachment 91268 [details]
20131228-2203_3.13.0-rc4+_dmesg.log

attaching new dmesg, in one year the output has changed a bit.
Comment 29 Daniel Vetter 2014-01-08 19:28:35 UTC
(In reply to comment #26)
> I'm a few boots into current drm-intel-nightly, and so far it looks good!
> Also threw in Linus' git master (insta-fail), while the following reboot
> into the drm-intel-nightly image was successful once more.

Tentatively closing as working, thanks for reporting this issue and please reopen when it breaks again.
Comment 30 andreas.sturmlechner 2014-01-08 23:46:02 UTC
Sorry, perhaps I didn't state it clear enough in my last two answers, that the issue was back again the day after my tests.
Comment 31 Rodrigo Vivi 2014-01-29 16:34:55 UTC
Hi Andreas,

Could you please retry latest drm-intel-nightly? And attach log please.

Also it would be great if you could bisect between the version that worked on Dec 27 and the one that didn't work on Dec 28, trying to find the patch that reintroduced the issue.

Thanks
Comment 32 Chris Wilson 2014-01-29 16:52:29 UTC
If you look at the log it happened anyway, just the desktop environment queried the configuration so often that the fact that it failed once made no difference. It reported the EDID at the vital time and so everything appeared to work.

i.e. I don't think the problem was ever mysteriously fixed, just bad/good timing.
Comment 33 andreas.sturmlechner 2014-02-01 23:21:53 UTC
Created attachment 93201 [details]
20140201-2343_3.13.0+_dmesg.log

took me a few cold boots, but here is the latest blank screen dmesg log with drm-intel-nightly - would the output of intel_reg_dumper also be useful?

(In reply to comment #31)
> Also it would be great if you could bisect between the version that worked
> on Dec 27 and the one that didn't work on Dec 28, trying to find the patch
> that reintroduced the issue.

Every single kernel image built since 3.8-rc1 has at some point (repeatedly) failed to bring up the DP screen. git bisect between start of merge window and rc1 so far has been like a lottery with that kind of error...
Comment 34 Rodrigo Vivi 2014-06-05 20:29:10 UTC
Yes, I'd like to see the reg dumps before and after the failure.

But also I'm curious about the dmesg output when it *works*. Could you attach a dmesg with drm.debug=0xe from a good case where detection and edid read goes right?

Does suspend/resume sequences aftect your results anyhow?
Comment 35 andreas.sturmlechner 2014-06-09 11:39:51 UTC
Created attachment 100724 [details]
20140609-1030_3.15.0_dmesg-OFF.log

3.15.0 - no connection - dmesg with drm.debug=0xe
Comment 36 andreas.sturmlechner 2014-06-09 11:40:54 UTC
Created attachment 100725 [details]
20140609-1030_3.15.0_i915regdump-OFF.log

3.15.0 - no connection - intel-gpu-tools-1.6 regdump
Comment 37 andreas.sturmlechner 2014-06-09 11:41:57 UTC
Created attachment 100727 [details]
20140609-1330_3.15.0_dmesg-ON.log

3.15.0 - success - dmesg with drm.debug=0xe
Comment 38 andreas.sturmlechner 2014-06-09 11:43:30 UTC
Created attachment 100728 [details]
20140609-1330_3.15.0_i915regdump-ON.log

3.15.0 - success - intel-gpu-tools-1.6 regdump
Comment 39 andreas.sturmlechner 2014-06-09 12:22:36 UTC
Created attachment 100731 [details]
20140609-1410_3.4.92-gentoo_dmesg-ON.log

3.4.92 - 100% success rate - dmesg with drm.debug=0xe
Comment 40 andreas.sturmlechner 2014-06-09 12:23:57 UTC
Created attachment 100732 [details]
20140609-1410_3.4.92-gentoo_i915regdump-ON.log

3.4.92 - 100% success rate - intel-gpu-tools-1.6 regdump
Comment 41 andreas.sturmlechner 2014-06-09 12:40:32 UTC
(In reply to comment #34)
> Does suspend/resume sequences aftect your results anyhow?

I do not suspend/resume at all - however, I could look into configuring it since the swap partition is rather bored anyway.

There's also no consistency with cold- or reboots - sometimes 3.15.0 (as a placeholder for any >=3.8-rc1 kernel image) will work after reboot from out of 3.4, sometimes not, but then maybe - not necessarily - a coldboot into 3.15.0 will work fine. However, I don't think it has ever brought up a connection from *only* rebooting from 3.15.0 when it didn't work the first time - though I didn't try that for too long since staring at my constantly standbied display (only waking up for POST and grub2) is a rather sad experience.
Comment 42 Rodrigo Vivi 2014-09-30 20:52:34 UTC
Thanks for the logs and dump. They make me think that something is disabling primary plane
before it show. Could you please continue the investigation with the following branch:
http://cgit.freedesktop.org/~vivijim/drm-intel/log/?h=58876-investigation
Please attach the logs for the good and bad case. Don't need the reg dumps.
Comment 43 andreas.sturmlechner 2014-10-03 22:15:41 UTC
Created attachment 107287 [details]
20141003-2338_3.17.0-rc7+_dmesg-stop-OFF.log

Thanks for the effort - new log with your branch.

I have not managed to get a 'good case' yet, however interesting stuff. The old match on 'no connectors reported' doesn´t work anymore, and there even appear modelines for the external display when I switch between ttys. No signal though.
Comment 44 andreas.sturmlechner 2014-10-19 19:49:24 UTC
Created attachment 108073 [details]
20141019-2125_3.17.0-rc7+_dmesg-stop-ON.log (drm.debug=6)

I couldn't get any output on the external display in weeks, but today I accidentally used drm.debug=6 instead of 0xe, and suddenly there's a screen working as intended. And indeed, switching back and forth from these two debug settings, it either works or not. Oh fun... ;)

As a side note, the above mentioned 'No connectors reported' message is triggered by my trusty old (default) kernel param i915.panel_ignore_lid=0, so no change actually.
Comment 45 andreas.sturmlechner 2014-10-19 21:03:25 UTC
Fun fact: drm.debug=6 seems to raise the chance for screen output _considerably_, compared both to drm.debug=0xe and no drm.debug. Several reboots into the same 3.17 image were successful with that parameter, one unsuccessful.
Comment 46 Daniel Vetter 2014-11-27 16:58:08 UTC
Ok, this is definitely just the gmbus controller being pissed somehow and refusing to work. No idea why, but given that it's only happening at boot-up it's probably leftover bios state.
Comment 47 Daniel Vetter 2014-11-27 16:59:03 UTC
Created attachment 110134 [details] [review]
idle gmbus harder in takeover

A quick patch for you to test.
Comment 48 andreas.sturmlechner 2014-12-06 22:34:44 UTC
Thanks for new stuff to try out, I'm finally back at the setup. No change though, mostly bad runs and a few good ones with either no debug param, or drm.debug=6 or drm.debug=0xe, applied over 3.17.4 as well as 3.18-rc8.
Comment 49 Jesse Barnes 2014-12-08 23:18:53 UTC
Daniel, did you forget to add the new location for the i2c reset?  It just looks like you moved the function in the file as-is...
Comment 50 andreas.sturmlechner 2014-12-14 09:57:09 UTC
Given a new version, I would have time today for another round of testing on my weekend setup.
Comment 51 Jesse Barnes 2015-03-03 20:41:21 UTC
Andreas, did you try Daniel's patch?  On looking again I see I missed the fact he added a new function call in there that could help.
Comment 52 andreas.sturmlechner 2015-03-21 15:25:28 UTC
(In reply to Jesse Barnes from comment #51)
> Andreas, did you try Daniel's patch?
Yes I tried, results are in comment #48 - unfortunately no success.

However, once again I have high hopes for a new kernel version - 4.0.0 RCs look good so far! I've had a few reboots today on the troublesome setup, and while the failure was reproduceable once more with 3.17.8 (including the lastest patch), all good so far with 4.0.0. It looks as if the non-detection cases have been replaced by wrong-resolution detections there - much easier to live with, if it stays that way.
Comment 53 Jesse Barnes 2015-03-23 16:53:07 UTC
Well, we'd still like to fix any resolution detection problems!  Sounds like maybe a failed EDID read.  Can you file a new bug for that with logs against 4.0-rc?

Thanks,
Jesse

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.