Bug 107784 - [AMD tahiti XT] displayport broken
Summary: [AMD tahiti XT] displayport broken
Status: RESOLVED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: highest blocker
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-02 17:22 UTC by Sylvain BERTRAND
Modified: 2018-09-10 09:15 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
kernel log (113.04 KB, text/plain)
2018-09-02 17:28 UTC, Sylvain BERTRAND
no flags Details
kernel log from the bad commit (121.45 KB, text/plain; charset=ISO-8859-1)
2018-09-05 21:52 UTC, Sylvain BERTRAND
no flags Details
kernel log from a good commit (51.98 KB, text/plain; charset=ISO-8859-1)
2018-09-05 21:54 UTC, Sylvain BERTRAND
no flags Details

Description Sylvain BERTRAND 2018-09-02 17:22:15 UTC
linux 4.19-rc1+ branch: amd-staging-drm-next commit:d0a96214993c5dad9c2a54b888209f0f5cafd060

amdgpu is unable to program my displayport monitor anymore (was working in 4.18-rc1+)

see the error messages in the kernel log
Comment 1 Sylvain BERTRAND 2018-09-02 17:28:47 UTC
Created attachment 141416 [details]
kernel log
Comment 2 Michel Dänzer 2018-09-03 09:17:30 UTC
Can you bisect?

P.S. Please enable CONFIG_UNWINDER_ORC in your kernel build, to make the backtraces in dmesg more useful.
Comment 3 Sylvain BERTRAND 2018-09-03 10:30:34 UTC
On Mon, Sep 03, 2018 at 09:17:30AM +0000, bugzilla-daemon@freedesktop.org wrote:
> https://bugs.freedesktop.org/show_bug.cgi?id=107784
> 
> --- Comment #2 from Michel Dänzer <michel@daenzer.net> ---
> Can you bisect?

I don't think it is reasonable because something bad happens between the 4.18
to 4.19 jump, then I would have to bisect on thousands of commits.
Additionnaly, I had to remove my git repo of 4.18 and clone again 4.19 because
git was unable to perform the pull, even with reset and force (I probably don't
know some git magic here): lost the last working 4.18 commit.

I would need to know what drm/amdgpu related commits where added from
4.18->4.19, the person who did this has this knowledge.

I'll try to fool around with git to see if I can manage something in the mean
time (with dates and targetting drm/amdgpu directories).

> P.S. Please enable CONFIG_UNWINDER_ORC in your kernel build, to make the
> backtraces in dmesg more useful.

ORC breaks linux build. Only guess compile (actually kbuild and kconfig are
quite broken in this 4.19), but it's an optimized build without symbols.
Comment 4 Sylvain BERTRAND 2018-09-03 14:55:27 UTC
I did a manual bisection and got lucky:

faulty commit: 019cddc88f9e4ae0de2c76802f7137210c2101aa on amd-staging-drm-next

this commit is i2c related, right before the commit
5e8704ac1cfa9fceef94fcc457e18613b1589b34 which is the drm-next commit.
Comment 5 Michel Dänzer 2018-09-05 07:19:37 UTC
(In reply to Sylvain BERTRAND from comment #4)
> faulty commit: 019cddc88f9e4ae0de2c76802f7137210c2101aa on
> amd-staging-drm-next

That's a pure merge commit, so it's unlikely to be the one that actually caused the problem.

Without a plausible bisection result or at least usable backtraces in dmesg, no progress can be made here.
Comment 6 Sylvain BERTRAND 2018-09-05 11:50:58 UTC
I tested several times that:
	 - this commit is failing.
	 - the commit right before this one was working.

Then, I can reasonably say, this commit, whatever its type, is the one breaking
displayport programming.

Then, the code breaking displayport would be here: how hard this is to figure
out what code related to displayport was modified?

I can test again several times that this is the faulty commit.
Comment 7 Felix Schwarz 2018-09-05 12:28:41 UTC
In reply to Sylvain BERTRAND from comment #6)
> I tested several times that:
> 	 - this commit is failing.
> 	 - the commit right before this one was working.
> 
> Then, I can reasonably say, this commit, whatever its type, is the one
> breaking displayport programming.

A merge commit has two parents so you need to test both. It is possible that a merge commit did indeed introduce the bug but *highly* unlikely. (apologies if you knew all of this)

Sylvain: Are you familiar with "git bisect"? "git bisect" automatically take care of merge commits. Also it minimizes the number of tests you need to do.
Comment 8 Felix Schwarz 2018-09-05 12:33:38 UTC
Bugzilla: The tool which preserves all my stupid errors for eternity. :-/

Sorry, I just read the previous comments and of course you used git bisect already... Still you should test both parents of a merge commit to be sure.

When my git bisect ends at a merge commit, usually I messed up the build during bisection (e.g. running a previous build by mistake).
Comment 9 Sylvain BERTRAND 2018-09-05 13:38:34 UTC
Ok, I got it. Since my git knowledge is quite limited, this "merge" commit is
opening a vast sea of new commits to test. 

I'll dive into bisection using bisect (which actually deals with those merge
commits). I am a bit scared of the amount of commits, may take hours/days.

Please, in the foreseable futur, do not make amd-staging-drm-next lag that
much.

Coming back once I get the faulty _simple_ commit, wish me luck.
Comment 10 Felix Schwarz 2018-09-05 17:44:25 UTC
(In reply to Sylvain BERTRAND from comment #9)
> Ok, I got it. Since my git knowledge is quite limited, this "merge" commit is
> opening a vast sea of new commits to test. 
> 
> I'll dive into bisection using bisect (which actually deals with those merge
> commits). I am a bit scared of the amount of commits, may take hours/days.

Don't worry: If your problem is immediately obvious after booting this should be not too bad. You can also use the versions you build manually to "seed" the git bisection process. That way you reduce the number of bisection steps a bit.

If you encounter non-bootable kernels please use "git bisect skip" - this is important! DO NOT mark as "good"/"bad" in these cases.

> Please, in the foreseable futur, do not make amd-staging-drm-next lag that
> much.

I think the "drm-next" thing was mandated by Linus so there is not much which can be done at this point. But anyway if you have an easily reproducible problem bisection is not too bad.
Comment 11 Sylvain BERTRAND 2018-09-05 21:48:30 UTC
bisected: e2a9ca29b5edc89da2fddeae30e1070b272395c5

This commit is one in a series related to new TSC code.

I tried to switch the clocksource to hpet early in the boot process, did not
change anything.

Any ideas before I post an issue on kernel bugzilla?
Comment 12 Sylvain BERTRAND 2018-09-05 21:52:28 UTC
Created attachment 141464 [details]
kernel log from the bad commit
Comment 13 Sylvain BERTRAND 2018-09-05 21:54:21 UTC
Created attachment 141465 [details]
kernel log from a good commit
Comment 14 Michel Dänzer 2018-09-06 10:04:53 UTC
(In reply to Sylvain BERTRAND from comment #11)
> bisected: e2a9ca29b5edc89da2fddeae30e1070b272395c5
> 
> This commit is one in a series related to new TSC code

That's not consistent with the merge commit you identified earlier. So I'm afraid it's likely that you incorrectly classified some commits as good or bad. Maybe the problem doesn't occur 100% consistently even with bad commits, so try testing longer / more times before declaring a commit as good.
Comment 15 Sylvain BERTRAND 2018-09-06 12:31:56 UTC
On Thu, Sep 06, 2018 at 10:04:53AM +0000, bugzilla-daemon@freedesktop.org wrote:
> https://bugs.freedesktop.org/show_bug.cgi?id=107784
> 
> --- Comment #14 from Michel Dänzer <michel@daenzer.net> ---
> That's not consistent with the merge commit you identified earlier. So I'm
> afraid it's likely that you incorrectly classified some commits as good or bad.
> Maybe the problem doesn't occur 100% consistently even with bad commits, so try
> testing longer / more times before declaring a commit as good.

Not consistent? Could you be more specific? Some git magic I forgot again?
This time I used git bisect to go through "merge" commits properly.

I did test countless times those commits: that would mean this TSC code would
"side-effect" an ultra-rare bad condition into an nearly-all-the-time bad
condition. That amount of bad luck?

Whatever, I'll update amd-staging-drm-next, and go through a full bisection
again. I'll need probably days (lucky: I don't work).
Comment 16 Michel Dänzer 2018-09-06 14:22:18 UTC
(In reply to Sylvain BERTRAND from comment #15)
> Not consistent? Could you be more specific?

You wrote:

(In reply to Sylvain BERTRAND from comment #6)
> 	 - this commit is failing.
> 	 - the commit right before this one was working.

"this commit" being 019cddc88f9e4ae0de2c76802f7137210c2101aa (the I2C merge), which has two parents. Both of those parent commits contain commit e2a9ca29b5edc89da2fddeae30e1070b272395c5 (a TSC commit) as part of their history. So you previously considered commit e2a9ca29b5edc89da2fddeae30e1070b272395c5 as both bad and good. That's the inconsistency.

This most likely means that you're not yet able to reliably determine that a given commit is bad, e.g. due to not testing (long) enough.
Comment 17 Sylvain BERTRAND 2018-09-06 16:20:19 UTC
On Thu, Sep 06, 2018 at 02:22:18PM +0000, bugzilla-daemon@freedesktop.org wrote:
> https://bugs.freedesktop.org/show_bug.cgi?id=107784
> 
> --- Comment #16 from Michel Dänzer <michel@daenzer.net> ---
> "this commit" being 019cddc88f9e4ae0de2c76802f7137210c2101aa (the I2C merge),
> which has two parents. Both of those parent commits contain commit
> e2a9ca29b5edc89da2fddeae30e1070b272395c5 (a TSC commit) as part of their
> history. So you previously considered commit
> e2a9ca29b5edc89da2fddeae30e1070b272395c5 as both bad and good. That's the
> inconsistency.
> 
> This most likely means that you're not yet able to reliably determine that a
> given commit is bad, e.g. due to not testing (long) enough.

Wow! Then it is even worse of what I thought. Due to the violent leap from 4.18
to 4.19, there are zillions of commits, and even nlog(n) bisect will give me
ten_s_ of commits to test.

Please, could you refine your "long enough" for a blocker pb which happens at boot,
once xorg tries to program my displayport screen. That would be based on your
experience, something to give me the order of the "long enough".

That said, I have a hinch. I am going to setup a much cleaner test env: it's a
custom distro which boots in _really_ a few seconds (not in the range of most
mainstream distros boot time)-->I am going to slow it down, on purpose
(certainly in more than 1 spot). Then, I have an efi framebuffer and I saw some
issues about this->I am going to get rid of it. Then, I am not confident in my
monitor (see my other bugs), I may use the previous artificial slow down, to
power cycle the monitor, before xorg tries to detect and program it. Well, I'll
try to figure a way to put my monitor in a "probably" cleaner state (in respect
of displayport hotplug support). Oh, and just in case of, I'll stick to the
performance cpu governor.

If you have any advice about this based on your experience at knowledge , which
I cannot match, I'm all eyes and hears.
Comment 18 Sylvain BERTRAND 2018-09-06 21:10:27 UTC
Got something, sort of obvious for a human, impossible for bisect to know,
which explains why this bisection was a failure:
testing a commit in the middle of a series of commits which are to be taken as
a whole to be consistent for normal operations of the kernel, is wrong. That,
bisect cannot know.

In my case, the TSC commits are "good"... and testing time has nothing to do
with this: testing a commit in the middle of this series will have a side
effect (DP link/aux timings...) on what I am observing to decide if a commit is
"bad" or "good". In my case it will break this observable (my DP monitor is
_really_ not working) and I'll tag the commit as "bad", which is inconsistent.

The bottom of this, which is _obvious_ for an experienced git user on large
projects, at the time of big merges on the main branch of a project like linux,
if something breaks, bisect is probably more your enemy than your friend: it is
ok to ask "Can you bisect?" on a sub-sub-system branch which has been free from
big merges from any related other subsystems for a good while, but almost an
insult on such massive update.
Comment 19 Sylvain BERTRAND 2018-09-07 21:48:19 UTC
Breaking commit is a merge commit from upstream mainline:
--------
commit 13e091b6dd0e78a518a7d8756607d3acb8215768
Merge: eac341194426 1088c6eef261
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Mon Aug 13 18:28:19 2018 -0700

    Merge branch 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
    
    Pull x86 timer updates from Thomas Gleixner:
     "Early TSC based time stamping to allow better boot time analysis.
    
      This comes with a general cleanup of the TSC calibration code which
      grew warts and duct taping over the years and removes 250 lines of
      code. Initiated and mostly implemented by Pavel with help from various
      folks"
    
    * 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
      x86/kvmclock: Mark kvm_get_preset_lpj() as __init
      x86/tsc: Consolidate init code
      sched/clock: Disable interrupts when calling generic_sched_clock_init()
      timekeeping: Prevent false warning when persistent clock is not available
      sched/clock: Close a hole in sched_clock_init()
      x86/tsc: Make use of tsc_calibrate_cpu_early()
      x86/tsc: Split native_calibrate_cpu() into early and late parts
      sched/clock: Use static key for sched_clock_running
      sched/clock: Enable sched clock early
      sched/clock: Move sched clock initialization and merge with generic clock
      x86/tsc: Use TSC as sched clock early
      x86/tsc: Initialize cyc2ns when tsc frequency is determined
      x86/tsc: Calibrate tsc only once
      ARM/time: Remove read_boot_clock64()
      s390/time: Remove read_boot_clock64()
      timekeeping: Default boot time offset to local_clock()
      timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
      s390/time: Add read_persistent_wall_and_boot_offset()
      x86/xen/time: Output xen sched_clock time from 0
      x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
      ...
--------
If I revert this merge commit on amd-staging-drm-next branch, my displayport monitor works again.

In the 1088c6eef261 branch, the one merged in mainline, displayport programming
stopped working on the following commit:
--------
commit e2a9ca29b5edc89da2fddeae30e1070b272395c5
Author: Pavel Tatashin <pasha.tatashin@oracle.com>
Date:   Thu Jul 19 16:55:39 2018 -0400

    x86/tsc: Initialize cyc2ns when tsc frequency is determined
    
    cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data
    structure.
    
    Currently, the setup code for c2ns data for every possible CPU goes through
    the same sequence of calculations as for the boot CPU, but is based on the
    same tsc frequency as the boot CPU, and thus this is not necessary.
    
    Initialize the boot cpu when tsc frequency is determined. Copy the
    calculated data from the boot CPU to the other CPUs in tsc_init().
    
    In addition do the following:
    
     - Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init()
    
     - Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be
       called when system is up, and wraps around __set_cyc2ns_scale() that can
       be called directly when system is booting but avoids saving restoring
       IRQs and going and waking up from idle.
----
I did not check if this issue was known from upstream yet.
Comment 20 Sylvain BERTRAND 2018-09-09 03:46:16 UTC
Found the bug, very probably.

It seems to be an upstream bug: a 32bits multiplication overflow on TSC
frequency introduced in recent TSC cleanup:

--------------------------------------------------------------------------------
commit cf7a63ef4e0203f6f33284c69e8188d91422de83
Author: Pavel Tatashin <pasha.tatashin@oracle.com>
Date:   Thu Jul 19 16:55:38 2018 -0400

    x86/tsc: Calibrate tsc only once
    
    During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
    and the second time in tsc_init().
    
    Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
    the calibration is done only early, and make tsc_init() to use the values
    already determined in tsc_early_init().
    
    Sometimes it is not possible to determine tsc early, as the subsystem that
    is required is not yet initialized, in such case try again later in
    tsc_init().
--------------------------------------------------------------------------------

One nasty thing: this commit is broken, it does not compile, hence it's a
bisect "skipped" commit.

Roughly, if you have a cpu with a frequency above 4.2GHz (max unsigned 32bits),
linux time subsystem gets broken leading to the timeouts in displayport
programming. Ofc, my cpu runs at 4.7GHz.
Comment 21 Dmitrii Tcvetkov 2018-09-09 05:57:08 UTC
(In reply to Sylvain BERTRAND from comment #20)
> Roughly, if you have a cpu with a frequency above 4.2GHz (max unsigned
> 32bits),
> linux time subsystem gets broken leading to the timeouts in displayport
> programming. Ofc, my cpu runs at 4.7GHz.

I hit the bug on AMD FX-9590 machine with RX 580 and 2 displayport monitors. As for you bisect wasn't successful for me and led me to network merge commit. 

Looks like you're right! If I downclock the CPU to 4.2 GHz then current mainline master (f8f65382c98a28e3c2b20) boots fine.
Comment 22 Dmitrii Tcvetkov 2018-09-09 13:05:06 UTC
(In reply to Dmitrii Tcvetkov from comment #21)
> (In reply to Sylvain BERTRAND from comment #20)
> > Roughly, if you have a cpu with a frequency above 4.2GHz (max unsigned
> > 32bits),
> > linux time subsystem gets broken leading to the timeouts in displayport
> > programming. Ofc, my cpu runs at 4.7GHz.
> 
> I hit the bug on AMD FX-9590 machine with RX 580 and 2 displayport monitors.
> As for you bisect wasn't successful for me and led me to network merge
> commit. 
> 
> Looks like you're right! If I downclock the CPU to 4.2 GHz then current
> mainline master (f8f65382c98a28e3c2b20) boots fine.

Relevant discussion about cf7a63ef4e0203f (x86/tsc: Calibrate tsc only once) http://lkml.iu.edu/hypermail/linux/kernel/1809.0/03226.html
Comment 23 Dmitrii Tcvetkov 2018-09-09 13:11:19 UTC
> Relevant discussion about cf7a63ef4e0203f (x86/tsc: Calibrate tsc only once)
> http://lkml.iu.edu/hypermail/linux/kernel/1809.0/03226.html

And there is a patch http://lkml.iu.edu/hypermail/linux/kernel/1809.0/04424.html
which is on it's way to mainline as I understand.
Comment 24 Sylvain BERTRAND 2018-09-09 14:59:39 UTC
Yep, it manifest also on 32 bits x86. I posted also a bug report.
Since it's critical we'll probably get a new rc very soon.
Comment 25 Dmitrii Tcvetkov 2018-09-10 05:20:49 UTC
The fix for TSC is merged: 17f6bac2249356c795 (x86/tsc: Prevent result truncation on 32bit)
Comment 26 Michel Dänzer 2018-09-10 09:15:34 UTC
Thanks for the follow-ups, resolving as not our bug.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.