linux 4.19-rc1+ branch: amd-staging-drm-next commit:d0a96214993c5dad9c2a54b888209f0f5cafd060 amdgpu is unable to program my displayport monitor anymore (was working in 4.18-rc1+) see the error messages in the kernel log
Created attachment 141416 [details] kernel log
Can you bisect? P.S. Please enable CONFIG_UNWINDER_ORC in your kernel build, to make the backtraces in dmesg more useful.
On Mon, Sep 03, 2018 at 09:17:30AM +0000, bugzilla-daemon@freedesktop.org wrote: > https://bugs.freedesktop.org/show_bug.cgi?id=107784 > > --- Comment #2 from Michel Dänzer <michel@daenzer.net> --- > Can you bisect? I don't think it is reasonable because something bad happens between the 4.18 to 4.19 jump, then I would have to bisect on thousands of commits. Additionnaly, I had to remove my git repo of 4.18 and clone again 4.19 because git was unable to perform the pull, even with reset and force (I probably don't know some git magic here): lost the last working 4.18 commit. I would need to know what drm/amdgpu related commits where added from 4.18->4.19, the person who did this has this knowledge. I'll try to fool around with git to see if I can manage something in the mean time (with dates and targetting drm/amdgpu directories). > P.S. Please enable CONFIG_UNWINDER_ORC in your kernel build, to make the > backtraces in dmesg more useful. ORC breaks linux build. Only guess compile (actually kbuild and kconfig are quite broken in this 4.19), but it's an optimized build without symbols.
I did a manual bisection and got lucky: faulty commit: 019cddc88f9e4ae0de2c76802f7137210c2101aa on amd-staging-drm-next this commit is i2c related, right before the commit 5e8704ac1cfa9fceef94fcc457e18613b1589b34 which is the drm-next commit.
(In reply to Sylvain BERTRAND from comment #4) > faulty commit: 019cddc88f9e4ae0de2c76802f7137210c2101aa on > amd-staging-drm-next That's a pure merge commit, so it's unlikely to be the one that actually caused the problem. Without a plausible bisection result or at least usable backtraces in dmesg, no progress can be made here.
I tested several times that: - this commit is failing. - the commit right before this one was working. Then, I can reasonably say, this commit, whatever its type, is the one breaking displayport programming. Then, the code breaking displayport would be here: how hard this is to figure out what code related to displayport was modified? I can test again several times that this is the faulty commit.
In reply to Sylvain BERTRAND from comment #6) > I tested several times that: > - this commit is failing. > - the commit right before this one was working. > > Then, I can reasonably say, this commit, whatever its type, is the one > breaking displayport programming. A merge commit has two parents so you need to test both. It is possible that a merge commit did indeed introduce the bug but *highly* unlikely. (apologies if you knew all of this) Sylvain: Are you familiar with "git bisect"? "git bisect" automatically take care of merge commits. Also it minimizes the number of tests you need to do.
Bugzilla: The tool which preserves all my stupid errors for eternity. :-/ Sorry, I just read the previous comments and of course you used git bisect already... Still you should test both parents of a merge commit to be sure. When my git bisect ends at a merge commit, usually I messed up the build during bisection (e.g. running a previous build by mistake).
Ok, I got it. Since my git knowledge is quite limited, this "merge" commit is opening a vast sea of new commits to test. I'll dive into bisection using bisect (which actually deals with those merge commits). I am a bit scared of the amount of commits, may take hours/days. Please, in the foreseable futur, do not make amd-staging-drm-next lag that much. Coming back once I get the faulty _simple_ commit, wish me luck.
(In reply to Sylvain BERTRAND from comment #9) > Ok, I got it. Since my git knowledge is quite limited, this "merge" commit is > opening a vast sea of new commits to test. > > I'll dive into bisection using bisect (which actually deals with those merge > commits). I am a bit scared of the amount of commits, may take hours/days. Don't worry: If your problem is immediately obvious after booting this should be not too bad. You can also use the versions you build manually to "seed" the git bisection process. That way you reduce the number of bisection steps a bit. If you encounter non-bootable kernels please use "git bisect skip" - this is important! DO NOT mark as "good"/"bad" in these cases. > Please, in the foreseable futur, do not make amd-staging-drm-next lag that > much. I think the "drm-next" thing was mandated by Linus so there is not much which can be done at this point. But anyway if you have an easily reproducible problem bisection is not too bad.
bisected: e2a9ca29b5edc89da2fddeae30e1070b272395c5 This commit is one in a series related to new TSC code. I tried to switch the clocksource to hpet early in the boot process, did not change anything. Any ideas before I post an issue on kernel bugzilla?
Created attachment 141464 [details] kernel log from the bad commit
Created attachment 141465 [details] kernel log from a good commit
(In reply to Sylvain BERTRAND from comment #11) > bisected: e2a9ca29b5edc89da2fddeae30e1070b272395c5 > > This commit is one in a series related to new TSC code That's not consistent with the merge commit you identified earlier. So I'm afraid it's likely that you incorrectly classified some commits as good or bad. Maybe the problem doesn't occur 100% consistently even with bad commits, so try testing longer / more times before declaring a commit as good.
On Thu, Sep 06, 2018 at 10:04:53AM +0000, bugzilla-daemon@freedesktop.org wrote: > https://bugs.freedesktop.org/show_bug.cgi?id=107784 > > --- Comment #14 from Michel Dänzer <michel@daenzer.net> --- > That's not consistent with the merge commit you identified earlier. So I'm > afraid it's likely that you incorrectly classified some commits as good or bad. > Maybe the problem doesn't occur 100% consistently even with bad commits, so try > testing longer / more times before declaring a commit as good. Not consistent? Could you be more specific? Some git magic I forgot again? This time I used git bisect to go through "merge" commits properly. I did test countless times those commits: that would mean this TSC code would "side-effect" an ultra-rare bad condition into an nearly-all-the-time bad condition. That amount of bad luck? Whatever, I'll update amd-staging-drm-next, and go through a full bisection again. I'll need probably days (lucky: I don't work).
(In reply to Sylvain BERTRAND from comment #15) > Not consistent? Could you be more specific? You wrote: (In reply to Sylvain BERTRAND from comment #6) > - this commit is failing. > - the commit right before this one was working. "this commit" being 019cddc88f9e4ae0de2c76802f7137210c2101aa (the I2C merge), which has two parents. Both of those parent commits contain commit e2a9ca29b5edc89da2fddeae30e1070b272395c5 (a TSC commit) as part of their history. So you previously considered commit e2a9ca29b5edc89da2fddeae30e1070b272395c5 as both bad and good. That's the inconsistency. This most likely means that you're not yet able to reliably determine that a given commit is bad, e.g. due to not testing (long) enough.
On Thu, Sep 06, 2018 at 02:22:18PM +0000, bugzilla-daemon@freedesktop.org wrote: > https://bugs.freedesktop.org/show_bug.cgi?id=107784 > > --- Comment #16 from Michel Dänzer <michel@daenzer.net> --- > "this commit" being 019cddc88f9e4ae0de2c76802f7137210c2101aa (the I2C merge), > which has two parents. Both of those parent commits contain commit > e2a9ca29b5edc89da2fddeae30e1070b272395c5 (a TSC commit) as part of their > history. So you previously considered commit > e2a9ca29b5edc89da2fddeae30e1070b272395c5 as both bad and good. That's the > inconsistency. > > This most likely means that you're not yet able to reliably determine that a > given commit is bad, e.g. due to not testing (long) enough. Wow! Then it is even worse of what I thought. Due to the violent leap from 4.18 to 4.19, there are zillions of commits, and even nlog(n) bisect will give me ten_s_ of commits to test. Please, could you refine your "long enough" for a blocker pb which happens at boot, once xorg tries to program my displayport screen. That would be based on your experience, something to give me the order of the "long enough". That said, I have a hinch. I am going to setup a much cleaner test env: it's a custom distro which boots in _really_ a few seconds (not in the range of most mainstream distros boot time)-->I am going to slow it down, on purpose (certainly in more than 1 spot). Then, I have an efi framebuffer and I saw some issues about this->I am going to get rid of it. Then, I am not confident in my monitor (see my other bugs), I may use the previous artificial slow down, to power cycle the monitor, before xorg tries to detect and program it. Well, I'll try to figure a way to put my monitor in a "probably" cleaner state (in respect of displayport hotplug support). Oh, and just in case of, I'll stick to the performance cpu governor. If you have any advice about this based on your experience at knowledge , which I cannot match, I'm all eyes and hears.
Got something, sort of obvious for a human, impossible for bisect to know, which explains why this bisection was a failure: testing a commit in the middle of a series of commits which are to be taken as a whole to be consistent for normal operations of the kernel, is wrong. That, bisect cannot know. In my case, the TSC commits are "good"... and testing time has nothing to do with this: testing a commit in the middle of this series will have a side effect (DP link/aux timings...) on what I am observing to decide if a commit is "bad" or "good". In my case it will break this observable (my DP monitor is _really_ not working) and I'll tag the commit as "bad", which is inconsistent. The bottom of this, which is _obvious_ for an experienced git user on large projects, at the time of big merges on the main branch of a project like linux, if something breaks, bisect is probably more your enemy than your friend: it is ok to ask "Can you bisect?" on a sub-sub-system branch which has been free from big merges from any related other subsystems for a good while, but almost an insult on such massive update.
Breaking commit is a merge commit from upstream mainline: -------- commit 13e091b6dd0e78a518a7d8756607d3acb8215768 Merge: eac341194426 1088c6eef261 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon Aug 13 18:28:19 2018 -0700 Merge branch 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 timer updates from Thomas Gleixner: "Early TSC based time stamping to allow better boot time analysis. This comes with a general cleanup of the TSC calibration code which grew warts and duct taping over the years and removes 250 lines of code. Initiated and mostly implemented by Pavel with help from various folks" * 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits) x86/kvmclock: Mark kvm_get_preset_lpj() as __init x86/tsc: Consolidate init code sched/clock: Disable interrupts when calling generic_sched_clock_init() timekeeping: Prevent false warning when persistent clock is not available sched/clock: Close a hole in sched_clock_init() x86/tsc: Make use of tsc_calibrate_cpu_early() x86/tsc: Split native_calibrate_cpu() into early and late parts sched/clock: Use static key for sched_clock_running sched/clock: Enable sched clock early sched/clock: Move sched clock initialization and merge with generic clock x86/tsc: Use TSC as sched clock early x86/tsc: Initialize cyc2ns when tsc frequency is determined x86/tsc: Calibrate tsc only once ARM/time: Remove read_boot_clock64() s390/time: Remove read_boot_clock64() timekeeping: Default boot time offset to local_clock() timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset() s390/time: Add read_persistent_wall_and_boot_offset() x86/xen/time: Output xen sched_clock time from 0 x86/xen/time: Initialize pv xen time in init_hypervisor_platform() ... -------- If I revert this merge commit on amd-staging-drm-next branch, my displayport monitor works again. In the 1088c6eef261 branch, the one merged in mainline, displayport programming stopped working on the following commit: -------- commit e2a9ca29b5edc89da2fddeae30e1070b272395c5 Author: Pavel Tatashin <pasha.tatashin@oracle.com> Date: Thu Jul 19 16:55:39 2018 -0400 x86/tsc: Initialize cyc2ns when tsc frequency is determined cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data structure. Currently, the setup code for c2ns data for every possible CPU goes through the same sequence of calculations as for the boot CPU, but is based on the same tsc frequency as the boot CPU, and thus this is not necessary. Initialize the boot cpu when tsc frequency is determined. Copy the calculated data from the boot CPU to the other CPUs in tsc_init(). In addition do the following: - Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init() - Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be called when system is up, and wraps around __set_cyc2ns_scale() that can be called directly when system is booting but avoids saving restoring IRQs and going and waking up from idle. ---- I did not check if this issue was known from upstream yet.
Found the bug, very probably. It seems to be an upstream bug: a 32bits multiplication overflow on TSC frequency introduced in recent TSC cleanup: -------------------------------------------------------------------------------- commit cf7a63ef4e0203f6f33284c69e8188d91422de83 Author: Pavel Tatashin <pasha.tatashin@oracle.com> Date: Thu Jul 19 16:55:38 2018 -0400 x86/tsc: Calibrate tsc only once During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(), and the second time in tsc_init(). Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so the calibration is done only early, and make tsc_init() to use the values already determined in tsc_early_init(). Sometimes it is not possible to determine tsc early, as the subsystem that is required is not yet initialized, in such case try again later in tsc_init(). -------------------------------------------------------------------------------- One nasty thing: this commit is broken, it does not compile, hence it's a bisect "skipped" commit. Roughly, if you have a cpu with a frequency above 4.2GHz (max unsigned 32bits), linux time subsystem gets broken leading to the timeouts in displayport programming. Ofc, my cpu runs at 4.7GHz.
(In reply to Sylvain BERTRAND from comment #20) > Roughly, if you have a cpu with a frequency above 4.2GHz (max unsigned > 32bits), > linux time subsystem gets broken leading to the timeouts in displayport > programming. Ofc, my cpu runs at 4.7GHz. I hit the bug on AMD FX-9590 machine with RX 580 and 2 displayport monitors. As for you bisect wasn't successful for me and led me to network merge commit. Looks like you're right! If I downclock the CPU to 4.2 GHz then current mainline master (f8f65382c98a28e3c2b20) boots fine.
(In reply to Dmitrii Tcvetkov from comment #21) > (In reply to Sylvain BERTRAND from comment #20) > > Roughly, if you have a cpu with a frequency above 4.2GHz (max unsigned > > 32bits), > > linux time subsystem gets broken leading to the timeouts in displayport > > programming. Ofc, my cpu runs at 4.7GHz. > > I hit the bug on AMD FX-9590 machine with RX 580 and 2 displayport monitors. > As for you bisect wasn't successful for me and led me to network merge > commit. > > Looks like you're right! If I downclock the CPU to 4.2 GHz then current > mainline master (f8f65382c98a28e3c2b20) boots fine. Relevant discussion about cf7a63ef4e0203f (x86/tsc: Calibrate tsc only once) http://lkml.iu.edu/hypermail/linux/kernel/1809.0/03226.html
> Relevant discussion about cf7a63ef4e0203f (x86/tsc: Calibrate tsc only once) > http://lkml.iu.edu/hypermail/linux/kernel/1809.0/03226.html And there is a patch http://lkml.iu.edu/hypermail/linux/kernel/1809.0/04424.html which is on it's way to mainline as I understand.
Yep, it manifest also on 32 bits x86. I posted also a bug report. Since it's critical we'll probably get a new rc very soon.
The fix for TSC is merged: 17f6bac2249356c795 (x86/tsc: Prevent result truncation on 32bit)
Thanks for the follow-ups, resolving as not our bug.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.