Bug 100548 - [BAT][CTG] DRM-Tip 4.11.0-rc5 introduced kms_pipe_crc_basic regression on C2D
Summary: [BAT][CTG] DRM-Tip 4.11.0-rc5 introduced kms_pipe_crc_basic regression on C2D
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: highest critical
Assignee: Ville Syrjala
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: PatchMerged
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-04 07:21 UTC by Tomi Sarvela
Modified: 2017-07-27 16:49 UTC (History)
1 user (show)

See Also:
i915 platform: GM45
i915 features: display/Other


Attachments
lshw from cantiga (19.53 KB, text/plain)
2017-04-12 12:08 UTC, Marta Löfstedt
no flags Details
Patch from Peter Zijlstra (5.79 KB, patch)
2017-04-18 13:07 UTC, Marta Löfstedt
no flags Details | Splinter Review

Description Tomi Sarvela 2017-04-04 07:21:00 UTC
Failure example:
(kms_pipe_crc_basic:8910) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(kms_pipe_crc_basic:8910) igt-debugfs-DEBUG: Using generic frame CRC ABI
(kms_pipe_crc_basic:8910) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(kms_pipe_crc_basic:8910) DEBUG: CRC for this fb: 001492bb 001492bb 001492bb 001492bb 001492bb
(kms_pipe_crc_basic:8910) igt-debugfs-CRITICAL: Test assertion failure function igt_assert_crc_equal, file igt_debugfs.c:301:
(kms_pipe_crc_basic:8910) igt-debugfs-CRITICAL: Failed assertion: a->crc[i] == b->crc[i]
(kms_pipe_crc_basic:8910) igt-debugfs-CRITICAL: Last errno: 9, Bad file descriptor
(kms_pipe_crc_basic:8910) igt-debugfs-CRITICAL: error: 0x1492bb != 0x45a649


Commits in this build:
https://intel-gfx-ci.01.org/CI/CI_DRM_2460/commits_short.log


History:
https://intel-gfx-ci.01.org/CI/fi-ctg-l9400.html
Comment 1 Daniel Vetter 2017-04-06 07:27:25 UTC
We need the bisect here ...
Comment 2 Marta Löfstedt 2017-04-06 10:11:29 UTC
We took the CTG out of fram1 so I can manually bisect this regression.
Comment 3 Chris Wilson 2017-04-06 10:16:14 UTC
Note that Ville was suprised that ctg had a brief period of stability as the underlying bug wasn't fixed.
Comment 4 Marta Löfstedt 2017-04-06 13:50:20 UTC
Note, drivers/gpu/drm/* are identical between CI_DRM_2459 and CI_DRM_2460
Comment 5 Marta Löfstedt 2017-04-07 09:43:45 UTC
In order to do ezbench bisect, I would need some reasonably secure way to reproduce the issue. However, with T.o.T drm-tip (make oldconfig, make localmodconfig) I can't reproduce the issue while running the affected tests.
Also, by running the CI_DRM_2482 kernel that was on the system, the: 
igt@kms_pipe_crc_basic@read-crc-pipe-a
displays a green frame, but then the poor ol' laptop is hanging.
Comment 6 Marta Löfstedt 2017-04-10 06:38:36 UTC
I built CI_DRM_2459 with kernel conf and bootargs from build server, as expected I can't reproduce the issue. Did the same for CI_DRM_2460, and I actually managed to reproduce on fail on igt@kms_pipe_crc_basic@read-crc-pipe-b after 25 runs. 
So, I suggest we try local ezbench bisect for this. However, since this device has a non-predictable reboot behavior, it may need some nursing.
Comment 7 Marta Löfstedt 2017-04-11 10:15:53 UTC
First try with ezbech bisect failed.
Second try just started.
Comment 8 Marta Löfstedt 2017-04-12 06:15:52 UTC
The ezbench bisect is still running! check timing from my last comment.
Comment 9 Marta Löfstedt 2017-04-12 10:26:43 UTC
ezbench bisect is not 100% done yet, but the human(mupuf) interpretation is that the regression is caused by this commit:

 commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649
Author: Pavel Tatashin <pasha.tatashin@oracle.com>
Date:   Wed Mar 22 16:24:17 2017 -0400

    sched/clock: Fix broken stable to unstable transfer

This is currently being verified.
Comment 10 Martin Peres 2017-04-12 10:34:12 UTC
(In reply to Marta Löfstedt from comment #8)
> The ezbench bisect is still running! check timing from my last comment.

Yeah, I had to increase the number of runs of each test because 5 runs were not sufficient to reliably reproduce issues on all the tests, which prompted more bisecting jobs than necessary (from 1 to 5). I increased it to 20 and it's been doing less work.

The bisecting of the main issue was however done, it was now bisecting a change from skip to pass, which was not really useful to our needs.

Here is the suspected commit: https://cgit.freedesktop.org/drm-tip/commit/?id=7b09cc5a9debc86c903c2eff8f8a1fdef773c649

We reverted the patch and we are now letting the machine verify that this is enough to fix the issue.
Comment 11 Marta Löfstedt 2017-04-12 11:43:50 UTC
lspci:
00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
00:19.0 Ethernet controller: Intel Corporation 82567LM Gigabit Network Connection (rev 03)
00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M-E LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode] (rev 03)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
03:00.0 Network controller: Intel Corporation Ultimate N WiFi Link 5300
Comment 12 Marta Löfstedt 2017-04-12 12:08:18 UTC
50 runs with the patch reverted, no regressions.

I emailed: Pavel Tatashin and Peter Zijlstra about this and hope they can explain in more details why this only affects this poor old Intel Core 2 Duo.

Since the issue only appear to affect the CTG, maybe we should wait on feedback before we revert the patch.
Comment 13 Marta Löfstedt 2017-04-12 12:08:51 UTC
Created attachment 130812 [details]
lshw from cantiga
Comment 14 Ville Syrjala 2017-04-12 13:24:52 UTC
The real problem is most likely in i915 code, and it should be fixed by my g4x watermark rewrite (once I get around to posting them, and we get the into the tree).

As the system memory self residency goes up the more likely we're to see the problem on the display side, which probably explains why scheduler changes can end up masking/unmasking the problem to some degree.
Comment 15 Daniel Vetter 2017-04-12 13:52:54 UTC
Great work on the bisecting Marta/Martin!

If it's a probable bug in i915 we can still carry the quick revert of the sched patch in topic/core-for-CI if that helps with stability. Ville, can you pls take care of that?
Comment 16 Martin Peres 2017-04-12 14:19:41 UTC
(In reply to Daniel Vetter from comment #15)
> Great work on the bisecting Marta/Martin!
> 
> If it's a probable bug in i915 we can still carry the quick revert of the
> sched patch in topic/core-for-CI if that helps with stability. Ville, can
> you pls take care of that?

I agree. The revert has been pushed after a discussion on IRC.

Let's see what upstream has to say about this patch and what we can do on our side to make all of this less timing-sensitive.
Comment 17 Martin Peres 2017-04-12 15:55:52 UTC
The revert has been verified to work on CI twice. We are whitelisting the tests again.
Comment 18 Marta Löfstedt 2017-04-18 13:07:39 UTC
Created attachment 130898 [details] [review]
Patch from Peter Zijlstra

Peter Zijlstra wanted this patch tested
Comment 19 Marta Löfstedt 2017-04-18 14:11:17 UTC
I have tested the patch from peterz on the Cantiga:

drm-tip: git@4d374295 + revert of the revert:
"commit 841c8c95663fe5743f958dfcd1adc4e27b3748b8
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Apr 12 16:04:06 2017 +0200

    Revert "sched/clock: Fix broken stable to unstable transfer"
    
    This reverts commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649."

+ attached patch from peterz

I had 32 failing tests on 50 runs of the testlist with the affected tests.

Note, with just the 841c8c95663, i.e. the revert of 7b09cc5a9d, there was zero fails on 50 runs.
Also remember, with 7b09cc5a9d we failed more than one test on every run, so it appear as if the patch from peterz is an improvement, but it does not fix the regression.
Comment 20 Marta Löfstedt 2017-04-24 06:02:21 UTC
Peter Zijlstra has a new patch series for this issue.
I believe Ville has helped out with verification.
I will ask Ville on the status before we take the CTG out of the lab for verfication of Peterz patchset.
Comment 21 Marta Löfstedt 2017-04-24 06:47:20 UTC
Here are the peterz 8 patches:

https://patchwork.kernel.org/patch/9693191/
...
https://patchwork.kernel.org/patch/9693195/
Comment 22 Marta Löfstedt 2017-04-24 14:20:48 UTC
I can no longer reproduce the regression with Villes patch-set:
https://patchwork.freedesktop.org/series/23358/
Comment 23 Marta Löfstedt 2017-04-24 14:30:04 UTC
I could test the new Zijlstras patch-set tomorrow.
Comment 24 Marta Löfstedt 2017-04-25 09:23:02 UTC
Zijstras "new" patchset does not produce clean results. I have tested both with and without the tsc=unstable kernel boot parameter.

But, the failure rate has gone down significantly.

I have 14 tests and to 50 repetitions => 700 invocations

Before Zijlstra first patch, failure rate was approximately 50% => 350 fails
Ziljstra first patch-set => 30 fails
Ziljstra new patch-set => 15 fails + 1 dmesg-warn

So, this is a big improvment.

However, since Villes patchset appear to fix the problem by 100%, we don't need the Zijlstra patches for solving this regression.
Comment 25 Marta Löfstedt 2017-05-04 06:57:00 UTC
This issue would be solved when Villes patchset is merged.
Comment 26 Marta Löfstedt 2017-05-09 10:40:38 UTC
I assign this to Ville to be closed when his patch-set is merged.
Comment 27 Jani Saarinen 2017-05-15 13:45:59 UTC
Now patches were merged. 
https://cgit.freedesktop.org/drm-tip/commit/?id=04548cbada77c662b2af149d742a1d93aa3bc568
Comment 28 Marta Löfstedt 2017-05-16 10:13:19 UTC
Excellent that patches are merged, now the revert 471199349134ed612e15deec17c95fafe6b0723 can be removed, and we will soon see if this fix still holds.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.