Failure example: (kms_pipe_crc_basic:8910) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (kms_pipe_crc_basic:8910) igt-debugfs-DEBUG: Using generic frame CRC ABI (kms_pipe_crc_basic:8910) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (kms_pipe_crc_basic:8910) DEBUG: CRC for this fb: 001492bb 001492bb 001492bb 001492bb 001492bb (kms_pipe_crc_basic:8910) igt-debugfs-CRITICAL: Test assertion failure function igt_assert_crc_equal, file igt_debugfs.c:301: (kms_pipe_crc_basic:8910) igt-debugfs-CRITICAL: Failed assertion: a->crc[i] == b->crc[i] (kms_pipe_crc_basic:8910) igt-debugfs-CRITICAL: Last errno: 9, Bad file descriptor (kms_pipe_crc_basic:8910) igt-debugfs-CRITICAL: error: 0x1492bb != 0x45a649 Commits in this build: https://intel-gfx-ci.01.org/CI/CI_DRM_2460/commits_short.log History: https://intel-gfx-ci.01.org/CI/fi-ctg-l9400.html
We need the bisect here ...
We took the CTG out of fram1 so I can manually bisect this regression.
Note that Ville was suprised that ctg had a brief period of stability as the underlying bug wasn't fixed.
Note, drivers/gpu/drm/* are identical between CI_DRM_2459 and CI_DRM_2460
In order to do ezbench bisect, I would need some reasonably secure way to reproduce the issue. However, with T.o.T drm-tip (make oldconfig, make localmodconfig) I can't reproduce the issue while running the affected tests. Also, by running the CI_DRM_2482 kernel that was on the system, the: igt@kms_pipe_crc_basic@read-crc-pipe-a displays a green frame, but then the poor ol' laptop is hanging.
I built CI_DRM_2459 with kernel conf and bootargs from build server, as expected I can't reproduce the issue. Did the same for CI_DRM_2460, and I actually managed to reproduce on fail on igt@kms_pipe_crc_basic@read-crc-pipe-b after 25 runs. So, I suggest we try local ezbench bisect for this. However, since this device has a non-predictable reboot behavior, it may need some nursing.
First try with ezbech bisect failed. Second try just started.
The ezbench bisect is still running! check timing from my last comment.
ezbench bisect is not 100% done yet, but the human(mupuf) interpretation is that the regression is caused by this commit: commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649 Author: Pavel Tatashin <pasha.tatashin@oracle.com> Date: Wed Mar 22 16:24:17 2017 -0400 sched/clock: Fix broken stable to unstable transfer This is currently being verified.
(In reply to Marta Löfstedt from comment #8) > The ezbench bisect is still running! check timing from my last comment. Yeah, I had to increase the number of runs of each test because 5 runs were not sufficient to reliably reproduce issues on all the tests, which prompted more bisecting jobs than necessary (from 1 to 5). I increased it to 20 and it's been doing less work. The bisecting of the main issue was however done, it was now bisecting a change from skip to pass, which was not really useful to our needs. Here is the suspected commit: https://cgit.freedesktop.org/drm-tip/commit/?id=7b09cc5a9debc86c903c2eff8f8a1fdef773c649 We reverted the patch and we are now letting the machine verify that this is enough to fix the issue.
lspci: 00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07) 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) 00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) 00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07) 00:19.0 Ethernet controller: Intel Corporation 82567LM Gigabit Network Connection (rev 03) 00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03) 00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03) 00:1a.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03) 00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03) 00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03) 00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03) 00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03) 00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03) 00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03) 00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03) 00:1d.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03) 00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03) 00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93) 00:1f.0 ISA bridge: Intel Corporation ICH9M-E LPC Interface Controller (rev 03) 00:1f.2 SATA controller: Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode] (rev 03) 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03) 03:00.0 Network controller: Intel Corporation Ultimate N WiFi Link 5300
50 runs with the patch reverted, no regressions. I emailed: Pavel Tatashin and Peter Zijlstra about this and hope they can explain in more details why this only affects this poor old Intel Core 2 Duo. Since the issue only appear to affect the CTG, maybe we should wait on feedback before we revert the patch.
Created attachment 130812 [details] lshw from cantiga
The real problem is most likely in i915 code, and it should be fixed by my g4x watermark rewrite (once I get around to posting them, and we get the into the tree). As the system memory self residency goes up the more likely we're to see the problem on the display side, which probably explains why scheduler changes can end up masking/unmasking the problem to some degree.
Great work on the bisecting Marta/Martin! If it's a probable bug in i915 we can still carry the quick revert of the sched patch in topic/core-for-CI if that helps with stability. Ville, can you pls take care of that?
(In reply to Daniel Vetter from comment #15) > Great work on the bisecting Marta/Martin! > > If it's a probable bug in i915 we can still carry the quick revert of the > sched patch in topic/core-for-CI if that helps with stability. Ville, can > you pls take care of that? I agree. The revert has been pushed after a discussion on IRC. Let's see what upstream has to say about this patch and what we can do on our side to make all of this less timing-sensitive.
The revert has been verified to work on CI twice. We are whitelisting the tests again.
Created attachment 130898 [details] [review] Patch from Peter Zijlstra Peter Zijlstra wanted this patch tested
I have tested the patch from peterz on the Cantiga: drm-tip: git@4d374295 + revert of the revert: "commit 841c8c95663fe5743f958dfcd1adc4e27b3748b8 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Apr 12 16:04:06 2017 +0200 Revert "sched/clock: Fix broken stable to unstable transfer" This reverts commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649." + attached patch from peterz I had 32 failing tests on 50 runs of the testlist with the affected tests. Note, with just the 841c8c95663, i.e. the revert of 7b09cc5a9d, there was zero fails on 50 runs. Also remember, with 7b09cc5a9d we failed more than one test on every run, so it appear as if the patch from peterz is an improvement, but it does not fix the regression.
Peter Zijlstra has a new patch series for this issue. I believe Ville has helped out with verification. I will ask Ville on the status before we take the CTG out of the lab for verfication of Peterz patchset.
Here are the peterz 8 patches: https://patchwork.kernel.org/patch/9693191/ ... https://patchwork.kernel.org/patch/9693195/
I can no longer reproduce the regression with Villes patch-set: https://patchwork.freedesktop.org/series/23358/
I could test the new Zijlstras patch-set tomorrow.
Zijstras "new" patchset does not produce clean results. I have tested both with and without the tsc=unstable kernel boot parameter. But, the failure rate has gone down significantly. I have 14 tests and to 50 repetitions => 700 invocations Before Zijlstra first patch, failure rate was approximately 50% => 350 fails Ziljstra first patch-set => 30 fails Ziljstra new patch-set => 15 fails + 1 dmesg-warn So, this is a big improvment. However, since Villes patchset appear to fix the problem by 100%, we don't need the Zijlstra patches for solving this regression.
This issue would be solved when Villes patchset is merged.
I assign this to Ville to be closed when his patch-set is merged.
Now patches were merged. https://cgit.freedesktop.org/drm-tip/commit/?id=04548cbada77c662b2af149d742a1d93aa3bc568
Excellent that patches are merged, now the revert 471199349134ed612e15deec17c95fafe6b0723 can be removed, and we will soon see if this fix still holds.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.