Created attachment 82900 [details] dmesg System Environment: -------------------------- Platform: Ironlake Kernel: (drm-intel-fixes)363202bb22467ea1de6dd284b78eff5cf517db66 Bug detailed description: ----------------------------- It randomly causes system hang on Ironlake with drm-intel-fixes and drm-intel-next-queued kernel. It happens 1 in 5 runs. output: Beginning test gpu-blit with 1366x768 @ 60Hz / RGB565 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 60Hz / RGB565 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED Beginning test gpu-blit with 1366x768 @ 60Hz / RGB888 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 60Hz / RGB888 on pipe A, encoder TMDS, connector Embedded DisplayPort: SKIPPED Beginning test gpu-blit with 1366x768 @ 60Hz / XRGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 60Hz / XRGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED Beginning test gpu-blit with 1366x768 @ 60Hz / XRGB2101010 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 60Hz / XRGB2101010 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED Beginning test gpu-blit with 1366x768 @ 60Hz / ARGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 60Hz / ARGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED Beginning test gpu-blit with 1366x768 @ 40Hz / RGB565 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 40Hz / RGB565 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED Beginning test gpu-blit with 1366x768 @ 40Hz / RGB888 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 40Hz / RGB888 on pipe A, encoder TMDS, connector Embedded DisplayPort: SKIPPED Beginning test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED Beginning test gpu-blit with 1366x768 @ 40Hz / XRGB2101010 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 40Hz / XRGB2101010 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED Beginning test gpu-blit with 1366x768 @ 40Hz / ARGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort Test gpu-blit with 1366x768 @ 40Hz / ARGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED Reproduce steps: ---------------------------- 1. run ./kms_render --run-subtest gpu-blit 5 cycles
Can you please record the point of failure for a large number of runs (say ~10 fails)? My guess is that it has to be a modeset vs blit race (this test needs a background thread also doing blits) and so I want to see if there is any commonality with which connector/crtc/etc that it hangs upon.
Is this still an issue with latest i-g-t? Imre committed a few patches which might help here.
(In reply to comment #2) > Is this still an issue with latest i-g-t? Imre committed a few patches which > might help here. It still happens on latest commit. kms_render subtest direct-render also randomly causes system hang
Random hard hangs are nearly impossible to solve, yet are still critical bugs. :( Perhaps bumping the drm.debug to 7 so we get all the ioctls as well. Unlikely to help clarify matters, but you never know...
Created attachment 83957 [details] dmesg with drm.debug=7
Is this bisectable? If we run the kms_render test case about 50 times on each bisect step we may get some reliable bisecting results.
(In reply to comment #6) > Is this bisectable? If we run the kms_render test case about 50 times on > each bisect step we may get some reliable bisecting results. We will try to bisect it.
I can't find a good commit. Test on drm-intel-fixes(8abbbaf6adb46157b6bd416f7616b555cc6a332f), It also happens.
(In reply to comment #8) > I can't find a good commit. > Test on drm-intel-fixes(8abbbaf6adb46157b6bd416f7616b555cc6a332f), It also > happens. I can't reproduce this on the latest -nightly with my IVB EliteBook 8440p.. At least one difference is the resolution (1600x900 vs your 1366x768). The above commit is rather old (Mar 27), so could you retest with the latest -nightly instead (both with --run-subtest gpu-blit and direct-render)? Also since in your log the hang seems to happen at the last step, there is a slight chance gem_quiescent_gpu gets stuck somehow, so could you try both subtests also with the following igt patch: diff --git a/lib/drmtest.c b/lib/drmtest.c index 37d7da3..aa382ff 100644 --- a/lib/drmtest.c +++ b/lib/drmtest.c @@ -243,8 +243,10 @@ int drm_open_any(void) if (__sync_fetch_and_add(&open_count, 1)) return fd; + if (0) { gem_quiescent_gpu(fd); igt_install_exit_handler(quiescent_gpu_at_exit); + } return fd; } Is it a hard hang, that is you can't even ping/ssh the machine? Thanks.
Correction to comment#9: the EliteBook I ran the test was on ILK not an IVB. Now tried with a modified kernel to force your 1366x768 resolution without panel fitting on eDP, which results in corrupted output, but otherwise should result in correct signal generation matching your timings exactly. Still can't reproduce the bug after running it many hours today. So I'd need your input on comment#9.. Could you also attach your kernel's .config?
> > diff --git a/lib/drmtest.c b/lib/drmtest.c > index 37d7da3..aa382ff 100644 > --- a/lib/drmtest.c > +++ b/lib/drmtest.c > @@ -243,8 +243,10 @@ int drm_open_any(void) > if (__sync_fetch_and_add(&open_count, 1)) > return fd; > > + if (0) { > gem_quiescent_gpu(fd); > igt_install_exit_handler(quiescent_gpu_at_exit); > + } > > return fd; > } > Test this patch, It still happens. When it hang, can't ping/ssh the machine.
Created attachment 84898 [details] kernel config
Created attachment 84929 [details] [review] fix ilk ring flush workaround The only notable from your .config was that you're running on 32 bit kernel, but I couldn't reproduce the hang even on that. I'm still not sure that it's not a ring flushing issue, since in your last reply you didn't provide separate results for --run-subtest gpu-blit and direct-render. Anyway, I noticed that Chris' infinite __wait_seqno patch makes a clear improvement in gem_quiescent_gpu(), by getting rid of occasional stalls in it, so it would be worth trying it (and I just realized he also suggested this earlier on IRC to me). Please give a go to the following: https://patchwork.kernel.org/patch/2849600/ I also found an igt regression in flush_on_ring_common() and since that is related to ilk, it's also a possible cause. I attached the fix for it, could you please try that too?
(In reply to comment #13) > Created attachment 84929 [details] [review] [review] > fix ilk ring flush workaround > > The only notable from your .config was that you're running on 32 bit kernel, > but I couldn't reproduce the hang even on that. > > I'm still not sure that it's not a ring flushing issue, since in your last > reply you didn't provide separate results for --run-subtest gpu-blit and > direct-render. Anyway, I noticed that Chris' infinite __wait_seqno patch > makes a clear improvement in gem_quiescent_gpu(), by getting rid of > occasional stalls in it, so it would be worth trying it (and I just realized > he also suggested this earlier on IRC to me). Please give a go to the > following: > > https://patchwork.kernel.org/patch/2849600/ > > I also found an igt regression in flush_on_ring_common() and since that is > related to ilk, it's also a possible cause. I attached the fix for it, could > you please try that too? I test with these two patches, It happens in the fourth cycle. output: Beginning test gpu-blit with 1366x768 @ 60Hz / RGB565 on pipe A, encoder TMDS, connector eDP Test gpu-blit with 1366x768 @ 60Hz / RGB565 on pipe A, encoder TMDS, connector eDP: PASSED Beginning test gpu-blit with 1366x768 @ 60Hz / RGB888 on pipe A, encoder TMDS, connector eDP Test gpu-blit with 1366x768 @ 60Hz / RGB888 on pipe A, encoder TMDS, connector eDP: SKIPPED Beginning test gpu-blit with 1366x768 @ 60Hz / XRGB8888 on pipe A, encoder TMDS, connector eDP Test gpu-blit with 1366x768 @ 60Hz / XRGB8888 on pipe A, encoder TMDS, connector eDP: PASSED Beginning test gpu-blit with 1366x768 @ 60Hz / XRGB2101010 on pipe A, encoder TMDS, connector eDP Test gpu-blit with 1366x768 @ 60Hz / XRGB2101010 on pipe A, encoder TMDS, connector eDP: PASSED Beginning test gpu-blit with 1366x768 @ 60Hz / ARGB8888 on pipe A, encoder TMDS, connector eDP Test gpu-blit with 1366x768 @ 60Hz / ARGB8888 on pipe A, encoder TMDS, connector eDP: PASSED Beginning test gpu-blit with 1366x768 @ 40Hz / RGB565 on pipe A, encoder TMDS, connector eDP Test gpu-blit with 1366x768 @ 40Hz / RGB565 on pipe A, encoder TMDS, connector eDP: PASSED Beginning test gpu-blit with 1366x768 @ 40Hz / RGB888 on pipe A, encoder TMDS, connector eDP Test gpu-blit with 1366x768 @ 40Hz / RGB888 on pipe A, encoder TMDS, connector eDP: SKIPPED Beginning test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS, connector eDP Test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS, connector eDP: PASSED
Created attachment 85064 [details] disable steps one-by-one (In reply to comment #14) > (In reply to comment #13) > [...] > I test with these two patches, It happens in the fourth cycle. > output: > [...] > Beginning test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder > TMDS, connector eDP > Test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS, > connector eDP: PASSED Hm, unless the output is shortened somehow, the above one and your earlier dmesg seems to suggest that the hang happens during kmstest_remove_fb(), but we'd need to instrument the kernel to find the exact place. Before that we could try to narrow down things by disabling each test step. Could you try applying the attached disable patches and see with which ones the gpu-blit test stops hanging (if at all :P)? First with patch 1, then patch 1+2, then 1+2+3 etc. Thanks.
Test patch 1, 1+2, 1+2+3, 1+2+3+4, 1+2+3+4+5, It still happens.
Created attachment 85190 [details] [review] debug ironlake_crtc_disable (In reply to comment #16) > Test patch 1, 1+2, 1+2+3, 1+2+3+4, 1+2+3+4+5, It still happens. Thanks, that narrows it down to a: fb = kmstest_create_fb2(); drmModeSetCrtc(fb); kmstest_remove_fb(fb); loop. And considering your dmesg that seems to point to somewhere in ironlake_crtc_disable() when calling kmstest_remove_fb(). I attached a kernel patch to add debug info to ironlake_crtc_disable(). I noticed only now that you are missing some useful debugging options from your kernel .config, please enable them (and I suggest keeping them on for the future too): CONFIG_LOCKUP_DETECTOR CONFIG_DETECT_HUNG_TASK CONFIG_PROVE_LOCKING Then please run the test again with all the disable patches 1-5 applied and the attached kernel patch, and provide a new dmesg log (with drm.debug=0xf and including everything starting from boot-up) and the output from the test.
(In reply to comment #17) > Created attachment 85190 [details] [review] [review] > debug ironlake_crtc_disable > Test this patch,It still happens.
Created attachment 85229 [details] dmesg with patch debug ironlake_crtc_disable
Created attachment 85244 [details] [review] fix modeset disable sequence
Created attachment 85246 [details] debug ironlake_crtc_disable-2 (In reply to comment #19) > Created attachment 85229 [details] > dmesg with patch debug ironlake_crtc_disable Seems to hang in intel_disable_plane. According to Ville, disabling clocks while having planes on might cause this. I checked and our disable sequence for ilk is not according to spec, we should disable planes before disabling the port. Could you try again with all of the following kernel/igt patches applied: - fix modeset disable sequence - debug ironlake_crtc_disable-2 - disable steps one-by-one (all of 1-5) and send a full dmesg with drm.debug=0xf?
> Could you try again with all of the following kernel/igt patches applied: > - fix modeset disable sequence > - debug ironlake_crtc_disable-2 > - disable steps one-by-one (all of 1-5) > > and send a full dmesg with drm.debug=0xf? Test these patch, It still happens.
Created attachment 85305 [details] dmesg with patch comment 21
(In reply to comment #23) > Created attachment 85305 [details] > dmesg with patch comment 21 Ok, thanks a lot for testing these. I didn't get much closer to the root cause but at least we eliminated some possible causes. To summarize, a loop as in comment 17 is enough to trigger the problem and the hang seems to happen in intel_disable_plane(). The wrong ilk disable sequence I mentioned is only a red-herring, since the port gets only disabled in encoder->post_disable(), so we can forget about that. One more pattern I noticed in your dmesg is regular fifo underflows, that happen consistently with certain modes, resulting in lower watermark thresholds. And the hang happened in all cases with the same threshold values. Atm, no idea why you get those, I don't get any underflows on my ilk even with your timings.
I could hit a hard hang - by causing pipe underruns - and it seems to be the same what you see. Leaving the primary WM value at its default and keeping the LP WMs disabled got rid of the problem for me. Also Ville's WM rework seems to fix the issue for me, so could you give it a try: git://gitorious.org/vsyrjala/linux.git watermarks_for_imre branch.
> > git://gitorious.org/vsyrjala/linux.git watermarks_for_imre branch. Run 5 cycles on this branch, they all work well.
Hi lu hua, Could you please verify with latest drm-intel-nightly branch: http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-nightly I shall be fixed so we can close this bug. Thanks, Rodrigo.
Test on latest -nightly kernel. It works well.
Verified.Fixed.
Closing verified+fixed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.