67243 – [ILK]igt/kms_render/gpu-blit randomly causes system hang

Bug 67243 - [ILK]igt/kms_render/gpu-blit randomly causes system hang

Summary: [ILK]igt/kms_render/gpu-blit randomly causes system hang

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	high critical
Assignee:	Imre Deak
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-07-24 06:27 UTC by lu hua
Modified:	2016-11-22 08:49 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg (208.71 KB, text/plain) 2013-07-24 06:27 UTC, lu hua	no flags	Details
dmesg with drm.debug=7 (124.87 KB, text/plain) 2013-08-12 07:11 UTC, lu hua	no flags	Details
kernel config (108.42 KB, text/plain) 2013-08-30 07:27 UTC, lu hua	no flags	Details
fix ilk ring flush workaround (746 bytes, patch) 2013-08-30 19:26 UTC, Imre Deak	no flags	Details \| Splinter Review
disable steps one-by-one (10.00 KB, application/octet-stream) 2013-09-02 14:51 UTC, Imre Deak	no flags	Details
debug ironlake_crtc_disable (2.30 KB, patch) 2013-09-04 11:08 UTC, Imre Deak	no flags	Details \| Splinter Review
dmesg with patch debug ironlake_crtc_disable (2.73 MB, text/plain) 2013-09-05 07:24 UTC, lu hua	no flags	Details
fix modeset disable sequence (1.16 KB, patch) 2013-09-05 12:04 UTC, Imre Deak	no flags	Details \| Splinter Review
debug ironlake_crtc_disable-2 (2.22 KB, text/plain) 2013-09-05 12:17 UTC, Imre Deak	no flags	Details
dmesg with patch comment 21 (74.64 KB, text/plain) 2013-09-06 05:55 UTC, lu hua	no flags	Details
Show Obsolete (1) View All

Description lu hua 2013-07-24 06:27:42 UTC

Created attachment 82900 [details]
dmesg

System Environment:
--------------------------
Platform:    Ironlake
Kernel:     (drm-intel-fixes)363202bb22467ea1de6dd284b78eff5cf517db66

Bug detailed description:
-----------------------------
It randomly causes system hang on Ironlake with drm-intel-fixes and drm-intel-next-queued kernel.
It happens 1 in 5 runs.

output:
Beginning test gpu-blit with 1366x768 @ 60Hz / RGB565 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 60Hz / RGB565 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED
Beginning test gpu-blit with 1366x768 @ 60Hz / RGB888 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 60Hz / RGB888 on pipe A, encoder TMDS, connector Embedded DisplayPort: SKIPPED
Beginning test gpu-blit with 1366x768 @ 60Hz / XRGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 60Hz / XRGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED
Beginning test gpu-blit with 1366x768 @ 60Hz / XRGB2101010 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 60Hz / XRGB2101010 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED
Beginning test gpu-blit with 1366x768 @ 60Hz / ARGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 60Hz / ARGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED
Beginning test gpu-blit with 1366x768 @ 40Hz / RGB565 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 40Hz / RGB565 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED
Beginning test gpu-blit with 1366x768 @ 40Hz / RGB888 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 40Hz / RGB888 on pipe A, encoder TMDS, connector Embedded DisplayPort: SKIPPED
Beginning test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED
Beginning test gpu-blit with 1366x768 @ 40Hz / XRGB2101010 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 40Hz / XRGB2101010 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED
Beginning test gpu-blit with 1366x768 @ 40Hz / ARGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort
Test gpu-blit with 1366x768 @ 40Hz / ARGB8888 on pipe A, encoder TMDS, connector Embedded DisplayPort: PASSED


Reproduce steps:
----------------------------
1. run ./kms_render --run-subtest gpu-blit 5 cycles

Comment 1 Chris Wilson 2013-07-24 09:05:06 UTC

Can you please record the point of failure for a large number of runs (say ~10 fails)? My guess is that it has to be a modeset vs blit race (this test needs a background thread also doing blits) and so I want to see if there is any commonality with which connector/crtc/etc that it hangs upon.

Comment 2 Daniel Vetter 2013-08-04 23:00:47 UTC

Is this still an issue with latest i-g-t? Imre committed a few patches which might help here.

Comment 3 lu hua 2013-08-06 02:28:25 UTC

(In reply to comment #2)
> Is this still an issue with latest i-g-t? Imre committed a few patches which
> might help here.

It still happens on latest commit.
kms_render subtest direct-render also randomly causes system hang

Comment 4 Chris Wilson 2013-08-11 11:20:04 UTC

Random hard hangs are nearly impossible to solve, yet are still critical bugs. :(

Perhaps bumping the drm.debug to 7 so we get all the ioctls as well. Unlikely to help clarify matters, but you never know...

Comment 5 lu hua 2013-08-12 07:11:59 UTC

Created attachment 83957 [details]
dmesg with drm.debug=7

Comment 6 Paulo Zanoni 2013-08-15 13:15:39 UTC

Is this bisectable? If we run the kms_render test case about 50 times on each bisect step we may get some reliable bisecting results.

Comment 7 lu hua 2013-08-16 06:53:11 UTC

(In reply to comment #6)
> Is this bisectable? If we run the kms_render test case about 50 times on
> each bisect step we may get some reliable bisecting results.


We will try to bisect it.

Comment 8 lu hua 2013-08-26 07:58:57 UTC

I can't find a good commit.
Test on drm-intel-fixes(8abbbaf6adb46157b6bd416f7616b555cc6a332f), It also happens.

Comment 9 Imre Deak 2013-08-27 15:51:21 UTC

(In reply to comment #8)
> I can't find a good commit.
> Test on drm-intel-fixes(8abbbaf6adb46157b6bd416f7616b555cc6a332f), It also
> happens.

I can't reproduce this on the latest -nightly with my IVB EliteBook 8440p.. At least one difference is the resolution (1600x900 vs your 1366x768). The above commit is rather old (Mar 27), so could you retest with the latest -nightly instead (both with --run-subtest gpu-blit and direct-render)?

Also since in your log the hang seems to happen at the last step, there is a slight chance gem_quiescent_gpu gets stuck somehow, so could you try both subtests also with the following igt patch:

diff --git a/lib/drmtest.c b/lib/drmtest.c
index 37d7da3..aa382ff 100644
--- a/lib/drmtest.c
+++ b/lib/drmtest.c
@@ -243,8 +243,10 @@ int drm_open_any(void)
 	if (__sync_fetch_and_add(&open_count, 1))
 		return fd;
 
+	if (0) {
 	gem_quiescent_gpu(fd);
 	igt_install_exit_handler(quiescent_gpu_at_exit);
+	}
 
 	return fd;
 }

Is it a hard hang, that is you can't even ping/ssh the machine?

Thanks.

Comment 10 Imre Deak 2013-08-28 15:42:30 UTC

Correction to comment#9: the EliteBook I ran the test was on ILK not an IVB.

Now tried with a modified kernel to force your 1366x768 resolution without panel fitting on eDP, which results in corrupted output, but otherwise should result in correct signal generation matching your timings exactly. Still can't reproduce the bug after running it many hours today.

So I'd need your input on comment#9..

Could you also attach your kernel's .config?

Comment 11 lu hua 2013-08-30 07:27:05 UTC

> 
> diff --git a/lib/drmtest.c b/lib/drmtest.c
> index 37d7da3..aa382ff 100644
> --- a/lib/drmtest.c
> +++ b/lib/drmtest.c
> @@ -243,8 +243,10 @@ int drm_open_any(void)
>  	if (__sync_fetch_and_add(&open_count, 1))
>  		return fd;
>  
> +	if (0) {
>  	gem_quiescent_gpu(fd);
>  	igt_install_exit_handler(quiescent_gpu_at_exit);
> +	}
>  
>  	return fd;
>  }
> 
Test this patch, It still happens.  When it hang, can't ping/ssh the machine.

Comment 12 lu hua 2013-08-30 07:27:38 UTC

Created attachment 84898 [details]
kernel config

Comment 13 Imre Deak 2013-08-30 19:26:13 UTC

Created attachment 84929 [details] [review]
fix ilk ring flush workaround

The only notable from your .config was that you're running on 32 bit kernel, but I couldn't reproduce the hang even on that.

I'm still not sure that it's not a ring flushing issue, since in your last reply you didn't provide separate results for --run-subtest gpu-blit and direct-render. Anyway, I noticed that Chris' infinite __wait_seqno patch makes a clear improvement in gem_quiescent_gpu(), by getting rid of occasional stalls in it, so it would be worth trying it (and I just realized he also suggested this earlier on IRC to me). Please give a go to the following:

https://patchwork.kernel.org/patch/2849600/

I also found an igt regression in flush_on_ring_common() and since that is related to ilk, it's also a possible cause. I attached the fix for it, could you please try that too?

Comment 14 lu hua 2013-09-02 09:07:32 UTC

(In reply to comment #13)
> Created attachment 84929 [details] [review] [review]
> fix ilk ring flush workaround
> 
> The only notable from your .config was that you're running on 32 bit kernel,
> but I couldn't reproduce the hang even on that.
> 
> I'm still not sure that it's not a ring flushing issue, since in your last
> reply you didn't provide separate results for --run-subtest gpu-blit and
> direct-render. Anyway, I noticed that Chris' infinite __wait_seqno patch
> makes a clear improvement in gem_quiescent_gpu(), by getting rid of
> occasional stalls in it, so it would be worth trying it (and I just realized
> he also suggested this earlier on IRC to me). Please give a go to the
> following:
> 
> https://patchwork.kernel.org/patch/2849600/
> 
> I also found an igt regression in flush_on_ring_common() and since that is
> related to ilk, it's also a possible cause. I attached the fix for it, could
> you please try that too?


I test with these two patches, It happens in the fourth cycle.
output:
Beginning test gpu-blit with 1366x768 @ 60Hz / RGB565 on pipe A, encoder TMDS, connector eDP
Test gpu-blit with 1366x768 @ 60Hz / RGB565 on pipe A, encoder TMDS, connector eDP: PASSED
Beginning test gpu-blit with 1366x768 @ 60Hz / RGB888 on pipe A, encoder TMDS, connector eDP
Test gpu-blit with 1366x768 @ 60Hz / RGB888 on pipe A, encoder TMDS, connector eDP: SKIPPED
Beginning test gpu-blit with 1366x768 @ 60Hz / XRGB8888 on pipe A, encoder TMDS, connector eDP
Test gpu-blit with 1366x768 @ 60Hz / XRGB8888 on pipe A, encoder TMDS, connector eDP: PASSED
Beginning test gpu-blit with 1366x768 @ 60Hz / XRGB2101010 on pipe A, encoder TMDS, connector eDP
Test gpu-blit with 1366x768 @ 60Hz / XRGB2101010 on pipe A, encoder TMDS, connector eDP: PASSED
Beginning test gpu-blit with 1366x768 @ 60Hz / ARGB8888 on pipe A, encoder TMDS, connector eDP
Test gpu-blit with 1366x768 @ 60Hz / ARGB8888 on pipe A, encoder TMDS, connector eDP: PASSED
Beginning test gpu-blit with 1366x768 @ 40Hz / RGB565 on pipe A, encoder TMDS, connector eDP
Test gpu-blit with 1366x768 @ 40Hz / RGB565 on pipe A, encoder TMDS, connector eDP: PASSED
Beginning test gpu-blit with 1366x768 @ 40Hz / RGB888 on pipe A, encoder TMDS, connector eDP
Test gpu-blit with 1366x768 @ 40Hz / RGB888 on pipe A, encoder TMDS, connector eDP: SKIPPED
Beginning test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS, connector eDP
Test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS, connector eDP: PASSED

Comment 15 Imre Deak 2013-09-02 14:51:05 UTC

Created attachment 85064 [details]
disable steps one-by-one

(In reply to comment #14)
> (In reply to comment #13)
> [...]
> I test with these two patches, It happens in the fourth cycle.
> output:
> [...]
> Beginning test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder
> TMDS, connector eDP
> Test gpu-blit with 1366x768 @ 40Hz / XRGB8888 on pipe A, encoder TMDS,
> connector eDP: PASSED

Hm, unless the output is shortened somehow, the above one and your earlier dmesg seems to suggest that the hang happens during kmstest_remove_fb(), but we'd need to instrument the kernel to find the exact place.

Before that we could try to narrow down things by disabling each test step. Could you try applying the attached disable patches and see with which ones the gpu-blit test stops hanging (if at all :P)? First with patch 1, then patch 1+2, then 1+2+3 etc.

Thanks.

Comment 16 lu hua 2013-09-04 07:12:43 UTC

Test patch 1, 1+2, 1+2+3, 1+2+3+4, 1+2+3+4+5, It still happens.

Comment 17 Imre Deak 2013-09-04 11:08:46 UTC

Created attachment 85190 [details] [review]
debug ironlake_crtc_disable

(In reply to comment #16)
> Test patch 1, 1+2, 1+2+3, 1+2+3+4, 1+2+3+4+5, It still happens.

Thanks, that narrows it down to a:

fb = kmstest_create_fb2();
drmModeSetCrtc(fb);
kmstest_remove_fb(fb);

loop. And considering your dmesg that seems to point to somewhere in ironlake_crtc_disable() when calling kmstest_remove_fb(). I attached a kernel patch to add debug info to ironlake_crtc_disable().

I noticed only now that you are missing some useful debugging options from your kernel .config, please enable them (and I suggest keeping them on for the future too):

CONFIG_LOCKUP_DETECTOR
CONFIG_DETECT_HUNG_TASK
CONFIG_PROVE_LOCKING

Then please run the test again with all the disable patches 1-5 applied and the attached kernel patch, and provide a new dmesg log (with drm.debug=0xf and including everything starting from boot-up) and the output from the test.

Comment 18 lu hua 2013-09-05 07:23:25 UTC

(In reply to comment #17)
> Created attachment 85190 [details] [review] [review]
> debug ironlake_crtc_disable
> 

Test this patch,It still happens.

Comment 19 lu hua 2013-09-05 07:24:12 UTC

Created attachment 85229 [details]
dmesg with patch debug ironlake_crtc_disable

Comment 20 Imre Deak 2013-09-05 12:04:33 UTC

Created attachment 85244 [details] [review]
fix modeset disable sequence

Comment 21 Imre Deak 2013-09-05 12:17:18 UTC

Created attachment 85246 [details]
debug ironlake_crtc_disable-2

(In reply to comment #19)
> Created attachment 85229 [details]
> dmesg with patch debug ironlake_crtc_disable

Seems to hang in intel_disable_plane. According to Ville, disabling clocks while having planes on might cause this. I checked and our disable sequence for ilk is not according to spec, we should disable planes before disabling the port.

Could you try again with all of the following kernel/igt patches applied:
- fix modeset disable sequence
- debug ironlake_crtc_disable-2
- disable steps one-by-one (all of 1-5)

and send a full dmesg with drm.debug=0xf?

Comment 22 lu hua 2013-09-06 05:53:45 UTC

> Could you try again with all of the following kernel/igt patches applied:
> - fix modeset disable sequence
> - debug ironlake_crtc_disable-2
> - disable steps one-by-one (all of 1-5)
> 
> and send a full dmesg with drm.debug=0xf?

Test these patch, It still happens.

Comment 23 lu hua 2013-09-06 05:55:00 UTC

Created attachment 85305 [details]
dmesg with patch comment 21

Comment 24 Imre Deak 2013-09-06 09:54:39 UTC

(In reply to comment #23)
> Created attachment 85305 [details]
> dmesg with patch comment 21

Ok, thanks a lot for testing these. I didn't get much closer to the root cause but at least we eliminated some possible causes. To summarize, a loop as in comment 17 is enough to trigger the problem and the hang seems to happen in intel_disable_plane().

The wrong ilk disable sequence I mentioned is only a red-herring, since the port gets only disabled in encoder->post_disable(), so we can forget about that.

One more pattern I noticed in your dmesg is regular fifo underflows, that happen consistently with certain modes, resulting in lower watermark thresholds. And the hang happened in all cases with the same threshold values. Atm, no idea why you get those, I don't get any underflows on my ilk even with your timings.

Comment 25 Imre Deak 2013-09-12 15:40:25 UTC

I could hit a hard hang - by causing pipe underruns - and it seems to be the same what you see. Leaving the primary WM value at its default and keeping the LP WMs disabled got rid of the problem for me. Also Ville's WM rework seems to fix the issue for me, so could you give it a try:

git://gitorious.org/vsyrjala/linux.git watermarks_for_imre branch.

Comment 26 lu hua 2013-09-16 07:55:09 UTC

> 
> git://gitorious.org/vsyrjala/linux.git watermarks_for_imre branch.

Run 5 cycles on this branch, they all work well.

Comment 27 Rodrigo Vivi 2013-12-17 14:09:09 UTC

Hi lu hua,

Could you please verify with latest drm-intel-nightly branch:

http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-nightly

I shall be fixed so we can close this bug.

Thanks,
Rodrigo.

Comment 28 lu hua 2013-12-20 06:35:11 UTC

Test on latest -nightly kernel.
It works well.

Comment 29 lu hua 2013-12-20 06:35:47 UTC

Verified.Fixed.

Comment 30 Jari Tahvanainen 2016-11-22 08:49:33 UTC

Closing verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.