99312 – Long-running OpenCL kernels cause ring stalls and GPU lockups on Kabini when radeon.lockup_timeout is enabled

Bug 99312 - Long-running OpenCL kernels cause ring stalls and GPU lockups on Kabini when radeon.lockup_timeout is enabled

Summary: Long-running OpenCL kernels cause ring stalls and GPU lockups on Kabini when ...

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	99553
	Show dependency tree / graph

Reported:	2017-01-07 18:25 UTC by Vedran Miletić
Modified:	2019-09-25 17:56 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description Vedran Miletić 2017-01-07 18:25:19 UTC

Running long lasting OpenCL kernels (e.g. GROMACS with a system of many atoms) using kernel 4.8.15, Mesa git, and LLVM git on Kabini APU:

vendor_id       : AuthenticAMD
cpu family      : 22
model           : 0
model name      : AMD Athlon(tm) 5350 APU with Radeon(tm) R3
stepping        : 1
microcode       : 0x700010b

with GPU:

00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Kabini [Radeon HD 8400 / R3 Series] [1002:9830]

causes GPU lockups like:

[338584.980657] radeon 0000:00:01.0: ring 0 stalled for more than 10351msec
[338584.980811] radeon 0000:00:01.0: GPU lockup (current fence id 0x00000000000827c1 last fence id 0x00000000000827c2 on ring 0)
[338585.484633] radeon 0000:00:01.0: ring 0 stalled for more than 10855msec
[338585.484789] radeon 0000:00:01.0: GPU lockup (current fence id 0x00000000000827c1 last fence id 0x00000000000827c2 on ring 0)
[338585.988632] radeon 0000:00:01.0: ring 0 stalled for more than 11359msec
[338585.988787] radeon 0000:00:01.0: GPU lockup (current fence id 0x00000000000827c1 last fence id 0x00000000000827c2 on ring 0)

Machine does not hang. This is reliably reproducible. Any other info I can provide?

Comment 1 John Bridgman 2017-01-07 18:36:56 UTC

If you have not already done so, try disabling the watchdog timer:


MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 10000 = 10 seconds, 0 = disable)");
module_param_named(lockup_timeout, radeon_lockup_timeout, int, 0444);

As part of HSA/ROC development we dropped the priority of compute work relative to graphics which improved interactivity and *almost* eliminated timeouts without having to disable the timer  - when I get back in the office I'll dig up the changes. In the meantime, I think disabling the timer will do what you need although you will still have sluggish graphics while long-running kernels are active.

Lowering the priority of compute waves across the board won't be a fully general solution because there are going to be some cases (eg Valve's recent work with using high priority compute to improve VR smoothness) where compute will need to be *higher* priority than graphics but it should cover most cases other than "simultaneously running GROMACS and VR".

Comment 2 Vedran Miletić 2017-01-09 17:02:43 UTC

(In reply to John Bridgman from comment #1)
> If you have not already done so, try disabling the watchdog timer:
> 
> 
> MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 10000 =
> 10 seconds, 0 = disable)");
> module_param_named(lockup_timeout, radeon_lockup_timeout, int, 0444);
> 

Yup, that works around the problem.

> As part of HSA/ROC development we dropped the priority of compute work
> relative to graphics which improved interactivity and *almost* eliminated
> timeouts without having to disable the timer  - when I get back in the
> office I'll dig up the changes. In the meantime, I think disabling the timer
> will do what you need although you will still have sluggish graphics while
> long-running kernels are active.
> 

Eager to hear the details.

Comment 3 GitLab Migration User 2019-09-25 17:56:39 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1246.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.