Bug 99312 - Long-running OpenCL kernels cause ring stalls and GPU lockups on Kabini when radeon.lockup_timeout is enabled
Summary: Long-running OpenCL kernels cause ring stalls and GPU lockups on Kabini when ...
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 99553
  Show dependency treegraph
 
Reported: 2017-01-07 18:25 UTC by Vedran Miletić
Modified: 2019-05-11 20:07 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vedran Miletić 2017-01-07 18:25:19 UTC
Running long lasting OpenCL kernels (e.g. GROMACS with a system of many atoms) using kernel 4.8.15, Mesa git, and LLVM git on Kabini APU:

vendor_id       : AuthenticAMD
cpu family      : 22
model           : 0
model name      : AMD Athlon(tm) 5350 APU with Radeon(tm) R3
stepping        : 1
microcode       : 0x700010b

with GPU:

00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Kabini [Radeon HD 8400 / R3 Series] [1002:9830]

causes GPU lockups like:

[338584.980657] radeon 0000:00:01.0: ring 0 stalled for more than 10351msec
[338584.980811] radeon 0000:00:01.0: GPU lockup (current fence id 0x00000000000827c1 last fence id 0x00000000000827c2 on ring 0)
[338585.484633] radeon 0000:00:01.0: ring 0 stalled for more than 10855msec
[338585.484789] radeon 0000:00:01.0: GPU lockup (current fence id 0x00000000000827c1 last fence id 0x00000000000827c2 on ring 0)
[338585.988632] radeon 0000:00:01.0: ring 0 stalled for more than 11359msec
[338585.988787] radeon 0000:00:01.0: GPU lockup (current fence id 0x00000000000827c1 last fence id 0x00000000000827c2 on ring 0)

Machine does not hang. This is reliably reproducible. Any other info I can provide?
Comment 1 John Bridgman 2017-01-07 18:36:56 UTC
If you have not already done so, try disabling the watchdog timer:


MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 10000 = 10 seconds, 0 = disable)");
module_param_named(lockup_timeout, radeon_lockup_timeout, int, 0444);

As part of HSA/ROC development we dropped the priority of compute work relative to graphics which improved interactivity and *almost* eliminated timeouts without having to disable the timer  - when I get back in the office I'll dig up the changes. In the meantime, I think disabling the timer will do what you need although you will still have sluggish graphics while long-running kernels are active.

Lowering the priority of compute waves across the board won't be a fully general solution because there are going to be some cases (eg Valve's recent work with using high priority compute to improve VR smoothness) where compute will need to be *higher* priority than graphics but it should cover most cases other than "simultaneously running GROMACS and VR".
Comment 2 Vedran Miletić 2017-01-09 17:02:43 UTC
(In reply to John Bridgman from comment #1)
> If you have not already done so, try disabling the watchdog timer:
> 
> 
> MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 10000 =
> 10 seconds, 0 = disable)");
> module_param_named(lockup_timeout, radeon_lockup_timeout, int, 0444);
> 

Yup, that works around the problem.

> As part of HSA/ROC development we dropped the priority of compute work
> relative to graphics which improved interactivity and *almost* eliminated
> timeouts without having to disable the timer  - when I get back in the
> office I'll dig up the changes. In the meantime, I think disabling the timer
> will do what you need although you will still have sluggish graphics while
> long-running kernels are active.
> 

Eager to hear the details.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.