Bug 97985 - [BDW] System hang while running Qt QML application, with unclaimed registers
Summary: [BDW] System hang while running Qt QML application, with unclaimed registers
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 91960 97589 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-09-30 13:41 UTC by Anssi Hannula
Modified: 2018-04-25 06:37 UTC (History)
4 users (show)

See Also:
i915 platform: BDW
i915 features: GEM/Other


Attachments
BDW hang with intel-drm-nightly 20160927 (81.40 KB, text/plain)
2016-09-30 13:41 UTC, Anssi Hannula
no flags Details
drm-intel-nightly reports unclaimed register 0x220a8 (78.05 KB, text/plain)
2016-09-30 13:42 UTC, Anssi Hannula
no flags Details
drm-intel-nightly reports unclaimed register 0x44408 (77.30 KB, text/plain)
2016-09-30 13:43 UTC, Anssi Hannula
no flags Details
intel-drm-nightly 20160927 unclaimed read of 0x44324 (77.79 KB, text/plain)
2016-10-03 10:19 UTC, Anssi Hannula
no flags Details
intel-drm-nightly 20160927 unclaimed read of 0x44404 (77.70 KB, text/plain)
2016-10-03 10:20 UTC, Anssi Hannula
no flags Details
intel-drm-nightly 20160927 unclaimed writeof 0x20a8 (77.34 KB, text/plain)
2016-10-03 10:20 UTC, Anssi Hannula
no flags Details
intel-drm-nightly 20160927 unclaimed write of 0x44324 (77.68 KB, text/plain)
2016-10-03 10:20 UTC, Anssi Hannula
no flags Details
intel-drm-nightly 20160927 unclaimed write of 0x44404 (77.66 KB, text/plain)
2016-10-03 10:21 UTC, Anssi Hannula
no flags Details
intel-drm-nightly 20160927 unclaimed write of 0x44324 with i915.enable_psr=0 (77.96 KB, text/plain)
2016-10-03 14:53 UTC, Anssi Hannula
no flags Details
intel-drm-nightly 20160927 unclaimed read of 0x44404 with i915.enable_psr=0 (77.83 KB, text/plain)
2016-10-03 15:45 UTC, Anssi Hannula
no flags Details
intel-drm-nightly 20160927 unclaimed read of 0x44324 with patchset 13230 (77.76 KB, text/plain)
2016-10-04 08:44 UTC, Anssi Hannula
no flags Details
Grab RPM reference around vblank (3.63 KB, patch)
2016-10-07 13:49 UTC, Chris Wilson
no flags Details | Splinter Review
intel-drm-nightly 20161007 unclaimed writes/reads of 0x44324, writes of 0x20a8, with patch from comment 16 (134.72 KB, text/plain)
2016-10-07 14:57 UTC, Anssi Hannula
no flags Details
intel-drm-nightly 20170321 unclaimed accesses and hang (118.23 KB, text/plain)
2017-03-22 08:20 UTC, Anssi Hannula
no flags Details
attachment-21796-0.html (534 bytes, text/html)
2017-07-26 15:11 UTC, Anssi Hannula
no flags Details

Description Anssi Hannula 2016-09-30 13:41:53 UTC
Created attachment 126897 [details]
BDW hang with intel-drm-nightly 20160927

The system hangs whenever a Qt QML application has run for a while (minutes or hours). Qt QML uses OpenGL for rendering. It is a full system hang, not just GPU.

E.g. simply running the "quickwidget" example application in Qt 5.6.1 (qtdeclarative/examples/quick/quickwidgets/quickwidget , with the rotating red square) will trigger a full system freeze within a day (within minutes most of the time).

When the application starts, within seconds (i.e. long before the freeze) there is always an "Unclaimed register detected" warning in the log, one of these:
- Unclaimed register detected before writing to register 0x44324
- Unclaimed register detected before writing to register 0x220a8
- Unclaimed register detected before reading register 0x44408

This message is not there with e.g. glxgears (with which I have also not been able to reproduce the hang so far).

With serial console I see no output at the time of freeze.

The issue happens also with latest intel processor microcode (loaded using the early load mechanism).

With "intel_idle.max_cstate=1" kernel parameter the hang does not occur, or at least it occurs so much more rarely that I haven't seen it.


The attached logs are all with drm-intel-nightly with drm.debug=0xe.

bdw-hang-nightly-20160927.txt contains a quickwidget hang captured via serial port, and it also contains an unclaimed register warning for register 0x44324.

bdw-unclaimed-0x220a8.txt and bdw-unclaimed-0x44408.txt contain the other variants of the "Unclaimed register" warning that I have seen, but I did not wait for the freeze to actually happen in these instances (but I've seen the hang happen with those messages in other runs with different kernel and without drm debugging).



The setup is:

System architecture: x86_64

Kernel version: drm-intel-nightly 2016y-09m-27d-16h-32m-56s UTC (also seen in 4.4.18, 4.4.22, 4.8-rc8).

Linux distribution: Yocto 2.1-based build (mesa 11.1.1, X.org 1.18.0).

Machine model: Sintrones VBOX-3610

Display connector: DVI (appears as HDMI1)
Comment 1 Anssi Hannula 2016-09-30 13:42:40 UTC
Created attachment 126898 [details]
drm-intel-nightly reports unclaimed register 0x220a8
Comment 2 Anssi Hannula 2016-09-30 13:43:00 UTC
Created attachment 126899 [details]
drm-intel-nightly reports unclaimed register 0x44408
Comment 3 yann 2016-09-30 15:39:24 UTC
Can you try to apply Chris's patch https://patchwork.freedesktop.org/series/13161/ and re-test?
Comment 4 Anssi Hannula 2016-10-03 10:19:04 UTC
With the patch different unclaimed register accesses are reported (at the same time as before, i.e. a couple of seconds after I start the quickwidget example), at least the following (I'll attach logs):
- read/write of 0x44324
- read/write of 0x44404
- write of 0x20a8

As expected, the hangs still happen. With the quickwidget example, specifically, they seem to consistently happen within minutes (not hours).
Comment 5 Anssi Hannula 2016-10-03 10:19:47 UTC
Created attachment 126964 [details]
intel-drm-nightly 20160927 unclaimed read of 0x44324
Comment 6 Anssi Hannula 2016-10-03 10:20:08 UTC
Created attachment 126965 [details]
intel-drm-nightly 20160927 unclaimed read of 0x44404
Comment 7 Anssi Hannula 2016-10-03 10:20:27 UTC
Created attachment 126966 [details]
intel-drm-nightly 20160927 unclaimed writeof 0x20a8
Comment 8 Anssi Hannula 2016-10-03 10:20:48 UTC
Created attachment 126967 [details]
intel-drm-nightly 20160927 unclaimed write of 0x44324
Comment 9 Anssi Hannula 2016-10-03 10:21:08 UTC
Created attachment 126968 [details]
intel-drm-nightly 20160927 unclaimed write of 0x44404
Comment 10 Chris Wilson 2016-10-03 14:11:50 UTC
So vblank. Rumour is that rpm vs psr is currently broken, can you please test with i915.enable_psr=0 ?
Comment 11 Anssi Hannula 2016-10-03 14:53:30 UTC
Created attachment 126971 [details]
intel-drm-nightly 20160927 unclaimed write of 0x44324 with i915.enable_psr=0

First run with i915.enable_psr=0 resulted in the attached unclaimed write of 0x44324, very similar to before, and a hang shortly afterwards.

I did not have serial logging enabled on that run so I'm not 100% sure there was no extra output at the time of the hang, but I'm running it a second time now with logging again (and now I've already got the unclaimed write to 0x44324 again, and now waiting for the hang).
Comment 12 Anssi Hannula 2016-10-03 15:45:36 UTC
Created attachment 126976 [details]
intel-drm-nightly 20160927 unclaimed read of 0x44404 with i915.enable_psr=0

No kernel messages at hang time, same as before.

Attached is also an unclaimed read of 0x44404 with "i915.enable_psr=0", it contains the same vblank-related backtrace as before without that parameter.
Comment 13 yann 2016-10-03 17:00:22 UTC
(In reply to Anssi Hannula from comment #12)
> Created attachment 126976 [details]
> intel-drm-nightly 20160927 unclaimed read of 0x44404 with i915.enable_psr=0
> 
> No kernel messages at hang time, same as before.
> 
> Attached is also an unclaimed read of 0x44404 with "i915.enable_psr=0", it
> contains the same vblank-related backtrace as before without that parameter.

Please try with applying Chris' patch set https://patchwork.freedesktop.org/series/13230/
Comment 14 Anssi Hannula 2016-10-04 08:44:22 UTC
Created attachment 126987 [details]
intel-drm-nightly 20160927 unclaimed read of 0x44324 with patchset 13230

Seems to have had no effect, still unclaimed accesses and hangs.

Attached is a read of 0x44324 with the patchset.
Comment 15 Anssi Hannula 2016-10-07 13:34:12 UTC
Some additional details:

- Setting the DDX option "VSync" to off seems to workaround the issue, so I guess the issue is indeed vblank related as per comment #10. So far I've run the problematic case on 4.4.22 (not tested with nightly yet) for several hours and I have gotten no freezes and no "unclaimed register" warnings with VSync off (with no other workarounds active).

- If I use the modesetting DDX (with glamor), the problem does not seem to appear (no warnings, no hang, on 4.4.22). Judging from the tearing I see I guess VSync is just not getting used (at least with my versions of everything) so this is in effect the same as disabling VSync.

- "intel_idle.max_cstate=2" (2 = C1E-BDW) does not produce a hang with a ~12h run, but does produce the unclaimed register warnings. "intel_idle.max_cstate=3" (3 = C3-BDW) does also produce the hang.
Comment 16 Chris Wilson 2016-10-07 13:49:54 UTC
Created attachment 127096 [details] [review]
Grab RPM reference around vblank

This should fix the unclaimed registers whilst handing the vblank. I worry that it is overkill.
Comment 17 Anssi Hannula 2016-10-07 14:28:04 UTC
I can confirm that with patch of comment #16 I no longer see unclaimed reads/writes of 0x44404.

Unclaimed accesses of 0x44324 and 0x20a8 still remain, and hangs still happen.
Comment 18 Chris Wilson 2016-10-07 14:36:09 UTC
The RPS registers should already be guarded by a wakeref (the interrupts are only enabled whilst the GPU is active). Suggests that we are losing the wakeref, but that too should have shown up in a WARN when we tried to decrement it when already 0. Same for 02a08 which has an explicit wakeref across the interrupt handling.
Comment 19 Anssi Hannula 2016-10-07 14:57:42 UTC
Created attachment 127098 [details]
intel-drm-nightly 20161007 unclaimed writes/reads of 0x44324, writes of 0x20a8, with patch from comment 16

Here's also a log from today's nightly plus the comment #16 patch, with several instances of the remaining unclaimed accesses visible (via i915.mmio_debug).
Comment 20 yann 2016-10-07 15:51:18 UTC
New patch submitted by Chris (Grab RPM wakeref around enabling vblank interrupts):
 https://patchwork.freedesktop.org/series/13446/
Comment 21 Chris Wilson 2016-10-08 08:13:37 UTC
*** Bug 97589 has been marked as a duplicate of this bug. ***
Comment 22 Anssi Hannula 2016-10-10 08:45:57 UTC
(In reply to yann from comment #20)
> New patch submitted by Chris (Grab RPM wakeref around enabling vblank
> interrupts):
>  https://patchwork.freedesktop.org/series/13446/

Yes, this is the same patch as comment #16, with test results in comment #17 and comment #19.
Comment 23 Chris Wilson 2016-12-01 09:27:30 UTC
*** Bug 91960 has been marked as a duplicate of this bug. ***
Comment 24 Jari Tahvanainen 2017-03-21 13:57:57 UTC
Anssi, Chris - any news from either of you related to this? Still valid? Something cooking somewhere in order to get this resolved? Something common with bug 98995?
Comment 25 Chris Wilson 2017-03-21 14:11:16 UTC
commit 1f58c8e7eac0d4a7a59037dc18dbed2a9b5bd342
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 2 07:41:57 2017 +0000

    drm/i915: Restore the invalid access without RPM warning

should help identify the culprit if it still remains.
Comment 26 Anssi Hannula 2017-03-22 08:20:20 UTC
Created attachment 130372 [details]
intel-drm-nightly 20170321 unclaimed accesses and hang

Still valid, attached a new serial log with current intel-drm-nightly (unclaimed accesses and hang).
Comment 27 Elizabeth 2017-07-26 15:02:33 UTC
Adding tag into "Whiteboard" field - ReadyForDev
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included
Comment 28 Anssi Hannula 2017-07-26 15:11:23 UTC
Created attachment 132992 [details]
attachment-21796-0.html

I am out of office and will return on July 31st 2017.

On urgent matters I can be reached at +358503803997.

--
Anssi Hannula / Bitwise Oy
Comment 29 Jani Saarinen 2018-03-29 07:10:19 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 30 Jani Saarinen 2018-04-25 06:36:54 UTC
Closing, please re-open is issue still exists.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.