Created attachment 126897 [details] BDW hang with intel-drm-nightly 20160927 The system hangs whenever a Qt QML application has run for a while (minutes or hours). Qt QML uses OpenGL for rendering. It is a full system hang, not just GPU. E.g. simply running the "quickwidget" example application in Qt 5.6.1 (qtdeclarative/examples/quick/quickwidgets/quickwidget , with the rotating red square) will trigger a full system freeze within a day (within minutes most of the time). When the application starts, within seconds (i.e. long before the freeze) there is always an "Unclaimed register detected" warning in the log, one of these: - Unclaimed register detected before writing to register 0x44324 - Unclaimed register detected before writing to register 0x220a8 - Unclaimed register detected before reading register 0x44408 This message is not there with e.g. glxgears (with which I have also not been able to reproduce the hang so far). With serial console I see no output at the time of freeze. The issue happens also with latest intel processor microcode (loaded using the early load mechanism). With "intel_idle.max_cstate=1" kernel parameter the hang does not occur, or at least it occurs so much more rarely that I haven't seen it. The attached logs are all with drm-intel-nightly with drm.debug=0xe. bdw-hang-nightly-20160927.txt contains a quickwidget hang captured via serial port, and it also contains an unclaimed register warning for register 0x44324. bdw-unclaimed-0x220a8.txt and bdw-unclaimed-0x44408.txt contain the other variants of the "Unclaimed register" warning that I have seen, but I did not wait for the freeze to actually happen in these instances (but I've seen the hang happen with those messages in other runs with different kernel and without drm debugging). The setup is: System architecture: x86_64 Kernel version: drm-intel-nightly 2016y-09m-27d-16h-32m-56s UTC (also seen in 4.4.18, 4.4.22, 4.8-rc8). Linux distribution: Yocto 2.1-based build (mesa 11.1.1, X.org 1.18.0). Machine model: Sintrones VBOX-3610 Display connector: DVI (appears as HDMI1)
Created attachment 126898 [details] drm-intel-nightly reports unclaimed register 0x220a8
Created attachment 126899 [details] drm-intel-nightly reports unclaimed register 0x44408
Can you try to apply Chris's patch https://patchwork.freedesktop.org/series/13161/ and re-test?
With the patch different unclaimed register accesses are reported (at the same time as before, i.e. a couple of seconds after I start the quickwidget example), at least the following (I'll attach logs): - read/write of 0x44324 - read/write of 0x44404 - write of 0x20a8 As expected, the hangs still happen. With the quickwidget example, specifically, they seem to consistently happen within minutes (not hours).
Created attachment 126964 [details] intel-drm-nightly 20160927 unclaimed read of 0x44324
Created attachment 126965 [details] intel-drm-nightly 20160927 unclaimed read of 0x44404
Created attachment 126966 [details] intel-drm-nightly 20160927 unclaimed writeof 0x20a8
Created attachment 126967 [details] intel-drm-nightly 20160927 unclaimed write of 0x44324
Created attachment 126968 [details] intel-drm-nightly 20160927 unclaimed write of 0x44404
So vblank. Rumour is that rpm vs psr is currently broken, can you please test with i915.enable_psr=0 ?
Created attachment 126971 [details] intel-drm-nightly 20160927 unclaimed write of 0x44324 with i915.enable_psr=0 First run with i915.enable_psr=0 resulted in the attached unclaimed write of 0x44324, very similar to before, and a hang shortly afterwards. I did not have serial logging enabled on that run so I'm not 100% sure there was no extra output at the time of the hang, but I'm running it a second time now with logging again (and now I've already got the unclaimed write to 0x44324 again, and now waiting for the hang).
Created attachment 126976 [details] intel-drm-nightly 20160927 unclaimed read of 0x44404 with i915.enable_psr=0 No kernel messages at hang time, same as before. Attached is also an unclaimed read of 0x44404 with "i915.enable_psr=0", it contains the same vblank-related backtrace as before without that parameter.
(In reply to Anssi Hannula from comment #12) > Created attachment 126976 [details] > intel-drm-nightly 20160927 unclaimed read of 0x44404 with i915.enable_psr=0 > > No kernel messages at hang time, same as before. > > Attached is also an unclaimed read of 0x44404 with "i915.enable_psr=0", it > contains the same vblank-related backtrace as before without that parameter. Please try with applying Chris' patch set https://patchwork.freedesktop.org/series/13230/
Created attachment 126987 [details] intel-drm-nightly 20160927 unclaimed read of 0x44324 with patchset 13230 Seems to have had no effect, still unclaimed accesses and hangs. Attached is a read of 0x44324 with the patchset.
Some additional details: - Setting the DDX option "VSync" to off seems to workaround the issue, so I guess the issue is indeed vblank related as per comment #10. So far I've run the problematic case on 4.4.22 (not tested with nightly yet) for several hours and I have gotten no freezes and no "unclaimed register" warnings with VSync off (with no other workarounds active). - If I use the modesetting DDX (with glamor), the problem does not seem to appear (no warnings, no hang, on 4.4.22). Judging from the tearing I see I guess VSync is just not getting used (at least with my versions of everything) so this is in effect the same as disabling VSync. - "intel_idle.max_cstate=2" (2 = C1E-BDW) does not produce a hang with a ~12h run, but does produce the unclaimed register warnings. "intel_idle.max_cstate=3" (3 = C3-BDW) does also produce the hang.
Created attachment 127096 [details] [review] Grab RPM reference around vblank This should fix the unclaimed registers whilst handing the vblank. I worry that it is overkill.
I can confirm that with patch of comment #16 I no longer see unclaimed reads/writes of 0x44404. Unclaimed accesses of 0x44324 and 0x20a8 still remain, and hangs still happen.
The RPS registers should already be guarded by a wakeref (the interrupts are only enabled whilst the GPU is active). Suggests that we are losing the wakeref, but that too should have shown up in a WARN when we tried to decrement it when already 0. Same for 02a08 which has an explicit wakeref across the interrupt handling.
Created attachment 127098 [details] intel-drm-nightly 20161007 unclaimed writes/reads of 0x44324, writes of 0x20a8, with patch from comment 16 Here's also a log from today's nightly plus the comment #16 patch, with several instances of the remaining unclaimed accesses visible (via i915.mmio_debug).
New patch submitted by Chris (Grab RPM wakeref around enabling vblank interrupts): https://patchwork.freedesktop.org/series/13446/
*** Bug 97589 has been marked as a duplicate of this bug. ***
(In reply to yann from comment #20) > New patch submitted by Chris (Grab RPM wakeref around enabling vblank > interrupts): > https://patchwork.freedesktop.org/series/13446/ Yes, this is the same patch as comment #16, with test results in comment #17 and comment #19.
*** Bug 91960 has been marked as a duplicate of this bug. ***
Anssi, Chris - any news from either of you related to this? Still valid? Something cooking somewhere in order to get this resolved? Something common with bug 98995?
commit 1f58c8e7eac0d4a7a59037dc18dbed2a9b5bd342 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Mar 2 07:41:57 2017 +0000 drm/i915: Restore the invalid access without RPM warning should help identify the culprit if it still remains.
Created attachment 130372 [details] intel-drm-nightly 20170321 unclaimed accesses and hang Still valid, attached a new serial log with current intel-drm-nightly (unclaimed accesses and hang).
Adding tag into "Whiteboard" field - ReadyForDev *Status is correct *Platform is included *Feature is included *Priority and Severity correctly set *Logs included
Created attachment 132992 [details] attachment-21796-0.html I am out of office and will return on July 31st 2017. On urgent matters I can be reached at +358503803997. -- Anssi Hannula / Bitwise Oy
First of all. Sorry about spam. This is mass update for our bugs. Sorry if you feel this annoying but with this trying to understand if bug still valid or not. If bug investigation still in progress, please ignore this and I apologize! If you think this is not anymore valid, please comment to the bug that can be closed. If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Closing, please re-open is issue still exists.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.