Bug 104975 - Delay in skl_disable_plane() causes a system freeze
Summary: Delay in skl_disable_plane() causes a system freeze
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86 (IA32) other
: high minor
Assignee: Ville Syrjala
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-06 20:00 UTC by Azhar
Modified: 2019-06-20 12:27 UTC (History)
9 users (show)

See Also:
i915 platform: KBL
i915 features: display/atomic, display/watermark


Attachments
Drm logs with 0x1e log level (5.43 MB, text/plain)
2018-02-06 20:00 UTC, Azhar
no flags Details
Screenshot of crashed display (5.02 MB, image/jpeg)
2018-02-06 20:03 UTC, Azhar
no flags Details
Screenshot with msleep for 50msec (5.84 MB, image/jpeg)
2018-02-07 01:23 UTC, Azhar
no flags Details
CONFIG_DRM_I915_DEBUG_VBLANK_EVADE enabled logs (20.68 KB, text/plain)
2018-02-15 01:35 UTC, Azhar
no flags Details
DOUBLE_BUFFER_CTL patch logs with delay (85.60 KB, text/plain)
2018-02-15 01:36 UTC, Azhar
no flags Details
Logs with patches rebased on 4.15.x (6.36 MB, text/plain)
2018-02-17 00:54 UTC, Azhar
no flags Details
Logs with patches rebased on 4.15.x (86.03 KB, text/plain)
2018-02-17 00:56 UTC, Azhar
no flags Details
Chromeos-4.14-kernel-logs-with-Marteen's patches (2.41 MB, text/plain)
2018-03-06 01:13 UTC, Azhar
no flags Details
Marteen's 4-15.x branch with prints in skl_disable_plane() (30.13 KB, text/plain)
2018-03-06 01:13 UTC, Azhar
no flags Details
Marteen's 4-15.x branch evasion warnings seen (45.45 MB, text/plain)
2018-03-06 22:40 UTC, Azhar
no flags Details
Always increase vblank evasion time to 250 us. (639 bytes, patch)
2018-03-07 12:23 UTC, Maarten Lankhorst
no flags Details | Splinter Review
vertical corruption unless wait for vblank after disabling cursor (2.83 MB, image/jpeg)
2018-04-14 00:44 UTC, marc.herbert
no flags Details
Add vblank wait to plane disable? (641 bytes, patch)
2018-04-16 08:33 UTC, Maarten Lankhorst
no flags Details | Splinter Review
dmesg logs with disable LP1+ wm patch (183.88 KB, text/plain)
2018-05-11 23:05 UTC, Azhar
no flags Details
dmesg logs with disable LP1+ wm patch debug log level 0x1E (41.92 MB, application/x-bzip)
2018-05-14 17:45 UTC, Azhar
no flags Details
dmesg logs of branch shared on Jul 12, 2018 (21.74 MB, application/x-zip-compressed)
2018-07-16 23:39 UTC, Azhar
no flags Details
kernel logs leading up to the freeze, 0x1e (925.77 KB, text/x-log)
2019-03-05 02:29 UTC, Juston Li
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Azhar 2018-02-06 20:00:37 UTC
Created attachment 137193 [details]
Drm logs with 0x1e log level

When a 50ms delay happens on KBL-Y and KBL-U between these two register writes, the system hard hangs with a corrupted display (see attachments). 100ms reproduces even more frequently.

diff --git a/drivers/gpu/drm/i915/intel_sprite.c b/drivers/gpu/drm/i915/intel_sprite.c
index 4a8a5d918a83..b94eddb04d24 100644
--- a/drivers/gpu/drm/i915/intel_sprite.c
+++ b/drivers/gpu/drm/i915/intel_sprite.c
@@ -311,30 +311,31 @@ void intel_pipe_update_end(struct intel_crtc_state *new_crtc_state)
        spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
 }

 void
 skl_disable_plane(struct intel_plane *plane, struct intel_crtc *crtc)
 {
        struct drm_i915_private *dev_priv = to_i915(plane->base.dev);
        enum plane_id plane_id = plane->id;
        enum pipe pipe = plane->pipe;
        unsigned long irqflags;

        spin_lock_irqsave(&dev_priv->uncore.lock, irqflags);

        I915_WRITE_FW(PLANE_CTL(pipe, plane_id), 0);

+       mdelay(50);
        I915_WRITE_FW(PLANE_SURF(pipe, plane_id), 0);
        POSTING_READ_FW(PLANE_SURF(pipe, plane_id));

        spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
 }


This has been reproduced on a variety of KBL-Y and KBL-U SKUs, and with both coreboot firmware and UEFI version KBLSE2R1. R00. X094. P02. 1706220744 on a KBL-Y RVP.

This was reproduced with chromeos-4.14 https://chromium.googlesource.com/chromiumos/third_party/kernel/+/9325afcd11534271/drivers/gpu/drm/i915/ and linux stable 4.15.1 tag

This is reproduced with maxcpus=1. 
Hardware video decoder is required to reproduce the issue.
Sysrq does not work after system freeze. ITP halt() fails to stop the CPUS.

The first reports came from a ChromeOS autotest which does the following:
1. Login as a test account
2. Play a vp9 or vp8 video for 10secs
3. restart ui (roughly equivalent to “service mydisplaymanager restart”)

Of course this test was initially not run with a mdelay() added in the kernel - which is why reproduction was extremely rare. It took weeks to narrow down the issue to that specific delay. 

To reproduce the issue manually (without running autotest) execute the below steps 
1) Play any video in 1080p from crosvideo.appspot.com
2) Monitor the i915_display_info to make sure the overlay plane is active.
3) Once overlay plane is active move the cursor on top of the overlay plane. This disables the overlay plane.
4) When the overlay plane is disabled the system freezes.
Comment 1 Azhar 2018-02-06 20:03:13 UTC
Created attachment 137194 [details]
Screenshot of crashed display
Comment 2 Azhar 2018-02-07 01:23:52 UTC
Created attachment 137202 [details]
Screenshot with msleep for 50msec
Comment 3 Maarten Lankhorst 2018-02-07 12:49:29 UTC
Hey,

I fear this is working as intended. With this information I can't narrow down what is your real problem. :(

The relevant error here is:

[  403.443634] [drm:drm_atomic_nonblocking_commit] committing 000000007afed85b nonblocking
[  403.544735] [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe A FIFO underrun
[  403.544750] [drm:intel_pipe_update_end] *ERROR* Atomic update failure on pipe A (start=24896 end=24902) time 100023 us, min 1073, max 1079, scanline start 450, end 457
[  403.544762] [drm:intel_fbc_underrun_work_fn] Disabling FBC due to FIFO underrun.

as can be seen in the dmesg, you get a lot of those ERRORs like below:
[  403.544750] [drm:intel_pipe_update_end] *ERROR* Atomic update failure on pipe A (start=24896 end=24902) time 100023 us, min 1073, max 1079, scanline start 450, end 457

All of the code between pipe_update_start and pipe_update_end needs to run without encountering a single vblank interrupt. As you can see in the dmesg, hitting a vblank interrupt
results in this ERROR, and might cause FIFO underruns and even system hangs because all the registers need to be updated in a single vblank interval.

Because of this pipe_update_start disables all local irqs and schedules the update to be run at least 100 us away from the vblank interrupt, and pipe_update_end complains loudly if we took longer than 100 us.

There's some code below in a ifdef CONFIG_DRM_I915_DEBUG_VBLANK_EVADE, you can use it to report when a delay happened, without necessarily hitting any hang, FIFO underrun or display corruption.

This might help you narrow down why vblank evasion sometimes takes longer than 100 us, or increase VBLANK_EVASION_TIME_US if necessary..

Good luck!
Comment 4 Ville Syrjala 2018-02-07 14:21:16 UTC
The fact that the system hangs is a little disappointing, but not entirely unexpected on SKL+. Maybe we could make it fail less spectacularly if we did the wm/ddb update more alongside the rest of the plane registers, maybe not.

As a more proper fix we could optimize the plane/pipe update some more. One of my unfinished attempts at just that is here:
git://github.com/vsyrjala/linux.git plane_update_optimization_split_lock_alt

Another thing we could perhaps do is utilize the "lock out vblank register latching" bit for the pipe as a backup for the atomic update fails. In theory we could even use that instead of the vblank evade, if we decided that we never want to override an already queued flip. But people seem to be warming up to the idea of overriding queueud flips recently, so I think should probably keep using the vblank evade anyway. Oh, actually I don't recall if that even locks out every pipe/plane register, but if not then we certainly couldn't rely on it alone.
Comment 5 Azhar 2018-02-08 06:30:20 UTC
Just to be more clear on the issue here:

On running autotest, mentioned above in a loop without any delay and without any spinlock, same system freeze/display corruption is observed. The issue is seen after running 300-400 iterations. I have started a test without any delay on linux-stable 4.15.1. Will share those logs once issue is reproduced.

The delay added also causes the same crash pattern and reproduces the issue more frequently/always within few seconds.
Comment 6 Azhar 2018-02-08 06:31:54 UTC
I tried increasing the VBLANK_EVASION_TIME_US, but that caused a Potential Atomic update failure error always and system crashed.

I also tried Ville's branch, but that too didn't help and the system still crashed.

I tried both of the above with a delay. I will give these a try without the delays too.
Comment 7 Jani Saarinen 2018-02-08 07:18:28 UTC
Hi.
What are FW's used in the system. Especially DMC. Can you post dmesg from the boot?
Comment 8 marc.herbert 2018-02-08 07:25:20 UTC
Initial reports were with DMC 1.01. An upgrade to DMC 1.04 didn't help.

We're currently merging an upgrade to 1.04 anyway to solve another issue.

No GuC or HuC is used.
Comment 9 Maarten Lankhorst 2018-02-08 09:01:07 UTC
You can't increase the vblank evasion time to a value above ~16 ms because of physics. :)

Assuming the problem is inside i915, there are 2 solutions, one easy and one hard..
1. Use the DOUBLE_BUFFER_CTL register to lock out updates, like above. Least amount of lines of code changed.
2. Grab dev_priv->uncore.lock in pipe_update_start and release it in pipe_update_end. All register updates need this lock so grabbing it would reduce contention.

I've posted the first solution to intel-gfx as a RFC, and to see if it passes our CI:
https://patchwork.freedesktop.org/series/37879/
Comment 10 Maarten Lankhorst 2018-02-09 11:10:06 UTC
Less hacky version, that should no longer have random delays between pipe_update_start and end: https://patchwork.freedesktop.org/series/37986/
Comment 11 marc.herbert 2018-02-12 23:36:50 UTC
What kind of test can reproduce this issue in a reasonable amount of time and verify any of the propositions currently floated? I mean without a sleep hack in the driver.
Comment 12 Maarten Lankhorst 2018-02-13 13:48:00 UTC
CONFIG_DRM_I915_DEBUG_VBLANK_EVADE is meant for this. It complains if we ever take longer than 100 us in vblank evasion, and how long this is. If it takes longer than 100 us we possibly hit the underrun, which possibly could result in a crash..

The sleep hack is only useful to check whether the DOUBLE_BUFFER_CTL patch works as intended.
Comment 13 Azhar 2018-02-15 01:34:46 UTC
Please find attached log files with :
1. Only DOUBLE_BUFFER_CTL patch with a delay of 10msecs. This causes a crash as mentioned in this bug.

2. With CONFIG_DRM_I915_DEBUG_VBLANK_EVADE enabled. The system did not crash in this case, but it did print out below prints many times

[ 3263.045247] [drm] Atomic update on pipe (A) took 123 us, max time under evasion is 100 us
[ 3929.221242] [drm] Atomic update on pipe (A) took 120 us, max time under evasion is 100 us
[ 3969.201734] [drm] Atomic update on pipe (A) took 145 us, max time under evasion is 100 us


Both these were tried on linux-stable 4.15.1

Also in the log messages I have deleted all chromeos specific messages related to camera and tpm.
Comment 14 Azhar 2018-02-15 01:35:49 UTC
Created attachment 137368 [details]
CONFIG_DRM_I915_DEBUG_VBLANK_EVADE enabled logs
Comment 15 Azhar 2018-02-15 01:36:30 UTC
Created attachment 137369 [details]
DOUBLE_BUFFER_CTL patch logs with delay
Comment 16 marc.herbert 2018-02-15 01:50:33 UTC
> > What kind of test...
> CONFIG_DRM_I915_DEBUG_VBLANK_EVADE is meant for this...

It's not very convenient to recompile the kernel for testing purposes. Unless you recommend this to be turned ON by default, even in production builds?

Even with this option turned on, manually restarting video playback is still for now the only known, slow and elusive way to reproduce this hard crash. There seems to be a tentative root-cause so could some (new?) lower-level and more focused (IGT?) test(s) stress overlays harder, better, faster and stronger?
Comment 17 Maarten Lankhorst 2018-02-15 14:09:42 UTC
I've rebased on top of v4.15.x, can you test the following branch works?

https://cgit.freedesktop.org/~mlankhorst/linux/log/?h=v4.15.x

Don't add a delay, but I do need the CONFIG_DRM_I915_DEBUG_VBLANK_EVADE logs. There should be no mention of things like:
[ 3263.045247] [drm] Atomic update on pipe (A) took 123 us, max time under evasion is 100 us

any more..
Comment 18 Maarten Lankhorst 2018-02-15 15:18:12 UTC
BTW it's harmless to keep CONFIG_DRM_I915_DEBUG_VBLANK_EVADE enabled in production kernels, but none of the other CONFIG_DRM_I915_DEBUG options should be enabled in the production kernels.
Comment 19 Azhar 2018-02-16 01:41:24 UTC
With the patches rebased on 4.15, I do not see the below message:

[drm] Atomic update on pipe (A) took 123 us, max time under evasion is 100 us 

There was a WARN_ON trigerred from below line once.

https://cgit.freedesktop.org/~mlankhorst/linux/tree/drivers/gpu/drm/drm_atomic.c?h=v4.15.x#n862

[18796.092771] WARNING: CPU: 1 PID: 22704 at /mnt/host/source/src/third_party/kernel/v4.4/drivers/gpu/drm/drm_atomic.c:862 drm_atomic_check_only+0x94/0x4dd
[18796.108037] Modules linked in: cmac rfcomm uinput lzo lzo_compress zram bridge btusb stp llc btrtl btbcm btintel bluetooth ecdh_generic ov13858 ov5670 v4l2_fwnode snd_soc_max98927 at24 dw9714 acpi_als i
pt_MASQUERADE nf_nat_masquerade_ipv4 xt_mark fuse snd_seq_dummy snd_seq snd_seq_device cfg80211 iio_trig_sysfs ip6table_filter cros_ec_light_prox cros_ec_sensors cros_ec_sensors_core industrialio_triggered
_buffer kfifo_buf industrialio r8152 mii joydev
[18796.152665] CPU: 1 PID: 22704 Comm: DrmThread Not tainted 4.15.3-00093-gc5aaf9d7faec #84
[18796.170382] RIP: 0010:drm_atomic_check_only+0x94/0x4dd
[18796.176128] RSP: 0018:ffffafb2007bfc58 EFLAGS: 00010246
[18796.181973] RAX: 0000000000000000 RBX: ffff98a96049c900 RCX: ffff98a81d43c400
[18796.189944] RDX: ffffffffb7c1191e RSI: 0000000000000010 RDI: ffff98a938a7a200
[18796.197921] RBP: ffffafb2007bfcb0 R08: 0000000000000020 R09: ffff98a81d432328
[18796.205899] R10: ffff98aa34ab4800 R11: ffff98a82da17800 R12: ffff98aa34b38000
[18796.213876] R13: ffff98aa34b0a000 R14: ffff98a81d43c400 R15: 0000000000000000
[18796.221847] FS:  00007b69e67b9700(0000) GS:ffff98aa3ec80000(0000) knlGS:0000000000000000
[18796.230894] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[18796.237316] CR2: 00003ada01075008 CR3: 00000001f766c002 CR4: 00000000003606e0
[18796.245294] Call Trace:
[18796.248037]  drm_atomic_nonblocking_commit+0x18/0x5a
[18796.253594]  drm_mode_atomic_ioctl+0x7e1/0x8de
[18796.258567]  ? drm_atomic_set_property+0x3f1/0x3f1
[18796.263928]  drm_ioctl+0x2ac/0x3dc
[18796.267732]  ? security_file_ioctl+0x35/0x49
[18796.272505]  SyS_ioctl+0x492/0x4d7
[18796.276305]  do_syscall_64+0x60/0x6e
[18796.280298]  entry_SYSCALL_64_after_hwframe+0x21/0x86
[18796.285945] RIP: 0033:0x7b69ec1e8967
[18796.289935] RSP: 002b:00007b69e67b8488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[18796.298391] RAX: ffffffffffffffda RBX: 000000000000001c RCX: 00007b69ec1e8967
[18796.306362] RDX: 00007b69e67b84f0 RSI: 00000000c03864bc RDI: 000000000000001d
[18796.314331] RBP: 00007b69e67b84b0 R08: 00003ada00de3b70 R09: 00003ada00de3b90
[18796.322301] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000001d
[18796.330263] R13: 00003ada00de3b80 R14: 00007b69e67b84f0 R15: 00000000c03864bc
[18796.338233] Code: 03 7e 18 4c 8b 2f 4d 85 ed 0f 84 be 00 00 00 48 8b 5f 18 4c 8b 53 08 4d 85 d2 0f 84 fd 01 00 00 48 83 7b 10 00 0f 85 f2 01 00 00 <0f> ff 48 c7 c2 18 1a c1 b7 e9 78 03 00 00 8b 4b 28 85
 c9 0f 88 
[18796.359348] ---[ end trace 9c41097535acdec5 ]---
[18796.366777] traps: DrmThread[22704] trap int3 ip:5575cbb07df2 sp:7b69e67b8990 error:0 in chrome[5575caa49000+86d4000]

I have kept the test running overnight, in case we get any new prints/warnings.
Comment 20 Azhar 2018-02-17 00:53:22 UTC
Attaching the logs with patches rebased on 4.15.x, no Atomic update evasion messages seen. The warning mentioned above seen couple of times.
Comment 21 Azhar 2018-02-17 00:54:00 UTC
Created attachment 137406 [details]
Logs with patches rebased on 4.15.x
Comment 22 Azhar 2018-02-17 00:56:32 UTC
Created attachment 137407 [details]
Logs with patches rebased on 4.15.x
Comment 23 Maarten Lankhorst 2018-02-22 09:09:18 UTC
The WARN_ON is removed by:

commit fa5aaeecf524ecbcae9755ee3d34b7b8ba412583
Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Date:   Tue Jan 30 11:27:04 2018 +0100

    drm/atomic: Remove WARN_ON for invalid plane configuration.
    
    Userspace can set a FB_ID on a plane without setting CRTC_ID, which
    will fail with -EINVAL, but the kernel shouldn't warn about that.
    
    Same for !FB_ID and CRTC_ID being set.

This can safely be cherry-picked if needed, but userspace is requesting an invalid configuration. That bug should be fixed as well.
Comment 24 marc.herbert 2018-02-24 07:02:56 UTC
I believe "drm/i915: Disable SAGV on pre plane update." is related, correct?
https://lists.freedesktop.org/archives/intel-gfx/2018-February/thread.html#156735
Comment 25 marc.herbert 2018-02-24 15:06:58 UTC
(yes it is, just saw this bug number in the commit message)

Now that we have a tentative root-cause, how would a test that reproduces in a reasonable amount of time look like? Even a manual, non fully deterministic one. Any ideas, even vague?
Comment 26 Rodrigo Vivi 2018-02-26 21:35:04 UTC
According to Art:
"
Buffer allocation overlap among enabled planes will cause a full frame underrun, and that becomes a hard hange if pkgC or SAGV are enabled.  You need to make sure the plane is disabled before reallocating the buffer it uses.  For a single pipe it is sufficient to initiate the disabling of the plane before the reallocation.  For multiple pipes it can be more complex.
 
In this case you should be doing something like this to ensure plane 2A turns off before plane 1A steals the buffer
1.       PLANE_CTL_2A -> disabled
2.       PLANE_SURF_2A:  touch to arm double buffer update
3.       PLANE_BUF_CFG_1A -> (0-860)
4.       PLANE_SURF_1A: touch to arm double buffer update
If the planes are on different pipes there needs to be a wait for vblank between step 2 and 3 to ensure the plane 2A disable completed.

General requirement is to never have an enabled plane's buffer allocation overlap with another enabled plane.

If you disable plane X, don't give its buffer to plane Y until you are sure plane X is disabled.
Be careful about the double-buffer arming.  For some stange legacy reason, plane enable happens without needing arming from PLANE_SURF, so on disable->enable you write PLANE_CTL last with enable bit set.
Once plane is enabled, arming always comes from PLANE_SURF.
"

So I confirmed that our code actually doesn't seem to respect this.
We program the BUF_CFG during atomic commit begin CRTC but we actually
disable the plane on middle of atomic commit.

I did some experiments to move the BUF_CFG right to after disabling planes
and only when disabling planes and result here was horrible. Blank and corrupted
screens.

So, Ville and Marteen... ideas?
Comment 27 Maarten Lankhorst 2018-02-28 10:26:31 UTC
The patch series I suggested make those issues more unlikely.

Anything between pipe_update_start and pipe_update_end needs to run in a single vblank. By taking the spinlock in pipe_update_start we get rid of any race condition, and we can actually finish inside a single vblank deterministically.
Comment 28 Kristian Høgsberg 2018-02-28 18:23:54 UTC
Which userspace produces the warning in comment #19?
Comment 29 Kristian Høgsberg 2018-02-28 18:31:03 UTC
(In reply to Kristian Høgsberg from comment #28)
> Which userspace produces the warning in comment #19?

That is, if it's Chrome, we should fix that there, but I don't think it should warn either. If it's an EINVAL situation, i915 should just return that silently, and only make noise with drm debug enabled.
Comment 30 Azhar 2018-02-28 18:49:46 UTC
(In reply to Kristian Høgsberg from comment #29)
> (In reply to Kristian Høgsberg from comment #28)
> > Which userspace produces the warning in comment #19?
> 
> That is, if it's Chrome, we should fix that there, but I don't think it
> should warn either. 

Yes it is Chrome.

> If it's an EINVAL situation, i915 should just return
> that silently, and only make noise with drm debug enabled.


Patch suggested comment #23 removes the WARN_ON.
Comment 31 Ville Syrjala 2018-03-01 20:21:36 UTC
What is the status of this? Did Maarten's plane update/locking optimizations solve the original hard hang or not?
Comment 32 Azhar 2018-03-02 02:11:17 UTC
We have tried the patches on chromeos-4.14 kernel and were able to reproduce the system hang.

https://chromium.googlesource.com/chromiumos/third_party/kernel/+log/d5c36430d8ded4e1d/drivers/gpu/drm/i915


We are testing on Marteen's 4.15.x branch shared in comment #17.
Comment 33 Maarten Lankhorst 2018-03-02 11:35:04 UTC
Do you have the VBLANK evade warnings enabled? Do you get any warning or error from dmesg from that?
Comment 34 Azhar 2018-03-02 14:51:02 UTC
I didn't have the EVADE warnings enabled for chromeos, will enable them and re-run the test.
Comment 35 Azhar 2018-03-03 01:37:00 UTC
Had a lot of setup issues hence could not get the test results.

test on upstream kernel is also not completed due to the same reason.
Comment 36 Azhar 2018-03-06 01:11:36 UTC
After enabling the CONFIG_DRM_I915_DEBUG_VBLANK_EVADE on chromeos-4.14 kernel, was able to see the warning messages and the issue/system hang. Please check the attached logs.

On Marteen's 4.15 linux stable branch, with just couple of prints in skl_disable_plane() was able to reproduce the issue/system hang.


[   57.951813] [drm:skl_disable_plane] *ERROR* Disabling plane @ 321 : plane 2A
[   57.959722] [drm:drm_atomic_helper_commit_planes_on_crtc] *ERROR* Disabling plane @ 326 : plane 2A POST READ
[   57.970733] [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe A FIFO underrun
[   65.269469] [drm:skl_disable_plane] *ERROR* Disabling plane @ 321 : plane 2A
[   65.277357] [drm:drm_atomic_helper_commit_planes_on_crtc] *ERROR* Disabling plane @ 326 : plane 2A POST READ
[   69.290422] [drm:skl_disable_plane] *ERROR* Disabling plane @ 321 : plane 2A
[   69.298349] [drm:drm_atomic_helper_commit_planes_on_crtc] *ERROR* Disabling plane @ 326 : plane 2A POST READ
[   70.558620] [drm:skl_disable_plane] *ERROR* Disabling plane @ 321 : plane 2A
[   70.566506] [drm:drm_atomic_helper_commit_planes_on_crtc] *ERROR* Disabling plane @ 326 : plane 2A POST READ

Without the prints on Marteen's linux stable branch I do not see the issue/system hang.
Comment 37 Azhar 2018-03-06 01:13:01 UTC
Created attachment 137807 [details]
Chromeos-4.14-kernel-logs-with-Marteen's patches
Comment 38 Azhar 2018-03-06 01:13:40 UTC
Created attachment 137808 [details]
Marteen's 4-15.x branch with prints in skl_disable_plane()
Comment 39 Maarten Lankhorst 2018-03-06 09:00:21 UTC
The calls to kprintf will add a random delay, and will destroy the timing of the vblank evasion, so it's not surprising it causes a hang. We had to remove debug calls before because of that reason.

If the bug cannot be reproduced with the patch series any more, and you don't get warnings with CONFIG_DRM_I915_DEBUG_VBLANK_EVADE enabled, then I think it's fixed. :)
Comment 40 Azhar 2018-03-06 22:36:56 UTC
(In reply to Maarten Lankhorst from comment #39)
> The calls to kprintf will add a random delay, and will destroy the timing of
> the vblank evasion, so it's not surprising it causes a hang. We had to
> remove debug calls before because of that reason.
> 
> If the bug cannot be reproduced with the patch series any more, and you
> don't get warnings with CONFIG_DRM_I915_DEBUG_VBLANK_EVADE enabled, then I
> think it's fixed. :)

Sorry forgot to add yesterday, I still see couple of warnings with the above config option enabled. Attaching those logs.

Could you please give your views on Art's comments in comment #26. Do we follow the sequence mentioned by Art?
Comment 41 Azhar 2018-03-06 22:40:51 UTC
Created attachment 137843 [details]
Marteen's 4-15.x branch evasion warnings seen
Comment 42 Maarten Lankhorst 2018-03-07 12:23:46 UTC
Created attachment 137856 [details] [review]
Always increase vblank evasion time to 250 us.

With my patch series + this commit the issues should be gone. You should be able to keep Cstate and SAGV enabled without hitting issues.

If you still get a vblank evasion warning, let me know. :)
Comment 43 Tarun Vyas 2018-03-16 00:01:35 UTC
We have another test @ https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/client/site_tests/video_WebRtcResolutionSwitching/resolution-switching.js to reproduce this issue, which is pretty quick, but reproductions with drm-tip and other *recent* upstream kernels have been *inconsistent*. Once we have consistent results, we will try Maarten's patch series.
Comment 45 Jani Saarinen 2018-03-29 07:10:53 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 46 Ben Widawsky 2018-04-09 20:53:11 UTC
Maarten, why don't we just do the sequence correctly? It seems dangerous to me to do anything else when the consequence is a hard hang (I admittedly don't know much about the atomic sequences).
Comment 47 Maarten Lankhorst 2018-04-10 07:54:46 UTC
Because there's no way to do so. Everything between intel_pipe_update_begin and intel_pipe_update_end must complete within 100/250 us, or we end up with flickering on the screen. The only thing we can alter is to take all locks in advance, to make the update itself limited to writing out a list of registers.
But even then we need to complete within 250 us, or we risk running into an underrun.

But it seems the underlying hangs were fixed by enabling IPC support on BXT+.
Comment 48 marc.herbert 2018-04-10 14:45:20 UTC
One or two frames missing per week is acceptable. One or two crashes per week is not.
Comment 49 Maarten Lankhorst 2018-04-12 13:11:57 UTC
Agreed, that's why I put together a series that eliminates spinlock contention and increases evasion time to 250 us. We run the evasion code with IRQs disabled, so at the time critical point the code is fully deterministic, even in the worst case.
Comment 50 marc.herbert 2018-04-12 18:29:21 UTC
Tarun's experience seem to differ: https://lists.freedesktop.org/archives/intel-gfx/2018-April/162279.html

> We run the evasion code with IRQs disabled, so at the time critical point the code is fully deterministic, even in the worst case.

Even if this can be proven with absolute certainty in some particular kernel version(s), it sounds like a guarantee difficult to maintain over time in hardware+firmware+software systems not designed for and generally not interested in real-time (and even more difficult to maintain in product kernels with many various backports - but I digress)
Comment 51 marc.herbert 2018-04-14 00:44:10 UTC
Created attachment 138830 [details]
vertical corruption unless wait for vblank after disabling cursor

Could this be related? See screenshot attached and workaround by:

Kristian H. Kristensen

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/877396

CHROMIUM: drm/i915: Wait for vblank after disabling cursor

The cursor disable write is latched on vsync but if we do a modeset
before that happens the write seems to be discarded and the cursor
stays on. The CURCNTL register will still read back 0 as if the write
took, but underlying hardware didn't change. We don't disable the
cursor often, so we can just force the pre-GEN9 path for cursor
that waits for vblank.

	 * Unfortunately, the GEN9+ double buffering seems to be buggy when it
	 * races with a modeset....

I'm guessing that this issue may well be present in upstream kernels and regular Linux distros, the difference being that no other distro so far has shipped with Y-tiled framebuffers.


(I doubt many distros have shipped with hardware overlays either -- Marc)
Comment 52 Maarten Lankhorst 2018-04-16 08:33:57 UTC
Created attachment 138861 [details] [review]
Add vblank wait to plane disable?

We don't currently force a vblank wait in crtc disable, would this fix that bug?
Comment 53 Jani Saarinen 2018-04-18 10:50:53 UTC
Any updates here?
Comment 54 Ville Syrjala 2018-04-18 18:04:55 UTC
git://github.com/vsyrjala/linux.git plane_disable_before_ddb_wm_changes

This disables all planes on the pipe which may be affected by watermark/ddb changes (or any other change for that matter) prior to changing the ddb allocations. Any plane that is supposed to remain visible will get re-enabled later in the sequence.
Comment 55 Maarten Lankhorst 2018-04-19 08:09:09 UTC
Patch makes sense, didn't think of that solution, but could we get rid of the POSTING_READs in update_plaen and disable_plane first?
Comment 56 Azhar 2018-04-20 00:25:14 UTC
(In reply to Ville Syrjala from comment #54)
> git://github.com/vsyrjala/linux.git plane_disable_before_ddb_wm_changes
> 
> This disables all planes on the pipe which may be affected by watermark/ddb
> changes (or any other change for that matter) prior to changing the ddb
> allocations. Any plane that is supposed to remain visible will get
> re-enabled later in the sequence.

I tried the patch on Linux 4.15.18 stable branch and was able to reproduce the issue.
Comment 57 Ville Syrjala 2018-04-20 16:09:52 UTC
(In reply to marc.herbert from comment #51)
> Created attachment 138830 [details]
> vertical corruption unless wait for vblank after disabling cursor
> 
> Could this be related? See screenshot attached and workaround by:
> 
> Kristian H. Kristensen
> 
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/
> 877396
> 
> CHROMIUM: drm/i915: Wait for vblank after disabling cursor
> 
> The cursor disable write is latched on vsync but if we do a modeset
> before that happens the write seems to be discarded and the cursor
> stays on. The CURCNTL register will still read back 0 as if the write
> took, but underlying hardware didn't change. We don't disable the
> cursor often, so we can just force the pre-GEN9 path for cursor
> that waits for vblank.
> 
> 	 * Unfortunately, the GEN9+ double buffering seems to be buggy when it
> 	 * races with a modeset....
> 
> I'm guessing that this issue may well be present in upstream kernels and
> regular Linux distros, the difference being that no other distro so far has
> shipped with Y-tiled framebuffers.
> 
> 
> (I doubt many distros have shipped with hardware overlays either -- Marc)

I'm definitely seeing something odd going on with the plane double buffering on my SKL here. The PLANE_CTL=0 write doesn't seem to stick sometimes and instead I'm seeing the plane scanning out stale data instead of being off. I suspect this could happen when the PLANE_CTL write lands before the vblank, and the PLANE_SURF write lands after the vblank, but I'll need to do some more low level hw poking to confirm that theory.
Comment 58 Ville Syrjala 2018-04-20 19:30:59 UTC
(In reply to Ville Syrjala from comment #57)
> (In reply to marc.herbert from comment #51)
> > Created attachment 138830 [details]
> > vertical corruption unless wait for vblank after disabling cursor
> > 
> > Could this be related? See screenshot attached and workaround by:
> > 
> > Kristian H. Kristensen
> > 
> > https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/
> > 877396
> > 
> > CHROMIUM: drm/i915: Wait for vblank after disabling cursor
> > 
> > The cursor disable write is latched on vsync but if we do a modeset
> > before that happens the write seems to be discarded and the cursor
> > stays on. The CURCNTL register will still read back 0 as if the write
> > took, but underlying hardware didn't change. We don't disable the
> > cursor often, so we can just force the pre-GEN9 path for cursor
> > that waits for vblank.
> > 
> > 	 * Unfortunately, the GEN9+ double buffering seems to be buggy when it
> > 	 * races with a modeset....
> > 
> > I'm guessing that this issue may well be present in upstream kernels and
> > regular Linux distros, the difference being that no other distro so far has
> > shipped with Y-tiled framebuffers.
> > 
> > 
> > (I doubt many distros have shipped with hardware overlays either -- Marc)
> 
> I'm definitely seeing something odd going on with the plane double buffering
> on my SKL here. The PLANE_CTL=0 write doesn't seem to stick sometimes and
> instead I'm seeing the plane scanning out stale data instead of being off. I
> suspect this could happen when the PLANE_CTL write lands before the vblank,
> and the PLANE_SURF write lands after the vblank, but I'll need to do some
> more low level hw poking to confirm that theory.

The race is something more subtle than that since manually poking those register across vblanks is obviously fine.

It looks like I managed to avoid the planes staying on past their bedtime with the following branch:
git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa

I hope it will also prevent the hard hangs. Please test.
Comment 59 Azhar 2018-04-21 00:59:44 UTC
(In reply to Ville Syrjala from comment #58)
> (In reply to Ville Syrjala from comment #57)
> > (In reply to marc.herbert from comment #51)
> > > Created attachment 138830 [details]
> > > vertical corruption unless wait for vblank after disabling cursor
> > > 
> > > Could this be related? See screenshot attached and workaround by:
> > > 
> > > Kristian H. Kristensen
> > > 
> > > https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/
> > > 877396
> > > 
> > > CHROMIUM: drm/i915: Wait for vblank after disabling cursor
> > > 
> > > The cursor disable write is latched on vsync but if we do a modeset
> > > before that happens the write seems to be discarded and the cursor
> > > stays on. The CURCNTL register will still read back 0 as if the write
> > > took, but underlying hardware didn't change. We don't disable the
> > > cursor often, so we can just force the pre-GEN9 path for cursor
> > > that waits for vblank.
> > > 
> > > 	 * Unfortunately, the GEN9+ double buffering seems to be buggy when it
> > > 	 * races with a modeset....
> > > 
> > > I'm guessing that this issue may well be present in upstream kernels and
> > > regular Linux distros, the difference being that no other distro so far has
> > > shipped with Y-tiled framebuffers.
> > > 
> > > 
> > > (I doubt many distros have shipped with hardware overlays either -- Marc)
> > 
> > I'm definitely seeing something odd going on with the plane double buffering
> > on my SKL here. The PLANE_CTL=0 write doesn't seem to stick sometimes and
> > instead I'm seeing the plane scanning out stale data instead of being off. I
> > suspect this could happen when the PLANE_CTL write lands before the vblank,
> > and the PLANE_SURF write lands after the vblank, but I'll need to do some
> > more low level hw poking to confirm that theory.
> 
> The race is something more subtle than that since manually poking those
> register across vblanks is obviously fine.
> 
> It looks like I managed to avoid the planes staying on past their bedtime
> with the following branch:
> git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> 
> I hope it will also prevent the hard hangs. Please test.


With this patch on Linux-stable 4.15.18 without the support of IPC on BXT+, still causes hard hangs.
Comment 60 Ville Syrjala 2018-04-23 12:53:46 UTC
(In reply to Azhar from comment #59)
> (In reply to Ville Syrjala from comment #58)
> > It looks like I managed to avoid the planes staying on past their bedtime
> > with the following branch:
> > git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> > 
> > I hope it will also prevent the hard hangs. Please test.
> 
> 
> With this patch on Linux-stable 4.15.18 without the support of IPC on BXT+,
> still causes hard hangs.

Argh. OK, so looks like what I had there is only sufficient for linear buffers for some unknown reason. To make the planes appear to work correctly with tiled buffers I had to frob the double buffer disable bits some more between the ddb/wm and plane programming steps.

I pushed an additional commit to the same branch:
git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa

Please test again.
Comment 61 Ville Syrjala 2018-04-23 13:14:11 UTC
(In reply to Ville Syrjala from comment #60)
> (In reply to Azhar from comment #59)
> > (In reply to Ville Syrjala from comment #58)
> > > It looks like I managed to avoid the planes staying on past their bedtime
> > > with the following branch:
> > > git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> > > 
> > > I hope it will also prevent the hard hangs. Please test.
> > 
> > 
> > With this patch on Linux-stable 4.15.18 without the support of IPC on BXT+,
> > still causes hard hangs.
> 
> Argh. OK, so looks like what I had there is only sufficient for linear
> buffers for some unknown reason. To make the planes appear to work correctly
> with tiled buffers I had to frob the double buffer disable bits some more
> between the ddb/wm and plane programming steps.
> 
> I pushed an additional commit to the same branch:
> git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> 
> Please test again.

Actually never mind. I still seems to fail here. Just makes a bit harder to hit :(
Comment 62 Ville Syrjala 2018-04-23 18:27:28 UTC
(In reply to Ville Syrjala from comment #61)
> (In reply to Ville Syrjala from comment #60)
> > (In reply to Azhar from comment #59)
> > > (In reply to Ville Syrjala from comment #58)
> > > > It looks like I managed to avoid the planes staying on past their bedtime
> > > > with the following branch:
> > > > git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> > > > 
> > > > I hope it will also prevent the hard hangs. Please test.
> > > 
> > > 
> > > With this patch on Linux-stable 4.15.18 without the support of IPC on BXT+,
> > > still causes hard hangs.
> > 
> > Argh. OK, so looks like what I had there is only sufficient for linear
> > buffers for some unknown reason. To make the planes appear to work correctly
> > with tiled buffers I had to frob the double buffer disable bits some more
> > between the ddb/wm and plane programming steps.
> > 
> > I pushed an additional commit to the same branch:
> > git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> > 
> > Please test again.
> 
> Actually never mind. I still seems to fail here. Just makes a bit harder to
> hit :(

OK, yet another commit pushed. So far this is showing a bit more promise here. I'll leave it running overnight to see if it ends up tripping over later.
Comment 63 Azhar 2018-04-23 20:58:54 UTC
(In reply to Ville Syrjala from comment #62)
> (In reply to Ville Syrjala from comment #61)
> > (In reply to Ville Syrjala from comment #60)
> > > (In reply to Azhar from comment #59)
> > > > (In reply to Ville Syrjala from comment #58)
> > > > > It looks like I managed to avoid the planes staying on past their bedtime
> > > > > with the following branch:
> > > > > git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> > > > > 
> > > > > I hope it will also prevent the hard hangs. Please test.
> > > > 
> > > > 
> > > > With this patch on Linux-stable 4.15.18 without the support of IPC on BXT+,
> > > > still causes hard hangs.
> > > 
> > > Argh. OK, so looks like what I had there is only sufficient for linear
> > > buffers for some unknown reason. To make the planes appear to work correctly
> > > with tiled buffers I had to frob the double buffer disable bits some more
> > > between the ddb/wm and plane programming steps.
> > > 
> > > I pushed an additional commit to the same branch:
> > > git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> > > 
> > > Please test again.
> > 
> > Actually never mind. I still seems to fail here. Just makes a bit harder to
> > hit :(
> 
> OK, yet another commit pushed. So far this is showing a bit more promise
> here. I'll leave it running overnight to see if it ends up tripping over
> later.


With the patches on the double_buffer_ctl_ddb_wa branhc, on Linux-stable 4.15.18 and without the support of IPC on BXT+, still causes hard hangs.
Comment 64 Azhar 2018-04-24 00:24:46 UTC
(In reply to Azhar from comment #63)
> (In reply to Ville Syrjala from comment #62)
> > (In reply to Ville Syrjala from comment #61)
> > > (In reply to Ville Syrjala from comment #60)
> > > > (In reply to Azhar from comment #59)
> > > > > (In reply to Ville Syrjala from comment #58)
> > > > > > It looks like I managed to avoid the planes staying on past their bedtime
> > > > > > with the following branch:
> > > > > > git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> > > > > > 
> > > > > > I hope it will also prevent the hard hangs. Please test.
> > > > > 
> > > > > 
> > > > > With this patch on Linux-stable 4.15.18 without the support of IPC on BXT+,
> > > > > still causes hard hangs.
> > > > 
> > > > Argh. OK, so looks like what I had there is only sufficient for linear
> > > > buffers for some unknown reason. To make the planes appear to work correctly
> > > > with tiled buffers I had to frob the double buffer disable bits some more
> > > > between the ddb/wm and plane programming steps.
> > > > 
> > > > I pushed an additional commit to the same branch:
> > > > git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa
> > > > 
> > > > Please test again.
> > > 
> > > Actually never mind. I still seems to fail here. Just makes a bit harder to
> > > hit :(
> > 
> > OK, yet another commit pushed. So far this is showing a bit more promise
> > here. I'll leave it running overnight to see if it ends up tripping over
> > later.
> 
> 
> With the patches on the double_buffer_ctl_ddb_wa branhc, on Linux-stable
> 4.15.18 and without the support of IPC on BXT+, still causes hard hangs.

I re-ran the test second time and it did crash again. But this time it took a longer time to crash. Earlier (before this patch) it used to crash within 15-20 mins, but this time it took more than 2 hours to crash. So looks like there is some improvement.
Comment 65 Ville Syrjala 2018-04-24 19:22:54 UTC
Hmm. I wonder if we're still barking up the right tree with the DDB angle.

I pushed the following hack:
b48dad3fe322 ("hacks: fixed ddb allocation for each plane")
to git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa_hacks

Would be nice to know whether that fully eliminates the hang. If it doesn't then we would seem to be on the wrong track.
Comment 66 Azhar 2018-04-25 00:42:52 UTC
(In reply to Ville Syrjala from comment #65)
> Hmm. I wonder if we're still barking up the right tree with the DDB angle.
> 
> I pushed the following hack:
> b48dad3fe322 ("hacks: fixed ddb allocation for each plane")
> to git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa_hacks
> 
> Would be nice to know whether that fully eliminates the hang. If it doesn't
> then we would seem to be on the wrong track.

The test has been running for more than 5 hours now, with ONLY the top commit "b48dad3fe322 ("hacks: fixed ddb allocation for each plane")"  on linux-stable 4.15.18 branch without IPC support on BXT+.

I will keep it running overnight.
Comment 67 Azhar 2018-04-25 17:48:17 UTC
(In reply to Azhar from comment #66)
> (In reply to Ville Syrjala from comment #65)
> > Hmm. I wonder if we're still barking up the right tree with the DDB angle.
> > 
> > I pushed the following hack:
> > b48dad3fe322 ("hacks: fixed ddb allocation for each plane")
> > to git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa_hacks
> > 
> > Would be nice to know whether that fully eliminates the hang. If it doesn't
> > then we would seem to be on the wrong track.
> 
> The test has been running for more than 5 hours now, with ONLY the top
> commit "b48dad3fe322 ("hacks: fixed ddb allocation for each plane")"  on
> linux-stable 4.15.18 branch without IPC support on BXT+.
> 
> I will keep it running overnight.

There was no crash with overnight run for above configuration.
Comment 68 Ville Syrjala 2018-04-25 18:17:28 UTC
(In reply to Azhar from comment #67)
> (In reply to Azhar from comment #66)
> > (In reply to Ville Syrjala from comment #65)
> > > Hmm. I wonder if we're still barking up the right tree with the DDB angle.
> > > 
> > > I pushed the following hack:
> > > b48dad3fe322 ("hacks: fixed ddb allocation for each plane")
> > > to git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa_hacks
> > > 
> > > Would be nice to know whether that fully eliminates the hang. If it doesn't
> > > then we would seem to be on the wrong track.
> > 
> > The test has been running for more than 5 hours now, with ONLY the top
> > commit "b48dad3fe322 ("hacks: fixed ddb allocation for each plane")"  on
> > linux-stable 4.15.18 branch without IPC support on BXT+.
> > 
> > I will keep it running overnight.
> 
> There was no crash with overnight run for above configuration.

OK. I guess we're still on the right track then.

Let's try to make this double buffer thing as simple as possible:
git://github.com/vsyrjala/linux.git double_buffer_ctl_simple

That approach is no good when multiple pipes are enabled, but for a single pipe case it's about as simple as we can make it.
Comment 69 Azhar 2018-04-25 23:51:06 UTC
(In reply to Ville Syrjala from comment #68)
> (In reply to Azhar from comment #67)
> > (In reply to Azhar from comment #66)
> > > (In reply to Ville Syrjala from comment #65)
> > > > Hmm. I wonder if we're still barking up the right tree with the DDB angle.
> > > > 
> > > > I pushed the following hack:
> > > > b48dad3fe322 ("hacks: fixed ddb allocation for each plane")
> > > > to git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa_hacks
> > > > 
> > > > Would be nice to know whether that fully eliminates the hang. If it doesn't
> > > > then we would seem to be on the wrong track.
> > > 
> > > The test has been running for more than 5 hours now, with ONLY the top
> > > commit "b48dad3fe322 ("hacks: fixed ddb allocation for each plane")"  on
> > > linux-stable 4.15.18 branch without IPC support on BXT+.
> > > 
> > > I will keep it running overnight.
> > 
> > There was no crash with overnight run for above configuration.
> 
> OK. I guess we're still on the right track then.
> 
> Let's try to make this double buffer thing as simple as possible:
> git://github.com/vsyrjala/linux.git double_buffer_ctl_simple
> 
> That approach is no good when multiple pipes are enabled, but for a single
> pipe case it's about as simple as we can make it.

In the above branch commit("grab double buffer ctl around the entire update") has a schedule_timeout call within a spinlock which is causing a kernel panic due to BUG.

[  204.671649] BUG: scheduling while atomic: kworker/u8:1/78/0x00000002
[  204.678769] Modules linked in: rfcomm cmac uinput cfg80211 ip6table_filter xt_nat bridge stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 btusb btrtl lzo btbcm btintel lzo_compress ov13858 bluetooth iptable_nat zram nf_nat_ipv4 nf_nat ov5670 v4l2_fwnode ecdh_generic snd_soc_max98927 dw9714 at24 acpi_als xt_mark fuse snd_seq_dummy snd_seq snd_seq_device iio_trig_sysfs cros_ec_light_prox cros_ec_sensors cros_ec_sensors_core industrialio_triggered_buffer kfifo_buf industrialio r8152 mii joydev
[  204.727322] CPU: 0 PID: 78 Comm: kworker/u8:1 Tainted: G        W        4.15.18-00010-gc4c90323f9ec #49
[  204.746677] Workqueue: events_unbound intel_atomic_commit_work
[  204.753196] Call Trace:
[  204.755943]  dump_stack+0x4d/0x63
[  204.759650]  __schedule_bug+0x5d/0x6b
[  204.763740]  __schedule+0x83/0x7b2
[  204.767542]  ? __switch_to_asm+0x40/0x70
[  204.771920]  ? __switch_to_asm+0x34/0x70
[  204.776297]  ? __switch_to_asm+0x34/0x70
[  204.780665]  ? __switch_to_asm+0x40/0x70
[  204.785055]  schedule+0x75/0x86
[  204.788566]  schedule_timeout+0x2de/0x34b
[  204.793053]  ? collect_expired_timers+0x10b/0x10b
[  204.798312]  intel_pipe_update_start+0x1ea/0x2e2
[  204.803473]  ? add_wait_queue+0x48/0x48
[  204.807761]  intel_begin_crtc_commit+0x6b/0x20d
[  204.812836]  drm_atomic_helper_commit_planes_on_crtc+0x4e/0x14f
[  204.819462]  skl_update_crtcs+0x172/0x1b7
[  204.823944]  intel_atomic_commit_tail+0x6dd/0x1d15
[  204.829299]  ? _raw_spin_unlock_irq+0xe/0x21
[  204.834085]  ? __schedule+0x583/0x7b2
[  204.838180]  worker_thread+0x42e/0x5da
[  204.842362]  ? queue_work_on+0x24/0x24
[  204.846552]  kthread+0x1e6/0x1ee
[  204.850153]  ? queue_work_on+0x24/0x24
[  204.854336]  ? kthread_create_worker+0x72/0x72
[  204.859297]  ret_from_fork+0x35/0x40
[  204.863441] BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/v4.4/kernel/sched/completion.c:102
[  204.877553] in_atomic(): 1, irqs_disabled(): 0, pid: 78, name: kworker/u8:1
[  204.885348] CPU: 1 PID: 78 Comm: kworker/u8:1 Tainted: G        W        4.15.18-00010-gc4c90323f9ec #49
[  204.904727] Workqueue: events_unbound intel_atomic_commit_work
[  204.911252] Call Trace:
[  204.913990]  dump_stack+0x4d/0x63
[  204.917696]  ___might_sleep+0x126/0x135
[  204.921975]  wait_for_common+0x32/0x69
[  204.926158]  drm_atomic_helper_wait_for_flip_done+0x50/0x7e
[  204.932391]  intel_atomic_commit_tail+0x6ea/0x1d15
[  204.937745]  ? _raw_spin_unlock_irq+0xe/0x21
[  204.942513]  ? __schedule+0x583/0x7b2
[  204.946610]  worker_thread+0x42e/0x5da
[  204.950793]  ? queue_work_on+0x24/0x24
[  204.954976]  kthread+0x1e6/0x1ee
[  204.958578]  ? queue_work_on+0x24/0x24
[  204.962764]  ? kthread_create_worker+0x72/0x72
[  204.967725]  ret_from_fork+0x35/0x40
[  204.971759] BUG: workqueue leaked lock or atomic: kworker/u8:1/0x7fffffff/78
[  204.971759]      last function: intel_atomic_commit_work
[  204.971775] BUG: scheduling while atomic: kworker/u8:3/713/0x00000002

For now I just commented and ran the test with the same patch and it still crashed.
Comment 70 Azhar 2018-05-02 17:42:42 UTC
(In reply to Azhar from comment #69)
> (In reply to Ville Syrjala from comment #68)
> > (In reply to Azhar from comment #67)
> > > (In reply to Azhar from comment #66)
> > > > (In reply to Ville Syrjala from comment #65)
> > > > > Hmm. I wonder if we're still barking up the right tree with the DDB angle.
> > > > > 
> > > > > I pushed the following hack:
> > > > > b48dad3fe322 ("hacks: fixed ddb allocation for each plane")
> > > > > to git://github.com/vsyrjala/linux.git double_buffer_ctl_ddb_wa_hacks
> > > > > 
> > > > > Would be nice to know whether that fully eliminates the hang. If it doesn't
> > > > > then we would seem to be on the wrong track.
> > > > 
> > > > The test has been running for more than 5 hours now, with ONLY the top
> > > > commit "b48dad3fe322 ("hacks: fixed ddb allocation for each plane")"  on
> > > > linux-stable 4.15.18 branch without IPC support on BXT+.
> > > > 
> > > > I will keep it running overnight.
> > > 
> > > There was no crash with overnight run for above configuration.
> > 
> > OK. I guess we're still on the right track then.
> > 
> > Let's try to make this double buffer thing as simple as possible:
> > git://github.com/vsyrjala/linux.git double_buffer_ctl_simple
> > 
> > That approach is no good when multiple pipes are enabled, but for a single
> > pipe case it's about as simple as we can make it.
> 
> In the above branch commit("grab double buffer ctl around the entire
> update") has a schedule_timeout call within a spinlock which is causing a
> kernel panic due to BUG.
> 
> [  204.671649] BUG: scheduling while atomic: kworker/u8:1/78/0x00000002
> [  204.678769] Modules linked in: rfcomm cmac uinput cfg80211
> ip6table_filter xt_nat bridge stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4
> btusb btrtl lzo btbcm btintel lzo_compress ov13858 bluetooth iptable_nat
> zram nf_nat_ipv4 nf_nat ov5670 v4l2_fwnode ecdh_generic snd_soc_max98927
> dw9714 at24 acpi_als xt_mark fuse snd_seq_dummy snd_seq snd_seq_device
> iio_trig_sysfs cros_ec_light_prox cros_ec_sensors cros_ec_sensors_core
> industrialio_triggered_buffer kfifo_buf industrialio r8152 mii joydev
> [  204.727322] CPU: 0 PID: 78 Comm: kworker/u8:1 Tainted: G        W       
> 4.15.18-00010-gc4c90323f9ec #49
> [  204.746677] Workqueue: events_unbound intel_atomic_commit_work
> [  204.753196] Call Trace:
> [  204.755943]  dump_stack+0x4d/0x63
> [  204.759650]  __schedule_bug+0x5d/0x6b
> [  204.763740]  __schedule+0x83/0x7b2
> [  204.767542]  ? __switch_to_asm+0x40/0x70
> [  204.771920]  ? __switch_to_asm+0x34/0x70
> [  204.776297]  ? __switch_to_asm+0x34/0x70
> [  204.780665]  ? __switch_to_asm+0x40/0x70
> [  204.785055]  schedule+0x75/0x86
> [  204.788566]  schedule_timeout+0x2de/0x34b
> [  204.793053]  ? collect_expired_timers+0x10b/0x10b
> [  204.798312]  intel_pipe_update_start+0x1ea/0x2e2
> [  204.803473]  ? add_wait_queue+0x48/0x48
> [  204.807761]  intel_begin_crtc_commit+0x6b/0x20d
> [  204.812836]  drm_atomic_helper_commit_planes_on_crtc+0x4e/0x14f
> [  204.819462]  skl_update_crtcs+0x172/0x1b7
> [  204.823944]  intel_atomic_commit_tail+0x6dd/0x1d15
> [  204.829299]  ? _raw_spin_unlock_irq+0xe/0x21
> [  204.834085]  ? __schedule+0x583/0x7b2
> [  204.838180]  worker_thread+0x42e/0x5da
> [  204.842362]  ? queue_work_on+0x24/0x24
> [  204.846552]  kthread+0x1e6/0x1ee
> [  204.850153]  ? queue_work_on+0x24/0x24
> [  204.854336]  ? kthread_create_worker+0x72/0x72
> [  204.859297]  ret_from_fork+0x35/0x40
> [  204.863441] BUG: sleeping function called from invalid context at
> /mnt/host/source/src/third_party/kernel/v4.4/kernel/sched/completion.c:102
> [  204.877553] in_atomic(): 1, irqs_disabled(): 0, pid: 78, name:
> kworker/u8:1
> [  204.885348] CPU: 1 PID: 78 Comm: kworker/u8:1 Tainted: G        W       
> 4.15.18-00010-gc4c90323f9ec #49
> [  204.904727] Workqueue: events_unbound intel_atomic_commit_work
> [  204.911252] Call Trace:
> [  204.913990]  dump_stack+0x4d/0x63
> [  204.917696]  ___might_sleep+0x126/0x135
> [  204.921975]  wait_for_common+0x32/0x69
> [  204.926158]  drm_atomic_helper_wait_for_flip_done+0x50/0x7e
> [  204.932391]  intel_atomic_commit_tail+0x6ea/0x1d15
> [  204.937745]  ? _raw_spin_unlock_irq+0xe/0x21
> [  204.942513]  ? __schedule+0x583/0x7b2
> [  204.946610]  worker_thread+0x42e/0x5da
> [  204.950793]  ? queue_work_on+0x24/0x24
> [  204.954976]  kthread+0x1e6/0x1ee
> [  204.958578]  ? queue_work_on+0x24/0x24
> [  204.962764]  ? kthread_create_worker+0x72/0x72
> [  204.967725]  ret_from_fork+0x35/0x40
> [  204.971759] BUG: workqueue leaked lock or atomic:
> kworker/u8:1/0x7fffffff/78
> [  204.971759]      last function: intel_atomic_commit_work
> [  204.971775] BUG: scheduling while atomic: kworker/u8:3/713/0x00000002
> 
> For now I just commented and ran the test with the same patch and it still
> crashed.

Ville, commenting the schedule_timeout here, git://github.com/vsyrjala/linux.git double_buffer_ctl_simple, still does crash the system. Is commenting the timeout, right thing here? If not, then will have to remove the spinlocks. Do you have any new patch/es?
Comment 71 Ville Syrjala 2018-05-04 20:22:55 UTC
Came up with one more random idea:
git://github.com/vsyrjala/linux.git double_buffer_ctl_simple_2
Comment 72 Azhar 2018-05-07 21:22:56 UTC
(In reply to Ville Syrjala from comment #71)
> Came up with one more random idea:
> git://github.com/vsyrjala/linux.git double_buffer_ctl_simple_2

This branch still had the spinlock issues (introduced with this patch "grab double buffer ctl around the entire update") mentioned in comment #69, have commented the schedule_timeout() in intel_pipe_update_start(), to move forward. 


1. adjust the DDB and plane update seqeunces
2. irqsave/restore for skl wm
3. double buffer disable for scalers too
4. grab double buffer ctl around the entire update
5. leave the "allow" bit on, but arm the ddb/wm with a SURF write
6. clear the double buffer allow bits immediately after ddb/wm change
7. drm/i915: Use DOUBLE_BUFFER_CTL and disable planes before frobbing wm/ddb

With above patches on top of Linux-stable 4.15.18 without the support of IPC on BXT+, system still crashes.
Comment 73 Ville Syrjala 2018-05-09 16:00:03 UTC
Pushed a few more things to 
git://github.com/vsyrjala/linux.git double_buffer_ctl_simple_2

cd7a3f1a64c7 ("write wm/buf_cfg regs when writing other plane regs")
0374055ebfdd ("clear upper half of ggtt and point all disabled planes there")
64bb19816a10 ("fix schedule_timeout vs. spinlock")
dd26b543fec6 ("disable LP1+ watermarks")

Please test again. If supposedly disabled planes remain on this time we should get getting fault errors in dmesg from the irq handler.
Comment 74 Azhar 2018-05-11 22:52:48 UTC
(In reply to Ville Syrjala from comment #73)
> Pushed a few more things to 
> git://github.com/vsyrjala/linux.git double_buffer_ctl_simple_2
> 
> cd7a3f1a64c7 ("write wm/buf_cfg regs when writing other plane regs")
> 0374055ebfdd ("clear upper half of ggtt and point all disabled planes there")
> 64bb19816a10 ("fix schedule_timeout vs. spinlock")
> dd26b543fec6 ("disable LP1+ watermarks")
> 
> Please test again. If supposedly disabled planes remain on this time we
> should get getting fault errors in dmesg from the irq handler.

After applying above patches I was able to reproduce the issue once, but was not able to collect the logs.

I have re-started the test again to get the dmesg logs with default log level.
Comment 75 Azhar 2018-05-11 23:04:11 UTC
(In reply to Azhar from comment #74)
> (In reply to Ville Syrjala from comment #73)
> > Pushed a few more things to 
> > git://github.com/vsyrjala/linux.git double_buffer_ctl_simple_2
> > 
> > cd7a3f1a64c7 ("write wm/buf_cfg regs when writing other plane regs")
> > 0374055ebfdd ("clear upper half of ggtt and point all disabled planes there")
> > 64bb19816a10 ("fix schedule_timeout vs. spinlock")
> > dd26b543fec6 ("disable LP1+ watermarks")
> > 
> > Please test again. If supposedly disabled planes remain on this time we
> > should get getting fault errors in dmesg from the irq handler.
> 
> After applying above patches I was able to reproduce the issue once, but was
> not able to collect the logs.
> 
> I have re-started the test again to get the dmesg logs with default log
> level.

Please find attached dmesg logs.
Comment 76 Azhar 2018-05-11 23:05:34 UTC
Created attachment 139508 [details]
dmesg logs with disable LP1+ wm patch
Comment 77 Francesco Balestrieri 2018-05-14 12:27:37 UTC
Removing the NEEDINFO state since AFAICT the info has been provided.
Comment 78 Azhar 2018-05-14 17:44:04 UTC
(In reply to Azhar from comment #75)
> (In reply to Azhar from comment #74)
> > (In reply to Ville Syrjala from comment #73)
> > > Pushed a few more things to 
> > > git://github.com/vsyrjala/linux.git double_buffer_ctl_simple_2
> > > 
> > > cd7a3f1a64c7 ("write wm/buf_cfg regs when writing other plane regs")
> > > 0374055ebfdd ("clear upper half of ggtt and point all disabled planes there")
> > > 64bb19816a10 ("fix schedule_timeout vs. spinlock")
> > > dd26b543fec6 ("disable LP1+ watermarks")
> > > 
> > > Please test again. If supposedly disabled planes remain on this time we
> > > should get getting fault errors in dmesg from the irq handler.
> > 
> > After applying above patches I was able to reproduce the issue once, but was
> > not able to collect the logs.
> > 
> > I have re-started the test again to get the dmesg logs with default log
> > level.
> 
> Please find attached dmesg logs.

Please find attached dmesg logs with debug log level 0x1E, when issue is reproduced with above patches.
Comment 79 Azhar 2018-05-14 17:45:29 UTC
Created attachment 139560 [details]
dmesg logs with disable LP1+ wm patch debug log level 0x1E
Comment 80 Francesco Balestrieri 2018-05-22 13:30:41 UTC
Ville, any thoughts on what to do next?
Comment 81 Francesco Balestrieri 2018-06-19 12:38:02 UTC
Lowering priority
Comment 82 Ville Syrjala 2018-07-12 20:32:29 UTC
I had a minor revelation today when chatting with danvet on irc about the troubles with the DOUBLE_BUFFER_CTL register. Maybe this finally does the plane updates in a safe way?

git://github.com/vsyrjala/linux.git skl_plane_update_ddb_sequence

It probably still has some issues with the cursor that I'll need to think about.
Comment 83 Azhar 2018-07-16 23:37:07 UTC
I tried this branch git://github.com/vsyrjala/linux.git skl_plane_update_ddb_sequence and the issue is still reproducible.

Attaching the dmesg logs - dmesg_logs_DOUBLE_BUFFER.zip
Comment 84 Azhar 2018-07-16 23:39:59 UTC
Created attachment 140654 [details]
dmesg logs of branch shared on Jul 12, 2018
Comment 85 Lakshmi 2018-09-10 06:31:33 UTC
Ville, any updates on this issue?
Comment 86 Ville Syrjala 2019-01-22 18:47:12 UTC
The watermark/ddb update series landed a while back. Retesting with drm-tip might make sense.
Comment 87 Lakshmi 2019-02-27 14:02:58 UTC
Azhar, can you verify the issue wit latest drmtip?
(https://cgit.freedesktop.org/drm-tip)
Comment 88 Juston Li 2019-03-05 02:28:43 UTC
(In reply to Lakshmi from comment #87)
> Azhar, can you verify the issue wit latest drmtip?
> (https://cgit.freedesktop.org/drm-tip)

Still able to reproduce the issue with drm-tip as of today:

commit 19d7e6b4a6c43104823680ed766240300f5ef876 (drm/drm-tip)
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Mon Mar 4 20:20:13 2019 +0200

    drm-tip: 2019y-03m-04d-18h-19m-23s UTC integration manifest

Logs leading up to the freeze attached.
Comment 89 Juston Li 2019-03-05 02:29:20 UTC
Created attachment 143527 [details]
kernel logs leading up to the freeze, 0x1e
Comment 90 Lakshmi 2019-06-04 10:15:41 UTC
Azhar, would you mind verifying this issue with drmtip?
Please attach the full dmesg from boot with drmtip. Dmesg from boot will be helpful to investigate this issue further.
Comment 91 Juston Li 2019-06-11 21:25:26 UTC
Retested recently and did not reproduce with 500 iterations of our autotest.

Bisect points to:
drm/i915: Switch to level-based DDB allocation algorithm (v5)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/i915/intel_pm.c?h=v5.1&id=d8e874980241dd1cee68e85c39d8e5372bbad494


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.