Summary: | Machine hang with i915.enable_psr=1 (Kaby Lake R) | ||
---|---|---|---|
Product: | DRI | Reporter: | Raphael Kubo da Costa <raphael.kubo.da.costa> |
Component: | DRM/Intel | Assignee: | Jose Roberto de Souza <jose.souza> |
Status: | RESOLVED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | critical | ||
Priority: | highest | CC: | intel-gfx-bugs, jose.souza |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | All | ||
Whiteboard: | Triaged, ReadyForDev | ||
i915 platform: | KBL | i915 features: | display/PSR |
Attachments: |
[ 9.548638] [drm] [transcoder EDP] PSR aux error [ 9.548740] [drm:intel_psr_disable_locked [i915]] Disabling PSR1 [ 9.557453] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.557512] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.557554] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.557657] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun [ 9.557700] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.557743] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.557811] [drm:intel_power_well_enable [i915]] enabling DC off [ 9.557838] [drm:gen9_set_dc_state [i915]] Setting DC state from 02 to 00 [ 9.557849] ------------[ cut here ]------------ [ 9.557886] WARN_ON((val & (1 << 30)) == 0) [ 9.557928] WARNING: CPU: 2 PID: 189 at drivers/gpu/drm/i915/display/intel_cdclk.c:834 skl_get_cdclk+0x228/0x250 [i915] [ 9.557975] Modules linked in: i915 crct10dif_pclmul crc32_pclmul i2c_algo_bit crc32c_intel drm_kms_helper ghash_clmulni_intel drm nvme e1000e serio_raw nvme_core wmi i2c_hid video fuse [ 9.558029] CPU: 2 PID: 189 Comm: kworker/2:2 Not tainted 5.3.7-200.fc30.x86_64+debug #1 [ 9.558054] Hardware name: Dell Inc. Latitude 7390/09386V, BIOS 1.11.0 07/04/2019 [ 9.558096] Workqueue: events intel_psr_work [i915] [ 9.558130] RIP: 0010:skl_get_cdclk+0x228/0x250 [i915] [ 9.558149] Code: b8 01 00 00 00 e9 cd fe ff ff c7 03 5c 26 05 00 31 c0 e9 c0 fe ff ff 48 c7 c6 98 5f 76 c0 48 c7 c7 cb 55 78 c0 e8 10 4a a6 e2 <0f> 0b 8b 53 04 e9 7d fe ff ff c7 03 80 97 06 00 b8 01 00 00 00 e9 [ 9.558202] RSP: 0018:ffffac1d80387b80 EFLAGS: 00010282 [ 9.558220] RAX: 0000000000000000 RBX: ffffac1d80387ba4 RCX: 0000000000000006 [ 9.558243] RDX: 00000000000040bf RSI: ffff8e93ea384160 RDI: 0000000000000246 [ 9.558265] RBP: ffff8e93ddb40000 R08: 0000000239b1c80b R09: 0000000000000000 [ 9.558288] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e93ddb40f28 [ 9.558314] R13: 0000000080000000 R14: ffff8e93ddb4bd30 R15: ffff8e93ddb40000 [ 9.558337] FS: 0000000000000000(0000) GS:ffff8e93ed800000(0000) knlGS:0000000000000000 [ 9.558362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.558381] CR2: 00007ffae271a8a0 CR3: 000000040b612003 CR4: 00000000003606e0 [ 9.558405] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9.558427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9.558450] Call Trace: [ 9.558481] gen9_dc_off_power_well_enable+0x54/0x1b0 [i915] [ 9.559511] intel_power_well_enable+0x3a/0x50 [i915] [ 9.560535] __intel_display_power_get_domain.part.0+0x6f/0x90 [i915] [ 9.561560] intel_display_power_get+0x4c/0x60 [i915] [ 9.562580] intel_dp_aux_xfer+0xaa/0x910 [i915] [ 9.563881] ? __mutex_lock+0x68e/0x990 [ 9.564876] ? find_held_lock+0x32/0x90 [ 9.565869] ? find_held_lock+0x32/0x90 [ 9.566876] intel_dp_aux_transfer+0x105/0x220 [i915] [ 9.567876] drm_dp_dpcd_access+0x72/0x110 [drm_kms_helper] [ 9.569169] drm_dp_dpcd_write+0x26/0xa0 [drm_kms_helper] [ 9.570173] intel_psr_disable_locked+0xca/0x140 [i915] [ 9.571182] intel_psr_work+0xdb/0x160 [i915] [ 9.572170] process_one_work+0x272/0x5a0 [ 9.573151] worker_thread+0x50/0x3b0 [ 9.574708] kthread+0x108/0x140 [ 9.575670] ? process_one_work+0x5a0/0x5a0 [ 9.576629] ? kthread_park+0x80/0x80 [ 9.577581] ret_from_fork+0x3a/0x50 [ 9.578533] irq event stamp: 20710 [ 9.579772] hardirqs last enabled at (20709): [<ffffffffa31733b5>] vprintk_emit+0x155/0x360 [ 9.580728] hardirqs last disabled at (20710): [<ffffffffa30038da>] trace_hardirqs_off_thunk+0x1a/0x20 [ 9.581684] softirqs last enabled at (19646): [<ffffffffa3e0035d>] __do_softirq+0x35d/0x45d [ 9.582636] softirqs last disabled at (19633): [<ffffffffa30f2027>] irq_exit+0xf7/0x100 [ 9.583584] ---[ end trace ef3a13a5b16988f5 ]--- [ 9.584554] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.585832] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.586806] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.587760] ------------[ cut here ]------------ [ 9.588700] WARN_ON(intel_cdclk_needs_modeset(&dev_priv->cdclk.hw, &cdclk_state)) [ 9.588733] WARNING: CPU: 2 PID: 189 at drivers/gpu/drm/i915/display/intel_display_power.c:831 gen9_dc_off_power_well_enable+0x159/0x1b0 [i915] [ 9.591213] Modules linked in: i915 crct10dif_pclmul crc32_pclmul i2c_algo_bit crc32c_intel drm_kms_helper ghash_clmulni_intel drm nvme e1000e serio_raw nvme_core wmi i2c_hid video fuse [ 9.592184] CPU: 2 PID: 189 Comm: kworker/2:2 Tainted: G W 5.3.7-200.fc30.x86_64+debug #1 [ 9.593124] Hardware name: Dell Inc. Latitude 7390/09386V, BIOS 1.11.0 07/04/2019 [ 9.594097] Workqueue: events intel_psr_work [i915] [ 9.595035] RIP: 0010:gen9_dc_off_power_well_enable+0x159/0x1b0 [i915] [ 9.595933] Code: 00 00 f6 85 3b 0e 00 00 10 75 39 0f b6 85 ca 0d 00 00 e9 61 ff ff ff 48 c7 c6 80 b6 76 c0 48 c7 c7 12 68 78 c0 e8 ef ea a3 e2 <0f> 0b e9 0d ff ff ff 48 c7 c7 c8 b6 76 c0 e8 dc ea a3 e2 0f 0b e9 [ 9.596877] RSP: 0018:ffffac1d80387ba0 EFLAGS: 00010286 [ 9.597809] RAX: 0000000000000000 RBX: ffff8e93e2b73930 RCX: 0000000000000006 [ 9.598734] RDX: 00000000000040bf RSI: ffff8e93ea384160 RDI: 0000000000000246 [ 9.599653] RBP: ffff8e93ddb40000 R08: 000000023b87f6ba R09: 0000000000000000 [ 9.600577] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000001f [ 9.601510] R13: 0000000080000000 R14: ffff8e93ddb4bd30 R15: ffff8e93ddb40000 [ 9.602438] FS: 0000000000000000(0000) GS:ffff8e93ed800000(0000) knlGS:0000000000000000 [ 9.603368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.604302] CR2: 00007ffae271a8a0 CR3: 000000040b612003 CR4: 00000000003606e0 [ 9.605237] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9.606175] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9.607100] Call Trace: [ 9.608016] intel_power_well_enable+0x3a/0x50 [i915] [ 9.608920] __intel_display_power_get_domain.part.0+0x6f/0x90 [i915] [ 9.609825] intel_display_power_get+0x4c/0x60 [i915] [ 9.610722] intel_dp_aux_xfer+0xaa/0x910 [i915] [ 9.611606] ? __mutex_lock+0x68e/0x990 [ 9.612451] ? find_held_lock+0x32/0x90 [ 9.613258] ? find_held_lock+0x32/0x90 [ 9.614073] intel_dp_aux_transfer+0x105/0x220 [i915] [ 9.614883] drm_dp_dpcd_access+0x72/0x110 [drm_kms_helper] [ 9.615693] drm_dp_dpcd_write+0x26/0xa0 [drm_kms_helper] [ 9.616469] intel_psr_disable_locked+0xca/0x140 [i915] [ 9.617285] intel_psr_work+0xdb/0x160 [i915] [ 9.618080] process_one_work+0x272/0x5a0 [ 9.618869] worker_thread+0x50/0x3b0 [ 9.619652] kthread+0x108/0x140 [ 9.620425] ? process_one_work+0x5a0/0x5a0 [ 9.621200] ? kthread_park+0x80/0x80 [ 9.621977] ret_from_fork+0x3a/0x50 [ 9.622710] irq event stamp: 20748 [ 9.623473] hardirqs last enabled at (20747): [<ffffffffa31733b5>] vprintk_emit+0x155/0x360 [ 9.624249] hardirqs last disabled at (20748): [<ffffffffa30038da>] trace_hardirqs_off_thunk+0x1a/0x20 [ 9.624988] softirqs last enabled at (20734): [<ffffffffa3e0035d>] __do_softirq+0x35d/0x45d [ 9.625768] softirqs last disabled at (20713): [<ffffffffa30f2027>] irq_exit+0xf7/0x100 [ 9.626547] ---[ end trace ef3a13a5b16988f6 ]--- [ 9.627343] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.628156] [drm:gen8_de_irq_handler [i915]] *ERROR* Fault errors on pipe A: 0x00000080 [ 9.629017] [drm:edp_panel_vdd_on [i915]] Turning eDP port A VDD on [ 9.629096] [drm:edp_panel_vdd_on [i915]] PP_STATUS: 0x80000008 PP_CONTROL: 0x0000000f [ 9.629468] [drm:intel_fbc_underrun_work_fn [i915]] Disabling FBC due to FIFO underrun. (In reply to Raphael Kubo da Costa from comment #0) > Created attachment 145835 [details] > dmesg with drm.debug=0x1e log_buf_len=1M > > I'm running Fedora with kernel 5.3.7, and have been experiencing this issue > for several months. Essentially, if I boot my laptop _without_ plugging it > to an external display (via an HDMI port, for example), then it is almost > 100% certain that it is either going to hang while waiting for my LUKS > password, during the boot process or if I later switch to a non-X11/Wayland > VT. The entire machine hangs, and I need to manually power it off and then > turn it on again. This machine is a Dell Latitude 7390 with an i7-8650U > processor. > > I played with a few boot parameters today, and noticed that any of the > following seem to stop the problem from happening: > * intel_idle.max_cstate=1 > * i915.enable_dc=0 > * i915.enable_psr=0 > * i915.dmc_firmware_path="" > * nomodeset i915.modeset=0 > > I'm attaching dmesg from a kernel-debug run with drm.debug=0x1e and > log_buf_len=1M that did not hang, but has two WARN_ONs there. I also have a > few other saved dmesgs, but they were produced without booting with > drm.debug=0x1e and log_buf_len=1M. CC'ing Jose. Hi Raphael Could you give a try with drm-tip kernel? If this still happens please share the logs. Looks like DMC is doing something really bad when exiting DC5/DC6 causing those pipe faults. Raphael, As Jose said, can you reproduce this issue using drm-tip (https://cgit.freedesktop.org/drm-tip) with kernel parameters drm.debug=0x1e log_buf_len=4M? Please attach full log from boot. I'm able to reproduce the issue with drm-tip at 5efc505498d2612451f6230a6347f4e0e1960e50 ("drm-tip: 2019y-10m-30d-22h-53m-25s UTC integration manifest"), but getting a log is very hard: most of the time this happens while I'm typing my LUKS password, or right after I finish typing and press Enter. The screen flashes once very quickly and the whole machine hangs, and it leaves no information in either /sys/class/drm/card0/error or /sys/fs/pstore (the laptop doesn't have ECC and I'm trying to use EFI variables for this). There are a few occasions where instead of hanging the kernel outputs the errors in the files I'll attach (2 dmesgs from 2 runs where this happened) while allowing me to finish typing my password and booting. Created attachment 145854 [details]
Second dmesg
Created attachment 145855 [details]
Third dmesg
Created attachment 145858 [details]
Hung machine with ignore_loglevel drm.debug=0x1e
In case it helps: when I boot with "drm.debug=0x1e log_buf_len=4M ignore_loglevel" I get a lot more information on the console when the machine hangs.
I'm attaching a screenshot of the last lines that appear on the screen before everything hangs. As far as I could reproduce, the last lines are always
[drm:intel_power_well_disable [i915]] disabling DC off
[drm:skl_enable_dc6 [i915]] Enabling DC6
[drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 02
Setting it to Highest priority considering the impact and reproduction rate. Hi Raphael Can you give a try with this branch? https://github.com/zehortigoza/linux/tree/psr-test Then please share the logs, also can you share the output of /sys/kernel/debug/dri/0/i915_dmc_info before the issue and after. Thanks Created attachment 145887 [details]
Fourth dmesg, commit 6eb50a2e5d102
dmesg from commit 6eb50a2e5d102 ("drm-tip: 2019y-11m-04d-18h-07m-25s UTC integration manifest"), the drm-tip commit before the PSR test commits in the psr-test branch.
Created attachment 145888 [details]
/sys/kernel/debug/dri/0/i915_dmc_info, commit 6eb50a2e5d102
Contents of /sys/kernel/debug/dri/0/i915_dmc_info at commit 6eb50a2e5d102 ("drm-tip: 2019y-11m-04d-18h-07m-25s UTC integration manifest"), the drm-tip commit before the PSR test commits in the psr-test branch.
Created attachment 145889 [details]
Fifth dmesg, commit 241ccbd67399c
dmesg from commit 241ccbd67399c ("wip: drm/i915/psr: Increase psr idle paterns"), tip of the psr-test branch.
Created attachment 145890 [details]
/sys/kernel/debug/dri/0/i915_dmc_info, commit 241ccbd67399c
Contents of /sys/kernel/debug/dri/0/i915_dmc_info at commit 241ccbd67399c ("wip: drm/i915/psr: Increase psr idle paterns"), tip of the psr-test branch.
Alright, I've tested your psr-test branch at commit 241ccbd67399c ("wip: drm/i915/psr: Increase psr idle paterns") and I'm unable to reproduce the machine hang there. ... However, drm-tip _without_ your patches (i.e. 6eb50a2e5d102, "drm-tip: 2019y-11m-04d-18h-07m-25s UTC integration manifest", i.e. the tree you based your patches on) also works! I've spent a few hours trying to figure out what happened between last week's drm-tip tree (5efc50549, "drm-tip: 2019y-10m-30d-22h-53m-25s UTC integration manifest") and yours. There's been rc6, and a lot of merges, but bisection (*) points to commit 58d124ea2739e1440ddd743d46c470fe724aca9a ("drm/i915: Complete crtc hw/uapi split, v6"), which was merged in commit 74dfc380537aa49ad0b741cd0358bdb9e0248e4b ("Merge remote-tracking branch 'drm-intel/drm-intel-next-queued' into drm-tip"). I can reliably reproduce my machine hang when using commit 2225f3c6f1d793 ("drm/i915: Perform automated conversions for crtc uapi/hw split, base -> uapi"), which is 58d124ea2739e1440ddd743d46c470fe724aca9a's parent, and I have not managed to cause any hangs since commit 58d124ea2739e1440ddd743d46c470fe724aca9a. I just have no idea if unrelated changes ended up fixing the worst cases of my actual issue and the machine no longer hangs, or if those commits were really supposed to fix this, or even whether the other failures in older dmesgs were actually other problems that are not related to the hang I was having. (*) Due to the merges it wasn't a straight bisect; I reset to commit 4b40699236ebc009cba7c690e96c5873e2a75d6f ("Merge remote-tracking branch 'drm-misc/drm-misc-next' into drm-tip"), which was the latest drm-tip merge before the drm-intel-next-queued series, and then used git merge to merge a different range of the interval [f05816cbbcd0aa0af1efdd888ea6964644197e13..dbb1a6fbbb809fe364f51293026616c58916f330]. (In reply to Raphael Kubo da Costa from comment #15) > Alright, I've tested your psr-test branch at commit 241ccbd67399c ("wip: > drm/i915/psr: Increase psr idle paterns") and I'm unable to reproduce the > machine hang there. > > ... However, drm-tip _without_ your patches (i.e. 6eb50a2e5d102, "drm-tip: > 2019y-11m-04d-18h-07m-25s UTC integration manifest", i.e. the tree you based > your patches on) also works! > > I've spent a few hours trying to figure out what happened between last > week's drm-tip tree (5efc50549, "drm-tip: 2019y-10m-30d-22h-53m-25s UTC > integration manifest") and yours. There's been rc6, and a lot of merges, but > bisection (*) points to commit 58d124ea2739e1440ddd743d46c470fe724aca9a > ("drm/i915: Complete crtc hw/uapi split, v6"), which was merged in commit > 74dfc380537aa49ad0b741cd0358bdb9e0248e4b ("Merge remote-tracking branch > 'drm-intel/drm-intel-next-queued' into drm-tip"). > > I can reliably reproduce my machine hang when using commit 2225f3c6f1d793 > ("drm/i915: Perform automated conversions for crtc uapi/hw split, base -> > uapi"), which is 58d124ea2739e1440ddd743d46c470fe724aca9a's parent, and I > have not managed to cause any hangs since commit > 58d124ea2739e1440ddd743d46c470fe724aca9a. > > I just have no idea if unrelated changes ended up fixing the worst cases of > my actual issue and the machine no longer hangs, or if those commits were > really supposed to fix this, or even whether the other failures in older > dmesgs were actually other problems that are not related to the hang I was > having. > > (*) Due to the merges it wasn't a straight bisect; I reset to commit > 4b40699236ebc009cba7c690e96c5873e2a75d6f ("Merge remote-tracking branch > 'drm-misc/drm-misc-next' into drm-tip"), which was the latest drm-tip merge > before the drm-intel-next-queued series, and then used git merge to merge a > different range of the interval > [f05816cbbcd0aa0af1efdd888ea6964644197e13.. > dbb1a6fbbb809fe364f51293026616c58916f330]. 58d124ea2739e1440ddd743d46c470fe724aca9a wasn't supposed to fix anything but maybe some register was left without programming since of the beginning of the conversion and when DMC was restoring the registers the missing part was causing the GTT fault. Anyways, thanks for testing and doing the bisect. I will close this issue then, please reopen if it regress. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 145835 [details] dmesg with drm.debug=0x1e log_buf_len=1M I'm running Fedora with kernel 5.3.7, and have been experiencing this issue for several months. Essentially, if I boot my laptop _without_ plugging it to an external display (via an HDMI port, for example), then it is almost 100% certain that it is either going to hang while waiting for my LUKS password, during the boot process or if I later switch to a non-X11/Wayland VT. The entire machine hangs, and I need to manually power it off and then turn it on again. This machine is a Dell Latitude 7390 with an i7-8650U processor. I played with a few boot parameters today, and noticed that any of the following seem to stop the problem from happening: * intel_idle.max_cstate=1 * i915.enable_dc=0 * i915.enable_psr=0 * i915.dmc_firmware_path="" * nomodeset i915.modeset=0 I'm attaching dmesg from a kernel-debug run with drm.debug=0x1e and log_buf_len=1M that did not hang, but has two WARN_ONs there. I also have a few other saved dmesgs, but they were produced without booting with drm.debug=0x1e and log_buf_len=1M.