Bug 111601

Summary: regression: deadlock-freeze due to kernel commit aa56a292ce623734ddd30f52d73f527d1f3529b5
Product: DRI Reporter: Dq8CokMHloQZw <howaboutsynergy>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: highest CC: albeec13, chris, howaboutsynergy, intel-gfx-bugs, leho, massimiliano.torromeo
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard: Triaged, ReadyForDev
i915 platform: CFL i915 features: GEM/Other
Attachments:
Description Flags
cloned here what I've referenced: memfreeze, le9i.patch, le9h.patch
none
semi-full log of stacktraces at the point of `return -EINTR`
none
got stacktraces also at the point of `callback failed with -4 in blockable context` not just on `return -EINTR` none

Description Dq8CokMHloQZw 2019-09-09 09:49:13 UTC
In kernel 5.3.0-rc8 with commit aa56a292ce623734ddd30f52d73f527d1f3529b5 present(bisected), only on Intel/i915 computer (not on AMD/Radeon laptop) I can easily reproduce a deadlock-like freeze of the entire system which apparently lasts indefinitely. Sysrq still works but the CPU/fans go low speed and there's no disk activity.

Apparently this freeze happens during low memory / memory pressure situations.
I've encountered the freeze during chromium compilation (possibly due to jumbo build being in use which means it uses much RAM, I've 32G RAM total)

I tried to post about this on the issue that that commit references here: https://bugzilla.kernel.org/show_bug.cgi?id=203317#c4

Not sure what more info to give at this point, please let me know.

Ah, the way I can easily reproduce this is by running the memfreeze (variant 1) script from here: https://gist.github.com/howaboutsynergy/d9818dcf462c071251a236787d8fbea6#file-memfreeze
But in order to avoid the usual disk-reading freeze/crawl, I'm also using the le9i.patch from here: https://gist.github.com/howaboutsynergy/cbfa3cc5e8093c26c29f5d411c16e6b1/941ebb19cb493fb1d090475fafc57d4f4c2638ef
(but le9h.patch is enough though, it just keeps `Active(file)` above 300MiB during memory pressure; whilst le9i.patch also keeps `Inactive(file)` above 64MiB - but this is not necessary) 

le9h.patch is here: https://gist.github.com/howaboutsynergy/04fd9be927835b055ac15b8b64658e85/166cb58f454f2bdae26cbe219b919084cddfc657

Note that even without those patches, I can still cause this deadlock freeze and I know that it's not the disk-reading freeze that le9*.patch is solving because it lacks any disk activity and fans are slow speed, also mouse is frozen forever and it doesn't snap back to normal after the specified timeout(for 'stress' command).

The only reason I think it's about i915 is because of:
1. "i915" in the stacktrace in Comment 1 here https://bugzilla.kernel.org/show_bug.cgi?id=203317#c1
2. it only happens on my Intel(i915) computer, not on Amd/Radeon laptop(I simply cannot reproduce the freeze on the AMD one)
Comment 1 Dq8CokMHloQZw 2019-09-09 10:00:54 UTC
Created attachment 145306 [details]
cloned here what I've referenced: memfreeze, le9i.patch, le9h.patch

just in case gist.github goes away for whatever reason.
Comment 2 Jani Nikula 2019-09-09 10:33:30 UTC
(In reply to howaboutsynergy from comment #0)
> In kernel 5.3.0-rc8 with commit aa56a292ce623734ddd30f52d73f527d1f3529b5
> present(bisected)

That would be

commit aa56a292ce623734ddd30f52d73f527d1f3529b5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 8 15:03:27 2019 +0100

    drm/i915/userptr: Acquire the page lock around set_page_dirty()

Cc: Chris.
Comment 3 Lakshmi 2019-09-09 12:52:06 UTC
(In reply to howaboutsynergy from comment #0)
> In kernel 5.3.0-rc8 with commit aa56a292ce623734ddd30f52d73f527d1f3529b5
> present(bisected), only on Intel/i915 computer (not on AMD/Radeon laptop) I
> can easily reproduce a deadlock-like freeze of the entire system which
> apparently lasts indefinitely. Sysrq still works but the CPU/fans go low
> speed and there's no disk activity.
> 
> Apparently this freeze happens during low memory / memory pressure
> situations.
> I've encountered the freeze during chromium compilation (possibly due to
> jumbo build being in use which means it uses much RAM, I've 32G RAM total)
> 
What's the platform under use?
Comment 4 Dq8CokMHloQZw 2019-09-09 13:24:48 UTC
(In reply to Lakshmi from comment #3)
> What's the platform under use?

Intel Core i7-8700K CPU, 12 cores
ASUS Prime Z370-A motherboard

[    0.373549] smpboot: CPU0: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz (family: 0x6, model: 0x9e, stepping: 0xa)
microcode	: 0xb4
[    0.000000] DMI: System manufacturer System Product Name/PRIME Z370-A, BIOS 2201 05/27/2019

32G RAM, no swap(during memfreeze test, but swap in zram during chromium compilation)
64G ext4 zram of /tmp and /var/tmp

/dev/zram1       63G   53M   60G   1% /tmp
/dev/zram2       63G   53M   60G   1% /var/tmp

$ swapon
NAME       TYPE      SIZE USED PRIO
/dev/zram0 partition  64G   0B   -2


$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-linux-git root=UUID=2b8b9ab8-7ac5-4586-aa42-d7ffb12de92a rw root_trim=yes rd.luks.allow-discards rd.luks.options=discard ipv6.disable=1 ipv6.disable_ipv6=1 ipv6.autoconf=0 loglevel=15 log_buf_len=16M ignore_loglevel printk.always_kmsg_dump=y printk.time=y printk.devkmsg=on mminit_loglevel=4 memory_corruption_check=1 fbcon=scrollback:4096k fbcon=font:ProFont6x11 net.ifnames=0 nolvm dobtrfs console=tty1 earlyprintk=vga audit=0 systemd.log_target=kmsg systemd.journald.forward_to_console=1 enforcing=0 udev.children-max=1256 rd.udev.children-max=1256 nohz=on oops=panic crashkernel=256M panic=0 page_poison=1 psi=1 sysrq_always_enabled random.trust_cpu=off logo.nologo lpj=0 mce=bootlog reboot=force,cold noexec=on nohibernate scsi_mod.use_blk_mq=1 consoleblank=120 mitigations=off nospectre_v1 nospectre_v2 spectre_v2=off nospec_store_bypass_disable kvm-intel.vmentry_l1d_flush=never l1tf=off nopti pti=off no_stf_barrier noibrs noibpb ssbd=force-off spectre_v2_user=off noretpoline mds=off rd.log=all noefi cpuidle.governor=menu zram.num_devices=3 zswap.enabled=0 zswap.same_filled_pages_enabled=1 zswap.compressor=zstd zswap.max_pool_percent=40 zswap.zpool=z3fold vsyscall=none i915.enable_fbc=1 i915.verbose_state_checks=1 i915.enable_hangcheck=1 i915.error_capture=1


I've just discovered why reading the kdump with `crash` didn't work, because crashkernel=256M was too low and it would OOM during `makedumpfile` so the dump would be incomplete/garbage(tested this only on the AMD/Radeon laptop which would do this with crashkernel=128M), so I will thus increase that to 512M on the Intel computer and have something next...
Comment 5 Dq8CokMHloQZw 2019-09-09 18:10:29 UTC
ok here's what I got, the last part of the dmesg:
```
[  299.780876] sysrq: Emergency Sync
[  299.802758] Emergency Sync complete
[  300.777865] sudo[2070]:     user : TTY=pts/1 ; PWD=/home/user ; USER=root ; COMMAND=/usr/bin/swapoff -a
[  300.778020] sudo[2070]: pam_unix(sudo:session): session opened for user root by (uid=0)
[  300.778907] sudo[2070]: pam_unix(sudo:session): session closed for user root
[  300.842945] sudo[2230]:     user : TTY=pts/1 ; PWD=/home/user ; USER=root ; COMMAND=/usr/bin/sysctl -w vm.oom_dump_tasks=0
[  300.843213] sudo[2230]: pam_unix(sudo:session): session opened for user root by (uid=0)
[  300.843947] sudo[2230]: pam_unix(sudo:session): session closed for user root
[  302.213705] stress invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  302.213708] CPU: 7 PID: 2237 Comm: stress Kdump: loaded Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #55
[  302.213709] Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2201 05/27/2019
[  302.213709] Call Trace:
[  302.213713]  dump_stack+0x46/0x60
[  302.213715]  dump_header+0x45/0x17a
[  302.213716]  oom_kill_process.cold+0xb/0x10
[  302.213718]  out_of_memory+0x1aa/0x450
[  302.213719]  __alloc_pages_slowpath+0x9d9/0xc10
[  302.213721]  __alloc_pages_nodemask+0x268/0x2b0
[  302.213722]  alloc_pages_vma+0xc1/0x160
[  302.213724]  __handle_mm_fault+0xe13/0x12f0
[  302.213725]  handle_mm_fault+0xa9/0x1d0
[  302.213726]  __do_page_fault+0x23a/0x480
[  302.213728]  do_page_fault+0x1a/0x64
[  302.213729]  page_fault+0x39/0x40
[  302.213731] RIP: 0033:0x5ea9a660ec10
[  302.213732] Code: c0 0f 84 53 02 00 00 8b 54 24 0c 31 c0 85 d2 0f 94 c0 89 04 24 41 83 fd 02 0f 8f fa 00 00 00 31 c0 4d 85 ff 7e 10 0f 1f 40 00 <c6> 04 03 5a 4c 01 f0 49 39 c7 7f f4 4d 85 e4 0f 84 f4 01 00 00 7e
[  302.213733] RSP: 002b:00007ffc1d98c090 EFLAGS: 00010206
[  302.213733] RAX: 0000000375063000 RBX: 000078ba8d6cf010 RCX: 000078c1719b76fb
[  302.213734] RDX: 0000000000000001 RSI: 00000006e41c3000 RDI: 000078ba8d6cf000
[  302.213735] RBP: 00005ea9a660fa54 R08: 000078ba8d6cf010 R09: 0000000000000000
[  302.213735] R10: 0000000000000022 R11: 00000006e41c2000 R12: ffffffffffffffff
[  302.213736] R13: 0000000000000002 R14: 0000000000001000 R15: 00000006e41c2000
[  302.213736] Mem-Info:
[  302.213739] active_anon:7334737 inactive_anon:2479 isolated_anon:276
                active_file:18211 inactive_file:14386 isolated_file:71
                unevictable:94204 dirty:16 writeback:0 unstable:0
                slab_reclaimable:5767 slab_unreclaimable:15583
                mapped:28927 shmem:97461 pagetables:15786 bounce:0
                free:85636 free_pcp:1127 free_cma:0
[  302.213740] Node 0 active_anon:29338948kB inactive_anon:9916kB active_file:72844kB inactive_file:57544kB unevictable:376816kB isolated(anon):1104kB isolated(file):284kB mapped:115708kB dirty:64kB writeback:0kB shmem:389844kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 28256256kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  302.213741] Node 0 DMA free:15892kB min:104kB low:128kB high:152kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15980kB managed:15892kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  302.213742] lowmem_reserve[]: 0 1578 30909 30909
[  302.213744] Node 0 DMA32 free:127904kB min:10800kB low:13500kB high:16200kB active_anon:1500668kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2744176kB managed:1630064kB mlocked:0kB kernel_stack:0kB pagetables:1232kB bounce:0kB free_pcp:248kB local_pcp:248kB free_cma:0kB
[  302.213745] lowmem_reserve[]: 0 0 29331 29331
[  302.213746] Node 0 Normal free:198748kB min:199088kB low:248860kB high:298632kB active_anon:27837920kB inactive_anon:10104kB active_file:72956kB inactive_file:57616kB unevictable:376416kB writepending:64kB present:30654464kB managed:30039288kB mlocked:32kB kernel_stack:5696kB pagetables:61912kB bounce:0kB free_pcp:4260kB local_pcp:388kB free_cma:0kB
[  302.213747] lowmem_reserve[]: 0 0 0 0
[  302.213748] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
[  302.213751] Node 0 DMA32: 6*4kB (UM) 5*8kB (U) 4*16kB (UM) 7*32kB (UM) 5*64kB (UM) 2*128kB (U) 4*256kB (UM) 4*512kB (UM) 3*1024kB (UM) 1*2048kB (M) 29*4096kB (M) = 127904kB
[  302.213754] Node 0 Normal: 2322*4kB (UME) 1861*8kB (UE) 776*16kB (UME) 350*32kB (UME) 272*64kB (UE) 162*128kB (UME) 27*256kB (UE) 93*512kB (UM) 57*1024kB (UME) 0*2048kB 0*4096kB = 198832kB
[  302.213757] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  302.213758] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  302.213758] 130209 total pagecache pages
[  302.213759] 0 pages in swap cache
[  302.213759] Swap cache stats: add 0, delete 0, find 0/0
[  302.213760] Free swap  = 0kB
[  302.213760] Total swap = 0kB
[  302.213761] 8353655 pages RAM
[  302.213761] 0 pages HighMem/MovableOnly
[  302.213761] 432344 pages reserved
[  302.213762] 0 pages cma reserved
[  302.213762] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=stress,pid=2238,uid=1000
[  302.213767] Out of memory: Killed process 2238 (stress) total-vm:28906972kB, anon-rss:14587072kB, file-rss:276kB, shmem-rss:0kB
[  302.245057] oom_reaper: reaped process 2238 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  310.858232] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.858335] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.858408] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.858477] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.858624] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.858692] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.858759] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.858825] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.858890] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.858955] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859020] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859085] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859150] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859248] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859321] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859394] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859461] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859527] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859593] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859660] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859726] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859792] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859858] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859924] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.859990] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860056] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860122] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860195] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860273] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860340] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860406] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860472] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860538] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860604] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860669] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860735] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860802] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860868] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.860934] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  310.861000] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.
[  316.311671] gpg-agent[1077]: handler 0x7da757fff700 for fd 10 started
[  316.402397] gpg-agent[1077]: handler 0x7da757fff700 for fd 10 terminated
[  326.028889] sysrq: Emergency Sync
[  326.031866] Emergency Sync complete
[  327.359884] sysrq: Emergency Remount R/O
[  327.359938] Emergency Remount complete
[  328.452885] sysrq: Emergency Sync
[  328.454829] Emergency Sync complete
[  329.821886] sysrq: Trigger a crash
[  329.821892] Kernel panic - not syncing: sysrq triggered crash
[  329.821899] CPU: 8 PID: 0 Comm: swapper/8 Kdump: loaded Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #55
[  329.821902] Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2201 05/27/2019
[  329.821905] Call Trace:
[  329.821910]  <IRQ>
[  329.821919]  dump_stack+0x46/0x60
[  329.821928]  panic+0xf6/0x2e1
[  329.821935]  ? printk+0x53/0x6a
[  329.821944]  sysrq_handle_crash+0xc/0x10
[  329.821951]  __handle_sysrq.cold+0x43/0xfd
[  329.821957]  sysrq_filter+0x2ea/0x3b0
[  329.821965]  input_to_handler+0x4a/0xe0
[  329.821973]  input_pass_values.part.0+0x109/0x130
[  329.821978]  input_handle_event+0x92/0x560
[  329.821982]  input_event+0x4d/0x70
[  329.821987]  hidinput_report_event+0x2e/0x40
[  329.821991]  hid_report_raw_event+0x25a/0x420
[  329.821996]  hid_input_report+0xf8/0x150
[  329.822002]  hid_irq_in+0x169/0x1a0
[  329.822007]  __usb_hcd_giveback_urb+0x6b/0xd0
[  329.822024]  xhci_giveback_urb_in_irq.isra.0+0x62/0x90 [xhci_hcd]
[  329.822041]  xhci_td_cleanup+0xf7/0x140 [xhci_hcd]
[  329.822055]  xhci_irq+0xe6b/0x1a90 [xhci_hcd]
[  329.822063]  __handle_irq_event_percpu+0x29/0xc0
[  329.822070]  handle_irq_event_percpu+0x2c/0x80
[  329.822076]  handle_irq_event+0x23/0x43
[  329.822082]  handle_edge_irq+0x72/0x180
[  329.822088]  handle_irq+0x11/0x20
[  329.822096]  do_IRQ+0x3e/0xc0
[  329.822101]  common_interrupt+0xf/0xf
[  329.822104]  </IRQ>
[  329.822109] RIP: 0010:cpuidle_enter_state+0x108/0x280
[  329.822115] Code: ff 56 48 41 89 c5 e8 e7 66 a1 ff 65 8b 3d e0 ee 95 48 e8 9b 65 a1 ff 49 89 c6 31 ff e8 a1 75 a1 ff 45 84 ff 75 1d fb 45 85 ed <79> 2b 41 c7 44 24 10 00 00 00 00 44 89 e8 5b 5d 41 5c 41 5d 41 5e
[  329.822119] RSP: 0018:ffffbd3e000fbe80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffde
[  329.822124] RAX: ffff9dd62da20a40 RBX: ffffffffb8062040 RCX: 000000000000001f
[  329.822127] RDX: 0000000000000000 RSI: 0000000022a1ce40 RDI: 0000000000000000
[  329.822130] RBP: 0000004cc9264b0a R08: 0000004ccaea38d8 R09: 000000007fffffff
[  329.822133] R10: ffff9dd62da1fb84 R11: ffff9dd62da1fb64 R12: ffff9dd62da29700
[  329.822135] R13: 0000000000000008 R14: 0000004ccaea38d8 R15: 0000000000000000
[  329.822142]  cpuidle_enter+0x24/0x40
[  329.822147]  do_idle+0x1c1/0x230
[  329.822151]  cpu_startup_entry+0x14/0x20
[  329.822158]  start_secondary+0x141/0x190
[  329.822163]  secondary_startup_64+0xa4/0xb0

```
Comment 6 Dq8CokMHloQZw 2019-09-09 20:20:29 UTC
[  310.860735] userptr_mn_invalidate_range_start+0x0/0x210 [i915] callback failed with -4 in blockable context.

mm/mmu_notifier.c:179:				pr_info("%pS callback failed with %d in %sblockable context.\n",

context:
```c
  int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
  {
    struct mmu_notifier *mn;
    int ret = 0;
    int id;
  
    id = srcu_read_lock(&srcu);
    hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
      if (mn->ops->invalidate_range_start) {
        int _ret = mn->ops->invalidate_range_start(mn, range);
        if (_ret) {
          pr_info("%pS callback failed with %d in %sblockable context.\n",                                                      
            mn->ops->invalidate_range_start, _ret,
            !mmu_notifier_range_blockable(range) ? "non-" : "");
          ret = _ret;
        }
      }
    }
    srcu_read_unlock(&srcu, id);
  
    return ret;
  }
  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
```

the `-4` is:

#define EINTR    4  /* Interrupted system call */                                                                               

context:
```c
static int
userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
				  const struct mmu_notifier_range *range)
{
	struct i915_mmu_notifier *mn =
		container_of(_mn, struct i915_mmu_notifier, mn);
	struct interval_tree_node *it;
	struct mutex *unlock = NULL;
	unsigned long end;
	int ret = 0;

	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
		return 0;

	/* interval ranges are inclusive, but invalidate range is exclusive */
	end = range->end - 1;

	spin_lock(&mn->lock);
	it = interval_tree_iter_first(&mn->objects, range->start, end);
	while (it) {
		struct drm_i915_gem_object *obj;

		if (!mmu_notifier_range_blockable(range)) {
			ret = -EAGAIN;
			break;
		}

		/*
		 * The mmu_object is released late when destroying the
		 * GEM object so it is entirely possible to gain a
		 * reference on an object in the process of being freed
		 * since our serialisation is via the spinlock and not
		 * the struct_mutex - and consequently use it after it
		 * is freed and then double free it. To prevent that
		 * use-after-free we only acquire a reference on the
		 * object if it is not in the process of being destroyed.
		 */
		obj = container_of(it, struct i915_mmu_object, it)->obj;
		if (!kref_get_unless_zero(&obj->base.refcount)) {
			it = interval_tree_iter_next(it, range->start, end);
			continue;
		}
		spin_unlock(&mn->lock);

		if (!unlock) {
			unlock = &mn->mm->i915->drm.struct_mutex;

			switch (mutex_trylock_recursive(unlock)) {
			default:
			case MUTEX_TRYLOCK_FAILED:
				if (mutex_lock_killable_nested(unlock, I915_MM_SHRINKER)) {
					i915_gem_object_put(obj);
					return -EINTR;
				}
				/* fall through */
			case MUTEX_TRYLOCK_SUCCESS:
				break;

			case MUTEX_TRYLOCK_RECURSIVE:
				unlock = ERR_PTR(-EEXIST);
				break;
			}
		}

		ret = i915_gem_object_unbind(obj);
		if (ret == 0)
			ret = __i915_gem_object_put_pages(obj, I915_MM_SHRINKER);
		i915_gem_object_put(obj);
		if (ret)
			goto unlock;

		spin_lock(&mn->lock);

		/*
		 * As we do not (yet) protect the mmu from concurrent insertion
		 * over this range, there is no guarantee that this search will
		 * terminate given a pathologic workload.
		 */
		it = interval_tree_iter_first(&mn->objects, range->start, end);
	}
	spin_unlock(&mn->lock);

unlock:
	if (!IS_ERR_OR_NULL(unlock))
		mutex_unlock(unlock);

	return ret;

}
```
drivers/gpu/drm/i915/gem/i915_gem_userptr.c
Comment 7 Lakshmi 2019-09-10 05:56:16 UTC
Setting the priority of the bug as highest considering it's regression.
Comment 8 Dq8CokMHloQZw 2019-09-10 10:36:19 UTC
Created attachment 145318 [details]
semi-full log of stacktraces at the point of `return -EINTR`

I got stacktraces from this:
```c
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 2caa594322bc..5a9dd14d3bea 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -138,6 +138,7 @@ userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 			case MUTEX_TRYLOCK_FAILED:
 				if (mutex_lock_killable_nested(unlock, I915_MM_SHRINKER)) {
 					i915_gem_object_put(obj);
+          WARN_ON(1);
 					return -EINTR;
 				}
 				/* fall through */
```


They look like this:
```
[  154.134495] ------------[ cut here ]------------
[  154.134498] WARN_ON(1)
[  154.134642] WARNING: CPU: 7 PID: 1931 at drivers/gpu/drm/i915/gem/i915_gem_userptr.c:141 userptr_mn_invalidate_range_start+0x176/0x220 [i915]
[  154.134684] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic xt_TCPMSS iptable_mangle iptable_security iptable_nat nf_nat iptable_raw nf_log_ipv4 nf_log_common xt_conntrack xt_LOG xt_connlimit nf_conncount nf_conntrack nf_defrag_ipv4 xt_hashlimit xt_multiport xt_owner xt_addrtype intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp i915 crct10dif_pclmul crc32_pclmul crc32c_intel i2c_algo_bit snd_hda_intel ghash_clmulni_intel drm_kms_helper snd_hda_codec syscopyarea snd_hwdep sysfillrect sysimgblt snd_hda_core fb_sys_fops snd_pcm iTCO_wdt intel_cstate drm snd_timer iTCO_vendor_support intel_uncore snd intel_rapl_perf e1000e drm_panel_orientation_quirks soundcore bfq i2c_i801 pcspkr xhci_pci xhci_hcd
[  154.134720] CPU: 7 PID: 1931 Comm: stress Kdump: loaded Tainted: G     U  W         5.3.0-rc8-gf74c2bb98776 #56
[  154.134723] Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2201 05/27/2019
[  154.134804] RIP: 0010:userptr_mn_invalidate_range_start+0x176/0x220 [i915]
[  154.134809] Code: ff ff ff 48 89 ef e8 a9 37 cc cb 84 c0 74 08 48 89 ef e8 1d db f1 ff 48 c7 c6 81 00 89 c0 48 c7 c7 7e 00 89 c0 e8 63 ae 8e cb <0f> 0b 41 bf fc ff ff ff e9 97 fe ff ff be 01 00 00 00 48 89 ef e8
[  154.134816] RSP: 0018:ffffaf0607e03768 EFLAGS: 00010286
[  154.134820] RAX: 0000000000000000 RBX: ffffaf0607e03820 RCX: 0000000000000000
[  154.134824] RDX: 000000000000000a RSI: ffffffff8d56c9ca RDI: ffffffff8d56d9ca
[  154.134828] RBP: ffff94de3c88f900 R08: 00000023e321caf3 R09: 000000000000000a
[  154.134832] R10: 0000000000000000 R11: 00000000fffffffe R12: ffff94de3c418068
[  154.134837] R13: ffff94dea9dbfd98 R14: ffff94de47e70b90 R15: 0000000000000000
[  154.134842] FS:  0000766ecc735740(0000) GS:ffff94dead9c0000(0000) knlGS:0000000000000000
[  154.134846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  154.134849] CR2: 00007ded1ef46740 CR3: 00000007bdffe006 CR4: 00000000003606e0
[  154.134852] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  154.134859] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  154.134865] Call Trace:
[  154.134877]  __mmu_notifier_invalidate_range_start+0x4f/0x90
[  154.134897]  try_to_unmap_one+0x718/0x820
[  154.134905]  rmap_walk_file+0xe4/0x250
[  154.134933]  try_to_unmap+0xc1/0xf0
[  154.134942]  ? page_remove_rmap+0x2a0/0x2a0
[  154.134952]  ? page_not_mapped+0x10/0x10
[  154.134974]  ? page_get_anon_vma+0x70/0x70
[  154.134986]  migrate_pages+0x7aa/0x9a0
[  154.134999]  ? isolate_freepages_block+0x340/0x340
[  154.135007]  ? move_freelist_tail+0xd0/0xd0
[  154.135019]  compact_zone+0x656/0xa70
[  154.135029]  compact_zone_order+0xde/0x120
[  154.135042]  try_to_compact_pages+0x187/0x240
[  154.135065]  __alloc_pages_direct_compact+0x87/0x170
[  154.135084]  __alloc_pages_slowpath+0x1f8/0xc10
[  154.135096]  ? get_page_from_freelist+0xe80/0x1330
[  154.135117]  __alloc_pages_nodemask+0x268/0x2b0
[  154.135141]  alloc_pages_vma+0xc1/0x160
[  154.135149]  do_huge_pmd_anonymous_page+0x271/0x610
[  154.135166]  __handle_mm_fault+0xbfc/0x12f0
[  154.135173]  handle_mm_fault+0xa9/0x1d0
[  154.135187]  __do_page_fault+0x23a/0x480
[  154.135199]  do_page_fault+0x1a/0x64
[  154.135221]  page_fault+0x39/0x40
[  154.135229] RIP: 0033:0x5bfeb4f67c10
[  154.135244] Code: c0 0f 84 53 02 00 00 8b 54 24 0c 31 c0 85 d2 0f 94 c0 89 04 24 41 83 fd 02 0f 8f fa 00 00 00 31 c0 4d 85 ff 7e 10 0f 1f 40 00 <c6> 04 03 5a 4c 01 f0 49 39 c7 7f f4 4d 85 e4 0f 84 f4 01 00 00 7e
[  154.135266] RSP: 002b:00007ffcc3719e40 EFLAGS: 00010206
[  154.135272] RAX: 0000000374e52000 RBX: 00007667ebdae010 RCX: 0000766ecc85a6fb
[  154.135281] RDX: 0000000000000001 RSI: 00000006e0987000 RDI: 00007667ebdae000
[  154.135295] RBP: 00005bfeb4f68a54 R08: 00007667ebdae010 R09: 0000000000000000
[  154.135305] R10: 0000000000000022 R11: 00000006e0986000 R12: ffffffffffffffff
[  154.135315] R13: 0000000000000002 R14: 0000000000001000 R15: 00000006e0986000
[  154.135331] ---[ end trace 28cc7c376d08f23b ]---
[  154.135452] userptr_mn_invalidate_range_start+0x0/0x220 [i915] callback failed with -4 in blockable context.
```

semi-full log attached

Hope this is helpful.
Comment 9 Dq8CokMHloQZw 2019-09-10 11:06:16 UTC
Created attachment 145319 [details]
got stacktraces also at the point of `callback failed with -4 in blockable context` not just on `return -EINTR`

I just realized I could do better: get stacktraces also at this point
mm/mmu_notifier.c:179:				pr_info("%pS callback failed with %d in %sblockable context.\n",

using this code:
```c
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b5670620aea0..2ec61d700e72 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -179,6 +179,7 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 				pr_info("%pS callback failed with %d in %sblockable context.\n",
 					mn->ops->invalidate_range_start, _ret,
 					!mmu_notifier_range_blockable(range) ? "non-" : "");
+        WARN_ON(1);
 				ret = _ret;
 			}
 		}
```

The first try yielded no stacktraces, not sure why, but the second try was a success and it looks like this:
```
[  261.691955] ------------[ cut here ]------------
[  261.691961] WARN_ON(1)
[  261.692078] WARNING: CPU: 1 PID: 2240 at drivers/gpu/drm/i915/gem/i915_gem_userptr.c:141 userptr_mn_invalidate_range_start+0x176/0x220 [i915]
[  261.692082] Modules linked in: xt_comment xt_TCPMSS iptable_mangle iptable_security iptable_nat nf_nat iptable_raw nf_log_ipv4 nf_log_common xt_conntrack xt_LOG xt_connlimit nf_conncount nf_conntrack nf_defrag_ipv4 xt_hashlimit xt_multiport xt_owner xt_addrtype snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp i915 i2c_algo_bit crct10dif_pclmul drm_kms_helper crc32_pclmul snd_hda_intel crc32c_intel snd_hda_codec syscopyarea sysfillrect snd_hwdep sysimgblt snd_hda_core ghash_clmulni_intel fb_sys_fops drm intel_cstate iTCO_wdt snd_pcm e1000e intel_uncore iTCO_vendor_support snd_timer intel_rapl_perf bfq snd soundcore pcspkr i2c_i801 drm_panel_orientation_quirks xhci_pci xhci_hcd
[  261.692125] CPU: 1 PID: 2240 Comm: stress Kdump: loaded Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #57
[  261.692128] Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2201 05/27/2019
[  261.692203] RIP: 0010:userptr_mn_invalidate_range_start+0x176/0x220 [i915]
[  261.692209] Code: ff ff ff 48 89 ef e8 c9 77 f5 c6 84 c0 74 08 48 89 ef e8 1d 3b e2 ff 48 c7 c6 81 c0 5f c0 48 c7 c7 7e c0 5f c0 e8 63 ee b7 c6 <0f> 0b 41 bf fc ff ff ff e9 97 fe ff ff be 01 00 00 00 48 89 ef e8
[  261.692213] RSP: 0018:ffffafc9c8f5f768 EFLAGS: 00010286
[  261.692217] RAX: 0000000000000000 RBX: ffffafc9c8f5f820 RCX: 0000000000000000
[  261.692220] RDX: 000000000000000a RSI: ffffffff8856c9ca RDI: ffffffff8856d9ca
[  261.692223] RBP: ffffa42007915500 R08: 0000003cee0e71d7 R09: 000000000000000a
[  261.692225] R10: 0000000000000000 R11: 00000000fffffffe R12: ffffa41ffd7a8068
[  261.692228] R13: ffffa42069d80c18 R14: ffffa42007a9b990 R15: 0000000000000000
[  261.692232] FS:  00007f7258f5a740(0000) GS:ffffa4206d840000(0000) knlGS:0000000000000000
[  261.692235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.692238] CR2: 0000737968747578 CR3: 00000007c7d1e006 CR4: 00000000003606e0
[  261.692241] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  261.692244] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  261.692246] Call Trace:
[  261.692261]  __mmu_notifier_invalidate_range_start+0x4f/0x90
[  261.692268]  try_to_unmap_one+0x718/0x820
[  261.692273]  rmap_walk_file+0xe4/0x250
[  261.692278]  try_to_unmap+0xc1/0xf0
[  261.692283]  ? page_remove_rmap+0x2a0/0x2a0
[  261.692287]  ? page_not_mapped+0x10/0x10
[  261.692291]  ? page_get_anon_vma+0x70/0x70
[  261.692296]  migrate_pages+0x7aa/0x9a0
[  261.692304]  ? isolate_freepages_block+0x340/0x340
[  261.692310]  ? move_freelist_tail+0xd0/0xd0
[  261.692314]  compact_zone+0x656/0xa70
[  261.692319]  compact_zone_order+0xde/0x120
[  261.692324]  try_to_compact_pages+0x187/0x240
[  261.692332]  __alloc_pages_direct_compact+0x87/0x170
[  261.692338]  __alloc_pages_slowpath+0x1f8/0xc10
[  261.692345]  ? get_page_from_freelist+0xe80/0x1330
[  261.692352]  __alloc_pages_nodemask+0x268/0x2b0
[  261.692357]  alloc_pages_vma+0xc1/0x160
[  261.692363]  do_huge_pmd_anonymous_page+0x271/0x610
[  261.692369]  __handle_mm_fault+0xbfc/0x12f0
[  261.692375]  handle_mm_fault+0xa9/0x1d0
[  261.692382]  __do_page_fault+0x23a/0x480
[  261.692388]  do_page_fault+0x1a/0x64
[  261.692395]  page_fault+0x39/0x40
[  261.692400] RIP: 0033:0x5afb2a4d7c10
[  261.692406] Code: c0 0f 84 53 02 00 00 8b 54 24 0c 31 c0 85 d2 0f 94 c0 89 04 24 41 83 fd 02 0f 8f fa 00 00 00 31 c0 4d 85 ff 7e 10 0f 1f 40 00 <c6> 04 03 5a 4c 01 f0 49 39 c7 7f f4 4d 85 e4 0f 84 f4 01 00 00 7e
[  261.692409] RSP: 002b:00007ffe9f9a6050 EFLAGS: 00010206
[  261.692413] RAX: 000000036dd5a000 RBX: 00007f6b76aa6010 RCX: 00007f725907f6fb
[  261.692415] RDX: 0000000000000001 RSI: 00000006e24b4000 RDI: 00007f6b76aa6000
[  261.692418] RBP: 00005afb2a4d8a54 R08: 00007f6b76aa6010 R09: 0000000000000000
[  261.692420] R10: 0000000000000022 R11: 00000006e24b3000 R12: ffffffffffffffff
[  261.692423] R13: 0000000000000002 R14: 0000000000001000 R15: 00000006e24b3000
[  261.692427] ---[ end trace 4cdaeb6ba05f3a75 ]---
[  261.692502] userptr_mn_invalidate_range_start+0x0/0x220 [i915] callback failed with -4 in blockable context.
[  261.692505] ------------[ cut here ]------------
[  261.692518] WARNING: CPU: 1 PID: 2240 at mm/mmu_notifier.c:182 __mmu_notifier_invalidate_range_start.cold+0x33/0x43
[  261.692520] Modules linked in: xt_comment xt_TCPMSS iptable_mangle iptable_security iptable_nat nf_nat iptable_raw nf_log_ipv4 nf_log_common xt_conntrack xt_LOG xt_connlimit nf_conncount nf_conntrack nf_defrag_ipv4 xt_hashlimit xt_multiport xt_owner xt_addrtype snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp i915 i2c_algo_bit crct10dif_pclmul drm_kms_helper crc32_pclmul snd_hda_intel crc32c_intel snd_hda_codec syscopyarea sysfillrect snd_hwdep sysimgblt snd_hda_core ghash_clmulni_intel fb_sys_fops drm intel_cstate iTCO_wdt snd_pcm e1000e intel_uncore iTCO_vendor_support snd_timer intel_rapl_perf bfq snd soundcore pcspkr i2c_i801 drm_panel_orientation_quirks xhci_pci xhci_hcd
[  261.692554] CPU: 1 PID: 2240 Comm: stress Kdump: loaded Tainted: G     U  W         5.3.0-rc8-gf74c2bb98776 #57
[  261.692557] Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2201 05/27/2019
[  261.692563] RIP: 0010:__mmu_notifier_invalidate_range_start.cold+0x33/0x43
[  261.692567] Code: ec 9f ea 87 49 0f 44 cf 48 8b 43 10 48 8b 70 28 89 ea 48 c7 c7 78 9f e8 87 e8 55 f2 ef ff 48 c7 c7 c8 44 e6 87 e8 49 f2 ef ff <0f> 0b 41 89 ee e9 6f fc ff ff 90 90 90 90 90 90 53 48 83 ec 10 48
[  261.692570] RSP: 0018:ffffafc9c8f5f7c0 EFLAGS: 00010246
[  261.692574] RAX: 0000000000000024 RBX: ffffa42069d80c18 RCX: 0000000000000006
[  261.692576] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffffa4206d8564c0
[  261.692579] RBP: 00000000fffffffc R08: ffffafc9c8f5f675 R09: 000000000000167c
[  261.692582] R10: ffffa4208e079cb4 R11: ffffafc9c8f5f675 R12: ffffafc9c8f5f820
[  261.692584] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff87e89f71
[  261.692588] FS:  00007f7258f5a740(0000) GS:ffffa4206d840000(0000) knlGS:0000000000000000
[  261.692591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.692593] CR2: 0000737968747578 CR3: 00000007c7d1e006 CR4: 00000000003606e0
[  261.692596] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  261.692599] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  261.692601] Call Trace:
[  261.692607]  try_to_unmap_one+0x718/0x820
[  261.692612]  rmap_walk_file+0xe4/0x250
[  261.692617]  try_to_unmap+0xc1/0xf0
[  261.692621]  ? page_remove_rmap+0x2a0/0x2a0
[  261.692625]  ? page_not_mapped+0x10/0x10
[  261.692629]  ? page_get_anon_vma+0x70/0x70
[  261.692634]  migrate_pages+0x7aa/0x9a0
[  261.692640]  ? isolate_freepages_block+0x340/0x340
[  261.692646]  ? move_freelist_tail+0xd0/0xd0
[  261.692650]  compact_zone+0x656/0xa70
[  261.692655]  compact_zone_order+0xde/0x120
[  261.692660]  try_to_compact_pages+0x187/0x240
[  261.692667]  __alloc_pages_direct_compact+0x87/0x170
[  261.692673]  __alloc_pages_slowpath+0x1f8/0xc10
[  261.692680]  ? get_page_from_freelist+0xe80/0x1330
[  261.692687]  __alloc_pages_nodemask+0x268/0x2b0
[  261.692692]  alloc_pages_vma+0xc1/0x160
[  261.692698]  do_huge_pmd_anonymous_page+0x271/0x610
[  261.692703]  __handle_mm_fault+0xbfc/0x12f0
[  261.692709]  handle_mm_fault+0xa9/0x1d0
[  261.692715]  __do_page_fault+0x23a/0x480
[  261.692721]  do_page_fault+0x1a/0x64
[  261.692727]  page_fault+0x39/0x40
[  261.692730] RIP: 0033:0x5afb2a4d7c10
[  261.692734] Code: c0 0f 84 53 02 00 00 8b 54 24 0c 31 c0 85 d2 0f 94 c0 89 04 24 41 83 fd 02 0f 8f fa 00 00 00 31 c0 4d 85 ff 7e 10 0f 1f 40 00 <c6> 04 03 5a 4c 01 f0 49 39 c7 7f f4 4d 85 e4 0f 84 f4 01 00 00 7e
[  261.692737] RSP: 002b:00007ffe9f9a6050 EFLAGS: 00010206
[  261.692740] RAX: 000000036dd5a000 RBX: 00007f6b76aa6010 RCX: 00007f725907f6fb
[  261.692743] RDX: 0000000000000001 RSI: 00000006e24b4000 RDI: 00007f6b76aa6000
[  261.692745] RBP: 00005afb2a4d8a54 R08: 00007f6b76aa6010 R09: 0000000000000000
[  261.692748] R10: 0000000000000022 R11: 00000006e24b3000 R12: ffffffffffffffff
[  261.692750] R13: 0000000000000002 R14: 0000000000001000 R15: 00000006e24b3000
[  261.692754] ---[ end trace 4cdaeb6ba05f3a76 ]---
```

semi-full log (of all stacktraces encountered due to the WARN_ON(1) that I added) is attached.
Comment 10 Leho Kraav (:macmaN :lkraav) 2019-09-10 14:25:33 UTC
Thanks for filing this @howaboutsynergy.

I wonder if this is a duplicate of my i915 freeze struggles at https://bugs.freedesktop.org/show_bug.cgi?id=111500 or vice versa?
Comment 11 Dq8CokMHloQZw 2019-09-10 14:48:21 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #10)
Ha! That's very interesting! I can actually ssh to it while frozen!! (just like you)

I assumed that since during `stress` the CPU/fans are high speed, and when freeze happens they go low/silent/off and not even the mouse moves in X, that it's completely frozen, even though sysrq stuff would work.

So I can then see hung tasks via ssh, dmesg like:
```
[ 2062.553078] glibc64:getnameinfo.c:559/getnameinfo[6929]: sshd[6929](full:'sshd: user@pts/7') for user user(1000(eff:user(1000))) 2of2 successfully reverse-resolved requested IP address:
               192.168.0.78 192.168.0.78
[ 2098.511916] INFO: task kworker/10:0H:63 blocked for more than 36 seconds.
[ 2098.511923]       Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #62
[ 2098.511926] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2098.511929] kworker/10:0H   D    0    63      2 0x80004000
[ 2098.511944] Workqueue: events_highpri intel_atomic_cleanup_work
[ 2098.511948] Call Trace:
[ 2098.511960]  ? __schedule+0x2c6/0x580
[ 2098.511966]  schedule+0x34/0xa0
[ 2098.511972]  schedule_preempt_disabled+0x5/0x10
[ 2098.511978]  __mutex_lock.isra.0+0x25f/0x4b0
[ 2098.511986]  ? pick_next_task_fair+0x1eb/0x700
[ 2098.511993]  intel_cleanup_plane_fb+0x28/0x70
[ 2098.512000]  drm_atomic_helper_cleanup_planes+0x4a/0x60
[ 2098.512005]  intel_atomic_cleanup_work+0x17/0x80
[ 2098.512011]  process_one_work+0x16b/0x2a0
[ 2098.512016]  worker_thread+0x48/0x390
[ 2098.512023]  kthread+0xee/0x130
[ 2098.512028]  ? process_one_work+0x2a0/0x2a0
[ 2098.512034]  ? kthread_park+0x70/0x70
[ 2098.512038]  ret_from_fork+0x1f/0x30
[ 2098.512076] INFO: task kworker/u24:1:5567 blocked for more than 36 seconds.
[ 2098.512079]       Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #62
[ 2098.512081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2098.512083] kworker/u24:1   D    0  5567      2 0x80004000
[ 2098.512093] Workqueue: i915 __i915_gem_free_work
[ 2098.512095] Call Trace:
[ 2098.512102]  ? __schedule+0x2c6/0x580
[ 2098.512107]  schedule+0x34/0xa0
[ 2098.512113]  schedule_preempt_disabled+0x5/0x10
[ 2098.512119]  __mutex_lock.isra.0+0x25f/0x4b0
[ 2098.512125]  ? __pm_runtime_resume+0x53/0x70
[ 2098.512130]  __i915_gem_free_objects+0x71/0x1d0
[ 2098.512135]  __i915_gem_free_work+0x5f/0x90
[ 2098.512140]  process_one_work+0x16b/0x2a0
[ 2098.512144]  worker_thread+0x48/0x390
[ 2098.512151]  kthread+0xee/0x130
[ 2098.512155]  ? process_one_work+0x2a0/0x2a0
[ 2098.512160]  ? kthread_park+0x70/0x70
[ 2098.512164]  ret_from_fork+0x1f/0x30
[ 2098.512171] INFO: task stress:6620 blocked for more than 36 seconds.
[ 2098.512174]       Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #62
[ 2098.512176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2098.512178] stress          D    0  6620   6618 0x00000004
[ 2098.512183] Call Trace:
[ 2098.512189]  ? __schedule+0x2c6/0x580
[ 2098.512194]  schedule+0x34/0xa0
[ 2098.512199]  io_schedule+0xd/0x40
[ 2098.512206]  __lock_page+0x13a/0x230
[ 2098.512212]  ? file_fdatawait_range+0x20/0x20
[ 2098.512217]  set_page_dirty_lock+0x44/0x50
[ 2098.512224]  i915_gem_userptr_put_pages+0x135/0x180
[ 2098.512229]  __i915_gem_object_put_pages+0x50/0x90
[ 2098.512235]  userptr_mn_invalidate_range_start+0x17f/0x210
[ 2098.512243]  __mmu_notifier_invalidate_range_start+0x4f/0x90
[ 2098.512249]  try_to_unmap_one+0x718/0x820
[ 2098.512254]  rmap_walk_file+0xe4/0x250
[ 2098.512260]  try_to_unmap+0xc1/0xf0
[ 2098.512264]  ? page_remove_rmap+0x2a0/0x2a0
[ 2098.512268]  ? page_not_mapped+0x10/0x10
[ 2098.512272]  ? page_get_anon_vma+0x70/0x70
[ 2098.512277]  migrate_pages+0x7aa/0x9a0
[ 2098.512285]  ? isolate_freepages_block+0x340/0x340
[ 2098.512291]  ? move_freelist_tail+0xd0/0xd0
[ 2098.512295]  compact_zone+0x656/0xa70
[ 2098.512300]  ? set_next_entity+0x9e/0x1a0
[ 2098.512304]  compact_zone_order+0xde/0x120
[ 2098.512309]  try_to_compact_pages+0x187/0x240
[ 2098.512316]  __alloc_pages_direct_compact+0x87/0x170
[ 2098.512323]  __alloc_pages_slowpath+0x1f8/0xc10
[ 2098.512329]  ? get_page_from_freelist+0xe80/0x1330
[ 2098.512337]  __alloc_pages_nodemask+0x268/0x2b0
[ 2098.512342]  alloc_pages_vma+0xc1/0x160
[ 2098.512348]  do_huge_pmd_anonymous_page+0x271/0x610
[ 2098.512354]  __handle_mm_fault+0xbfc/0x12f0
[ 2098.512360]  handle_mm_fault+0xa9/0x1d0
[ 2098.512367]  __do_page_fault+0x23a/0x480
[ 2098.512373]  do_page_fault+0x1a/0x64
[ 2098.512378]  page_fault+0x39/0x40
[ 2098.512383] RIP: 0033:0x6453194e9c10
[ 2098.512391] Code: Bad RIP value.
[ 2098.512394] RSP: 002b:00007ffdcdfbbbc0 EFLAGS: 00010206
[ 2098.512398] RAX: 00000006c5c79000 RBX: 00007a6cfe787010 RCX: 00007a73db2596fb
[ 2098.512401] RDX: 0000000000000001 RSI: 00000006dc9ad000 RDI: 00007a6cfe787000
[ 2098.512404] RBP: 00006453194eaa54 R08: 00007a6cfe787010 R09: 0000000000000000
[ 2098.512406] R10: 0000000000000022 R11: 00000006dc9ac000 R12: ffffffffffffffff
[ 2098.512409] R13: 0000000000000002 R14: 0000000000001000 R15: 00000006dc9ac000
[ 2117.676377] gpg-agent[1084]: handler 0x7c2858d5b700 for fd 10 started
[ 2117.741468] gpg-agent[1084]: handler 0x7c2858d5b700 for fd 10 terminated
[ 2135.375908] INFO: task kworker/10:0H:63 blocked for more than 73 seconds.
[ 2135.375914]       Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #62
[ 2135.375917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2135.375920] kworker/10:0H   D    0    63      2 0x80004000
[ 2135.375935] Workqueue: events_highpri intel_atomic_cleanup_work
[ 2135.375938] Call Trace:
[ 2135.375950]  ? __schedule+0x2c6/0x580
[ 2135.375957]  schedule+0x34/0xa0
[ 2135.375963]  schedule_preempt_disabled+0x5/0x10
[ 2135.375969]  __mutex_lock.isra.0+0x25f/0x4b0
[ 2135.375977]  ? pick_next_task_fair+0x1eb/0x700
[ 2135.375984]  intel_cleanup_plane_fb+0x28/0x70
[ 2135.375991]  drm_atomic_helper_cleanup_planes+0x4a/0x60
[ 2135.375996]  intel_atomic_cleanup_work+0x17/0x80
[ 2135.376002]  process_one_work+0x16b/0x2a0
[ 2135.376007]  worker_thread+0x48/0x390
[ 2135.376014]  kthread+0xee/0x130
[ 2135.376019]  ? process_one_work+0x2a0/0x2a0
[ 2135.376025]  ? kthread_park+0x70/0x70
[ 2135.376029]  ret_from_fork+0x1f/0x30
[ 2135.376061] INFO: task kworker/u24:1:5567 blocked for more than 73 seconds.
[ 2135.376064]       Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #62
[ 2135.376066] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2135.376069] kworker/u24:1   D    0  5567      2 0x80004000
[ 2135.376078] Workqueue: i915 __i915_gem_free_work
[ 2135.376080] Call Trace:
[ 2135.376087]  ? __schedule+0x2c6/0x580
[ 2135.376093]  schedule+0x34/0xa0
[ 2135.376098]  schedule_preempt_disabled+0x5/0x10
[ 2135.376104]  __mutex_lock.isra.0+0x25f/0x4b0
[ 2135.376110]  ? __pm_runtime_resume+0x53/0x70
[ 2135.376114]  __i915_gem_free_objects+0x71/0x1d0
[ 2135.376119]  __i915_gem_free_work+0x5f/0x90
[ 2135.376124]  process_one_work+0x16b/0x2a0
[ 2135.376129]  worker_thread+0x48/0x390
[ 2135.376135]  kthread+0xee/0x130
[ 2135.376139]  ? process_one_work+0x2a0/0x2a0
[ 2135.376145]  ? kthread_park+0x70/0x70
[ 2135.376149]  ret_from_fork+0x1f/0x30
[ 2135.376155] INFO: task stress:6620 blocked for more than 73 seconds.
[ 2135.376158]       Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #62
[ 2135.376160] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2135.376163] stress          D    0  6620   6618 0x00000004
[ 2135.376166] Call Trace:
[ 2135.376172]  ? __schedule+0x2c6/0x580
[ 2135.376178]  schedule+0x34/0xa0
[ 2135.376183]  io_schedule+0xd/0x40
[ 2135.376190]  __lock_page+0x13a/0x230
[ 2135.376197]  ? file_fdatawait_range+0x20/0x20
[ 2135.376201]  set_page_dirty_lock+0x44/0x50
[ 2135.376209]  i915_gem_userptr_put_pages+0x135/0x180
[ 2135.376214]  __i915_gem_object_put_pages+0x50/0x90
[ 2135.376220]  userptr_mn_invalidate_range_start+0x17f/0x210
[ 2135.376228]  __mmu_notifier_invalidate_range_start+0x4f/0x90
[ 2135.376234]  try_to_unmap_one+0x718/0x820
[ 2135.376239]  rmap_walk_file+0xe4/0x250
[ 2135.376244]  try_to_unmap+0xc1/0xf0
[ 2135.376248]  ? page_remove_rmap+0x2a0/0x2a0
[ 2135.376252]  ? page_not_mapped+0x10/0x10
[ 2135.376257]  ? page_get_anon_vma+0x70/0x70
[ 2135.376261]  migrate_pages+0x7aa/0x9a0
[ 2135.376269]  ? isolate_freepages_block+0x340/0x340
[ 2135.376275]  ? move_freelist_tail+0xd0/0xd0
[ 2135.376279]  compact_zone+0x656/0xa70
[ 2135.376284]  ? set_next_entity+0x9e/0x1a0
[ 2135.376288]  compact_zone_order+0xde/0x120
[ 2135.376293]  try_to_compact_pages+0x187/0x240
[ 2135.376300]  __alloc_pages_direct_compact+0x87/0x170
[ 2135.376307]  __alloc_pages_slowpath+0x1f8/0xc10
[ 2135.376314]  ? get_page_from_freelist+0xe80/0x1330
[ 2135.376321]  __alloc_pages_nodemask+0x268/0x2b0
[ 2135.376327]  alloc_pages_vma+0xc1/0x160
[ 2135.376332]  do_huge_pmd_anonymous_page+0x271/0x610
[ 2135.376338]  __handle_mm_fault+0xbfc/0x12f0
[ 2135.376344]  handle_mm_fault+0xa9/0x1d0
[ 2135.376351]  __do_page_fault+0x23a/0x480
[ 2135.376357]  do_page_fault+0x1a/0x64
[ 2135.376362]  page_fault+0x39/0x40
[ 2135.376367] RIP: 0033:0x6453194e9c10
[ 2135.376375] Code: Bad RIP value.
[ 2135.376379] RSP: 002b:00007ffdcdfbbbc0 EFLAGS: 00010206
[ 2135.376383] RAX: 00000006c5c79000 RBX: 00007a6cfe787010 RCX: 00007a73db2596fb
[ 2135.376386] RDX: 0000000000000001 RSI: 00000006dc9ad000 RDI: 00007a6cfe787000
[ 2135.376389] RBP: 00006453194eaa54 R08: 00007a6cfe787010 R09: 0000000000000000
[ 2135.376392] R10: 0000000000000022 R11: 00000006dc9ac000 R12: ffffffffffffffff
[ 2135.376395] R13: 0000000000000002 R14: 0000000000001000 R15: 00000006dc9ac000
[ 2172.239905] INFO: task kworker/10:0H:63 blocked for more than 110 seconds.
[ 2172.239912]       Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #62
[ 2172.239914] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2172.239918] kworker/10:0H   D    0    63      2 0x80004000
[ 2172.239932] Workqueue: events_highpri intel_atomic_cleanup_work
[ 2172.239935] Call Trace:
[ 2172.239947]  ? __schedule+0x2c6/0x580
[ 2172.239953]  schedule+0x34/0xa0
[ 2172.239959]  schedule_preempt_disabled+0x5/0x10
[ 2172.239965]  __mutex_lock.isra.0+0x25f/0x4b0
[ 2172.239973]  ? pick_next_task_fair+0x1eb/0x700
[ 2172.239980]  intel_cleanup_plane_fb+0x28/0x70
[ 2172.239986]  drm_atomic_helper_cleanup_planes+0x4a/0x60
[ 2172.239992]  intel_atomic_cleanup_work+0x17/0x80
[ 2172.239997]  process_one_work+0x16b/0x2a0
[ 2172.240003]  worker_thread+0x48/0x390
[ 2172.240010]  kthread+0xee/0x130
[ 2172.240014]  ? process_one_work+0x2a0/0x2a0
[ 2172.240020]  ? kthread_park+0x70/0x70
[ 2172.240024]  ret_from_fork+0x1f/0x30
[ 2172.240054] INFO: task kworker/u24:1:5567 blocked for more than 110 seconds.
[ 2172.240057]       Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #62
[ 2172.240059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2172.240062] kworker/u24:1   D    0  5567      2 0x80004000
[ 2172.240071] Workqueue: i915 __i915_gem_free_work
[ 2172.240073] Call Trace:
[ 2172.240080]  ? __schedule+0x2c6/0x580
[ 2172.240085]  schedule+0x34/0xa0
[ 2172.240091]  schedule_preempt_disabled+0x5/0x10
[ 2172.240097]  __mutex_lock.isra.0+0x25f/0x4b0
[ 2172.240102]  ? __pm_runtime_resume+0x53/0x70
[ 2172.240107]  __i915_gem_free_objects+0x71/0x1d0
[ 2172.240112]  __i915_gem_free_work+0x5f/0x90
[ 2172.240117]  process_one_work+0x16b/0x2a0
[ 2172.240122]  worker_thread+0x48/0x390
[ 2172.240128]  kthread+0xee/0x130
[ 2172.240132]  ? process_one_work+0x2a0/0x2a0
[ 2172.240137]  ? kthread_park+0x70/0x70
[ 2172.240141]  ret_from_fork+0x1f/0x30
[ 2172.240148] INFO: task stress:6620 blocked for more than 110 seconds.
[ 2172.240151]       Tainted: G     U            5.3.0-rc8-gf74c2bb98776 #62
[ 2172.240153] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2172.240155] stress          D    0  6620   6618 0x00000004
[ 2172.240159] Call Trace:
[ 2172.240165]  ? __schedule+0x2c6/0x580
[ 2172.240170]  schedule+0x34/0xa0
[ 2172.240176]  io_schedule+0xd/0x40
[ 2172.240182]  __lock_page+0x13a/0x230
[ 2172.240188]  ? file_fdatawait_range+0x20/0x20
[ 2172.240193]  set_page_dirty_lock+0x44/0x50
[ 2172.240200]  i915_gem_userptr_put_pages+0x135/0x180
[ 2172.240206]  __i915_gem_object_put_pages+0x50/0x90
[ 2172.240212]  userptr_mn_invalidate_range_start+0x17f/0x210
[ 2172.240219]  __mmu_notifier_invalidate_range_start+0x4f/0x90
[ 2172.240224]  try_to_unmap_one+0x718/0x820
[ 2172.240229]  rmap_walk_file+0xe4/0x250
[ 2172.240234]  try_to_unmap+0xc1/0xf0
[ 2172.240239]  ? page_remove_rmap+0x2a0/0x2a0
[ 2172.240243]  ? page_not_mapped+0x10/0x10
[ 2172.240247]  ? page_get_anon_vma+0x70/0x70
[ 2172.240251]  migrate_pages+0x7aa/0x9a0
[ 2172.240259]  ? isolate_freepages_block+0x340/0x340
[ 2172.240265]  ? move_freelist_tail+0xd0/0xd0
[ 2172.240269]  compact_zone+0x656/0xa70
[ 2172.240274]  ? set_next_entity+0x9e/0x1a0
[ 2172.240278]  compact_zone_order+0xde/0x120
[ 2172.240283]  try_to_compact_pages+0x187/0x240
[ 2172.240290]  __alloc_pages_direct_compact+0x87/0x170
[ 2172.240297]  __alloc_pages_slowpath+0x1f8/0xc10
[ 2172.240304]  ? get_page_from_freelist+0xe80/0x1330
[ 2172.240311]  __alloc_pages_nodemask+0x268/0x2b0
[ 2172.240317]  alloc_pages_vma+0xc1/0x160
[ 2172.240322]  do_huge_pmd_anonymous_page+0x271/0x610
[ 2172.240328]  __handle_mm_fault+0xbfc/0x12f0
[ 2172.240334]  handle_mm_fault+0xa9/0x1d0
[ 2172.240341]  __do_page_fault+0x23a/0x480
[ 2172.240347]  do_page_fault+0x1a/0x64
[ 2172.240352]  page_fault+0x39/0x40
[ 2172.240357] RIP: 0033:0x6453194e9c10
[ 2172.240365] Code: Bad RIP value.
[ 2172.240369] RSP: 002b:00007ffdcdfbbbc0 EFLAGS: 00010206
[ 2172.240373] RAX: 00000006c5c79000 RBX: 00007a6cfe787010 RCX: 00007a73db2596fb
[ 2172.240376] RDX: 0000000000000001 RSI: 00000006dc9ad000 RDI: 00007a6cfe787000
[ 2172.240379] RBP: 00006453194eaa54 R08: 00007a6cfe787010 R09: 0000000000000000
[ 2172.240382] R10: 0000000000000022 R11: 00000006dc9ac000 R12: ffffffffffffffff
[ 2172.240384] R13: 0000000000000002 R14: 0000000000001000 R15: 00000006dc9ac000
[ 2177.737800] gpg-agent[1084]: handler 0x7c285955c700 for fd 10 started
[ 2177.781519] gpg-agent[1084]: handler 0x7c285955c700 for fd 10 terminated


```
Comment 12 Dq8CokMHloQZw 2019-09-10 16:26:41 UTC
Looks like this was mentioned before: https://lkml.org/lkml/2019/8/9/433
Thanks to Leho Kraav (:macmaN :lkraav) for the pointer, also for the ssh idea! Cheers!
Comment 13 Chris Wilson 2019-09-12 20:41:29 UTC
commit 505a8ec7e11ae5236c4a154a1e24ef49a8349600 (linus/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 12 13:56:34 2019 +0100

    Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()"

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.