Bug 104011 - [CI] igt@gem_ppgtt@blt-vs-render-ctxn - incomplete - gem_ppgtt: page allocation failure: order:0, mode:0x14204d2(GFP_HIGHUSER|__GFP_RETRY_MAYFAIL|__GFP_RECLAIMABLE), nodemask=(null)
Summary: [CI] igt@gem_ppgtt@blt-vs-render-ctxn - incomplete - gem_ppgtt: page allocati...
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-01 08:34 UTC by Marta Löfstedt
Modified: 2018-01-05 17:32 UTC (History)
1 user (show)

See Also:
i915 platform: BXT
i915 features: GEM/Other


Attachments

Description Marta Löfstedt 2017-12-01 08:34:39 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3419/shard-apl2/igt@gem_ppgtt@blt-vs-render-ctxn.html

This caught my eye in dmesg:
<7>[  192.797271] [IGT] gem_eio: starting subtest throttle
<4>[  192.817585] Setting dangerous option reset - tainting kernel
<4>[  193.388010] hpet1: lost 7161 rtc interrupts
...
<7>[  196.788603] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915], irq posted? yes, current seqno=1a6b3, last=1a6b4
...
<4>[  209.345145] hpet1: lost 7161 rtc interrupts
<7>[  209.851911] [drm:i915_reset_device [i915]] resetting chip
<5>[  209.852184] i915 0000:00:02.0: Resetting chip after gpu hang
...
<7>[  210.168410] [IGT] gem_ppgtt: executing
<4>[  210.281048] hpet1: lost 7161 rtc interrupts
<7>[  210.296595] [IGT] gem_ppgtt: starting subtest blt-vs-render-ctxN
<4>[  211.179533] hpet1: lost 7161 rtc interrupts
...
<4>[  349.723277] hpet1: lost 7161 rtc interrupts
<4>[  350.192490] gem_ppgtt: page allocation failure: order:0, mode:0x14204d2(GFP_HIGHUSER|__GFP_RETRY_MAYFAIL|__GFP_RECLAIMABLE), nodemask=(null)
<4>[  350.192608] CPU: 0 PID: 2707 Comm: gem_ppgtt Tainted: G     U           4.15.0-rc1-CI-CI_DRM_3419+ #1
<4>[  350.192612] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[  350.192616] Call Trace:
<4>[  350.192627]  dump_stack+0x5f/0x86
<4>[  350.192635]  warn_alloc+0xad/0x130
<4>[  350.192647]  ? __mutex_unlock_slowpath+0x38/0x270
<4>[  350.192656]  __alloc_pages_nodemask+0xeb5/0x11d0
<4>[  350.192666]  ? mark_held_locks+0x66/0x90
<4>[  350.192686]  shmem_alloc_and_acct_page+0x7b/0x3a0
<4>[  350.192697]  shmem_getpage_gfp.isra.9+0x163/0xc90
<4>[  350.192715]  shmem_read_mapping_page_gfp+0x2e/0x50
<4>[  350.192775]  i915_gem_object_get_pages_gtt+0x13b/0x660 [i915]
<4>[  350.192826]  ? i915_gem_obj_prepare_shmem_write+0x201/0x220 [i915]
<4>[  350.192870]  ____i915_gem_object_get_pages+0x17/0x70 [i915]
<4>[  350.192914]  __i915_gem_object_get_pages+0x59/0x80 [i915]
<4>[  350.192959]  i915_gem_set_domain_ioctl+0x377/0x3e0 [i915]
<4>[  350.193005]  ? i915_gem_obj_prepare_shmem_write+0x220/0x220 [i915]
<4>[  350.193012]  drm_ioctl_kernel+0x60/0xa0
<4>[  350.193020]  drm_ioctl+0x290/0x330
<4>[  350.193066]  ? i915_gem_obj_prepare_shmem_write+0x220/0x220 [i915]
<4>[  350.193076]  ? remove_vma+0x56/0x60
<4>[  350.193082]  ? remove_vma+0x56/0x60
<4>[  350.193087]  ? remove_vma+0x56/0x60
<4>[  350.193092]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[  350.193101]  do_vfs_ioctl+0x8a/0x680
<4>[  350.193107]  ? entry_SYSCALL_64_fastpath+0x5/0x89
<4>[  350.193113]  ? trace_hardirqs_on_caller+0xde/0x1c0
<4>[  350.193120]  SyS_ioctl+0x36/0x70
<4>[  350.193128]  entry_SYSCALL_64_fastpath+0x1c/0x89
<4>[  350.193133] RIP: 0033:0x7f5cfdd49587
<4>[  350.193137] RSP: 002b:00007ffd39791888 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4>[  350.193144] RAX: ffffffffffffffda RBX: 0000559c607423c0 RCX: 00007f5cfdd49587
<4>[  350.193147] RDX: 00007ffd397918c0 RSI: 00000000400c645f RDI: 0000000000000003
<4>[  350.193151] RBP: 0000559c60739720 R08: 0000559c60739828 R09: 00007f5cfc22e000
<4>[  350.193154] R10: 0000000000000000 R11: 0000000000000246 R12: 0000559c60739720
<4>[  350.193158] R13: 0000559c60739828 R14: 0000000000000d29 R15: 0000559c5f784a30
<4>[  350.193318] Mem-Info:
<4>[  350.193330] active_anon:1688098 inactive_anon:211034 isolated_anon:32
                   active_file:139 inactive_file:150 isolated_file:0
                   unevictable:0 dirty:9 writeback:11 unstable:0
                   slab_reclaimable:22465 slab_unreclaimable:37058
                   mapped:4171 shmem:1822976 pagetables:1485 bounce:0
                   free:26698 free_pcp:1026 free_cma:0
<4>[  350.193336] Node 0 active_anon:6752392kB inactive_anon:844136kB active_file:556kB inactive_file:600kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:16684kB dirty:36kB writeback:44kB shmem:7291904kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 88064kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
<4>[  350.193343] DMA free:15896kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15896kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
<4>[  350.193346] lowmem_reserve[]: 0 1772 7774 7774
<4>[  350.193368] DMA32 free:39320kB min:15376kB low:19220kB high:23064kB active_anon:1449076kB inactive_anon:300352kB active_file:216kB inactive_file:408kB unevictable:0kB writepending:100kB present:1926488kB managed:1817944kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:2096kB local_pcp:428kB free_cma:0kB
<4>[  350.193371] lowmem_reserve[]: 0 0 6001 6001
<4>[  350.193393] Normal free:51576kB min:52068kB low:65084kB high:78100kB active_anon:5303204kB inactive_anon:543384kB active_file:96kB inactive_file:940kB unevictable:0kB writepending:40kB present:6291456kB managed:6146000kB mlocked:0kB kernel_stack:3296kB pagetables:5936kB bounce:0kB free_pcp:2008kB local_pcp:316kB free_cma:0kB
<4>[  350.193396] lowmem_reserve[]: 0 0 0 0
<4>[  350.193415] DMA: 2*4kB (U) 2*8kB (U) 2*16kB (U) 3*32kB (U) 2*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
<4>[  350.193486] DMA32: 75*4kB (ME) 161*8kB (UME) 63*16kB (UME) 34*32kB (ME) 10*64kB (M) 23*128kB (UME) 52*256kB (ME) 12*512kB (UME) 2*1024kB (E) 5*2048kB (M) 0*4096kB = 39012kB
<4>[  350.193561] Normal: 2268*4kB (UME) 1355*8kB (UME) 924*16kB (UME) 448*32kB (UME) 40*64kB (ME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 51592kB
<6>[  350.193620] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
<4>[  350.193624] 1830981 total pagecache pages
<4>[  350.193632] 7618 pages in swap cache
<4>[  350.193636] Swap cache stats: add 980158, delete 972501, find 6/14
<4>[  350.193659] Free swap  = 0kB
<4>[  350.193662] Total swap = 2097148kB
<4>[  350.193665] 2058482 pages RAM
<4>[  350.193669] 0 pages HighMem/MovableOnly
<4>[  350.193672] 63522 pages reserved
<4>[  350.625960] hpet1: lost 7161 rtc interrupts
<4>[  351.529516] hpet1: lost 7161 rtc interrupts
<4>[  351.563934] gem_ppgtt invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=1000
<4>[  351.564070] CPU: 3 PID: 2707 Comm: gem_ppgtt Tainted: G     U           4.15.0-rc1-CI-CI_DRM_3419+ #1
<4>[  351.564074] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[  351.564077] Call Trace:
<4>[  351.564090]  dump_stack+0x5f/0x86
<4>[  351.564098]  dump_header+0x60/0x424
<4>[  351.564106]  ? trace_hardirqs_on_caller+0xde/0x1c0
<4>[  351.564113]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4>[  351.564123]  oom_kill_process+0x2b3/0x650
<4>[  351.564134]  out_of_memory+0xf4/0x350
<4>[  351.564142]  __alloc_pages_nodemask+0x108f/0x11d0
<4>[  351.564171]  filemap_fault+0x456/0x600
<4>[  351.564185]  ext4_filemap_fault+0x2a/0x40
<4>[  351.564192]  __do_fault+0x15/0x70
<4>[  351.564198]  __handle_mm_fault+0x9ab/0xda0
<4>[  351.564217]  handle_mm_fault+0x14f/0x2f0
<4>[  351.564226]  __do_page_fault+0x2d1/0x560
<4>[  351.564239]  page_fault+0x22/0x30
<4>[  351.564245] RIP: 0033:0x7f5cfdccd610
<4>[  351.564249] RSP: 002b:00007ffd39791898 EFLAGS: 00010246
<4>[  351.564256] RAX: 0000000000000181 RBX: 0000559c607423c0 RCX: 00007f5cfdd49587
<4>[  351.564260] RDX: 0000000000000000 RSI: 0000559c607423b0 RDI: 00007f5cfe00cb00
<4>[  351.564264] RBP: 0000559c60741200 R08: 0000559c607395e0 R09: 00007ffd3979d080
<4>[  351.564268] R10: 000000000009588e R11: 0000000000000246 R12: 0000559c60739720
<4>[  351.564272] R13: 0000559c60739828 R14: 0000000000000d4a R15: 0000559c5f784a30
<4>[  351.564292] Mem-Info:
<4>[  351.564301] active_anon:1687917 inactive_anon:211004 isolated_anon:96
                   active_file:72 inactive_file:0 isolated_file:0
                   unevictable:0 dirty:2 writeback:11 unstable:0
                   slab_reclaimable:22451 slab_unreclaimable:37227
                   mapped:2735 shmem:1830444 pagetables:1485 bounce:0
                   free:26606 free_pcp:1126 free_cma:0
<4>[  351.564308] Node 0 active_anon:6751668kB inactive_anon:844016kB active_file:288kB inactive_file:0kB unevictable:0kB isolated(anon):384kB isolated(file):0kB mapped:10940kB dirty:8kB writeback:44kB shmem:7321776kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 10240kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
<4>[  351.564314] DMA free:15896kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15896kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
<4>[  351.564318] lowmem_reserve[]: 0 1772 7774 7774
<4>[  351.586427] DMA32 free:39820kB min:15376kB low:19220kB high:23064kB active_anon:1741616kB inactive_anon:5912kB active_file:44kB inactive_file:372kB unevictable:0kB writepending:88kB present:1926488kB managed:1817944kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:2544kB local_pcp:720kB free_cma:0kB
<4>[  351.586430] lowmem_reserve[]: 0 0 6001 6001
<4>[  351.586455] Normal free:52764kB min:52068kB low:65084kB high:78100kB active_anon:5006776kB inactive_anon:838476kB active_file:140kB inactive_file:0kB unevictable:0kB writepending:24kB present:6291456kB managed:6146000kB mlocked:0kB kernel_stack:3296kB pagetables:5940kB bounce:0kB free_pcp:2376kB local_pcp:644kB free_cma:0kB
<4>[  351.586458] lowmem_reserve[]: 0 0 0 0
<4>[  351.586479] DMA: 2*4kB (U) 2*8kB (U) 2*16kB (U) 3*32kB (U) 2*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
<4>[  351.586671] DMA32: 76*4kB (UME) 112*8kB (ME) 66*16kB (ME) 51*32kB (UME) 19*64kB (UME) 11*128kB (UME) 51*256kB (ME) 16*512kB (ME) 2*1024kB (E) 5*2048kB (M) 0*4096kB = 40048kB
<4>[  351.586754] Normal: 2499*4kB (UME) 1359*8kB (UME) 920*16kB (UME) 433*32kB (UME) 49*64kB (UME) 1*128kB (E) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 52708kB
<6>[  351.586824] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
<4>[  351.586828] 1830078 total pagecache pages
<4>[  351.586837] 0 pages in swap cache
<4>[  351.586842] Swap cache stats: add 980178, delete 980138, find 10/22
<4>[  351.586846] Free swap  = 0kB
<4>[  351.586850] Total swap = 2097148kB
<4>[  351.586854] 2058482 pages RAM
<4>[  351.586858] 0 pages HighMem/MovableOnly
<4>[  351.586861] 63522 pages reserved
<6>[  351.586865] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<6>[  351.587119] [  193]     0   193    11556      316   122880        0             0 systemd-journal
<6>[  351.587421] [  211]     0   211    11266      392   114688        0         -1000 systemd-udevd
<6>[  351.587432] [  439]     0   439    44195      203   192512        0             0 thermald
<6>[  351.587440] [  443]   104   443    64105      231   139264        0             0 rsyslogd
<6>[  351.587449] [  451]     0   451   105640      883   335872        0             0 ModemManager
<6>[  351.587457] [  474]     0   474    11661      146   131072        0             0 systemd-logind
<6>[  351.587465] [  496]     0   496     8086       72   110592        0             0 cron
<6>[  351.587473] [  507]     0   507     1098       22    53248        0             0 acpid
<6>[  351.587481] [  512]   106   512    11478      257   135168        0          -900 dbus-daemon
<6>[  351.587488] [  529]   118   529    11830      107   139264        0             0 avahi-daemon
<6>[  351.587496] [  533]     0   533   255151      693   274432        0             0 repowerd
<6>[  351.587505] [  534]   118   534    11797       90   126976        0             0 avahi-daemon
<6>[  351.587513] [  546]     0   546    99822      746   348160        0             0 NetworkManager
<6>[  351.587521] [  604]     0   604     4906       63    77824        0             0 irqbalance
<6>[  351.587529] [  643]     0   643    88880      342   249856        0             0 upowerd
<6>[  351.587537] [  649]     0   649    76011      240   245760        0             0 polkitd
<6>[  351.587545] [  756]   102   756    12445      135   143360        0             0 systemd-resolve
<6>[  351.587553] [  757]     0   757    17504      182   180224        0         -1000 sshd
<6>[  351.587560] [  856]     0   856     4305       35    73728        0             0 agetty
<6>[  351.587568] [  936]     0   936    25378      245   237568        0             0 sshd
<6>[  351.587576] [  942]  1000   942    16248      233   167936        0             0 systemd
<6>[  351.587584] [  945]  1000   945    22381      562   208896        0             0 (sd-pam)
<6>[  351.587592] [  958]  1000   958    25412      279   233472        0             0 sshd
<6>[  351.587600] [  986]  1000   986  1295991    28692   696320        0             0 java
<6>[  351.587607] [ 1041]  1000  1041     1125       27    57344        0             0 sh
<6>[  351.587615] [ 1114]  1000  1114     3760       37    77824        0             0 dmesg
<6>[  351.587623] [ 1116]     0  1116    15572      116   172032        0             0 sudo
<6>[  351.587631] [ 1120]     0  1120     1090       18    57344        0             0 owatch
<6>[  351.587638] [ 1121]     0  1121   263175    32295   561152        0             0 python3
<6>[  351.587651] [ 1599]     0  1599     4030      214    73728        0             0 dhclient
<6>[  351.587659] [ 2706]     0  2706    14115     1857   147456        0          1000 gem_ppgtt
<6>[  351.587667] [ 2707]     0  2707    13859      104   126976        0          1000 gem_ppgtt
<6>[  351.587675] [ 2709]     0  2709    14115      105   126976        0          1000 gem_ppgtt
<6>[  351.587683] [ 2710]     0  2710    14115      235   126976        0          1000 gem_ppgtt
<6>[  351.587690] [ 2711]     0  2711    14115      105   126976        0          1000 gem_ppgtt
<6>[  351.587698] [ 2712]     0  2712    14115      301   126976        0          1000 gem_ppgtt
<6>[  351.587706] [ 2713]     0  2713    14115      301   126976        0          1000 gem_ppgtt
<6>[  351.587714] [ 2714]     0  2714    14115      105   126976        0          1000 gem_ppgtt
<6>[  351.587722] [ 2715]     0  2715    14627      302   131072        0          1000 gem_ppgtt
<3>[  351.587727] Out of memory: Kill process 2706 (gem_ppgtt) score 1000 or sacrifice child
<3>[  351.588765] Killed process 2715 (gem_ppgtt) total-vm:58508kB, anon-rss:416kB, file-rss:4kB, shmem-rss:788kB
<6>[  351.616709] oom_reaper: reaped process 2715 (gem_ppgtt), now anon-rss:0kB, file-rss:0kB, shmem-rss:788kB
<4>[  352.431360] hpet1: lost 7161 rtc interrupts
..
the just "stray" in rest of dmesg.
Comment 1 Mika Kuoppala 2017-12-01 08:55:45 UTC
From dmesg, the problems start already on igt trying to init before gem_eio/throttle. multiring resets (dead gpu?) already before throttle starts
and also then during throttle.
Comment 2 Chris Wilson 2017-12-01 09:36:30 UTC
(In reply to Mika Kuoppala from comment #1)
> From dmesg, the problems start already on igt trying to init before
> gem_eio/throttle. multiring resets (dead gpu?) already before throttle starts
> and also then during throttle.

That's gem_eio, it's purpose is to wedge the gpu and check abi for correct behaviour. Before gem_ppgtt started, it checked that the gpu was no longer wedged, so everything appears in order.

What is not so happy is the

<4>[  193.388010] hpet1: lost 7161 rtc interrupts

A dead timer. That would explain why we never were able to retire requests and so leaked everything, and everything remained marked as busy.
Comment 3 Chris Wilson 2017-12-01 20:49:32 UTC
(In reply to Chris Wilson from comment #2)
> What is not so happy is the
> 
> <4>[  193.388010] hpet1: lost 7161 rtc interrupts
> 
> A dead timer. That would explain why we never were able to retire requests
> and so leaked everything, and everything remained marked as busy.

No, it doesn't. If we thought everything was still active, we would not have filled swap (only inactive bo can be swapped). Which brings us back to a bo leak, but whose?
Comment 4 Marta Löfstedt 2018-01-05 07:58:59 UTC
Last seen:  CI_DRM_3419: 2017-11-30 / 241 runs ago


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.