Bug 89859 - [SNB/BSW]igt/gem_evict_everything/mlocked-hang causes oom killer
Summary: [SNB/BSW]igt/gem_evict_everything/mlocked-hang causes oom killer
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-01 06:45 UTC by lu hua
Modified: 2017-02-24 06:55 UTC (History)
2 users (show)

See Also:
i915 platform: BSW/CHT, SNB
i915 features: GEM/Other


Attachments
dmesg (124.77 KB, text/plain)
2015-04-01 06:45 UTC, lu hua
no flags Details

Description lu hua 2015-04-01 06:45:37 UTC
Created attachment 114797 [details]
dmesg

==System Environment==
--------------------------
Regression: not sure, new case

Non-working platforms: BSW

==kernel==
--------------------------
drm-intel-nightly/d72ff1ab1499711c941831400629c14493313b3a
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Mar 31 17:30:12 2015 +0200

    drm-intel-nightly: 2015y-03m-31d-15h-29m-33s UTC integration manifest

==Bug detailed description==
-----------------------------
It causes oom killer on BSW with the latest drm-intel-nightly kernel.

output:
IGT-Version: 1.10-g2f0e3cd (x86_64) (Linux: 4.0.0-rc6_drm-intel-nightly_d72ff1_20150401+ x86_64)
child 0 died with signal 9, Killed
Subtest mlocked-hang failed.
**** DEBUG ****
Checking 1536 surfaces of size 1048576 bytes (total 1611399168) against RAM
Test requirement passed: !(total <= required)
Test requirement passed: !igt_run_in_simulation()
Test requirement passed: pin > sz
Pinning [1439, 2975] MiB
Test requirement passed: locked
****  END  ****
Subtest mlocked-hang: FAIL (478.382s)

real    7m58.787s
user    0m0.609s
sys     2m7.256s

==Reproduce steps==
---------------------------- 
1. time ./gem_evict_everything --run-subtest mlocked-hang
Comment 1 lu hua 2015-04-01 06:49:13 UTC
dmesg:
[  567.997476] Unable to purge GPU memory due lock contention.
[  568.064521] gmain invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[  568.064526] gmain cpuset=/ mems_allowed=0
[  568.064537] CPU: 3 PID: 3465 Comm: gmain Not tainted 4.0.0-rc6_drm-intel-nightly_d72ff1_20150401+ #136
[  568.064542]  0000000000000000 00000000000201da ffffffff817950c6 ffff880174860830
[  568.064564]  ffffffff817922b1 ffff880179c99060 0000000000000000 ffff880179c99298
[  568.064570]  ffff880179c99060 ffff88017fffb800 ffff880179c99060 ffff88017a0bd901
[  568.064576] Call Trace:
[  568.064589]  [<ffffffff817950c6>] ? dump_stack+0x40/0x50
[  568.064596]  [<ffffffff817922b1>] ? dump_header.isra.11+0x6b/0x196
[  568.064603]  [<ffffffff810d376c>] ? oom_kill_process+0xb5/0x374
[  568.064611]  [<ffffffff81041abe>] ? has_ns_capability_noaudit+0xd/0x14
[  568.064616]  [<ffffffff810d3eb5>] ? __out_of_memory+0x43d/0x45f
[  568.064622]  [<ffffffff810d4002>] ? out_of_memory+0x52/0x67
[  568.064629]  [<ffffffff810d7cc9>] ? __alloc_pages_nodemask+0x66e/0x6fc
[  568.064636]  [<ffffffff81104696>] ? alloc_pages_current+0xad/0xca
[  568.064641]  [<ffffffff810d2917>] ? filemap_fault+0x21f/0x37e
[  568.064648]  [<ffffffff810ee319>] ? do_set_pte+0x8b/0x98
[  568.064653]  [<ffffffff810ec61a>] ? __do_fault+0x3d/0x80
[  568.064659]  [<ffffffff810efa97>] ? handle_mm_fault+0x358/0xc4d
[  568.064667]  [<ffffffff810328f7>] ? __do_page_fault+0x21c/0x3d8
[  568.064673]  [<ffffffff8179c3e2>] ? page_fault+0x22/0x30
[  568.064677] Mem-Info:
[  568.064680] Node 0 DMA per-cpu:
[  568.064684] CPU    0: hi:    0, btch:   1 usd:   0
[  568.064688] CPU    1: hi:    0, btch:   1 usd:   0
[  568.064691] CPU    2: hi:    0, btch:   1 usd:   0
[  568.064695] CPU    3: hi:    0, btch:   1 usd:   0
[  568.064697] Node 0 DMA32 per-cpu:
[  568.064701] CPU    0: hi:  186, btch:  31 usd:   0
[  568.064705] CPU    1: hi:  186, btch:  31 usd:   0
[  568.064708] CPU    2: hi:  186, btch:  31 usd:   0
[  568.064711] CPU    3: hi:  186, btch:  31 usd:   0
[  568.064714] Node 0 Normal per-cpu:
[  568.064718] CPU    0: hi:  186, btch:  31 usd:   0
[  568.064721] CPU    1: hi:  186, btch:  31 usd:   0
[  568.064724] CPU    2: hi:  186, btch:  31 usd:   0
[  568.064728] CPU    3: hi:  186, btch:  31 usd:   0
Comment 2 ye.tian 2015-04-02 04:08:32 UTC
This issue also exists with the nightly kernel on SNB, It does not exist the fixes kernel. 
There is not "Call Trace" on fixes kernel, but there is "GPU HANG" on both kernels.

output:
-----------------
[  968.835330] [drm] stuck on render ring
[  968.835838] [drm] GPU HANG: ecode 6:0:0xe77fffff, in gem_evict_every [4614], reason:
[  968.855694] drm/i915: Resetting chip after gpu hang
[  972.607154] gem_evict_everything: exiting, ret=0

output:
----------------
./gem_evict_everything --run-subtest mlocked-hang
IGT-Version: 1.10-g992f9f6 (x86_64) (Linux: 4.0.0-rc6_drm-intel-nightly_a8ae89_20150402+ x86_64)
child 0 died with signal 9, Killed
Subtest mlocked-hang failed.
**** DEBUG ****
Checking 1536 surfaces of size 1048576 bytes (total 1611399168) against RAM
Test requirement passed: !(total <= required)
Test requirement passed: !igt_run_in_simulation()
Test requirement passed: pin > sz
Pinning [1431, 2967] MiB
Test requirement passed: locked
****  END  ****
Subtest mlocked-hang: FAIL (54.330s)

output:
---------------------
./gem_evict_everything --run-subtest mlocked-hang
IGT-Version: 1.10-g992f9f6 (x86_64) (Linux: 4.0.0-rc6_drm-intel-fixes_ee73c6_20150331+ x86_64)
Subtest mlocked-hang: SUCCESS (825.940s)
Comment 3 ye.tian 2015-04-22 09:18:46 UTC
Test it on the latest nightly kernel on SNB, this case will timeout and WARNING.

output:
-----------------
root@x-sgb3:/GFX/Test/Intel_gpu_tools/intel-gpu-tools/tests# time ./gem_evict_everything --run-subtest mlocked-hang
IGT-Version: 1.10-gbeddb3b (x86_64) (Linux: 4.0.0_drm-intel-nightly_b9fe35_20150421+ x86_64)
Subtest mlocked-hang: SUCCESS (723.107s)

real    12m7.982s
user    0m4.546s
sys     3m52.365s


dmesg info:
-----------------------

[  757.698571] ------------[ cut here ]------------
[  757.698602] WARNING: CPU: 0 PID: 4588 at drivers/gpu/drm/i915/i915_gem_execbuffer.c:1256 i915_gem_ringbuffer_submission+0x2a9/0x86f [i915]()
[  757.698604] blitter ring didn't clear reload
[  757.698605] Modules linked in: ipv6 dm_mod snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support ppdev pcspkr serio_raw i2c_i801 firewire_ohci firewire_core crc_itu_t snd_hda_intel snd_hda_controller snd_hda_codec lpc_ich mfd_core snd_hda_core snd_hwdep snd_pcm joydev snd_timer snd soundcore parport_pc parport tpm_tis tpm acpi_cpufreq i915 button video drm_kms_helper drm
[  757.698624] CPU: 0 PID: 4588 Comm: gem_evict_every Tainted: G        W       4.0.0_drm-intel-nightly_b9fe35_20150421+ #368
[  757.698626] Hardware name:                  /DQ67SW, BIOS SWQ6710H.86A.0060.2011.1220.1805 12/20/2011
[  757.698627]  0000000000000000 0000000000000009 ffffffff81795847 ffff8802344ffbb8
[  757.698630]  ffffffff8103bd5a 0000000000000010 ffffffffa0096618 ffff880233fd9a40
[  757.698632]  ffff880002d94108 0000000000000000 ffff8802344ffe18 ffff880002d90000
[  757.698634] Call Trace:
[  757.698640]  [<ffffffff81795847>] ? dump_stack+0x40/0x50
[  757.698644]  [<ffffffff8103bd5a>] ? warn_slowpath_common+0x98/0xb0
[  757.698654]  [<ffffffffa0096618>] ? i915_gem_ringbuffer_submission+0x2a9/0x86f [i915]
[  757.698657]  [<ffffffff8103bdb7>] ? warn_slowpath_fmt+0x45/0x4a
[  757.698667]  [<ffffffffa0096618>] ? i915_gem_ringbuffer_submission+0x2a9/0x86f [i915]
[  757.698676]  [<ffffffffa0095ff0>] ? i915_gem_do_execbuffer.isra.13+0xca6/0xd88 [i915]
[  757.698689]  [<ffffffffa00a138a>] ? i915_gem_pwrite_ioctl+0x75a/0x7e0 [i915]
[  757.698693]  [<ffffffff8110948a>] ? __kmalloc+0x65/0x13d
[  757.698703]  [<ffffffffa0097085>] ? i915_gem_execbuffer2+0x16e/0x205 [i915]
[  757.698709]  [<ffffffffa00047ae>] ? drm_ioctl+0x322/0x38d [drm]
[  757.698719]  [<ffffffffa0096f17>] ? i915_gem_execbuffer+0x339/0x339 [i915]
[  757.698723]  [<ffffffff8111daa6>] ? do_vfs_ioctl+0x360/0x424
[  757.698726]  [<ffffffff810f1870>] ? __mm_populate+0xf6/0x107
[  757.698728]  [<ffffffff8111dbb3>] ? SyS_ioctl+0x49/0x7a
[  757.698731]  [<ffffffff8179b0f2>] ? system_call_fastpath+0x12/0x17
[  757.698733] ---[ end trace a5c8f727317bda67 ]---
Comment 4 ye.tian 2015-04-22 09:21:08 UTC
Test it on BSW with the latest nightly kernel, it still exists.

output:
----------------
root@x-bsw08:/GFX/Test/Intel_gpu_tools/intel-gpu-tools/tests# time ./gem_evict_everything --run-subtest mlocked-hang
IGT-Version: 1.10-g36ecc31 (x86_64) (Linux: 4.0.0_drm-intel-nightly_b9fe35_20150421+ x86_64)


^C^C^C^C^C^C


dmesg info:
-------------------

[ 8769.772750] ------------[ cut here ]------------
[ 8769.772841] kernel BUG at drivers/gpu/drm/i915/i915_drv.h:2737!
[ 8769.772934] invalid opcode: 0000 [#1] SMP
[ 8769.773006] Modules linked in: ipv6 dm_mod snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_su                                                                                pport snd_hda_codec_realtek snd_hda_codec_generic serio_raw pcspkr snd_hda_intel snd_hda                                                                                _controller i2c_i801 snd_hda_codec lpc_ich snd_hda_core snd_hwdep mfd_core snd_pcm snd_t                                                                                imer snd soundcore battery ac acpi_cpufreq i915 button video drm_kms_helper drm
[ 8769.773566] CPU: 2 PID: 627 Comm: gem_evict_every Not tainted 4.0.0_drm-intel-nightly                                                                                _b9fe35_20150421+ #368
[ 8769.773715] task: ffff880179a04180 ti: ffff8801769a0000 task.ti: ffff8801769a0000
[ 8769.773830] RIP: 0010:[<ffffffffa009d520>]  [<ffffffffa009d520>] i915_gem_retire_requ                                                                                ests_ring+0xb2/0x16b [i915]
[ 8769.774021] RSP: 0018:ffff8801769a3c08  EFLAGS: 00010246
[ 8769.774103] RAX: ffff880002b8b5e8 RBX: ffff880002d54108 RCX: ffff880002b8b5e8
[ 8769.774212] RDX: ffff880002d54280 RSI: 0000000000000001 RDI: ffff880179037dc0
[ 8769.774321] RBP: 0000000000000000 R08: ffff880179037e80 R09: 0000000000000000
[ 8769.774429] R10: ffff880178832f30 R11: 00000000fffffffa R12: ffff880002d54280
[ 8769.774538] R13: ffff880002b8b500 R14: ffff880179edab40 R15: 0000000000000003
[ 8769.774648] FS:  00007f35bb26f740(0000) GS:ffff88017f500000(0000) knlGS:0000000000000                                                                                000
[ 8769.774771] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8769.774859] CR2: 00007f3562e78008 CR3: 0000000178cf8000 CR4: 00000000001006e0
[ 8769.774967] Stack:
[ 8769.774999]  ffff880074557000 0000000000000008 ffff8801769a3d3e ffff880179ccd900
[ 8769.775125]  ffff880002af08c0 ffffffffa009508c 0000056900000202 ffff880178832ef8
[ 8769.775250]  ffff8801769a3c48 ffff8801769a3c48 0000000000000000 ffff880074557000
[ 8769.775376] Call Trace:
[ 8769.775447]  [<ffffffffa009508c>] ? i915_gem_execbuffer_reserve+0x25/0x2e3 [i915]
[ 8769.775592]  [<ffffffffa0095916>] ? i915_gem_do_execbuffer.isra.13+0x5cc/0xd88 [i915]
[ 8769.775719]  [<ffffffff81104767>] ? alloc_pages_current+0xad/0xca
[ 8769.775818]  [<ffffffff810e9ea6>] ? kmalloc_order+0x10/0x3d
[ 8769.775907]  [<ffffffff810e9eef>] ? kmalloc_order_trace+0x1c/0x7e
[ 8769.776028]  [<ffffffffa0097085>] ? i915_gem_execbuffer2+0x16e/0x205 [i915]
[ 8769.776147]  [<ffffffffa00047ae>] ? drm_ioctl+0x322/0x38d [drm]
[ 8769.776265]  [<ffffffffa0096f17>] ? i915_gem_execbuffer+0x339/0x339 [i915]
[ 8769.776375]  [<ffffffff8111daa6>] ? do_vfs_ioctl+0x360/0x424
[ 8769.776465]  [<ffffffff810f1870>] ? __mm_populate+0xf6/0x107
[ 8769.776556]  [<ffffffff8111dbb3>] ? SyS_ioctl+0x49/0x7a
[ 8769.776642]  [<ffffffff8179b0f2>] ? system_call_fastpath+0x12/0x17
[ 8769.776737] Code: 00 00 3b 45 18 78 26 4c 89 ef e8 fb e0 ff ff 48 8b 83 78 01 00 00 4                                                                                c 39 e0 74 12 48 8b 68 70 4c 8d a8 18 ff ff ff 48 85 ed 75 c5 <0f> 0b 48 8b ab 88 00 00                                                                                 00 48 85 ed 0f 84 9e 00 00 00 48 8b 45
[ 8769.777252] RIP  [<ffffffffa009d520>] i915_gem_retire_requests_ring+0xb2/0x16b [i915]
[ 8769.777402]  RSP <ffff8801769a3c08>
[ 8770.567704] [drm] stuck on render ring
[ 8770.588402] [drm] GPU HANG: ecode 8:0:0xe75ffffe, in gem_evict_every [625], reason: R                                                                                ing hung, action: reset
[ 8770.588572] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, incl                                                                                uding userspace.
[ 8770.588716] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI                                                                                 -> DRM/Intel
[ 8770.588852] [drm] drm/i915 developers can then reassign to the right component if it'                                                                                s not a kernel issue.
[ 8770.589000] [drm] The gpu crash dump is required to analyze gpu hangs, so please alwa                                                                                ys attach it.
[ 8770.589137] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 8770.589289] [drm:i915_reset_and_wakeup] resetting chip
[ 8800.384289] traps: kernel_oops[632] general protection ip:7fa68ce12028 sp:7ffd2557b77                                                                                0 error:0 in libapt-pkg.so.4.12.0[7fa68cdb6000+144000]
Comment 5 Chris Wilson 2015-04-22 09:28:23 UTC
(In reply to ye.tian from comment #4)
> Test it on BSW with the latest nightly kernel, it still exists.
> 
> output:
> ----------------
> root@x-bsw08:/GFX/Test/Intel_gpu_tools/intel-gpu-tools/tests# time
> ./gem_evict_everything --run-subtest mlocked-hang
> IGT-Version: 1.10-g36ecc31 (x86_64) (Linux:
> 4.0.0_drm-intel-nightly_b9fe35_20150421+ x86_64)
> 
> 
> ^C^C^C^C^C^C
> 
> 
> dmesg info:
> -------------------
> 
> [ 8769.772750] ------------[ cut here ]------------
> [ 8769.772841] kernel BUG at drivers/gpu/drm/i915/i915_drv.h:2737!
> [ 8769.772934] invalid opcode: 0000 [#1] SMP

This is a different bug. It may be a dupe of the shrinker-vs-hangcheck, but the oops is slightly different. Can you please file this as a new bug?
Comment 6 ye.tian 2015-04-23 06:24:48 UTC
(In reply to Chris Wilson from comment #5)
> (In reply to ye.tian from comment #4)
> > Test it on BSW with the latest nightly kernel, it still exists.
> > 
> > output:
> > ----------------
> > root@x-bsw08:/GFX/Test/Intel_gpu_tools/intel-gpu-tools/tests# time
> > ./gem_evict_everything --run-subtest mlocked-hang
> > IGT-Version: 1.10-g36ecc31 (x86_64) (Linux:
> > 4.0.0_drm-intel-nightly_b9fe35_20150421+ x86_64)
> > 
> > 
> > ^C^C^C^C^C^C
> > 
> > 
> > dmesg info:
> > -------------------
> > 
> > [ 8769.772750] ------------[ cut here ]------------
> > [ 8769.772841] kernel BUG at drivers/gpu/drm/i915/i915_drv.h:2737!
> > [ 8769.772934] invalid opcode: 0000 [#1] SMP
> 
> This is a different bug. It may be a dupe of the shrinker-vs-hangcheck, but
> the oops is slightly different. Can you please file this as a new bug?
Ok, file a new bug.
Bug 90152 - [BSW] Igt/gem_evict_everything subcase mlocked-hang causes oom killer and kernel BUG at drivers/gpu/drm/i915/i915_drv.h:2737!
Comment 7 ye.tian 2015-04-23 06:45:45 UTC
This case timeout on SNB.
I found the below bug.
 
Bug 90002 - [BYT/SNB]igt/gem_evict_everything@mlocked-hang takes more than 10 minutes
Comment 8 yann 2017-02-24 06:55:10 UTC
(In reply to ye.tian from comment #7)
> This case timeout on SNB.
> I found the below bug.
>  
> Bug 90002 - [BYT/SNB]igt/gem_evict_everything@mlocked-hang takes more than
> 10 minutes

Closing current bug since there is still timeout occurring but w/o oom killer


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.