Bug 88865

Summary: [BDW]igt/gem_ringfill/render causes bad io access
Product: DRI Reporter: Mika Kuoppala <mika.kuoppala>
Component: DRM/IntelAssignee: Mika Kuoppala <mika.kuoppala>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: eero.t.tamminen, intel-gfx-bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: BDW i915 features: GEM/Other
Attachments:
Description Flags
dmesg log none

Description Mika Kuoppala 2015-01-29 11:49:09 UTC
After reboot, doing gem_ringfill --r render and quickly stopping the test with CTRL-C causes warning with the following trace:

[  760.651973] gem_ringfill: executing
[  760.655535] gem_ringfill: starting subtest render
[  761.042459] ------------[ cut here ]------------
[  761.042491] WARNING: CPU: 0 PID: 1489 at lib/iomap.c:43 bad_io_access+0x3d/0x40()
[  761.042523] Bad IO access at port 0x0 (outl(val,port))
[  761.042544] Modules linked in: snd_hda_codec_hdmi i915 x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_hwdep snd_seq_midi crct10dif_pclmul crc32_pclmul i2c_algo_bit ghash_clmulni_intel drm_kms_helper snd_seq_midi_event snd_rawmidi aesni_intel drm aes_x86_64 glue_helper lrw snd_seq gf128mul ablk_helper cryptd snd_timer snd_seq_device serio_raw mei_me snd mei soundcore lpc_ich video bnep rfcomm bluetooth acpi_pad parport_pc ppdev mac_hid lp parport nls_iso8859_1 hid_generic usbhid hid e1000e ptp ahci libahci pps_core sdhci_acpi sdhci
[  761.042837] CPU: 0 PID: 1489 Comm: gem_ringfill Not tainted 3.19.0-rc6+ #60
[  761.042866] Hardware name: Intel Corporation Broadwell Client platform/SawTooth Peak, BIOS BDW-E1R1.86C.0092.R00.1408311942 08/31/2014
[  761.042913]  ffffffff81aa1fae ffff8800ab5efa68 ffffffff8173e8b8 0000000000000000
[  761.042948]  ffff8800ab5efab8 ffff8800ab5efaa8 ffffffff8107027a 004a128200000002
[  761.042982]  ffff8800aad25600 0000000000101001 000000000162d000 ffff8800aa6d1a20
[  761.043017] Call Trace:
[  761.043031]  [<ffffffff8173e8b8>] dump_stack+0x45/0x57
[  761.043056]  [<ffffffff8107027a>] warn_slowpath_common+0x8a/0xc0
[  761.043083]  [<ffffffff810702f6>] warn_slowpath_fmt+0x46/0x50
[  761.043125]  [<ffffffffa044460e>] ? intel_logical_ring_begin+0x3e/0x250 [i915]
[  761.043158]  [<ffffffff81398cfd>] bad_io_access+0x3d/0x40
[  761.043182]  [<ffffffff81398ea3>] iowrite32+0x33/0x40
[  761.043216]  [<ffffffffa0444de8>] gen8_emit_flush_render+0x68/0x100 [i915]
[  761.043257]  [<ffffffffa0444085>] logical_ring_flush_all_caches+0x35/0x50 [i915]
[  761.043298]  [<ffffffffa0445273>] gen8_init_rcs_context+0x53/0x190 [i915]
[  761.043335]  [<ffffffffa0445ad7>] intel_lr_context_deferred_create+0x657/0x8e0 [i915]
[  761.043376]  [<ffffffffa0420de8>] i915_gem_do_execbuffer.isra.22+0xf68/0x1000 [i915]
[  761.043411]  [<ffffffff811c06f5>] ? __kmalloc+0x55/0x1b0
[  761.043442]  [<ffffffffa042209c>] ? i915_gem_execbuffer2+0x6c/0x2c0 [i915]
[  761.043479]  [<ffffffffa04220e1>] i915_gem_execbuffer2+0xb1/0x2c0 [i915]
[  761.043519]  [<ffffffffa022eab4>] drm_ioctl+0x1a4/0x630 [drm]
[  761.043545]  [<ffffffff81125cdc>] ? acct_account_cputime+0x1c/0x20
[  761.043573]  [<ffffffff811ee918>] do_vfs_ioctl+0x2f8/0x510
[  761.043597]  [<ffffffff8109fae4>] ? vtime_account_user+0x54/0x60
[  761.043623]  [<ffffffff811eebb1>] SyS_ioctl+0x81/0xa0
[  761.043646]  [<ffffffff81746974>] ? int_check_syscall_exit_work+0x34/0x3d
[  761.043675]  [<ffffffff817466ed>] system_call_fastpath+0x16/0x1b
[  761.043700] ---[ end trace b5d74ad8a84ad5c9 ]---

Kernel:

commit 299c73cff6145719471b825be0a8aa88bd85378f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jan 28 17:23:05 2015 +0100

    drm-intel-nightly: 2015y-01m-28d-16h-22m-42s UTC integration manifest


Ville and I traced the problem to have something to do with outstanding_lazy_request handling. I suspect the problem lies with
outstanding_lazy_request->ctx pointer pointing to an old initialized context when we actually are trying to submit to a new context.

There relevant email thread:
1420724407-11767-1-git-send-email-mika.kuoppala@intel.com
Comment 1 Eero Tamminen 2015-01-29 14:56:38 UTC
I wonder whether this is related to the BDW-specific issue we're seeing with the (SynMark) context re-recreation test-case. When running that test for dozen times on fresly booted device, kernel does few OOM-kills on the system because gtt page allocs fail.

The main memory users on the machine were (which both have gotten OOM-killed):

* Jenkins Java client (console app that maps nearly 3GB of RAM anynymously, of which only few tens of MB are dirty)

* Ubuntu Compiz (which leaks X Windows resource references on each test invocation which is several years old bug in Compiz: https://bugs.launchpad.net/compiz/+bug/1065657)
Comment 2 Jeff Zheng 2015-03-18 03:21:37 UTC
Created attachment 114414 [details]
dmesg log

Tried on Braswell RVP Fab2 with C0 Stepping CPU + BIOS: V59with drm-intel-testing tag drm-intel-testing-2015-03-13 and this issue can be reproduced.
Comment 3 Chris Harris 2015-07-01 13:24:58 UTC
Should be fixed by OLR removal patch set - http://cgit.freedesktop.org/drm-intel/commit/?id=a5ac0f907d5b713a89c960605f36c0ccb436022c
Comment 4 Mika Kuoppala 2015-08-05 09:04:41 UTC
OLR removal has been merged in drm-intel-nightly. Please retest
Comment 5 Jani Nikula 2016-04-21 12:41:34 UTC
(In reply to Mika Kuoppala from comment #4)
> OLR removal has been merged in drm-intel-nightly. Please retest

Timeout, presuming fixed and closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.