Bug 98829 - [BDW] oops in intel_unpin_fb_obj
Summary: [BDW] oops in intel_unpin_fb_obj
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: high critical
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 99134 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-11-23 12:38 UTC by mwa
Modified: 2017-07-24 22:39 UTC (History)
3 users (show)

See Also:
i915 platform: BDW, SKL
i915 features:


Attachments
netconsole-dump (79.06 KB, text/plain)
2016-11-23 12:38 UTC, mwa
no flags Details

Description mwa 2016-11-23 12:38:07 UTC
Created attachment 128160 [details]
netconsole-dump

On the latest -nightly when running the following, a hard hang will sometimes present itself:

./nexuiz-linux-x86_64-glx -benchmark demos/demo1 -nosound 2>&1 | egrep -e '[0-9]+ frames'

This has so far been hard to reproduce, so bisection has proved to be futile, but did finally get it to happen whilst netconsole was running. Please see attached.
Comment 1 Chris Wilson 2016-11-23 12:45:45 UTC
Hmm, looks like quite general memory corruption. Probably due to the vma pinleak, i.e. the hardware was continuing to use pages returned to the system; calamity ensures. Alternatively, some other code went rogue and scribbled over large chunks of memory and just happened to write zero over the unpin_work...
Comment 2 Chris Wilson 2016-11-23 13:02:23 UTC
Matthew whilst you still hopefully have that kernel, care to work out the line that died? 120 bytes from the base of the pointer...
Comment 3 Chris Wilson 2016-11-23 13:09:26 UTC
120 bytes should be vma->vm i.e. the lockdep_assert_held(&vma->vm->dev->struct_mutex), and so vma is NULL.

Given the persistence of GGTT vma (they are only destroyed when the object is) it seems more likely that there was another change of state. Still that loophole is closed by https://patchwork.freedesktop.org/series/15325/
Comment 4 mwa 2016-11-23 22:51:56 UTC
Yeah, as expected the problematic line is:

0x000000000007c625 <+101>:	mov    0x78(%rax),%rcx

Which would be vma->fence, vma->vm would be 0x70, I guess I didn't have lockdep enabled...anyway as you said it looks like the vma is NULL. Now just need to figure out why...
Comment 5 Chris Wilson 2016-12-20 13:20:45 UTC
*** Bug 99134 has been marked as a duplicate of this bug. ***
Comment 6 Greg White 2017-01-05 11:38:58 UTC
This still reproduces on 4.10-rc2.
Comment 7 Zlatko Calusic 2017-01-16 09:47:10 UTC
Happens occasionally during logout from lightdm session. Afterwards, SysRq reboot is required. Skylake i7-6700K, 4.10.0-rc3+.

Jan 16 10:18:53 atlas kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
Jan 16 10:18:53 atlas kernel: IP: intel_unpin_fb_obj+0x63/0xd0 [i915]
Jan 16 10:18:53 atlas kernel: PGD 0 
Jan 16 10:18:53 atlas kernel: 
Jan 16 10:18:53 atlas kernel: Oops: 0000 [#1] PREEMPT SMP
Jan 16 10:18:53 atlas kernel: Modules linked in: cpuid btrfs ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs dm_mod uas usb_storage loop drbg ansi_cprng authenc echainiv xfrm6_mode_tunnel xfrm4_mode_tunnel hmac xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp fuse esp4 ah4 af_key xfrm_algo ip6table_filter ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_tcpudp xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat crc32c_generic binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul eeepc_wmi crc32_pclmul asus_wmi sparse_keymap ghash_clmulni_intel mxm_wmi pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate snd_hda_intel
Jan 16 10:18:53 atlas kernel:  snd_hda_codec snd_hda_core snd_hwdep intel_uncore snd_pcm snd_timer snd intel_rapl_perf iTCO_wdt mei_me joydev serio_raw pcspkr soundcore iTCO_vendor_support sg shpchp mei hci_uart btbcm btqca btintel battery wmi bluetooth rfkill intel_lpss_acpi intel_lpss mfd_core acpi_als tpm_tis evdev acpi_pad tpm_tis_core kfifo_buf industrialio tpm nf_conntrack msr zram zsmalloc ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c multipath linear raid1 raid0 md_mod sd_mod hid_generic usbhid ahci libahci crc32c_intel sata_sil24 i915 psmouse e1000e i2c_algo_bit ptp xhci_pci nvme pps_core i2c_i801 xhci_hcd drm_kms_helper nvme_core libata usbcore drm scsi_mod fan thermal video i2c_hid hid button fjes
Jan 16 10:18:53 atlas kernel: CPU: 0 PID: 4277 Comm: kworker/u16:0 Not tainted 4.10.0-rc3+ #1
Jan 16 10:18:53 atlas kernel: Hardware name: System manufacturer System Product Name/Z170-A, BIOS 2202 09/19/2016
Jan 16 10:18:53 atlas kernel: Workqueue: i915 intel_unpin_work_fn [i915]
Jan 16 10:18:53 atlas kernel: task: ffff9fabf3ab9e80 task.stack: ffffb8af60e24000
Jan 16 10:18:53 atlas kernel: RIP: 0010:intel_unpin_fb_obj+0x63/0xd0 [i915]
Jan 16 10:18:53 atlas kernel: RSP: 0018:ffffb8af60e27de8 EFLAGS: 00010246
Jan 16 10:18:53 atlas kernel: RAX: 0000000000000000 RBX: ffff9faa7f21f700 RCX: 0000000000000001
Jan 16 10:18:53 atlas kernel: RDX: ffffb8af60e27de8 RSI: ffff9fac54c03908 RDI: ffff9faa7f21f700
Jan 16 10:18:53 atlas kernel: RBP: ffff9fa94b96a500 R08: ffff9fa99f7a9f08 R09: 0000000000000002
Jan 16 10:18:53 atlas kernel: R10: 000000000000008a R11: 0000000000000075 R12: 0000000000000001
Jan 16 10:18:53 atlas kernel: R13: ffff9fac54c00000 R14: ffff9fac5b5e9c00 R15: ffff9fac54c00068
Jan 16 10:18:53 atlas kernel: FS:  0000000000000000(0000) GS:ffff9fac6ec00000(0000) knlGS:0000000000000000
Jan 16 10:18:53 atlas kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 16 10:18:53 atlas kernel: CR2: 0000000000000078 CR3: 0000000373809000 CR4: 00000000003406f0
Jan 16 10:18:53 atlas kernel: Call Trace:
Jan 16 10:18:53 atlas kernel:  ? intel_unpin_work_fn+0x50/0x120 [i915]
Jan 16 10:18:53 atlas kernel:  ? process_one_work+0x18e/0x440
Jan 16 10:18:53 atlas kernel:  ? worker_thread+0x4a/0x480
Jan 16 10:18:53 atlas kernel:  ? kthread+0xf4/0x130
Jan 16 10:18:53 atlas kernel:  ? process_one_work+0x440/0x440
Jan 16 10:18:53 atlas kernel:  ? kthread_create_on_node+0x60/0x60
Jan 16 10:18:53 atlas kernel:  ? ret_from_fork+0x25/0x30
Jan 16 10:18:53 atlas kernel: Code: a9 fc ff ff ff 74 64 44 89 e2 48 89 ee 48 89 e7 e8 73 30 ff ff 48 8b 43 08 48 89 e2 48 89 df 48 8d b0 08 39 00 00 e8 bd 25 fc ff <48> 8b 50 78 48 85 d2 74 04 83 6a 20 01 48 89 c7 e8 c8 6b fc ff 
Jan 16 10:18:53 atlas kernel: RIP: intel_unpin_fb_obj+0x63/0xd0 [i915] RSP: ffffb8af60e27de8
Jan 16 10:18:53 atlas kernel: CR2: 0000000000000078
Jan 16 10:18:53 atlas kernel: ---[ end trace 97044bd2bd6079bb ]---
Comment 8 Chris Wilson 2017-01-27 18:15:31 UTC
A w/a has been applied that will prevent the oops. Root cause still unknown, hopefully we will stumble upon it soon enough.

commit be1e341513ca23b0668b7b0f26fa6e2ffc46ba20
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 16 15:21:27 2017 +0000

    drm/i915: Track pinned vma in intel_plane_state
Comment 9 Zlatko Calusic 2017-01-27 18:53:04 UTC
Great! And just in time, my other computer (BDW) locked up with the same oops.

Will test, and report after some time if it helps (or not).

Thanks!
Comment 10 Nicholas Sielicki 2017-01-27 22:45:51 UTC
I just tested and can confirm it's fixed in drm-intel-next.

If this is any help, prior to recent patches I was able to reproduce this
consistently by opening a video in vlc/mplayer/mpv and quickly clicking the
video, such that it was rapidly flipping between fullscreen and windowed.
Comment 11 Greg White 2017-01-27 23:08:39 UTC
Is there a patch for 4.10-rc5?  I'm seeing this intermittently, mostly at login/logoff and it's a problem.
Comment 12 yann 2017-01-31 09:51:45 UTC
(In reply to Greg White from comment #11)
> Is there a patch for 4.10-rc5?  I'm seeing this intermittently, mostly at
> login/logoff and it's a problem.

Reference to Marteen's "Backport vma fixes for 4.10-rc6" patch: https://patchwork.freedesktop.org/series/18825/
Comment 13 Greg White 2017-01-31 10:52:32 UTC
Thanks!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.