Bug 99107

Summary: [SNB] BUG NULL pointer deref in gen6_ppgtt_insert_entries+0x154/0x1e0
Product: DRI Reporter: Shlomi Fish <shlomif>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: dominik, intel-gfx-bugs, matthew.auld
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: HSW, SNB i915 features: GEM/PPGTT
Attachments:
Description Flags
some output/debug files.
none
direct pagecache write none

Description Shlomi Fish 2016-12-16 11:54:39 UTC
Created attachment 128498 [details]
some output/debug files.

My Mageia v6 x86-64 system's keyboard and screen are freezing after I play
this file in VLC player (including on startx using JWM in a new user):

http://www.shlomifish.org/Files/files/video/yay-ponies--YP-7R-06x12--vlc-causing-hang--first3M.mkv

Here are its specs:

    An Intel Core i3 CPU (x86-64).
    8 GB of RAM.
    Intel Corporation Sandy Bridge Integrated Graphics Controller (rev 09)
    A 2 TB hard-disk.
    A 21″ Wide LCD Screen by LG.
    Intel Corporation Cougar Point High Definition Audio Controller.
    Intel Corporation 82579V Gigabit Network Connection.

ssh to it still works. chvt doesn't work.

"systemctl start lightdm.service" as root doesn't reset the screen.
htop shows that 4GB out of 7.71G are occupied but there are no large MEM%
processes. 

The attached .zip contains the output of some commands as requested on #intel-gfx on freenode. Note that «cat /sys/kernel/debug/dri/0/i915_gem_objects» just hangs. Playing the file with mpv works fine.
Comment 1 Chris Wilson 2016-12-16 17:19:08 UTC
[ 2510.847137] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 2510.847176] IP: [<ffffffffc00fb9d4>] gen6_ppgtt_insert_entries+0x154/0x1e0 [i915]
[ 2510.847231] PGD 1baa0b067 PUD 193f37067 PMD 0
[ 2510.847251] Oops: 0000 [#1] SMP
[ 2510.847264] Modules linked in: fuse ipt_IFWLOG ipt_psd xt_set ip_set_hash_ip ip_set xt_recent iptable_nat nf_nat_ipv4 xt_comment ipt_REJECT nf_reject_ipv4 xt_addrtype bridge stp llc xt_mark iptable_mangle xt_tcpudp xt_CT iptable_raw xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_NFLOG nfnetlink_log xt_LOG nf_log_ipv4 nf_log_common nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_nat nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack iptable_filter
[ 2510.847547]  ip_tables x_tables af_packet msr vboxnetadp(O) vboxnetflt(O) vboxdrv(O) intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm uvcvideo snd_usb_audio irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel snd_usbmidi_lib ghash_clmulni_intel videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev snd_rawmidi cryptd media snd_seq_device intel_cstate intel_uncore intel_rapl_perf snd_hda_codec_hdmi snd_hda_codec_realtek input_leds joydev snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core ppdev gpio_ich iTCO_wdt iTCO_vendor_support snd_hwdep psmouse snd_pcm snd_timer mei_me i2c_i801 i2c_smbus evdev fjes mei lpc_ich nuvoton_cir rc_core snd soundcore shpchp parport_pc parport e1000e ptp pps_core tpm_tis tpm_tis_core tpm sch_fq_codel ipv6 crc_ccitt
[ 2510.847843]  autofs4 hid_generic usbhid hid xhci_pci xhci_hcd ehci_pci ehci_hcd serio_raw usbcore usb_common i915 video button i2c_algo_bit drm_kms_helper drm ata_piix
[ 2510.847910] CPU: 1 PID: 7449 Comm: vlc Tainted: G           O    4.8.14-desktop-2.mga6 #1
[ 2510.847935] Hardware name:                  /DH67BL, BIOS BLH6710H.86A.0105.2011.0301.1654 03/01/2011
[ 2510.847962] task: ffff92408597b900 task.stack: ffff924051614000
[ 2510.847981] RIP: 0010:[<ffffffffc00fb9d4>]  [<ffffffffc00fb9d4>] gen6_ppgtt_insert_entries+0x154/0x1e0 [i915]
[ 2510.848025] RSP: 0018:ffff9240516179e8  EFLAGS: 00010246
[ 2510.848042] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000d37b6000
[ 2510.848063] RDX: ffff9240d352c000 RSI: 0000000000001000 RDI: 00000000d37b6000
[ 2510.848084] RBP: ffff924051617a38 R08: 0000000000000000 R09: ffff9240d352c000
[ 2510.848106] R10: 0000000000000000 R11: 0000000000000000 R12: ffff924090555c20
[ 2510.848127] R13: 0000000000000000 R14: ffff9240d360cffc R15: 0000000000000000
[ 2510.848149] FS:  00007fffb2eee700(0000) GS:ffff9240dfa80000(0000) knlGS:0000000000000000
[ 2510.848172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2510.848190] CR2: 00007fffb52e68ce CR3: 00000002143bd000 CR4: 00000000000406e0
[ 2510.848211] Stack:
[ 2510.848219]  ffff92408597b900 0000000000000200 ffff9240d352c000 0000100000000001
[ 2510.848247]  00000000d37b6000 0000000000000000 ffff92403f410e00 0000000000000002
[ 2510.848275]  0000000000000001 ffff9240b20ec480 ffff924051617a70 ffffffffc00fd230
[ 2510.848301] Call Trace:
[ 2510.848325]  [<ffffffffc00fd230>] aliasing_gtt_bind_vma+0x90/0xe0 [i915]
[ 2510.848359]  [<ffffffffc010228e>] i915_vma_bind+0xce/0x180 [i915]
[ 2510.848391]  [<ffffffffc01086fb>] i915_gem_object_do_pin+0x7eb/0xa20 [i915]
[ 2510.848424]  [<ffffffffc010895d>] i915_gem_object_pin+0x2d/0x30 [i915]
[ 2510.848457]  [<ffffffffc00f72b9>] i915_gem_execbuffer_reserve_vma.isra.20+0x99/0x160 [i915]
[ 2510.848494]  [<ffffffffc00f76e7>] i915_gem_execbuffer_reserve.isra.21+0x367/0x390 [i915]
[ 2510.848530]  [<ffffffffc00f89d5>] i915_gem_do_execbuffer.isra.24+0x735/0x1250 [i915]
[ 2510.848556]  [<ffffffffbd173439>] ? __alloc_pages_nodemask+0x169/0xd70
[ 2510.848577]  [<ffffffffbd187390>] ? shmem_getpage_gfp+0x4f0/0x9e0
[ 2510.850036]  [<ffffffffc00fa0b6>] i915_gem_execbuffer2+0x106/0x260 [i915]
[ 2510.851500]  [<ffffffffc0018f37>] drm_ioctl+0x1d7/0x4a0 [drm]
[ 2510.852953]  [<ffffffffc00f9fb0>] ? i915_gem_execbuffer+0x320/0x320 [i915]
[ 2510.854395]  [<ffffffffbd1f9042>] do_vfs_ioctl+0x92/0x5a0
[ 2510.855843]  [<ffffffffbd3b9179>] ? tomoyo_file_ioctl+0x19/0x20
[ 2510.857295]  [<ffffffffbd1f95c9>] SyS_ioctl+0x79/0x90
[ 2510.858743]  [<ffffffffbd003ade>] do_syscall_64+0x5e/0xc0
[ 2510.860192]  [<ffffffffbd73d2e5>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2510.861636] Code: 5f 5d c3 c7 45 cc 00 00 00 00 31 db 48 c7 45 d0 00 00 00 00 45 31 e4 e9 26 ff ff ff 8b 45 b8 48 8b 55 c0 48 8b 84 c2 c8 01 00 00 <4c> 8b 38 48 8b 45 b0 83 80 00 18 00 00 01 48 b8 00 00 00 00 00
[ 2510.863266] RIP  [<ffffffffc00fb9d4>] gen6_ppgtt_insert_entries+0x154/0x1e0 [i915]
[ 2510.864859]  RSP <ffff9240516179e8>
[ 2510.866435] CR2: 0000000000000000
Comment 2 Chris Wilson 2017-02-10 22:24:43 UTC
*** Bug 98760 has been marked as a duplicate of this bug. ***
Comment 3 Chris Wilson 2017-02-15 20:53:00 UTC
Revamped the ppgtt alloc in drm-tip, including some fault testing. Please test.
Comment 4 Shlomi Fish 2017-02-15 23:09:17 UTC
I tested this bug again today and what happens is that VLC fails to play the video properly as there are green screens and garbage on the screen and sometimes segfaults, but the containing environment remains stable and functional. I have upgraded many components / packages of my Mageia v6 system so maybe one of them broke the default VLC.
Comment 5 Chris Wilson 2017-02-16 07:41:09 UTC
But does it still work on an old kernel? If so, a single component to bisect - even if it tells us what introduced the NULL pointer deref - would be helpful.

Nothing in dmesg or other system log files? Not even a GPU hang? :|
Comment 6 Shlomi Fish 2017-02-16 11:24:36 UTC
(In reply to Chris Wilson from comment #5)
> But does it still work on an old kernel? If so, a single component to bisect
> - even if it tells us what introduced the NULL pointer deref - would be
> helpful.

I can try testing it on an old kernel, but it'll take some time to build it.

> 
> Nothing in dmesg or other system log files? Not even a GPU hang? :|

I don't see anything in dmesg after using vlc to play that file.
Comment 7 Shlomi Fish 2017-02-16 12:09:00 UTC
(In reply to Shlomi Fish from comment #6)
> (In reply to Chris Wilson from comment #5)
> > But does it still work on an old kernel? If so, a single component to bisect
> > - even if it tells us what introduced the NULL pointer deref - would be
> > helpful.
> 
> I can try testing it on an old kernel, but it'll take some time to build it.
> 

Turned out I had a 4.8.15 kernel already built and installed on my computer. While running it, the original bug can still be reproduced - the screen and keyboard freeze.
Comment 8 Chris Wilson 2017-02-16 12:12:23 UTC
We need to go older then. Before it died, did it show the same corruption?
Comment 9 Shlomi Fish 2017-02-16 12:26:11 UTC
Hi Chris,

(In reply to Chris Wilson from comment #8)
> We need to go older then. Before it died, did it show the same corruption?

not sure I understand you, but I'll try. VLC didn't get to showing the green background / etc. - it freezed the computer immediately.
Comment 10 Dominik 'Rathann' Mierzejewski 2017-02-17 18:02:58 UTC
(In reply to Chris Wilson from comment #3)
> Revamped the ppgtt alloc in drm-tip, including some fault testing. Please
> test.

I'd be happy to test on my Haswell-based machine, but I can't seem to find the right source repository to build from. Could you give me some hints about *what* I need to build and *where* to get it from? I'm currently running Fedora 25 on my machine.
Comment 11 Jani Nikula 2017-02-20 09:01:05 UTC
(In reply to Dominik 'Rathann' Mierzejewski from comment #10)
> Could you give me some hints
> about *what* I need to build and *where* to get it from? I'm currently
> running Fedora 25 on my machine.

Kernel built from drm-tip branch of https://cgit.freedesktop.org/drm-tip
Comment 12 Shlomi Fish 2017-02-26 13:07:36 UTC
Hi all!

An update - I can reproduce a similar freeze again on the same Core i3 machine with the drm-tip kernel. Note that I'll have some problems sshing into it after the issue occurs until my laptop returns. The problematic commit is:

commit 0be4ca1aff160a0abfcb2047d487799e420be6b5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Feb 25 19:02:52 2017 +0000

    drm-tip: 2017y-02m-25d-19h-02m-26s UTC integration manifest

diff --git a/integration-manifest b/integration-manifest
new file mode 100644
index 0000000..e6b73f0
--- /dev/null
+++ b/integration-manifest
@@ -0,0 +1,22 @@
+drm-intel drm-intel-fixes c470abd4fde40ea6a0846a2beab642a578c0b8cd
+       Linux 4.10
+drm-upstream drm-fixes 18a0de8816766a0da7537ef82156b5418ba5cd6e
+       Merge branch 'drm-fixes-4.10' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Comment 13 Shlomi Fish 2017-02-28 09:15:00 UTC
Hi all!

(In reply to Shlomi Fish from comment #12)
> Hi all!
> 
> An update - I can reproduce a similar freeze again on the same Core i3
> machine with the drm-tip kernel. Note that I'll have some problems sshing
> into it after the issue occurs until my laptop returns. The problematic
> commit is:
> 

In case it wasn't clear - I am asking for help in getting the post-mortem information about that freeze. Can anyone provide me with a script for that?

Regards,

-- Shlomi Fish
Comment 14 Chris Wilson 2017-03-04 00:57:10 UTC
I know where the memory starvation is coming from in v4.9+ -- it creates huge objects and pwrites into just a page, and then never uses the object. We are pinning the whole object, and since it is not used by the gpu, it is not currently being accounted for -- a side-effect of the obj->mm.lock work. Even with that fixed, this is going to be badly behaving.
Comment 15 Chris Wilson 2017-03-04 01:08:56 UTC
Ah, not just a page:

[   46.684240] i915_gem_pwrite_ioctl(0 + fffffffe / 100000000), pages? no
[   48.062115] i915_gem_pwrite_ioctl(0 + fffffffc / 100000000), pages? no
[   49.435955] i915_gem_pwrite_ioctl(0 + fffffffd / 100000000), pages? no
[   50.808551] i915_gem_pwrite_ioctl(0 + fffffffd / 100000000), pages? no
[   68.478379] i915_gem_pwrite_ioctl(0 + fffffffe / 100000000), pages? no
[  146.486648] i915_gem_pwrite_ioctl(0 + dcc / 1000), pages? no
[  146.535395] i915_gem_pwrite_ioctl(0 + 5 / 1000), pages? yes
[  146.535441] i915_gem_pwrite_ioctl(0 + ffffffff / 100000000), pages? no
[  148.969021] i915_gem_pwrite_ioctl(0 + fffffffd / 100000000), pages? no
[  150.438061] i915_gem_pwrite_ioctl(0 + fffffffd / 100000000), pages? no
[  151.852695] i915_gem_pwrite_ioctl(0 + fffffffd / 100000000), pages? no
[  169.815979] i915_gem_pwrite_ioctl(0 + fffffffd / 100000000), pages? no

4GiB at a time.
Comment 16 Chris Wilson 2017-03-04 02:01:56 UTC
Created attachment 130060 [details] [review]
direct pagecache write
Comment 17 yann 2017-03-06 14:29:44 UTC
Reference to Chris' patchset: https://patchwork.freedesktop.org/series/20747/
Comment 18 Chris Wilson 2017-03-07 21:38:52 UTC
commit 7c55e2c5772dcf3cbacd0fa2bcfeefae416b73f7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 7 12:03:38 2017 +0000

    drm/i915: Use pagecache write to prepopulate shmemfs from pwrite-ioctl

The earlier BUG should have been resolved, and this commit should prevent vlc triggering the regression in v4.10
Comment 19 Dominik 'Rathann' Mierzejewski 2017-05-25 15:07:10 UTC
Confirmed, no longer seeing this on my Haswell machine with Fedora kernel-4.11.2-200.fc25.x86_64 .

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.