Bug 99295 - [Regression BDW] kernel panic in Intel i915 module, complete system freeze in 4.10-rc2
Summary: [Regression BDW] kernel panic in Intel i915 module, complete system freeze in...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: highest blocker
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: PatchSubmitted
Keywords: bisect_pending, regression
: 99684 100875 100887 101019 101625 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-01-06 05:50 UTC by Nicholas Stommel
Modified: 2017-08-21 11:48 UTC (History)
17 users (show)

See Also:
i915 platform: BDW
i915 features: GEM/PPGTT


Attachments
Stacktrace (4.94 KB, text/plain)
2017-05-18 08:41 UTC, Maël Lavault
no flags Details
Do not drop pagetables. (2.74 KB, patch)
2017-05-18 10:06 UTC, Chris Wilson
no flags Details | Splinter Review

Description Nicholas Stommel 2017-01-06 05:50:58 UTC
So I am running kernel version 4.10-rc2 on Ubuntu and several times my computer has become completely unresponsive and requires a forcible removal from a power supply. We're talking just a complete and total freeze, all IO devices become utterly unresponsive. Each time, a similar message citing a kernel NULL pointer dereference within the i915 module is generated in the syslog:
Any insight would be appreciated. This is rather critical, as it causes a severe system failure.

The following is the message from the syslog:

Jan  6 00:15:50 delphi kernel: [ 7283.501316] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
Jan  6 00:15:50 delphi kernel: [ 7283.501416] IP: gen8_ppgtt_alloc_page_directories.isra.38+0x115/0x250 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.501445] PGD 0 
Jan  6 00:15:50 delphi kernel: [ 7283.501446] 
Jan  6 00:15:50 delphi kernel: [ 7283.501464] Oops: 0002 [#1] SMP
Jan  6 00:15:50 delphi kernel: [ 7283.501478] Modules linked in: ccm rfcomm bnep snd_soc_sst_broadwell nls_iso8859_1 dm_crypt snd_soc_sst_haswell_pcm snd_soc_sst_firmware snd_soc_sst_ipc snd_soc_sst_dsp hp_wmi arc4 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_soc_rt298 kvm irqbypass crct10dif_pclmul crc32_pclmul iwlmvm ghash_clmulni_intel mac80211 pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf snd_hda_codec_hdmi iwlwifi rtsx_pci_ms memstick cfg80211 snd_hda_intel snd_soc_rt286 snd_soc_rl6347a snd_hda_codec snd_usb_audio snd_soc_ssm4567 serio_raw snd_hda_core snd_soc_core snd_usbmidi_lib snd_compress snd_seq_midi ac97_bus snd_hwdep snd_seq_midi_event snd_pcm_dmaengine snd_pcm snd_rawmidi mei_me shpchp lpc_ich mei joydev btusb hid_sensor_gyro_3d hid_sensor_magn_3d
Jan  6 00:15:50 delphi kernel: [ 7283.501744]  hid_sensor_incl_3d hid_sensor_accel_3d hid_sensor_rotation btrtl uvcvideo btbcm input_leds hid_sensor_trigger btintel industrialio_triggered_buffer hid_sensor_iio_common elan_i2c bluetooth videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core hid_multitouch videodev media snd_seq snd_seq_device snd_timer snd dw_dmac snd_soc_sst_acpi 8250_dw hp_accel intel_vbtn snd_soc_sst_match lis3lv02d i2c_designware_platform input_polldev i2c_designware_core spi_pxa2xx_platform soundcore hp_wireless mac_hid sparse_keymap acpi_pad acpi_als kfifo_buf industrialio parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_logitech_hidpp hid_sensor_custom hid_sensor_hub hid_logitech_dj hid_generic usbhid rtsx_pci_sdmmc i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt psmouse
Jan  6 00:15:50 delphi kernel: [ 7283.502011]  fb_sys_fops drm ahci libahci rtsx_pci wmi sdhci_acpi sdhci video i2c_hid hid fjes
Jan  6 00:15:50 delphi kernel: [ 7283.502049] CPU: 1 PID: 1223 Comm: Xorg Not tainted 4.10.0-rc2 #3
Jan  6 00:15:50 delphi kernel: [ 7283.502073] Hardware name: HP HP Spectre x360 Convertible  /802D, BIOS F.43 09/28/2016
Jan  6 00:15:50 delphi kernel: [ 7283.502102] task: ffff9c5c8245e3c0 task.stack: ffffc15b014e4000
Jan  6 00:15:50 delphi kernel: [ 7283.502142] RIP: 0010:gen8_ppgtt_alloc_page_directories.isra.38+0x115/0x250 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.502171] RSP: 0018:ffffc15b014e7898 EFLAGS: 00010246
Jan  6 00:15:50 delphi kernel: [ 7283.502198] RAX: ffff9c5c12db9440 RBX: 0000000000000003 RCX: 0000000000000003
Jan  6 00:15:50 delphi kernel: [ 7283.502242] RDX: 0000000000000000 RSI: ffff9c5b6f4d6000 RDI: ffff9c5c7c6a8000
Jan  6 00:15:50 delphi kernel: [ 7283.502287] RBP: ffffc15b014e78f0 R08: 0000000000000000 R09: 0000000000000000
Jan  6 00:15:50 delphi kernel: [ 7283.502333] R10: 0000000000000000 R11: ffff9c5c8245e3c0 R12: ffff9c5a49144000
Jan  6 00:15:50 delphi kernel: [ 7283.502381] R13: ffff9c5c77777730 R14: 00000000fcc0c000 R15: 0000000000010000
Jan  6 00:15:50 delphi kernel: [ 7283.502430] FS:  00007fb025b72a40(0000) GS:ffff9c5c8ec80000(0000) knlGS:0000000000000000
Jan  6 00:15:50 delphi kernel: [ 7283.502479] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  6 00:15:50 delphi kernel: [ 7283.502514] CR2: 0000000000000018 CR3: 000000023cff2000 CR4: 00000000003406e0
Jan  6 00:15:50 delphi kernel: [ 7283.502559] Call Trace:
Jan  6 00:15:50 delphi kernel: [ 7283.502604]  gen8_alloc_va_range_3lvl+0xfb/0x9e0 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.502644]  ? swiotlb_map_sg_attrs+0x4b/0x140
Jan  6 00:15:50 delphi kernel: [ 7283.502695]  gen8_alloc_va_range+0x23d/0x470 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.502752]  i915_vma_bind+0x81/0x120 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.502806]  __i915_vma_do_pin+0x2a5/0x450 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.502860]  i915_gem_execbuffer_reserve_vma.isra.31+0x144/0x1b0 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.502927]  i915_gem_execbuffer_reserve.isra.32+0x39e/0x3d0 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.502989]  i915_gem_do_execbuffer.isra.38+0x637/0x1780 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.503050]  ? intel_runtime_pm_put+0x6e/0xa0 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.503104]  ? remap_pfn+0x4f/0x60 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.503152]  ? i915_memcpy_init_early+0x20/0x20 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.503210]  i915_gem_execbuffer2+0xc5/0x240 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.503258]  drm_ioctl+0x21b/0x4c0 [drm]
Jan  6 00:15:50 delphi kernel: [ 7283.503307]  ? i915_gem_execbuffer+0x310/0x310 [i915]
Jan  6 00:15:50 delphi kernel: [ 7283.503344]  ? apparmor_mmap_file+0x16/0x20
Jan  6 00:15:50 delphi kernel: [ 7283.503375]  ? timerqueue_del+0x24/0x70
Jan  6 00:15:50 delphi kernel: [ 7283.503402]  ? __remove_hrtimer+0x3c/0x70
Jan  6 00:15:50 delphi kernel: [ 7283.503433]  do_vfs_ioctl+0xa3/0x600
Jan  6 00:15:50 delphi kernel: [ 7283.503458]  ? do_setitimer+0xdc/0x230
Jan  6 00:15:50 delphi kernel: [ 7283.503482]  SyS_ioctl+0x79/0x90
Jan  6 00:15:50 delphi kernel: [ 7283.503503]  entry_SYSCALL_64_fastpath+0x1e/0xad
Jan  6 00:15:50 delphi kernel: [ 7283.503521] RIP: 0033:0x7fb0239a98b7
Jan  6 00:15:50 delphi kernel: [ 7283.503536] RSP: 002b:00007ffc196b25f8 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
Jan  6 00:15:50 delphi kernel: [ 7283.503565] RAX: ffffffffffffffda RBX: 000000000000032c RCX: 00007fb0239a98b7
Jan  6 00:15:50 delphi kernel: [ 7283.503592] RDX: 00007ffc196b2640 RSI: 0000000040406469 RDI: 000000000000000b
Jan  6 00:15:50 delphi kernel: [ 7283.503619] RBP: 00000000000000d9 R08: 00005572df4855f0 R09: 0000000000000000
Jan  6 00:15:50 delphi kernel: [ 7283.503646] R10: 0000000000000000 R11: 0000000000003246 R12: 0000000000000380
Jan  6 00:15:50 delphi kernel: [ 7283.503673] R13: 000000000000032c R14: 00005572dfeb1310 R15: 0000000000000000
Jan  6 00:15:50 delphi kernel: [ 7283.503716] Code: e6 48 8b 90 20 03 00 00 48 8b b8 d8 02 00 00 48 8b 52 08 48 83 ca 03 e8 ea cd ff ff 48 8b 45 b0 48 8b 4d c8 48 8b 10 48 8b 45 d0 <4c> 89 24 ca 48 0f ab 08 0f 1f 44 00 00 e9 53 ff ff ff 65 8b 05 
Jan  6 00:15:50 delphi kernel: [ 7283.503890] RIP: gen8_ppgtt_alloc_page_directories.isra.38+0x115/0x250 [i915] RSP: ffffc15b014e7898
Jan  6 00:15:50 delphi kernel: [ 7283.503948] CR2: 0000000000000018
Jan  6 00:15:50 delphi kernel: [ 7283.529230] ---[ end trace 539a400d167f3399 ]---
Comment 1 Ernst Sjöstrand 2017-01-10 09:38:59 UTC
Got the same thing with:
http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-next/2017-01-09/

Jan  9 18:05:28 kvagga kernel: [25869.013107] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
Jan  9 18:05:28 kvagga kernel: [25869.013154] IP: gen8_ppgtt_alloc_page_directories.isra.38+0x115/0x250 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.013175] PGD 0 
Jan  9 18:05:28 kvagga kernel: [25869.013175] 
Jan  9 18:05:28 kvagga kernel: [25869.013188] Oops: 0002 [#1] SMP
Jan  9 18:05:28 kvagga kernel: [25869.013199] Modules linked in: msr rfcomm ftdi_sio usbserial bnep ax88179_178a usbnet mii snd_hda_codec_hdmi hid_logitech_hidpp hid_
logitech_dj uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media btusb btrtl hid_generic hid_multitouch binfmt_misc nls_iso8859_1 
i2c_designware_platform i2c_designware_core intel_rapl x86_pkg_temp_thermal intel_powerclamp dell_wmi dell_led coretemp snd_hda_codec_realtek snd_hda_codec_generic kv
m_intel dell_laptop dell_smbios dcdbas kvm irqbypass snd_hda_intel crct10dif_pclmul snd_hda_codec crc32_pclmul ghash_clmulni_intel snd_hda_core pcbc snd_hwdep snd_pcm
 aesni_intel snd_seq_midi snd_seq_midi_event snd_rawmidi aes_x86_64 brcmfmac crypto_simd glue_helper cryptd snd_seq brcmutil intel_cstate intel_rapl_perf cfg80211 inp
ut_leds snd_seq_device
Jan  9 18:05:28 kvagga kernel: [25869.013394]  joydev snd_timer rtsx_pci_ms serio_raw memstick snd soundcore idma64 virt_dma processor_thermal_device mei_me intel_soc
_dts_iosf shpchp mei intel_pch_thermal intel_lpss_pci hci_uart btbcm btqca btintel bluetooth int3403_thermal dell_smo8800 intel_lpss_acpi acpi_als intel_lpss int3402_
thermal mac_hid int3400_thermal intel_hid int340x_thermal_zone acpi_thermal_rel acpi_pad kfifo_buf sparse_keymap industrialio parport_pc ppdev lp parport ip_tables x_
tables autofs4 usbhid rtsx_pci_sdmmc mxm_wmi i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt psmouse fb_sys_fops nvme drm nvme_core ahci rtsx_pci l
ibahci i2c_hid hid wmi pinctrl_sunrisepoint video pinctrl_intel fjes
Jan  9 18:05:28 kvagga kernel: [25869.013566] CPU: 2 PID: 4249 Comm: Compositor Tainted: G        W       4.10.0-996-generic #201701090024
Jan  9 18:05:28 kvagga kernel: [25869.013591] Hardware name: Dell Inc. XPS 15 9550/0N7TVV, BIOS 1.2.14 08/31/2016
Jan  9 18:05:28 kvagga kernel: [25869.013611] task: ffff8d3e218c5c00 task.stack: ffff9a5a43d74000
Jan  9 18:05:28 kvagga kernel: [25869.013642] RIP: 0010:gen8_ppgtt_alloc_page_directories.isra.38+0x115/0x250 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.013663] RSP: 0018:ffff9a5a43d778a8 EFLAGS: 00010246
Jan  9 18:05:28 kvagga kernel: [25869.013679] RAX: ffff8d3cf2285880 RBX: 0000000000000003 RCX: 0000000000000003
Jan  9 18:05:28 kvagga kernel: [25869.013699] RDX: 0000000000000000 RSI: ffff8d3d1108f000 RDI: ffff8d3e21aa0000
Jan  9 18:05:28 kvagga kernel: [25869.013718] RBP: ffff9a5a43d77900 R08: 0000000000000000 R09: 0000000000000000
Jan  9 18:05:28 kvagga kernel: [25869.013738] R10: 0000000000000000 R11: ffff8d3e218c5c00 R12: ffff8d3e26a72000
Jan  9 18:05:28 kvagga kernel: [25869.013758] R13: ffff8d3ac0d8c810 R14: 00000000fdb44000 R15: 0000000000010000
Jan  9 18:05:28 kvagga kernel: [25869.013778] FS:  00007f22f36ff700(0000) GS:ffff8d3e3dc80000(0000) knlGS:0000000000000000
Jan  9 18:05:28 kvagga kernel: [25869.013800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  9 18:05:28 kvagga kernel: [25869.013816] CR2: 0000000000000018 CR3: 000000045ff01000 CR4: 00000000003406e0
Jan  9 18:05:28 kvagga kernel: [25869.013836] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan  9 18:05:28 kvagga kernel: [25869.013855] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jan  9 18:05:28 kvagga kernel: [25869.013875] Call Trace:
Jan  9 18:05:28 kvagga kernel: [25869.013897]  gen8_alloc_va_range_3lvl+0xfb/0x9e0 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.013914]  ? swiotlb_map_sg_attrs+0x49/0x110
Jan  9 18:05:28 kvagga kernel: [25869.013939]  gen8_alloc_va_range+0x23d/0x470 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.013967]  i915_vma_bind+0x81/0x120 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.013993]  __i915_vma_do_pin+0x29b/0x440 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.014018]  i915_gem_execbuffer_reserve_vma.isra.31+0x144/0x1b0 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.014047]  i915_gem_execbuffer_reserve.isra.32+0x39e/0x3d0 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.014076]  i915_gem_do_execbuffer.isra.38+0x493/0x1740 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.014094]  ? __schedule+0x23b/0x6f0
Jan  9 18:05:28 kvagga kernel: [25869.014106]  ? schedule+0x36/0x80
Jan  9 18:05:28 kvagga kernel: [25869.014118]  ? futex_wait_queue_me+0xd3/0x120
Jan  9 18:05:28 kvagga kernel: [25869.014142]  i915_gem_execbuffer2+0xa1/0x1e0 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.014163]  drm_ioctl+0x209/0x4c0 [drm]
Jan  9 18:05:28 kvagga kernel: [25869.014186]  ? i915_gem_execbuffer+0x310/0x310 [i915]
Jan  9 18:05:28 kvagga kernel: [25869.014201]  ? do_futex+0x1ff/0x530
Jan  9 18:05:28 kvagga kernel: [25869.014213]  ? __vfs_write+0xe5/0x160
Jan  9 18:05:28 kvagga kernel: [25869.014225]  do_vfs_ioctl+0xa3/0x600
Jan  9 18:05:28 kvagga kernel: [25869.014237]  ? SyS_futex+0x85/0x180
Jan  9 18:05:28 kvagga kernel: [25869.014248]  SyS_ioctl+0x79/0x90
Jan  9 18:05:28 kvagga kernel: [25869.014259]  entry_SYSCALL_64_fastpath+0x1e/0xad
Jan  9 18:05:28 kvagga kernel: [25869.014274] RIP: 0033:0x7f2319eae8b7
Jan  9 18:05:28 kvagga kernel: [25869.014291] RSP: 002b:00007f22f36fe518 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan  9 18:05:28 kvagga kernel: [25869.014313] RAX: ffffffffffffffda RBX: 00007f22e93ba800 RCX: 00007f2319eae8b7
Jan  9 18:05:28 kvagga kernel: [25869.014332] RDX: 00007f22f36fe560 RSI: 0000000040406469 RDI: 0000000000000032
Jan  9 18:05:28 kvagga kernel: [25869.014352] RBP: 00007f22f36fe560 R08: 00007f22dba53400 R09: 0000000000000000
Jan  9 18:05:28 kvagga kernel: [25869.014372] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469
Jan  9 18:05:28 kvagga kernel: [25869.014391] R13: 0000000000000032 R14: 00007f22e9397a40 R15: 0000000000001160
Jan  9 18:05:28 kvagga kernel: [25869.014411] Code: e6 48 8b 90 e0 02 00 00 48 8b b8 98 02 00 00 48 8b 52 08 48 83 ca 03 e8 ca cd ff ff 48 8b 45 b0 48 8b 4d c8 48 8b 10 48 8b 45 d0 <4c> 89 24 ca 48 0f ab 08 0f 1f 44 00 00 e9 53 ff ff ff 65 8b 05 
Jan  9 18:05:28 kvagga kernel: [25869.014489] RIP: gen8_ppgtt_alloc_page_directories.isra.38+0x115/0x250 [i915] RSP: ffff9a5a43d778a8
Jan  9 18:05:28 kvagga kernel: [25869.014513] CR2: 0000000000000018
Comment 2 Mika Kuoppala 2017-01-11 08:29:02 UTC
Could you compile with CONFIG_DRM_I915_DEBUG_GEM=y and rerun and then see
if there is warning(s) in dmesg prior to null pointer deref? Thanks.
Comment 3 rockorequin 2017-02-02 06:08:10 UTC
How do I build with CONFIG_DRM_I915_DEBUG_GEM=y? "make menuconfig" in both drm-intel-nightly and the main sources for 4.10-rc6 only produces a definition for "CONFIG_DRM_I915_DEBUG", and if I try setting CONFIG_DRM_I915_DEBUG=y and adding CONFIG_DRM_I915_DEBUG_GEM=y manually, it doesn't survive the build.
Comment 4 Chris Wilson 2017-02-02 08:52:41 UTC
DEBUG_GEM won't detect this error anyway - this has been showing up in earlier kernels where (and including this one) the error would have triggered an equivalent WARN to the GEM_BUG_ON.
Comment 5 rockorequin 2017-02-05 08:10:58 UTC
Is there another way I can provide more information then? I'm loathe to test the kernel at the moment if it doesn't provide any info about the crash, because of course it loses any work I'm doing at the time, and it happened about three times in a week, so I'm back on linux-4.9 (which hasn't so far crashed in this way as far as I can tell).
Comment 6 Tomeu Vizoso 2017-02-06 14:17:19 UTC
*** Bug 99684 has been marked as a duplicate of this bug. ***
Comment 7 Chris Wilson 2017-02-06 20:55:50 UTC
From bug 99584:

It looks like we are hitting a use-after-free in gen8_ppgtt_alloc_page_directories with some pdp state. One possible theory from looking at the log is that the shrinker kicks in and starts swinging its axe, evicting one or more vma's, which results in said pdp being freed, I guess we didn't have anything else inserted in that range, which is why it was freed. But all of this could have happened while we were in the middle of allocating a va range for another vma which just so happens to touch the same pdp, and so with a little bad timing the free could have happened just after we check if we need to allocate a new pdp, resulting in all kinds of brokenness. It looks like something similar could also happen with a pd.
Comment 8 Chris Wilson 2017-02-08 12:09:46 UTC
(In reply to Chris Wilson from comment #7)
> From bug 99584:
> 
> It looks like we are hitting a use-after-free in
> gen8_ppgtt_alloc_page_directories with some pdp state. One possible theory
> from looking at the log is that the shrinker kicks in and starts swinging
> its axe, evicting one or more vma's, which results in said pdp being freed,

Confirmed with a kselftest that is exactly what is happening here. Or at least we have a bug exactly like that. The patches to adjust used_pte accounting fixes my kselftest.
Comment 9 yann 2017-02-14 13:32:05 UTC
Reference to Chris' patch : https://patchwork.freedesktop.org/patch/138790/
(full patchset: https://patchwork.freedesktop.org/series/19615/)

Please confirm then issue status with it
Comment 10 Albert Hopkins 2017-02-14 21:56:17 UTC
I'm also experiencing this issue erratically on 4.10-rc8.  Should I try the patch from comment #9?
Comment 11 Chris Wilson 2017-02-15 15:04:12 UTC
commit dd19674bacba227ae5d3ce680cbc5668198894dc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 15 08:43:46 2017 +0000

    drm/i915: Remove bitmap tracking for used-ptes
Comment 12 Ernst Sjöstrand 2017-02-15 18:15:36 UTC
So that's available on drm-intel-next-queued and drm-intel-nightly?
Comment 13 Chris Wilson 2017-02-15 20:50:29 UTC
It's in dinq (so drm-intel-nightly or drm-tip). Backporting requires a lot of massaging or intervening patches - for just this bug, we could just disable RECLAIM in the allocations to avoid triggering the shrinker.
Comment 14 rockorequin 2017-02-17 07:16:15 UTC
> for just this bug, we could just disable RECLAIM
> in the allocations to avoid triggering the shrinker

Is this easy to do and is it being done for 4.8 final? I tried drm-intel-nightly but the nvidia module doesn't build for it (it does build for 4.10-rc8-generic, though, which seems odd), so I can't use the nvidia card on this intel/nvidia laptop with drm-intel-nightly.
Comment 15 Nicholas Stommel 2017-02-27 13:20:08 UTC
Attempting to apply the patch provided by Chris to the stock 4.10 source fails, and editing the 4.10.x drivers/gpu/drm/i915/i915_gem_gtt.c file to fit the mentioned patch has proved impossible, as their are too many differences and intervening patches beforehand across this file and virtually the entire i915 module. Is there any easier way to stop the kernel panic from happening in 4.10.x as a temporary stopgap measure before the fix becomes incorporated into the mainline kernel?
Or, as suggested, how would I disable RECLAIM in the allocations to avoid triggering the shrinker? A fix would be much appreciated.
Comment 16 Jamie Strandboge 2017-03-06 17:52:35 UTC
I'm seeing this on 4.10.1 as well and curious on the progress of a simpler to backport fix, whether that is the aformentioned RECLAIM adjustment or something along the lines of "I think by itself moving the used_pte tracking to allocate is enough to prevent the race with the shrinker." (from https://patchwork.freedesktop.org/patch/138790/).
Comment 17 Eric Blau 2017-03-15 21:49:01 UTC
I'm hitting this issue quite often on 4.10.2. Has anyone succeeded in backporting a fix to the 4.10.x stable series?
Comment 18 Nicholas Stommel 2017-03-21 01:52:43 UTC
How long will it take for this fix to 'trickle down' from drm-tip to an actual kernel point release? As far as I can tell from examining the source and Wilson's patch, the release candidate builds for 4.11 don't yet have this fix incorporated either. This is a fairly bad problem, and marking it as 'resolved' doesn't seem appropriate given the problem is still present with no fix for 4.10.x or even 4.11-rcx.
Comment 19 Jani Nikula 2017-03-21 15:09:20 UTC
(In reply to Nicholas Stommel from comment #18)
> This is a fairly bad problem, and marking it
> as 'resolved' doesn't seem appropriate given the problem is still present
> with no fix for 4.10.x or even 4.11-rcx.

Sorry, our tracking is based on our upstream branches, not backports to older kernels. Please do not reopen bugs that have been and remain fixed in drm-tip.
Comment 20 Daniel Drake 2017-04-04 15:39:36 UTC
I appreciate that this is fixed for a future kernel release, but could you be a bit more specific for how we might attempt a 4.10 workaround?

It was mentioned to disable RECLAIM, could you clarify which of the following calls you think should be modified:


i915_gem.c:

i915_gem_object_create() calls mapping_set_gfp_mask() with __GFP_RECLAIMABLE

i915_gem_load_init() twice calls KMEM_CACHE() with SLAB_RECLAIM_ACCOUNT


i915_gem_internal.c:

i915_gem_object_get_pages_internal() calls alloc_pages() with 
__GFP_RECLAIMABLE


Thanks.
Comment 21 Chris Wilson 2017-04-04 15:49:34 UTC
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index b4bde1452f2a..595277facc2b 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -33,7 +33,7 @@
 #include "intel_drv.h"
 #include "intel_frontbuffer.h"
 
-#define I915_GFP_DMA (GFP_KERNEL | __GFP_HIGHMEM)
+#define I915_GFP_DMA ((GFP_KERNEL | __GFP_HIGHMEM) & __GFP_RECLAIM)
 
 /**
  * DOC: Global GTT views
Comment 22 Chris Wilson 2017-04-04 15:50:00 UTC
Or less broken:

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index b4bde1452f2a..be75ab6a8ed6 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -33,7 +33,7 @@
 #include "intel_drv.h"
 #include "intel_frontbuffer.h"
 
-#define I915_GFP_DMA (GFP_KERNEL | __GFP_HIGHMEM)
+#define I915_GFP_DMA ((GFP_KERNEL | __GFP_HIGHMEM) & ~__GFP_RECLAIM)
 
 /**
  * DOC: Global GTT views
Comment 23 Vasily Khoruzhick 2017-04-12 23:04:07 UTC
(In reply to Chris Wilson from comment #22)
> Or less broken:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c
> b/drivers/gpu/drm/i915/i915_gem_gtt.c
> index b4bde1452f2a..be75ab6a8ed6 100644
> --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> @@ -33,7 +33,7 @@
>  #include "intel_drv.h"
>  #include "intel_frontbuffer.h"
>  
> -#define I915_GFP_DMA (GFP_KERNEL | __GFP_HIGHMEM)
> +#define I915_GFP_DMA ((GFP_KERNEL | __GFP_HIGHMEM) & ~__GFP_RECLAIM)
>  
>  /**
>   * DOC: Global GTT views

Chris, could you send this or similar patch to linux-stable@? Waiting another 3 months to get the fix doesn't sound good.
Comment 24 freedesktop 2017-04-13 12:50:27 UTC
Is there an easy way to test if I've successfully patched my kernel?

I think I've added the proposed patch (the 2nd one) but my kernel still froze today.
Comment 25 Daniel Drake 2017-04-13 14:38:23 UTC
Anyone can submit the patch for linux-stable review, does not need to be Chris. But we do need someone to test and confirm the fix first. I don't have the affected hardware.

Can anyone else test this and feed back?

Also, has anyone affected by this issue (other than Chris) tested drm-tip to confirm that the freeze does not occur there?
Comment 26 freedesktop 2017-04-13 15:40:32 UTC
Regarding my comment 24.

The kernel was not correctly patched.  I think I got it right this time.

I've had an oops ~3 times a day.  If another one happens I will report back.
Comment 27 freedesktop 2017-04-14 07:47:55 UTC
My kernel still crashes after having applied the patch.

However the stack-trace is different.  I don't know if this crash is really related.

```
Apr 14 09:39:15 hobbes kernel: Oops: 0002 [#1] PREEMPT SMP
Apr 14 09:39:15 hobbes kernel: Modules linked in: ctr ccm bnep snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic hid_generic uvcvideo v
Apr 14 09:39:15 hobbes kernel:  intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_hda_inte
Apr 14 09:39:15 hobbes kernel:  crc32c_intel ahci libahci xhci_pci xhci_hcd libata rtsx_pci scsi_mod usbcore usb_common i8042 serio
Apr 14 09:39:15 hobbes kernel: CPU: 3 PID: 49 Comm: kswapd0 Tainted: G           O    4.10.9-1-ARCH #1
Apr 14 09:39:15 hobbes kernel: Hardware name: LENOVO 20FCS02400/20FCS02400, BIOS N1FET37W (1.11 ) 03/15/2016
Apr 14 09:39:15 hobbes kernel: task: ffff88021870bb00 task.stack: ffffc90000ffc000
Apr 14 09:39:15 hobbes kernel: RIP: 0010:bitmap_clear+0x3e/0x80
Apr 14 09:39:15 hobbes kernel: RSP: 0000:ffffc90000fffa28 EFLAGS: 00010206
Apr 14 09:39:15 hobbes kernel: RAX: 0000000171426008 RBX: 0000000000000042 RCX: ffffffffffffffff
Apr 14 09:39:15 hobbes kernel: RDX: 00000000000001be RSI: 0000000000000042 RDI: 0000000000000003
Apr 14 09:39:15 hobbes kernel: RBP: ffffc90000fffa28 R08: ffffffffffffffff R09: 0000000171426040
Apr 14 09:39:15 hobbes kernel: R10: 0000000000000180 R11: 0000000000000006 R12: 0000000001000000
Apr 14 09:39:15 hobbes kernel: R13: 00000000e3a42000 R14: ffff8800b4488680 R15: ffff88018f57e000
Apr 14 09:39:15 hobbes kernel: FS:  0000000000000000(0000) GS:ffff880222580000(0000) knlGS:0000000000000000
Apr 14 09:39:15 hobbes kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 14 09:39:15 hobbes kernel: CR2: 0000000171426008 CR3: 0000000002a09000 CR4: 00000000003406e0
Apr 14 09:39:15 hobbes kernel: Call Trace:
Apr 14 09:39:15 hobbes kernel:  gen8_ppgtt_clear_pdp+0x16b/0x4b0 [i915]
Apr 14 09:39:15 hobbes kernel:  gen8_ppgtt_clear_range+0x102/0x1a0 [i915]
Apr 14 09:39:15 hobbes kernel:  ppgtt_unbind_vma+0x21/0x30 [i915]
Apr 14 09:39:15 hobbes kernel:  i915_vma_unbind+0x73/0x300 [i915]
Apr 14 09:39:15 hobbes kernel:  i915_gem_object_unbind+0xc8/0x120 [i915]
Apr 14 09:39:15 hobbes kernel:  i915_gem_shrink+0x227/0x4a0 [i915]
Apr 14 09:39:15 hobbes kernel:  i915_gem_shrinker_scan+0x9d/0xb0 [i915]
Apr 14 09:39:15 hobbes kernel:  shrink_slab.part.14+0x1f1/0x450
Apr 14 09:39:15 hobbes kernel:  shrink_node+0x230/0x320
Apr 14 09:39:15 hobbes kernel:  kswapd+0x31c/0x730
Apr 14 09:39:15 hobbes kernel:  kthread+0x101/0x140
Apr 14 09:39:15 hobbes kernel:  ? mem_cgroup_shrink_node+0x1a0/0x1a0
Apr 14 09:39:15 hobbes kernel:  ? kthread_create_on_node+0x60/0x60
Apr 14 09:39:15 hobbes kernel:  ret_from_fork+0x2c/0x40
Apr 14 09:39:15 hobbes kernel: Code: 8d 54 11 c0 48 8d 04 c7 4c 89 c7 48 d3 e7 48 89 e5 45 85 d2 78 4a 45 89 d3 41 c1 eb 06 44 89 d9 4c 8d 4c c8 08 4
Apr 14 09:39:15 hobbes kernel: RIP: bitmap_clear+0x3e/0x80 RSP: ffffc90000fffa28
Apr 14 09:39:15 hobbes kernel: CR2: 0000000171426008
Apr 14 09:39:15 hobbes kernel: ---[ end trace 699cb04ab4a14163 ]---
```
Comment 28 freedesktop 2017-04-14 13:03:05 UTC
And another crash.  Again another stacktrace.

Before applying the patch the stacktrace was always ending with `gen8_pp...`

It's of course possible that I encounter different bugs now.


Apr 14 14:54:45 hobbes kernel: ------------[ cut here ]------------
Apr 14 14:54:45 hobbes kernel: kernel BUG at drivers/gpu/drm/drm_mm.c:413!
Apr 14 14:54:45 hobbes kernel: invalid opcode: 0000 [#1] PREEMPT SMP
Apr 14 14:54:45 hobbes kernel: Modules linked in: snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic usbhid ctr ccm joydev mousedev arc4 snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic iwlmv
Apr 14 14:54:45 hobbes kernel:  intel_rapl_perf input_leds mac_hid psmouse pcspkr e1000e ptp pps_core i2c_i801 btusb btrtl btbcm btintel bluetooth crc16 option usb_wwan usbserial cdc_ether usbnet mii uvcvideo videobuf2_vmall
Apr 14 14:54:45 hobbes kernel:  crc32c_intel ahci libahci xhci_pci libata xhci_hcd rtsx_pci scsi_mod usbcore usb_common i8042 serio
Apr 14 14:54:45 hobbes kernel: CPU: 2 PID: 391 Comm: Xorg Tainted: G           O    4.10.9-1-ARCH #1
Apr 14 14:54:45 hobbes kernel: Hardware name: LENOVO 20FCS02400/20FCS02400, BIOS N1FET37W (1.11 ) 03/15/2016
Apr 14 14:54:45 hobbes kernel: task: ffff880216c4e740 task.stack: ffffc9000197c000
Apr 14 14:54:45 hobbes kernel: RIP: 0010:drm_mm_insert_node_in_range_generic+0x374/0x3a0 [drm]
Apr 14 14:54:45 hobbes kernel: RSP: 0018:ffffc9000197fa38 EFLAGS: 00010202
Apr 14 14:54:45 hobbes kernel: RAX: ffff8800062bf000 RBX: 0000000096c3f000 RCX: 0000000098b73000
Apr 14 14:54:45 hobbes kernel: RDX: ffff8801b0751a10 RSI: 0000000000000000 RDI: 0000000000000000
Apr 14 14:54:45 hobbes kernel: RBP: ffffc9000197fab8 R08: 0000000096c3f000 R09: 0000000000000000
Apr 14 14:54:45 hobbes kernel: R10: ffffc9000197fb90 R11: 0000000000000009 R12: 0000000000000000
Apr 14 14:54:45 hobbes kernel: R13: 000000009bac6000 R14: ffff880218742000 R15: 000000009bac6000
Apr 14 14:54:45 hobbes kernel: FS:  00007fd9d77da940(0000) GS:ffff880222500000(0000) knlGS:0000000000000000
Apr 14 14:54:45 hobbes kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 14 14:54:45 hobbes kernel: CR2: 00007fd9b3cde000 CR3: 0000000207724000 CR4: 00000000003406e0
Apr 14 14:54:45 hobbes kernel: Call Trace:
Apr 14 14:54:45 hobbes kernel:  __i915_vma_do_pin+0x1f4/0x450 [i915]
Apr 14 14:54:45 hobbes kernel:  i915_gem_execbuffer_reserve_vma.isra.8+0x144/0x1b0 [i915]
Apr 14 14:54:45 hobbes kernel:  i915_gem_execbuffer_reserve.isra.9+0x319/0x3d0 [i915]
Apr 14 14:54:45 hobbes kernel:  i915_gem_do_execbuffer.isra.15+0x62e/0x1810 [i915]
Apr 14 14:54:45 hobbes kernel:  ? i915_gem_free_object+0x58/0x60 [i915]
Apr 14 14:54:45 hobbes kernel:  ? drm_gem_object_free+0x29/0x70 [drm]
Apr 14 14:54:45 hobbes kernel:  ? drm_gem_object_unreference_unlocked+0x3a/0x90 [drm]
Apr 14 14:54:45 hobbes kernel:  i915_gem_execbuffer2+0xc5/0x240 [i915]
Apr 14 14:54:45 hobbes kernel:  drm_ioctl+0x21b/0x4c0 [drm]
Apr 14 14:54:45 hobbes kernel:  ? i915_gem_execbuffer+0x310/0x310 [i915]
Apr 14 14:54:45 hobbes kernel:  do_vfs_ioctl+0xa3/0x5f0
Apr 14 14:54:45 hobbes kernel:  ? __fget+0x77/0xb0
Apr 14 14:54:45 hobbes kernel:  SyS_ioctl+0x79/0x90
Apr 14 14:54:45 hobbes kernel:  entry_SYSCALL_64_fastpath+0x1a/0xa9
Apr 14 14:54:45 hobbes kernel: RIP: 0033:0x7fd9d56380d7
Apr 14 14:54:45 hobbes kernel: RSP: 002b:00007fff1dca1758 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
Apr 14 14:54:45 hobbes kernel: RAX: ffffffffffffffda RBX: 0000000000000018 RCX: 00007fd9d56380d7
Apr 14 14:54:45 hobbes kernel: RDX: 00007fff1dca1790 RSI: 0000000040406469 RDI: 0000000000000018
Apr 14 14:54:45 hobbes kernel: RBP: 00007fff1dca1790 R08: 0000000001da4830 R09: 0000000000000000
Apr 14 14:54:45 hobbes kernel: R10: 0000000000000001 R11: 0000000000003246 R12: 00007fd9d7701000
Apr 14 14:54:45 hobbes kernel: R13: 0000000000000030 R14: 00007fff1dca1790 R15: 00000000009320b0
Apr 14 14:54:45 hobbes kernel: Code: 48 ba 00 02 00 00 00 00 ad de 48 89 7e 10 48 89 56 18 48 8b 75 c0 e9 c5 fe ff ff 85 c9 74 0e 48 29 d6 48 89 75 c0 e9 ac fe ff ff <0f> 0b 8b 45 b0 29 d0 48 01 c6 48 89 75 c0 e9 99 fe ff ff
Apr 14 14:54:45 hobbes kernel: RIP: drm_mm_insert_node_in_range_generic+0x374/0x3a0 [drm] RSP: ffffc9000197fa38
Apr 14 14:54:45 hobbes kernel: ---[ end trace c46646f28c290a90 ]---
Comment 29 sitic 2017-04-17 14:30:13 UTC
Had a crash with a kernel using the patch from comment 22. I'll compile it again with CONFIG_DRM_I915_DEBUG_GEM=y.

Apr 17 16:06:17 desna kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
Apr 17 16:06:17 desna kernel: IP: gen8_ppgtt_alloc_page_directories.isra.14+0x11f/0x270 [i915]
Apr 17 16:06:17 desna kernel: PGD 0 
Apr 17 16:06:17 desna kernel: 
Apr 17 16:06:17 desna kernel: Oops: 0002 [#1] PREEMPT SMP
Apr 17 16:06:17 desna kernel: Modules linked in: xfs libcrc32c crc32c_generic nilfs2 jfs nls_utf8 isofs uas usb_storage hid_cherry hid_generic usbhid ctr ccm rfcomm fuse bnep bbswitch(O) uvc
Apr 17 16:06:17 desna kernel:  snd_hda_ext_core i2c_i801 snd_soc_sst_match i2c_hid thermal hid snd_soc_core i915 hci_uart snd_compress btbcm snd_pcm_dmaengine btqca ac97_bus btintel ac snd_h
Apr 17 16:06:17 desna kernel:  xhci_pci libata xhci_hcd scsi_mod usbcore usb_common i8042 serio crc32c_intel [last unloaded: nvidia]
Apr 17 16:06:17 desna kernel: CPU: 1 PID: 7214 Comm: chrome Tainted: P           O    4.10.10-1-SITIC #1
Apr 17 16:06:17 desna kernel: Hardware name: ASUSTeK COMPUTER INC. UX410UQK/UX410UQK, BIOS UX410UQK.301 12/12/2016
Apr 17 16:06:17 desna kernel: task: ffff880264ec0ec0 task.stack: ffffc90004758000
Apr 17 16:06:17 desna kernel: RIP: 0010:gen8_ppgtt_alloc_page_directories.isra.14+0x11f/0x270 [i915]
Apr 17 16:06:17 desna kernel: RSP: 0018:ffffc9000475b890 EFLAGS: 00010286
Apr 17 16:06:17 desna kernel: RAX: ffff880168cc6000 RBX: 0000000000008000 RCX: 0000000000000003
Apr 17 16:06:17 desna kernel: RDX: 0000000000000000 RSI: ffff88015b76a000 RDI: ffff88025f328000
Apr 17 16:06:17 desna kernel: RBP: ffffc9000475b8e8 R08: 0000000000000000 R09: 0000000000000000
Apr 17 16:06:17 desna kernel: R10: 0000000000000000 R11: 0000000000000040 R12: ffff880189f1e000
Apr 17 16:06:17 desna kernel: R13: ffff880164185330 R14: 0000000000000003 R15: 00000000fff98000
Apr 17 16:06:17 desna kernel: FS:  00007fcac4835a00(0000) GS:ffff88026ec80000(0000) knlGS:0000000000000000
Apr 17 16:06:17 desna kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 17 16:06:17 desna kernel: CR2: 0000000000000018 CR3: 0000000160942000 CR4: 00000000003406e0
Apr 17 16:06:17 desna kernel: Call Trace:
Apr 17 16:06:17 desna kernel:  gen8_alloc_va_range_3lvl+0xf7/0x9c0 [i915]
Apr 17 16:06:17 desna kernel:  gen8_alloc_va_range+0x256/0x490 [i915]
Apr 17 16:06:17 desna kernel:  i915_vma_bind+0xab/0x1a0 [i915]
Apr 17 16:06:17 desna kernel:  __i915_vma_do_pin+0x2a5/0x450 [i915]
Apr 17 16:06:17 desna kernel:  i915_gem_execbuffer_reserve_vma.isra.8+0x144/0x1b0 [i915]
Apr 17 16:06:17 desna kernel:  i915_gem_execbuffer_reserve.isra.9+0x39e/0x3d0 [i915]
Apr 17 16:06:17 desna kernel:  i915_gem_do_execbuffer.isra.15+0x62e/0x1810 [i915]
Apr 17 16:06:17 desna kernel:  ? radix_tree_lookup_slot+0x22/0x50
Apr 17 16:06:17 desna kernel:  ? shmem_getpage_gfp+0xf9/0xc50
Apr 17 16:06:17 desna kernel:  i915_gem_execbuffer2+0xc5/0x240 [i915]
Apr 17 16:06:17 desna kernel:  drm_ioctl+0x21b/0x4c0 [drm]
Apr 17 16:06:17 desna kernel:  ? i915_gem_execbuffer+0x310/0x310 [i915]
Apr 17 16:06:17 desna kernel:  ? __seccomp_filter+0x67/0x2a0
Apr 17 16:06:17 desna kernel:  do_vfs_ioctl+0xa3/0x5f0
Apr 17 16:06:17 desna kernel:  ? __fget+0x77/0xb0
Apr 17 16:06:17 desna kernel:  SyS_ioctl+0x79/0x90
Apr 17 16:06:17 desna kernel:  do_syscall_64+0x54/0xc0
Apr 17 16:06:17 desna kernel:  entry_SYSCALL64_slow_path+0x25/0x25
Apr 17 16:06:17 desna kernel: RIP: 0033:0x7fcabdee50d7
Apr 17 16:06:17 desna kernel: RSP: 002b:00007fff64507e78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Apr 17 16:06:17 desna kernel: RAX: ffffffffffffffda RBX: 00002d7a32579000 RCX: 00007fcabdee50d7
Apr 17 16:06:17 desna kernel: RDX: 00007fff64507ec0 RSI: 00000000c0406469 RDI: 000000000000000e
Apr 17 16:06:17 desna kernel: RBP: 00007fff64507ec0 R08: 0000000000000000 R09: 0000000000000000
Apr 17 16:06:17 desna kernel: R10: 0000000000000050 R11: 0000000000000246 R12: 00000000c0406469
Apr 17 16:06:17 desna kernel: R13: 000000000000000e R14: 0000000000000000 R15: 0000000000000000
Apr 17 16:06:17 desna kernel: Code: 49 8b bc 24 d8 02 00 00 48 89 c6 48 89 45 c0 48 8b 52 08 48 83 ca 03 e8 50 e0 ff ff 48 8b 45 b0 48 8b 4d c8 48 8b 10 48 8b 45 c0 <48> 89 04 ca 48 8b 45 d0
Apr 17 16:06:17 desna kernel: RIP: gen8_ppgtt_alloc_page_directories.isra.14+0x11f/0x270 [i915] RSP: ffffc9000475b890
Apr 17 16:06:17 desna kernel: CR2: 0000000000000018
Apr 17 16:06:17 desna kernel: ---[ end trace 5b3658c58f87d1cd ]---
Comment 30 freedesktop 2017-04-18 14:27:30 UTC
I am now running tip.

Until now no crash.

However I am using plasma (KDE) and pasma-shell crashes with:
`intel_do_flush_locked failed: Cannot allocate memory`

But no kernel crash and system is still working fine.

Please tell me if this is expected behavior or if I should file another bug.
(Kind of hard to know how a system should behave when RAM is running low.)
Comment 31 freedesktop 2017-04-20 15:22:27 UTC
3rd day without crashes on tip.

However as mentioned before plasmashell regularly exits.
Comment 32 Chris Wilson 2017-05-03 13:24:43 UTC
*** Bug 100875 has been marked as a duplicate of this bug. ***
Comment 33 Chris Wilson 2017-05-03 13:24:50 UTC
*** Bug 100887 has been marked as a duplicate of this bug. ***
Comment 34 Chris Wilson 2017-05-13 08:13:30 UTC
*** Bug 101019 has been marked as a duplicate of this bug. ***
Comment 35 Maël Lavault 2017-05-15 15:42:12 UTC
(In reply to Daniel Drake from comment #25)
> Anyone can submit the patch for linux-stable review, does not need to be
> Chris. But we do need someone to test and confirm the fix first. I don't
> have the affected hardware.
> 
> Can anyone else test this and feed back?
> 
> Also, has anyone affected by this issue (other than Chris) tested drm-tip to
> confirm that the freeze does not occur there?

Dan Aloni did a backport to 4.11, I ca nconfirm it works really well !
https://bugs.freedesktop.org/show_bug.cgi?id=101019
Comment 36 Maël Lavault 2017-05-18 08:41:49 UTC
Created attachment 131401 [details]
Stacktrace

Running with the fix for a few days without freeze, but this morning I had a freeze again.

The fix only seems to make it less frequent.

See stacktrace.
Comment 37 Chris Wilson 2017-05-18 10:06:22 UTC
Created attachment 131404 [details] [review]
Do not drop pagetables.
Comment 38 Dan Aloni 2017-05-18 10:43:37 UTC
Chris, should the 'if (!pt->used_ptes) return true;' code in gen8_ppgtt_clear_pt be kept for the later version? (or if you can supply the equivalent patch for 4.12-rc1+ that would be good - I'd back-port it to my stabilized 4.11).
Comment 39 Chris Wilson 2017-05-18 10:53:48 UTC
v4.12 doesn't need a patch, as this has been fixed upstream already see comment 11.
Comment 40 Dan Aloni 2017-05-18 11:17:52 UTC
Ok, so we'll test this commit with the earlier one on top of 4.11.1 and see how it goes. The Fedora kernel I'm building is based off this branch:

https://github.com/kernelim/linux/commits/drmfix2
Comment 41 Daniel Vetter 2017-05-18 12:58:44 UTC
(In reply to Dan Aloni from comment #40)
> Ok, so we'll test this commit with the earlier one on top of 4.11.1 and see
> how it goes. The Fedora kernel I'm building is based off this branch:
> 
> https://github.com/kernelim/linux/commits/drmfix2

Please test Chris' patch on top of plain upstream, not with other workarounds and patches for this bug applied.
Comment 42 Helio Loureiro 2017-05-19 16:12:13 UTC
Hi,

I'm running on kernel 4.11.1 and got hit by this bug as well.

May 17 20:01:43 elxaf7qtt32 kernel: [41279.298093] restoring control 00000000-0000-0000-0000-000000000101/10/5
May 17 20:01:43 elxaf7qtt32 kernel: [41279.298095] restoring control 00000000-0000-0000-0000-000000000101/12/11
May 17 20:03:40 elxaf7qtt32 kernel: [41395.829850] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
May 17 20:03:40 elxaf7qtt32 kernel: [41395.829922] IP: gen8_ppgtt_alloc_page_directories.isra.39+0x124/0x290 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.829957] PGD 0 
May 17 20:03:40 elxaf7qtt32 kernel: [41395.829958] 
May 17 20:03:40 elxaf7qtt32 kernel: [41395.829979] Oops: 0002 [#1] PREEMPT SMP
May 17 20:03:40 elxaf7qtt32 kernel: [41395.829999] Modules linked in: psmouse usbhid snd_usb_audio snd_usbmidi_lib cpuid cmac ctr ccm nvram msr pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) xfrm_user xfrm_algo br_netfilter xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat overlay nf_nat_ipv4 bridge stp llc ebtable_filter ebtables rfcomm bnep uvcvideo videobuf2_vmalloc btusb videobuf2_memops btrtl videobuf2_v4l2 btbcm btintel videobuf2_core bluetooth videodev media cdc_ether usbnet r8152 mii binfmt_misc dell_wmi dell_laptop dell_smbios dcdbas arc4 iwlmvm mac80211 intel_rapl intel_powerclamp coretemp iwlwifi snd_hda_codec_hdmi rtsx_pci_ms cfg80211 memstick snd_hda_codec_realtek joydev input_leds snd_hda_codec_generic serio_raw snd_soc_rt286 snd_soc_ssm4567 snd_soc_rl6347a snd_soc_core elan_i2c
May 17 20:03:40 elxaf7qtt32 kernel: [41395.830370]  snd_compress ac97_bus snd_hda_intel snd_seq_midi snd_pcm_dmaengine snd_seq_midi_event snd_hda_codec snd_rawmidi snd_hda_core snd_seq intel_pch_thermal snd_hwdep snd_pcm snd_seq_device lpc_ich mei_me shpchp snd_timer mei snd soundcore wmi intel_vbtn soc_button_array int3403_thermal intel_hid sparse_keymap snd_soc_sst_acpi dw_dmac snd_soc_sst_match i2c_designware_platform processor_thermal_device int3400_thermal acpi_pad intel_soc_dts_iosf acpi_thermal_rel acpi_als int3402_thermal i2c_designware_core spi_pxa2xx_platform 8250_dw int340x_thermal_zone tpm_crb kfifo_buf intel_smartconnect mac_hid industrialio kvm_intel ip6t_REJECT nf_reject_ipv6 kvm nf_log_ipv6 irqbypass xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp
May 17 20:03:40 elxaf7qtt32 kernel: [41395.830727]  xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack libcrc32c iptable_filter parport_pc ip_tables ppdev x_tables lp parport autofs4 btrfs xor raid6_pq algif_skcipher af_alg dm_crypt hid_generic hid mmc_block crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc rtsx_pci_sdmmc i915 aesni_intel i2c_algo_bit aes_x86_64 drm_kms_helper crypto_simd glue_helper syscopyarea cryptd sysfillrect sysimgblt fb_sys_fops ahci rtsx_pci drm libahci video sdhci_acpi sdhci [last unloaded: usbhid]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831006] CPU: 1 PID: 8174 Comm: chromium-browse Tainted: G           OE   4.11.1-helio #4
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831047] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A03 03/25/2015
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831091] task: ffff998ccd319d00 task.stack: ffffa2fd43150000
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831160] RIP: 0010:gen8_ppgtt_alloc_page_directories.isra.39+0x124/0x290 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831196] RSP: 0018:ffffa2fd431538b0 EFLAGS: 00010286
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831222] RAX: ffff998c43974700 RBX: 0000000000000003 RCX: 0000000000000003
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831266] RDX: 0000000000000000 RSI: ffff998c5be90000 RDI: 00000000ffffffff
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831310] RBP: ffffa2fd43153910 R08: 0000000000000018 R09: 0000000000000000
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831343] R10: 0000000000000000 R11: 0000000000000000 R12: ffff998c8b448000
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831396] R13: ffff998c703457b0 R14: 00000000fffe7000 R15: 0000000000008000
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831430] FS:  00007f660c42ca40(0000) GS:ffff998d5e480000(0000) knlGS:0000000000000000
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831467] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831495] CR2: 0000000000000018 CR3: 0000000213f3f000 CR4: 00000000003406a0
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831549] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831582] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831625] Call Trace:
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831668]  gen8_alloc_va_range_3lvl+0xc8/0x970 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831697]  ? finish_task_switch+0x83/0x230
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831730]  ? add_hole+0xfd/0x120 [drm]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831767]  gen8_alloc_va_range+0x273/0x440 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831809]  i915_vma_bind+0x85/0x210 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831848]  __i915_vma_do_pin+0x397/0x600 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831887]  ? i915_gem_do_execbuffer.isra.39+0x809/0x15c0 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831932]  i915_gem_execbuffer_reserve_vma.isra.30+0xc8/0x1f0 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.831978]  i915_gem_execbuffer_reserve.isra.31+0x449/0x480 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832023]  i915_gem_do_execbuffer.isra.39+0x526/0x15c0 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832063]  ? krealloc+0x79/0xc0
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832093]  ? reservation_object_get_fences_rcu+0x43/0x240
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832156]  ? i915_gem_object_wait_reservation+0xeb/0x1d0 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832201]  i915_gem_execbuffer2+0xa8/0x220 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832233]  drm_ioctl+0x1fc/0x450 [drm]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832287]  ? i915_gem_execbuffer+0x300/0x300 [i915]
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832315]  do_vfs_ioctl+0xa1/0x5d0
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832334]  ? __fget+0x77/0xb0
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832361]  SyS_ioctl+0x79/0x90
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832380]  do_syscall_64+0x5b/0xc0
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832400]  entry_SYSCALL64_slow_path+0x25/0x25
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832424] RIP: 0033:0x7f65f6732357
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832443] RSP: 002b:00007fff838f47d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832481] RAX: ffffffffffffffda RBX: 000055dc0a74f090 RCX: 00007f65f6732357
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832516] RDX: 00007fff838f4830 RSI: 0000000040406469 RDI: 00000000000000ed
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832551] RBP: 00007fff838f4830 R08: 00000000000000ed R09: 0000000000000036
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832585] R10: 000055dc0bfc7fa0 R11: 0000000000000246 R12: 0000000040406469
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832620] R13: 00000000000000ed R14: 00007f65df5980a0 R15: 000055dc0a774020
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832655] Code: 28 03 00 00 48 8b b8 e0 02 00 00 48 8b 52 08 48 83 ca 03 e8 bf e8 ff ff 48 8b 45 a8 4c 8b 45 c8 48 8b 4d c0 48 8b 10 48 8b 45 d0 <4e> 89 24 02 48 0f ab 08 0f 1f 44 00 00 e9 44 ff ff ff 65 8b 05 
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832791] RIP: gen8_ppgtt_alloc_page_directories.isra.39+0x124/0x290 [i915] RSP: ffffa2fd431538b0
May 17 20:03:40 elxaf7qtt32 kernel: [41395.832833] CR2: 0000000000000018
May 17 20:03:40 elxaf7qtt32 kernel: [41395.843564] ---[ end trace f38e3ace26421bd5 ]---
Comment 43 Dan Aloni 2017-05-19 16:54:25 UTC
Hi all,

Following Daniel's suggestion, I've posted two modified 4.11.1 builds here:

https://copr.fedorainfracloud.org/coprs/alonid/kernel-4.11-drmfix/builds/

4.11.1-200.drmfixb.fc26 contains only 64b1d89f35
4.11.1-200.drmfix.fc26 contains 64b1d89f35 + my backport of dd19674bacba

Please tell which of them works best.
Comment 44 Daniel Vetter 2017-05-22 07:00:37 UTC
(In reply to Dan Aloni from comment #43)
> Hi all,
> 
> Following Daniel's suggestion, I've posted two modified 4.11.1 builds here:
> 
> https://copr.fedorainfracloud.org/coprs/alonid/kernel-4.11-drmfix/builds/
> 
> 4.11.1-200.drmfixb.fc26 contains only 64b1d89f35
> 4.11.1-200.drmfix.fc26 contains 64b1d89f35 + my backport of dd19674bacba

git sha1 are meaningless outside of your git repo. 

> Please tell which of them works best.

We need testing on 4.11.y + Chris' patch: https://bugs.freedesktop.org/attachment.cgi?id=131404 No other backports (except what's in 4.11.x stable of course) please. If you test with that backport (which papers over the bug, but misses a pile of corner-cases and is way too big for stable backporting) then you do not test Chris' patch (since the bug is hidden already by the other patch).
Comment 45 Dan Aloni 2017-05-22 07:52:13 UTC
Fedora builds of 4.11.2 + Chris's patch will be ready in about 4 hours under the repo (4.11.2-200.drmfixb.fc26):

https://copr.fedorainfracloud.org/coprs/alonid/kernel-4.11-drmfix/build/554992/
Comment 46 Helio Loureiro 2017-05-22 11:24:57 UTC
(In reply to Chris Wilson from comment #22)
> Or less broken:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c
> b/drivers/gpu/drm/i915/i915_gem_gtt.c
> index b4bde1452f2a..be75ab6a8ed6 100644
> --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> @@ -33,7 +33,7 @@
>  #include "intel_drv.h"
>  #include "intel_frontbuffer.h"
>  
> -#define I915_GFP_DMA (GFP_KERNEL | __GFP_HIGHMEM)
> +#define I915_GFP_DMA ((GFP_KERNEL | __GFP_HIGHMEM) & ~__GFP_RECLAIM)
>  
>  /**
>   * DOC: Global GTT views

Hi,

I applied such patch on top of 4.11.2.  It apparently caught the bug and didn't crash the usage.

May 22 11:20:33 elxaf7qtt32 kernel: CPU: 3 PID: 10790 Comm: kscreenlocker_g Tainted: G           OE   4.11.2-i915patch-helio+ #6
May 22 11:20:33 elxaf7qtt32 kernel: Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A03 03/25/2015
May 22 11:20:33 elxaf7qtt32 kernel: Call Trace:
May 22 11:20:33 elxaf7qtt32 kernel:  dump_stack+0x65/0x92
May 22 11:20:33 elxaf7qtt32 kernel:  warn_alloc+0x114/0x1b0
May 22 11:20:33 elxaf7qtt32 kernel:  __alloc_pages_slowpath+0xd69/0xe40
May 22 11:20:33 elxaf7qtt32 kernel:  __alloc_pages_nodemask+0x24b/0x260
May 22 11:20:33 elxaf7qtt32 kernel:  alloc_pages_current+0x95/0x140
May 22 11:20:33 elxaf7qtt32 kernel:  __setup_page_dma+0x21/0x130 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  alloc_pt+0x5e/0xb0 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  gen8_alloc_va_range_3lvl+0x1d7/0x970 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  gen8_alloc_va_range+0x273/0x440 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  i915_vma_bind+0x85/0x210 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  __i915_vma_do_pin+0x397/0x600 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  i915_gem_execbuffer_reserve_vma.isra.30+0xc8/0x1f0 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  i915_gem_execbuffer_reserve.isra.31+0x449/0x480 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  i915_gem_do_execbuffer.isra.39+0x526/0x15c0 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  ? refcount_dec_and_test+0x11/0x20
May 22 11:20:33 elxaf7qtt32 kernel:  ? i915_gem_pwrite_ioctl+0xbe/0x720 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  i915_gem_execbuffer2+0xa8/0x220 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  drm_ioctl+0x1fc/0x450 [drm]
May 22 11:20:33 elxaf7qtt32 kernel:  ? i915_gem_execbuffer+0x300/0x300 [i915]
May 22 11:20:33 elxaf7qtt32 kernel:  ? __dentry_kill+0x11d/0x170
May 22 11:20:33 elxaf7qtt32 kernel:  ? mntput_no_expire+0x2c/0x1c0
May 22 11:20:33 elxaf7qtt32 kernel:  do_vfs_ioctl+0xa1/0x5d0
May 22 11:20:33 elxaf7qtt32 kernel:  ? __fget+0x77/0xb0
May 22 11:20:33 elxaf7qtt32 kernel:  SyS_ioctl+0x79/0x90
May 22 11:20:33 elxaf7qtt32 kernel:  entry_SYSCALL_64_fastpath+0x1e/0xad
May 22 11:20:33 elxaf7qtt32 kernel: RIP: 0033:0x7f3d48da2357
May 22 11:20:33 elxaf7qtt32 kernel: RSP: 002b:00007ffc33b19248 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 22 11:20:33 elxaf7qtt32 kernel: RAX: ffffffffffffffda RBX: 00000000023a33e0 RCX: 00007f3d48da2357
May 22 11:20:33 elxaf7qtt32 kernel: RDX: 00007ffc33b192a0 RSI: 0000000040406469 RDI: 0000000000000008
May 22 11:20:33 elxaf7qtt32 kernel: RBP: 00000000020c5ee0 R08: 0000000000000008 R09: 0000000000000001
May 22 11:20:33 elxaf7qtt32 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000023fed88
May 22 11:20:33 elxaf7qtt32 kernel: R13: 00007ffc33b194c0 R14: 00007f3d4a3b1100 R15: 00007ffc33b19498
May 22 11:20:33 elxaf7qtt32 kernel: Mem-Info:
May 22 11:20:33 elxaf7qtt32 kernel: active_anon:1158734 inactive_anon:339188 isolated_anon:0
                                     active_file:182984 inactive_file:116538 isolated_file:64
                                     unevictable:16 dirty:265 writeback:0 unstable:0
                                     slab_reclaimable:24977 slab_unreclaimable:26928
                                     mapped:294138 shmem:351802 pagetables:29023 bounce:0
                                     free:25050 free_pcp:1066 free_cma:0
May 22 11:20:33 elxaf7qtt32 kernel: Node 0 active_anon:4634936kB inactive_anon:1356752kB active_file:731936kB inactive_file:466152kB unevictable:64kB isolated(anon):0kB isolated(file):2
May 22 11:20:33 elxaf7qtt32 kernel: Node 0 DMA free:15832kB min:136kB low:168kB high:200kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writependi
May 22 11:20:33 elxaf7qtt32 kernel: lowmem_reserve[]: 0 3027 7488 7488
May 22 11:20:33 elxaf7qtt32 kernel: Node 0 DMA32 free:44708kB min:27268kB low:34084kB high:40900kB active_anon:2005440kB inactive_anon:394852kB active_file:299736kB inactive_file:318068
May 22 11:20:33 elxaf7qtt32 kernel: lowmem_reserve[]: 0 0 4460 4460
May 22 11:20:33 elxaf7qtt32 kernel: Node 0 Normal free:39660kB min:40172kB low:50212kB high:60252kB active_anon:2628924kB inactive_anon:961708kB active_file:433156kB inactive_file:14716
May 22 11:20:33 elxaf7qtt32 kernel: lowmem_reserve[]: 0 0 0 0
May 22 11:20:33 elxaf7qtt32 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15832kB
May 22 11:20:33 elxaf7qtt32 kernel: Node 0 DMA32: 465*4kB (UME) 350*8kB (UM) 228*16kB (UM) 462*32kB (UM) 184*64kB (UM) 48*128kB (UM) 4*256kB (ME) 2*512kB (ME) 2*1024kB (E) 0*2048kB 0*40
May 22 11:20:33 elxaf7qtt32 kernel: Node 0 Normal: 4705*4kB (UMH) 2148*8kB (UMEH) 190*16kB (MEH) 9*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 40100kB
May 22 11:20:33 elxaf7qtt32 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 22 11:20:33 elxaf7qtt32 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 22 11:20:33 elxaf7qtt32 kernel: 658209 total pagecache pages
May 22 11:20:33 elxaf7qtt32 kernel: 6762 pages in swap cache
May 22 11:20:33 elxaf7qtt32 kernel: Swap cache stats: add 187215, delete 180458, find 205823/212955
May 22 11:20:33 elxaf7qtt32 kernel: Free swap  = 7555980kB
May 22 11:20:33 elxaf7qtt32 kernel: Total swap = 7925756kB
May 22 11:20:33 elxaf7qtt32 kernel: 1982055 pages RAM
May 22 11:20:33 elxaf7qtt32 kernel: 0 pages HighMem/MovableOnly
May 22 11:20:33 elxaf7qtt32 kernel: 51817 pages reserved
May 22 11:20:33 elxaf7qtt32 kernel: 0 pages cma reserved

I just noticed the kernel log after checking w/ journalctl.
Comment 47 Daniel Vetter 2017-05-22 14:56:06 UTC
(In reply to Helio Loureiro from comment #46)
> (In reply to Chris Wilson from comment #22)
> > Or less broken:
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > index b4bde1452f2a..be75ab6a8ed6 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > @@ -33,7 +33,7 @@
> >  #include "intel_drv.h"
> >  #include "intel_frontbuffer.h"
> >  
> > -#define I915_GFP_DMA (GFP_KERNEL | __GFP_HIGHMEM)
> > +#define I915_GFP_DMA ((GFP_KERNEL | __GFP_HIGHMEM) & ~__GFP_RECLAIM)
> >  
> >  /**
> >   * DOC: Global GTT views
> 
> Hi,
> 
> I applied such patch on top of 4.11.2.  It apparently caught the bug and
> didn't crash the usage.

This patch is already rejected for backporting, testing it doesn't move this bug foward. Please instead test Chris' latest patch (and only that, no other hacks/backports applied please) attachment #2 [details] [review].

> I just noticed the kernel log after checking w/ journalctl.

This is just an OOM report (with a few lines missing at the beginning), which is somewhat expected with Chris' patch. It explains why kwin dies in execbuf with "out of memory" at least".
-Daniel
Comment 48 Daniel Vetter 2017-05-22 14:56:57 UTC
(In reply to Daniel Vetter from comment #47)
> This is just an OOM report (with a few lines missing at the beginning),
> which is somewhat expected with Chris' patch. It explains why kwin dies in
> execbuf with "out of memory" at least".

Correction: Your screensave (kscreenlocker) is the thing that died.
Comment 49 Daniel Vetter 2017-05-22 19:06:38 UTC
(In reply to Daniel Vetter from comment #47)
> This patch is already rejected for backporting, testing it doesn't move this
> bug foward. Please instead test Chris' latest patch (and only that, no other
> hacks/backports applied please) attachment #2 [details] [review] [review].

Meh, bugzilla failed me (or I failed bugzilla), the patch I meant was 
https://bugs.freedesktop.org/attachment.cgi?id=131404

I haz no idea how to reference stuff here :(
Comment 50 Helio Loureiro 2017-05-23 09:18:43 UTC
(In reply to Daniel Vetter from comment #48)
> (In reply to Daniel Vetter from comment #47)
> > This is just an OOM report (with a few lines missing at the beginning),
> > which is somewhat expected with Chris' patch. It explains why kwin dies in
> > execbuf with "out of memory" at least".
> 
> Correction: Your screensave (kscreenlocker) is the thing that died.

At this time.  But it randomly happened to whatever application on X.

Just got another one:

May 23 11:04:20 elxaf7qtt32 kernel: kwin_x11: page allocation failure: order:0, mode:0xc2(__GFP_HIGHMEM|__GFP_IO|__GFP_FS), nodemask=(null)
May 23 11:04:20 elxaf7qtt32 kernel: kwin_x11 cpuset=/ mems_allowed=0
May 23 11:04:20 elxaf7qtt32 kernel: CPU: 0 PID: 25038 Comm: kwin_x11 Tainted: G           OE   4.11.2-i915patch-helio+ #6
May 23 11:04:20 elxaf7qtt32 kernel: Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A03 03/25/2015
May 23 11:04:20 elxaf7qtt32 kernel: Call Trace:
May 23 11:04:20 elxaf7qtt32 kernel:  dump_stack+0x65/0x92


If I search my logs:

ehellou@elxaf7qtt32:~$ journalctl -k | egrep "BUG:|RIP:|CPU:"
May 22 11:20:33 elxaf7qtt32 kernel: CPU: 3 PID: 10790 Comm: kscreenlocker_g Tainted: G           OE   4.11.2-i915patch-helio+ #6
May 22 11:20:33 elxaf7qtt32 kernel: RIP: 0033:0x7f3d48da2357
May 22 15:06:07 elxaf7qtt32 kernel: CPU: 2 PID: 28528 Comm: kwin_x11 Tainted: G           OE   4.11.2-i915patch-helio+ #6
May 22 15:06:07 elxaf7qtt32 kernel: RIP: 0033:0x7f7bbef1b357
May 23 11:04:20 elxaf7qtt32 kernel: CPU: 0 PID: 25038 Comm: kwin_x11 Tainted: G           OE   4.11.2-i915patch-helio+ #6
May 23 11:04:20 elxaf7qtt32 kernel: RIP: 0033:0x7f345f6eb357

Before it was just freezing and I was forced to reboot.

But I will take a look on your recommended patch at https://bugs.freedesktop.org/attachment.cgi?id=131404
Comment 51 Maël Lavault 2017-05-23 15:02:13 UTC
I've been testing 4.11.1-drmfixb and I havent had any issues so far.

I get some messages in the journal but no freeze:

[drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
Comment 52 Helio Loureiro 2017-05-27 16:20:26 UTC
(In reply to Maël Lavault from comment #51)
> I've been testing 4.11.1-drmfixb and I havent had any issues so far.
> 
> I get some messages in the journal but no freeze:
> 
> [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO
> underrun

Hi,

Same here.  I applied manually the changes over kernel 4.11.2.  I see messages like this:

May 25 19:30:04 elxaf7qtt32 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
May 25 19:39:58 elxaf7qtt32 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
May 26 16:17:15 elxaf7qtt32 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
May 26 20:26:10 elxaf7qtt32 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
May 26 21:02:39 elxaf7qtt32 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
May 26 21:03:10 elxaf7qtt32 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
May 26 21:39:46 elxaf7qtt32 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
May 27 14:01:40 elxaf7qtt32 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
May 27 18:09:44 elxaf7qtt32 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

but nothing else.  No hangs, no lags, nothing.
Comment 53 Jani Nikula 2017-05-30 12:18:18 UTC
The stable backport request
http://mid.mail-archive.com/20170526082906.8982-1-daniel.vetter@ffwll.ch
Comment 54 Maël Lavault 2017-06-12 08:35:04 UTC
In which version of 4.11.x kernel will it be backported ?
Comment 55 Kyle 2017-06-14 16:11:25 UTC
Hey all,

This bug had been hitting me pretty bad on a Dell Precision 5510, falling back to 4.9 kernel has resolved it.

Looks like 4.11.5 was just released[0], but it doesn't look like this patch made it in.  Can anyone confirm?


[0] https://lwn.net/Articles/725369/
Comment 56 Eric Blau 2017-06-14 16:53:44 UTC
No, the fix did not make it in to any 4.11.y release yet, including 4.11.5.
Comment 57 Patrick Decat 2017-06-16 10:49:03 UTC
Applied on top of git://kernel.ubuntu.com/ubuntu/ubuntu-zesty.git Ubuntu-4.10.0-23.25 tag, this patch makes crashes happen way less often but they still happen:

Jun 16 12:21:15 patrickxps kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
Jun 16 12:21:15 patrickxps kernel: IP: gen8_ppgtt_alloc_page_directories.isra.38+0x115/0x250 [i915]
Jun 16 12:21:15 patrickxps kernel: PGD 0
Jun 16 12:21:15 patrickxps kernel:
Jun 16 12:21:15 patrickxps kernel: Oops: 0002 [#1] SMP
Jun 16 12:21:15 patrickxps kernel: Modules linked in: ccm rfcomm xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
Jun 16 12:21:15 patrickxps kernel: videobuf2_v4l2 input_leds btusb joydev btrtl serio_raw btbcm videobuf2_core btintel bluetooth videodev media dell_led dell_smbios hid_multitouch dcdbas snd_soc_rt298 snd_s
Jun 16 12:21:15 patrickxps kernel: int3400_thermal kfifo_buf mei shpchp intel_soc_dts_iosf int3406_thermal acpi_thermal_rel int340x_thermal_zone intel_hid industrialio sparse_keymap mac_hid intel_smartconne
Jun 16 12:21:15 patrickxps kernel: CPU: 0 PID: 1675 Comm: Xorg Tainted: P W O 4.10.0-23-generic #25
Jun 16 12:21:15 patrickxps kernel: Hardware name: Dell Inc. XPS 13 9343/0310JH, BIOS A12 05/09/2017
Jun 16 12:21:15 patrickxps kernel: task: ffff9ddf0dec1680 task.stack: ffffc18303234000
Jun 16 12:21:15 patrickxps kernel: RIP: 0010:gen8_ppgtt_alloc_page_directories.isra.38+0x115/0x250 [i915]
Jun 16 12:21:15 patrickxps kernel: RSP: 0018:ffffc18303237880 EFLAGS: 00010246
Jun 16 12:21:15 patrickxps kernel: RAX: ffff9dde000c1b40 RBX: 0000000000000003 RCX: 0000000000000003
Jun 16 12:21:15 patrickxps kernel: RDX: 0000000000000000 RSI: ffff9ddf07747000 RDI: ffff9ddf0b3f0000
Jun 16 12:21:15 patrickxps kernel: RBP: ffffc183032378d8 R08: 0000000000000000 R09: 0000000000000000
Jun 16 12:21:15 patrickxps kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff9dde001a2000
Jun 16 12:21:15 patrickxps kernel: R13: ffff9ddea3fb70d0 R14: 00000000fc73e000 R15: 0000000000010000
Jun 16 12:21:15 patrickxps kernel: FS: 00007fabaecbca40(0000) GS:ffff9ddf1f400000(0000) knlGS:0000000000000000
Jun 16 12:21:15 patrickxps kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 16 12:21:15 patrickxps kernel: CR2: 0000000000000018 CR3: 000000020fe66000 CR4: 00000000003406f0
Jun 16 12:21:15 patrickxps kernel: Call Trace:
Jun 16 12:21:15 patrickxps kernel: gen8_alloc_va_range_3lvl+0xfb/0x9e0 [i915]
Jun 16 12:21:15 patrickxps kernel: ? swiotlb_map_sg_attrs+0x49/0x110
Jun 16 12:21:15 patrickxps kernel: gen8_alloc_va_range+0x23d/0x470 [i915]
Jun 16 12:21:15 patrickxps kernel: i915_vma_bind+0x7e/0x170 [i915]
Jun 16 12:21:15 patrickxps kernel: __i915_vma_do_pin+0x2a5/0x450 [i915]
Jun 16 12:21:15 patrickxps kernel: i915_gem_execbuffer_reserve_vma.isra.31+0x144/0x1b0 [i915]
Jun 16 12:21:15 patrickxps kernel: i915_gem_execbuffer_reserve.isra.32+0x39e/0x3d0 [i915]
Jun 16 12:21:15 patrickxps kernel: i915_gem_do_execbuffer.isra.38+0x4ca/0x15c0 [i915]
Jun 16 12:21:15 patrickxps kernel: i915_gem_execbuffer2+0xa1/0x1e0 [i915]
Jun 16 12:21:15 patrickxps kernel: drm_ioctl+0x21b/0x4c0 [drm]
Jun 16 12:21:15 patrickxps kernel: ? i915_gem_execbuffer+0x310/0x310 [i915]
Jun 16 12:21:15 patrickxps kernel: do_vfs_ioctl+0xa3/0x610
Jun 16 12:21:15 patrickxps kernel: ? __audit_syscall_entry+0xad/0xf0
Jun 16 12:21:15 patrickxps kernel: ? syscall_trace_enter+0x1d9/0x2e0
Jun 16 12:21:15 patrickxps kernel: SyS_ioctl+0x79/0x90
Jun 16 12:21:15 patrickxps kernel: do_syscall_64+0x5b/0xc0
Jun 16 12:21:15 patrickxps kernel: entry_SYSCALL64_slow_path+0x25/0x25
Jun 16 12:21:15 patrickxps kernel: RIP: 0033:0x7fabac6d1987
Jun 16 12:21:15 patrickxps kernel: RSP: 002b:00007ffc93b26ac8 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
Jun 16 12:21:15 patrickxps kernel: RAX: ffffffffffffffda RBX: 000055b84c667030 RCX: 00007fabac6d1987
Jun 16 12:21:15 patrickxps kernel: RDX: 00007ffc93b26b10 RSI: 00000000c0406469 RDI: 000000000000000e
Jun 16 12:21:15 patrickxps kernel: RBP: 00007ffc93b26b10 R08: 0000000000000000 R09: 0000000000000000
Jun 16 12:21:15 patrickxps kernel: R10: 0000000000000348 R11: 0000000000003246 R12: 00000000c0406469
Jun 16 12:21:15 patrickxps kernel: R13: 000000000000000e R14: 0000000000000000 R15: 0000000000000000
Jun 16 12:21:15 patrickxps kernel: Code: e6 48 8b 90 20 03 00 00 48 8b b8 d8 02 00 00 48 8b 52 08 48 83 ca 03 e8 aa cd ff ff 48 8b 45 b0 48 8b 4d c8 48 8b 10 48 8b 45 d0 <4c> 89 24 ca 48 0f ab 08 0f 1f 44 00
Jun 16 12:21:15 patrickxps kernel: RIP: gen8_ppgtt_alloc_page_directories.isra.38+0x115/0x250 [i915] RSP: ffffc18303237880
Jun 16 12:21:15 patrickxps kernel: CR2: 0000000000000018
Jun 16 12:21:15 patrickxps kernel: ---[ end trace 1f97c57229ff8402 ]---
Comment 58 Chris Wilson 2017-06-28 13:37:44 UTC
*** Bug 101625 has been marked as a duplicate of this bug. ***
Comment 59 Jonathan Ganc 2017-08-18 16:41:07 UTC
Can someone clarify if this has been backported to 4.10? I am running Ubuntu 17.04 (kernel 4.10.0-32-generic) and hit what I think is this same bug.
Comment 60 Jani Nikula 2017-08-21 11:48:05 UTC
(In reply to Jonathan Ganc from comment #59)
> Can someone clarify if this has been backported to 4.10? I am running Ubuntu
> 17.04 (kernel 4.10.0-32-generic) and hit what I think is this same bug.

It's been backported to v4.11.6+.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.