Bug 99947 - [BDW] igt_ppgtt_lowlevel, GEM_BUG_ON(num_entries > pt->used_ptes)
Summary: [BDW] igt_ppgtt_lowlevel, GEM_BUG_ON(num_entries > pt->used_ptes)
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-24 18:11 UTC by mwa
Modified: 2017-02-27 16:20 UTC (History)
1 user (show)

See Also:
i915 platform: BDW
i915 features: GEM/PPGTT


Attachments
dmesg (77.07 KB, text/plain)
2017-02-24 18:11 UTC, mwa
no flags Details

Description mwa 2017-02-24 18:11:48 UTC
Created attachment 129902 [details]
dmesg

Consistently reproducible when using an increased timeout.

[  135.370764] kernel BUG at drivers/gpu/drm/i915/i915_gem_gtt.c:683!
[  135.370766] invalid opcode: 0000 [#1] SMP
[  135.370766] Modules linked in: i915(+) drm_kms_helper drm rfcomm fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security ip6table_raw ip6table_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat libcrc32c nf_conntrack iptable_security iptable_raw iptable_mangle ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep arc4 iwlmvm intel_rapl mac80211 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iwlwifi mei_wdt iTCO_wdt iTCO_vendor_support irqbypass intel_cstate btusb snd_hda_codec_realtek snd_hda_codec_hdmi intel_uncore
[  135.370792]  snd_hda_codec_generic btrtl cfg80211 btbcm btintel snd_hda_codec intel_rapl_perf bluetooth snd_hwdep snd_hda_core i2c_i801 intel_pch_thermal snd_seq joydev rtsx_pci_ms snd_seq_device memstick mei_me nfsd mei snd_pcm shpchp lpc_ich thinkpad_acpi wmi auth_rpcgss snd_timer snd rfkill tpm_tis tpm_tis_core intel_rst soundcore tpm nfs_acl lockd grace sunrpc binfmt_misc dm_crypt hid_logitech_hidpp hid_logitech_dj hid_microsoft prime_numbers i2c_algo_bit rtsx_pci_sdmmc mmc_core e1000e crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ptp rtsx_pci pps_core serio_raw fjes video [last unloaded: drm]
[  135.370817] CPU: 1 PID: 2921 Comm: drv_selftest Tainted: G     U          4.10.0-debug+ #251
[  135.370818] Hardware name: LENOVO 20BW000FUK/20BW000FUK, BIOS JBET54WW (1.19 ) 11/06/2015
[  135.370820] task: ffff8ed6ee6bd7c0 task.stack: ffffae104194c000
[  135.370867] RIP: 0010:gen8_ppgtt_clear_pd+0x19a/0x250 [i915]
[  135.370868] RSP: 0000:ffffae104194f860 EFLAGS: 00010287
[  135.370870] RAX: 0000000000000000 RBX: 000003147c800000 RCX: 0000000000800000
[  135.370871] RDX: 0000000000000200 RSI: ffff8ed4cb778000 RDI: 0000000000000200
[  135.370872] RBP: ffffae104194f8a0 R08: ffff8ed6e93a2000 R09: ffff8ed6ee6bd7c0
[  135.370872] R10: ffff8ed6629f8ba0 R11: 0000000000000000 R12: ffff8ed6ee6bd7c0
[  135.370873] R13: 0000000000800000 R14: 00000000000001e4 R15: ffff8ed4cb778000
[  135.370875] FS:  00007f71799a9dc0(0000) GS:ffff8ed6fd840000(0000) knlGS:0000000000000000
[  135.370876] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  135.370876] CR2: 00007fa92a7e701d CR3: 000000022d992000 CR4: 00000000003406e0
[  135.370878] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  135.370878] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  135.370879] Call Trace:
[  135.370918]  gen8_ppgtt_clear_pdp+0x97/0xf0 [i915]
[  135.370953]  gen8_ppgtt_clear_4lvl+0xa4/0xe0 [i915]
[  135.370987]  lowlevel_hole+0x431/0x4f0 [i915]
[  135.371021]  exercise_ppgtt+0xac/0x110 [i915]
[  135.371054]  ? pot_hole+0x300/0x300 [i915]
[  135.371086]  igt_ppgtt_lowlevel+0x15/0x20 [i915]
[  135.371127]  __i915_subtests+0x3c/0xc0 [i915]
[  135.371160]  i915_gem_gtt_live_selftests+0x2f/0x40 [i915]
[  135.371197]  __run_selftests+0x113/0x1c0 [i915]
[  135.371233]  i915_live_selftests+0x35/0x60 [i915]
[  135.371266]  i915_pci_probe+0x67/0xb0 [i915]
[  135.371269]  local_pci_probe+0x45/0xa0
[  135.371271]  pci_device_probe+0x103/0x150
[  135.371273]  driver_probe_device+0x2bb/0x460
[  135.371275]  __driver_attach+0xdf/0xf0
[  135.371277]  ? driver_probe_device+0x460/0x460
[  135.371278]  bus_for_each_dev+0x6c/0xc0
[  135.371279]  driver_attach+0x1e/0x20
[  135.371281]  bus_add_driver+0x170/0x270
[  135.371282]  driver_register+0x60/0xe0
[  135.371284]  __pci_register_driver+0x4c/0x50
[  135.371316]  i915_init+0x6f/0x78 [i915]
[  135.371317]  ? 0xffffffffc0488000
[  135.371319]  do_one_initcall+0x52/0x1a0
[  135.371322]  ? __vunmap+0x81/0xd0
[  135.371324]  ? kmem_cache_alloc_trace+0x167/0x1c0
[  135.371326]  ? do_init_module+0x27/0x1f8
[  135.371328]  do_init_module+0x5f/0x1f8
[  135.371330]  load_module+0x25d7/0x29b0
[  135.371332]  ? __symbol_put+0x70/0x70
[  135.371333]  ? vfs_read+0x11b/0x130
[  135.371335]  SYSC_finit_module+0xdf/0x110
[  135.371337]  SyS_finit_module+0xe/0x10
[  135.371340]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  135.371341] RIP: 0033:0x7f71781debf9
[  135.371341] RSP: 002b:00007ffdef4d23d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  135.371343] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f71781debf9
[  135.371343] RDX: 0000000000000000 RSI: 00000000015e3740 RDI: 0000000000000008
[  135.371344] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000
[  135.371345] R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffdef4d13d0
[  135.371345] R13: 00007ffdef4d13b0 R14: 0000000000000005 R15: 00000000015e9c10
[  135.371346] Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c1 e8 0c 45 8b 5a 10 ba 00 02 00 00 25 ff 01 00 00 29 c2 41 39 d3 8d 3c 10 0f 83 02 ff ff ff <0f> 0b 49 8b b0 48 03 00 00 44 89 f2 4c 89 ff 4c 89 4d c0 4c 89 
[  135.371403] RIP: gen8_ppgtt_clear_pd+0x19a/0x250 [i915] RSP: ffffae104194f860
Comment 1 Chris Wilson 2017-02-24 18:20:20 UTC
First suspicion would be the allocate_va_range error handling. Hmm, also need to tweak the timeout handling - it should quit first before expanding.
Comment 2 Chris Wilson 2017-02-24 18:23:09 UTC
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
index e23753181720..6bac267914df 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
@@ -237,18 +237,19 @@ static int lowlevel_hole(struct drm_i915_private *i915,
 
                        GEM_BUG_ON(addr + BIT_ULL(size) > vm->total);
 
+                       if (igt_timeout(end_time,
+                                       "%s timed out before %d/%d\n",
+                                       __func__, n, count)) {
+                               hole_end = hole_start; /* quit */
+                               break;
+                       }
+
                        if (vm->allocate_va_range &&
                            vm->allocate_va_range(vm, addr, BIT_ULL(size)))
                                break;
 
                        vm->insert_entries(vm, obj->mm.pages, addr,
                                           I915_CACHE_NONE, 0);
-                       if (igt_timeout(end_time,
-                                       "%s timed out after %d/%d\n",
-                                       __func__, n, count)) {
-                               hole_end = hole_start; /* quit */
-                               break;
-                       }
                }
                count = n;
Comment 3 Chris Wilson 2017-02-25 19:05:34 UTC
I this once on vanilla-ish drm-tip, but not since applying all the pair of fixes. Can you still reproduce?
Comment 4 mwa 2017-02-25 19:53:58 UTC
Yup, still there. I've also hit it during igt_ppgtt_shrink now that you fixed the other issue. My timeout is 10s.
Comment 5 Chris Wilson 2017-02-27 13:22:36 UTC
commit bf75d59eff679d2e2b7af5c6958a088f8a458f7a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Feb 27 12:26:52 2017 +0000

    drm/i915: Only unwind the local pgtable layer if empty
    
    Only if we allocated the layer and the lower level failed should we
    remove this layer when unwinding. Otherwise we ignore the overlapping
    entries by overwriting the old layer with scratch.
    
    Fixes: c5d092a4293f ("drm/i915: Remove bitmap tracking for used-pml4")
    Fixes: e2b763caa6eb ("drm/i915: Remove bitmap tracking for used-pdpes")
    Reported-by: Matthew Auld <matthew.william.auld@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99947
    Testcase: igt/drv_selftest/live_gtt
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Matthew Auld <matthew.william.auld@gmail.com>
    Tested-by: Matthew Auld <matthew.auld@intel.com>
    Reviewed-by: Matthew Auld <matthew.auld@intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/20170227122654.27651-1-chris@chris-wilson.co.uk


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.