Bug 77587 - [BDW] GPU hang on resume from suspend
Summary: [BDW] GPU hang on resume from suspend
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: high critical
Assignee: Ben Widawsky
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-17 18:21 UTC by Timo Aaltonen
Modified: 2017-07-24 22:54 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
error state (2.66 MB, text/plain)
2014-04-17 18:21 UTC, Timo Aaltonen
no flags Details
Use MMIO for PDPs (2.38 KB, patch)
2014-05-09 22:33 UTC, Ben Widawsky
no flags Details | Splinter Review

Description Timo Aaltonen 2014-04-17 18:21:56 UTC
Created attachment 97527 [details]
error state

testing a module based on 3.14 + bdw-backports I sometimes get a gpu hang on resume, which results in a complete system hang shortly after

managed to get an error state from it, attaching
Comment 1 Timo Aaltonen 2014-05-05 10:07:15 UTC
so turns out this is probably caused by incomplete ppgtt support in the backport series which still uses aliasing, but disabling it by i915.enable_ppgtt=0 didn't seem to work
Comment 2 Yang Kun (YK) 2014-05-06 08:16:45 UTC
this bug blocks us from supporting BDW in Ubuntu 14.04 . 

bumping up the importance to "High" + "Critical" . please let me know if this is inappropriate.
Comment 3 Ben Widawsky 2014-05-06 19:10:31 UTC
We've now been getting several reports of failures on resume. Are you certain this is a broadwell specific issue?
Comment 4 Timo Aaltonen 2014-05-06 19:56:10 UTC
Kernel trace just before the system dies shows gen8_ppgtt_cleanup() on it, so yes I think this one is. 'broadwell' git branch before the recent rebase is stable.
Comment 5 Timo Aaltonen 2014-05-06 20:49:42 UTC
also, it doesn't happen when the S3-cycles are scripted with fwts, but closing/opening the lid makes it hang after ~3 cycles

[  899.708190] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  899.708330] IP: [<ffffffffa031e8db>] gen8_ppgtt_cleanup+0x1b/0x60 [i915_bdw]
[  899.708481] PGD 36b8c067 PUD d0ae0067 PMD 0 
[  899.708560] Oops: 0002 [#1] SMP 
[  899.708618] Modules linked in: rpcsec_gss_krb5 nfsv4 ctr ccm snd_hda_codec_realtek x86_pkg_temp_thermal coretemp kvm_intel sparse_keymap kvm crct10dif_pclmul arc4 dcdbas crc32_pclmul ghash_clmulni_intel aesni_intel rfcomm bnep aes_x86_64 lrw gf128mul bluetooth snd_hda_intel glue_helper ablk_helper uvcvideo cryptd snd_hda_codec snd_hwdep videobuf2_vmalloc snd_pcm videobuf2_memops psmouse snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi videobuf2_core iwlmvm videodev mac80211 snd_seq nfsd i915_bdw auth_rpcgss nfs_acl snd_seq_device nfs intel_ips iwlwifi lockd snd_timer sunrpc drm_kms_helper drm serio_raw mei_me fscache cfg80211 i2c_algo_bit snd lpc_ich mei soundcore wmi video acpi_pad parport_pc mac_hid ppdev lp parport ahci sdhci_pci e1000e libahci sdhci ptp pps_core
[  899.710220] CPU: 0 PID: 6083 Comm: kworker/0:5 Not tainted 3.13.0-25-generic #47tja1
[  899.710320] Hardware name: not-really
[  899.710416] Workqueue: events i915_error_work_func [i915_bdw]
[  899.710474] task: ffff8800ae372fe0 ti: ffff88009f938000 task.ti: ffff88009f938000
[  899.710540] RIP: 0010:[<ffffffffa031e8db>]  [<ffffffffa031e8db>] gen8_ppgtt_cleanup+0x1b/0x60 [i915_bdw]
[  899.710656] RSP: 0018:ffff88009f939d38  EFLAGS: 00010286
[  899.710704] RAX: 0000000000000000 RBX: ffff8800d09d3e00 RCX: 0000000000000001
[  899.710769] RDX: 0000000000000000 RSI: 00000000b400b3fe RDI: ffff8800d09d3e00
[  899.710836] RBP: ffff88009f939d40 R08: 0000000000000286 R09: 000000000000000b
[  899.710903] R10: 00000000e0065000 R11: ffffffff8118626f R12: ffff8800cc1e4000
[  899.710969] R13: ffff8800cc1e4000 R14: ffff8800cc1e5870 R15: 0000000000000000
[  899.711046] FS:  0000000000000000(0000) GS:ffff88011f400000(0000) knlGS:0000000000000000
[  899.711156] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  899.711225] CR2: 0000000000000008 CR3: 00000000d0b51000 CR4: 00000000003407f0
[  899.711291] Stack:
[  899.711321]  ffff8800d09d3e00 ffff88009f939d60 ffffffffa031efe9 0000000000000000
[  899.711461]  ffff8800cd21c000 ffff88009f939d90 ffffffffa0314fe2 ffff8800cd21c000
[  899.711591]  ffff8800cc1e4000 ffff8800cd21c020 0000000000000000 ffff88009f939dc8
[  899.711687] Call Trace:
[  899.711751]  [<ffffffffa031efe9>] i915_gem_cleanup_aliasing_ppgtt+0x29/0x50 [i915_bdw]
[  899.711851]  [<ffffffffa0314fe2>] i915_gem_init_hw+0x362/0x380 [i915_bdw]
[  899.711931]  [<ffffffffa0301ca1>] i915_reset+0xa1/0x180 [i915_bdw]
[  899.712008]  [<ffffffffa030901d>] i915_error_work_func+0xcd/0x120 [i915_bdw]
[  899.712086]  [<ffffffff810838a2>] process_one_work+0x182/0x450
[  899.712161]  [<ffffffff81084641>] worker_thread+0x121/0x410
[  899.712247]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
[  899.712348]  [<ffffffff8108b312>] kthread+0xd2/0xf0
[  899.712442]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
[  899.712563]  [<ffffffff81728ffc>] ret_from_fork+0x7c/0xb0
[  899.712664]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
[  899.712778] Code: 51 ff ff ff 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 8b 97 b0 00 00 00 48 8b 87 b8 00 00 00 48 89 fb <48> 89 42 08 48 89 10 48 b8 00 01 10 00 00 00 ad de 48 89 87 b0 
[  899.713280] RIP  [<ffffffffa031e8db>] gen8_ppgtt_cleanup+0x1b/0x60 [i915_bdw]
Comment 6 Timo Aaltonen 2014-05-06 22:48:46 UTC
easiest way to reproduce is to play big_buck_bunny_1080p.ogg with vlc
Comment 7 Ben Widawsky 2014-05-06 23:00:28 UTC
I am unable to reproduce this with VLC + big_buck_bunny_1080p.ogg + bdw-backports (and non-composited desktop).

I'll try with the USB live image as soon as possible. 

I think we're really looking at two separate issues though. The first is the hang, and the second is bad cleanup after hang. The latter one I can reproduce by forcing the GPU to wedged. I'll try to come up with a patch to fix that. I have no ideas on the real issue - the hang.
Comment 8 Timo Aaltonen 2014-05-07 08:57:11 UTC
Vanilla bdw-backports kernel fails as well, so it's not just the distro kernel. But kernel built from old 'broadwell' branch based on 3.14 doesn't hang hard, there's still a gpu hang but the system recovers from it.
Comment 9 Ben Widawsky 2014-05-08 01:39:40 UTC
So the only difference when you run bdw-backports is you have compositing, correct? I will try that.
Comment 10 Timo Aaltonen 2014-05-08 11:05:14 UTC
I have compiz working yes, our mesa 10.1 is patched to enable bdw.

Can't reproduce the hard hang with vanilla 3.15-rc4, fwiw.
Comment 11 Timo Aaltonen 2014-05-09 21:59:34 UTC
The system freeze with S3/vlc is not happening with the preliminary patch I got from Ben, but it triggers #76368.
Comment 12 Ben Widawsky 2014-05-09 22:33:26 UTC
Created attachment 98788 [details] [review]
Use MMIO for PDPs
Comment 13 Timo Aaltonen 2014-05-15 22:04:47 UTC
let's just close this.. bdw-backports won't fly for long, I need to rebase to 3.15 anyway due to #76368


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.