Created attachment 114084 [details] first error The desktop freezes for few seconds gpu hang in dmesg error state attached in Xorg 1 Section"Device" 2 Identifier "Intel Graphics" 3 Driver "intel" 4 # Option "AccelMethod" "uxa" 5 Option "DRI" "False" 6 EndSection when the freeze happens again the code may change
Created attachment 114085 [details] second error
Please also attach your Xorg.0.log. The first looks familiar...
The second is batch buffer incoherence.
Could you see if this makes the X hangs go away: xf86-video-intel: diff --git a/src/sna/kgem.c b/src/sna/kgem.c index a5571aa..adf52d6 100644 --- a/src/sna/kgem.c +++ b/src/sna/kgem.c @@ -83,7 +83,7 @@ search_snoop_cache(struct kgem *kgem, unsigned int num_pages, unsigned flags); #define DBG_NO_FAST_RELOC 0 #define DBG_NO_HANDLE_LUT 0 #define DBG_NO_WT 0 -#define DBG_NO_WC_MMAP 0 +#define DBG_NO_WC_MMAP 1 #define DBG_DUMP 0 #define DBG_NO_MALLOC_CACHE 0
I will try Chris, Any prefered place to get the kernel from? I have enabled the dri and now I can see more errors attached them too. the gpu hang is kind of old, since summer I have it. 3.14 was probably the last kernel when things were ok regards, Alin
Created attachment 114086 [details] kwin error 1
Created attachment 114087 [details] kwin_error 2
Created attachment 114088 [details] kwin error 3
Created attachment 114089 [details] plasma crash
I have built and tried 3.10, and 3.11 both showed crashes 3.10 fast 3.11 after a while and only one. Now on 3.12. I have attached the errors. Alin
Created attachment 114100 [details] kernel 3.10 error
Created attachment 114101 [details] kernel 3.11
Created attachment 114102 [details] kernel 3.12 e1
Created attachment 114103 [details] kernel 3.12 e2
Time for good/bad news. The errors on 3.10/3.11/3.12 are a different error originating from mesa. If you make sure you have the DRI "false" in your xorg.conf and see if the errors persist. Even if they do, could you run with kernels 3.13->4.0 and attach the error states and I'll can see when the error switches over to the more troubling incoherence issue.
That is great news. Yes I will continue with the new kernels(some of them already built) over the weekend and report. regards, Alin
3.13 started to show the crashes I am used with. errors added.
Created attachment 114119 [details] kernel 3.13 e1
Created attachment 114120 [details] kernel 3.13 e2
Created attachment 114121 [details] kernel 3.13 e3
Created attachment 114122 [details] kernel 3.13 e4
Created attachment 114123 [details] kernel 3.13 e5
kernel 3.14 the same freezes...I am uplaoding the errors.
Created attachment 114124 [details] kernel 3.14 e1
Created attachment 114125 [details] kernel 3.14 e2
Created attachment 114126 [details] kernel 3.14 e3
Created attachment 114127 [details] kernel 3.14 e4
3.13/3.14 is started to die after the context switch and before the batchbuffer start. Different again. This is fun!
Great time to move to 3.15
with 3.15 crashes are fast... states added.
Created attachment 114130 [details] kernel 3.15 e1
Created attachment 114131 [details] kernel 3.15 e2
kernel 3.19 error state
Created attachment 114156 [details] kernel 3.19 e1
Created attachment 114157 [details] kernel 3.19 e2
Created attachment 114158 [details] kernel 3.19 e3
Created attachment 114159 [details] kernel 3.19 e4
Created attachment 114160 [details] kernel 3.19 e5
The 3.19 error states all look to be mesa hangs. Could you capture a fresh 4.0 error state as well?
latest 4.0.0-rc3 states attached
Created attachment 114164 [details] kernel 4.0.0-rc3 e1
Created attachment 114165 [details] kernel 4.0.0-rc3 e2
Created attachment 114166 [details] kernel 4.0.0-rc3 e3
Created attachment 114167 [details] kernel 4.0.0-rc3 e4
Now we have a split between death inside mesa (similar to the error states from earlier states) for kwin_x11 and a death on context restore like from around 3.13/3.14.
dri disable kernel 4.0.90rc3
Created attachment 114168 [details] kernel 4.0.0-rc3 e1 dri disabled
Created attachment 114169 [details] kernel 4.0.0-rc3 e2 dri disabled
(In reply to Alin M Elena from comment #47) > Created attachment 114168 [details] > kernel 4.0.0-rc3 e1 dri disabled Hangcheck be broken.
(In reply to Alin M Elena from comment #48) > Created attachment 114169 [details] > kernel 4.0.0-rc3 e2 dri disabled The GPU appears to have fallen asleep.
(In reply to Chris Wilson from comment #49) > (In reply to Alin M Elena from comment #47) > > Created attachment 114168 [details] > > kernel 4.0.0-rc3 e1 dri disabled > > Hangcheck be broken. Ok, it's a bit deeper than that: last seqno write in ringbuffer: 1c2d last seqno in hws: 182d
Every batch before and after the last successful hws is using the same batch, which implies it must have considered idle (i.e. seqno had advanced). Again something fishy with obj->active.
Another interesting tidbit: SYNC_1: 0x00001c2d which corresponds with the ringbuffer. So it appears that the hws ended up with a stale value long after "later" writes landed.
I don't have a good explanation, but we may as well try punching a few things and see what we can break: diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index e5b3c6dbd467..c972f24d50cc 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -1087,18 +1087,20 @@ gen6_add_request(struct intel_engine_cs *ring) int ret; if (ring->semaphore.signal) - ret = ring->semaphore.signal(ring, 4); + ret = ring->semaphore.signal(ring, 6); else - ret = intel_ring_begin(ring, 4); + ret = intel_ring_begin(ring, 6); if (ret) return ret; + intel_ring_emit(ring, MI_ARB_ON_OFF | MI_ARB_DISABLE); intel_ring_emit(ring, MI_STORE_DWORD_INDEX); intel_ring_emit(ring, I915_GEM_HWS_INDEX << MI_STORE_DWORD_INDEX_SHIFT); intel_ring_emit(ring, i915_gem_request_get_seqno(ring->outstanding_lazy_request)); intel_ring_emit(ring, MI_USER_INTERRUPT); + intel_ring_emit(ring, MI_ARB_ON_OFF | MI_ARB_ENABLE); __intel_ring_advance(ring); return 0;
Created attachment 114176 [details] [review] Just a random shot in the dark
dri disabled clean dmesg after one night. drm enabled... errors are still there
with dri enabled still issues.. the same as before... in addition [ 742.562429] [drm] GPU HANG: ecode 7:0:0x97f4ffff, in chromium [2431], reason: Ring hung, action: reset [ 742.562440] ------------[ cut here ]------------ [ 742.562486] WARNING: CPU: 1 PID: 2308 at /home/alin/lavello/linux/drivers/gpu/drm/i915/intel_display.c:9574 intel_mmio_flip_work_func+0x2ea/0x310 [i915]() [ 742.562487] WARN_ON(__i915_wait_request(mmio_flip->req, crtc->reset_counter, false, NULL, NULL) != 0) [ 742.562489] Modules linked in: [ 742.562490] ctr ccm fuse ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet bnep dell_wmi sparse_keymap nls_iso8859_1 nls_cp437 vfat fat arc4 ath9k ath9k_common ath9k_hw snd_hda_codec_hdmi iTCO_wdt ath iTCO_vendor_support snd_hda_codec_realtek snd_hda_codec_generic mac80211 snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_timer kvm dm_mod snd crct10dif_pclmul ath3k btusb crc32_pclmul cfg80211 dell_laptop crc32c_intel uvcvideo dcdbas bluetooth ghash_clmulni_intel aesni_intel videobuf2_vmalloc aes_x86_64 videobuf2_memops glue_helper videobuf2_core lrw gf128mul ablk_helper rndis_host v4l2_common cryptd cdc_ether videodev usbnet mei_me joydev mii rfkill mei serio_raw pcspkr lpc_ich [ 742.562532] i2c_i801 shpchp mfd_core soundcore tpm_tis tpm wmi thermal battery processor ac efivarfs xhci_pci xhci_hcd i915 i2c_algo_bit drm_kms_helper drm video button sg [ 742.562544] CPU: 1 PID: 2308 Comm: kworker/1:1 Tainted: G U 4.0.0-rc3-1.gf264c86-desktop+ #1 [ 742.562545] Hardware name: Dell Inc. XPS L322X/0PJHXN, BIOS A10 08/28/2013 [ 742.562561] Workqueue: events intel_mmio_flip_work_func [i915] [ 742.562563] ffffffffa0193448 ffff8800b26a7ce8 ffffffff81678e34 0000000000000000 [ 742.562565] ffff8800b26a7d38 ffff8800b26a7d28 ffffffff810657aa ffff880235b70000 [ 742.562568] ffff88003f8f68b0 ffff8800b3fb9f40 ffff88003f8f6000 ffffe8ffffc41900 [ 742.562570] Call Trace: [ 742.562577] [<ffffffff81678e34>] dump_stack+0x4c/0x6e [ 742.562582] [<ffffffff810657aa>] warn_slowpath_common+0x8a/0xc0 [ 742.562585] [<ffffffff81065826>] warn_slowpath_fmt+0x46/0x50 [ 742.562602] [<ffffffffa012e06a>] intel_mmio_flip_work_func+0x2ea/0x310 [i915] [ 742.562605] [<ffffffff810893ed>] ? finish_task_switch+0x5d/0x100 [ 742.562609] [<ffffffff8107dfb5>] process_one_work+0x145/0x440 [ 742.562611] [<ffffffff8107e3d1>] worker_thread+0x121/0x450 [ 742.562614] [<ffffffff8107e2b0>] ? process_one_work+0x440/0x440 [ 742.562616] [<ffffffff810836f9>] kthread+0xc9/0xe0 [ 742.562629] [<ffffffff81083630>] ? kthread_create_on_node+0x180/0x180 [ 742.562631] [<ffffffff8167f798>] ret_from_fork+0x58/0x90 [ 742.562634] [<ffffffff81083630>] ? kthread_create_on_node+0x180/0x180 [ 742.562636] ---[ end trace 5268aa6c476a71d0 ]---
Created attachment 114185 [details] kernel 4.0.0-rc3 dri enabled + patch t1 e1
Created attachment 114186 [details] kernel 4.0.0-rc3 dri enabled + patch t1 e2
Created attachment 114188 [details] kernel 4.0.0-rc3 dri enabled + patch t1 e3
Created attachment 114189 [details] kernel 4.0.0-rc3 dri enabled + patch t1 e4
Created attachment 114190 [details] kernel 4.0.0-rc3 dri enabled + patch t1 e5
Created attachment 114191 [details] kernel 4.0.0-rc3 dri enabled + patch t1 e6
kwin seems to generate a trace in the kernel too [ 1133.003380] drm/i915: Resetting chip after gpu hang [ 1147.017084] [drm] stuck on render ring [ 1147.017546] [drm] GPU HANG: ecode 7:0:0x97f4ffff, in kwin_x11 [2782], reason: Ring hung, action: reset [ 1147.017597] ------------[ cut here ]------------ [ 1147.017635] WARNING: CPU: 3 PID: 152 at /home/alin/lavello/linux/drivers/gpu/drm/i915/intel_display.c:9574 intel_mmio_flip_work_func+0x2ea/0x310 [i915]() [ 1147.017637] WARN_ON(__i915_wait_request(mmio_flip->req, crtc->reset_counter, false, NULL, NULL) != 0) [ 1147.017639] Modules linked in: [ 1147.017640] ctr ccm fuse ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet bnep dell_wmi sparse_keymap nls_iso8859_1 nls_cp437 vfat fat arc4 ath9k ath9k_common ath9k_hw snd_hda_codec_hdmi iTCO_wdt ath iTCO_vendor_support snd_hda_codec_realtek snd_hda_codec_generic mac80211 snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_timer kvm dm_mod snd crct10dif_pclmul ath3k btusb crc32_pclmul cfg80211 dell_laptop crc32c_intel uvcvideo dcdbas bluetooth ghash_clmulni_intel aesni_intel videobuf2_vmalloc aes_x86_64 videobuf2_memops glue_helper videobuf2_core lrw gf128mul ablk_helper rndis_host v4l2_common cryptd cdc_ether videodev usbnet mei_me joydev mii rfkill mei serio_raw pcspkr lpc_ich [ 1147.017674] i2c_i801 shpchp mfd_core soundcore tpm_tis tpm wmi thermal battery processor ac efivarfs xhci_pci xhci_hcd i915 i2c_algo_bit drm_kms_helper drm video button sg [ 1147.017688] CPU: 3 PID: 152 Comm: kworker/3:1 Tainted: G U W 4.0.0-rc3-1.gf264c86-desktop+ #1 [ 1147.017689] Hardware name: Dell Inc. XPS L322X/0PJHXN, BIOS A10 08/28/2013 [ 1147.017705] Workqueue: events intel_mmio_flip_work_func [i915] [ 1147.017707] ffffffffa0193448 ffff88003fb97ce8 ffffffff81678e34 0000000000000000 [ 1147.017709] ffff88003fb97d38 ffff88003fb97d28 ffffffff810657aa ffff88023f2cd900 [ 1147.017710] ffff88003f8f68b0 ffff88003f900640 ffff88003f8f6000 ffffe8ffffcc1900 [ 1147.017712] Call Trace: [ 1147.017719] [<ffffffff81678e34>] dump_stack+0x4c/0x6e [ 1147.017723] [<ffffffff810657aa>] warn_slowpath_common+0x8a/0xc0 [ 1147.017725] [<ffffffff81065826>] warn_slowpath_fmt+0x46/0x50 [ 1147.017740] [<ffffffffa012e06a>] intel_mmio_flip_work_func+0x2ea/0x310 [i915] [ 1147.017743] [<ffffffff810893ed>] ? finish_task_switch+0x5d/0x100 [ 1147.017746] [<ffffffff8107dfb5>] process_one_work+0x145/0x440 [ 1147.017748] [<ffffffff8107e3d1>] worker_thread+0x121/0x450 [ 1147.017750] [<ffffffff8107e2b0>] ? process_one_work+0x440/0x440 [ 1147.017752] [<ffffffff810836f9>] kthread+0xc9/0xe0 [ 1147.017754] [<ffffffff81083630>] ? kthread_create_on_node+0x180/0x180 [ 1147.017756] [<ffffffff8167f798>] ret_from_fork+0x58/0x90 [ 1147.017758] [<ffffffff81083630>] ? kthread_create_on_node+0x180/0x180 [ 1147.017759] ---[ end trace 5268aa6c476a71d1 ]---
Created attachment 114192 [details] kernel 4.0.0-rc3 dri enabled + patch t1 e7
That trace is just the -EIO issue that should have been fixed with #requests. To be clean, if you run with DRI disabled and with the shotgun patch, do we either see a hang? Give it a good long use.
ok... with dri disabled and the patch got an error in the end. state attached ALin
Created attachment 114196 [details] kernel 4.0.0-rc3 + patch t1 dri disabled e1
Created attachment 114204 [details] kernel 4.0.0-rc3 dri enabled + patch t1 rc6=0 e1
Created attachment 114205 [details] kernel 4.0.0-rc3 dri enabled + patch t1 rc6=0 e2
Created attachment 114206 [details] kernel 4.0.0-rc3 dri enabled + patch t1 rc6=0 e3
Same error that I was hoping was rc6 related. I need a new tree to bark at.
changed the memory and the bug is gone. Alin
/o\ Hoping this remains a hw issue.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.