Bug 88432

Summary: [all] OOPS in i915_error_capture()
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: christophe.prigent, humberto.i.perez.rodriguez, intel-gfx-bugs, nicholas.hoath, przanoni
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: ALL i915 features: GPU hang
Attachments:
Description Flags
dmesg
none
HSW-ULT_dmesg.txt
none
BDW-U dmesg log
none
BDW dmesg log none

Description lu hua 2015-01-15 01:23:21 UTC
Created attachment 112259 [details]
dmesg

==System Environment==
--------------------------
Regression: not sure

Non-working platforms: BDW

==kernel==
--------------------------
drm-intel-nightly/9c4bdce37d09c0682f04bb5e6d0567def5c8d786
commit 9c4bdce37d09c0682f04bb5e6d0567def5c8d786
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Jan 13 23:27:51 2015 +0100

    drm-intel-nightly: 2015y-01m-13d-22h-27m-23s UTC integration manifest

==Bug detailed description==
-----------------------------
It sporadically causes system hang, fail rate: 1/5.

output:
IGT-Version: 1.9-g5fb26d1 (x86_64) (Linux: 3.19.0-rc4_drm-intel-nightly_9c4bdc_20150114+ x86_64)

dmesg:
[  499.462482] BUG: unable to handle kernel paging request at 0000000100000088
[  499.462532] IP: [<ffffffffa01093d9>] capture_bo+0x4/0x14d [i915]
[  499.462583] PGD a1cdd067 PUD 0
[  499.462610] Oops: 0000 [#1] SMP
[  499.462636] Modules linked in: netconsole configfs ipv6 iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi ppdev dm_mod pcspkr i2c_i801 snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer lpc_ich mfd_core snd soundcore battery parport_pc parport ac acpi_cpufreq i915 button video drm_kms_helper drm cfbfillrect cfbimgblt cfbcopyarea [last unloaded: netconsole]
[  499.462917] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.19.0-rc4_drm-intel-nightly_9c4bdc_20150114+ #410
[  499.462971] task: ffff880149bb1800 ti: ffff880149bc0000 task.ti: ffff880149bc0000
[  499.463016] RIP: 0010:[<ffffffffa01093d9>]  [<ffffffffa01093d9>] capture_bo+0x4/0x14d [i915]
[  499.463075] RSP: 0018:ffff88014ec43ce0  EFLAGS: 00010203
[  499.463106] RAX: 0000000100000000 RBX: ffff8800a1de4000 RCX: ffff8800956817e0
[  499.463151] RDX: ffff880095698020 RSI: ffff880143c8fa80 RDI: ffff8800956817c0
[  499.463191] RBP: ffff880143c8fa00 R08: ffff880143c8fa80 R09: ffff8800a7d5dd08
[  499.463232] R10: 0000000000000c01 R11: ffffea000255d820 R12: ffff880144610000
[  499.463273] R13: 0000000000000008 R14: 0000000000000002 R15: 00000000000000bf
[  499.463315] FS:  0000000000000000(0000) GS:ffff88014ec40000(0000) knlGS:0000000000000000
[  499.463361] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  499.463395] CR2: 0000000100000088 CR3: 00000000a1cdc000 CR4: 00000000003407e0
[  499.463437] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  499.463484] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  499.463531] Stack:
[  499.463548]  ffffffffa00ac483 0000000000000300 ffff880143c8fa80 ffff8800a7d5dd08
[  499.463614]  ffff880095698020 ffff880095680000 ffff880100000c01 ffff880100000005
[  499.463684]  ffff880143c8fae0 ffff880144617b30 ffff880144436000 ffff8800956817e0
[  499.463742] Call Trace:
[  499.463760]  <IRQ>
[  499.463778]  [<ffffffffa00ac483>] ? i915_capture_error_state+0x681/0x1382 [i915]
[  499.463856]  [<ffffffffa00b4460>] ? i915_handle_error+0x7a/0x599 [i915]
[  499.463906]  [<ffffffffa00b4cd0>] ? i915_hangcheck_elapsed+0x305/0x399 [i915]
[  499.465609]  [<ffffffffa00b49cb>] ? i915_queue_hangcheck+0x4c/0x4c [i915]
[  499.467303]  [<ffffffff8107bf89>] ? call_timer_fn+0x46/0xe2
[  499.469006]  [<ffffffffa00b49cb>] ? i915_queue_hangcheck+0x4c/0x4c [i915]
[  499.470703]  [<ffffffff8107c3ce>] ? run_timer_softirq+0x1af/0x212
[  499.472389]  [<ffffffff8103eb14>] ? __do_softirq+0xdc/0x22f
[  499.474077]  [<ffffffff8103ed9b>] ? irq_exit+0x34/0x78
[  499.475761]  [<ffffffff81026b8e>] ? smp_apic_timer_interrupt+0x39/0x43
[  499.477451]  [<ffffffff8179fdaa>] ? apic_timer_interrupt+0x6a/0x70
[  499.479151]  <EOI>
[  499.479167]  [<ffffffff81009f18>] ? default_idle+0x34/0x8e
[  499.482553]  [<ffffffff8106554c>] ? cpu_startup_entry+0x170/0x2e0
[  499.484261] Code: c7 3d 33 12 a0 e8 3d a0 f0 ff b8 fb ff ff ff eb 09 b8 00 00 00 00 41 0f 4e c4 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e c3 48 8b 46 48 <48> 8b 90 88 00 00 00 89 17 8b 90 90 00 00 00 89 57 04 48 8b 90
[  499.486263] RIP  [<ffffffffa01093d9>] capture_bo+0x4/0x14d [i915]
[  499.488079]  RSP <ffff88014ec43ce0>
[  499.489862] CR2: 0000000100000088
[  499.491650] ---[ end trace 04d70c09331bf34f ]---

==Reproduce steps==
---------------------------- 
1. ./gem_evict_everything --run-subtest swapping-hang
Comment 1 Ding Heng 2015-01-16 07:57:46 UTC
add BSW in this bug.
Comment 2 Yi Sun 2015-01-19 03:19:43 UTC
Chris, did you mean this issue is platform interrelated for you removed the platform 'BDW'?
Comment 3 Chris Wilson 2015-01-19 09:55:49 UTC
It's a bug in the capture code that is not specific to any architecture.
Comment 4 Chris Wilson 2015-01-19 09:56:31 UTC
(In reply to Chris Wilson from comment #3)
> It's a bug in the capture code that is not specific to any architecture.

(The issue is magnified by the partial seqno/request conversion.)
Comment 5 Chris Wilson 2015-01-27 12:50:26 UTC
*** Bug 88821 has been marked as a duplicate of this bug. ***
Comment 6 Chris Wilson 2015-03-05 09:47:04 UTC
*** Bug 89441 has been marked as a duplicate of this bug. ***
Comment 7 Gordon Jin 2015-04-02 06:42:15 UTC
Does development team agree this as highest priority? If so can we move on?
Comment 8 Daniel Vetter 2015-04-02 08:08:08 UTC
Lost track here ... have we merged the patches Chris?
Comment 9 Chris Wilson 2015-04-02 08:38:46 UTC
No, it is something that I addressed in the conversion to requests but has been overlooked.
Comment 10 Ander Conselvan de Oliveira 2015-04-29 10:49:57 UTC
Reducing bug priority after a discussion with Chris. Main points are

 - the bug is not a regression, it has been in the code base since the introcution of lockless error capture;
 - there is no user sighting of the bug;
 - the blocked test case (gem_evict_everything/swapping-hang) tests for an extreme corner-case.


Also, according to Chris, the for the solution "we need a couple of spinlocks to serialize bo retirement vs error capture, but we need to avoid creating deadlocks, and that is the tricky part."
Comment 11 Humberto Israel Perez Rodriguez 2015-08-12 14:25:51 UTC
Created attachment 117649 [details]
HSW-ULT_dmesg.txt

Hi, this issue also occurs with the latest configuration for HSW-ULT

-- Hardware --
Platform: Intel NUC D54250WYK
Processo: Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz
-- Software --
Linux distribution: Ubuntu 14.04.02 LTS 64Bits
BIOS: WYLPT10H.86A.0021.2013.1017.1606



Test Environment:
````````````````````````````````````
Kernel: tag drm-intel-testing-2015-07-31 (4.2-rc4) from git://anongit.freedesktop.org/drm-intel
Mesa: mesa-10.6.3 from http://cgit.freedesktop.org/mesa/mesa/
Xf86_video_intel: 2.99.917 from http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/
Libdrm: libdrm-2.4.62 from http://cgit.freedesktop.org/mesa/drm/
Cairo: 1.14.2 from http://cgit.freedesktop.org/cairo
libva: libva-1.6.0 from http://cgit.freedesktop.org/libva/
intel-driver: 1.6.0. from http://cgit.freedesktop.org/vaapi/intel-driver
xorg: 1.17.99 installed with script git_xorg.sh
Xserver: xorg-server-1.17.2 from http://cgit.freedesktop.org/xorg/xserver
Intel-gpu-tools: 1.11 from http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/


Notes : It often causes system hang. Fail rate : 4/5,  and sometimes causes dmesg warning
	Attached HSW-ULT_dmesg.txt


If needed more information or you have any doubt do not hesitate to contact me
Comment 12 Elio 2015-08-13 19:15:52 UTC
Created attachment 117672 [details]
BDW-U dmesg log
Comment 13 cprigent 2015-10-08 16:36:13 UTC
Bug scrub:
Probably fixed, can you confirm?
Comment 14 Chris Wilson 2015-10-09 07:52:29 UTC
No. Error capture still dereferences requests without any serialisation with the freeing of said requests.
Comment 15 Jairo Miramontes 2015-10-15 12:29:19 UTC
Created attachment 118888 [details]
BDW dmesg log

Bug Scrub:

Tested again on BDW using kernel 4.3.0 and got an error as well, find attached the dmesg log and find below the Environment I used

````````````````````````````````````
Kernel:4.3.0-rc4 drm-intel-testing-2015-10-10
Mesa: mesa-11.0.2 
Xf86_video_intel: 2.99.917 
Libdrm: libdrm-2.4.65
Cairo: 1.14.2 
libva: libva-1.6.1 
intel-driver: 1.6.1
xorg: 1.17.99 installed with script git_xorg.sh
Xserver: xorg-server-1.17.2 
Intel-gpu-tools: 1.12
Comment 16 cprigent 2015-11-17 17:25:36 UTC
Bug scrub,
Assigned to Kimmo
Comment 17 Chris Wilson 2016-01-28 10:00:47 UTC
http://patchwork.freedesktop.org/patch/70010/
Comment 18 Paulo Zanoni 2016-03-08 19:38:35 UTC
(In reply to Chris Wilson from comment #17)
> http://patchwork.freedesktop.org/patch/70010/

Can anybody please confirm whether the patch above solves the problem or at least reduces the failure rate?

Thanks,
Paulo
Comment 19 yann 2016-05-20 09:25:22 UTC
Jairo, please re-test with the patch and confirm if it is still occuring.
Comment 20 Elio 2016-06-01 18:53:56 UTC
Seems that the patch is not valid for drm-intel-next-2016-05-08-2069-gf1eaed1.. equivalent for drm-intel-testing-05-21-2016. 

The file i915_gpu_error.c is not taking the patches.

Hunk #3 FAILED at 1290.
1 out of 3 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gpu_error.c.rej
(04:05 AM) [gfx@gfx-ThinkCentre-M600] [drm-intel]$ : nano drivers/gpu/drm/i915/i915_gpu_error.c.rej
  GNU nano 2.5.3   File: drivers/gpu/drm/i915/i915_gpu_error.c.rej

--- drivers/gpu/drm/i915/i915_gpu_error.c
+++ drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1290,9 +1269,19 @@ void i915_capture_error_state(struct drm_device *dev, bo$
        }

        kref_init(&error->ref);
-       error->i915 = dev_priv;

-       stop_machine(capture, error, NULL);
+       i915_capture_gen_state(dev_priv, error);
+       i915_capture_reg_state(dev_priv, error);
+       i915_gem_record_fences(dev, error);
+       i915_gem_record_rings(dev, error);
+
+       i915_capture_active_buffers(dev_priv, error);
+       i915_capture_pinned_buffers(dev_priv, error);
+
+       do_gettimeofday(&error->time);
+
Comment 21 Humberto Israel Perez Rodriguez 2016-07-22 15:18:03 UTC
(In reply to Chris Wilson from comment #17)
> http://patchwork.freedesktop.org/patch/70010/

HI Chris, this patch we could not apply im the latest kernels 4.7.0-rc7, could you do a double check please?
Comment 22 Chris Wilson 2016-07-24 09:30:06 UTC
Well, we are getting closer it is only at about patch 90 in the queue now. The patch in situ is https://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=tasklet&id=c9a8be989704c323a87c2fd661b3a65815daa938
Comment 23 Jairo Miramontes 2016-08-30 21:54:48 UTC
This test is now being skipped due to "lack of memory", I tested in BXT and  SKL using the following Kernel:


===================================================================
commit 57de27e40b9741c17c6749a366e891faf8b22fcb
Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Date:   Mon Aug 29 17:38:46 2016 +0200

    drm-intel-nightly: 2016y-08m-29d-15h-38m-26s UTC integration manifest
===================================================================

I am getting the following message 

IGT-Version: 1.15-g572a770 (x86_64) (Linux: 4.8.0-rc4drm-intel-nighly-ww35-commi                                   64)
Test requirement not met in function intel_require_memory, file intel_os.c:289:
Test requirement: __intel_check_memory(count, size, mode, &required, &total)
Estimated that we need 201,326,592 objects and 201,424,896 MiB for the test, but                                   89 MiB available (RAM) and a maximum of 1,611,544 objects


Notice the " estimated " memory required is an abnormal amount of memory.
Comment 24 Chris Wilson 2016-08-30 22:55:11 UTC
(In reply to Jairo Miramontes from comment #23)
> I am getting the following message 
> 
> IGT-Version: 1.15-g572a770 (x86_64) (Linux:
> 4.8.0-rc4drm-intel-nighly-ww35-commi                                   64)
> Test requirement not met in function intel_require_memory, file
> intel_os.c:289:
> Test requirement: __intel_check_memory(count, size, mode, &required, &total)
> Estimated that we need 201,326,592 objects and 201,424,896 MiB for the test,
> but                                   89 MiB available (RAM) and a maximum
> of 1,611,544 objects
> 
> 
> Notice the " estimated " memory required is an abnormal amount of memory.

But accurate. That test is irrelevant regarding this bug. The bug is a race condition in our error capture code that only depends upon running the error capture whilst the driver is active.
Comment 25 Chris Wilson 2016-10-12 11:05:51 UTC
commit 9f267eb8d2ea0a87f694da3f236067335e8cb7b9
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Oct 12 10:05:19 2016 +0100

    drm/i915: Stop the machine whilst capturing the GPU crash dump

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.