Summary: | [all] OOPS in i915_error_capture() | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Severity: | critical | ||||||||||||
Priority: | high | CC: | christophe.prigent, humberto.i.perez.rodriguez, intel-gfx-bugs, nicholas.hoath, przanoni | ||||||||||
Version: | unspecified | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux (All) | ||||||||||||
Whiteboard: | |||||||||||||
i915 platform: | ALL | i915 features: | GPU hang | ||||||||||
Attachments: |
|
Description
lu hua
2015-01-15 01:23:21 UTC
add BSW in this bug. Chris, did you mean this issue is platform interrelated for you removed the platform 'BDW'? It's a bug in the capture code that is not specific to any architecture. (In reply to Chris Wilson from comment #3) > It's a bug in the capture code that is not specific to any architecture. (The issue is magnified by the partial seqno/request conversion.) *** Bug 88821 has been marked as a duplicate of this bug. *** *** Bug 89441 has been marked as a duplicate of this bug. *** Does development team agree this as highest priority? If so can we move on? Lost track here ... have we merged the patches Chris? No, it is something that I addressed in the conversion to requests but has been overlooked. Reducing bug priority after a discussion with Chris. Main points are - the bug is not a regression, it has been in the code base since the introcution of lockless error capture; - there is no user sighting of the bug; - the blocked test case (gem_evict_everything/swapping-hang) tests for an extreme corner-case. Also, according to Chris, the for the solution "we need a couple of spinlocks to serialize bo retirement vs error capture, but we need to avoid creating deadlocks, and that is the tricky part." Created attachment 117649 [details] HSW-ULT_dmesg.txt Hi, this issue also occurs with the latest configuration for HSW-ULT -- Hardware -- Platform: Intel NUC D54250WYK Processo: Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz -- Software -- Linux distribution: Ubuntu 14.04.02 LTS 64Bits BIOS: WYLPT10H.86A.0021.2013.1017.1606 Test Environment: ```````````````````````````````````` Kernel: tag drm-intel-testing-2015-07-31 (4.2-rc4) from git://anongit.freedesktop.org/drm-intel Mesa: mesa-10.6.3 from http://cgit.freedesktop.org/mesa/mesa/ Xf86_video_intel: 2.99.917 from http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/ Libdrm: libdrm-2.4.62 from http://cgit.freedesktop.org/mesa/drm/ Cairo: 1.14.2 from http://cgit.freedesktop.org/cairo libva: libva-1.6.0 from http://cgit.freedesktop.org/libva/ intel-driver: 1.6.0. from http://cgit.freedesktop.org/vaapi/intel-driver xorg: 1.17.99 installed with script git_xorg.sh Xserver: xorg-server-1.17.2 from http://cgit.freedesktop.org/xorg/xserver Intel-gpu-tools: 1.11 from http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/ Notes : It often causes system hang. Fail rate : 4/5, and sometimes causes dmesg warning Attached HSW-ULT_dmesg.txt If needed more information or you have any doubt do not hesitate to contact me Created attachment 117672 [details]
BDW-U dmesg log
Bug scrub: Probably fixed, can you confirm? No. Error capture still dereferences requests without any serialisation with the freeing of said requests. Created attachment 118888 [details]
BDW dmesg log
Bug Scrub:
Tested again on BDW using kernel 4.3.0 and got an error as well, find attached the dmesg log and find below the Environment I used
````````````````````````````````````
Kernel:4.3.0-rc4 drm-intel-testing-2015-10-10
Mesa: mesa-11.0.2
Xf86_video_intel: 2.99.917
Libdrm: libdrm-2.4.65
Cairo: 1.14.2
libva: libva-1.6.1
intel-driver: 1.6.1
xorg: 1.17.99 installed with script git_xorg.sh
Xserver: xorg-server-1.17.2
Intel-gpu-tools: 1.12
Bug scrub, Assigned to Kimmo (In reply to Chris Wilson from comment #17) > http://patchwork.freedesktop.org/patch/70010/ Can anybody please confirm whether the patch above solves the problem or at least reduces the failure rate? Thanks, Paulo Jairo, please re-test with the patch and confirm if it is still occuring. Seems that the patch is not valid for drm-intel-next-2016-05-08-2069-gf1eaed1.. equivalent for drm-intel-testing-05-21-2016. The file i915_gpu_error.c is not taking the patches. Hunk #3 FAILED at 1290. 1 out of 3 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gpu_error.c.rej (04:05 AM) [gfx@gfx-ThinkCentre-M600] [drm-intel]$ : nano drivers/gpu/drm/i915/i915_gpu_error.c.rej GNU nano 2.5.3 File: drivers/gpu/drm/i915/i915_gpu_error.c.rej --- drivers/gpu/drm/i915/i915_gpu_error.c +++ drivers/gpu/drm/i915/i915_gpu_error.c @@ -1290,9 +1269,19 @@ void i915_capture_error_state(struct drm_device *dev, bo$ } kref_init(&error->ref); - error->i915 = dev_priv; - stop_machine(capture, error, NULL); + i915_capture_gen_state(dev_priv, error); + i915_capture_reg_state(dev_priv, error); + i915_gem_record_fences(dev, error); + i915_gem_record_rings(dev, error); + + i915_capture_active_buffers(dev_priv, error); + i915_capture_pinned_buffers(dev_priv, error); + + do_gettimeofday(&error->time); + (In reply to Chris Wilson from comment #17) > http://patchwork.freedesktop.org/patch/70010/ HI Chris, this patch we could not apply im the latest kernels 4.7.0-rc7, could you do a double check please? Well, we are getting closer it is only at about patch 90 in the queue now. The patch in situ is https://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=tasklet&id=c9a8be989704c323a87c2fd661b3a65815daa938 This test is now being skipped due to "lack of memory", I tested in BXT and SKL using the following Kernel: =================================================================== commit 57de27e40b9741c17c6749a366e891faf8b22fcb Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Date: Mon Aug 29 17:38:46 2016 +0200 drm-intel-nightly: 2016y-08m-29d-15h-38m-26s UTC integration manifest =================================================================== I am getting the following message IGT-Version: 1.15-g572a770 (x86_64) (Linux: 4.8.0-rc4drm-intel-nighly-ww35-commi 64) Test requirement not met in function intel_require_memory, file intel_os.c:289: Test requirement: __intel_check_memory(count, size, mode, &required, &total) Estimated that we need 201,326,592 objects and 201,424,896 MiB for the test, but 89 MiB available (RAM) and a maximum of 1,611,544 objects Notice the " estimated " memory required is an abnormal amount of memory. (In reply to Jairo Miramontes from comment #23) > I am getting the following message > > IGT-Version: 1.15-g572a770 (x86_64) (Linux: > 4.8.0-rc4drm-intel-nighly-ww35-commi 64) > Test requirement not met in function intel_require_memory, file > intel_os.c:289: > Test requirement: __intel_check_memory(count, size, mode, &required, &total) > Estimated that we need 201,326,592 objects and 201,424,896 MiB for the test, > but 89 MiB available (RAM) and a maximum > of 1,611,544 objects > > > Notice the " estimated " memory required is an abnormal amount of memory. But accurate. That test is irrelevant regarding this bug. The bug is a race condition in our error capture code that only depends upon running the error capture whilst the driver is active. commit 9f267eb8d2ea0a87f694da3f236067335e8cb7b9 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Oct 12 10:05:19 2016 +0100 drm/i915: Stop the machine whilst capturing the GPU crash dump |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.