Bug 88652 - [BSW/SKL ppgtt Bisected]igt/gem_evict_everything/major-hang causes system hang
Summary: [BSW/SKL ppgtt Bisected]igt/gem_evict_everything/major-hang causes system hang
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: highest critical
Assignee: Nick Hoath
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 88655 88688 88790 88817 88840 88845 88987 89000 89005 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-01-21 06:43 UTC by lu hua
Modified: 2017-08-14 08:33 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (2.27 MB, image/jpeg)
2015-01-21 06:43 UTC, lu hua
no flags Details
console output---call trace from running case to hang (10.70 KB, text/plain)
2015-03-05 07:25 UTC, wendy.wang
no flags Details

Description lu hua 2015-01-21 06:43:21 UTC
Created attachment 112584 [details]
dmesg

==System Environment==
--------------------------
Regression: yes
good commit: 1d83d957e621f160dfe0f08194e9c2fdd5fa7f3e
bad commit: 93180785d44e3d417099e293b9ff6eeb4fd20aa2

no-working platforms: BSW

==kernel==
--------------------------
drm-intel-nightly/d6bc7a6a0a7573350e8be8ec54002c20d1dbe1e0
commit d6bc7a6a0a7573350e8be8ec54002c20d1dbe1e0
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Jan 20 15:10:59 2015 +0100

    drm-intel-nightly: 2015y-01m-20d-14h-10m-40s UTC integration manifest

==Bug detailed description==
-----------------------------
It causes system on drm-intel-nightly and drm-intel-next-queued kernel.

output
IGT-Version: 1.9-g032f30c (x86_64) (Linux: 3.19.0-rc4_drm-intel-next-queued_931807_20150121+ x86_64)
Test requirement not met in function intel_require_memory, file intel_os.c:244:
Test requirement: !(total <= required)
Estimated that we need 6442455040 bytes for the test, but only have 1885339648 bytes available (RAM)
Subtest major-hang: SKIP (0.043s)

==Reproduce steps==
---------------------------- 
1. ./gem_evict_everything --run-subtest major-hang
Comment 1 lu hua 2015-01-21 06:50:37 UTC
gem_evict_everything/minor-hang also has this issue.
Comment 2 Chris Wilson 2015-01-21 09:20:14 UTC
*** Bug 88655 has been marked as a duplicate of this bug. ***
Comment 3 lu hua 2015-01-23 07:52:02 UTC
It also happens on BDW.
Comment 4 lu hua 2015-01-26 02:53:48 UTC
./gem_evict_everything --run-subtest swapping-hang also causes system hang on SNB.
Comment 5 Chris Wilson 2015-01-26 08:58:08 UTC
(In reply to lu hua from comment #4)
> ./gem_evict_everything --run-subtest swapping-hang also causes system hang
> on SNB.

I doubt it is the same bug. Please file it separately and we can dup if it does match.
Comment 6 lu hua 2015-01-27 06:22:44 UTC
(In reply to Chris Wilson from comment #5)
> (In reply to lu hua from comment #4)
> > ./gem_evict_everything --run-subtest swapping-hang also causes system hang
> > on SNB.
> 
> I doubt it is the same bug. Please file it separately and we can dup if it
> does match.

OK, report bug 88821 to track it on SNB, Thanks.
Comment 7 lu hua 2015-02-06 03:15:41 UTC
Report bug 89000 to track BDW.
Test on BSW with i915.enable_ppgtt=0, it works well.
Comment 8 lu hua 2015-02-10 06:09:36 UTC
Bisect shows: 6d3d8274bc45de4babb62d64562d92af984dd238 is the first bad commit.
commit 6d3d8274bc45de4babb62d64562d92af984dd238
Author:     Nick Hoath <nicholas.hoath@intel.com>
AuthorDate: Thu Jan 15 13:10:39 2015 +0000
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Tue Jan 27 09:50:53 2015 +0100

    drm/i915: Subsume intel_ctx_submit_request in to drm_i915_gem_request

    Move all remaining elements that were unique to execlists queue items
    in to the associated request.

    Issue: VIZ-4274

    v2: Rebase. Fixed issue of overzealous freeing of request.
    v3: Removed re-addition of cleanup work queue (found by Daniel Vetter)
    v4: Rebase.
    v5: Actual removal of intel_ctx_submit_request. Update both tail and postfix
    pointer in __i915_add_request (found by Thomas Daniel)
    v6: Removed unrelated changes

    Signed-off-by: Nick Hoath <nicholas.hoath@intel.com>
    Reviewed-by: Thomas Daniel <thomas.daniel@intel.com>
    [danvet: Reformat comment with strange linebreaks.]
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 9 Jani Nikula 2015-02-10 07:56:26 UTC
Please retest with current drm-intel-nightly that has

commit f82107950e9bda3779610e37bdfdccae6fc16f87
Author: Nick Hoath <nicholas.hoath@intel.com>
Date:   Thu Jan 29 16:55:07 2015 +0000

    drm/i915: Fix a use-after-free in intel_execlists_retire_requests
Comment 10 Jani Nikula 2015-02-10 08:17:54 UTC
*** Bug 88790 has been marked as a duplicate of this bug. ***
Comment 11 Jani Nikula 2015-02-10 08:18:04 UTC
*** Bug 88845 has been marked as a duplicate of this bug. ***
Comment 12 Jani Nikula 2015-02-10 08:18:14 UTC
*** Bug 88688 has been marked as a duplicate of this bug. ***
Comment 13 Jani Nikula 2015-02-10 08:18:22 UTC
*** Bug 88840 has been marked as a duplicate of this bug. ***
Comment 14 Jani Nikula 2015-02-10 08:18:48 UTC
*** Bug 89000 has been marked as a duplicate of this bug. ***
Comment 15 Jani Nikula 2015-02-10 08:18:57 UTC
*** Bug 89005 has been marked as a duplicate of this bug. ***
Comment 16 Jani Nikula 2015-02-10 08:19:07 UTC
*** Bug 88817 has been marked as a duplicate of this bug. ***
Comment 17 Jani Nikula 2015-02-10 08:22:03 UTC
All the dupes have the same bisected bad commit.
Comment 18 lu hua 2015-02-10 09:01:02 UTC
*** Bug 88987 has been marked as a duplicate of this bug. ***
Comment 19 Jani Nikula 2015-02-10 11:47:57 UTC
Note, when you test and verify this bug, please have a look at what the steps to reproduce were in the duplicate bugs, and see if they are truly fixed too. Thanks.
Comment 20 lu hua 2015-02-11 01:06:44 UTC
Test on the latest drm-intel-nightly kernel, this issue still exists.
Test commit ad95125eaef18eebb9f47261ce3c99957f5953de
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Feb 9 21:31:04 2015 +0100

    drm-intel-nightly: 2015y-02m-09d-20h-26m-16s UTC integration manifest
Comment 21 lu hua 2015-02-11 06:08:00 UTC
Test on the latest -nightly kernel, gem_concurrent_blit/cpu-bcs-early-read-forked-hang-blt also causes system hang, add i915.enable_ppgtt=0, it doesn't have hang issue.
Comment 22 lu hua 2015-02-11 06:34:57 UTC
Test some gem_concurrent_blit*hang* cases on BDW and BSW, they all cause system hang and have the same bisect commit.
Comment 23 Nick Hoath 2015-02-12 12:30:51 UTC
I've submitted a fix for this issue:
https://patchwork.kernel.org/patch/5819071/
Comment 24 Jani Nikula 2015-02-12 12:46:53 UTC
Nick, for tracking purposes we keep the bugs open until we've merged the patches upstream.

lu hua, please test Nick's patch.
Comment 25 lu hua 2015-02-13 02:36:13 UTC
(In reply to Nick Hoath from comment #23)
> I've submitted a fix for this issue:
> https://patchwork.kernel.org/patch/5819071/

Apply this patch on the latest -nightly kernel. Test ./gem_evict_everything --run-subtest major-hang on BDW and BSW, it works well.
I will test the duplicate bugs later.
Comment 26 lu hua 2015-02-13 03:19:39 UTC
It also impacts SKL.
Comment 27 Jani Nikula 2015-02-16 16:32:44 UTC
Nick's latest patch is at 
http://patchwork.freedesktop.org/patch/42508

Please test this, also against the tests in the duplicates.
Comment 28 Jani Nikula 2015-02-18 16:02:02 UTC
(In reply to Jani Nikula from comment #27)
> Nick's latest patch is at 

Another update, http://patchwork.freedesktop.org/patch/42729

> Please test this, also against the tests in the duplicates.
Comment 29 Jani Nikula 2015-02-24 13:19:57 UTC
Fixed by

commit b3a38998f042b862f5ba4d7f2268f3a8dfb4883a
Author: Nick Hoath <nicholas.hoath@intel.com>
Date:   Thu Feb 19 16:30:47 2015 +0000

    drm/i915: Fix a use after free, and unbalanced refcounting

in drm-intel-fixes.
Comment 30 Ding Heng 2015-02-27 05:51:50 UTC
(In reply to Jani Nikula from comment #29)
> Fixed by

commit b3a38998f042b862f5ba4d7f2268f3a8dfb4883a
Author: Nick Hoath
> <nicholas.hoath@intel.com>
Date:   Thu Feb 19 16:30:47 2015 +0000

   
> drm/i915: Fix a use after free, and unbalanced refcounting

in
> drm-intel-fixes.

I had test with kernel commit b3a38998f042b862f5ba4d7f2268f3a8dfb4883a

./gem_evict_everything --run-subtest swapping-hang will cause system hang still.
./gem_evict_everything --run-subtest major-hang will report an error, claimed that it need 6G free mem for this case. Return number is 77.
Comment 31 Nick Hoath 2015-02-27 15:40:33 UTC
Hi,
   What's the failure rate you're getting for 
./gem_evict_everything --run-subtest swapping-hang ?

   I've just run it on BDW with the latest nightly, and I'm getting a 1/10 failure due to paging request BUG, with nothing to indicate 6d3d8274bc45de4babb62d64562d92af984dd238 is the cause.

   I can't run ./gem_evict_everything --run-subtest major-hang as there isn't enough memory on my system.
Comment 32 Ding Heng 2015-03-02 03:34:34 UTC
(In reply to Nick Hoath from comment #31)
> Hi,
>    What's the failure rate you're getting for 
> ./gem_evict_everything --run-subtest swapping-hang ?
> 
>    I've just run it on BDW with the latest nightly, and I'm getting a 1/10
> failure due to paging request BUG, with nothing to indicate
> 6d3d8274bc45de4babb62d64562d92af984dd238 is the cause.
> 
>    I can't run ./gem_evict_everything --run-subtest major-hang as there
> isn't enough memory on my system.


./gem_evict_everything --run-subtest swapping-hang 

I tried it with nightly commit 855932144a48a66081a62288bea6f2bbbf48e2e7(2015-02-28) on BDW and 0b2a1076c5cb4f383d6a8c940ffab1e27f241097(2015-02-25) on BSW, the reproducible is 100%.

./gem_evict_everything --run-subtest major-hang 
This case need 6G free mem, I don't have a machine with so much mem available. However, this case should be able to run before(refer to comment 25), and the machine I use is the same as Lu Hua. What's the reason for this mem requirement increasememn?
Comment 33 Nick Hoath 2015-03-03 09:22:56 UTC
Hi,
   I'm investigating the lockup I see, but please can I have your kernel console output when the hang occurs from:
./gem_evict_everything --run-subtest swapping-hang
   to see if it's a difference problem.
Comment 34 Nick Hoath 2015-03-03 11:20:47 UTC
FWIW the hang I am investigating still occurs with without 6d3d8274bc45de4babb62d64562d92af984dd238, so it will need a new bug if the kernel console output matches.
Comment 35 wendy.wang 2015-03-05 07:25:27 UTC
Created attachment 114019 [details]
console output---call trace from running case to hang
Comment 36 wendy.wang 2015-03-05 08:28:52 UTC
(In reply to wendy.wang from comment #35)
> Created attachment 114019 [details]
> console output---call trace from running case to hang

This log was based 0b2a1076c5cb4f383d6a8c940ffab1e27f241097(2015-02-25) drm-intel-nightly kernel testing result.
Comment 37 Nick Hoath 2015-03-05 09:45:31 UTC
This latest trace is the same problem I'm investigating.
It pre-dates 6d3d8274bc45de4babb62d64562d92af984dd238.
The original bug introduced in 6d3d8274bc45de4babb62d64562d92af984dd238 is fixed and the fix upstreamed, and as such I am closing this bug as fixed. I have created bug 89441 to track the newly reported (pre-existing) issue.
Comment 38 Jari Tahvanainen 2017-08-14 08:33:46 UTC
Moving old bug from Verified to Closed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.