Bug 74007 - set-to-domain returned EIO
Summary: set-to-domain returned EIO
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: highest blocker
Assignee: Mika Kuoppala
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-01-24 02:39 UTC by lu hua
Modified: 2016-10-12 09:04 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (15.83 KB, text/plain)
2014-01-24 02:39 UTC, lu hua
no flags Details
dmesg (122.73 KB, text/plain)
2014-02-12 09:00 UTC, lu hua
no flags Details
drm/i915: fix forcewake counts for gen8 (4.42 KB, patch)
2014-02-18 16:46 UTC, Mika Kuoppala
no flags Details | Splinter Review
drm/i915: Fix forcewake counts for gen8 (2.49 KB, patch)
2014-02-18 17:24 UTC, Mika Kuoppala
no flags Details | Splinter Review
dmesg with patch (127.39 KB, text/plain)
2014-02-19 06:13 UTC, lu hua
no flags Details
dmesg(queued) (126.63 KB, text/plain)
2014-03-07 07:34 UTC, lu hua
no flags Details

Description lu hua 2014-01-24 02:39:23 UTC
Created attachment 92707 [details]
dmesg

System Environment:
--------------------------
Platform: Broadwell
Kernel(drm-intel-nightly)83ac01f486397fd8d319a3e31d1f95beb05037c5

Bug detailed description:
-------------------------
Run one gem_reset_stats subcase then run other gem_reset_stats subcase, or run one gem_reset_stats subcase twice, system will hang. gem_reset_stats is a new case.

run ./gem_reset_stats --run-subtest reset-count-vebox twice.
output:
#./gem_reset_stats --run-subtest reset-count-vebox
IGT-Version: 1.5-g1bbb607 (x86_64) (Linux: 3.13.0-rc8_drm-intel-nightly_83ac01_20140123+ x86_64)
Test assertion failure function gem_set_domain, file drmtest.c:535:
Last errno: 5, Input/output error
Failed assertion: drmIoctl((fd), ((((1U) << (((0+8)+8)+14)) | ((('d')) << (0+8)) | (((0x40 + 0x1f)) << 0) | ((((sizeof(struct drm_i915_gem_set_domain)))) << ((0+8)+8)))), (&set_domain)) == 0
Subtest reset-count-vebox: FAIL
# ./gem_reset_stats --run-subtest reset-count-vebox
IGT-Version: 1.5-g1bbb607 (x86_64) (Linux: 3.13.0-rc8_drm-intel-nightly_83ac01_20140123+ x86_64)

Reproduce steps:
-------------------------
1. run ./gem_reset_stats --run-subtest reset-count-vebox twice
Comment 1 Mika Kuoppala 2014-01-24 07:48:52 UTC

*** This bug has been marked as a duplicate of bug 73652 ***
Comment 2 Chris Wilson 2014-01-24 08:23:12 UTC
Repeat after me, the driver is never, ever, allowed to return EIO through set-to-domain (or i915_gem_fault() and friends).

If that i-g-t assert is reliable, that is just what happened here.
Comment 3 Mika Kuoppala 2014-01-24 08:30:59 UTC
(In reply to comment #2)
> Repeat after me, the driver is never, ever, allowed to return EIO through
> set-to-domain (or i915_gem_fault() and friends).
> 
> If that i-g-t assert is reliable, that is just what happened here.

There are multiple problems currently with resets in nightly,
but what is happening here is that gem_reset_stats (btw fixed in most recent master) will stop all rings after injecting hang.

In nightly, this results in same context being claimed twice for the same hang, and gets banned. Then comes the set_domain() for that context and well, it has
been, errorneously, banned.
Comment 4 Chris Wilson 2014-01-24 08:35:13 UTC
I still strongly believe that these interfaces (set-to-domain, pagefault) are special in that they require defense against internal errors. They are userspaces last resort at keeping the machine limping along long enough for people to save their work and report a bug.
Comment 5 Mika Kuoppala 2014-01-24 08:44:33 UTC
(In reply to comment #4)
> I still strongly believe that these interfaces (set-to-domain, pagefault)
> are special in that they require defense against internal errors. They are
> userspaces last resort at keeping the machine limping along long enough for
> people to save their work and report a bug.

Ok. Do you consider also returning -EIO if the test has gone wild with 'stop_rings' debugfs interface to be an inssue ?

Another example:
Userspace sends hanging batch into multiple rings and then does set_domain(). Is -EIO acceptable in this case?
Comment 6 Chris Wilson 2014-01-24 08:49:01 UTC
Yes. i-g-t is not foolproof, and we must avoid shooting ourselves in the foot and the user in the head even in extreme cases.
Comment 7 Daniel Vetter 2014-01-27 07:47:22 UTC
I don't really see how a banned context results in an -EIO in set_domain. It really shouldn't happen though ...
Comment 8 Gordon Jin 2014-01-28 01:54:32 UTC
Is this BDW specific? If so we'd keep [BDW] in title.
Comment 9 Chris Wilson 2014-01-28 07:32:32 UTC
(In reply to comment #8)
> Is this BDW specific? If so we'd keep [BDW] in title.

No. The specifics of this bug Mika believes to be bug 73652. I want to address the wider issue of making sure that the driver never returns EIO from memory access functions - as these are required for failsafe.
Comment 10 Mika Kuoppala 2014-01-28 11:44:10 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > Is this BDW specific? If so we'd keep [BDW] in title.
> 
> No. The specifics of this bug Mika believes to be bug 73652. I want to
> address the wider issue of making sure that the driver never returns EIO
> from memory access functions - as these are required for failsafe.

I agree with Chris here. My initial analysis was way too hasty and I mixed
up the banning to have something to do with set_domain. It doesn't and it shouldn't.

I couldn't reproduce this with IVB.
Comment 11 Chris Wilson 2014-01-31 16:59:12 UTC
If "./gem_reset_stats --run-subtest reset-count-vebox" [twice] still fails, can you please paste the error message after applying

commit 48ad03ca0c5f078b8d12a64323fd93b3858041af
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jan 31 16:56:01 2014 +0000

    lib: Capture errno on entry
    
    When printing the errno, it is important that we capture the user errno
    before we make any library calls - as they may alter the value.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=74007
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 12 Mika Kuoppala 2014-02-05 14:35:13 UTC
please retest with latest igt and drm-intel-nightly
Comment 13 lu hua 2014-02-08 06:09:00 UTC
Test on latest igt and -nightly kernel 
Run ./gem_reset_stats --run-subtest reset-count-vebox twice, the hang goes away. 
[root@x-bdw02 tests]# ./gem_reset_stats --run-subtest reset-count-vebox
IGT-Version: 1.5-g971c7db (x86_64) (Linux: 3.13.0_drm-intel-nightly_8f5284_20140208+ x86_64)
Subtest reset-count-vebox: SUCCESS
[root@x-bdw02 tests]# ./gem_reset_stats --run-subtest reset-count-vebox
IGT-Version: 1.5-g971c7db (x86_64) (Linux: 3.13.0_drm-intel-nightly_8f5284_20140208+ x86_64)
Subtest reset-count-vebox: SUCCESS
[root@x-bdw02 tests]# ./gem_reset_stats --run-subtest reset-count-vebox
IGT-Version: 1.5-g971c7db (x86_64) (Linux: 3.13.0_drm-intel-nightly_8f5284_20140208+ x86_64)
Subtest reset-count-vebox: SUCCESS
Comment 14 Chris Wilson 2014-02-08 19:02:21 UTC
If it no longer produces the required fail, let's mark it as fixed until we have more evidence.
Comment 15 lu hua 2014-02-11 02:31:23 UTC
Verified.Fixed.
Comment 16 lu hua 2014-02-11 08:54:05 UTC
Reopen it.
Sub case ban-vebox has this issue.
It happens 1 in 3 runs.
run: ./gem_reset_stats --run-subtest ban-vebox
Comment 17 Chris Wilson 2014-02-11 09:16:58 UTC
Kindly please attach the output of the failing command and the associated dmesg.
Comment 18 lu hua 2014-02-12 09:00:31 UTC
Created attachment 93919 [details]
dmesg
Comment 19 Chris Wilson 2014-02-12 20:00:27 UTC
And the output from the command? Does it still report EIO which is what this bug is concerned about?
Comment 20 lu hua 2014-02-13 08:36:04 UTC
(In reply to comment #19)
> And the output from the command? Does it still report EIO which is what this
> bug is concerned about?

output:
IGT-Version: 1.5-gec3b133 (x86_64) (Linux: 3.13.0_drm-intel-nightly_efb4a6_20140213+ x86_64)

System hangs, only get the attached dmesg.
Comment 21 Mika Kuoppala 2014-02-14 09:10:39 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > And the output from the command? Does it still report EIO which is what this
> > bug is concerned about?
> 
> output:
> IGT-Version: 1.5-gec3b133 (x86_64) (Linux:
> 3.13.0_drm-intel-nightly_efb4a6_20140213+ x86_64)
> 
> System hangs, only get the attached dmesg.

Was this with the same hardware as the first report?

Could you run './gem-reset-stats --run-subtest="reset-count-render" just to rule out that we haven't regressed on the hang recovery part in more generally. As there seems to be lots of different bugs where BDW can't recover from reset.
Comment 22 lu hua 2014-02-17 07:44:52 UTC
> Was this with the same hardware as the first report?
> 
Different hardware.

> Could you run './gem-reset-stats --run-subtest="reset-count-render" just to
> rule out that we haven't regressed on the hang recovery part in more
> generally. As there seems to be lots of different bugs where BDW can't
> recover from reset.

./gem-reset-stats --run-subtest reset-count-render also causes system hang.
Comment 23 Mika Kuoppala 2014-02-18 16:46:15 UTC
Created attachment 94294 [details] [review]
drm/i915: fix forcewake counts for gen8
Comment 24 Mika Kuoppala 2014-02-18 16:47:08 UTC
Could you please test with the attached patch if that makes a difference?
Comment 25 Mika Kuoppala 2014-02-18 17:24:42 UTC
Created attachment 94295 [details] [review]
drm/i915: Fix forcewake counts for gen8
Comment 26 lu hua 2014-02-19 06:12:14 UTC
(In reply to comment #25)
> Created attachment 94295 [details] [review] [review]
> drm/i915: Fix forcewake counts for gen8

Test this patch. This issue goes away.
output:
 ./gem_reset_stats
IGT-Version: 1.5-g9597836 (x86_64) (Linux: 3.13.0_nightly_74007_72631patch_20140219+ x86_64)
Subtest params: SUCCESS
Subtest reset-stats-render: SUCCESS
Subtest reset-stats-ctx-render: SUCCESS
Subtest ban-render: SUCCESS
Subtest ban-ctx-render: SUCCESS
Subtest reset-count-render: SUCCESS
Subtest reset-count-ctx-render: SUCCESS
Subtest unrelated-ctx-render: SUCCESS
Subtest close-pending-render: SUCCESS
Subtest close-pending-ctx-render: SUCCESS
Subtest close-pending-fork-render: SUCCESS
Subtest reset-stats-blt: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-stats-ctx-blt: SKIP
Subtest ban-blt: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069:
Last errno: 9, Bad file descriptor
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest ban-ctx-blt: SKIP
Subtest reset-count-blt: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-count-ctx-blt: SKIP
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest unrelated-ctx-blt: SKIP
Subtest close-pending-blt: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086:
Last errno: 2, No such file or directory
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest close-pending-ctx-blt: SKIP
Subtest close-pending-fork-blt: SUCCESS
Subtest reset-stats-bsd: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-stats-ctx-bsd: SKIP
retrying for ban (9)
Subtest ban-bsd: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069:
Last errno: 9, Bad file descriptor
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest ban-ctx-bsd: SKIP
Subtest reset-count-bsd: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-count-ctx-bsd: SKIP
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest unrelated-ctx-bsd: SKIP
Subtest close-pending-bsd: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086:
Last errno: 2, No such file or directory
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest close-pending-ctx-bsd: SKIP
Subtest close-pending-fork-bsd: SUCCESS
Subtest reset-stats-vebox: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-stats-ctx-vebox: SKIP
Subtest ban-vebox: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069:
Last errno: 9, Bad file descriptor
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest ban-ctx-vebox: SKIP
Subtest reset-count-vebox: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-count-ctx-vebox: SKIP
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest unrelated-ctx-vebox: SKIP
Subtest close-pending-vebox: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086:
Last errno: 2, No such file or directory
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest close-pending-ctx-vebox: SKIP
Subtest close-pending-fork-vebox: SUCCESS
Comment 27 lu hua 2014-02-19 06:13:18 UTC
Created attachment 94324 [details]
dmesg with patch
Comment 28 Mika Kuoppala 2014-02-19 17:07:23 UTC
Now as the system hang issue seems to be solved, could you please try to reproduce the original issue which is gem_set_domain failing, by running:

./gem_reset_stats --run-subtest reset-count-vebox

multiple times in row
Comment 29 lu hua 2014-02-20 07:39:16 UTC
(In reply to comment #28)
> Now as the system hang issue seems to be solved, could you please try to
> reproduce the original issue which is gem_set_domain failing, by running:
> 
> ./gem_reset_stats --run-subtest reset-count-vebox
> 
> multiple times in row

Apply the patch then run multiple times? Run 20 cycles, It passes.
Comment 30 Daniel Vetter 2014-03-03 14:13:48 UTC
Assigning to Mika to shepherd his fix.
Comment 31 Daniel Vetter 2014-03-06 07:50:22 UTC
This is now resolved in dinq. Ben has the AR to assemble BDW patches for -fixes/3.14.
Comment 32 lu hua 2014-03-07 07:34:26 UTC
Created attachment 95292 [details]
dmesg(queued)

On latest -queued, the hang goes away.Bug 75876 track the fail.
IGT-Version: 1.5-gcdf74b6 (x86_64) (Linux: 3.14.0-rc5_drm-intel-nightly_2e7638_20140307+ x86_64)
Subtest params: SUCCESS
Subtest reset-stats-render: SUCCESS
Subtest reset-stats-ctx-render: SUCCESS
retrying for ban (9)
retrying for ban (8)
retrying for ban (7)
retrying for ban (6)
retrying for ban (5)
retrying for ban (4)
retrying for ban (3)
retrying for ban (2)
retrying for ban (1)
retrying for ban (0)
Test assertion failure function test_ban, file gem_reset_stats.c:582:
Last errno: 11, Resource temporarily unavailable
Failed assertion: h4 == -EIO
Subtest ban-render: FAIL
Subtest ban-ctx-render: SUCCESS
Subtest reset-count-render: SUCCESS
Subtest reset-count-ctx-render: SUCCESS
Subtest unrelated-ctx-render: SUCCESS
Subtest close-pending-render: SUCCESS
Subtest close-pending-ctx-render: SUCCESS
Subtest close-pending-fork-render: SUCCESS
Subtest reset-stats-blt: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-stats-ctx-blt: SKIP
retrying for ban (9)
retrying for ban (8)
retrying for ban (7)
retrying for ban (6)
retrying for ban (5)
retrying for ban (4)
retrying for ban (3)
retrying for ban (2)
retrying for ban (1)
retrying for ban (0)
Test assertion failure function test_ban, file gem_reset_stats.c:582:
Last errno: 11, Resource temporarily unavailable
Failed assertion: h4 == -EIO
Subtest ban-blt: FAIL
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest ban-ctx-blt: SKIP
Subtest reset-count-blt: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-count-ctx-blt: SKIP
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest unrelated-ctx-blt: SKIP
Subtest close-pending-blt: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086:
Last errno: 2, No such file or directory
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest close-pending-ctx-blt: SKIP
Subtest close-pending-fork-blt: SUCCESS
Subtest reset-stats-bsd: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-stats-ctx-bsd: SKIP
retrying for ban (9)
retrying for ban (8)
retrying for ban (7)
retrying for ban (6)
retrying for ban (5)
retrying for ban (4)
retrying for ban (3)
retrying for ban (2)
retrying for ban (1)
retrying for ban (0)
Test assertion failure function test_ban, file gem_reset_stats.c:582:
Last errno: 11, Resource temporarily unavailable
Failed assertion: h4 == -EIO
Subtest ban-bsd: FAIL
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest ban-ctx-bsd: SKIP
Subtest reset-count-bsd: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-count-ctx-bsd: SKIP
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest unrelated-ctx-bsd: SKIP
Subtest close-pending-bsd: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086:
Last errno: 2, No such file or directory
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest close-pending-ctx-bsd: SKIP
Subtest close-pending-fork-bsd: SUCCESS
Subtest reset-stats-vebox: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-stats-ctx-vebox: SKIP
retrying for ban (9)
retrying for ban (8)
retrying for ban (7)
retrying for ban (6)
retrying for ban (5)
retrying for ban (4)
retrying for ban (3)
retrying for ban (2)
retrying for ban (1)
retrying for ban (0)
Test assertion failure function test_ban, file gem_reset_stats.c:582:
Last errno: 11, Resource temporarily unavailable
Failed assertion: h4 == -EIO
Subtest ban-vebox: FAIL
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest ban-ctx-vebox: SKIP
Subtest reset-count-vebox: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest reset-count-ctx-vebox: SKIP
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078:
Last errno: 11, Resource temporarily unavailable
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest unrelated-ctx-vebox: SKIP
Subtest close-pending-vebox: SUCCESS
Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086:
Last errno: 2, No such file or directory
Test requirement: (RING_HAS_CONTEXTS == false)
Subtest close-pending-ctx-vebox: SKIP
Subtest close-pending-fork-vebox: SUCCESS
Comment 33 lu hua 2014-03-10 06:34:18 UTC
Verified.
Comment 34 Jari Tahvanainen 2016-10-12 09:04:04 UTC
Closing verified+fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.