Created attachment 92707 [details] dmesg System Environment: -------------------------- Platform: Broadwell Kernel(drm-intel-nightly)83ac01f486397fd8d319a3e31d1f95beb05037c5 Bug detailed description: ------------------------- Run one gem_reset_stats subcase then run other gem_reset_stats subcase, or run one gem_reset_stats subcase twice, system will hang. gem_reset_stats is a new case. run ./gem_reset_stats --run-subtest reset-count-vebox twice. output: #./gem_reset_stats --run-subtest reset-count-vebox IGT-Version: 1.5-g1bbb607 (x86_64) (Linux: 3.13.0-rc8_drm-intel-nightly_83ac01_20140123+ x86_64) Test assertion failure function gem_set_domain, file drmtest.c:535: Last errno: 5, Input/output error Failed assertion: drmIoctl((fd), ((((1U) << (((0+8)+8)+14)) | ((('d')) << (0+8)) | (((0x40 + 0x1f)) << 0) | ((((sizeof(struct drm_i915_gem_set_domain)))) << ((0+8)+8)))), (&set_domain)) == 0 Subtest reset-count-vebox: FAIL # ./gem_reset_stats --run-subtest reset-count-vebox IGT-Version: 1.5-g1bbb607 (x86_64) (Linux: 3.13.0-rc8_drm-intel-nightly_83ac01_20140123+ x86_64) Reproduce steps: ------------------------- 1. run ./gem_reset_stats --run-subtest reset-count-vebox twice
*** This bug has been marked as a duplicate of bug 73652 ***
Repeat after me, the driver is never, ever, allowed to return EIO through set-to-domain (or i915_gem_fault() and friends). If that i-g-t assert is reliable, that is just what happened here.
(In reply to comment #2) > Repeat after me, the driver is never, ever, allowed to return EIO through > set-to-domain (or i915_gem_fault() and friends). > > If that i-g-t assert is reliable, that is just what happened here. There are multiple problems currently with resets in nightly, but what is happening here is that gem_reset_stats (btw fixed in most recent master) will stop all rings after injecting hang. In nightly, this results in same context being claimed twice for the same hang, and gets banned. Then comes the set_domain() for that context and well, it has been, errorneously, banned.
I still strongly believe that these interfaces (set-to-domain, pagefault) are special in that they require defense against internal errors. They are userspaces last resort at keeping the machine limping along long enough for people to save their work and report a bug.
(In reply to comment #4) > I still strongly believe that these interfaces (set-to-domain, pagefault) > are special in that they require defense against internal errors. They are > userspaces last resort at keeping the machine limping along long enough for > people to save their work and report a bug. Ok. Do you consider also returning -EIO if the test has gone wild with 'stop_rings' debugfs interface to be an inssue ? Another example: Userspace sends hanging batch into multiple rings and then does set_domain(). Is -EIO acceptable in this case?
Yes. i-g-t is not foolproof, and we must avoid shooting ourselves in the foot and the user in the head even in extreme cases.
I don't really see how a banned context results in an -EIO in set_domain. It really shouldn't happen though ...
Is this BDW specific? If so we'd keep [BDW] in title.
(In reply to comment #8) > Is this BDW specific? If so we'd keep [BDW] in title. No. The specifics of this bug Mika believes to be bug 73652. I want to address the wider issue of making sure that the driver never returns EIO from memory access functions - as these are required for failsafe.
(In reply to comment #9) > (In reply to comment #8) > > Is this BDW specific? If so we'd keep [BDW] in title. > > No. The specifics of this bug Mika believes to be bug 73652. I want to > address the wider issue of making sure that the driver never returns EIO > from memory access functions - as these are required for failsafe. I agree with Chris here. My initial analysis was way too hasty and I mixed up the banning to have something to do with set_domain. It doesn't and it shouldn't. I couldn't reproduce this with IVB.
If "./gem_reset_stats --run-subtest reset-count-vebox" [twice] still fails, can you please paste the error message after applying commit 48ad03ca0c5f078b8d12a64323fd93b3858041af Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jan 31 16:56:01 2014 +0000 lib: Capture errno on entry When printing the errno, it is important that we capture the user errno before we make any library calls - as they may alter the value. References: https://bugs.freedesktop.org/show_bug.cgi?id=74007 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
please retest with latest igt and drm-intel-nightly
Test on latest igt and -nightly kernel Run ./gem_reset_stats --run-subtest reset-count-vebox twice, the hang goes away. [root@x-bdw02 tests]# ./gem_reset_stats --run-subtest reset-count-vebox IGT-Version: 1.5-g971c7db (x86_64) (Linux: 3.13.0_drm-intel-nightly_8f5284_20140208+ x86_64) Subtest reset-count-vebox: SUCCESS [root@x-bdw02 tests]# ./gem_reset_stats --run-subtest reset-count-vebox IGT-Version: 1.5-g971c7db (x86_64) (Linux: 3.13.0_drm-intel-nightly_8f5284_20140208+ x86_64) Subtest reset-count-vebox: SUCCESS [root@x-bdw02 tests]# ./gem_reset_stats --run-subtest reset-count-vebox IGT-Version: 1.5-g971c7db (x86_64) (Linux: 3.13.0_drm-intel-nightly_8f5284_20140208+ x86_64) Subtest reset-count-vebox: SUCCESS
If it no longer produces the required fail, let's mark it as fixed until we have more evidence.
Verified.Fixed.
Reopen it. Sub case ban-vebox has this issue. It happens 1 in 3 runs. run: ./gem_reset_stats --run-subtest ban-vebox
Kindly please attach the output of the failing command and the associated dmesg.
Created attachment 93919 [details] dmesg
And the output from the command? Does it still report EIO which is what this bug is concerned about?
(In reply to comment #19) > And the output from the command? Does it still report EIO which is what this > bug is concerned about? output: IGT-Version: 1.5-gec3b133 (x86_64) (Linux: 3.13.0_drm-intel-nightly_efb4a6_20140213+ x86_64) System hangs, only get the attached dmesg.
(In reply to comment #20) > (In reply to comment #19) > > And the output from the command? Does it still report EIO which is what this > > bug is concerned about? > > output: > IGT-Version: 1.5-gec3b133 (x86_64) (Linux: > 3.13.0_drm-intel-nightly_efb4a6_20140213+ x86_64) > > System hangs, only get the attached dmesg. Was this with the same hardware as the first report? Could you run './gem-reset-stats --run-subtest="reset-count-render" just to rule out that we haven't regressed on the hang recovery part in more generally. As there seems to be lots of different bugs where BDW can't recover from reset.
> Was this with the same hardware as the first report? > Different hardware. > Could you run './gem-reset-stats --run-subtest="reset-count-render" just to > rule out that we haven't regressed on the hang recovery part in more > generally. As there seems to be lots of different bugs where BDW can't > recover from reset. ./gem-reset-stats --run-subtest reset-count-render also causes system hang.
Created attachment 94294 [details] [review] drm/i915: fix forcewake counts for gen8
Could you please test with the attached patch if that makes a difference?
Created attachment 94295 [details] [review] drm/i915: Fix forcewake counts for gen8
(In reply to comment #25) > Created attachment 94295 [details] [review] [review] > drm/i915: Fix forcewake counts for gen8 Test this patch. This issue goes away. output: ./gem_reset_stats IGT-Version: 1.5-g9597836 (x86_64) (Linux: 3.13.0_nightly_74007_72631patch_20140219+ x86_64) Subtest params: SUCCESS Subtest reset-stats-render: SUCCESS Subtest reset-stats-ctx-render: SUCCESS Subtest ban-render: SUCCESS Subtest ban-ctx-render: SUCCESS Subtest reset-count-render: SUCCESS Subtest reset-count-ctx-render: SUCCESS Subtest unrelated-ctx-render: SUCCESS Subtest close-pending-render: SUCCESS Subtest close-pending-ctx-render: SUCCESS Subtest close-pending-fork-render: SUCCESS Subtest reset-stats-blt: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-stats-ctx-blt: SKIP Subtest ban-blt: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069: Last errno: 9, Bad file descriptor Test requirement: (RING_HAS_CONTEXTS == false) Subtest ban-ctx-blt: SKIP Subtest reset-count-blt: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-count-ctx-blt: SKIP Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest unrelated-ctx-blt: SKIP Subtest close-pending-blt: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086: Last errno: 2, No such file or directory Test requirement: (RING_HAS_CONTEXTS == false) Subtest close-pending-ctx-blt: SKIP Subtest close-pending-fork-blt: SUCCESS Subtest reset-stats-bsd: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-stats-ctx-bsd: SKIP retrying for ban (9) Subtest ban-bsd: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069: Last errno: 9, Bad file descriptor Test requirement: (RING_HAS_CONTEXTS == false) Subtest ban-ctx-bsd: SKIP Subtest reset-count-bsd: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-count-ctx-bsd: SKIP Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest unrelated-ctx-bsd: SKIP Subtest close-pending-bsd: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086: Last errno: 2, No such file or directory Test requirement: (RING_HAS_CONTEXTS == false) Subtest close-pending-ctx-bsd: SKIP Subtest close-pending-fork-bsd: SUCCESS Subtest reset-stats-vebox: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-stats-ctx-vebox: SKIP Subtest ban-vebox: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069: Last errno: 9, Bad file descriptor Test requirement: (RING_HAS_CONTEXTS == false) Subtest ban-ctx-vebox: SKIP Subtest reset-count-vebox: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-count-ctx-vebox: SKIP Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest unrelated-ctx-vebox: SKIP Subtest close-pending-vebox: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086: Last errno: 2, No such file or directory Test requirement: (RING_HAS_CONTEXTS == false) Subtest close-pending-ctx-vebox: SKIP Subtest close-pending-fork-vebox: SUCCESS
Created attachment 94324 [details] dmesg with patch
Now as the system hang issue seems to be solved, could you please try to reproduce the original issue which is gem_set_domain failing, by running: ./gem_reset_stats --run-subtest reset-count-vebox multiple times in row
(In reply to comment #28) > Now as the system hang issue seems to be solved, could you please try to > reproduce the original issue which is gem_set_domain failing, by running: > > ./gem_reset_stats --run-subtest reset-count-vebox > > multiple times in row Apply the patch then run multiple times? Run 20 cycles, It passes.
Assigning to Mika to shepherd his fix.
This is now resolved in dinq. Ben has the AR to assemble BDW patches for -fixes/3.14.
Created attachment 95292 [details] dmesg(queued) On latest -queued, the hang goes away.Bug 75876 track the fail. IGT-Version: 1.5-gcdf74b6 (x86_64) (Linux: 3.14.0-rc5_drm-intel-nightly_2e7638_20140307+ x86_64) Subtest params: SUCCESS Subtest reset-stats-render: SUCCESS Subtest reset-stats-ctx-render: SUCCESS retrying for ban (9) retrying for ban (8) retrying for ban (7) retrying for ban (6) retrying for ban (5) retrying for ban (4) retrying for ban (3) retrying for ban (2) retrying for ban (1) retrying for ban (0) Test assertion failure function test_ban, file gem_reset_stats.c:582: Last errno: 11, Resource temporarily unavailable Failed assertion: h4 == -EIO Subtest ban-render: FAIL Subtest ban-ctx-render: SUCCESS Subtest reset-count-render: SUCCESS Subtest reset-count-ctx-render: SUCCESS Subtest unrelated-ctx-render: SUCCESS Subtest close-pending-render: SUCCESS Subtest close-pending-ctx-render: SUCCESS Subtest close-pending-fork-render: SUCCESS Subtest reset-stats-blt: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-stats-ctx-blt: SKIP retrying for ban (9) retrying for ban (8) retrying for ban (7) retrying for ban (6) retrying for ban (5) retrying for ban (4) retrying for ban (3) retrying for ban (2) retrying for ban (1) retrying for ban (0) Test assertion failure function test_ban, file gem_reset_stats.c:582: Last errno: 11, Resource temporarily unavailable Failed assertion: h4 == -EIO Subtest ban-blt: FAIL Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest ban-ctx-blt: SKIP Subtest reset-count-blt: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-count-ctx-blt: SKIP Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest unrelated-ctx-blt: SKIP Subtest close-pending-blt: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086: Last errno: 2, No such file or directory Test requirement: (RING_HAS_CONTEXTS == false) Subtest close-pending-ctx-blt: SKIP Subtest close-pending-fork-blt: SUCCESS Subtest reset-stats-bsd: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-stats-ctx-bsd: SKIP retrying for ban (9) retrying for ban (8) retrying for ban (7) retrying for ban (6) retrying for ban (5) retrying for ban (4) retrying for ban (3) retrying for ban (2) retrying for ban (1) retrying for ban (0) Test assertion failure function test_ban, file gem_reset_stats.c:582: Last errno: 11, Resource temporarily unavailable Failed assertion: h4 == -EIO Subtest ban-bsd: FAIL Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest ban-ctx-bsd: SKIP Subtest reset-count-bsd: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-count-ctx-bsd: SKIP Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest unrelated-ctx-bsd: SKIP Subtest close-pending-bsd: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086: Last errno: 2, No such file or directory Test requirement: (RING_HAS_CONTEXTS == false) Subtest close-pending-ctx-bsd: SKIP Subtest close-pending-fork-bsd: SUCCESS Subtest reset-stats-vebox: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1063: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-stats-ctx-vebox: SKIP retrying for ban (9) retrying for ban (8) retrying for ban (7) retrying for ban (6) retrying for ban (5) retrying for ban (4) retrying for ban (3) retrying for ban (2) retrying for ban (1) retrying for ban (0) Test assertion failure function test_ban, file gem_reset_stats.c:582: Last errno: 11, Resource temporarily unavailable Failed assertion: h4 == -EIO Subtest ban-vebox: FAIL Test requirement not met in function __real_main1025, file gem_reset_stats.c:1069: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest ban-ctx-vebox: SKIP Subtest reset-count-vebox: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1075: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest reset-count-ctx-vebox: SKIP Test requirement not met in function __real_main1025, file gem_reset_stats.c:1078: Last errno: 11, Resource temporarily unavailable Test requirement: (RING_HAS_CONTEXTS == false) Subtest unrelated-ctx-vebox: SKIP Subtest close-pending-vebox: SUCCESS Test requirement not met in function __real_main1025, file gem_reset_stats.c:1086: Last errno: 2, No such file or directory Test requirement: (RING_HAS_CONTEXTS == false) Subtest close-pending-ctx-vebox: SKIP Subtest close-pending-fork-vebox: SUCCESS
Verified.
Closing verified+fixed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.