As part of debugging hangs we're starting to look more at the error states generated by i915. Together with Jason we've noticed that the data produced is incorrect. Chunks of 32 bytes appear to just go missing (replaced by 0s). For example this bug report : https://bugs.freedesktop.org/show_bug.cgi?id=107586 has 2 error state from which you can see that the context image has its first 32bytes at 0s. What we should be finding at that location looks more like this : https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/intel/tools/gen8_context.h#L27 MI_NOOP followed by MI_LRI. We've noticed this issue on both Skylake & Kabylake (I believe this affects at least all big cores). To workaround this issue I came up with this patch : https://github.com/djdeath/linux/commit/c18d4e1ee66cf587c484a60bba64f3dc4f35fc2e This is probably wrong as pointed by Chris on IRC, but it gets us correct data in the error state. This issue can be easily reproduced by running the IGT drv_hangman test on the render ring and checking the content of the "rcs0 --- HW context" BO in the error state. At offset 8092 we should find the MI_NOOP following by MI_LRI I pointed above, but instead of 32 bytes of 0s.
Reproduced using import { gem, gem_ioctls } from '../lib/i915/gem-ioctls.js'; import { drm } from '../lib/drm.js'; const ioctl = igt.os.ioctl; const print = igt.toy.print; const i915 = drm.cards()[0]; var fd; fd = igt.os.open(drm.sysfs(i915) + '/' + 'error'); igt.os.write(fd, Uint32Array.of(0)); igt.os.close(fd); const reloc_sz = gem_ioctls.s_execbuf_reloc.__pack_size__; const reloc = new ArrayBuffer(2 * reloc_sz); const batch = gem.create(i915, 4096); let cs = new Uint32Array(gem.mmap(i915, batch, 4096, 1)); gem.set_domain(i915, batch, 0x80, 0x80); cs[0] = 0x20 << 23 | 2; ioctl.pack(new DataView(reloc, 0, reloc_sz), gem_ioctls.s_execbuf_reloc, new Map([ [ "target_handle", batch ], [ "delta", 4096 - 4 ], [ "offset", 4 ], ])); cs[1] = 4096 - 4; cs[2] = 0; cs[3] = 1; cs[16] = 0x5 << 23; cs[17] = 0x31 << 23 | 1 << 8 | 1; ioctl.pack(new DataView(reloc, reloc_sz, reloc_sz), gem_ioctls.s_execbuf_reloc, new Map([ [ "target_handle", batch ], [ "delta", 64 ], [ "offset", 4 * 17 ], ])); cs[18] = 64; cs[19] = 0; const handle = gem.create(i915, 4096 * 65536); let data = new Uint32Array(gem.mmap(i915, handle, 4096 * 65536)); for (let page = 0; page < 65536; page++) { for (let x = 0; x < 4096/4; x++) { data[page * 4096/4 + x] = page << 16 | x; } } const sz = gem_ioctls.s_execbuf_object.__pack_size__; const objects = new ArrayBuffer(sz * 2); ioctl.pack(new DataView(objects, 0, sz), gem_ioctls.s_execbuf_object, new Map([ [ "handle", handle ], [ "flags", 1 << 7 ], ])); ioctl.pack(new DataView(objects, sz, sz), gem_ioctls.s_execbuf_object, new Map([ [ "handle", batch ], [ "relocation_count", 2 ], [ "relocs_ptr", reloc ], ])); gem.execbuf(i915, ioctl.pack(undefined, gem_ioctls.s_execbuf, new Map([ [ "buffers_ptr", objects], [ "buffer_count", 2 ] ]))); igt.assert(gem.busy(i915, batch)); while (!cs[1023] && gem.busy(i915, batch)) ; fd = igt.os.open(drm.debugfs(i915) + '/' + 'i915_wedged'); igt.os.write(fd, Uint32Array.of(0xffffffff)); igt.os.close(fd); fd = igt.os.open(drm.sysfs(i915) + '/' + 'error'); const re = /--- user =/; let line; do { line = igt.os.readLine(fd) } while (line && !re.test(line)); let error = igt.os.readLine(fd); igt.os.close(fd); let out = igt.ascii85.decode(error, error[0] == ':', 1); for (var page = 0; page < 65536; page++) { let a = new Uint32Array(out, 4096*page, 4096/4); for (var i = 0; i < 4096/4; i++) { if (a[i] != page << 16 | i) { print(`Error detected on page:${page}, byte:${4*i}`); print(new Uint32Array(out, 4096*page + 4*(i&-32), 64)); exit(1); } } } For science!
As a further bit of information, sometimes data isn't zero, it's actually other bits of some buffer or other. I've seen several dumps where bits of the first page of the batch end up showing up in the second page of the batch.
The oddity is that the corruption came from the misplaced reloc (overwriting the BB_START itself). As a tight spin loop, no errors crop up in the capture. At least in this simple case of checking 65,536 pages, the corruption was an artifact of the hang.
Jason, Lionel, are you still seeing the problem? There have been changes in the capture system and we are wondering whether they helped with this issue.
(In reply to Francesco Balestrieri from comment #4) > Jason, Lionel, are you still seeing the problem? There have been changes in > the capture system and we are wondering whether they helped with this issue. I'll have a look. In the meantime if you want to check, you can try with drv_hangman test and check the error state produced.
#assessing i915_hangman --r error-state-capture-rcs0 --debug 2>&1 | tee ~/tmp/107691.log With latest drm-tip and igt, running above, I do not see an 'rcs0 --- HW context' section in the output/log. I do see data in 'rcs0 --- NULL context' though, looks ascii. Is there a way to interpret the data in this context? Is this a new change in the capture system to show just the default state?
(In reply to umesh from comment #6) > #assessing > > i915_hangman --r error-state-capture-rcs0 --debug 2>&1 | tee ~/tmp/107691.log > > With latest drm-tip and igt, running above, I do not see an 'rcs0 --- HW > context' section in the output/log. I do see data in 'rcs0 --- NULL context' > though, looks ascii. Is there a way to interpret the data in this context? > Is this a new change in the capture system to show just the default state? I usually look at it with the mesa aubinator_error_decode tool. Use the --all-bb option to decode the context image (other it just decodes the batches).
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/137.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.