Bug 107691 - Invalid data in error state
Summary: Invalid data in error state
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-26 11:10 UTC by Lionel Landwerlin
Modified: 2019-11-29 17:49 UTC (History)
3 users (show)

See Also:
i915 platform: KBL, SKL
i915 features: GEM/Other


Attachments

Description Lionel Landwerlin 2018-08-26 11:10:12 UTC
As part of debugging hangs we're starting to look more at the error states generated by i915.
Together with Jason we've noticed that the data produced is incorrect.
Chunks of 32 bytes appear to just go missing (replaced by 0s).

For example this bug report : https://bugs.freedesktop.org/show_bug.cgi?id=107586
has 2 error state from which you can see that the context image has its first 32bytes at 0s.
What we should be finding at that location looks more like this :

https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/intel/tools/gen8_context.h#L27

MI_NOOP followed by MI_LRI.

We've noticed this issue on both Skylake & Kabylake (I believe this affects at least all big cores).

To workaround this issue I came up with this patch : https://github.com/djdeath/linux/commit/c18d4e1ee66cf587c484a60bba64f3dc4f35fc2e
This is probably wrong as pointed by Chris on IRC, but it gets us correct data in the error state.

This issue can be easily reproduced by running the IGT drv_hangman test on the render ring and checking the content of the "rcs0 --- HW context" BO in the error state. At offset 8092 we should find the MI_NOOP following by MI_LRI I pointed above, but instead of 32 bytes of 0s.
Comment 1 Chris Wilson 2018-08-26 12:26:04 UTC
Reproduced using

import { gem, gem_ioctls } from '../lib/i915/gem-ioctls.js';
import { drm } from '../lib/drm.js';
const ioctl = igt.os.ioctl;
const print = igt.toy.print;

const i915 = drm.cards()[0];

var fd;
fd = igt.os.open(drm.sysfs(i915) + '/' + 'error');
igt.os.write(fd, Uint32Array.of(0));
igt.os.close(fd);

const reloc_sz = gem_ioctls.s_execbuf_reloc.__pack_size__;
const reloc = new ArrayBuffer(2 * reloc_sz);

const batch = gem.create(i915, 4096);
let cs = new Uint32Array(gem.mmap(i915, batch, 4096, 1));
gem.set_domain(i915, batch, 0x80, 0x80);

cs[0] = 0x20 << 23 | 2;
ioctl.pack(new DataView(reloc, 0, reloc_sz), gem_ioctls.s_execbuf_reloc,
           new Map([
             [ "target_handle", batch ],
             [ "delta", 4096 - 4 ],
             [ "offset", 4 ],
           ]));
cs[1] = 4096 - 4;
cs[2] = 0;
cs[3] = 1;
cs[16] = 0x5 << 23;
cs[17] = 0x31 << 23 | 1 << 8 | 1;
ioctl.pack(new DataView(reloc, reloc_sz, reloc_sz), gem_ioctls.s_execbuf_reloc,
           new Map([
             [ "target_handle", batch ],
             [ "delta", 64 ],
             [ "offset", 4 * 17 ],
           ]));
cs[18] = 64;
cs[19] = 0;

const handle = gem.create(i915, 4096 * 65536);
let data = new Uint32Array(gem.mmap(i915, handle, 4096 * 65536));
for (let page = 0; page < 65536; page++) {
  for (let x = 0; x < 4096/4; x++) {
    data[page * 4096/4 + x] = page << 16 | x;
  }
}

const sz = gem_ioctls.s_execbuf_object.__pack_size__;
const objects = new ArrayBuffer(sz * 2);
ioctl.pack(new DataView(objects, 0, sz),
           gem_ioctls.s_execbuf_object,
           new Map([
             [ "handle", handle ],
             [ "flags", 1 << 7 ],
           ]));
ioctl.pack(new DataView(objects, sz, sz),
           gem_ioctls.s_execbuf_object,
           new Map([
             [ "handle", batch ],
             [ "relocation_count", 2 ],
             [ "relocs_ptr", reloc ],
           ]));
gem.execbuf(i915,
            ioctl.pack(undefined, gem_ioctls.s_execbuf,
                       new Map([
                         [ "buffers_ptr", objects],
                         [ "buffer_count", 2 ]
                       ])));

igt.assert(gem.busy(i915, batch));
while (!cs[1023] && gem.busy(i915, batch))
  ;

fd = igt.os.open(drm.debugfs(i915) + '/' + 'i915_wedged');
igt.os.write(fd, Uint32Array.of(0xffffffff));
igt.os.close(fd);

fd = igt.os.open(drm.sysfs(i915) + '/' + 'error');
const re = /--- user =/;
let line;
do {
  line = igt.os.readLine(fd)
} while (line && !re.test(line));
let error = igt.os.readLine(fd);
igt.os.close(fd);

let out = igt.ascii85.decode(error, error[0] == ':', 1);
for (var page = 0; page < 65536; page++) {
  let a = new Uint32Array(out, 4096*page, 4096/4);
  for (var i = 0; i < 4096/4; i++) {
    if (a[i] != page << 16 | i) {
      print(`Error detected on page:${page}, byte:${4*i}`);
      print(new Uint32Array(out, 4096*page + 4*(i&-32), 64));
      exit(1);
    }
  }
}

For science!
Comment 2 Jason Ekstrand 2018-08-26 12:27:57 UTC
As a further bit of information, sometimes data isn't zero, it's actually other bits of some buffer or other.  I've seen several dumps where bits of the first page of the batch end up showing up in the second page of the batch.
Comment 3 Chris Wilson 2018-08-26 13:48:15 UTC
The oddity is that the corruption came from the misplaced reloc (overwriting the BB_START itself). As a tight spin loop, no errors crop up in the capture. At least in this simple case of checking 65,536 pages, the corruption was an artifact of the hang.
Comment 4 Francesco Balestrieri 2019-06-03 07:36:24 UTC
Jason, Lionel, are you still seeing the problem? There have been changes in the capture system and we are wondering whether they helped with this issue.
Comment 5 Lionel Landwerlin 2019-06-03 07:56:36 UTC
(In reply to Francesco Balestrieri from comment #4)
> Jason, Lionel, are you still seeing the problem? There have been changes in
> the capture system and we are wondering whether they helped with this issue.

I'll have a look. In the meantime if you want to check, you can try with drv_hangman test and check the error state produced.
Comment 6 umesh 2019-11-25 19:59:53 UTC
#assessing

i915_hangman --r error-state-capture-rcs0 --debug 2>&1 | tee ~/tmp/107691.log

With latest drm-tip and igt, running above, I do not see an 'rcs0 --- HW context' section in the output/log. I do see data in 'rcs0 --- NULL context' though, looks ascii. Is there a way to interpret the data in this context? Is this a new change in the capture system to show just the default state?
Comment 7 Lionel Landwerlin 2019-11-25 20:31:47 UTC
(In reply to umesh from comment #6)
> #assessing
> 
> i915_hangman --r error-state-capture-rcs0 --debug 2>&1 | tee ~/tmp/107691.log
> 
> With latest drm-tip and igt, running above, I do not see an 'rcs0 --- HW
> context' section in the output/log. I do see data in 'rcs0 --- NULL context'
> though, looks ascii. Is there a way to interpret the data in this context?
> Is this a new change in the capture system to show just the default state?

I usually look at it with the mesa aubinator_error_decode tool.
Use the --all-bb option to decode the context image (other it just decodes the batches).
Comment 8 Martin Peres 2019-11-29 17:49:19 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/137.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.