Bug 104778

Summary:	intermittent unit test crashes since mesa 17.2
Product:	Mesa	Reporter:	Mark Janes <mark.a.janes>
Component:	Drivers/DRI/i965	Assignee:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Status:	RESOLVED MOVED	QA Contact:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity:	normal
Priority:	high	CC:	clayton.a.craft, greatquux, jmcasanova, julien.isorce, martin.peres, oss.linuxpf, samuel, sergii.romantsov
Version:	git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	Debug trace.

Description Mark Janes 2018-01-24 22:47:15 UTC

Mesa CI reports a low error rate (2/700k), however the number of intermittent failures is consistently nonzero.  This is worse than our historical results.

The rarity of the failures makes it difficult to pinpoint the regression, however there are several repeating errors:

i965: Failed to submit batchbuffer: Bad address
    piglit.spec.!opengl 1_1.copypixels-draw-sync  ivb
    piglit.spec.!opengl 1_3.gl-1_3-texture-env   snb


intel_batchbuffer.c:937: submit_batch: Assertion `entry->handle == batch->batch.bo->gem_handle' failed.
    piglit.spec.!opengl 1_3.gl-1_3-texture-env.bdwm64
    piglit.shaders.glsl-fs-raytrace-bug27060 skl

HSW tesselation failures

deqp-gles31': corrupted double-linked list: 0x0000561e9518be50 ***
    dEQP-GLES31.functional.debug.error_filters.case_0.bdwm64

Comment 1 Kenneth Graunke 2018-01-31 14:59:20 UTC

Do we know what kernel version is running on the machines with failures?  We do slightly different things on v4.13 and later.  Wondering if it's only happening on machines with older kernels, or newer ones, or both.

Comment 2 Mark Janes 2018-01-31 16:42:44 UTC

Unfortunately I saw this recently on 4.14 and 4.11

http://otc-mesa-ci.jf.intel.com/job/Leeroy/1934476/ - ivbgt2-01 4.14
http://otc-mesa-ci.jf.intel.com/job/Leeroy/1934454/ - sklgt2-04 4.11

Comment 3 Lionel Landwerlin 2018-03-23 10:51:40 UTC

Running piglit.shaders.glsl-fs-raytrace-bug27060, I found this valgrind warning : https://patchwork.freedesktop.org/patch/212413/

Comment 4 Lionel Landwerlin 2018-03-23 13:42:29 UTC

The Broadwell failure is interesting as it's clearly a memory corruption issue.
Running the dEQP-GLES31.functional.debug.* tests under valgrind, I can see a few errors from the CTS suite :

Test case 'dEQP-GLES31.functional.debug.negative_coverage.callbacks.state.get_nuniformfv'..

==12081== Use of uninitialised value of size 8
==12081==    at 0x59B505E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==12081==    by 0x59B55A8: std::ostreambuf_iterator<char, std::char_traits<char> > std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long>(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, long) const (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==12081==    by 0x59C1178: std::ostream& std::ostream::_M_insert<long>(long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==12081==    by 0x71CAF1: std::ostream& tcu::Format::operator<< <int const*>(std::ostream&, tcu::Format::Array<int const*> const&) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31)
==12081==    by 0xD9F922: std::ostream& tcu::Format::operator<< <int>(std::ostream&, tcu::Format::ArrayPointer<int> const&) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31)
==12081==    by 0xEE0160: tcu::MessageBuilder& tcu::MessageBuilder::operator<< <tcu::Format::ArrayPointer<int> >(tcu::Format::ArrayPointer<int> const&) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31)
==12081==    by 0xE9E45F: glu::CallLogWrapper::glGetIntegerv(unsigned int, int*) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31)
==12081==    by 0xA611D2: deqp::gles31::Functional::NegativeTestShared::get_nuniformfv(deqp::gles31::Functional::NegativeTestShared::NegativeTestContext&) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31)
==12081==    by 0x7230DF: deqp::gles31::Functional::(anonymous namespace)::TestFunctionWrapper::call(deqp::gles31::Functional::(anonymous namespace)::DebugMessageTestContext&) const (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31)
==12081==    by 0x725DA1: deqp::gles31::Functional::(anonymous namespace)::CallbackErrorCase::iterate() (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31)
==12081==    by 0x6DCAD3: deqp::gles31::TestCaseWrapper::iterate(tcu::TestCase*) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31)
==12081==    by 0xF9E157: tcu::TestSessionExecutor::iterateTestCase(tcu::TestCase*) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31)


I'm not sure whether that's related, might be worth fixing though (trying to write some patches).

Comment 5 Samuel Sieb 2018-04-15 02:12:18 UTC

I just had my Gnome desktop crash and the only info in the log was:
i965: Failed to submit batchbuffer: Bad address

This is on Fedora 27, kernel 4.15.9, mesa 17.3.6.

Comment 6 Mark Janes 2018-04-15 17:23:08 UTC

It's clear to me that this bug is not simply "CI ghosts".  We have a bug in Mesa which is hard to trigger, and we hit it very occasionally with the exhaustive CI infrastructure.

What we need is ideas on how to narrow down the failure.  Perhaps one of the branches that performs additional memory verification could help?  I got nothing out of valgrind.

I'm eager to get suggestions on what to do next.

Comment 7 Mike Russo 2018-05-03 18:33:38 UTC

I've also encountered some desktop crashes lately with

May 03 14:15:15 ossy /usr/lib/gdm3/gdm-x-session[5995]: i965: Failed to submit batchbuffer: Cannot allocate memory

but it's intermittent and yeah this sounds like a tough problem to solve. 
Ubuntu 18.04; GNOME 3.28.1; Kernel 4.15.0-20-lowlatency; Intel HD Graphics 630 with modesetting on Xorg 1.19.6  (but might try the old intel driver)

Comment 8 Mark Janes 2018-05-23 05:36:21 UTC

*** Bug 106621 has been marked as a duplicate of this bug. ***

Comment 9 Mark Janes 2018-10-29 15:42:48 UTC

One of the tests that seems to reproduce this more often than others:

dEQP-GLES31.functional.debug.negative_coverage.get_error.vertex_array.draw_arrays_instanced_incomplete_primitive

produces on stderr:
  corrupted size vs. prev_size
or
  corrupted double-linked list

Seen on bxt, bdw, bsw, ivb

Comment 10 Martin Peres 2018-11-01 15:12:22 UTC

We hit this bug twice in a week, and then nothing since then (5 months and 1 week). I wonder if newer kernels fixed this issue. What is the most up to date kernel that has shown this issue?

Comment 11 Mark Janes 2018-11-01 15:43:52 UTC

4.18.  If you have a suggestion for what to run, I'll update.

Comment 12 Martin Peres 2018-11-01 16:19:17 UTC

(In reply to Mark Janes from comment #11)
> 4.18.  If you have a suggestion for what to run, I'll update.

Our CI last saw it on Linux: 4.17.0-rc6. So I guess we are just lucky...

Comment 13 Lakshmi 2019-02-14 09:37:14 UTC

Last seen this issue on our CI system is 8 months, 3 weeks / 4968 runs ago.
Can we close this issue?

Comment 14 Mark Janes 2019-02-14 21:55:16 UTC

The problematic tests have been disabled in mesa ci since June 2018.  If you think this is fixed, than I can re-enable them.

Mesa CI updated it's kernels to 4.19 recently, but otherwise there has been no change to affect this bug.

Comment 15 Mark Janes 2019-02-16 00:53:52 UTC

Mesa CI reproduce these test failures immediately:

https://mesa-ci.01.org/mesa_master/builds/15252/group/63a9f0ea7bb98050796b649e85481845

Builds have fairly recent kernels:

Linux otc-gfxtest-sklgt2-01 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 (2018-12-22) x86_64 GNU/Linux

Comment 16 Martin Peres 2019-03-08 11:53:07 UTC

(In reply to Mark Janes from comment #15)
> Mesa CI reproduce these test failures immediately:
> 
> https://mesa-ci.01.org/mesa_master/builds/15252/group/
> 63a9f0ea7bb98050796b649e85481845
> 
> Builds have fairly recent kernels:
> 
> Linux otc-gfxtest-sklgt2-01 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1
> (2018-12-22) x86_64 GNU/Linux

Thanks for the info!

I'll treat this as a mesa bug and since we are using your blacklist, we should be safe to just ignore it from our side. I'll close our kernel issue.

Thanks to everyone involved!

Comment 17 CI Bug Log 2019-03-08 11:53:16 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Comment 18 Yoshinori Gento 2019-03-12 05:21:30 UTC

I saw this once.

[Environment]
CPU: SkyLake(core i5 6500TE)
Distribution: debian(customised)
Kernel: 4.14.98
Mesa: 18.3.3
libdrm: 2.4.89

Message from stdout of drawing module was
----
i965: Failed to submit batchbuffer: Bad address
----

and back-trace were following
----
:
:
#5  0x00007f4496240b35 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f44864d1a5d in submit_batch (out_fence_fd=0x0, in_fence_fd=<optimized out>, brw=0x47ee030) at intel_batchbuffer.c:838
#7  _intel_batchbuffer_flush_fence (line=<optimized out>, file=<optimized out>, out_fence_fd=0x0, in_fence_fd=<optimized out>, brw=0x47ee030) at intel_batchbuffer.c:891
#8  _intel_batchbuffer_flush_fence (brw=0x47ee030, in_fence_fd=<optimized out>, out_fence_fd=0x0, file=<optimized out>, line=<optimized out>) at intel_batchbuffer.c:852
#9  0x00007f44864a558a in brw_draw_single_prim (stream=<optimized out>, xfb_obj=0x0, prim_id=0, prim=0x7ffff9aa77d0, ctx=0x47ee030, indirect=<optimized out>) at brw_draw.c:898
#10 brw_draw_prims (ctx=0x47ee030, prims=<optimized out>, nr_prims=1, ib=<optimized out>, index_bounds_valid=<optimized out>, min_index=0, max_index=3, gl_xfb_obj=0x0, stream=0, indirect=0x0) at brw_draw.c:1107
#11 0x00007f448608063c in _mesa_draw_arrays (drawID=0, baseInstance=0, numInstances=1, count=4, start=0, mode=6, ctx=0x47ee030) at main/draw.c:408
#12 _mesa_draw_arrays (ctx=0x47ee030, mode=6, start=0, count=4, numInstances=1, baseInstance=0, drawID=0) at main/draw.c:385
#13 0x00007f4486081344 in _mesa_exec_DrawArrays (mode=6, start=0, count=4) at main/draw.c:565
:
:
----

Comment 19 Yoshinori Gento 2019-03-12 05:27:16 UTC

(In reply to Yoshinori Gento from comment #18)
> I saw this once.
This occurred in our product.

Comment 20 Mark Janes 2019-03-12 15:18:29 UTC

Yoshinori: Mesa i965 team is seeking a way to reproduce this bug, so we can analyze and fix it.

How often does this occur in your product?  If it is reproducible, then perhaps we can use an apitrace to investigate the root cause.

Comment 21 Yoshinori Gento 2019-03-13 03:17:16 UTC

(In reply to Mark Janes from comment #20)
> Yoshinori: Mesa i965 team is seeking a way to reproduce this bug, so we can
> analyze and fix it.
> 
> How often does this occur in your product?  If it is reproducible, then
> perhaps we can use an apitrace to investigate the root cause.

While I operated in about 1month * 4 machines, 
I saw this problem only once.

So, I don't know how to reproduce this.
But when I saw this, I executed 'cp' command on xterm for copy some files. (I think that I do not matter.)

I keep operating machine to know frequency.

Comment 22 Mark Janes 2019-03-13 15:38:22 UTC

Hmm...`cp` in xterm is a pretty clear indicator that this issue is random and not triggered by a specific workload.

Lionel suggested that it would be good to have a feedback from the kernel about what didn't pass validation.

There is a kernel option to generate debug traces for that but you have to recompile your kernel with that option.  Lionel, can you provide some details?

It would be a good data point to see if a much older kernel produces this error (eg 4.9, 4.4).  I can't deploy those kernels in Mesa i965 CI because they lack features needed to run our Vulkan test suites.

Comment 23 Lionel Landwerlin 2019-03-14 11:36:34 UTC

(In reply to Mark Janes from comment #22)
> Hmm...`cp` in xterm is a pretty clear indicator that this issue is random
> and not triggered by a specific workload.
> 
> Lionel suggested that it would be good to have a feedback from the kernel
> about what didn't pass validation.
> 
> There is a kernel option to generate debug traces for that but you have to
> recompile your kernel with that option.  Lionel, can you provide some
> details?
> 
> It would be a good data point to see if a much older kernel produces this
> error (eg 4.9, 4.4).  I can't deploy those kernels in Mesa i965 CI because
> they lack features needed to run our Vulkan test suites.


With the kernel compiled with CONFIG_DRM_I915_DEBUG_GEM and the following command issued as root : 

echo 15 > /sys/module/drm/parameters/debug 

You should be able to get some traces about why the execbuffer failed.

Unfortunately that generates a lot of traces...

Comment 24 Mike Russo 2019-03-14 12:19:18 UTC

I haven't encountered this issue at all since moving away from modesetting and back to the intel DDX driver.  So whatever extra exercises GLAMOR was doing may be triggering the bug.  I'm sure that doesn't help actually fix it but it might at least help people experiencing it to have a more stable desktop.

Comment 25 Yoshinori Gento 2019-03-15 11:01:08 UTC

I saw this problem three times from yesterday.
All of them occurred during file sync over LAN with rsync.
I think this problem might be related to load by disk i/o or network i/o.
But unfortunately I have not re-compiled kernel with CONFIG_DRM_I915_DEBUG_GEM yet.
I will try to it next week.

Comment 26 Yoshinori Gento 2019-03-25 06:41:32 UTC

Created attachment 143768 [details]
Debug trace.

I got debug traces. Please see attached file.
PID needs to be checked is 2602.
After that this process was exited with "i965: Failed to submit batchbuffer: Bad address".
At that time I repeated to copy and delete of files by rsync.

Note: This is occurred on core i3-6100E. Software version are same as the above.

Comment 27 Yoshinori Gento 2019-05-07 03:24:19 UTC

I got how to reproduce.

Cached memory grows big by reading many files and free RAM becomes empty.
In this situation (repeat release and allocate caches frequently), drawing process faces this problem.

Does conflict of memory cause this problem?

Comment 28 Denis 2019-08-08 15:26:22 UTC

Hello Yoshinori Gento

>I got how to reproduce.
Does this mean that you could provide an apitrace or somekind of reproducer? It would be really helpful.

Comment 29 Yoshinori Gento 2019-08-29 11:55:03 UTC

(In reply to Denis from comment #28)
> Hello Yoshinori Gento
> 
> >I got how to reproduce.
> Does this mean that you could provide an apitrace or somekind of reproducer?
> It would be really helpful.

Hello Denis

I didn't produce an apitrace nor reproducer.
I updated kernel to 4.19.57.
Then, this problem became hard to occur, but still occurs.

Comment 30 GitLab Migration User 2019-09-25 19:07:58 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1680.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.