Bug 94727

Summary:	[NV30/NV40] nouveau/pushbuf.c:238: pushbuf_krel: Assertion `bkref` failed.
Product:	Mesa	Reporter:	Severin Pappadeux <pappadeux>
Component:	Drivers/DRI/nouveau	Assignee:	Nouveau Project <nouveau>
Status:	RESOLVED MOVED	QA Contact:	Nouveau Project <nouveau>
Severity:	normal
Priority:	medium	CC:	fourdan, jeffbai
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	Backtrace of Xwayland on a G71

Description Severin Pappadeux 2016-03-28 03:01:19 UTC

Linux, Lubuntu 16.04 beta2, x64 on Intel CPU, NVIDIA Quadro 1500M

Multithreaded OpenGL application (C++, Geant4 toolkit, 3D visualisation)

Failed to start with the message in Subj. Application is working in VMware, same Linux version,  on top of Windows 10

Comment 1 Ilia Mirkin 2016-03-28 03:07:23 UTC

nouveau doesn't handle multithreaded GL usage (even if it's done by having separate contexts per thread). If the application has a mode which adds an interlock to GL calls, please try enabling it. (The issue isn't with multiple contexts, but with concurrent use of multiple contexts.)

Comment 2 Mathieu Malaterre 2016-05-25 07:04:48 UTC

Apparently it also affects some KDE applications: https://bugs.debian.org/822220#55

Comment 3 Ian Whyman 2016-09-27 06:17:52 UTC

This makes any qt app using qtwebengine which "integrates chromium's fast moving web capabilities into Qt." crash.

Affecting me by making kmail unusable.

Comment 4 Mingcong Bai 2016-11-07 20:07:10 UTC

I was also able to reproduce this issue on my PowerMac G5 with GeForce 6600LE, but the crash was pretty random, in most of the cases this crash happened while I was using Linthesia.

Comment 5 Olivier Fourdan 2016-12-02 10:30:30 UTC

This is affecting Xwayland as well, as reported downstream in Fedora 25:

    https://bugzilla.redhat.com/show_bug.cgi?id=1372878

wrt comment 1, Xwayland is not using multithreaded GL, even through glamor, is it?

Comment 6 Olivier Fourdan 2016-12-02 10:44:58 UTC

A full backtrace of Xwayland that led to this abort() is found in the downstream bug here:

   https://bugzilla.redhat.com/attachment.cgi?id=1197379

Comment 7 Ilia Mirkin 2016-12-06 19:22:39 UTC

(In reply to Olivier Fourdan from comment #5)
> This is affecting Xwayland as well, as reported downstream in Fedora 25:
> 
>     https://bugzilla.redhat.com/show_bug.cgi?id=1372878
> 
> wrt comment 1, Xwayland is not using multithreaded GL, even through glamor,
> is it?

Oh dear. That's using glamor on a nv30 (or nv40) board. The nv30 driver is ... erm ... imperfect.

And looking back at the first comments, the Quadro 1500M appears to be a G71, and the 6600LE is obviously some nv4x as well.

In which case these issues are just as likely to not be caused by the multithreading stuff at all.

But you're in (limited) luck - I have a NV34 plugged in. If you can send me an apitrace I can replay to cause this to happen, I can investigate [and the NV34 doesn't support as much stuff as a NV4x, but I can fake it]. I'll also stare at the nv30_resource_copy_region code to see what it's missing.

Comment 8 Ilia Mirkin 2016-12-06 19:42:15 UTC

OK, well staring at the code didn't do me much good.

bkref == NULL means that the pushbuf thing doesn't know about the bo we're trying to reloc. However just a few lines above in nv30_transfer_rect_m2mf we do

nouveau_pushbuf_refn (push, refs, 2)

which should tell it about the src/dst bo's. So either there's concurrency going on, or something's not working as advertised. In the no-concurrency case, I'd need repro steps (that can't start with "install tons of software").

Comment 9 Olivier Fourdan 2016-12-07 15:26:35 UTC

Created attachment 128368 [details]
Backtrace of Xwayland on a G71

Thanks for pointing out that it was on a G71, I have been able to reproduce using my own old G71 (Dell m1710) with Fedora 25!

Didn't have much luck trying to run Xwayland through apitrace, but I have a corefile of Xwayland, backtrace attached, if that's of any help...

Comment 10 Olivier Fourdan 2016-12-08 10:37:39 UTC

(In reply to Ilia Mirkin from comment #7)
> [...] If you can send me an apitrace I can replay to cause this to happen,
> I can investigate [and the NV34 doesn't support as much stuff as a NV4x,
> but I can fake it]. [...]

Ah! I managed to reproduce with apitrace \o/

   Xwayland: pushbuf.c:238: pushbuf_krel: Assertion `bkref' failed.
   apitrace: warning: caught signal 6
   apitrace: flushing trace due to an exception
   /usr/bin/../lib64/apitrace/wrappers/glxtrace.so+0x224c8c
   [...]
   apitrace: info: taking default action for signal 6

The capture is there:

https://people.freedesktop.org/~ofourdan/xwayland-apitrace.bz2

Please let me know if that helps!

Comment 11 Tomasz Paweł Gajc 2016-12-10 14:47:32 UTC

Looks like this is very similiar to #98039

Comment 12 Olivier Fourdan 2016-12-12 09:48:27 UTC

(In reply to Tomasz Paweł Gajc from comment #11)
> Looks like this is very similiar to #98039

Eventually, the assertion hit is the same, but it doesn't mean it's the same problem/root cause.

Comment 13 Ilia Mirkin 2017-01-11 03:24:16 UTC

Olivier, can you see if this patch with mesa helps?

https://patchwork.freedesktop.org/patch/132414/

It helped in the repro of bug #99354.

Comment 14 Olivier Fourdan 2017-01-11 09:37:08 UTC

(In reply to Ilia Mirkin from comment #13)
> Olivier, can you see if this patch with mesa helps?
> 
> https://patchwork.freedesktop.org/patch/132414/
> 
> It helped in the repro of bug #99354.

Sure! Thanks IIlia!

The patch did not apply cleanly on top of 13.0.3 nor master though (in src/gallium/drivers/nouveau/nvc0/nvc0_query_hw.c line 406 [1]) so I changed "nouveau_pushbuf_space(push, 16, 2, 0);" to "nouveau_pushbuf_space(push, 32, 3, 0);" to match the content of your patch, I hope this is right (sorry, I'm not familiar with this code).

But the issue is quite random and rather hard to reproduce at will, so I'll run with this for some time see if the issue reoccurs.

I shall also prepare a test package for Fedora downstream with the (modified) patch included so that the original reporter can give it a try as well.


[1] https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_query_hw.c#n406
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1372878

Comment 15 Ilia Mirkin 2017-01-11 15:49:47 UTC

(In reply to Olivier Fourdan from comment #14)
> The patch did not apply cleanly on top of 13.0.3 nor master though (in
> src/gallium/drivers/nouveau/nvc0/nvc0_query_hw.c line 406 [1]) so I changed
> "nouveau_pushbuf_space(push, 16, 2, 0);" to "nouveau_pushbuf_space(push, 32,
> 3, 0);" to match the content of your patch, I hope this is right (sorry, I'm
> not familiar with this code).

Ah yeah, that sounds right. I have some local changes trying to fix that code (unsuccessfully). But that in no way affects your situation - the issue there is a functional one, not a crashing one.

Comment 16 Olivier Fourdan 2017-02-28 15:58:38 UTC

Haven't heard back from the original reporter downstream, but FAF (Fedora Analysis Framework) reports thousands (literally) of similar bugs:

https://retrace.fedoraproject.org/faf/problems/2833836/

I see the patch in comment 13 has landed and is included in mesa-13.0.4, and Fedora has an updated mesa package for 13.0.4 and, interestingly, I don't see any mention of 13.0.4 in the FAF report, so it could be that this patch indeed fixes the issue with glamor/Xwayland on nv30.

Thanks again for your help, Ilia!

Comment 17 GitLab Migration User 2019-09-18 20:42:19 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1099.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.