Bug 75761

Summary: weston-launch no output - black screen
Product: xorg Reporter: soundx94
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: nerdopolis1, wayland-bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
The log of weston after quiting.
none
weston log with pixman none

Description soundx94 2014-03-04 15:57:32 UTC
Created attachment 95103 [details]
The log of weston after quiting.

Hi.

I'm trying to start weston with weston-launch in archLinux with the latest packages from the official repositories.
I've got an Nvidia card with Nouveau and nouveau-fw from the arch user repository and the latest mesa packages and KMS enabled.
Weston-launch seems to start correctly but the weston interface doesn't show up, the output is only a black screen. I can quit weston using Ctrl Alt backspace but nothing else.

When i'm trying to start just weston from an X session everything works fine but weston-launch is failing somehow.

Weston log attached.
Comment 1 Pekka Paalanen 2014-03-04 18:01:23 UTC
Could you check if weston-simple-shm gets anything to screen?

You can try it first when you run weston under X to see what it is supposed to show, and then when using weston-launch. VT switching should work, but if it does not, you might need to use ssh for access from remote or use a delay to start weston-simple-shm in another VT.

If weston-simple-shm seems to run, but still nothing on screen, then we should start suspecting an issue with Nouveau.

Did you check the kernel log for Nouveau errors?

You are not trying to use weston-launch from an X terminal, right?
Comment 2 soundx94 2014-03-05 08:20:05 UTC
So I tried running weston-simple-shm under weston in an X session and it worked fine. (Guessing this multi-colored circle stuff is expected to show up)
After quiting the X session and opening another VT with weston-launch in the first and weston-simple-shm in the second the first VT was still black and there is nothing nasty in the log.
Of course i'm NOT trying to run weston-launch from X.
Comment 3 Pekka Paalanen 2014-03-05 09:02:56 UTC
Ok, it sounds like it would have something to do with Nouveau then. I vaguely recall a few people having such problems with Nouveau in the past. Maybe look into Nouveau's bugzilla, if there are any reports about Weston?
Comment 4 soundx94 2014-03-05 13:44:24 UTC
Okai, thanks for your help :)
If anything turns up or a solution is found i'll post it here.
Comment 5 Lubosz Sarnecki 2014-04-04 00:45:04 UTC
I can confirm this bug. Also Arch Linux.

I tried downgrading mesa to 10.0.3, but it did not help, as described in this post http://www.maui-project.org/news/2014/04/02/how-to-fix-mesa-on-arch-linux/

Under X I get following nouveau error (not in tty)

libEGL debug: driver does not expose __driDriverGetExtensions_nouveau(): /usr/lib/xorg/modules/dri/nouveau_dri.so: undefined symbol: __driDriverGetExtensions_nouveau
Comment 6 Zach 2014-04-04 11:17:54 UTC
Also can confirm, running Arch with nouveau, on linux 3.14.

I just installed linux-lts (currently at version 3.10.x) and it runs weston fine on a tty, so this is clearly a nouveau kernel issue from a fairly recent change.
Comment 7 Pekka Paalanen 2014-04-04 12:25:43 UTC
Alright, thanks Zach. Anything relevant in kernel logs? I assume linux-lts refers just to the kernel?

Would anyone be up for bisecting the kernel? You can probably limit the bisection to the drivers/gpu/drm/nouveau directory.

We could reassign this bug to Nouveau, but I think they would just ask for a bisection as the first thing, since this is apparently a regression.

Lubosz, the libEGL message is harmless.
Comment 8 Lubosz Sarnecki 2014-04-04 16:36:52 UTC
Kernel 3.10 also works for me. Tested on NVC8 and NV50. I will bisect the kernel.
Comment 9 Kristian Høgsberg 2014-04-07 23:08:17 UTC
Thanks for investigating.  Closing as NOTOURBUG for now, please reopen if it turns out to be a weston issues after all.
Comment 10 Pekka Paalanen 2014-04-09 09:29:27 UTC
Trying to move this to Nouveau, BZ is being difficult...
Comment 11 Pekka Paalanen 2014-04-09 09:30:14 UTC
Reopening as Nouveau bug...
Comment 12 Pekka Paalanen 2014-04-09 09:32:13 UTC
Setting status to NEW, since this is a new Nouveau bug report, moved from Weston.
Said to be a kernel regression after 3.10.
Comment 13 Lubosz Sarnecki 2014-04-10 22:04:39 UTC
My bisection process so far is pretty frustrating, 
since the majority of the commits between 3.12 and 3.13 do not boot. 
I get corrupted output before being able to login. 

The screen looks like so:

http://i.imgur.com/C517qpP.jpg

My current progress:

# v3.12 good
# 0fef9d8a59abcd699761cb054b6c37a2bea9e31a  skip / won't boot
# 48ae0b355f21533145133002854de89a0537408d  skip / won't boot
# c17f5bb529221c7f4c0736e69ceb614da0df2838  bad
# 8df1d0c07f18bd84ea7d8c7bc2cff45ba2b09680  skip / won't boot
# 16c4f227ffc556a4851518092e2b5979da1280c1  skip / won't boot
# 5fa7543041cbc2d3139e8d2178df61a33ac3f9ac  skip / won't boot
# 6d8d163132d7df6ca701efcde7832046ecb2f040  skip / won't boot
# 98706ea99f2da8afdac69686de4ff982aca6a5c7  skip / won't boot
# v3.13 bad
# v3.14 bad
Comment 14 Lubosz Sarnecki 2014-04-10 22:07:12 UTC
I am only bisecting the nouveau directory, like so:

git bisect start v3.14 v3.10 -- drivers/gpu/drm/nouveau
Comment 15 Ilia Mirkin 2014-04-10 22:10:53 UTC
(In reply to comment #13)
> My bisection process so far is pretty frustrating, 
> since the majority of the commits between 3.12 and 3.13 do not boot. 
> I get corrupted output before being able to login. 

What HW are you using?
Comment 16 Lubosz Sarnecki 2014-04-10 22:14:42 UTC
Chipset: GF110 (NVC8)
Comment 17 Ilia Mirkin 2014-04-10 22:16:48 UTC
(In reply to comment #16)
> Chipset: GF110 (NVC8)

There are known issues with NVC8 that go away with blob pgraph fw, perhaps worth trying with that. (See bug #54437.)
Comment 18 Lubosz Sarnecki 2014-04-10 23:39:03 UTC
I was able to boot all commits with following boot option:
nouveau.config=NvMSI=0

Thanks to xexaxo on the IRC for pointing out this option.


f074d733866628973eca0ddb0c534ef4561da9e0 is the first bad commit
commit f074d733866628973eca0ddb0c534ef4561da9e0
Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Date:   Wed Nov 20 15:14:31 2013 +1000

    drm/nouveau/kms: send timestamp data for correct head in flip completion events
    
    Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

:040000 040000 99b978c435e8bb3e7bd91e30daf953940dcfe99c dcff6c807e51e53d4de0064678c46bfd68e0a1ca M	drivers


Canonical broke Weston for me ;)
Comment 19 Lubosz Sarnecki 2014-04-10 23:52:21 UTC
Reverting the patch on current Linux master (4a4389abdd9822fdf3cc2ac6ed87eb811fd43acc) fixes the issue for me.

Apperently s->crtc needs to be hardcoded to -1 for the screen not to stay black when launching weston.
Comment 20 soundx94 2014-04-17 11:54:03 UTC
Thanks for bisecting the kernel. Reverting that patch on the latest kernel also fixes the issue for me.
(Using NVE6: GTX 660)
Comment 21 Pekka Paalanen 2014-04-17 12:33:26 UTC
(In reply to comment #20)
> Thanks for bisecting the kernel. Reverting that patch on the latest kernel
> also fixes the issue for me.
> (Using NVE6: GTX 660)

Hey, wait! Why did you close this bug?

Has upstream merged a fix, or did this problem never occur on upstream (hard to imagine)?

It's not enough that you personally find a hack that fixes it for you alone, we want it fixed for everybody.
Comment 22 soundx94 2014-04-17 12:37:21 UTC
I thought WORKSFORME is the proper status because it works for me at the moment with this modification and it can get merged into upstream. Sorry if i misunderstood the status label, do you want me to reopen it ?
Comment 23 Pekka Paalanen 2014-04-17 12:42:39 UTC
(In reply to comment #22)
> I thought WORKSFORME is the proper status because it works for me at the
> moment with this modification and it can get merged into upstream. Sorry if
> i misunderstood the status label, do you want me to reopen it ?

WORKSFORME means, that maintainers or developers cannot reproduce the bug at all, and cannot find anything wrong.

I reopened this for you.
Comment 24 Lubosz Sarnecki 2014-05-28 18:27:46 UTC
Since Weston and Wayland 1.5.0 were released today, I wanted to ask about the status of this bug.

Does the patch which causes the regression have enough legitimacy to stay and break Weston?

Does someone understand the problem on a large scale?
Comment 25 Ilia Mirkin 2014-05-28 18:57:37 UTC
(In reply to comment #24)
> Since Weston and Wayland 1.5.0 were released today, I wanted to ask about
> the status of this bug.
> 
> Does the patch which causes the regression have enough legitimacy to stay
> and break Weston?

It seems like a pretty correct fix, at first blush. And was necessary for something else, IIRC.

> 
> Does someone understand the problem on a large scale?

It'll take someone with KMS API and Wayland experience to work out what's going on. Probably would have to get a Wayland developer involved to figure this one out.
Comment 26 Pekka Paalanen 2014-05-30 09:18:36 UTC
What did the -1 as crtc mean?

I'm not sure how this is weston-specific, though. Weston programs a pageflip on a certain crtc, and then relies on that particular crtc to send back a pageflip event as requested.

So if that's the culprit, is Nouveau on some particular hardware sending back pageflip events on the wrong crtc? Or maybe mixing up crtcs so that no event gets sent?

This would be easy to test by adding a weston_log("got pageflip event\n") to src/compositor-drm.c:745, in function page_flip_handler(), and another in line 723 in vblank_handler(). The vblank_handler should be unused atm. I think.

If Weston never gets the pageflip event, its repaint loop would be stuck, which would explain the symptoms. A side effect would be that clients would never get frame callbacks, which means that e.g. weston-simple-shm would not be consuming any CPU. Normally it should take at least few percent of one core. Otherwise everything would seem to just work, since Weston does not hang but only not repaint.
Comment 27 nerdopolis1 2014-05-31 03:41:53 UTC
I pulled out my old Dell desktop with a Pentium 4 processor, and AGB Nvidia card (NV36), and it does the same thing. Mark45 on Phoronix said this issue affected all 3 of his Nvidia cards when he tried to use my Wayland Live CD.

I added the "got pageflip event" lines as suggested, and it seems that they appear multiple times in the log, and seem to keep going until I tty switch, but the screen is still black. Using a vanilla Ubuntu kernel, and passing nouveau.config=NvMSI=0 did not work for me, or for Mark45 on any of his Nvidia cards.

Weston-desktop-shell starts fine, and I also tried weston-simple-shm, which did not appear, but it did run.  

It can't be a mesa issue, as Weston is using pixman here in this case...

I wish I had tested this sooner, but I wasn't sure if my results on the older desktop would reflect on newer Nvidia hardware.
Comment 28 Pekka Paalanen 2014-05-31 08:23:03 UTC
I wonder if there are two different bugs in play here.

If you do get the vblank events right, like it sounds that Nerdopolis does indeed get, then does reverting the blamed nouveau kernel patch make any difference? I would be surprised if it does make a difference.

This other bug would be, that when Weston changes the video mode, it does not call drmModeCrtcSetGamma. IIRC Michel Dänzer told me, that not calling that may lead to black output, if the color translations tables are all zero due not being explicitly set. So the main question here is, does Weston change the video mode, or does it use the mode that is already set for fbcon?

Weston code base already has a call to drmModeCrtcSetGamma(), but it is used only for color managed outputs I think. Yet, it is an example of how to call it, so if someone having this problem could test that, it would be very useful. If missing this call is indeed the problem, I think it should be a new Weston bug report, as the Nouveau patch blamed here should not make a difference AFAIU.

One more note: if you test reverting the blamed Nouveau patch, make sure your *both* kernels with and without are built by you and launched the same way, to make sure that the fbcon video mode is really the same on both. If the fbcon video mode is different, then that could mask the drmModeCrtcSetGamma issue.
Comment 29 Pekka Paalanen 2014-05-31 08:32:23 UTC
Hmm, a third remote possibility came to mind. If the timestamp delivered with the DRM pageflip event does not advance, I suppose Weston's fade-in animation might not advance either and therefore it never fades in. However this is hard for me to believe, as I think it would have caused problems with Xorg, too... wouldn't it?

Or maybe Xorg is only looking at the vblank counters, and you'd actually need a program using GLX_OML_sync_control to even see the timestamps?

One could check the timestamps by:
weston_log("%s: time %u.%06u frame %u\n", __func__, sec, usec, frame);
in compositor-drm.c page_flip_handler() function.

Mm, I'd like to see a weston log with that when you have the black screen problem.
Comment 30 nerdopolis1 2014-05-31 11:49:32 UTC
The only interesting thing that I can find in the dmesg logs is this:


[    1.509268] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    1.509269] [drm] No driver support for vblank timestamp query.
Comment 31 nerdopolis1 2014-05-31 12:29:14 UTC
Created attachment 100199 [details]
weston log with pixman
Comment 32 Pekka Paalanen 2014-06-01 10:15:47 UTC
(In reply to comment #31)
> Created attachment 100199 [details]
> weston log with pixman

Something is seriously wrong here:
[09:58:29.402] queueing pageflip failed: Permission denied

But I see you have another weston session afterwards, which seems to run, but:
[12:26:36.371] page_flip_handler: time 0.000000 frame 603
[12:26:36.371] got pageflip event
[12:26:36.388] page_flip_handler: time 0.000000 frame 603
[12:26:36.388] got pageflip event
[12:26:36.404] page_flip_handler: time 0.000000 frame 603
[12:26:36.404] got pageflip event
[12:26:36.421] page_flip_handler: time 0.000000 frame 603
[12:26:36.421] got pageflip event
[12:26:36.437] page_flip_handler: time 0.000000 frame 603
[12:26:36.438] got pageflip event
[12:26:36.454] page_flip_handler: time 0.000000 frame 603

That means that the kernel DRM driver does not provide timestamps and it does not even advance the vblank counter(?) if I understood right.

So at least for you, the problem is that the kernel DRM does not report proper pageflip timestamps. I would consider this as a kernel driver bug, but if the kernel developers insist, I suppose we could also hack around it in Weston by providing fake timestamps, disabling the Presentation extension (somewhat similar X11 Present, not yet merged) and yelling loudly. But I'd rather the driver got fixed.

As far as I see, bad timestamps are enough to make Weston stay black (fade-in animation not advancing), yet appear to work just fine. So at least we (Weston devs) should detect when the kernel driver is faulty, and abort Weston.

Weston, and also the intention in Wayland, is to rely on the proper pageflip timestamps from the kernel for timing... well, everything visual, so we really would like it to work.
Comment 33 Pekka Paalanen 2014-06-01 10:29:04 UTC
I filed bug #79502 for adding the detection to Weston.
Comment 34 nerdopolis1 2014-06-01 12:59:00 UTC
Just an FYI, I think the

[09:58:29.402] queueing pageflip failed: Permission denied

came from when I killed the previous Weston session, to have the waylandloginmanager reload Weston after I recompiled it with the diag lines on my live session. I was in another TTY

That was the only time I ever seen that happen so far
Comment 35 Ilia Mirkin 2014-06-10 01:28:27 UTC
A few pageflip-related commits landed in the repo:

http://cgit.freedesktop.org/~darktama/nouveau/

http://cgit.freedesktop.org/~darktama/nouveau/commit/?id=ff527544b0048d41a56097a66ff954fb9b0a2c75

Seems related to what was going on here. Give it a shot... (you'll need to build it against the latest drm-next tree from http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-next). Or try to apply the patches to whatever kernel tree you're on -- should probably work.
Comment 36 nerdopolis1 2014-06-14 22:24:59 UTC
I'm trying to compile the module... it seems that it's gone from the log now???

How do I disable werror too?
Comment 37 nerdopolis1 2014-06-14 23:45:08 UTC
I found out how to turn off werror, but I guess I need to know what patches I need to apply... Is that the only commit that I need to try the fix?
Comment 38 Ilia Mirkin 2014-06-15 00:03:11 UTC
All the patches are now in drm-next, just use that.
Comment 39 nerdopolis1 2014-06-15 00:16:56 UTC
I would at least like to try to just try to apply them against the 3.13 nouveau module, instead of trying to rebuild the whole kernel first? So is that the only patch I would need?

I'm still on 3.13 (which is also affected), as I have been doing all of my testing on the Weston side of things...
Comment 40 Ilia Mirkin 2014-06-15 18:48:58 UTC
(In reply to comment #39)
> I would at least like to try to just try to apply them against the 3.13
> nouveau module, instead of trying to rebuild the whole kernel first? So is
> that the only patch I would need?
> 
> I'm still on 3.13 (which is also affected), as I have been doing all of my
> testing on the Weston side of things...

Pretty sure that's the only patch. There were 2 other patches, but they only matter for pre-nv50 hardware, which I assume you're not using (GeForce 7 and earlier).
Comment 41 nerdopolis1 2014-06-17 00:03:53 UTC
I found the second patch too, and I applied that as well. It works now, even after I applied them to 3.13, and built just the Nouveau module. 

I have an older Nvidia card, I think it's NV36.
Comment 42 Ilia Mirkin 2014-08-21 22:17:32 UTC
All those patches should have made it to 3.16 (and were cc'd to earlier stable kernels). Does 3.16 work OK for people?
Comment 43 Martin Peres 2014-08-21 22:24:58 UTC
Works for me on the 3.16 and 3.17. Finally :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.