Bug 92504 - [NVA5] Corruption in Plasma 5 on resume -- set_domain failures
Summary: [NVA5] Corruption in Plasma 5 on resume -- set_domain failures
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/nouveau (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Nouveau Project
URL:
Whiteboard:
Keywords:
: 91125 91598 92010 92213 92299 92522 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-10-16 19:19 UTC by Alexander Schlarb
Modified: 2015-11-08 23:11 UTC (History)
9 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (since boot) (256.31 KB, text/plain)
2015-10-16 19:19 UTC, Alexander Schlarb
Details
Log of plasmashell running (only last suspend before initial corruption) (1.58 MB, text/plain)
2015-10-16 19:20 UTC, Alexander Schlarb
Details
Kernel log when using the debugging patch provided in the last message (XZ compressed) (745.55 KB, application/x-xz)
2015-10-17 14:03 UTC, Alexander Schlarb
Details
Kernel log when debugging only (!domain) cases (91.57 KB, text/plain)
2015-10-17 18:34 UTC, Alexander Schlarb
Details
patch to print bo info (1.78 KB, patch)
2015-10-19 01:06 UTC, Ilia Mirkin
Details | Splinter Review
dmesg output (with patch 118967) (95.08 KB, text/plain)
2015-10-19 15:17 UTC, Alexander Schlarb
Details
plasmashell + mesa output (with patch 118967) (481.68 KB, text/plain)
2015-10-19 15:17 UTC, Alexander Schlarb
Details

Description Alexander Schlarb 2015-10-16 19:19:50 UTC
Created attachment 118931 [details]
dmesg output (since boot)

Steps to reproduce:
 * Use laptop with "NVIDIA Corporation GT216M [GeForce GT 330M]" or similar
 * Install KDE Plasma 5
 * Reboot
 * Log in to Plasma 5 desktop

 * Open the main menu and do *Quit* -> *Suspend*
 * Wait for the system to suspend
 * Resume the system and unlock the screen
 * Disconnect or connect external monitor (use *Menu* -> *Computer* -> *System settings* -> *Screen* to force screen detection if it doesn't happen automatically)

 * Open the main menu and do *Quit* -> *Suspend*
 * Wait for the system to suspend
 * Resume the system and unlock the screen
 * Open the main menu, network indicator or any other menu from the panel (`plasmashell`)


*What should happen?*

Nothing - it should "just work"

*What happens instead?*

Massive corruption the first time you open any menu

Also, after reproducing this error a few times without rebooting, the corruption becomes worse on every suspend until the system hangs. I'm not sure if this is related through…

*Where was this error produced?*

Tanglu Linux 4.0 ALPHA (Debian based) using:
 * Latest 4.2.0-trunk linux-image
 * MESA git snapshot or MESA 11.0.2
 * LibDRM 2.4.64
 * X11 1.17.2
 * xorg-video-nouveau 1.0.11
Comment 1 Alexander Schlarb 2015-10-16 19:20:57 UTC
Created attachment 118932 [details]
Log of plasmashell running (only last suspend before initial corruption)
Comment 2 Alexander Schlarb 2015-10-16 19:22:10 UTC
A plasmashell apitrace is available on request (100MB)
Comment 3 Ilia Mirkin 2015-10-16 19:24:30 UTC
Which mesa git snapshot? A number of resource lifetime issues were addressed in 11.0.3 which I was hoping might fix weirdo issues like this.
Comment 4 Alexander Schlarb 2015-10-16 19:27:01 UTC
My current MESA snapshot is 93267887a06e760b4b20618523df5e8aa4e70307
Comment 5 Ilia Mirkin 2015-10-16 19:29:50 UTC
OK, that has all the fixes I had in mind. Oh well.

Does replaying the apitrace reproduce the issues? If so, I'd be most interested in obtaining it.
Comment 6 Alexander Schlarb 2015-10-16 19:47:47 UTC
Actually corruption was way worse before the latest MESA 11.0.2 update, so thumbs up! :-)

Unfortunately the replaying the apitrace neither displays any corruption nor does it show any trace of the pages of nouveau error messages that are shown in the `plasmashell` console output after resume.

Another thing: An output very similar to the one dumped by `plasmashell` on resume is also dumped by `kwin_x11` (the compositor/window manager) although I haven't noticed any corruption there yet (that only happens during the second stage when the system becomes unstable).

What now?
Comment 7 Ilia Mirkin 2015-10-16 20:08:50 UTC
Actually the issue is most likely due to

[20351.450000] nouveau E[plasmashell[13570]] fail set_domain
[20351.450006] nouveau E[plasmashell[13570]] validating bo list
[20351.450015] nouveau E[plasmashell[13570]] validate: -22

Which is... odd. This happens when

        uint32_t domains = valid_domains & nvbo->valid_domains &
                (write_domains ? write_domains : read_domains);

        if (!domains)
                return -EINVAL;

Which means that unless nvbo->valid_domains had something funny in it, this is all good. Hmmm....

        nvbo->valid_domains = NOUVEAU_GEM_DOMAIN_VRAM |
                              NOUVEAU_GEM_DOMAIN_GART;
        if (drm->device.info.family >= NV_DEVICE_INFO_V0_TESLA)
                nvbo->valid_domains &= domain;

So if we ever supply a valid/read/write domains that's not the same as the domain we created the bo with, then this will fail. However after some scouring of the code, I don't where we'd be messing this up. And unfortunately nothing that libdrm prints will help this. I'm also unsure of 

If you can, could you patch  drivers/gpu/drm/nouveau/nouveau_gem.c:nouveau_gem_set_domain to have a print of all 4 inputs to domains to see why it's coming out as 0?
Comment 8 Alexander Schlarb 2015-10-16 20:17:58 UTC
I'll do that and get back to you once I'm done testing.
Comment 9 Alexander Schlarb 2015-10-17 14:02:20 UTC
I now have also noted that the first "fail set_domain" messages already appear during the first suspend (including the plasmashell nouveau dump and even some of the corruption). How could I have not noticed this? :-/

Anyway, here we go:

I patched Linux 4.3.0-rc5-next-20151016+ (cd685d8558) with these lines:

    diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
    index 2c99815..caff0e0 100644
    --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
    +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
    @@ -291,7 +291,10 @@ nouveau_gem_set_domain(struct drm_gem_object *gem, uint32_t read_domains,
            uint32_t domains = valid_domains & nvbo->valid_domains &
                    (write_domains ? write_domains : read_domains);
            uint32_t pref_flags = 0, valid_flags = 0;
    -
    +       printk("nouveau_gem_set_domain - drm_gem_object: 0x%08p\n", gem);
    +       printk("nouveau_gem_set_domain - read_domains:   0x%08x\n", read_domains);
    +       printk("nouveau_gem_set_domain - write_domains:  0x%08x\n", write_domains);                                                                    
    +       printk("nouveau_gem_set_domain - valid_domains:  0x%08x\n", valid_domains);                                                                    
            if (!domains)                                                                                                                                  
                    return -EINVAL;

The I booted the kernel and reproduced the issue.

Some relevant lines:

Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000004                                                               
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000000                                                               
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000004                                                               
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800c5b832e8
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000000
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800c9b0cae8
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000000
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800c4990ae8
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000000
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: kscreenlocker_g[3146]: fail set_domain
Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: kscreenlocker_g[3146]: validating bo list
Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: kscreenlocker_g[3146]: validate: -22
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800bf085ae8
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000002
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000000
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000002
--
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000000
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8801a383eee8
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000000
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800c5918ee8
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000000
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800cab47ae8
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000000
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000004
Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: plasmashell[2575]: fail set_domain
Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: plasmashell[2575]: validating bo list
Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: plasmashell[2575]: validate: -22
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800ca083ae8
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains:   0x00000002
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains:  0x00000002
Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains:  0x00000002
Comment 10 Alexander Schlarb 2015-10-17 14:03:09 UTC
Created attachment 118936 [details]
Kernel log when using the debugging patch provided in the last message (XZ compressed)
Comment 11 Ilia Mirkin 2015-10-17 17:32:35 UTC
(In reply to Alexander Schlarb from comment #9)
> I now have also noted that the first "fail set_domain" messages already
> appear during the first suspend (including the plasmashell nouveau dump and
> even some of the corruption). How could I have not noticed this? :-/
> 
> Anyway, here we go:
> 
> I patched Linux 4.3.0-rc5-next-20151016+ (cd685d8558) with these lines:
> 
>     diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c
> b/drivers/gpu/drm/nouveau/nouveau_gem.c
>     index 2c99815..caff0e0 100644
>     --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
>     +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
>     @@ -291,7 +291,10 @@ nouveau_gem_set_domain(struct drm_gem_object *gem,
> uint32_t read_domains,
>             uint32_t domains = valid_domains & nvbo->valid_domains &
>                     (write_domains ? write_domains : read_domains);
>             uint32_t pref_flags = 0, valid_flags = 0;
>     -
>     +       printk("nouveau_gem_set_domain - drm_gem_object: 0x%08p\n", gem);
>     +       printk("nouveau_gem_set_domain - read_domains:   0x%08x\n",
> read_domains);
>     +       printk("nouveau_gem_set_domain - write_domains:  0x%08x\n",
> write_domains);                                                             
> 
>     +       printk("nouveau_gem_set_domain - valid_domains:  0x%08x\n",
> valid_domains);                                                             
> 
>             if (!domains)                                                   
> 
>                     return -EINVAL;

I should have been more explicit. You need to move that into if (!domains), and also print nvbo->valid_domains. So e.g.

if (!domains) {
  printk("bo valid domains: %x\n", nvbo->valid_domains);
  printk("valid domains: %x\n", valid_domains);
  printk("read domains: %x\n", read_domains);
  printk("write domains: %x\n", write_domains);
  return -EINVAL;
}

This should provide with the relevant info for the failing gem object.
Comment 12 Alexander Schlarb 2015-10-17 18:34:34 UTC
Created attachment 118938 [details]
Kernel log when debugging only (!domain) cases
Comment 13 Alexander Schlarb 2015-10-17 18:34:58 UTC
I hope this is more useful :-)
Comment 14 Ilia Mirkin 2015-10-17 18:52:53 UTC
(In reply to Alexander Schlarb from comment #13)
> I hope this is more useful :-)

*MUCH* more useful.

[   91.919550] bo valid domains: 2
[   91.919557] valid domains:    4
[   91.919559] read domains:     0
[   91.919560] write domains:    4
[   91.919565] nouveau 0000:01:00.0: plasmashell[2578]: fail set_domain
[   91.919569] nouveau 0000:01:00.0: plasmashell[2578]: validating bo list
[   91.919587] nouveau 0000:01:00.0: plasmashell[2578]: validate: -22

So bo->valid_domains == VRAM, but we're trying to write to it via GART. Hmmmm.

I'm going to try to come up with a patch that figures out the gpu va of the bo in question... hopefully that will provide further hints in conjunction with the pushbuf dump.
Comment 15 Ilia Mirkin 2015-10-18 14:22:19 UTC
*** Bug 92522 has been marked as a duplicate of this bug. ***
Comment 16 Ilia Mirkin 2015-10-19 01:06:11 UTC
Created attachment 118967 [details] [review]
patch to print bo info

OK, this patch should provide the bo va address. Can you provide the output of this along with a log that includes the pushbuf error to stderr from plasmashell?

I'm going to try to match up the bo's va to where it is in the pushbuf and hopefully identify which bo it's going wrong for.
Comment 17 Alexander Schlarb 2015-10-19 15:17:10 UTC
Created attachment 118984 [details]
dmesg output (with patch 118967)
Comment 18 Alexander Schlarb 2015-10-19 15:17:51 UTC
Created attachment 118985 [details]
plasmashell + mesa output (with patch 118967)
Comment 19 Alexander Schlarb 2015-10-19 15:20:07 UTC
Here we go: The latest round of patched output!

What do you see?
Comment 20 Ilia Mirkin 2015-10-19 15:37:32 UTC
[  210.490188] nouveau 0000:01:00.0: plasmashell[3176]: bo valid domains: 2
[  210.490193] nouveau 0000:01:00.0: plasmashell[3176]: valid domains: 4
[  210.490195] nouveau 0000:01:00.0: plasmashell[3176]: read domains: 0
[  210.490197] nouveau 0000:01:00.0: plasmashell[3176]: write domains: 4
[  210.490199] nouveau 0000:01:00.0: plasmashell[3176]: buf offset: 0040950000
[  210.490202] nouveau 0000:01:00.0: plasmashell[3176]: buf size: 800000
[  210.490204] nouveau 0000:01:00.0: plasmashell[3176]: fail set_domain
[  210.490206] nouveau 0000:01:00.0: plasmashell[3176]: validating bo list
[  210.490210] nouveau 0000:01:00.0: plasmashell[3176]: validate: -22

nouveau: 	0x00146200
nouveau: 	0x00000000
nouveau: 	0x40950000
nouveau: 	0x000000cf
nouveau: 	0x00000040
nouveau: 	0x00000000

Which happens to be

      BEGIN_NV04(push, NV50_3D(RT_ADDRESS_HIGH(i)), 5);
      PUSH_DATAh(push, mt->base.address + sf->offset);
      PUSH_DATA (push, mt->base.address + sf->offset);
      PUSH_DATA (push, nv50_format_table[sf->base.format].rt);
         PUSH_DATA (push, mt->level[sf->base.u.tex.level].tile_mode);
         PUSH_DATA (push, mt->layer_stride >> 2);

And above it inits the fb size to 1920x1080 with one RT. The surface format is BGRA8_UNORM. So somehow the nouveau code thinks that the RT is in GART whereas it's really in VRAM. Very weird.

I think that it's an imported buffer (via nv50_miptree_from_handle) and somehow the GEM_INFO tells us that it's in GART and not VRAM. Which means that the VRAM-allocated buffer somehow ends up in TTM_PL_TT.

Can you try the following patch and see if it fixes everything? (Keep the other patch in place in case it doesn't.)

diff --git a/drm/nouveau/nouveau_gem.c b/drm/nouveau/nouveau_gem.c
index ce74ab1..b4cbc86 100644
--- a/drm/nouveau/nouveau_gem.c
+++ b/drm/nouveau/nouveau_gem.c
@@ -229,11 +229,7 @@ nouveau_gem_info(struct drm_file *file_priv, struct drm_gem_object *gem,
 	struct nouveau_bo *nvbo = nouveau_gem_object(gem);
 	struct nvkm_vma *vma;
 
-	if (nvbo->bo.mem.mem_type == TTM_PL_TT)
-		rep->domain = NOUVEAU_GEM_DOMAIN_GART;
-	else
-		rep->domain = NOUVEAU_GEM_DOMAIN_VRAM;
-
+	rep->domain = nvbo->valid_domains;
 	rep->offset = nvbo->bo.offset;
 	if (cli->vm) {
 		vma = nouveau_bo_vma_find(nvbo, cli->vm);
Comment 21 Alexander Schlarb 2015-10-19 17:42:52 UTC
Whoa! I really didn't expect that to actually do anything. :-O 
You're a genius!
I've tried multiple suspend-resume cycles, with plugging out and in the external monitor several times, and no visible glitches and no suspicious kernel or application logs. Only kernel complaints were about the failing EDID requests to the external monitor (which tends to happen when you unplug it).

If I'm not mistaken this is just a workaround however, right? We still need to find out why "nvbo->bo.mem.mem_type == TTM_PL_TT"?

BTW: Thanks for your explanations about the nouveau buf offset! After reading it a few times I'm now pretty sure I understand roughly what you are doing. :-)
Comment 22 Ilia Mirkin 2015-10-19 17:47:44 UTC
(In reply to Alexander Schlarb from comment #21)
> If I'm not mistaken this is just a workaround however, right? We still need
> to find out why "nvbo->bo.mem.mem_type == TTM_PL_TT"?

Not sure... Ben and I will talk about it. I wonder if it's OK to return VRAM | GART here for the pre-nv30 cases.

> 
> BTW: Thanks for your explanations about the nouveau buf offset! After
> reading it a few times I'm now pretty sure I understand roughly what you are
> doing. :-)

No problem. Lots of tricky stuff going on. The GPU has a virtual memory space of its own which further complicates (but also simplifies) things.
Comment 23 Ilia Mirkin 2015-10-20 05:18:41 UTC
FTR I sent a slightly better patch at http://lists.freedesktop.org/archives/nouveau/2015-October/022732.html -- this one should keep pre-nv50 working the same way as it did before.
Comment 24 Ilia Mirkin 2015-10-20 18:35:07 UTC
*** Bug 92213 has been marked as a duplicate of this bug. ***
Comment 25 Ilia Mirkin 2015-10-20 18:35:33 UTC
*** Bug 91598 has been marked as a duplicate of this bug. ***
Comment 26 Ilia Mirkin 2015-10-20 18:37:40 UTC
*** Bug 91125 has been marked as a duplicate of this bug. ***
Comment 27 Marten van Kerkwijk 2015-10-22 01:13:32 UTC
Fantastic to see this resolved -- it has been very painful to have to reboot my office machine shortly after a suspend, or just after a long day's work. If it is clear which version this is going into, please alert here, so I can pick up the new package in debian experimental as soon as it lands there (or compile myself, but slightly hesitant to do that).
Comment 28 Ilia Mirkin 2015-10-22 02:45:15 UTC
(In reply to Marten van Kerkwijk from comment #27)
> If it is clear which version this is going into, please alert here, so I can
> pick up the new package in debian experimental as soon as it lands there (or
> compile myself, but slightly hesitant to do that).

It's already in Dave Airlie's drm-fixes branch, which means it should be on its way to Linus for 4.3-rc7 and then it should get picked into stable kernel trees.
Comment 29 Ilia Mirkin 2015-10-22 03:36:02 UTC
*** Bug 92010 has been marked as a duplicate of this bug. ***
Comment 30 Ilia Mirkin 2015-10-22 09:05:38 UTC
*** Bug 92299 has been marked as a duplicate of this bug. ***
Comment 31 Ilia Mirkin 2015-10-25 18:13:00 UTC
Linux 4.3-rc7 is out now and should contain my fix. Please try it out and confirm that it fixes the set_domain issues.
Comment 32 Carla sella 2015-10-27 21:33:18 UTC
I installed Ubuntu Kernel  4.3.0-040300rc7-generic, suspended my PC for about 5 minutes and when I resumed everything was ok, no issues.
So definitely my problems are solved (I had reported bug https://bugs.freedesktop.org/show_bug.cgi?id=92522.
Thanks a lot.
Comment 33 Alexander Schlarb 2015-10-29 15:57:34 UTC
I have used the latest (as of 4 days ago) linux-next kernel tree, without any patches applied, and can confirm that this issue is gone now!

Thanks everybody (and especially Ilia Mirkin) for fixing this bug so quickly!!
Comment 34 Marten van Kerkwijk 2015-10-29 18:42:58 UTC
Debian experimental now had -rc7 and I tried it too, and also found that the issues were resolved.  Thanks!  -- Marten


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.