Created attachment 118931 [details] dmesg output (since boot) Steps to reproduce: * Use laptop with "NVIDIA Corporation GT216M [GeForce GT 330M]" or similar * Install KDE Plasma 5 * Reboot * Log in to Plasma 5 desktop * Open the main menu and do *Quit* -> *Suspend* * Wait for the system to suspend * Resume the system and unlock the screen * Disconnect or connect external monitor (use *Menu* -> *Computer* -> *System settings* -> *Screen* to force screen detection if it doesn't happen automatically) * Open the main menu and do *Quit* -> *Suspend* * Wait for the system to suspend * Resume the system and unlock the screen * Open the main menu, network indicator or any other menu from the panel (`plasmashell`) *What should happen?* Nothing - it should "just work" *What happens instead?* Massive corruption the first time you open any menu Also, after reproducing this error a few times without rebooting, the corruption becomes worse on every suspend until the system hangs. I'm not sure if this is related through… *Where was this error produced?* Tanglu Linux 4.0 ALPHA (Debian based) using: * Latest 4.2.0-trunk linux-image * MESA git snapshot or MESA 11.0.2 * LibDRM 2.4.64 * X11 1.17.2 * xorg-video-nouveau 1.0.11
Created attachment 118932 [details] Log of plasmashell running (only last suspend before initial corruption)
A plasmashell apitrace is available on request (100MB)
Which mesa git snapshot? A number of resource lifetime issues were addressed in 11.0.3 which I was hoping might fix weirdo issues like this.
My current MESA snapshot is 93267887a06e760b4b20618523df5e8aa4e70307
OK, that has all the fixes I had in mind. Oh well. Does replaying the apitrace reproduce the issues? If so, I'd be most interested in obtaining it.
Actually corruption was way worse before the latest MESA 11.0.2 update, so thumbs up! :-) Unfortunately the replaying the apitrace neither displays any corruption nor does it show any trace of the pages of nouveau error messages that are shown in the `plasmashell` console output after resume. Another thing: An output very similar to the one dumped by `plasmashell` on resume is also dumped by `kwin_x11` (the compositor/window manager) although I haven't noticed any corruption there yet (that only happens during the second stage when the system becomes unstable). What now?
Actually the issue is most likely due to [20351.450000] nouveau E[plasmashell[13570]] fail set_domain [20351.450006] nouveau E[plasmashell[13570]] validating bo list [20351.450015] nouveau E[plasmashell[13570]] validate: -22 Which is... odd. This happens when uint32_t domains = valid_domains & nvbo->valid_domains & (write_domains ? write_domains : read_domains); if (!domains) return -EINVAL; Which means that unless nvbo->valid_domains had something funny in it, this is all good. Hmmm.... nvbo->valid_domains = NOUVEAU_GEM_DOMAIN_VRAM | NOUVEAU_GEM_DOMAIN_GART; if (drm->device.info.family >= NV_DEVICE_INFO_V0_TESLA) nvbo->valid_domains &= domain; So if we ever supply a valid/read/write domains that's not the same as the domain we created the bo with, then this will fail. However after some scouring of the code, I don't where we'd be messing this up. And unfortunately nothing that libdrm prints will help this. I'm also unsure of If you can, could you patch drivers/gpu/drm/nouveau/nouveau_gem.c:nouveau_gem_set_domain to have a print of all 4 inputs to domains to see why it's coming out as 0?
I'll do that and get back to you once I'm done testing.
I now have also noted that the first "fail set_domain" messages already appear during the first suspend (including the plasmashell nouveau dump and even some of the corruption). How could I have not noticed this? :-/ Anyway, here we go: I patched Linux 4.3.0-rc5-next-20151016+ (cd685d8558) with these lines: diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c index 2c99815..caff0e0 100644 --- a/drivers/gpu/drm/nouveau/nouveau_gem.c +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c @@ -291,7 +291,10 @@ nouveau_gem_set_domain(struct drm_gem_object *gem, uint32_t read_domains, uint32_t domains = valid_domains & nvbo->valid_domains & (write_domains ? write_domains : read_domains); uint32_t pref_flags = 0, valid_flags = 0; - + printk("nouveau_gem_set_domain - drm_gem_object: 0x%08p\n", gem); + printk("nouveau_gem_set_domain - read_domains: 0x%08x\n", read_domains); + printk("nouveau_gem_set_domain - write_domains: 0x%08x\n", write_domains); + printk("nouveau_gem_set_domain - valid_domains: 0x%08x\n", valid_domains); if (!domains) return -EINVAL; The I booted the kernel and reproduced the issue. Some relevant lines: Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000000 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800c5b832e8 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000000 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800c9b0cae8 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000000 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800c4990ae8 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000000 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: kscreenlocker_g[3146]: fail set_domain Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: kscreenlocker_g[3146]: validating bo list Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: kscreenlocker_g[3146]: validate: -22 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800bf085ae8 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000002 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000000 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000002 -- Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000000 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8801a383eee8 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000000 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800c5918ee8 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000000 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800cab47ae8 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000000 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000004 Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: plasmashell[2575]: fail set_domain Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: plasmashell[2575]: validating bo list Okt 17 15:15:07 Alexander-NB kernel: nouveau 0000:01:00.0: plasmashell[2575]: validate: -22 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - drm_gem_object: 0xffff8800ca083ae8 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - read_domains: 0x00000002 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - write_domains: 0x00000002 Okt 17 15:15:07 Alexander-NB kernel: nouveau_gem_set_domain - valid_domains: 0x00000002
Created attachment 118936 [details] Kernel log when using the debugging patch provided in the last message (XZ compressed)
(In reply to Alexander Schlarb from comment #9) > I now have also noted that the first "fail set_domain" messages already > appear during the first suspend (including the plasmashell nouveau dump and > even some of the corruption). How could I have not noticed this? :-/ > > Anyway, here we go: > > I patched Linux 4.3.0-rc5-next-20151016+ (cd685d8558) with these lines: > > diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c > b/drivers/gpu/drm/nouveau/nouveau_gem.c > index 2c99815..caff0e0 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_gem.c > +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c > @@ -291,7 +291,10 @@ nouveau_gem_set_domain(struct drm_gem_object *gem, > uint32_t read_domains, > uint32_t domains = valid_domains & nvbo->valid_domains & > (write_domains ? write_domains : read_domains); > uint32_t pref_flags = 0, valid_flags = 0; > - > + printk("nouveau_gem_set_domain - drm_gem_object: 0x%08p\n", gem); > + printk("nouveau_gem_set_domain - read_domains: 0x%08x\n", > read_domains); > + printk("nouveau_gem_set_domain - write_domains: 0x%08x\n", > write_domains); > > + printk("nouveau_gem_set_domain - valid_domains: 0x%08x\n", > valid_domains); > > if (!domains) > > return -EINVAL; I should have been more explicit. You need to move that into if (!domains), and also print nvbo->valid_domains. So e.g. if (!domains) { printk("bo valid domains: %x\n", nvbo->valid_domains); printk("valid domains: %x\n", valid_domains); printk("read domains: %x\n", read_domains); printk("write domains: %x\n", write_domains); return -EINVAL; } This should provide with the relevant info for the failing gem object.
Created attachment 118938 [details] Kernel log when debugging only (!domain) cases
I hope this is more useful :-)
(In reply to Alexander Schlarb from comment #13) > I hope this is more useful :-) *MUCH* more useful. [ 91.919550] bo valid domains: 2 [ 91.919557] valid domains: 4 [ 91.919559] read domains: 0 [ 91.919560] write domains: 4 [ 91.919565] nouveau 0000:01:00.0: plasmashell[2578]: fail set_domain [ 91.919569] nouveau 0000:01:00.0: plasmashell[2578]: validating bo list [ 91.919587] nouveau 0000:01:00.0: plasmashell[2578]: validate: -22 So bo->valid_domains == VRAM, but we're trying to write to it via GART. Hmmmm. I'm going to try to come up with a patch that figures out the gpu va of the bo in question... hopefully that will provide further hints in conjunction with the pushbuf dump.
*** Bug 92522 has been marked as a duplicate of this bug. ***
Created attachment 118967 [details] [review] patch to print bo info OK, this patch should provide the bo va address. Can you provide the output of this along with a log that includes the pushbuf error to stderr from plasmashell? I'm going to try to match up the bo's va to where it is in the pushbuf and hopefully identify which bo it's going wrong for.
Created attachment 118984 [details] dmesg output (with patch 118967)
Created attachment 118985 [details] plasmashell + mesa output (with patch 118967)
Here we go: The latest round of patched output! What do you see?
[ 210.490188] nouveau 0000:01:00.0: plasmashell[3176]: bo valid domains: 2 [ 210.490193] nouveau 0000:01:00.0: plasmashell[3176]: valid domains: 4 [ 210.490195] nouveau 0000:01:00.0: plasmashell[3176]: read domains: 0 [ 210.490197] nouveau 0000:01:00.0: plasmashell[3176]: write domains: 4 [ 210.490199] nouveau 0000:01:00.0: plasmashell[3176]: buf offset: 0040950000 [ 210.490202] nouveau 0000:01:00.0: plasmashell[3176]: buf size: 800000 [ 210.490204] nouveau 0000:01:00.0: plasmashell[3176]: fail set_domain [ 210.490206] nouveau 0000:01:00.0: plasmashell[3176]: validating bo list [ 210.490210] nouveau 0000:01:00.0: plasmashell[3176]: validate: -22 nouveau: 0x00146200 nouveau: 0x00000000 nouveau: 0x40950000 nouveau: 0x000000cf nouveau: 0x00000040 nouveau: 0x00000000 Which happens to be BEGIN_NV04(push, NV50_3D(RT_ADDRESS_HIGH(i)), 5); PUSH_DATAh(push, mt->base.address + sf->offset); PUSH_DATA (push, mt->base.address + sf->offset); PUSH_DATA (push, nv50_format_table[sf->base.format].rt); PUSH_DATA (push, mt->level[sf->base.u.tex.level].tile_mode); PUSH_DATA (push, mt->layer_stride >> 2); And above it inits the fb size to 1920x1080 with one RT. The surface format is BGRA8_UNORM. So somehow the nouveau code thinks that the RT is in GART whereas it's really in VRAM. Very weird. I think that it's an imported buffer (via nv50_miptree_from_handle) and somehow the GEM_INFO tells us that it's in GART and not VRAM. Which means that the VRAM-allocated buffer somehow ends up in TTM_PL_TT. Can you try the following patch and see if it fixes everything? (Keep the other patch in place in case it doesn't.) diff --git a/drm/nouveau/nouveau_gem.c b/drm/nouveau/nouveau_gem.c index ce74ab1..b4cbc86 100644 --- a/drm/nouveau/nouveau_gem.c +++ b/drm/nouveau/nouveau_gem.c @@ -229,11 +229,7 @@ nouveau_gem_info(struct drm_file *file_priv, struct drm_gem_object *gem, struct nouveau_bo *nvbo = nouveau_gem_object(gem); struct nvkm_vma *vma; - if (nvbo->bo.mem.mem_type == TTM_PL_TT) - rep->domain = NOUVEAU_GEM_DOMAIN_GART; - else - rep->domain = NOUVEAU_GEM_DOMAIN_VRAM; - + rep->domain = nvbo->valid_domains; rep->offset = nvbo->bo.offset; if (cli->vm) { vma = nouveau_bo_vma_find(nvbo, cli->vm);
Whoa! I really didn't expect that to actually do anything. :-O You're a genius! I've tried multiple suspend-resume cycles, with plugging out and in the external monitor several times, and no visible glitches and no suspicious kernel or application logs. Only kernel complaints were about the failing EDID requests to the external monitor (which tends to happen when you unplug it). If I'm not mistaken this is just a workaround however, right? We still need to find out why "nvbo->bo.mem.mem_type == TTM_PL_TT"? BTW: Thanks for your explanations about the nouveau buf offset! After reading it a few times I'm now pretty sure I understand roughly what you are doing. :-)
(In reply to Alexander Schlarb from comment #21) > If I'm not mistaken this is just a workaround however, right? We still need > to find out why "nvbo->bo.mem.mem_type == TTM_PL_TT"? Not sure... Ben and I will talk about it. I wonder if it's OK to return VRAM | GART here for the pre-nv30 cases. > > BTW: Thanks for your explanations about the nouveau buf offset! After > reading it a few times I'm now pretty sure I understand roughly what you are > doing. :-) No problem. Lots of tricky stuff going on. The GPU has a virtual memory space of its own which further complicates (but also simplifies) things.
FTR I sent a slightly better patch at http://lists.freedesktop.org/archives/nouveau/2015-October/022732.html -- this one should keep pre-nv50 working the same way as it did before.
*** Bug 92213 has been marked as a duplicate of this bug. ***
*** Bug 91598 has been marked as a duplicate of this bug. ***
*** Bug 91125 has been marked as a duplicate of this bug. ***
Fantastic to see this resolved -- it has been very painful to have to reboot my office machine shortly after a suspend, or just after a long day's work. If it is clear which version this is going into, please alert here, so I can pick up the new package in debian experimental as soon as it lands there (or compile myself, but slightly hesitant to do that).
(In reply to Marten van Kerkwijk from comment #27) > If it is clear which version this is going into, please alert here, so I can > pick up the new package in debian experimental as soon as it lands there (or > compile myself, but slightly hesitant to do that). It's already in Dave Airlie's drm-fixes branch, which means it should be on its way to Linus for 4.3-rc7 and then it should get picked into stable kernel trees.
*** Bug 92010 has been marked as a duplicate of this bug. ***
*** Bug 92299 has been marked as a duplicate of this bug. ***
Linux 4.3-rc7 is out now and should contain my fix. Please try it out and confirm that it fixes the set_domain issues.
I installed Ubuntu Kernel 4.3.0-040300rc7-generic, suspended my PC for about 5 minutes and when I resumed everything was ok, no issues. So definitely my problems are solved (I had reported bug https://bugs.freedesktop.org/show_bug.cgi?id=92522. Thanks a lot.
I have used the latest (as of 4 days ago) linux-next kernel tree, without any patches applied, and can confirm that this issue is gone now! Thanks everybody (and especially Ilia Mirkin) for fixing this bug so quickly!!
Debian experimental now had -rc7 and I tried it too, and also found that the issues were resolved. Thanks! -- Marten
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.