On a hybrid machine with SKL & Topaz with Ubuntu 17.04 (xserver 1.19.3, kernel 4.10, mesa 17.0.3) running apps on the dGPU results in corruption like on this video: https://aaltoset.kapsi.fi/tmp/MOV_2305.mp4 it goes away when the window is resized/fullscreened. Tested to happen at least with glxgears and glmark2.
forgot to mention that DRI3 is enabled libGL: pci id for fd 5: 1002:6900, driver radeonsi libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/tls/radeonsi_dri.so libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory. libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory. libGL: Using DRI3 for screen 0 Running synchronized to the vertical refresh. The framerate should be approximately the same as the monitor refresh rate. 305 frames in 5.0 seconds = 60.855 FPS and updating to git snapshots from a ppa dind't help.
and happens with both intel and modeset X drivers
the original comment was wrong, it's a KBL (5917)+topaz (6900) another config with KBL (5916) + topaz (6900) does not have issues, so looks like the bug is somewhere in the diff between GT1.5 and GT2 with 4.12 it was still buggy, and git mesa was tested too
current nightly tested as well, still the same
Interesting. Mesa apparently doesn't even know 5916 is a production thing. Timo, to work on narrowing down the issue, can you modify mesa to behave as the other part to see if the issue goes away. diff --git a/include/pci_ids/i965_pci_ids.h b/include/pci_ids/i965_pci_ids.h index 57e70b7aed..f60ac1ba60 100644 --- a/include/pci_ids/i965_pci_ids.h +++ b/include/pci_ids/i965_pci_ids.h @@ -153,7 +153,7 @@ CHIPSET(0x5913, kbl_gt1_5, "Intel(R) Kabylake GT1.5") CHIPSET(0x5915, kbl_gt1_5, "Intel(R) Kabylake GT1.5") CHIPSET(0x5917, kbl_gt1_5, "Intel(R) Kabylake GT1.5") CHIPSET(0x5912, kbl_gt2, "Intel(R) HD Graphics 630 (Kaby Lake GT2)") -CHIPSET(0x5916, kbl_gt2, "Intel(R) HD Graphics 620 (Kaby Lake GT2)") +CHIPSET(0x5916, gt1_5, "Intel(R) HD Graphics 620 (Kaby Lake GT2)") CHIPSET(0x591A, kbl_gt2, "Intel(R) HD Graphics P630 (Kaby Lake GT2)") CHIPSET(0x591B, kbl_gt2, "Intel(R) HD Graphics 630 (Kaby Lake GT2)") CHIPSET(0x591D, kbl_gt2, "Intel(R) HD Graphics P630 (Kaby Lake GT2)")
Oops, I had a typo. should have been kbl_gt1_5
Actually it is the other way around. 0x5917 is the failing one. Spec tells it is KBL-R GT2, not gt1_5. So maybe - CHIPSET(0x5917, kbl_gt1_5, "Intel(R) Kabylake GT1.5") + CHIPSET(0x5917, kbl_gt2, "Intel(R) Kabylake GT2") helps...
Changing to NEEDINFO. Please mark as REOPEN once the new information is provided.
The kernel's PCI ID table (include/drm/i915_pciids.h) should be updated too, if it is indeed a GT2. It's currently in the GT1 list. FWIW, our current guess is that the PCI ID corresponded to some internal-only part (a GT1.5, whatever that is), and has now been repurposed as as shipping GT2 SKU.
Still can reproduce this issue with the modified PCI ID table (mesa and kernel).
Hang on, the testing was probably done with wrong packages, need to verify
re-testing proves that changing 5917 to be GT2 fixes this issue, and in fact just updating the kernel was enough to fix it
testing still ongoing with mixed results, at this point I'm not sure what's working and what's not, but since the pci-id is on the wrong group maybe you can start pushing the fix upstream to the kernel/libdrm/mesa..
Timo what does mixed results mean? Are we back to the original statement of one thing works, and one doesn't?
well it looks like mesa plays a part after all, and the machine also needs amdgpu dkms so it makes kernel testing a bit more fragile.. but now I have the whole stack with the fixed pci-id build on a PPA so I hope this will finally be a solid set for testing at least..
Ok, so it turns out the pci-id change does _not_ help.. False positive test results were caused by the fact that the corruption sometimes needs 5-10s to show, or that the renderer was Intel instead of amdgpu (because the dkms failed to build or whatever)..
@Intel We verify this issue with New intel driver on 0x5916, also can reproduce this bug, seems this issue is a driver update bring in issue. Please help to locate the solution, time is upon us. Thanks so much!
Turns out that setting amdgpu.dpm=0 fixes the corruption..
But that can not explain why with old intel driver can not be reproduced while new driver can, I believe you draw this conclusion by verifying the issue on the same machine and with same AMD video card and same AMD video card driver, Is that right?(In reply to Timo Aaltonen from comment #18) > Turns out that setting amdgpu.dpm=0 fixes the corruption..
amdgpu.dpm=0 doesn't really work. There are some issues when loading amdgpu driver with dpm=0. It leads to using Intel as OpenGL renderer (glxinfo) instead of AMD. Error Message: Xorg.0.log:[ 24.261] (II) UnloadModule: "amdgpu" I got two patches from AMD to fix the issue. After applying patches, tearing issue is back. Here are some new findings: The issue can be reproudced on 5916&5917. But, I only see the issue on 5916 when running glxgears with fullscreen. When unplugging power adapter and running glxgear with default window size, the issue on 5917 is gone. Test Environment: Device 1 (5916): a. Graphics Cards: Intel [8086:5916] + AMD [1002:6900] b. Kernel: 4.4.0-45 Device 2 (5917): a. Graphics Cards: Intel [8086:5917] + AMD [1002:6900] b. Kernel: 4.4.0-73 Test Result: 1. AC Mode: (Power Adapter) Device 1 (5916): $ DRI_PRIME=1 glxgear: PASS $ DRI_PRIME=1 glxgear -fullscreen: tearing Device 2 (5917): $ DRI_PRIME=1 glxgear: tearing $ DRI_PRIME=1 glxgear -fullscreen: tearing 2. DC Mode: (Battery) Device 1 (5916): $ DRI_PRIME=1 glxgear: PASS $ DRI_PRIME=1 glxgear -fullscreen: tearing Device 2 (5917): $ DRI_PRIME=1 glxgear: PASS $ DRI_PRIME=1 glxgear -fullscreen: tearing
Created attachment 133542 [details] tearing on 5916&5917 (video)
Created attachment 133543 [details] tearing on 5916&5917 (photo)
Looks like there are two separate issues here — the corruption shown in Timo's video in the original bug description, and the tearing shown in Ethan's video and photo. They need to be tracked separately.
The tearing issue which I saw when running glxgear with default window size is same as the corruption shown in Timo's video. The photo and video I uploaded were taken when running glxgear with full screen.
(In reply to Ethan Hsieh from comment #24) > The tearing issue which I saw when running glxgear with default window size > is same as the corruption shown in Timo's video. That's corruption, not tearing. > The photo and video I uploaded were taken when running glxgear with full > screen. That's tearing, not corruption. They're two separate issues that need to be tracked separately.
So can anybody still reproduce the corruption as shown in Timo's video or not? I'm NOT asking about the tearing in Ethan's video.
I don't have the hw to test, but what I know by now is that - the tearing is gone at least with kernel 4.10 - corruption happens still, but _only_ when the charger is attached, meaning that when the system is on battery the corruption is gone, and appears again when charger is plugged in..
hi @Timo Aaltonen Again, Would you create another bug to separately track tearing issue as Michel comment#23 said?
(In reply to qwang13 from comment #28) > Again, Would you create another bug to separately track tearing issue as > Michel comment#23 said? Per comment 27, the tearing is already fixed with current upstream versions, so there's no point in creating another report here for that.
Created attachment 133710 [details] kern.log (drm.debug=0xe) Here is the timestamp of kern.log --- DC mode --- 1. timestamp: 16:49 $ DRI_PRIME=1 glxgears 2. timestamp: 16:50 Stop glxgears 3. timestamp: 16:51 plug power adaptor in --- AC mode --- 4. timestamp: 16:52 $ DRI_PRIME=1 glxgears ===> corruption!!! <=== 5. timestamp: 16:53 Stop glxgears 6. timestamp: 16:54 $ glxgears 7. timestamp: 16:55 Stop glxgears
Created attachment 133711 [details] Xorg.0.log
Created attachment 133712 [details] syslog
Based on Michel's comment in #23, there are two separate issues here. So, the test result in comment#20 should be Test Result: 1. AC Mode: (Power Adapter) Device 1 (5916): $ DRI_PRIME=1 glxgear: no corruption issue $ DRI_PRIME=1 glxgear -fullscreen: tearing Device 2 (5917): $ DRI_PRIME=1 glxgear: corruption (!!!) $ DRI_PRIME=1 glxgear -fullscreen: tearing 2. DC Mode: (Battery) Device 1 (5916): $ DRI_PRIME=1 glxgear: no corruption issue $ DRI_PRIME=1 glxgear -fullscreen: tearing Device 2 (5917): $ DRI_PRIME=1 glxgear: no corruption issue $ DRI_PRIME=1 glxgear -fullscreen: tearing
Just some thoughts... The corruption are x-tiled cachelines, looks like stale ones. I don't know what the vertical lines are. It looks to me like the memory is just misbehaving. I wonder if when you plug in the machine if BIOS tries to crank up DDR, or Graphics voltage. Is there some BIOS setting or update to tweak what happens on AC? Is this desktop environment using sprite planes? If the compositor is using all one plane (which I think is likely), and display was at fault, everything would be corrupt. So please see if there is anything that can be done in BIOS to tell it to not behave differently when on AC.
(In reply to Ben Widawsky from comment #34) > Just some thoughts... > > The corruption are x-tiled cachelines, looks like stale ones. I don't know > what the vertical lines are. It looks to me like the memory is just > misbehaving. I wonder if when you plug in the machine if BIOS tries to crank > up DDR, or Graphics voltage. Is there some BIOS setting or update to tweak > what happens on AC? > > Is this desktop environment using sprite planes? If the compositor is using > all one plane (which I think is likely), and display was at fault, > everything would be corrupt. > > > So please see if there is anything that can be done in BIOS to tell it to > not behave differently when on AC. Hi Ben, Has info the message to ODM. Is there any item in vbios concerning power?
Based on AMD's comment (The issue is gone without Compiz), I did some tests. Here is the test result: ----------------------------------- 1) Run glxgears without a desktop environment ----------------------------------- a. Logout b. Switch to tty1 c. Stop lightdm and then start x $ service lightdm stop $ startx d. Run glxgears AC mode: $ DRI_PRIME=1 glxgears -info => corruption issue is gone, but it is back when I touch touchpad. Video: https://goo.gl/8p26HZ Besides the corruption issue, sometimes the gear stopped when I touched the touchpad. DC mode: $ DRI_PRIME=1 glxgears -info => Pass!! ----------------------------------- 2) Run glxgears with other window manager ----------------------------------- a. Install TWM $ sudo apt-get install twm $ sudo apt-get install afterstep b. Logout and then login with twm c. Run glxgears AC mode: $ DRI_PRIME=1 glxgears -info => corruption issue is gone, but it is back when I touch touchpad. Video: https://goo.gl/8p26HZ Besides the corruption issue, sometimes the gear stopped when I touched the touchpad. DC mode: $ DRI_PRIME=1 glxgears -info => Pass!!
I wonder if it could be related to the caching attributes of the system memory pages being shared between the GPUs. AFAICT the amdgpu driver should end up treating them as non-cacheable, and correspondingly program the AMD GPU not to participate in cache coherency protocol for them. How does the i915 driver treat the pages of imported buffer objects?
Created attachment 133861 [details] Xorg.0.log (ubuntu 17.04 + modeset)
Created attachment 133862 [details] Xorg.0.log (ubuntu 17.04 + intel&amdgpu) I install Ubuntu 17.4 and can reproduce corruption issue in AC/DC mode. Xorg.0.log: [ 21.309] (II) LoadModule: "modesetting" [ 21.309] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so [ 21.310] (II) Module modesetting: vendor="X.Org Foundation" But, the issue in DC mode is gone after I add two config files. /usr/share/X11/xorg.conf.d/20-intel.conf Section "Device" Identifier "Intel Graphics" Driver "Intel" Option "DRI" "3" EndSection /usr/share/X11/xorg.conf.d/30-amdgpu.conf Section "Device" Identifier "AMDGPU" Driver "amdgpu" Option "DRI" "3" EndSection
Cannot reproduce corruption issue with SNA_POWERSAVE enabled. (Have to re-plug AC adapter to make the workaround work) xserver-xorg-video-intel-2.99.917+git20160325/src/sna/sna_acpi.c @@ -123,10 +123,14 @@ void _sna_acpi_wakeup(struct sna *sna) state = atoi(space + 1); DBG(("%s: ac_adapter event new state=%d\n", __FUNCTION__, state)); +#if 0 if (state) sna->flags &= ~SNA_POWERSAVE; else sna->flags |= SNA_POWERSAVE; +#endif + DBG(("%s: enable SNA_POWERSAVE\n", __FUNCTION__)); + sna->flags |= SNA_POWERSAVE; } Xorg.0.org: $ grep -r -i -e "ac_ad" -e "enable SNA" /var/log/Xorg.0.log [ 112.044] _sna_acpi_wakeup: event string [41]: 'ac_adapter ACPI0003:00 00000080 00000000 [ 112.044] _sna_acpi_wakeup: ac_adapter event new state=0 [ 112.044] _sna_acpi_wakeup: enable SNA_POWERSAVE [ 123.003] _sna_acpi_wakeup: event string [41]: 'ac_adapter ACPI0003:00 00000080 00000001 [ 123.003] _sna_acpi_wakeup: ac_adapter event new state=1 [ 123.003] _sna_acpi_wakeup: enable SNA_POWERSAVE
(In reply to Ethan Hsieh from comment #40) > Cannot reproduce corruption issue with SNA_POWERSAVE enabled. Interesting find. That can explain why the corruption is only reproducible with the power supply connected with xf86-video-intel, but reproducible regardless with the modesetting driver instead. At this point, I think we really need someone at Intel to look into this in more detail, reassigning.
SNA_POWERSAVE effects a good amount more than what I initially thought. 1. Potentially extra flushing. 2. Prefer blit ring over render ring. Corruption itself looks like X-tiled cachelines. And the changes in SNA totally effect cacheline behavior; blitter uses different caches, extra flushing does extra flushes :) WRT https://bugs.freedesktop.org/show_bug.cgi?id=101691#c37. Not entirely sure what you're getting at. dma_buf core should be handling the coherency between the two parties and it's up to the driver implementation to handle map/begin_cpu_access etc. to handle appropriately. Assuming we're talking about dma_buf.
(In reply to Ben Widawsky from comment #42) > WRT https://bugs.freedesktop.org/show_bug.cgi?id=101691#c37. Not entirely > sure what you're getting at. dma_buf core should be handling the coherency > between the two parties and it's up to the driver implementation to handle > map/begin_cpu_access etc. to handle appropriately. Assuming we're talking > about dma_buf. The buffers are shared via dma_buf, but there shouldn't be any CPU access to the shared buffers. My question is whether the i915 driver programs the GPU to treat the pages of the buffers imported from amdgpu as cacheable or non-cacheable.
There is no special handling for how the GPU maps dma-buf imported objects. In other words, it will default to LLC caching on KBL. Provided amdgpu is flushing its contents properly though, the LLC should just follow normal coherency snoop protocol and i915 should see the changed data.
(In reply to Ben Widawsky from comment #44) > Provided amdgpu is flushing its contents properly though, the LLC should > just follow normal coherency snoop protocol and i915 should see the changed > data. Since the pages are non-cacheable as far as the amdgpu driver is concerned, the AMD GPU does not participate in coherency snoop protocol for them AFAIK. It should flush to them "immediately" though.
(In reply to Ben Widawsky from comment #44) > There is no special handling for how the GPU maps dma-buf imported objects. > In other words, it will default to LLC caching on KBL. > > Provided amdgpu is flushing its contents properly though, the LLC should > just follow normal coherency snoop protocol and i915 should see the changed > data. Hi Ben, By review intel gem code, I found there are 4 cache level, I915_CACHE_NONE,LLC,L3_LLC,WT, if we chose NONE when create buffer, does that mean every operation and operate result(including whole render process and the output)relating to the buffer has nothing to do with cache policy? I want to check this because I want to prove this issue do relating to cache policy. I have checked the SNA power state flag, what this flag effects is just the path flow to treat buffer in user space, and I personally believe cache level is configured by user, upper layer of sna_accel user space driver(please correct me if I am wrong). When plug AC, /sys/class/power_supply/AC/online change from 0 to 1, this event will captured by SNA module, once it is captured, it will change power state, means sna->flags being changed, then buffer handle path been change, at this moment, buffer has been created with proper cache policy from upper layer user. Could you please help to answer the first question? If we can confirm this has nothing to do with cache. If my way is wrong, would you please help to suggest how to avoid cache influencing? May be I am wrong, I think this is just a user space driver issue, they might miss some consideration on 5917 video card, since 5916 has no this issue. Thanks in advance.
What should happen is when the buffer is imported it will be cached, but when you pin it for display, it should get mapped uncached in the GGTT (or WT if we have eLLC).
Pinning the buffer for display can only happen while the application's window is fullscreen, but may not happen even then, depending on various factors (and ultimately the Intel GPU's Xorg driver).
Created attachment 134890 [details] [review] AMDGPU fix Hey! So I've been seeing this on one of the machines I've got here and it looks like we've traced the problem down to being related to amdgpu and ttm. The attached patch solves the problem on my machine.
Forgot to add alex deucher to this as well; since this seems to be in amdgpu's ballpark now
btw the SNOOPED diff is from me. I think what amdgpu should do is set that for any dma-buf shared (i.e. dma_buf_attach has been called) or imported buffer. We can't set it for all exported buffers because dma-buf fd sharing withint the same device instance. But I didn't figure out how to wire that through ttm, the caching_mode seems to be on the tt, not the bo itself. Longer-term we might need/want to add a dma_buf_get/set_coherency_mode() function so that we don't have to opt for the most defensive thing available. The set_coherency_mode is required because apparently on kbl the "uncached" mode can pull cachelines into the cache and then reuse them.
Reassigning to amdgpu.
Created attachment 134897 [details] [review] Shotgun attempt to stop pulling external images into the L3 (mesa/i965)
Please see comment 37. We support both snooped and unsnooped access to system memory. When we use unsnooped, we always use uncached memory. Does i915 always assume dma_bufs are cached?
So parts of the reasons I've reassigned to amdgpu is that apparently someone also seen these corruptions when everything is rendered on amdgpu, and i915 only displays. And the display engine is definitely only accessing memory directly, bypassing cpu caches. Changing stuff on the kernel on the i915 side didn't help (on import we treat buffers as uncached and assume they're fully flushed). But we kinda missed/didn't get around to testing the mesa caching mode overrides (which seem to be wrong when texturing). So might need to move the bug to intel mesa (and who knows what happened with the display issue, I guess we'll shrug that off until confirmed with a video). Aside: When exporting dma-buf from i915 we leave them in whatever caching mode is preferred by that gpu platform, which on recent big core is fully cached (i.e. requires snooping). We still might need to somehow teach dma-buf importers about whether they need to use snooping reads/writes (through a new flag/function/whatever).
The tearing issue seen from comment #33 is actually a single triangle that is being rendered late. Attached a screen shot from Unigine showing the "Bad" triangle. This issue has been fixed when using a later version of the kernel 4.10 as stated in comment #27.
Created attachment 134914 [details] Tearing during fullscreen is really a bad triangle
(In reply to Daniel Vetter from comment #55) > So parts of the reasons I've reassigned to amdgpu is that apparently someone > also seen these corruptions when everything is rendered on amdgpu, and i915 > only displays. We'd need more information about how exactly that test was performed (i.e. details about how exactly "everything is rendered on amdgpu, and i915 only displays") and what the corruption looks like in that case.
Created attachment 134923 [details] [review] Mesa i965 patch to fix corruption issues So it looks like that this patch for mesa from Daniel actually fixes the issue. A very important note here: make -sure- whatever you are using as the compositor (X, gnome-shell on wayland, etc.) is also using the version of mesa with this patch as well because otherwise this will not fix your problem. I did this myself by building and installing mesa into a prefix (/home/lyudess/prefixes/mesa) and running: ldconfig -v /home/lyudess/prefixes/mesa/lib You should see all of the GL libraries get directed to your mesa prefix. After that just reboot and try. You might be able to skip the reboot, but I did it just to be safe :)
Created attachment 134929 [details] [review] Shotgun attempt to stop pulling external images into the L3 (mesa/i965) Reposting Chris's patch for this, since I didn't realize it was a later more refined version of danvet's patch when I obsoleted it (and of course, I tested that it actually works :)
The patch in Comment#60 doesn't work for me. I still can reproduce corruption issue with mesa 17.4 + patch in Comment 60. But, tearing issue is gone. 1:17.4~git171104221700.608af05~x~padoka0 https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa
(In reply to Ethan Hsieh from comment #61) > The patch in Comment#60 doesn't work for me. > I still can reproduce corruption issue with mesa 17.4 + patch in Comment 60. > But, tearing issue is gone. > > 1:17.4~git171104221700.608af05~x~padoka0 > https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa Hi, the current most recent patch series is actually https://patchwork.freedesktop.org/series/33162/ now, which fixes the problem completely on my system. As well can I confirm that you made sure that the X server and whatever applications you're testing this with are both using the patched version of mesa?
Here is the command I used for test. I tested with patched mesa 17.4. $ DRI_PRIME=1 glxgears -info GL_RENDERER = AMD ICELAND (DRM 3.9.0 / 4.4.0-98-generic, LLVM 6.0.0) GL_VERSION = 3.0 Mesa 17.4.0-devel - padoka PPA GL_VENDOR = X.Org I still can reproduce the issue with mesa 17.4 + latest patch. Patch: https://patchwork.freedesktop.org/series/33162/ Mesa 17.4 (1:17.4~git171104221700.608af05~x~padoka0): https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa Test Result: 1. 4.4.0-98 + mesa 17.4: Failed 2. 4.4.0-98 + patched mesa 17.4: Failed 3. 4.14.0-041400rc8 + patched mesa 17.4: Failed Test Environment: Ubuntu 16.04 (xenial)
Cannot reproduce the issue with Ubuntu 17.10 + patched Mesa 17.4 + Modeset Test Result: Mesa 17.4 + Intel DDX: Failed Mesa 17.4 + Modeset: Failed patched Mesa 17.4 + Intel DDX: Failed patched Mesa 17.4 + Modeset: Pass Test Environment: Image: ubuntu-17.10-desktop-amd64.iso Patch: https://patchwork.freedesktop.org/series/33162/ Mesa 17.4 (1:17.4~git171107184800.d002950~a~padoka0): https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa
The mesa side of this should be fixed in the following commit: commit d7a19d69ebc032ba7207fc97bc6f10d5bb35bb99 Author: Jason Ekstrand <jason.ekstrand@intel.com> Date: Fri Nov 3 15:26:17 2017 -0700 i965: Use PTE MOCS for all external buffers We were already using PTE for all render targets in case one happened to get scanned out. However, this still wasn't 100% correct because there are still possibly cases where we may want to texture from an external buffer even though we don't know the caching mode. This can happen, for instance, on buffers imported from another GPU via prime. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101691 Cc: "17.3" <mesa-stable@lists.freedesktop.org> Tested-by: Lyude Paul <lyude@redhat.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
According to comment 64 the Mesa issue is resolved. The patches in question are part of Mesa 17.3.0 (while 17.3.1 has also been released). If it were up-to me, I'd close this report and open another against the Intel DDX since the modesetting one works fine.
yep, closing
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.