101691 – gfx corruption on windowed 3d-apps running on dGPU

Bug 101691 - gfx corruption on windowed 3d-apps running on dGPU

Summary: gfx corruption on windowed 3d-apps running on dGPU

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-07-04 11:35 UTC by Timo Aaltonen
Modified:	2018-03-01 08:41 UTC (History)
CC List:	11 users (show)

See Also:
i915 platform:	KBL
i915 features:	GEM/Other

Attachments
tearing on 5916&5917 (video) (7.73 MB, video/mp4) 2017-08-16 11:14 UTC, Ethan Hsieh	Details
tearing on 5916&5917 (photo) (319.16 KB, image/jpeg) 2017-08-16 11:14 UTC, Ethan Hsieh	Details
kern.log (drm.debug=0xe) (199.45 KB, text/plain) 2017-08-23 09:02 UTC, Ethan Hsieh	Details
Xorg.0.log (42.34 KB, text/plain) 2017-08-23 09:03 UTC, Ethan Hsieh	Details
syslog (1.28 MB, text/plain) 2017-08-23 09:03 UTC, Ethan Hsieh	Details
Xorg.0.log (ubuntu 17.04 + modeset) (38.45 KB, text/plain) 2017-08-29 08:38 UTC, Ethan Hsieh	Details
Xorg.0.log (ubuntu 17.04 + intel&amdgpu) (28.03 KB, text/plain) 2017-08-29 08:39 UTC, Ethan Hsieh	Details
AMDGPU fix (533 bytes, patch) 2017-10-17 20:44 UTC, Lyude Paul	Details \| Splinter Review
Shotgun attempt to stop pulling external images into the L3 (mesa/i965) (6.07 KB, patch) 2017-10-17 21:42 UTC, Chris Wilson	Details \| Splinter Review
Tearing during fullscreen is really a bad triangle (2.32 MB, image/jpeg) 2017-10-18 22:08 UTC, Clinton Taylor	Details
Mesa i965 patch to fix corruption issues (1.03 KB, patch) 2017-10-19 17:34 UTC, Lyude Paul	Details \| Splinter Review
Shotgun attempt to stop pulling external images into the L3 (mesa/i965) (6.07 KB, patch) 2017-10-19 20:39 UTC, Lyude Paul	Details \| Splinter Review
Show Obsolete (3) View All

Description Timo Aaltonen 2017-07-04 11:35:47 UTC

On a hybrid machine with SKL & Topaz with Ubuntu 17.04 (xserver 1.19.3, kernel 4.10, mesa 17.0.3) running apps on the dGPU results in corruption like on this video:

https://aaltoset.kapsi.fi/tmp/MOV_2305.mp4

it goes away when the window is resized/fullscreened. Tested to happen at least with glxgears and glmark2.

Comment 1 Timo Aaltonen 2017-07-04 11:37:43 UTC

forgot to mention that DRI3 is enabled

libGL: pci id for fd 5: 1002:6900, driver radeonsi
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/tls/radeonsi_dri.so
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: Using DRI3 for screen 0
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
305 frames in 5.0 seconds = 60.855 FPS

and updating to git snapshots from a ppa dind't help.

Comment 2 Timo Aaltonen 2017-07-04 11:38:05 UTC

and happens with both intel and modeset X drivers

Comment 3 Timo Aaltonen 2017-07-05 08:31:09 UTC

the original comment was wrong, it's a KBL (5917)+topaz (6900)

another config with KBL (5916) + topaz (6900) does not have issues, so looks like the bug is somewhere in the diff between GT1.5 and GT2

with 4.12 it was still buggy, and git mesa was tested too

Comment 4 Timo Aaltonen 2017-07-05 08:32:28 UTC

current nightly tested as well, still the same

Comment 5 Ben Widawsky 2017-07-12 16:18:35 UTC

Interesting. Mesa apparently doesn't even know 5916 is a production thing.

Timo, to work on narrowing down the issue, can you modify mesa to behave as the other part to see if the issue goes away.

diff --git a/include/pci_ids/i965_pci_ids.h b/include/pci_ids/i965_pci_ids.h
index 57e70b7aed..f60ac1ba60 100644
--- a/include/pci_ids/i965_pci_ids.h
+++ b/include/pci_ids/i965_pci_ids.h
@@ -153,7 +153,7 @@ CHIPSET(0x5913, kbl_gt1_5, "Intel(R) Kabylake GT1.5")
 CHIPSET(0x5915, kbl_gt1_5, "Intel(R) Kabylake GT1.5")
 CHIPSET(0x5917, kbl_gt1_5, "Intel(R) Kabylake GT1.5")
 CHIPSET(0x5912, kbl_gt2, "Intel(R) HD Graphics 630 (Kaby Lake GT2)")
-CHIPSET(0x5916, kbl_gt2, "Intel(R) HD Graphics 620 (Kaby Lake GT2)")
+CHIPSET(0x5916, gt1_5, "Intel(R) HD Graphics 620 (Kaby Lake GT2)")
 CHIPSET(0x591A, kbl_gt2, "Intel(R) HD Graphics P630 (Kaby Lake GT2)")
 CHIPSET(0x591B, kbl_gt2, "Intel(R) HD Graphics 630 (Kaby Lake GT2)")
 CHIPSET(0x591D, kbl_gt2, "Intel(R) HD Graphics P630 (Kaby Lake GT2)")

Comment 6 Ben Widawsky 2017-07-12 16:22:03 UTC

Oops, I had a typo. should have been kbl_gt1_5

Comment 7 Rodrigo Vivi 2017-07-13 20:12:11 UTC

Actually it is the other way around.

0x5917 is the failing one. Spec tells it is KBL-R GT2, not gt1_5.

So maybe
- CHIPSET(0x5917, kbl_gt1_5, "Intel(R) Kabylake GT1.5")
+ CHIPSET(0x5917, kbl_gt2, "Intel(R) Kabylake GT2")

helps...

Comment 8 Elizabeth 2017-07-13 22:32:47 UTC

Changing to NEEDINFO. Please mark as REOPEN once the new information is provided.

Comment 9 Matt Turner 2017-07-13 23:08:31 UTC

The kernel's PCI ID table (include/drm/i915_pciids.h) should be updated too, if it is indeed a GT2. It's currently in the GT1 list.

FWIW, our current guess is that the PCI ID corresponded to some internal-only part (a GT1.5, whatever that is), and has now been repurposed as as shipping GT2 SKU.

Comment 10 Ethan Hsieh 2017-07-18 08:15:55 UTC

Still can reproduce this issue with the modified PCI ID table (mesa and kernel).

Comment 11 Timo Aaltonen 2017-07-19 08:27:58 UTC

Hang on, the testing was probably done with wrong packages, need to verify

Comment 12 Timo Aaltonen 2017-07-24 11:14:12 UTC

re-testing proves that changing 5917 to be GT2 fixes this issue, and in fact just updating the kernel was enough to fix it

Comment 13 Timo Aaltonen 2017-07-25 11:44:49 UTC

testing still ongoing with mixed results, at this point I'm not sure what's working and what's not, but since the pci-id is on the wrong group maybe you can start pushing the fix upstream to the kernel/libdrm/mesa..

Comment 14 Ben Widawsky 2017-07-25 14:59:07 UTC

Timo what does mixed results mean? Are we back to the original statement of one thing works, and one doesn't?

Comment 15 Timo Aaltonen 2017-07-25 19:42:57 UTC

well it looks like mesa plays a part after all, and the machine also needs amdgpu dkms so it makes kernel testing a bit more fragile.. but now I have the whole stack with the fixed pci-id build on a PPA so I hope this will finally be a solid set for testing at least..

Comment 16 Timo Aaltonen 2017-07-26 09:38:41 UTC

Ok, so it turns out the pci-id change does _not_ help..

False positive test results were caused by the fact that the corruption sometimes needs 5-10s to show, or that the renderer was Intel instead of amdgpu (because the dkms failed to build or whatever)..

Comment 17 Alex_Zhang1 2017-07-28 09:34:15 UTC

@Intel
  We verify this issue with New intel driver on 0x5916, also can reproduce this bug, seems this issue is a driver update bring in issue. Please help to locate the solution, time is upon us. Thanks so much!

Comment 18 Timo Aaltonen 2017-07-29 06:56:33 UTC

Turns out that setting amdgpu.dpm=0 fixes the corruption..

Comment 19 Alex_Zhang1 2017-07-31 03:20:35 UTC

But that can not explain why with old intel driver can not be reproduced while new driver can, I believe you draw this conclusion by verifying the issue on the same machine and with same AMD video card and same AMD video card driver, Is that right?(In reply to Timo Aaltonen from comment #18)
> Turns out that setting amdgpu.dpm=0 fixes the corruption..

Comment 20 Ethan Hsieh 2017-08-16 11:03:20 UTC

amdgpu.dpm=0 doesn't really work.
There are some issues when loading amdgpu driver with dpm=0.
It leads to using Intel as OpenGL renderer (glxinfo) instead of AMD.
Error Message: Xorg.0.log:[ 24.261] (II) UnloadModule: "amdgpu"
I got two patches from AMD to fix the issue.
After applying patches, tearing issue is back.

Here are some new findings:
The issue can be reproudced on 5916&5917.
But, I only see the issue on 5916 when running glxgears with fullscreen.
When unplugging power adapter and running glxgear with default window size, the issue on 5917 is gone.

Test Environment:
Device 1 (5916):
a. Graphics Cards: Intel [8086:5916] + AMD [1002:6900]
b. Kernel: 4.4.0-45
Device 2 (5917):
a. Graphics Cards: Intel [8086:5917] + AMD [1002:6900]
b. Kernel: 4.4.0-73

Test Result:
1. AC Mode: (Power Adapter)
Device 1 (5916):
$ DRI_PRIME=1 glxgear: PASS
$ DRI_PRIME=1 glxgear -fullscreen: tearing
Device 2 (5917):
$ DRI_PRIME=1 glxgear: tearing
$ DRI_PRIME=1 glxgear -fullscreen: tearing

2. DC Mode: (Battery)
Device 1 (5916):
$ DRI_PRIME=1 glxgear: PASS
$ DRI_PRIME=1 glxgear -fullscreen: tearing
Device 2 (5917):
$ DRI_PRIME=1 glxgear: PASS
$ DRI_PRIME=1 glxgear -fullscreen: tearing

Comment 21 Ethan Hsieh 2017-08-16 11:14:28 UTC

Created attachment 133542 [details]
tearing on 5916&5917 (video)

Comment 22 Ethan Hsieh 2017-08-16 11:14:52 UTC

Created attachment 133543 [details]
tearing on 5916&5917 (photo)

Comment 23 Michel Dänzer 2017-08-16 12:58:44 UTC

Looks like there are two separate issues here — the corruption shown in Timo's video in the original bug description, and the tearing shown in Ethan's video and photo. They need to be tracked separately.

Comment 24 Ethan Hsieh 2017-08-17 01:51:26 UTC

The tearing issue which I saw when running glxgear with default window size is same as the corruption shown in Timo's video.
The photo and video I uploaded were taken when running glxgear with full screen.

Comment 25 Michel Dänzer 2017-08-17 02:00:54 UTC

(In reply to Ethan Hsieh from comment #24)
> The tearing issue which I saw when running glxgear with default window size
> is same as the corruption shown in Timo's video.

That's corruption, not tearing.

> The photo and video I uploaded were taken when running glxgear with full
> screen.

That's tearing, not corruption.

They're two separate issues that need to be tracked separately.

Comment 26 Marek Olšák 2017-08-18 15:48:45 UTC

So can anybody still reproduce the corruption as shown in Timo's video or not?

I'm NOT asking about the tearing in Ethan's video.

Comment 27 Timo Aaltonen 2017-08-18 17:36:59 UTC

I don't have the hw to test, but what I know by now is that

- the tearing is gone at least with kernel 4.10
- corruption happens still, but _only_ when the charger is attached, meaning that when the system is on battery the corruption is gone, and appears again when charger is plugged in..

Comment 28 qwang13 2017-08-23 07:49:34 UTC

hi @Timo Aaltonen

Again, Would you create another bug to separately track tearing issue as Michel comment#23 said?

Comment 29 Michel Dänzer 2017-08-23 08:37:09 UTC

(In reply to qwang13 from comment #28)
> Again, Would you create another bug to separately track tearing issue as
> Michel comment#23 said?

Per comment 27, the tearing is already fixed with current upstream versions, so there's no point in creating another report here for that.

Comment 30 Ethan Hsieh 2017-08-23 09:02:52 UTC

Created attachment 133710 [details]
kern.log (drm.debug=0xe)

Here is the timestamp of kern.log

--- DC mode ---
1. timestamp: 16:49
$ DRI_PRIME=1 glxgears
2. timestamp: 16:50
Stop glxgears
3. timestamp: 16:51
plug power adaptor in
--- AC mode ---
4. timestamp: 16:52
$ DRI_PRIME=1 glxgears
===> corruption!!! <===
5. timestamp: 16:53
Stop glxgears
6. timestamp: 16:54
$ glxgears
7. timestamp: 16:55
Stop glxgears

Comment 31 Ethan Hsieh 2017-08-23 09:03:12 UTC

Created attachment 133711 [details]
Xorg.0.log

Comment 32 Ethan Hsieh 2017-08-23 09:03:34 UTC

Created attachment 133712 [details]
syslog

Comment 33 Ethan Hsieh 2017-08-24 03:49:11 UTC

Based on Michel's comment in #23, there are two separate issues here.
So, the test result in comment#20 should be

Test Result:
1. AC Mode: (Power Adapter)
Device 1 (5916):
$ DRI_PRIME=1 glxgear: no corruption issue
$ DRI_PRIME=1 glxgear -fullscreen: tearing
Device 2 (5917):
$ DRI_PRIME=1 glxgear: corruption (!!!)
$ DRI_PRIME=1 glxgear -fullscreen: tearing

2. DC Mode: (Battery)
Device 1 (5916):
$ DRI_PRIME=1 glxgear: no corruption issue
$ DRI_PRIME=1 glxgear -fullscreen: tearing
Device 2 (5917):
$ DRI_PRIME=1 glxgear: no corruption issue
$ DRI_PRIME=1 glxgear -fullscreen: tearing

Comment 34 Ben Widawsky 2017-08-25 02:46:49 UTC

Just some thoughts...

The corruption are x-tiled cachelines, looks like stale ones. I don't know what the vertical lines are. It looks to me like the memory is just misbehaving. I wonder if when you plug in the machine if BIOS tries to crank up DDR, or Graphics voltage. Is there some BIOS setting or update to tweak what happens on AC?

Is this desktop environment using sprite planes? If the compositor is using all one plane (which I think is likely), and display was at fault, everything would be corrupt.


So please see if there is anything that can be done in BIOS to tell it to not behave differently when on AC.

Comment 35 Alex_Zhang1 2017-08-25 08:48:55 UTC

(In reply to Ben Widawsky from comment #34)
> Just some thoughts...
> 
> The corruption are x-tiled cachelines, looks like stale ones. I don't know
> what the vertical lines are. It looks to me like the memory is just
> misbehaving. I wonder if when you plug in the machine if BIOS tries to crank
> up DDR, or Graphics voltage. Is there some BIOS setting or update to tweak
> what happens on AC?
> 
> Is this desktop environment using sprite planes? If the compositor is using
> all one plane (which I think is likely), and display was at fault,
> everything would be corrupt.
> 
> 
> So please see if there is anything that can be done in BIOS to tell it to
> not behave differently when on AC.

Hi Ben,
  Has info the message to ODM. Is there any item in vbios concerning power?

Comment 36 Ethan Hsieh 2017-08-25 09:33:07 UTC

Based on AMD's comment (The issue is gone without Compiz), I did some tests.

Here is the test result:
-----------------------------------
1) Run glxgears without a desktop environment
-----------------------------------
a. Logout
b. Switch to tty1
c. Stop lightdm and then start x
$ service lightdm stop
$ startx
d. Run glxgears

AC mode:
$ DRI_PRIME=1 glxgears -info
=> corruption issue is gone, but it is back when I touch touchpad.
Video: https://goo.gl/8p26HZ
Besides the corruption issue, sometimes the gear stopped when I touched the touchpad.

DC mode:
$ DRI_PRIME=1 glxgears -info
=> Pass!!
-----------------------------------
2) Run glxgears with other window manager
-----------------------------------
a. Install TWM
$ sudo apt-get install twm
$ sudo apt-get install afterstep
b. Logout and then login with twm
c. Run glxgears

AC mode:
$ DRI_PRIME=1 glxgears -info
=> corruption issue is gone, but it is back when I touch touchpad.
Video: https://goo.gl/8p26HZ
Besides the corruption issue, sometimes the gear stopped when I touched the touchpad.

DC mode:
$ DRI_PRIME=1 glxgears -info
=> Pass!!

Comment 37 Michel Dänzer 2017-08-28 08:41:52 UTC

I wonder if it could be related to the caching attributes of the system memory pages being shared between the GPUs. AFAICT the amdgpu driver should end up treating them as non-cacheable, and correspondingly program the AMD GPU not to participate in cache coherency protocol for them. How does the i915 driver treat the pages of imported buffer objects?

Comment 38 Ethan Hsieh 2017-08-29 08:38:54 UTC

Created attachment 133861 [details]
Xorg.0.log (ubuntu 17.04 + modeset)

Comment 39 Ethan Hsieh 2017-08-29 08:39:47 UTC

Created attachment 133862 [details]
Xorg.0.log (ubuntu 17.04 + intel&amdgpu)

I install Ubuntu 17.4 and can reproduce corruption issue in AC/DC mode.
Xorg.0.log:
[    21.309] (II) LoadModule: "modesetting"
[    21.309] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so
[    21.310] (II) Module modesetting: vendor="X.Org Foundation"

But, the issue in DC mode is gone after I add two config files.

/usr/share/X11/xorg.conf.d/20-intel.conf 
Section "Device"
	Identifier "Intel Graphics"
	Driver "Intel"
	Option "DRI" "3"
EndSection

/usr/share/X11/xorg.conf.d/30-amdgpu.conf 
Section "Device"
	Identifier "AMDGPU"
	Driver "amdgpu"
	Option "DRI" "3"
EndSection

Comment 40 Ethan Hsieh 2017-08-30 07:08:29 UTC

Cannot reproduce corruption issue with SNA_POWERSAVE enabled.

(Have to re-plug AC adapter to make the workaround work)
xserver-xorg-video-intel-2.99.917+git20160325/src/sna/sna_acpi.c
@@ -123,10 +123,14 @@ void _sna_acpi_wakeup(struct sna *sna)
     state = atoi(space + 1);

    DBG(("%s: ac_adapter event new state=%d\n", __FUNCTION__, state));
+#if 0
    if (state)
     sna->flags &= ~SNA_POWERSAVE;
    else
     sna->flags |= SNA_POWERSAVE;
+#endif
+	DBG(("%s: enable SNA_POWERSAVE\n", __FUNCTION__));
+	sna->flags |= SNA_POWERSAVE;
   }

Xorg.0.org:
$ grep -r -i -e "ac_ad" -e "enable SNA" /var/log/Xorg.0.log
[ 112.044] _sna_acpi_wakeup: event string [41]: 'ac_adapter ACPI0003:00 00000080 00000000
[ 112.044] _sna_acpi_wakeup: ac_adapter event new state=0
[ 112.044] _sna_acpi_wakeup: enable SNA_POWERSAVE
[ 123.003] _sna_acpi_wakeup: event string [41]: 'ac_adapter ACPI0003:00 00000080 00000001
[ 123.003] _sna_acpi_wakeup: ac_adapter event new state=1
[ 123.003] _sna_acpi_wakeup: enable SNA_POWERSAVE

Comment 41 Michel Dänzer 2017-08-30 07:35:02 UTC

(In reply to Ethan Hsieh from comment #40)
> Cannot reproduce corruption issue with SNA_POWERSAVE enabled.

Interesting find. That can explain why the corruption is only reproducible with the power supply connected with xf86-video-intel, but reproducible regardless with the modesetting driver instead.

At this point, I think we really need someone at Intel to look into this in more detail, reassigning.

Comment 42 Ben Widawsky 2017-08-30 20:53:57 UTC

SNA_POWERSAVE effects a good amount more than what I initially thought.
1. Potentially extra flushing.
2. Prefer blit ring over render ring.

Corruption itself looks like X-tiled cachelines. And the changes in SNA totally effect cacheline behavior; blitter uses different caches, extra flushing does extra flushes :)

WRT https://bugs.freedesktop.org/show_bug.cgi?id=101691#c37. Not entirely sure what you're getting at. dma_buf core should be handling the coherency between the two parties and it's up to the driver implementation to handle map/begin_cpu_access etc. to handle appropriately. Assuming we're talking about dma_buf.

Comment 43 Michel Dänzer 2017-08-31 01:10:24 UTC

(In reply to Ben Widawsky from comment #42)
> WRT https://bugs.freedesktop.org/show_bug.cgi?id=101691#c37. Not entirely
> sure what you're getting at. dma_buf core should be handling the coherency
> between the two parties and it's up to the driver implementation to handle
> map/begin_cpu_access etc. to handle appropriately. Assuming we're talking
> about dma_buf.

The buffers are shared via dma_buf, but there shouldn't be any CPU access to the shared buffers.

My question is whether the i915 driver programs the GPU to treat the pages of the buffers imported from amdgpu as cacheable or non-cacheable.

Comment 44 Ben Widawsky 2017-09-07 22:50:24 UTC

There is no special handling for how the GPU maps dma-buf imported objects. In other words, it will default to LLC caching on KBL.

Provided amdgpu is flushing its contents properly though, the LLC should just follow normal coherency snoop protocol and i915 should see the changed data.

Comment 45 Michel Dänzer 2017-09-08 03:52:27 UTC

(In reply to Ben Widawsky from comment #44)
> Provided amdgpu is flushing its contents properly though, the LLC should
> just follow normal coherency snoop protocol and i915 should see the changed
> data.

Since the pages are non-cacheable as far as the amdgpu driver is concerned, the AMD GPU does not participate in coherency snoop protocol for them AFAIK. It should flush to them "immediately" though.

Comment 46 Alex_Zhang1 2017-09-08 12:09:19 UTC

(In reply to Ben Widawsky from comment #44)
> There is no special handling for how the GPU maps dma-buf imported objects.
> In other words, it will default to LLC caching on KBL.
> 
> Provided amdgpu is flushing its contents properly though, the LLC should
> just follow normal coherency snoop protocol and i915 should see the changed
> data.

Hi Ben,
  By review intel gem code, I found there are 4 cache level, I915_CACHE_NONE,LLC,L3_LLC,WT, if we chose NONE when create buffer, does that mean every operation and operate result(including whole render process and the output)relating to the buffer has nothing to do with cache policy?
  I want to check this because I want to prove this issue do relating to cache policy. I have checked the SNA power state flag, what this flag effects is just the path flow to treat buffer in user space, and I personally believe cache level is configured by user, upper layer of sna_accel user space driver(please correct me if I am wrong).
  When plug AC, /sys/class/power_supply/AC/online change from 0 to 1, this event will captured by SNA module, once it is captured, it will change power state, means sna->flags being changed, then buffer handle path been change, at this moment, buffer has been created with proper cache policy from upper layer user.
  Could you please help to answer the first question? If we can confirm this has nothing to do with cache. If my way is wrong, would you please help to suggest how to avoid cache influencing?
  May be I am wrong, I think this is just a user space driver issue, they might miss some consideration on 5917 video card, since 5916 has no this issue.
  Thanks in advance.

Comment 47 Ben Widawsky 2017-09-14 17:28:54 UTC

What should happen is when the buffer is imported it will be cached, but when you pin it for display, it should get mapped uncached in the GGTT (or WT if we have eLLC).

Comment 48 Michel Dänzer 2017-09-15 10:51:18 UTC

Pinning the buffer for display can only happen while the application's window is fullscreen, but may not happen even then, depending on various factors (and ultimately the Intel GPU's Xorg driver).

Comment 49 Lyude Paul 2017-10-17 20:44:21 UTC

Created attachment 134890 [details] [review]
AMDGPU fix

Hey! So I've been seeing this on one of the machines I've got here and it looks like we've traced the problem down to being related to amdgpu and ttm. The attached patch solves the problem on my machine.

Comment 50 Lyude Paul 2017-10-17 20:51:22 UTC

Forgot to add alex deucher to this as well; since this seems to be in amdgpu's ballpark now

Comment 51 Daniel Vetter 2017-10-17 20:52:07 UTC

btw the SNOOPED diff is from me. I think what amdgpu should do is set that for any dma-buf shared (i.e. dma_buf_attach has been called) or imported buffer. We can't set it for all exported buffers because dma-buf fd sharing withint the same device instance.

But I didn't figure out how to wire that through ttm, the caching_mode seems to be on the tt, not the bo itself.

Longer-term we might need/want to add a dma_buf_get/set_coherency_mode() function so that we don't have to opt for the most defensive thing available. The set_coherency_mode is required because apparently on kbl the "uncached" mode can pull cachelines into the cache and then reuse them.

Comment 52 Daniel Vetter 2017-10-17 20:52:47 UTC

Reassigning to amdgpu.

Comment 53 Chris Wilson 2017-10-17 21:42:38 UTC

Created attachment 134897 [details] [review]
Shotgun attempt to stop pulling external images into the L3 (mesa/i965)

Comment 54 Alex Deucher 2017-10-18 01:47:52 UTC

Please see comment 37.  We support both snooped and unsnooped access to system memory.  When we use unsnooped, we always use uncached memory.  Does i915 always assume dma_bufs are cached?

Comment 55 Daniel Vetter 2017-10-18 19:19:56 UTC

So parts of the reasons I've reassigned to amdgpu is that apparently someone also seen these corruptions when everything is rendered on amdgpu, and i915 only displays. And the display engine is definitely only accessing memory directly, bypassing cpu caches.

Changing stuff on the kernel on the i915 side didn't help (on import we treat buffers as uncached and assume they're fully flushed). But we kinda missed/didn't get around to testing the mesa caching mode overrides (which seem to be wrong when texturing). So might need to move the bug to intel mesa (and who knows what happened with the display issue, I guess we'll shrug that off until confirmed with a video).

Aside: When exporting dma-buf from i915 we leave them in whatever caching mode is preferred by that gpu platform, which on recent big core is fully cached (i.e. requires snooping). We still might need to somehow teach dma-buf importers about whether they need to use snooping reads/writes (through a new flag/function/whatever).

Comment 56 Clinton Taylor 2017-10-18 22:07:30 UTC

The tearing issue seen from comment #33 is actually a single triangle that is being rendered late. Attached a screen shot from Unigine showing the "Bad" triangle.

This issue has been fixed when using a later version of the kernel 4.10 as stated in comment #27.

Comment 57 Clinton Taylor 2017-10-18 22:08:27 UTC

Created attachment 134914 [details]
Tearing during fullscreen is really a bad triangle

Comment 58 Michel Dänzer 2017-10-19 08:39:35 UTC

(In reply to Daniel Vetter from comment #55)
> So parts of the reasons I've reassigned to amdgpu is that apparently someone
> also seen these corruptions when everything is rendered on amdgpu, and i915
> only displays.

We'd need more information about how exactly that test was performed (i.e. details about how exactly "everything is rendered on amdgpu, and i915 only displays") and what the corruption looks like in that case.

Comment 59 Lyude Paul 2017-10-19 17:34:32 UTC

Created attachment 134923 [details] [review]
Mesa i965 patch to fix corruption issues

So it looks like that this patch for mesa from Daniel actually fixes the issue. A very important note here: make -sure- whatever you are using as the compositor (X, gnome-shell on wayland, etc.) is also using the version of mesa with this patch as well because otherwise this will not fix your problem.

I did this myself by building and installing mesa into a prefix (/home/lyudess/prefixes/mesa) and running:

ldconfig -v /home/lyudess/prefixes/mesa/lib

You should see all of the GL libraries get directed to your mesa prefix. After that just reboot and try. You might be able to skip the reboot, but I did it just to be safe :)

Comment 60 Lyude Paul 2017-10-19 20:39:22 UTC

Created attachment 134929 [details] [review]
Shotgun attempt to stop pulling external images into the L3 (mesa/i965)

Reposting Chris's patch for this, since I didn't realize it was a later more refined version of danvet's patch when I obsoleted it (and of course, I tested that it actually works :)

Comment 61 Ethan Hsieh 2017-11-06 10:53:46 UTC

The patch in Comment#60 doesn't work for me.
I still can reproduce corruption issue with mesa 17.4 + patch in Comment 60.
But, tearing issue is gone.

1:17.4~git171104221700.608af05~x~padoka0
https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa

Comment 62 Lyude Paul 2017-11-06 16:47:41 UTC

(In reply to Ethan Hsieh from comment #61)
> The patch in Comment#60 doesn't work for me.
> I still can reproduce corruption issue with mesa 17.4 + patch in Comment 60.
> But, tearing issue is gone.
> 
> 1:17.4~git171104221700.608af05~x~padoka0
> https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa

Hi, the current most recent patch series is actually https://patchwork.freedesktop.org/series/33162/ now, which fixes the problem completely on my system. As well can I confirm that you made sure that the X server and whatever applications you're testing this with are both using the patched version of mesa?

Comment 63 Ethan Hsieh 2017-11-07 06:32:01 UTC

Here is the command I used for test.
I tested with patched mesa 17.4.

$ DRI_PRIME=1 glxgears -info
GL_RENDERER   = AMD ICELAND (DRM 3.9.0 / 4.4.0-98-generic, LLVM 6.0.0)
GL_VERSION    = 3.0 Mesa 17.4.0-devel - padoka PPA
GL_VENDOR     = X.Org

I still can reproduce the issue with mesa 17.4 + latest patch.
Patch: https://patchwork.freedesktop.org/series/33162/ 
Mesa 17.4 (1:17.4~git171104221700.608af05~x~padoka0):
https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa

Test Result:
1. 4.4.0-98 + mesa 17.4: Failed
2. 4.4.0-98 + patched mesa 17.4: Failed 
3. 4.14.0-041400rc8 + patched mesa 17.4: Failed 

Test Environment:
Ubuntu 16.04 (xenial)

Comment 64 Ethan Hsieh 2017-11-08 07:47:08 UTC

Cannot reproduce the issue with Ubuntu 17.10 + patched Mesa 17.4 + Modeset

Test Result:
Mesa 17.4 + Intel DDX: Failed
Mesa 17.4 + Modeset: Failed
patched Mesa 17.4 + Intel DDX: Failed
patched Mesa 17.4 + Modeset: Pass

Test Environment:
Image: ubuntu-17.10-desktop-amd64.iso
Patch: https://patchwork.freedesktop.org/series/33162/
Mesa 17.4 (1:17.4~git171107184800.d002950~a~padoka0):
https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa

Comment 65 Jason Ekstrand 2017-11-14 21:56:02 UTC

The mesa side of this should be fixed in the following commit:

commit d7a19d69ebc032ba7207fc97bc6f10d5bb35bb99
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Fri Nov 3 15:26:17 2017 -0700

    i965: Use PTE MOCS for all external buffers
    
    We were already using PTE for all render targets in case one happened to
    get scanned out.  However, this still wasn't 100% correct because there
    are still possibly cases where we may want to texture from an external
    buffer even though we don't know the caching mode.  This can happen, for
    instance, on buffers imported from another GPU via prime.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101691
    Cc: "17.3" <mesa-stable@lists.freedesktop.org>
    Tested-by: Lyude Paul <lyude@redhat.com>
    Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

Comment 66 Emil Velikov 2017-12-22 15:36:13 UTC

According to comment 64 the Mesa issue is resolved. The patches in question are part of Mesa 17.3.0 (while 17.3.1 has also been released).

If it were up-to me, I'd close this report and open another against the Intel DDX since the modesetting one works fine.

Comment 67 Timo Aaltonen 2018-03-01 08:41:51 UTC

yep, closing

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.