Bug 60802 - Corruption with DMA ring on cayman
Corruption with DMA ring on cayman
Status: RESOLVED FIXED
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600
unspecified
Other All
: medium normal
Assigned To: Default DRI bug account
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-02-13 16:44 UTC by Alexandre Demers
Modified: 2013-03-18 00:13 UTC (History)
3 users (show)

See Also:


Attachments
Corrupted texture on background (131.23 KB, image/jpeg)
2013-02-14 06:42 UTC, Alexandre Demers
Details
align dma commands to 8 dwords (4.90 KB, patch)
2013-02-27 21:13 UTC, Alex Deucher
Details | Splinter Review
Corruption in Trine 2 (1.05 MB, image/png)
2013-03-01 06:51 UTC, Jakob Nixdorf
Details
set non_disp tiling bit for cayman (2.62 KB, patch)
2013-03-15 00:40 UTC, Alex Deucher
Details | Splinter Review
take 2 (2.82 KB, patch)
2013-03-15 13:01 UTC, Alex Deucher
Details | Splinter Review
take 3 (2.26 KB, patch)
2013-03-15 15:32 UTC, Alex Deucher
Details | Splinter Review
use blitter for compressed textures (2.14 KB, patch)
2013-03-15 19:21 UTC, Alex Deucher
Details | Splinter Review
take 5 (1.35 KB, patch)
2013-03-15 22:17 UTC, Alex Deucher
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandre Demers 2013-02-13 16:44:06 UTC
Since temporary fix for bug 58354, I'm able to test some applications once again and I've found out some rendering issues. In Unigine Tropics, Unigine Heaven and under another test that used to be OK (I'll come back with its name soon, I think it was Amnesia's render test), some textures are not rendered correctly with kernel 3.8-rc7 (vanilla), but are OK with kernel 3.7.7 from Archlinux 64. I'm pretty sure an earlier 3.8-rcX was rendering OK with the "still unamed" application.

I'm testing with latest mesa and drm on a XFX HD6950.

I'll try to bisect it later.
Comment 1 Alexandre Demers 2013-02-14 06:39:17 UTC
It seems I'll have to bisect because even 3.8-rc1 displays the corruption. I'm attaching a screenshot from RendererFeattest as a corrupted reference.
Comment 2 Alexandre Demers 2013-02-14 06:42:01 UTC
Created attachment 74793 [details]
Corrupted texture on background

But the texture on the cubes seems to be ok
Comment 3 Alexandre Demers 2013-02-14 06:42:52 UTC
(In reply to comment #2)
> Created attachment 74793 [details]
> Corrupted texture on background
> 
> But the texture on the cubes seems to be ok

The text is also corrupted
Comment 4 Alexandre Demers 2013-02-18 01:43:59 UTC
bisected and first bad commit is: 8696e33f06b0c52195152cc6a0e3d52233f486c1
commit 8696e33f06b0c52195152cc6a0e3d52233f486c1
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Thu Dec 13 18:57:07 2012 -0500

    drm/radeon: bump version for CS ioctl support for async DMA
    
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Comment 5 Alex Deucher 2013-02-18 13:47:27 UTC
(In reply to comment #4)
> bisected and first bad commit is: 8696e33f06b0c52195152cc6a0e3d52233f486c1
> commit 8696e33f06b0c52195152cc6a0e3d52233f486c1
> Author: Alex Deucher <alexander.deucher@amd.com>
> Date:   Thu Dec 13 18:57:07 2012 -0500
> 
>     drm/radeon: bump version for CS ioctl support for async DMA
>     
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

That just enables the use of DMA by the 3D driver, so the issue is in mesa.  This looks like a duplicate of bug 60236.
Comment 6 Alexandre Demers 2013-02-18 14:34:50 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > bisected and first bad commit is: 8696e33f06b0c52195152cc6a0e3d52233f486c1
> > commit 8696e33f06b0c52195152cc6a0e3d52233f486c1
> > Author: Alex Deucher <alexander.deucher@amd.com>
> > Date:   Thu Dec 13 18:57:07 2012 -0500
> > 
> >     drm/radeon: bump version for CS ioctl support for async DMA
> >     
> >     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> 
> That just enables the use of DMA by the 3D driver, so the issue is in mesa. 
> This looks like a duplicate of bug 60236.

You are right, that's what I thought about the bad identified commit.

I don't have the same symptoms as in bug 60236, but I'll have a look anyway. Otherwise, I'll use kernel 3.8-rc7 and I'll bisect mesa to track the bug down.
Comment 7 Alexandre Demers 2013-02-19 03:25:36 UTC
So I followed bug 60236. I sync mesa since the proposed mesa patch had already been pushed. I tested with kernel 3.8.0 and the bug was still there. I then applied the proposed kernel patch over 3.8.0 (after fixing it because it was not properly applying), built and restarted using this newly built kernel.

Result: the bug is still there and the same. So this bug is not the same as bug 60236.

I'll try to bisect mesa in the next couple of days.

BTW, I bought Bastion on Steam and tested it. The same bug appears (or not exactly as for the demos depending of the kernel I was using), so it is not limited to some demo cases.
Comment 8 Alexandre Demers 2013-02-19 06:33:31 UTC
Culprit commit in mesa identified as:

325422c49449acdd8df1eb2ca8ed81f7696c38cc is the first bad commit
commit 325422c49449acdd8df1eb2ca8ed81f7696c38cc
Author: Jerome Glisse <jglisse@redhat.com>
Date:   Mon Jan 7 17:45:59 2013 -0500

    r600g: add async for staging buffer upload v2
    
    v2: Add virtual address to dma src/dst offset for cayman
    
    Signed-off-by: Jerome Glisse <jglisse@redhat.com>

:040000 040000 4ef6e784f3acb7f21da0c5e1923810c78917d16d 55f0ce7d9793ce04d392fc7943059545053a9d79 M	src

So, it is the same commit as for bug 60236, but applying patches from it doesn't help.
Comment 9 Jakob Nixdorf 2013-02-19 08:14:09 UTC
I got similar looking corruptions in Counter Strike: Source and Trine 2 (via Steam) since my update from kernel 3.7.9 to 3.8.0 this morning (Arch Linux x86_64, newest mesa from git).

When I get back from the university I will check if I can confirm your commit as the culprit.
Comment 10 Jakob Nixdorf 2013-02-19 22:49:19 UTC
Ok, I tested  it and I can confirm commit 325422c49449acdd8df1eb2ca8ed81f7696c38cc as the culprit for the corruption in Trine 2.

Counter Strike: Source seems to be another problem, I will bisect it tomorrow and open a new bug for it.
Comment 11 Alexandre Demers 2013-02-20 05:10:24 UTC
I also noticed this in dmesg:
[ 3385.870124] radeon 0000:01:00.0: GPU fault detected: 146 0x0005d004
[ 3385.870138] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00002400
[ 3385.870146] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x050D0004
[ 3385.870153] radeon 0000:01:00.0: GPU fault detected: 146 0x0005a004
[ 3385.870160] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3385.870165] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 3385.870172] radeon 0000:01:00.0: GPU fault detected: 146 0x0005e004
[ 3385.870178] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3385.870183] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 3385.870189] radeon 0000:01:00.0: GPU fault detected: 146 0x0025a004
[ 3385.870195] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3385.870200] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 3385.870206] radeon 0000:01:00.0: GPU fault detected: 146 0x00059004
[ 3385.870211] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3385.870217] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 3385.870223] radeon 0000:01:00.0: GPU fault detected: 146 0x0065e004
[ 3385.870228] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3385.870233] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 3385.870239] radeon 0000:01:00.0: GPU fault detected: 146 0x0035a004
[ 3385.870245] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3385.870250] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 3385.870256] radeon 0000:01:00.0: GPU fault detected: 146 0x0075e004
[ 3385.870261] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3385.870267] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[

I don't know exactly for how long it has been happening, but this recent for sure since it was not there about 2 weeks ago.
Comment 12 Jakob Nixdorf 2013-02-20 06:35:41 UTC
I got this too, but this was already there before I updated to the 3.8.0 kernel. So I think it has nothing to do with this problem.
Comment 13 Alexandre Demers 2013-02-20 06:59:02 UTC
(In reply to comment #12)
> I got this too, but this was already there before I updated to the 3.8.0
> kernel. So I think it has nothing to do with this problem.

Thanks for letting me know. I may know an already opened bug about it...
Comment 14 Alex Deucher 2013-02-20 13:31:37 UTC
(In reply to comment #11)
> I also noticed this in dmesg:
> [ 3385.870124] radeon 0000:01:00.0: GPU fault detected: 146 0x0005d004
> [ 3385.870138] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00002400
> [ 3385.870146] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x050D0004
> [ 3385.870153] radeon 0000:01:00.0: GPU fault detected: 146 0x0005a004
> [ 3385.870160] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 3385.870165] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 3385.870172] radeon 0000:01:00.0: GPU fault detected: 146 0x0005e004
> [ 3385.870178] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 3385.870183] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 3385.870189] radeon 0000:01:00.0: GPU fault detected: 146 0x0025a004
> [ 3385.870195] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 3385.870200] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 3385.870206] radeon 0000:01:00.0: GPU fault detected: 146 0x00059004
> [ 3385.870211] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 3385.870217] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 3385.870223] radeon 0000:01:00.0: GPU fault detected: 146 0x0065e004
> [ 3385.870228] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 3385.870233] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 3385.870239] radeon 0000:01:00.0: GPU fault detected: 146 0x0035a004
> [ 3385.870245] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 3385.870250] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 3385.870256] radeon 0000:01:00.0: GPU fault detected: 146 0x0075e004
> [ 3385.870261] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 3385.870267] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [
> 
> I don't know exactly for how long it has been happening, but this recent for
> sure since it was not there about 2 weeks ago.

Can you bisect mesa to see when this started?
Comment 15 Alexandre Demers 2013-02-20 18:28:30 UTC
Yes, I will. Should I begin by looking on the kernel's side or on mesa's side?
Comment 16 Jakob Nixdorf 2013-02-20 18:32:57 UTC
I am currently bisecting mesa to see where the dmesg output starts.
Comment 17 Alex Deucher 2013-02-20 18:53:27 UTC
(In reply to comment #15)
> Yes, I will. Should I begin by looking on the kernel's side or on mesa's
> side?

It's most likely a mesa issue.
Comment 18 Jakob Nixdorf 2013-02-20 19:53:38 UTC
Ok, just finished bisecting, this is the first commit with the dmesg errors:

e110c98cae0ceae47db6cf26c08707505ce92479 is the first bad commit
commit e110c98cae0ceae47db6cf26c08707505ce92479
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Sun Jan 27 22:13:52 2013 -0500

    r600g: don't emit WAIT_UNTIL on cayman/TN (v2)
    
    It shouldn't be needed and older kernels don't support
    it.
    
    v2: Replace with PS partial flush as before.
    
    Fixes:
    https://bugs.freedesktop.org/show_bug.cgi?id=59945
    
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Reviewed-by: Marek Olšák <maraeo@gmail.com>

And if I remember correctly I first noticed it with kernel 3.7.5 or 3.7.6, but I thing this was also around the time this mesa commit was pushed so it may have nothing to do with this kernel versions.
Comment 19 Alexandre Demers 2013-02-20 22:31:05 UTC
(In reply to comment #18)
> Ok, just finished bisecting, this is the first commit with the dmesg errors:
> 
> e110c98cae0ceae47db6cf26c08707505ce92479 is the first bad commit
> commit e110c98cae0ceae47db6cf26c08707505ce92479
> Author: Alex Deucher <alexander.deucher@amd.com>
> Date:   Sun Jan 27 22:13:52 2013 -0500
> 
>     r600g: don't emit WAIT_UNTIL on cayman/TN (v2)
>     
>     It shouldn't be needed and older kernels don't support
>     it.
>     
>     v2: Replace with PS partial flush as before.
>     
>     Fixes:
>     https://bugs.freedesktop.org/show_bug.cgi?id=59945
>     
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>     Reviewed-by: Marek Olšák <maraeo@gmail.com>
> 
> And if I remember correctly I first noticed it with kernel 3.7.5 or 3.7.6,
> but I thing this was also around the time this mesa commit was pushed so it
> may have nothing to do with this kernel versions.

I'll double check if I'm getting the same thing as you. If this is right, we should open a different bug to track it.
Comment 20 Alexandre Demers 2013-02-21 03:14:08 UTC
(In reply to comment #18)
> Ok, just finished bisecting, this is the first commit with the dmesg errors:
> 
> e110c98cae0ceae47db6cf26c08707505ce92479 is the first bad commit
> commit e110c98cae0ceae47db6cf26c08707505ce92479
> Author: Alex Deucher <alexander.deucher@amd.com>
> Date:   Sun Jan 27 22:13:52 2013 -0500
> 
>     r600g: don't emit WAIT_UNTIL on cayman/TN (v2)
>     
>     It shouldn't be needed and older kernels don't support
>     it.
>     
>     v2: Replace with PS partial flush as before.
>     
>     Fixes:
>     https://bugs.freedesktop.org/show_bug.cgi?id=59945
>     
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>     Reviewed-by: Marek Olšák <maraeo@gmail.com>
> 
> And if I remember correctly I first noticed it with kernel 3.7.5 or 3.7.6,
> but I thing this was also around the time this mesa commit was pushed so it
> may have nothing to do with this kernel versions.

I ended up with a different commit than yours when bisecting:
325422c49449acdd8df1eb2ca8ed81f7696c38cc is the first bad commit
commit 325422c49449acdd8df1eb2ca8ed81f7696c38cc

It's the one just prior yours and it is also the one identified as being the culprit of the current bug (comment 8). Could you check a second time?
Comment 21 Jakob Nixdorf 2013-02-21 06:57:04 UTC
(In reply to comment #20)
> I ended up with a different commit than yours when bisecting:
> 325422c49449acdd8df1eb2ca8ed81f7696c38cc is the first bad commit
> commit 325422c49449acdd8df1eb2ca8ed81f7696c38cc
> 
> It's the one just prior yours and it is also the one identified as being the
> culprit of the current bug (comment 8). Could you check a second time?

I just build 325422c49449acdd8df1eb2ca8ed81f7696c38cc but I can't trigger any dmesg output. What do you use to test for it?
Comment 22 Jakob Nixdorf 2013-02-21 08:18:35 UTC
Ok, I managed to get the dmesg output using piglit and the r600.tests, so I think this confirms 325422c49449acdd8df1eb2ca8ed81f7696c38cc as culprit.

But I think the commit I identified (e110c98cae0ceae47db6cf26c08707505ce92479) is bad too, because there I only needed to start steam to get the errors.
Comment 23 Alex Deucher 2013-02-21 14:57:13 UTC
(In reply to comment #22)
> Ok, I managed to get the dmesg output using piglit and the r600.tests, so I
> think this confirms 325422c49449acdd8df1eb2ca8ed81f7696c38cc as culprit.

Can you identify the a specific piglit test that triggers the fault messages and attach the fault messages it generates?
Comment 24 Jakob Nixdorf 2013-02-21 16:36:55 UTC
(In reply to comment #23)
> Can you identify the a specific piglit test that triggers the fault messages
> and attach the fault messages it generates?

I won't be at my computer for awhile, but I will try as soon as I come back.
Comment 25 Alexandre Demers 2013-02-22 04:29:47 UTC
For the info, I use a now-unavailable demo used to test features for Amnesia (RendererFeattest). It is no where to be found anymore I think. It is a quick and progressive test (starting with almost nothing and adding or testing different features on every change). I'd be happy to use piglit tests, but last time I tried a couple of weeks ago, it was still locking my computer.
Comment 26 Jakob Nixdorf 2013-02-25 12:16:20 UTC
Now that I'm back I have a question: is there any simple way to single-step through all tests in the r600.tests, or do I have to do it manually ?
Comment 27 Tom Stellard 2013-02-25 15:13:05 UTC
(In reply to comment #26)
> Now that I'm back I have a question: is there any simple way to single-step
> through all tests in the r600.tests, or do I have to do it manually ?

When I want to identify which test is causing a GPU hang, I pass the -c 0 option to piglit-run.py to disable concurrent tests and then run piglit over ssh.  This way you can still see the terminal output after the GPU hangs.

For example:

ssh tstellar@cayman-box
cd piglit
DISPLAY=:0 ./piglit-run.py -c 0 tests/quick-driver-tests
Comment 28 Jakob Nixdorf 2013-02-25 16:35:39 UTC
The first test to trigger the dmesg output is:

spec/EXT_framebuffer_multisample/accuracy

but it also causes my system to freeze (ssh too), so I have no fail message.
Comment 29 Alexandre Demers 2013-02-27 02:52:04 UTC
I'm sure the rendering corruption comes from a wrongly set address for Cayman (not shifted correctly at some point). It reminds me of bug 38173 and its fixes.
Comment 30 Jakob Nixdorf 2013-02-27 06:59:54 UTC
What do you mean by fixed, this bug or the one you linked?
Because I still get the same corruptions in Trine 2 and some textures in CS:S with the newest git version of mesa.
Comment 31 Alexandre Demers 2013-02-27 13:37:23 UTC
(In reply to comment #30)
> What do you mean by fixed, this bug or the one you linked?
> Because I still get the same corruptions in Trine 2 and some textures in
> CS:S with the newest git version of mesa.

"and its fixes" refered to the bug I linked (bug 38173). I'm also still experiencing corruption with latest git.
Comment 32 Alex Deucher 2013-02-27 21:13:26 UTC
Created attachment 75658 [details] [review]
align dma commands to 8 dwords

Does this patch help?  You might also try it in combination with this patch:
http://lists.freedesktop.org/archives/mesa-dev/2013-February/035347.html
Comment 33 Jakob Nixdorf 2013-02-27 22:26:05 UTC
No, I still get corruption in Trine 2 and on some CS:S textures with this patch (also in combination with the one you linked).

Should I try all 5 patches in the set you linked or only 3/5 ?
Comment 34 Alexandre Demers 2013-02-28 02:17:08 UTC
(In reply to comment #32)
> Created attachment 75658 [details] [review] [review]
> align dma commands to 8 dwords
> 
> Does this patch help?  You might also try it in combination with this patch:
> http://lists.freedesktop.org/archives/mesa-dev/2013-February/035347.html

Tried proposed patch and the 5 patches from the link (the link and the 4 others) and still the same corruption.

However, I noticed the errors seam to have gone with latest mesa from git (no link with the patches, tested with and without them). I ran steam and RendererFeatTest64 and no error appeared in dmesg anymore. What about you Jakob?
Comment 35 Jakob Nixdorf 2013-02-28 06:44:22 UTC
Yes, it seems the dmesg errors are gone. I will try piglit later to confirm this.
Comment 36 Alexandre Demers 2013-03-01 03:32:55 UTC
This commit seems to have fixed the "GPU fault detected" message

62329d77b8065b5fd41179d6013c8adf6d86cfc7

Author Brian Paul<brianp@vmware.com>
Author date 2/25/13 8:00 PM
Parent svga: fix comment typos
Child r600g: synchronize streamout buffers on r6xx too (v3)
Branch master (i965/fs: Put immediate operand as src2) 
Branch origin/master (i965/fs: Put immediate operand as src2) 
Follows snb-magic (graw: Add struct pipe_surface forward declaration.)

    winsys/null: fix var typo templet->templat


Now, let's get back to the main bug: corruption. Is it me or is there a 4X4 pattern in the corruption? 2^4 or << 4
Comment 37 Jakob Nixdorf 2013-03-01 06:51:21 UTC
Created attachment 75725 [details]
Corruption in Trine 2

Yes it looks like some regular 4x4 pattern. I made a screenshoot of the Trine 2 startscreen with the corruption in the title and at the borders of the exit dialog.
Comment 38 Christian König 2013-03-01 09:56:13 UTC
(In reply to comment #37)
> Created attachment 75725 [details]
> Corruption in Trine 2
> 
> Yes it looks like some regular 4x4 pattern. I made a screenshoot of the
> Trine 2 startscreen with the corruption in the title and at the borders of
> the exit dialog.

Is it just me or does this corruption starts to look like a tiling issue? (Do we allready use the DMA for tiling uploads?)

Is the corruption binary repeatable, e.g. does it always look the same? Or does it change allot from one run to another?

If it's the later then it's more likely that it's just random memory you see.

Christian.
Comment 39 Jakob Nixdorf 2013-03-01 12:47:20 UTC
I just started Trine 2 again and took a second picture.
The corruption is definitely exactly the same.
Comment 40 Alexandre Demers 2013-03-01 14:35:08 UTC
(In reply to comment #39)
> I just started Trine 2 again and took a second picture.
> The corruption is definitely exactly the same.

I agree, same thing over here.
Comment 41 Jakob Nixdorf 2013-03-05 16:06:53 UTC
Just as information: I just built 3.9-rc1 and the corruption is still present and looking exactly the same.
Comment 42 Alexandre Demers 2013-03-05 19:27:40 UTC
(In reply to comment #41)
> Just as information: I just built 3.9-rc1 and the corruption is still
> present and looking exactly the same.

Same here, tested yesterday.
Comment 43 Alexandre Demers 2013-03-11 03:53:31 UTC
Still getting this rendering problem with kernel 3.9.0-rc2 and today's mesa code.
Comment 44 vincent 2013-03-13 22:46:36 UTC
I think I have similar issue with Unigine Heaven 3.0 :

http://people.freedesktop.org/~vlj/00002.jpg
http://people.freedesktop.org/~vlj/00003.jpg

A webgl demo that has the issue too is at http://www.findyourwaytooz.com/

It appeared with Kernel 3.8 too. Kernel 3.7.9 did have this.

With a previous Mesa there was the same kind of issue with Lightmark, but with latest mesa it is gone ; maybe the remaining corruption is related to compressed textures ?
Comment 45 Alex Deucher 2013-03-13 22:50:47 UTC
(In reply to comment #44)
> 
> It appeared with Kernel 3.8 too. Kernel 3.7.9 did have this.

The DMA rings are only available on 3.8 kernels.
Comment 46 Alexandre Demers 2013-03-14 02:03:53 UTC
(In reply to comment #44)
> I think I have similar issue with Unigine Heaven 3.0 :
> 
> http://people.freedesktop.org/~vlj/00002.jpg
> http://people.freedesktop.org/~vlj/00003.jpg
> 
> A webgl demo that has the issue too is at http://www.findyourwaytooz.com/
> 
> It appeared with Kernel 3.8 too. Kernel 3.7.9 did have this.
> 
> With a previous Mesa there was the same kind of issue with Lightmark, but
> with latest mesa it is gone ; maybe the remaining corruption is related to
> compressed textures ?

Your screenshots are very similar to what I see with other Unigine's demos. Your webgl demo also seems to exhibit the same corruption issue. I would say you are right about some specific kind of textures not being correctly rendered, while others are. What kind exactly is to be determined.
Comment 47 Alexandre Demers 2013-03-14 03:30:24 UTC
So I played with libtxc (removed it in fact) and tested some demos again. For those that were not completely relying on libtxc, I observed the following:
- textures not related to libtxc were displayed correctly
- textures related to libtxc were not displayed AND were corresponding to the corrupted textures seen when libtxc was available.
Comment 48 Anthony Waters 2013-03-14 03:37:38 UTC
I looked into this a little bit and the issue appears to be within evergreen_dma_copy_tile within evergreen_state.c.  It looks like when bank_h is 0 it causes the texture to appear bad, but bank_h is allowed to be 0 so I believe the lines
cs->buf[cs->cdw++] = (detile << 31) | (array_mode << 27) |
					(lbpp << 24) | (bank_h << 21) |
					(bank_w << 18) | (mt_aspect << 16);
have the incorrect offsets, but I'm not 100% sure at the moment.
Comment 49 Anthony Waters 2013-03-14 04:29:42 UTC
Looked into it a bit more and it appears that when bpp is 16 there is a bad texture, I'll see if I can figure it out more later on.
Comment 50 Anthony Waters 2013-03-14 23:33:25 UTC
I believe I have the source of the bug, it appears that there is a special case for Caymen GPUs that isn't handled in the DMA code path.  In evergreen_state.c within the method evergreen_create_sampler_view_custom there is the chunk of code

	/* 128 bit formats require tile type = 1 */
	if (rscreen->chip_class == CAYMAN) {
		if (util_format_get_blocksize(pipe_format) >= 16)
			non_disp_tiling = 1;
	}

however, within evergreen_dma_copy_tile in the same source file no such code exists.  I tested whether this was the case or not by placing the lines of code

		if (util_format_get_blocksize(dst->format) >= 16) {
			printf("Caymen non disp tiling skipping dma tile\n");
			return FALSE;
		}

before the call to evergreen_dma_copy_tile in evergreen_dma_blit, and the corruption no longer appeared. (having this checks skips the DMA path for this case and goes through the normal path, which would be evergreen_create_sampler_view_custom I believe)

I'm not sure which bits in the DMA packet control this setting.
Comment 51 Alex Deucher 2013-03-15 00:40:46 UTC
Created attachment 76544 [details] [review]
set non_disp tiling bit for cayman

Good catch!  I believe the attached patch should fix the issue.
Comment 52 Alexandre Demers 2013-03-15 01:21:02 UTC
(In reply to comment #51)
> Created attachment 76544 [details] [review] [review]
> set non_disp tiling bit for cayman
> 
> Good catch!  I believe the attached patch should fix the issue.

I don't know for others, but it doesn't fix the corruption over here.
Comment 53 Alexandre Demers 2013-03-15 01:32:44 UTC
(In reply to comment #50)
> I believe I have the source of the bug, it appears that there is a special
> case for Caymen GPUs that isn't handled in the DMA code path.  In
> evergreen_state.c within the method evergreen_create_sampler_view_custom
> there is the chunk of code
> 
> 	/* 128 bit formats require tile type = 1 */
> 	if (rscreen->chip_class == CAYMAN) {
> 		if (util_format_get_blocksize(pipe_format) >= 16)
> 			non_disp_tiling = 1;
> 	}
> 
> however, within evergreen_dma_copy_tile in the same source file no such code
> exists.  I tested whether this was the case or not by placing the lines of
> code
> 
> 		if (util_format_get_blocksize(dst->format) >= 16) {
> 			printf("Caymen non disp tiling skipping dma tile\n");
> 			return FALSE;
> 		}
> 
> before the call to evergreen_dma_copy_tile in evergreen_dma_blit, and the
> corruption no longer appeared. (having this checks skips the DMA path for
> this case and goes through the normal path, which would be
> evergreen_create_sampler_view_custom I believe)
> 
> I'm not sure which bits in the DMA packet control this setting.

Your observations are right, using your trick does indeed remove the corruptions.
Comment 54 Alexandre Demers 2013-03-15 03:47:25 UTC
(In reply to comment #51)
> Created attachment 76544 [details] [review] [review]
> set non_disp tiling bit for cayman
> 
> Good catch!  I believe the attached patch should fix the issue.

Since this bug is pretty much the same as previous one I had identified in comment 29 (pointing to bug 38173) and since the proposed patch is also pretty similar to 1 of the 2 patches needed to fix that identified bug, is it possible there is something missing that would look like the second patch that was needed?

patch 1 from bug 38173 (similar to the one proposed with your attachment 76544 [details] [review]): http://cgit.freedesktop.org/mesa/mesa/commit/?id=5e1495b2d9311fa2b320766a1d299053904bd9c3
patch 2 from bug 38173 (possibly similar to what is missing to complete your proposed patch): http://cgit.freedesktop.org/mesa/mesa/commit/?id=acca690c259824636ef1ff684a10bd1caca4751f

In other words, are we sure we are offsetting the non_disp variable correctly for Cayman? Because according to the second patch, the offset was different.
Comment 55 Alex Deucher 2013-03-15 13:01:27 UTC
Created attachment 76557 [details] [review]
take 2

Try this patch.  The bits of the non_disp field are inverted for compatibility with evergreen.
Comment 56 Anthony Waters 2013-03-15 13:57:50 UTC
take 2 didn't work for me, I also tried (non_disp << 28) to (non_disp << 30) and those didn't fix it either.
Comment 57 Alex Deucher 2013-03-15 15:32:08 UTC
Created attachment 76563 [details] [review]
take 3

Only set the non_disp bit for tiled to linear copies.
Comment 58 Anthony Waters 2013-03-15 16:47:21 UTC
That didn't work either, the L2T path is the one that keeps getting executed.  In my case dst_mode is RADEON_SURF_MODE_2D or RADEON_SURF_MODE_1D and src_mode is RADEON_SURF_MODE_LINEAR_ALIGNED.

I'll see if I can get some more specifics about the issue.
Comment 59 Alexandre Demers 2013-03-15 17:36:42 UTC
Alex, a couple of things changed in your last attachment. You removed the L2T part, you reverted the "non_disp = 2" back to "non_disp = 1" and changed the binary offset for 28 (instead of 27). Am I right to understand you set "non_disp" back to 1 because your are offsetting one more bit (according to your comment in v2)?

If that's right, I don't see how v3 is different from v2 except for the L2T change. And since non_disp=1 is not set at all in the main branch, how would that fix the corruption problem. Aren't we back to v0 for the L2T path (which is broken) and to v2 for the T2L path (which is also not better)?
Comment 60 Alex Deucher 2013-03-15 19:21:06 UTC
Created attachment 76590 [details] [review]
use blitter for compressed textures

This patch fixes the issue here.
Comment 61 Alex Deucher 2013-03-15 19:25:47 UTC
(In reply to comment #59)
> Alex, a couple of things changed in your last attachment. You removed the
> L2T part, you reverted the "non_disp = 2" back to "non_disp = 1" and changed
> the binary offset for 28 (instead of 27). Am I right to understand you set
> "non_disp" back to 1 because your are offsetting one more bit (according to
> your comment in v2)?

The non_disp field is two bits on cayman and one bit evergreen.  Bit 28 in this packet represents bit 0 of the non_disp field and bit 27 represents bit 1 of the non_disp field so that it's compatible with evergreen.  So just setting bit 28 covers us for both evergreen and cayman since we only care about non_dip bit 0.
Comment 62 Alex Deucher 2013-03-15 19:35:09 UTC
The way to properly handle T2L or L2T conversions of compressed textures with the DMA engine is to hack up the formats like we do in the blitter code so that it looks like a PIPE_FORMAT_R16G16B16A16_UINT or PIPE_FORMAT_R32G32B32A32_UINT T2L or L2T blit.
Comment 63 Alex Deucher 2013-03-15 22:17:07 UTC
Created attachment 76592 [details] [review]
take 5

After further investigation this seems to be an alignment issue with large block sizes on the DMA engine on cayman.  For now just use the blitter for large block sizes.
Comment 64 Alexandre Demers 2013-03-16 04:21:52 UTC
(In reply to comment #63)
> Created attachment 76592 [details] [review] [review]
> take 5
> 
> After further investigation this seems to be an alignment issue with large
> block sizes on the DMA engine on cayman.  For now just use the blitter for
> large block sizes.

It does the trick for now.
Comment 65 Alex Deucher 2013-03-17 17:29:32 UTC
After a little more investigation this weekend, I think I found the root cause.  On cayman, 128bpp surfaces require non_disp ordering for hw access to both linear and tiled surfaces.  When we use the 3D engine we can set the non_disp ordering on both the tiled and linear sides (via CB or texture), but when we use the DMA engine, we can only set the non_disp ordering on the tiled side, so after a L2T operation with the DMA engine, the data ends up in the wrong order on the tiled side.
Comment 67 Alexandre Demers 2013-03-18 00:13:43 UTC
(In reply to comment #66)
> Pushed:
> http://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=4409758a046a47b09cdd339f97afd22107c68f0c

It seems ok over here. Thank you.