Bug 62959

Summary:	r600g (HD 6950 Cayman) fails piglit tests and hangs system
Product:	Mesa	Reporter:	Alexandre Demers <alexandre.f.demers>
Component:	Drivers/Gallium/r600	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	medium	CC:	frederic.romagne, freedesktop, goeran, otaznik, udovdh
Version:	git
Hardware:	All
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	Full flush on vm flush Always vm flush

Description Alexandre Demers 2013-03-31 04:00:32 UTC

Using mesa 0967c362bf378b7415c30ca6d9523d3b2a3a7f5d (and for a long time), piglit tests fail and hangs the system (ssh session won't respond). Using an HD 6950 (Cayman).

Running "python2 piglit-run.py --no-concurrency tests/r600.tests results/r600.results/", it fails everytime. According to the "main" file, the last log info is:
        "spec/glsl-1.30/execution/built-in-functions/fs-greaterThan-uvec4-uvec4": {
                "info": "Returncode: 0\n\nErrors:\n\n\nOutput:\n", 
                "returncode": 0, 
                "command": "/home/dema1701/projects/display/piglit/framework/../bin/shader_runner tests/../generated_tests/spec/glsl-1.30/execution/built-in-functions/fs-greaterThan-uvec4-uvec4.shader_test -auto", 
                "result": "pass", 
                "time": 0.10875105857849121
            },
        "spec/glsl-1.10/execution/variable-indexing/fs-temp-array-mat4-col-row-wr": {
                "info": "Returncode: 0\n\nErrors:\n\n\nOutput:\n", 
                "returncode": 0, 



Mesa's build options:
baseExec="./autogen.sh --prefix=/usr \
		--enable-debug \
		--enable-shared \
		--enable-osmesa \
		--enable-gbm \
		--enable-xvmc \
		--enable-vdpau \
		--enable-gles1 \
		--enable-gles2 \
		--enable-openvg \
		--enable-xorg \
		--enable-xa \
		--enable-egl \
		--enable-gallium-egl \
		--enable-glx-tls \
		--enable-texture-float \
		--enable-wgl \
		--with-gallium-drivers=r600,swrast,svga \
		--with-egl-platforms=x11,drm"
"$baseExec  --enable-64-bit --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --libdir=/usr/lib/x86_64-linux-gnu --includedir=/usr/include/x86_64-linux-gnu"

Comment 1 Alexandre Demers 2013-03-31 04:08:29 UTC

I'd appreciate if someone could tell me how to test a single glsl test (or batch or version) at a time.

Comment 2 Alexandre Demers 2013-03-31 04:23:37 UTC

Nevermind, I've found out how to run a single glsl test.

Running /home/dema1701/projects/display/piglit/framework/../bin/shader_runner tests/spec/glsl-1.10/execution/variable-indexing/fs-temp-array-mat4-col-row-wr.shader_test works fine.

Could it be the following test that crashes the system? If so, how can I know which test is next on the list?

Comment 3 Michel Dänzer 2013-04-03 10:30:11 UTC

If you run piglit with --no-concurrency from a remote shell, you should see in the terminal output which test is running when it hangs.

Comment 4 Alexandre Demers 2013-04-03 12:43:18 UTC

(In reply to comment #3)
> If you run piglit with --no-concurrency from a remote shell, you should see
> in the terminal output which test is running when it hangs.

Many tests seems to skip when they are run remotely via an ssh session. Is this expected or should I set a parameter about the display?

Comment 5 Michel Dänzer 2013-04-03 12:49:59 UTC

It's perfectly normal that some tests are skipped, but you obviously do need to set the DISPLAY environment variable appropriately for the piglit run. Something like

DISPLAY=:0 python2 piglit-run.py ...

Comment 6 Alexandre Demers 2013-04-03 13:36:40 UTC

(In reply to comment #5)
> It's perfectly normal that some tests are skipped, but you obviously do need
> to set the DISPLAY environment variable appropriately for the piglit run.
> Something like
> 
> DISPLAY=:0 python2 piglit-run.py ...

I meant that some tests are skipped when run remotely, but tested when run locally. ;) I'll try your suggestion.

Comment 7 Michel Dänzer 2013-04-03 16:20:46 UTC

(In reply to comment #6)
> > DISPLAY=:0 python2 piglit-run.py ...
> 
> I meant that some tests are skipped when run remotely, but tested when run
> locally. ;)

Something would be wrong then. If you're doing it right, piglit is running just as locally, only its terminal output is visible elsewhere.

Comment 8 Alexandre Demers 2013-04-03 17:16:27 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > > DISPLAY=:0 python2 piglit-run.py ...
> > 
> > I meant that some tests are skipped when run remotely, but tested when run
> > locally. ;)
> 
> Something would be wrong then. If you're doing it right, piglit is running
> just as locally, only its terminal output is visible elsewhere.

Noted. I'll look at it when I'll get home tonight.

Comment 9 Alexandre Demers 2013-04-04 18:01:35 UTC

I was able to run it as supposed. I was missing the "DISPLAY=:0". I think I have identified which test fails first. I'll double-check and I'll tell you more as soon as I have time in the next couple of days.

Comment 10 Marek Olšák 2013-04-04 18:17:54 UTC

These tests hang if virtual memory is enabled. Some of them may hang randomly:

glean/polygonOffset
glean/pointAtten
security/initialized-texmemory
security/initialized-fbo
ARB_framebuffer_object/fbo-blit-stretch


These tests hang for a different reason:

EXT_transform_feedback/order *
- incorrect UMAD implementation causing an infinite loop, it's been discussed on mesa-dev

Comment 11 Alexandre Demers 2013-04-04 18:49:30 UTC

(In reply to comment #10)
> These tests hang if virtual memory is enabled. Some of them may hang
> randomly:
> 
> glean/polygonOffset
> glean/pointAtten
> security/initialized-texmemory
> security/initialized-fbo
> ARB_framebuffer_object/fbo-blit-stretch
> 
> 
> These tests hang for a different reason:
> 
> EXT_transform_feedback/order *
> - incorrect UMAD implementation causing an infinite loop, it's been
> discussed on mesa-dev

I confirm (since I remember them particularly) at least for:
glean/polygonOffset
glean/pointAtten
security/initialized-fbo
ARB_framebuffer_object/fbo-blit-stretch (which was the last one I was able to start)

And I think I also encountered:
EXT_transform_feedback/order *

In other words, Marek, you pretty much summed up what I saw. Are they already reported? If so, this bug may be a duplicate then.

Comment 12 Marek Olšák 2013-04-04 19:07:25 UTC

No, I just made the list today by watching piglit logs over ssh.

Comment 13 Marek Olšák 2013-04-05 23:45:18 UTC

This kernel patch fixes everything:

diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
index 70d3824..748a933 100644
--- a/drivers/gpu/drm/radeon/radeon_cs.c
+++ b/drivers/gpu/drm/radeon/radeon_cs.c
@@ -459,6 +459,7 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
        if (r) {
                goto out;
        }
+       radeon_fence_wait(vm->fence, false);
        radeon_cs_sync_rings(parser);
        radeon_ib_sync_to(&parser->ib, vm->fence);
        radeon_ib_sync_to(&parser->ib, radeon_vm_grab_id(

It's merely a workaround and it kills performance, but it's now pretty clear there is a synchronization issue in the kernel affecting all NI chips with virtual memory, and it should now be easier to find the bug. I'm not really familiar with the kernel code. I had to do some code reading before I found the right place to put the wait call in.

Comment 14 Jerome Glisse 2013-04-08 16:08:48 UTC

Created attachment 77607 [details] [review]
Full flush on vm flush

Can you try this patch ?

Comment 15 Jerome Glisse 2013-04-08 16:09:28 UTC

Created attachment 77608 [details] [review]
Always vm flush

Or this one, or both patch together. I am hoping this one is enough

Comment 16 Alex Deucher 2013-04-08 16:32:36 UTC

We shouldn't need to flush the caches in vm_flush() since that is already handled in fence_ring_emit().  I think attachment 72794 [details] [review] from bug 58354 may actually do the trick.

Comment 17 Jerome Glisse 2013-04-08 17:09:32 UTC

Yes this patch should do the trick

Comment 18 Alexandre Demers 2013-04-09 00:21:22 UTC

Attachment 72794 [details] applied on kernel 3.9-rc6 hangs (2 on 2) at spec/glsl-1.10/execution/built-in-functions/vs-max-vec2-vec2


Applying [...] @@ -459,6 +459,7 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
        if (r) {
                goto out;
        }
+       radeon_fence_wait(vm->fence, false);
        radeon_cs_sync_rings(parser);

on 3.9-rc5 hangs everytime (3 on 3) on spec/EXT_transform_feedback/order arrays triangles


I still have to test Jerome's patches.

Comment 19 Marek Olšák 2013-04-09 00:48:28 UTC

(In reply to comment #18)
> Attachment 72794 [details] applied on kernel 3.9-rc6 hangs (2 on 2) at
> spec/glsl-1.10/execution/built-in-functions/vs-max-vec2-vec2
> 
> 
> Applying [...] @@ -459,6 +459,7 @@ static int radeon_cs_ib_vm_chunk(struct
> radeon_device *rdev,
>         if (r) {
>                 goto out;
>         }
> +       radeon_fence_wait(vm->fence, false);
>         radeon_cs_sync_rings(parser);
> 
> on 3.9-rc5 hangs everytime (3 on 3) on spec/EXT_transform_feedback/order
> arrays triangles

Of course it hangs the "order" test! The test triggers a bug in the shader backend, causing an infinite loop. It has nothing to do with the virtual memory issues. Just skip the test, we already have a fix for it on mesa-dev.

Comment 20 Marek Olšák 2013-04-09 01:16:48 UTC

Sorry I meant to say the "order" test hangs and is unrelated to the other hangs and the patches posted here won't help you with it. Anyway, I have committed the fix for the "order" test now.

Comment 21 Alexandre Demers 2013-04-09 01:34:51 UTC

(In reply to comment #20)
> Sorry I meant to say the "order" test hangs and is unrelated to the other
> hangs and the patches posted here won't help you with it. Anyway, I have
> committed the fix for the "order" test now.

The previous comment was a bit rough around the edge when considering my knowledge about that bug known by dev, but no harm taken. It was... spontaneous. ;)

Thank you for letting me know about the committed fix, I'll update mesa and relaunch the piglit run.

Comment 22 Alexandre Demers 2013-04-09 04:16:47 UTC

It seems running kernel 3.9-rc6 with attachment 72794 [details] [review] with latest mesa (UMAD fixed on Cayman, thanks to commit pushed by Marek) allowed me to run all r600 piglit tests without any issue.

Comment 23 Alex Deucher 2013-04-09 12:52:03 UTC

(In reply to comment #22)
> It seems running kernel 3.9-rc6 with attachment 72794 [details] [review] [review]
> with latest mesa (UMAD fixed on Cayman, thanks to commit pushed by Marek)
> allowed me to run all r600 piglit tests without any issue.

Great.  I'll add that patch to my queue and also a similar patch for SI.

Comment 24 Marek Olšák 2013-04-09 23:56:00 UTC

Alex, I'm sorry but your patch does not fix the lockups on my Cayman (HD 6950). :( The piglit test "initialized-fbo" can be used to reproduce the lockup.

Comment 25 Alexandre Demers 2013-04-10 00:11:07 UTC

(In reply to comment #24)
> Alex, I'm sorry but your patch does not fix the lockups on my Cayman (HD
> 6950). :( The piglit test "initialized-fbo" can be used to reproduce the
> lockup.

Are all the previously listed tests failing? I'll test them again and I'll run the initialized-fbo.

Comment 26 Marek Olšák 2013-04-10 00:18:20 UTC

It's not important for this bug if the test fails (I think it does), what's important is whether it hangs the machine or not.

Comment 27 Alexandre Demers 2013-04-10 01:20:29 UTC

(In reply to comment #26)
> It's not important for this bug if the test fails (I think it does), what's
> important is whether it hangs the machine or not.

That's what I meant. I'm launching a new run in a moment.

Comment 28 Alexandre Demers 2013-04-10 02:05:07 UTC

Marek, you are right and I must have been "lucky" yesterday when I tested it. I launched two runs, and hit two different hanging tests this time:
glean/polygonOffset (first run)
glean/pointAtten (second run)

Comment 29 Alex Deucher 2013-04-10 13:06:57 UTC

Do either of Jerome's patches help?

Comment 30 Alexandre Demers 2013-04-10 13:21:17 UTC

(In reply to comment #29)
> Do either of Jerome's patches help?

Didn't have time to test them yesterday, I'll try them probably at the end of the day.

Comment 31 Alexandre Demers 2013-04-10 22:54:25 UTC

(In reply to comment #29)
> Do either of Jerome's patches help?

Applied both, ran 2 times r600.test and everything went fine. I'll test with only one patch applied at a time later today.

Comment 32 Marek Olšák 2013-04-11 01:04:05 UTC

Attachment 77608 [details] fixes the lockups, which suggests the DRM driver doesn't actually flush caches when it should.

Comment 33 Alex Deucher 2013-04-11 13:45:05 UTC

(In reply to comment #32)
> Attachment 77608 [details] fixes the lockups, which suggests the DRM driver
> doesn't actually flush caches when it should.

radeon_ring_vm_flush() doesn't actually flush caches per se, it writes the new VM page table base address, so presumably we are not handling last_flush properly somewhere which results in a stale VM page table pointer.

Comment 34 Alex Deucher 2013-04-14 14:28:00 UTC

*** Bug 62997 has been marked as a duplicate of this bug. ***

Comment 35 udo 2013-04-20 07:54:31 UTC

Alex, despite having your 2nd patch in, I found:

Apr 17 16:26:07 o2 kernel: [91224.372170] radeon 0000:00:01.0: GPU fault detected: 146 0x0594260c
Apr 17 16:26:07 o2 kernel: [91224.372175] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000A59
Apr 17 16:26:07 o2 kernel: [91224.372178] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0402600C
Apr 17 16:26:07 o2 kernel: [91224.372181] radeon 0000:00:01.0: GPU fault detected: 146 0x0594260c
Apr 17 16:26:07 o2 kernel: [91224.372184] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Apr 17 16:26:07 o2 kernel: [91224.372187] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000

Apr 19 17:19:08 o2 kernel: [132471.330610] radeon 0000:00:01.0: GPU fault detected: 147 0x06d37002
Apr 19 17:19:08 o2 kernel: [132471.330614] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000316D
Apr 19 17:19:08 o2 kernel: [132471.330617] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03070002

Comment 36 udo 2013-04-20 14:44:21 UTC

Alex, despite having your 2nd patch in, the box just crashed and rebooted.
I'll revert to your first patch to see if that one still helps.

Comment 37 Alexandre Demers 2013-04-20 14:52:13 UTC

(In reply to comment #36)
> Alex, despite having your 2nd patch in, the box just crashed and rebooted.
> I'll revert to your first patch to see if that one still helps.

By reading bug 62997 you've reported, you may be hitting more than one bug. I had to report a couple of bugs myself about VM and DMA for Cayman.

Could you try kernel 3.9-rc7? I know a couple of patches went in there that could help you.

Comment 38 udo 2013-04-20 15:58:21 UTC

3.9.0-rc7 here.

Just saw this:

[   67.568697] Bluetooth: BNEP socket layer initialized
[ 2144.364903] radeon 0000:00:01.0: GPU fault detected: 147 0x0d422602
[ 2144.364908] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000D6D4
[ 2144.364911] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02026002
[ 2144.364913] radeon 0000:00:01.0: GPU fault detected: 147 0x0d422602
[ 2144.364915] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 2144.364918] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000

Comment 39 Alexandre Demers 2013-04-20 16:08:07 UTC

(In reply to comment #38)
> 3.9.0-rc7 here.
> 
> Just saw this:
> 
> [   67.568697] Bluetooth: BNEP socket layer initialized
> [ 2144.364903] radeon 0000:00:01.0: GPU fault detected: 147 0x0d422602
> [ 2144.364908] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x0000D6D4
> [ 2144.364911] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x02026002
> [ 2144.364913] radeon 0000:00:01.0: GPU fault detected: 147 0x0d422602
> [ 2144.364915] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 2144.364918] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000

Are you getting the same result with and without R600_DEBUG=nodma (as per bug 62997)? Also, did you try applying patches on top of 3.9-rc7? Just getting some info here to see if anything is making a difference.

Comment 40 udo 2013-04-20 16:10:42 UTC

in /etc/environment I have R600_DEBUG=nodma ever since discovering that setting, i.e.: also now the R6))_DEBUG is set to nodma.

I did not apply any of your patches over the 3.9.0-rc7 kernel.

Comment 41 Alexandre Demers 2013-04-20 16:19:58 UTC

(In reply to comment #40)
> in /etc/environment I have R600_DEBUG=nodma ever since discovering that
> setting, i.e.: also now the R6))_DEBUG is set to nodma.
> 
> I did not apply any of your patches over the 3.9.0-rc7 kernel.

If I was you, I would test without R600_DEBUG=nodma to see if there is any difference between kernels 3.8 and 3.9. I would also try patching 3.9-rc7 with attachment 77608 [details] [review] or attachment.

Comment 42 Alexandre Demers 2013-04-20 16:35:21 UTC

(In reply to comment #40)
> in /etc/environment I have R600_DEBUG=nodma ever since discovering that
> setting, i.e.: also now the R6))_DEBUG is set to nodma.
> 
> I did not apply any of your patches over the 3.9.0-rc7 kernel.

Your log looks to the one in bug 58354, which is also related to DMA, or bug 59089, which is related to htile/VM.

The best thing would be to bisect the kernel between a good known version and the first bad one. Do you have a previous kernel version that was working OK as a good reference?

Comment 43 udo 2013-04-20 16:41:10 UTC

3.6.11 was OK but also maybe is kinda 'old'.

Comment 44 Alexandre Demers 2013-04-20 17:00:44 UTC

(In reply to comment #43)
> 3.6.11 was OK but also maybe is kinda 'old'.

Then try a 3.7 kernel if possible to see if a first bug was introduced there or if it all happened in the 3.8 branch. If I remember correctly, there were some changes about VM in 3.7 and some others about DMA in 3.8. Doing so should allow us to work on one bug/change at a time.

Are you using latest mesa, drm and ddx?

Comment 45 udo 2013-04-20 17:08:22 UTC

drm 2.4.44, git for the rest.

Comment 46 udo 2013-04-21 06:24:30 UTC

Booted 3.7.8.

Comment 47 udo 2013-04-21 07:20:12 UTC

And it crashed, booted.

Comment 48 Alexandre Demers 2013-04-21 13:45:53 UTC

(In reply to comment #47)
> And it crashed, booted.

Any message in your logs?

And I think you now have a reference to bisect.

Comment 49 udo 2013-04-21 15:36:31 UTC

No messages that I could find in /var/log/messages. Xorg.0.log or so didn't help.
Doing 3.7.6. since shortly after that unplanned boot. So far it's OK.

Comment 50 udo 2013-04-21 16:44:17 UTC

It crashed so we'll go for 3.7.5.

Comment 51 Alexandre Demers 2013-04-21 18:08:23 UTC

(In reply to comment #50)
> It crashed so we'll go for 3.7.5.

I'm sure 3.7.0 will already display the problem. It was probably introduced between 3.6 and 3.7-rc1. Have you ever bisected before? I could help you with it if you want to.

Comment 52 udo 2013-04-22 15:08:34 UTC

W.r.t. bisecting I found the info at http://webchick.net/node/99.
Next week I do not have to go to work so I could give it a try.
Where do I get the sources from?
And what sha1's refer to all radeon commits in 3.7-rc1??

Comment 53 Alexandre Demers 2013-04-24 02:31:13 UTC

(In reply to comment #52)
> W.r.t. bisecting I found the info at http://webchick.net/node/99.
> Next week I do not have to go to work so I could give it a try.
> Where do I get the sources from?
> And what sha1's refer to all radeon commits in 3.7-rc1??

You will get the source from git.kernel.org. The first sync will take a while. Something like this should allow you to get the whole Linus linux tree into the linux-git folder:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-git

Your link seems about how to use git seems good. However, I would skip to bisecting since you already have a known good and a known bad versions. You'll have to sync your tree to a known good or bad version (assuming you didn't change the code or that you are willing to loose your changes):
git reset --hard v3.7.5

You are now ready to bisect:
git bisect start

Tag this known bad version as "bad":
git bisect bad
This will tag the current version in the git tree as bad OR you can use "git bisect bad v3.7.5" to tag a specific tag. You can do the same with a specific commit

Tag the known good version as "good":
git bisect good v3.6.11

Git will do its work and let you know how many iterations will be needed to find the first bad commit. It will then sync to a new commit. You have to compile this kernel, install it and test it. Each time you are sure a given kernel is good or bad, you will have to tell git by using "git bisect good" or "git bisect bad". Git will move to the next iteration and you will have to configure, compile, install and test again until you end up identifying the first bad commit. The nearer you will get to the end of bisection, the faster it will be to configure and compile the kernel (less commits, thus less changes).

Once you'll be done and you'll have reported the bad commit here, you'll have to stop bisecting and get back to where you started bisecting:
git bisect reset

Comment 54 udo 2013-04-24 13:11:23 UTC

This afternoon I found a box, still somewhat alive, with a crashed Xorg, shoing a textmode bootup screen.
It was running 3.6.11.
So if we assume the hardware is OK and that 3.6.11 was indeed solid then either mesa, dri or the radeon video driver are causing this.

Now running 3.7.1. to see if we can get a log message of some sorts as I did not see these with the recent crashes.

Comment 55 udo 2013-04-24 13:21:59 UTC

Also: where are the minor kernel versions in that tree? It goes from 3.7-rcX to 3.6-rcX.

Comment 56 Alex Deucher 2013-04-24 13:30:01 UTC

For stable kernels you need to use the stable branches:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
if you've already checked out Linus' tree, you can add the stable tree as a remote:
git remote add stable git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
git fetch stable

Comment 57 Alexandre Demers 2013-04-24 13:42:59 UTC

(In reply to comment #55)
> Also: where are the minor kernel versions in that tree? It goes from 3.7-rcX
> to 3.6-rcX.

I'm sure you can use Linus' tree (between v3.6.0 and v3.7.0 for example). If you were to use minor versions, you would probably end up pointing at a change backported from newer version in the best case.

Comment 58 udo 2013-04-24 14:51:57 UTC

Ok, just starting the first iteration compile.
Is there a better testcase for the issue(s) we look for than using the PC for a while, watching youtube, etc?

Comment 59 Alexandre Demers 2013-04-24 15:07:08 UTC

(In reply to comment #58)
> Ok, just starting the first iteration compile.
> Is there a better testcase for the issue(s) we look for than using the PC
> for a while, watching youtube, etc?

You could run piglit tests, that would stress the computer. Otherwise, try to trigger the bug in a reproducible way.

Comment 60 udo 2013-04-25 14:04:32 UTC

The first bisect kernel I try gives me youtube videos that are blocks of gibberish. What to do about that?

3.7.1. crashed like previous kernels (showing text boot screen) but no messages in the log.
So is that the same problem we're seeing and searching for?

Comment 61 Alexandre Demers 2013-04-27 15:04:27 UTC

I compiled drm-next yesterday (which should be found in kernel 3.10 any day now). I've been able to run piglit r600.test without any problem 2 times (just in case, I rebooted beween each). Is there any thing pushed in there that is expected to help?

Comment 62 Alexandre Demers 2013-04-27 15:09:08 UTC

(In reply to comment #60)
> The first bisect kernel I try gives me youtube videos that are blocks of
> gibberish. What to do about that?
> 
> 3.7.1. crashed like previous kernels (showing text boot screen) but no
> messages in the log.
> So is that the same problem we're seeing and searching for?

Well, I would continue bisecting until you find the first problematic commit that crashes your setup. You may be hitting more than a bug, so keep track (commit and results) of what you see in between in case they are not linked to the same bug.

Comment 63 udo 2013-04-27 15:16:09 UTC

Currently running 3.6.0-02886-gd9a8074 and that one help up OK so far for 28 hours. How long to continue before declaring this kernel good?
Previous bisect kernel 3.6.0-05487-g24d7b40 was found within 24 hours to have crashed in the bootup textmode screen manner.

The 3.8.10 comment is interesting as the changelog does not mention radeon.

Comment 64 udo 2013-05-09 09:28:19 UTC

Weird thing is that with 3.8.10 the box has been stable for a few days without weird radeon-related errors.
Currently trying 3.9.1.

Git mesa, llvm, libclc, xf-video-ati etc

Comment 65 udo 2013-06-09 07:04:33 UTC

Running kernel 3.9.x (3.9.4 now), git mesa, git ati driver I had no issues with these adjustments:

# cat /etc/environment 
LIBGL_DRIVERS_PATH=/opt/xorg/lib/dri/
R600_DEBUG=nodma

and:

# git diff
diff --git a/src/gallium/winsys/radeon/drm/radeon_drm_winsys.c b/src/gallium/winsys/radeon/drm/radeon_drm_winsys.c
index 15d5d31..5b1d0fb 100644
--- a/src/gallium/winsys/radeon/drm/radeon_drm_winsys.c
+++ b/src/gallium/winsys/radeon/drm/radeon_drm_winsys.c
@@ -399,6 +399,7 @@ static boolean do_winsys_init(struct radeon_drm_winsys *ws)
                                       &ws->info.r600_ib_vm_max_size))
                 ws->info.r600_virtual_address = FALSE;
         }
+        ws->info.r600_virtual_address = FALSE;
     }
 
     /* Get max pipes, this is only needed for compute shaders.  All evergreen+

(maybe not relevant)

No GPU hangs. No weird radeon related stuff in /var/log/messages.
I could even run bfgminer succesfully with tstellard's recent Cayman fixes for llvm/OpenCL.

Bisecting got me nowhere. :-(

So what would be a next step?

Comment 66 Alexandre Demers 2013-07-14 01:10:25 UTC

While I'm the one who opened this bug, on my side I'm able to run all piglit tests without any hangs since awhile now. I don't even need to run one test at a time anymore. But I'm using latest git versions of mesa, drm, ddx with a 3.10 kernel if this can be of any help to those who are adding themselves to the CC list.

Comment 67 Michel Dänzer 2013-07-15 10:36:40 UTC

(In reply to comment #66)
> While I'm the one who opened this bug, on my side I'm able to run all piglit
> tests without any hangs since awhile now.

Even with GPU virtual memory enabled? If so, this report can be resolved as fixed?

Comment 68 udo 2013-07-15 10:44:54 UTC

I still use RADEON_VA=0 to avoid GPU lockups etc.
Was anything changed so there's reason to test with RADEON_VA=1?

Comment 69 Marek Olšák 2013-07-15 12:24:58 UTC

(In reply to comment #67)
> (In reply to comment #66)
> > While I'm the one who opened this bug, on my side I'm able to run all piglit
> > tests without any hangs since awhile now.
> 
> Even with GPU virtual memory enabled? If so, this report can be resolved as
> fixed?

This bug has been fixed by the kernel patch 466476dfdcafbb4286ffa232a3a792731b9dc852 for quite a long time as far as 3D support is concerned.

Some say that OpenCL still locks up, but I think that's a different issue.

Comment 70 udo 2013-07-15 12:30:27 UTC

Tom Stellard advised me not to use virtual memory, first by patch and later with RADEON_VA=0 as OpenCL started to work for Cayman (ARUBA here in A10-5800K) graphics.
Which bug to see for virtual memory and OpenCL?

Comment 71 Alexandre Demers 2013-07-15 13:44:52 UTC

(In reply to comment #67)
> (In reply to comment #66)
> > While I'm the one who opened this bug, on my side I'm able to run all piglit
> > tests without any hangs since awhile now.
> 
> Even with GPU virtual memory enabled? If so, this report can be resolved as
> fixed?

I'll have to double check, but from what I remember, yes even with VM enabled. But I'm not playing with OpenCL.

It's just that I was seeing new CCers being added and this bug is still open because Udo was experiencing a problem that looked like the current bug.

Comment 72 udo 2013-07-15 13:50:08 UTC

I can see if the bug is still present (aside from OpenCL usage) in normal desktop usage (mail, web, youtube, etc).

Comment 73 Michel Dänzer 2013-07-15 14:17:10 UTC

(In reply to comment #71)
> It's just that I was seeing new CCers being added and this bug is still open
> because Udo was experiencing a problem that looked like the current bug.

As you said, you're the reporter of this bug. If the problem you reported here is fixed, please resolve it accordingly. If Udo is still having problems, he should file his own report.

Comment 74 Alexandre Demers 2013-07-15 19:43:21 UTC

(In reply to comment #73)
> (In reply to comment #71)
> > It's just that I was seeing new CCers being added and this bug is still open
> > because Udo was experiencing a problem that looked like the current bug.
> 
> As you said, you're the reporter of this bug. If the problem you reported
> here is fixed, please resolve it accordingly. If Udo is still having
> problems, he should file his own report.

Then fixed and closed it is.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.