Using mesa 0967c362bf378b7415c30ca6d9523d3b2a3a7f5d (and for a long time), piglit tests fail and hangs the system (ssh session won't respond). Using an HD 6950 (Cayman). Running "python2 piglit-run.py --no-concurrency tests/r600.tests results/r600.results/", it fails everytime. According to the "main" file, the last log info is: "spec/glsl-1.30/execution/built-in-functions/fs-greaterThan-uvec4-uvec4": { "info": "Returncode: 0\n\nErrors:\n\n\nOutput:\n", "returncode": 0, "command": "/home/dema1701/projects/display/piglit/framework/../bin/shader_runner tests/../generated_tests/spec/glsl-1.30/execution/built-in-functions/fs-greaterThan-uvec4-uvec4.shader_test -auto", "result": "pass", "time": 0.10875105857849121 }, "spec/glsl-1.10/execution/variable-indexing/fs-temp-array-mat4-col-row-wr": { "info": "Returncode: 0\n\nErrors:\n\n\nOutput:\n", "returncode": 0, Mesa's build options: baseExec="./autogen.sh --prefix=/usr \ --enable-debug \ --enable-shared \ --enable-osmesa \ --enable-gbm \ --enable-xvmc \ --enable-vdpau \ --enable-gles1 \ --enable-gles2 \ --enable-openvg \ --enable-xorg \ --enable-xa \ --enable-egl \ --enable-gallium-egl \ --enable-glx-tls \ --enable-texture-float \ --enable-wgl \ --with-gallium-drivers=r600,swrast,svga \ --with-egl-platforms=x11,drm" "$baseExec --enable-64-bit --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --libdir=/usr/lib/x86_64-linux-gnu --includedir=/usr/include/x86_64-linux-gnu"
I'd appreciate if someone could tell me how to test a single glsl test (or batch or version) at a time.
Nevermind, I've found out how to run a single glsl test. Running /home/dema1701/projects/display/piglit/framework/../bin/shader_runner tests/spec/glsl-1.10/execution/variable-indexing/fs-temp-array-mat4-col-row-wr.shader_test works fine. Could it be the following test that crashes the system? If so, how can I know which test is next on the list?
If you run piglit with --no-concurrency from a remote shell, you should see in the terminal output which test is running when it hangs.
(In reply to comment #3) > If you run piglit with --no-concurrency from a remote shell, you should see > in the terminal output which test is running when it hangs. Many tests seems to skip when they are run remotely via an ssh session. Is this expected or should I set a parameter about the display?
It's perfectly normal that some tests are skipped, but you obviously do need to set the DISPLAY environment variable appropriately for the piglit run. Something like DISPLAY=:0 python2 piglit-run.py ...
(In reply to comment #5) > It's perfectly normal that some tests are skipped, but you obviously do need > to set the DISPLAY environment variable appropriately for the piglit run. > Something like > > DISPLAY=:0 python2 piglit-run.py ... I meant that some tests are skipped when run remotely, but tested when run locally. ;) I'll try your suggestion.
(In reply to comment #6) > > DISPLAY=:0 python2 piglit-run.py ... > > I meant that some tests are skipped when run remotely, but tested when run > locally. ;) Something would be wrong then. If you're doing it right, piglit is running just as locally, only its terminal output is visible elsewhere.
(In reply to comment #7) > (In reply to comment #6) > > > DISPLAY=:0 python2 piglit-run.py ... > > > > I meant that some tests are skipped when run remotely, but tested when run > > locally. ;) > > Something would be wrong then. If you're doing it right, piglit is running > just as locally, only its terminal output is visible elsewhere. Noted. I'll look at it when I'll get home tonight.
I was able to run it as supposed. I was missing the "DISPLAY=:0". I think I have identified which test fails first. I'll double-check and I'll tell you more as soon as I have time in the next couple of days.
These tests hang if virtual memory is enabled. Some of them may hang randomly: glean/polygonOffset glean/pointAtten security/initialized-texmemory security/initialized-fbo ARB_framebuffer_object/fbo-blit-stretch These tests hang for a different reason: EXT_transform_feedback/order * - incorrect UMAD implementation causing an infinite loop, it's been discussed on mesa-dev
(In reply to comment #10) > These tests hang if virtual memory is enabled. Some of them may hang > randomly: > > glean/polygonOffset > glean/pointAtten > security/initialized-texmemory > security/initialized-fbo > ARB_framebuffer_object/fbo-blit-stretch > > > These tests hang for a different reason: > > EXT_transform_feedback/order * > - incorrect UMAD implementation causing an infinite loop, it's been > discussed on mesa-dev I confirm (since I remember them particularly) at least for: glean/polygonOffset glean/pointAtten security/initialized-fbo ARB_framebuffer_object/fbo-blit-stretch (which was the last one I was able to start) And I think I also encountered: EXT_transform_feedback/order * In other words, Marek, you pretty much summed up what I saw. Are they already reported? If so, this bug may be a duplicate then.
No, I just made the list today by watching piglit logs over ssh.
This kernel patch fixes everything: diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 70d3824..748a933 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -459,6 +459,7 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev, if (r) { goto out; } + radeon_fence_wait(vm->fence, false); radeon_cs_sync_rings(parser); radeon_ib_sync_to(&parser->ib, vm->fence); radeon_ib_sync_to(&parser->ib, radeon_vm_grab_id( It's merely a workaround and it kills performance, but it's now pretty clear there is a synchronization issue in the kernel affecting all NI chips with virtual memory, and it should now be easier to find the bug. I'm not really familiar with the kernel code. I had to do some code reading before I found the right place to put the wait call in.
Created attachment 77607 [details] [review] Full flush on vm flush Can you try this patch ?
Created attachment 77608 [details] [review] Always vm flush Or this one, or both patch together. I am hoping this one is enough
We shouldn't need to flush the caches in vm_flush() since that is already handled in fence_ring_emit(). I think attachment 72794 [details] [review] from bug 58354 may actually do the trick.
Yes this patch should do the trick
Attachment 72794 [details] applied on kernel 3.9-rc6 hangs (2 on 2) at spec/glsl-1.10/execution/built-in-functions/vs-max-vec2-vec2 Applying [...] @@ -459,6 +459,7 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev, if (r) { goto out; } + radeon_fence_wait(vm->fence, false); radeon_cs_sync_rings(parser); on 3.9-rc5 hangs everytime (3 on 3) on spec/EXT_transform_feedback/order arrays triangles I still have to test Jerome's patches.
(In reply to comment #18) > Attachment 72794 [details] applied on kernel 3.9-rc6 hangs (2 on 2) at > spec/glsl-1.10/execution/built-in-functions/vs-max-vec2-vec2 > > > Applying [...] @@ -459,6 +459,7 @@ static int radeon_cs_ib_vm_chunk(struct > radeon_device *rdev, > if (r) { > goto out; > } > + radeon_fence_wait(vm->fence, false); > radeon_cs_sync_rings(parser); > > on 3.9-rc5 hangs everytime (3 on 3) on spec/EXT_transform_feedback/order > arrays triangles Of course it hangs the "order" test! The test triggers a bug in the shader backend, causing an infinite loop. It has nothing to do with the virtual memory issues. Just skip the test, we already have a fix for it on mesa-dev.
Sorry I meant to say the "order" test hangs and is unrelated to the other hangs and the patches posted here won't help you with it. Anyway, I have committed the fix for the "order" test now.
(In reply to comment #20) > Sorry I meant to say the "order" test hangs and is unrelated to the other > hangs and the patches posted here won't help you with it. Anyway, I have > committed the fix for the "order" test now. The previous comment was a bit rough around the edge when considering my knowledge about that bug known by dev, but no harm taken. It was... spontaneous. ;) Thank you for letting me know about the committed fix, I'll update mesa and relaunch the piglit run.
It seems running kernel 3.9-rc6 with attachment 72794 [details] [review] with latest mesa (UMAD fixed on Cayman, thanks to commit pushed by Marek) allowed me to run all r600 piglit tests without any issue.
(In reply to comment #22) > It seems running kernel 3.9-rc6 with attachment 72794 [details] [review] [review] > with latest mesa (UMAD fixed on Cayman, thanks to commit pushed by Marek) > allowed me to run all r600 piglit tests without any issue. Great. I'll add that patch to my queue and also a similar patch for SI.
Alex, I'm sorry but your patch does not fix the lockups on my Cayman (HD 6950). :( The piglit test "initialized-fbo" can be used to reproduce the lockup.
(In reply to comment #24) > Alex, I'm sorry but your patch does not fix the lockups on my Cayman (HD > 6950). :( The piglit test "initialized-fbo" can be used to reproduce the > lockup. Are all the previously listed tests failing? I'll test them again and I'll run the initialized-fbo.
It's not important for this bug if the test fails (I think it does), what's important is whether it hangs the machine or not.
(In reply to comment #26) > It's not important for this bug if the test fails (I think it does), what's > important is whether it hangs the machine or not. That's what I meant. I'm launching a new run in a moment.
Marek, you are right and I must have been "lucky" yesterday when I tested it. I launched two runs, and hit two different hanging tests this time: glean/polygonOffset (first run) glean/pointAtten (second run)
Do either of Jerome's patches help?
(In reply to comment #29) > Do either of Jerome's patches help? Didn't have time to test them yesterday, I'll try them probably at the end of the day.
(In reply to comment #29) > Do either of Jerome's patches help? Applied both, ran 2 times r600.test and everything went fine. I'll test with only one patch applied at a time later today.
Attachment 77608 [details] fixes the lockups, which suggests the DRM driver doesn't actually flush caches when it should.
(In reply to comment #32) > Attachment 77608 [details] fixes the lockups, which suggests the DRM driver > doesn't actually flush caches when it should. radeon_ring_vm_flush() doesn't actually flush caches per se, it writes the new VM page table base address, so presumably we are not handling last_flush properly somewhere which results in a stale VM page table pointer.
*** Bug 62997 has been marked as a duplicate of this bug. ***
Alex, despite having your 2nd patch in, I found: Apr 17 16:26:07 o2 kernel: [91224.372170] radeon 0000:00:01.0: GPU fault detected: 146 0x0594260c Apr 17 16:26:07 o2 kernel: [91224.372175] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000A59 Apr 17 16:26:07 o2 kernel: [91224.372178] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0402600C Apr 17 16:26:07 o2 kernel: [91224.372181] radeon 0000:00:01.0: GPU fault detected: 146 0x0594260c Apr 17 16:26:07 o2 kernel: [91224.372184] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 Apr 17 16:26:07 o2 kernel: [91224.372187] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 Apr 19 17:19:08 o2 kernel: [132471.330610] radeon 0000:00:01.0: GPU fault detected: 147 0x06d37002 Apr 19 17:19:08 o2 kernel: [132471.330614] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000316D Apr 19 17:19:08 o2 kernel: [132471.330617] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03070002
Alex, despite having your 2nd patch in, the box just crashed and rebooted. I'll revert to your first patch to see if that one still helps.
(In reply to comment #36) > Alex, despite having your 2nd patch in, the box just crashed and rebooted. > I'll revert to your first patch to see if that one still helps. By reading bug 62997 you've reported, you may be hitting more than one bug. I had to report a couple of bugs myself about VM and DMA for Cayman. Could you try kernel 3.9-rc7? I know a couple of patches went in there that could help you.
3.9.0-rc7 here. Just saw this: [ 67.568697] Bluetooth: BNEP socket layer initialized [ 2144.364903] radeon 0000:00:01.0: GPU fault detected: 147 0x0d422602 [ 2144.364908] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000D6D4 [ 2144.364911] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02026002 [ 2144.364913] radeon 0000:00:01.0: GPU fault detected: 147 0x0d422602 [ 2144.364915] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 2144.364918] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
(In reply to comment #38) > 3.9.0-rc7 here. > > Just saw this: > > [ 67.568697] Bluetooth: BNEP socket layer initialized > [ 2144.364903] radeon 0000:00:01.0: GPU fault detected: 147 0x0d422602 > [ 2144.364908] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x0000D6D4 > [ 2144.364911] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x02026002 > [ 2144.364913] radeon 0000:00:01.0: GPU fault detected: 147 0x0d422602 > [ 2144.364915] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 2144.364918] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 Are you getting the same result with and without R600_DEBUG=nodma (as per bug 62997)? Also, did you try applying patches on top of 3.9-rc7? Just getting some info here to see if anything is making a difference.
in /etc/environment I have R600_DEBUG=nodma ever since discovering that setting, i.e.: also now the R6))_DEBUG is set to nodma. I did not apply any of your patches over the 3.9.0-rc7 kernel.
(In reply to comment #40) > in /etc/environment I have R600_DEBUG=nodma ever since discovering that > setting, i.e.: also now the R6))_DEBUG is set to nodma. > > I did not apply any of your patches over the 3.9.0-rc7 kernel. If I was you, I would test without R600_DEBUG=nodma to see if there is any difference between kernels 3.8 and 3.9. I would also try patching 3.9-rc7 with attachment 77608 [details] [review] or attachment.
(In reply to comment #40) > in /etc/environment I have R600_DEBUG=nodma ever since discovering that > setting, i.e.: also now the R6))_DEBUG is set to nodma. > > I did not apply any of your patches over the 3.9.0-rc7 kernel. Your log looks to the one in bug 58354, which is also related to DMA, or bug 59089, which is related to htile/VM. The best thing would be to bisect the kernel between a good known version and the first bad one. Do you have a previous kernel version that was working OK as a good reference?
3.6.11 was OK but also maybe is kinda 'old'.
(In reply to comment #43) > 3.6.11 was OK but also maybe is kinda 'old'. Then try a 3.7 kernel if possible to see if a first bug was introduced there or if it all happened in the 3.8 branch. If I remember correctly, there were some changes about VM in 3.7 and some others about DMA in 3.8. Doing so should allow us to work on one bug/change at a time. Are you using latest mesa, drm and ddx?
drm 2.4.44, git for the rest.
Booted 3.7.8.
And it crashed, booted.
(In reply to comment #47) > And it crashed, booted. Any message in your logs? And I think you now have a reference to bisect.
No messages that I could find in /var/log/messages. Xorg.0.log or so didn't help. Doing 3.7.6. since shortly after that unplanned boot. So far it's OK.
It crashed so we'll go for 3.7.5.
(In reply to comment #50) > It crashed so we'll go for 3.7.5. I'm sure 3.7.0 will already display the problem. It was probably introduced between 3.6 and 3.7-rc1. Have you ever bisected before? I could help you with it if you want to.
W.r.t. bisecting I found the info at http://webchick.net/node/99. Next week I do not have to go to work so I could give it a try. Where do I get the sources from? And what sha1's refer to all radeon commits in 3.7-rc1??
(In reply to comment #52) > W.r.t. bisecting I found the info at http://webchick.net/node/99. > Next week I do not have to go to work so I could give it a try. > Where do I get the sources from? > And what sha1's refer to all radeon commits in 3.7-rc1?? You will get the source from git.kernel.org. The first sync will take a while. Something like this should allow you to get the whole Linus linux tree into the linux-git folder: git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-git Your link seems about how to use git seems good. However, I would skip to bisecting since you already have a known good and a known bad versions. You'll have to sync your tree to a known good or bad version (assuming you didn't change the code or that you are willing to loose your changes): git reset --hard v3.7.5 You are now ready to bisect: git bisect start Tag this known bad version as "bad": git bisect bad This will tag the current version in the git tree as bad OR you can use "git bisect bad v3.7.5" to tag a specific tag. You can do the same with a specific commit Tag the known good version as "good": git bisect good v3.6.11 Git will do its work and let you know how many iterations will be needed to find the first bad commit. It will then sync to a new commit. You have to compile this kernel, install it and test it. Each time you are sure a given kernel is good or bad, you will have to tell git by using "git bisect good" or "git bisect bad". Git will move to the next iteration and you will have to configure, compile, install and test again until you end up identifying the first bad commit. The nearer you will get to the end of bisection, the faster it will be to configure and compile the kernel (less commits, thus less changes). Once you'll be done and you'll have reported the bad commit here, you'll have to stop bisecting and get back to where you started bisecting: git bisect reset
This afternoon I found a box, still somewhat alive, with a crashed Xorg, shoing a textmode bootup screen. It was running 3.6.11. So if we assume the hardware is OK and that 3.6.11 was indeed solid then either mesa, dri or the radeon video driver are causing this. Now running 3.7.1. to see if we can get a log message of some sorts as I did not see these with the recent crashes.
Also: where are the minor kernel versions in that tree? It goes from 3.7-rcX to 3.6-rcX.
For stable kernels you need to use the stable branches: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git if you've already checked out Linus' tree, you can add the stable tree as a remote: git remote add stable git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git git fetch stable
(In reply to comment #55) > Also: where are the minor kernel versions in that tree? It goes from 3.7-rcX > to 3.6-rcX. I'm sure you can use Linus' tree (between v3.6.0 and v3.7.0 for example). If you were to use minor versions, you would probably end up pointing at a change backported from newer version in the best case.
Ok, just starting the first iteration compile. Is there a better testcase for the issue(s) we look for than using the PC for a while, watching youtube, etc?
(In reply to comment #58) > Ok, just starting the first iteration compile. > Is there a better testcase for the issue(s) we look for than using the PC > for a while, watching youtube, etc? You could run piglit tests, that would stress the computer. Otherwise, try to trigger the bug in a reproducible way.
The first bisect kernel I try gives me youtube videos that are blocks of gibberish. What to do about that? 3.7.1. crashed like previous kernels (showing text boot screen) but no messages in the log. So is that the same problem we're seeing and searching for?
I compiled drm-next yesterday (which should be found in kernel 3.10 any day now). I've been able to run piglit r600.test without any problem 2 times (just in case, I rebooted beween each). Is there any thing pushed in there that is expected to help?
(In reply to comment #60) > The first bisect kernel I try gives me youtube videos that are blocks of > gibberish. What to do about that? > > 3.7.1. crashed like previous kernels (showing text boot screen) but no > messages in the log. > So is that the same problem we're seeing and searching for? Well, I would continue bisecting until you find the first problematic commit that crashes your setup. You may be hitting more than a bug, so keep track (commit and results) of what you see in between in case they are not linked to the same bug.
Currently running 3.6.0-02886-gd9a8074 and that one help up OK so far for 28 hours. How long to continue before declaring this kernel good? Previous bisect kernel 3.6.0-05487-g24d7b40 was found within 24 hours to have crashed in the bootup textmode screen manner. The 3.8.10 comment is interesting as the changelog does not mention radeon.
Weird thing is that with 3.8.10 the box has been stable for a few days without weird radeon-related errors. Currently trying 3.9.1. Git mesa, llvm, libclc, xf-video-ati etc
Running kernel 3.9.x (3.9.4 now), git mesa, git ati driver I had no issues with these adjustments: # cat /etc/environment LIBGL_DRIVERS_PATH=/opt/xorg/lib/dri/ R600_DEBUG=nodma and: # git diff diff --git a/src/gallium/winsys/radeon/drm/radeon_drm_winsys.c b/src/gallium/winsys/radeon/drm/radeon_drm_winsys.c index 15d5d31..5b1d0fb 100644 --- a/src/gallium/winsys/radeon/drm/radeon_drm_winsys.c +++ b/src/gallium/winsys/radeon/drm/radeon_drm_winsys.c @@ -399,6 +399,7 @@ static boolean do_winsys_init(struct radeon_drm_winsys *ws) &ws->info.r600_ib_vm_max_size)) ws->info.r600_virtual_address = FALSE; } + ws->info.r600_virtual_address = FALSE; } /* Get max pipes, this is only needed for compute shaders. All evergreen+ (maybe not relevant) No GPU hangs. No weird radeon related stuff in /var/log/messages. I could even run bfgminer succesfully with tstellard's recent Cayman fixes for llvm/OpenCL. Bisecting got me nowhere. :-( So what would be a next step?
While I'm the one who opened this bug, on my side I'm able to run all piglit tests without any hangs since awhile now. I don't even need to run one test at a time anymore. But I'm using latest git versions of mesa, drm, ddx with a 3.10 kernel if this can be of any help to those who are adding themselves to the CC list.
(In reply to comment #66) > While I'm the one who opened this bug, on my side I'm able to run all piglit > tests without any hangs since awhile now. Even with GPU virtual memory enabled? If so, this report can be resolved as fixed?
I still use RADEON_VA=0 to avoid GPU lockups etc. Was anything changed so there's reason to test with RADEON_VA=1?
(In reply to comment #67) > (In reply to comment #66) > > While I'm the one who opened this bug, on my side I'm able to run all piglit > > tests without any hangs since awhile now. > > Even with GPU virtual memory enabled? If so, this report can be resolved as > fixed? This bug has been fixed by the kernel patch 466476dfdcafbb4286ffa232a3a792731b9dc852 for quite a long time as far as 3D support is concerned. Some say that OpenCL still locks up, but I think that's a different issue.
Tom Stellard advised me not to use virtual memory, first by patch and later with RADEON_VA=0 as OpenCL started to work for Cayman (ARUBA here in A10-5800K) graphics. Which bug to see for virtual memory and OpenCL?
(In reply to comment #67) > (In reply to comment #66) > > While I'm the one who opened this bug, on my side I'm able to run all piglit > > tests without any hangs since awhile now. > > Even with GPU virtual memory enabled? If so, this report can be resolved as > fixed? I'll have to double check, but from what I remember, yes even with VM enabled. But I'm not playing with OpenCL. It's just that I was seeing new CCers being added and this bug is still open because Udo was experiencing a problem that looked like the current bug.
I can see if the bug is still present (aside from OpenCL usage) in normal desktop usage (mail, web, youtube, etc).
(In reply to comment #71) > It's just that I was seeing new CCers being added and this bug is still open > because Udo was experiencing a problem that looked like the current bug. As you said, you're the reporter of this bug. If the problem you reported here is fixed, please resolve it accordingly. If Udo is still having problems, he should file his own report.
(In reply to comment #73) > (In reply to comment #71) > > It's just that I was seeing new CCers being added and this bug is still open > > because Udo was experiencing a problem that looked like the current bug. > > As you said, you're the reporter of this bug. If the problem you reported > here is fixed, please resolve it accordingly. If Udo is still having > problems, he should file his own report. Then fixed and closed it is.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.