I boot up freshly compiled linus git from 20121006, gdm starts but its all black screen after a couple of seconds its all garbage. I vt switch to 1 and try restarting gdm and i get the oops. xf86-video-ati git from 20121004 mesa git from 20121004 Using arch with 3.6 works fine
Created attachment 68153 [details] dmesg.3.7.0-rc0
Created attachment 68154 [details] dmesg.3.7.0-rc0 with irqpoll
Created attachment 68155 [details] oops pic
Created attachment 68156 [details] Xorg.0.log with 3.7-rc0
Created attachment 68157 [details] Xorg.0.log with 3.6
Created attachment 68159 [details] dmesg with 3.6
Can you bisect to locate the problematic commit?
Here it is: 2a6f1abbb48f1d90f20b8198c4894c0469468405 is the first bad commit commit 2a6f1abbb48f1d90f20b8198c4894c0469468405 Author: Christian König <deathsimple@vodafone.de> Date: Sat Aug 11 15:00:30 2012 +0200 drm/radeon: make page table updates async v2 Currently doing the update with the CP. v2: Rebased on Jeromes bugfix. Make validity comparison more human readable. Signed-off-by: Christian König <deathsimple@vodafone.de> :040000 040000 3ed3f64bd42f5f1000ab9e957df08f53e81e09d9 c5143cbc30add8e3472366fbdb84756d9cdcd035 M drivers
Mhm, interesting. You get a GPU lockup, but not a pagefault. Need to look deeper into it, but this looks rather strange to me.
Created attachment 68515 [details] Possible fix. Could you try the attached patch ontop of Alex latest drm-nex-3.7 branch (git://people.freedesktop.org/~agd5f/linux) ? I'm not 100% sure that it's this problem, but it might be it. Thanks, Christian.
Created attachment 68516 [details] Possible fix rebased on correct branch.
Yes the patch works.
(In reply to comment #12) > Yes the patch works. I'm sorry o spoke to soon, same problem
Created attachment 68519 [details] dmesg linus git with patch
I've tried the patch on git://people.freedesktop.org/~agd5f/linux drm-nex-3.7 branch and it doesn't work. The gdm sets the blue background image and freezes, no top bar or login dialog. I ssh from another computer and dmesg is clean at this point. I try to stop gdm and it displays some garbage, mostly black screen with some vertical purple bars about 4 cm thick and about 2 cm from the top of the screen, then it displays the gpu crash messages on log and then the console comes back.
Created attachment 68531 [details] dmesg with alex's drm-next-3.7 branch with patch
It works with linus git without the patch with arch packages for mesa 9.0-1 and -ati 6.14.6-2. I tried with -ati git and mesa 9 and it worked. Then i tried with mesa git and it failed. I started to bisect mesa but i got the following: $ git bisect bad Bisecting: a merge base must be tested [2d2f1fd164218eacf2b142bc808be1f25f66e72c] docs: Add some missing features to 9.0 release notes and GL3.txt $ git bisect bad The merge base 2d2f1fd164218eacf2b142bc808be1f25f66e72c is bad. This means the bug has been fixed between 2d2f1fd164218eacf2b142bc808be1f25f66e72c and [e5fdeef1e08b55acd48dc68f0cc8fe213f2820b8]. So i did a git log --graph --oneline --all and started to git checkout between those two commits, starting from 2d2f1fd to de92b7a are bad and with commit "ef557ea winsys/radeon: disable virtual memory on Cayman" it started working.
Is VM enabled or disabled on your system? I'm experiencing a similar bug with kernel 3.7-rc1, but it is working fine with 3.6. VM is enabled on my system, I'll try to disable it when I'll get home to see if that helps and I'll also try to bisect the kernel commit that screwed things for me.
(In reply to comment #18) > Is VM enabled or disabled on your system? I'm experiencing a similar bug > with kernel 3.7-rc1, but it is working fine with 3.6. VM is enabled on my > system, I'll try to disable it when I'll get home to see if that helps and > I'll also try to bisect the kernel commit that screwed things for me. I don't know, how can i check?
mesa-git is working fine on linux 3.6 and mesa-git dont have the "ef557ea winsys/radeon: disable virtual memory on Cayman" commit
(In reply to comment #19) > (In reply to comment #18) > > Is VM enabled or disabled on your system? I'm experiencing a similar bug > > with kernel 3.7-rc1, but it is working fine with 3.6. VM is enabled on my > > system, I'll try to disable it when I'll get home to see if that helps and > > I'll also try to bisect the kernel commit that screwed things for me. > > I don't know, how can i check? Use "setenv" in a terminal and look for "RADEON_VA".
(In reply to comment #21) > (In reply to comment #19) > > (In reply to comment #18) > > > Is VM enabled or disabled on your system? I'm experiencing a similar bug > > > with kernel 3.7-rc1, but it is working fine with 3.6. VM is enabled on my > > > system, I'll try to disable it when I'll get home to see if that helps and > > > I'll also try to bisect the kernel commit that screwed things for me. > > > > I don't know, how can i check? > > Use "setenv" in a terminal and look for "RADEON_VA". Oh, i have nothing like that in env
Created attachment 68623 [details] [review] Test patch. VM is definitely enabled, otherwise you won't got that error in the first place. Ok let's try to narrow down that bug a bit more, please apply the attached test patch and see what happens. If the GPU hang vanished we indeed have a syncing issue, but not the PFP sync.
(In reply to comment #23) > Created attachment 68623 [details] [review] [review] > Test patch. > > VM is definitely enabled, otherwise you won't got that error in the first > place. > > Ok let's try to narrow down that bug a bit more, please apply the attached > test patch and see what happens. > > If the GPU hang vanished we indeed have a syncing issue, but not the PFP > sync. The patch resets the gpu constantly, even without X, with both linus git and agd5f drm-next-3.7 branch with mesa git.
Created attachment 68655 [details] dmesg.3.7-rc1 with test patch
(In reply to comment #23) > Created attachment 68623 [details] [review] [review] > Test patch. > > VM is definitely enabled, otherwise you won't got that error in the first > place. > > Ok let's try to narrow down that bug a bit more, please apply the attached > test patch and see what happens. > > If the GPU hang vanished we indeed have a syncing issue, but not the PFP > sync. It is and it is not. What I mean is concerning comment 17 "So i did a git log --graph --oneline --all and started to git checkout between those two commits, starting from 2d2f1fd to de92b7a are bad and with commit "ef557ea winsys/radeon: disable virtual memory on Cayman" it started working." If the variable "RADEON_VA" is not set or doesn't exist, from the point commit "ef557ea" kicks in, VM gets disabled. Before that commit, VM is always enabled; from that point, we must be careful. If we want to test after commit "ef557ea" with VM enabled, "RADEON_VA" MUST be set, otherwise it will be disable and will hide the bug.
Well that's interesting, according to the logs you are running out of GART memory (which is 512MB in size) just 7 seconds after boot, and that is really odd. Could you please tell me what the heck you're doing to run out of memory? Is there some kind of animated splash screen running or something like that? I think that this problem shows up when you're tight on memory AND try to use VM at the same time. Probably we're missing some return value check or something like this. Anyway, as Alexandre Demers pointed out simply disabling VM should also help. In the meantime I will try to test the VM implementation under memory pressure, maybe that will yield some results. Cheers, Christian.
(In reply to comment #27) > Well that's interesting, according to the logs you are running out of GART > memory (which is 512MB in size) just 7 seconds after boot, and that is > really odd. > > Could you please tell me what the heck you're doing to run out of memory? Is > there some kind of animated splash screen running or something like that? > > I think that this problem shows up when you're tight on memory AND try to > use VM at the same time. Probably we're missing some return value check or > something like this. > > Anyway, as Alexandre Demers pointed out simply disabling VM should also help. > > In the meantime I will try to test the VM implementation under memory > pressure, maybe that will yield some results. > > Cheers, > Christian. I don't have anything graphical running during boot. I have radeon in mkinitcpio MODULES, no plymouth or anything just console, that sets up the mode then straight to gdm.
I haven't had time to dig it, but just to let you know I'm pretty much in the same situation as Serkan with a very similar config. I don't think it has to do with something using too much memory, but more about not releasing/attributing it correctly in the first place. Otherwise, why would it work with kernel 3.6 and not 3.7 if only kernel version is in the balance? I should have time to look at it tonight.
Well log for comment #25 shows out of memory. Which should not happen. It looks like it's the framebuffer that try to go into gtt but that doesn't make sense (16M is fb size according to log).
(In reply to comment #29) > I haven't had time to dig it, but just to let you know I'm pretty much in > the same situation as Serkan with a very similar config. I don't think it > has to do with something using too much memory, but more about not > releasing/attributing it correctly in the first place. Otherwise, why would > it work with kernel 3.6 and not 3.7 if only kernel version is in the balance? > > I should have time to look at it tonight. I think the gart memory issue is because of my recent update to gnome 3.6, i didn't see that with gdm 3.4. The machine also boots very fast now after the systemd upgrade, from grub to gdm i would say its about 5~7 seconds. Also when grub starts, the screen stays at console login prompt with the mouse cursor available and it takes about 2~3 seconds till gdm starts doing its fading thing to login prompt. I will try to revert it back and test it again when i get home.
Other explanation might be that the gdm admin queue a bunch of animation in form of big bo and thus fill up the gart before the first gpu lockup had a chance to be detected.
(In reply to comment #32) > Other explanation might be that the gdm admin queue a bunch of animation in > form of big bo and thus fill up the gart before the first gpu lockup had a > chance to be detected. I'll try lightdm or straight startx from console too
If Serkan and I are experiencing the same problem as I suspect, I would say this is improbably related to Gnome 3.6 because I'm still using 3.4 (with both kernel 3.6 and 3.7-rc1). We have the same GPU and we are not using plymouth. We are experiencing similar visual problem (can't confirm with a remote connection for now) when moving to kernel 3.7-rcX, but not with 3.6. I'll bisect kernel tonight and when I'm done. I'll keep you updated.
I've been playing a bit (booting and restarting with kernel 3.7-rc1) and strangely, what I see is very similar to what I was observing in bug 43655. It was then merged with bug 42373. At the time, attachment 64759 [details] [review] was proposed and a similar patch ended up being commited that fixed bug 43655 for me (but it never fixed bug 42373 on NI CAICOS). I'll try the workaround used at the time to see if it is really related to bug 43655 (comments 8 and 10) and I'll begin bisecting kernel right after.
(In reply to comment #35) > I've been playing a bit (booting and restarting with kernel 3.7-rc1) and > strangely, what I see is very similar to what I was observing in bug 43655. > It was then merged with bug 42373. At the time, attachment 64759 [details] [review] > [review] was proposed and a similar patch ended up being commited that fixed > bug 43655 for me (but it never fixed bug 42373 on NI CAICOS). > > I'll try the workaround used at the time to see if it is really related to > bug 43655 (comments 8 and 10) and I'll begin bisecting kernel right after. I think what you really want for your caicos is this patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=62444b7462a2b98bc78d68736c03a7c4e66ba7e2
Created attachment 68728 [details] dmesg.3.7-rc1 with testpatch with mesa-git I removed gdm and installed slim as login manager. Also installed cinnamon as a replacement for gnome and it works fine the first round with linus git with the test patch and mesa git. Restarted slim and logged in again and there were some font corruptions, i restarted cinnamon and they were gone. I tried google maps with webgl enabled and it was working fine. After that i edited my .xinitrc to startup gnome, restarted slim and logged in but it failed and got the error window saying oh no something has gone wrong and a log out button. I checked dmesg at that point and saw the ttm gart memory error. i switched back to cinnamon logged in and got the same font corruptions, restarting cinnamon fixed them.
Using the same linus git kernel with test patch and mesa git, I've reverted gnome to 3.4, kept slim as login manager, logged in to gnome, it worked fine, no errors in dmesg. I stopped slim, installed gdm and started it and logged in without any errors. I disabled slim and enabled gdm instead and rebooted the computer. Gdm login came up, i logged in and it worked fine.
(In reply to comment #36) > (In reply to comment #35) > > I've been playing a bit (booting and restarting with kernel 3.7-rc1) and > > strangely, what I see is very similar to what I was observing in bug 43655. > > It was then merged with bug 42373. At the time, attachment 64759 [details] [review] [review] > > [review] was proposed and a similar patch ended up being commited that fixed > > bug 43655 for me (but it never fixed bug 42373 on NI CAICOS). > > > > I'll try the workaround used at the time to see if it is really related to > > bug 43655 (comments 8 and 10) and I'll begin bisecting kernel right after. > > I think what you really want for your caicos is this patch: > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit; > h=62444b7462a2b98bc78d68736c03a7c4e66ba7e2 You misunderstood me. I'm using a 6950 (not CAICOS) and it was working great with commit http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=81ee8fb6b52ec69eeed37fe7943446af1dccecc5 for my Cayman (but not for CAICOS according to reporter of bug 42373) included in kernel 3.6. What I'm saying is that the symptoms I'm now seeing with 3.7-rc1 are similar to what I was seeing at the time, but it was fixed in 3.6. Now, about the patch you propose, it is already included in kernel 3.7-rc1 according to commit history. Since I'm experiencing bug 55692 with 3.7-rc1, the proposed patch can't be the cure. I'm bisecting right now between kernel 3.6 and 3.7-rc1. If it appears to be a different bug than 55692, I'll open a new one.
(In reply to comment #36) > (In reply to comment #35) > > I've been playing a bit (booting and restarting with kernel 3.7-rc1) and > > strangely, what I see is very similar to what I was observing in bug 43655. > > It was then merged with bug 42373. At the time, attachment 64759 [details] [review] [review] > > [review] was proposed and a similar patch ended up being commited that fixed > > bug 43655 for me (but it never fixed bug 42373 on NI CAICOS). > > > > I'll try the workaround used at the time to see if it is really related to > > bug 43655 (comments 8 and 10) and I'll begin bisecting kernel right after. > > I think what you really want for your caicos is this patch: > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit; > h=62444b7462a2b98bc78d68736c03a7c4e66ba7e2 kernel bisected. Here is the culprit commit from what I see here: 62444b7462a2b98bc78d68736c03a7c4e66ba7e2 is the first bad commit commit 62444b7462a2b98bc78d68736c03a7c4e66ba7e2 Author: Alex Deucher <alexander.deucher@amd.com> Date: Wed Aug 15 17:18:42 2012 -0400 drm/radeon: properly handle mc_stop/mc_resume on evergreen+ (v2) - Stop the displays from accessing the FB - Block CPU access - Turn off MC client access This should fix issues some users have seen, especially with UEFI, when changing the MC FB location that result in hangs or display corruption. v2: fix crtc enabled check noticed by Luca Tettamanti Signed-off-by: Alex Deucher <alexander.deucher@amd.com> :040000 040000 3e0d33c9b4eda29ced814fe9a863efe63e53f14c 4932561607b160734ec1eade927a9fe18c9f3f1b M drivers So it may not be the same bug I'm hitting as Serkan is. Where should I track this faulty commit/bug? In the NI CAICOS bug or in a new one?
(In reply to comment #37) > Created attachment 68728 [details] > dmesg.3.7-rc1 with testpatch with mesa-git > > I removed gdm and installed slim as login manager. Also installed cinnamon > as a replacement for gnome and it works fine the first round with linus git > with the test patch and mesa git. Restarted slim and logged in again and > there were some font corruptions, i restarted cinnamon and they were gone. I > tried google maps with webgl enabled and it was working fine. > > After that i edited my .xinitrc to startup gnome, restarted slim and logged > in but it failed and got the error window saying oh no something has gone > wrong and a log out button. I checked dmesg at that point and saw the ttm > gart memory error. i switched back to cinnamon logged in and got the same > font corruptions, restarting cinnamon fixed them. Thanks allot for your additional testing, as I suspected we are really facing two problems here: 1. The new gnome/gdm versions seem to trigger an out of memory situation in the GART memory area. That's probably because some miscalculation or memory leak or something like this and should be handled as a separate bug. BTW: You can take a look at the current memory allocations with: sudo cat /sys/kernel/debug/dri/0/radeon_gtt_mm and sudo cat /sys/kernel/debug/dri/0/radeon_vram_mm 2. Properly updating the page table asynchronously somehow fails under high memory pressure. I will try to look into problem 2 first, since that got added with my patch. But problem number 1 is as equally as bad. I don't think we just spool up allot of drawing operations like Jerome suspected, cause in this case TTM would just block on previous render operations to complete. It looks more like we are submitting a single draw operation with multiple ~16MB chunks of memory that is so big that it just won't fit into the GART memory altogether.
(In reply to comment #40) [SNIP] > kernel bisected. Here is the culprit commit from what I see here: > 62444b7462a2b98bc78d68736c03a7c4e66ba7e2 is the first bad commit > commit 62444b7462a2b98bc78d68736c03a7c4e66ba7e2 > Author: Alex Deucher <alexander.deucher@amd.com> > Date: Wed Aug 15 17:18:42 2012 -0400 > > drm/radeon: properly handle mc_stop/mc_resume on evergreen+ (v2) > > - Stop the displays from accessing the FB > - Block CPU access > - Turn off MC client access > > This should fix issues some users have seen, especially > with UEFI, when changing the MC FB location that result > in hangs or display corruption. > > v2: fix crtc enabled check noticed by Luca Tettamanti > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > > :040000 040000 3e0d33c9b4eda29ced814fe9a863efe63e53f14c > 4932561607b160734ec1eade927a9fe18c9f3f1b M drivers > > So it may not be the same bug I'm hitting as Serkan is. Where should I track > this faulty commit/bug? In the NI CAICOS bug or in a new one? That indeed looks like a separate bug to me, so I suggest to open up a new bug.
Good news! I figured out what it is (the crash not the memory problem) and can reproduce it. A patch fixing this shouldn't be to much of a problem any more, but I don't think I will have time to fix it before Monday. So please be patient for a couple of more days.
(In reply to comment #43) > Good news! I figured out what it is (the crash not the memory problem) and > can reproduce it. > > A patch fixing this shouldn't be to much of a problem any more, but I don't > think I will have time to fix it before Monday. > > So please be patient for a couple of more days. Thats cool. I found out what triggers the gart error. I had gtk-redshift on session start up. After removing that the ttm error is gone. It redshifts the screen colors so that it is easy on the eyes and when its started it starts the redshifting gradually. Also, i have been playing around with the RADEON_VA variable but i can't trigger the gpu stall anymore, i get some graphical corruptions and a couple of these instead: [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12! After a shell restart, the glitches go away.
> Thats cool. I found out what triggers the gart error. I had gtk-redshift on > session start up. After removing that the ttm error is gone. It redshifts > the screen colors so that it is easy on the eyes and when its started it > starts the redshifting gradually. > Scratch that, i removed redshift but the gart error happened again. Its not the gdm startup though, it happens during gnome session startup.
Created attachment 68906 [details] [review] Possible fix. Ok, please try the attached patch. It should fix the issue with the original "async page table updates patch". Please note that Alex current drm-fixes-3.7 branch already contains another patch that is also masquerading this problem, so please test with the original drm-next-3.7 branch. I've submitted a series of patches that should fix and cleanup the code.
(In reply to comment #46) > Created attachment 68906 [details] [review] [review] > Possible fix. > > Ok, please try the attached patch. It should fix the issue with the original > "async page table updates patch". > > Please note that Alex current drm-fixes-3.7 branch already contains another > patch that is also masquerading this problem, so please test with the > original drm-next-3.7 branch. > > I've submitted a series of patches that should fix and cleanup the code. Yes the patch works. I've checked out v3.6 and merged alex' drm-next-3.7 branch on top and tested with mesa-git and ati-git. Because of the gnome update i don't get the same exact dmesg errors but the result is the same, gpu just stalls when you try to login. After the patch, i am able to login, i still get a couple relocation errors and some glitches, which disappear after restarting gnome shell.
Created attachment 68932 [details] dmesg-3.6+drm-next-3.7
Created attachment 68933 [details] dmesg-3.6+drm-next-3.7+patch
you'll probably want the updated version of the patch here: http://lists.freedesktop.org/archives/dri-devel/2012-October/029292.html
Since the patch was submitted and applied on kernel 3.7, should this bug be closed?
Yes this is fixed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.