Bug 41838 - Kernel Crash/Hanging system in connection between WebKit and Gnome-Shell
Summary: Kernel Crash/Hanging system in connection between WebKit and Gnome-Shell
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/Radeon (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: xf86-video-ati maintainers
QA Contact: Xorg Project Team
URL:
Whiteboard: 2011BRB_Reviewed
Keywords:
Depends on:
Blocks:
 
Reported: 2011-10-16 10:03 UTC by Peter Weber
Modified: 2012-03-15 12:58 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (37.67 KB, text/plain)
2011-10-17 11:02 UTC, Peter Weber
no flags Details
xorg log (50.09 KB, text/plain)
2011-10-17 11:04 UTC, Peter Weber
no flags Details
lspci output (2.11 KB, text/plain)
2011-10-17 11:05 UTC, Peter Weber
no flags Details
pacman -Q output, list of installed packages (15.65 KB, text/plain)
2011-10-17 11:06 UTC, Peter Weber
no flags Details
xorg.log after crash/hang of system (58.61 KB, text/plain)
2011-10-27 12:37 UTC, Peter Weber
no flags Details
glxinfo output of kernel 3.1 (24.43 KB, text/plain)
2011-11-06 03:06 UTC, Peter Weber
no flags Details
glxinfo output kernel 2.6.37 (24.20 KB, text/plain)
2011-11-06 03:07 UTC, Peter Weber
no flags Details
glxinfo of kernel 2.6.39.3 (24.43 KB, text/plain)
2011-11-06 03:07 UTC, Peter Weber
no flags Details
remove r600_ioctl_wait_idle for evergreen (r800) based cards (436 bytes, patch)
2011-11-06 07:11 UTC, Peter Weber
no flags Details | Splinter Review
flush HDP via the ring (5.36 KB, patch)
2011-11-07 06:51 UTC, Alex Deucher
no flags Details | Splinter Review
glxinfo of kernel-3.2-rc1 with patch (24.43 KB, text/plain)
2011-11-09 14:50 UTC, Peter Weber
no flags Details
dmesg of kernel-3.2-rc1 with patch (69.85 KB, text/plain)
2011-11-09 14:51 UTC, Peter Weber
no flags Details

Description Peter Weber 2011-10-16 10:03:08 UTC
Hello!

I discovered this issue while browseing with WebKit based browser like Midori or Epiphany within the new Gnome-Shell.

Steps to reproduce:
1. Gnome 3.2 with Gnome-Shell, using the Radeon-Drivers from Linux/XORG
1. # sysctl -w kernel.sysrq="1" // WARNING! This will not save your ass!!!
2. $ launch Midori or Epiphany with enabled JavaScript
3. http://piratpix.com/bpt2011.1/index.html // Pictures from federal convent of german Pirate-Party, note: There is a "javascript void(0)" error on the site
4. open one or some image with the mouse

Results:
* Crash/Hang of system
* MagicSysRq will not help!

Software/Hardware:
pacman -Q libwebkit libwebkit3 midori epiphany
libwebkit 1.6.1-1
libwebkit3 1.6.1-1
midori 0.4.1-1
epiphany 3.2.0-1
and also:
* Self compiled vanilla kernel 3.0.3 and stock kernel 3.0.6 from Archlinux
* Graphis-Card is an AMD Radeon 5650, with open-source drivers

Not affected:
* Gnome 3.2 in Fallback-Mode
* Firefox 7 (shows also the javascript void(0) but displays everything like usual)
* Fedora 15 with Radeon 4670 and open-source drivers and Midori 0.3.6

Soughts:
* Maybe row of problems start with the JavaScript-Error
* Next Step is WebKit
* Next Steop is Gnome-Shell
* But even if there bugs in all of them above, the operating-system should crash/hang in any chase. So I afraid this is a bug in kernel/radeon-driver.
* It is also interesting the the Fedora 15 with an other AMD Graphics-Card is not affected.

Thanks

PS: Discussion in the forums of Archlinux - https://bbs.archlinux.org/viewtopic.php?pid=1004227#p1004227
Comment 1 Peter Weber 2011-10-16 14:41:31 UTC
...the operating-system shouldn't crash.

You can keep the other typos :-)
Comment 2 Alex Deucher 2011-10-16 15:03:00 UTC
Please attach your xorg log and dmesg output.
Comment 3 Peter Weber 2011-10-17 00:14:52 UTC
Will post both as soon as possible. Maybe today evening.
Comment 4 Peter Weber 2011-10-17 11:02:48 UTC
Created attachment 52435 [details]
dmesg output
Comment 5 Peter Weber 2011-10-17 11:04:54 UTC
Created attachment 52436 [details]
xorg log

I have set vblank_mode=0 and SwapbuffersWait to "false", but even without them the system will crash/hang.
Comment 6 Peter Weber 2011-10-17 11:05:23 UTC
Created attachment 52437 [details]
lspci output
Comment 7 Peter Weber 2011-10-17 11:06:56 UTC
Created attachment 52438 [details]
pacman -Q output, list of installed packages

My self compiled vanilla kernel 3.0.3 is not listed, but the problem exits also with the stock kernel of Archlinux.
Comment 8 Peter Weber 2011-10-22 06:35:46 UTC
I have done some testing with my laptop, their is also the integrated graphics from Intel on the Core i5 CPU.

Result:
* If I use the integrated Intel graphics-device I'm not affected by this bug
Comment 9 Peter Weber 2011-10-27 12:37:12 UTC
Created attachment 52831 [details]
xorg.log after crash/hang of system

I've upgraded to kernel 3.1 but the bug is still present. The first log I've uploaded previously show the state BEFORE anything happend and seems not very helpful. My mistake.

So I provide here a new logfile after crash/hang of the system, at the end you can see an error-message:
[    71.633] (II) RADEON(0): radeon_dri2_flip_event_handler:981 fevent[0x19870d0] width 1366 pitch 5632 (/4 1408)

I hope this is more helpful :-)
Comment 10 Peter Weber 2011-10-27 12:46:16 UTC
Ooops. There are a lot of bugreports around Gnome 3.2 and the flip_event_handler message.

https://bbs.archlinux.org/viewtopic.php?id=127506
http://lists.freedesktop.org/archives/dri-devel/2011-October/015112.html
https://bugs.archlinux.org/task/26340


Just the tip of the iceberg :-(
Comment 11 Alex Deucher 2011-10-27 13:01:23 UTC
Possibly a duplicate of bug 41592.
Comment 12 Peter Weber 2011-10-28 09:03:10 UTC
Hmmm.
Is it helpful for you, if I try to use kexec and get a dump-file of my crashed kernel?
Comment 13 Peter Weber 2011-10-31 04:49:12 UTC
Seems that this is not the same bug, looking at mirandir's comment.
Comment 14 Peter Weber 2011-10-31 12:52:17 UTC
I tried a git-version of xf86-video-ati, but it didn't changed anything.
Comment 15 Peter Weber 2011-11-01 03:22:59 UTC
1. I downloaded the beta release of Fedora 16 (Live) an put it on an USB-Thumbdrive
2. Boot up my computer and installed Epiphany (WebKit)
3. Visit piratpix.com
4. Click through thumbnails

I've done this on my laptop with a Radeon 5650 Mobility and on my desktop with a Radeon 46??.
The benefit of this testing is, that I've got the 100% identical system and we know that this issue is not caused by me or my configuration.

laptop with radeon 5650-> crash!
desktop with radeon 46xx-> runs smooth without problems

Conclusion: It is the graphics card! But why? It runs perfect in all other situations (Gnome-Shell itself, IOQuake3, Videos, Framebuffer...).
Comment 16 Peter Weber 2011-11-04 01:29:21 UTC
* Upgraded BIOS from 1.13 to 1.19, no effect (Acer Timeline 3820TG)
* xf86-video-ati 6.14.3, no effect


Am I the only user on this world, with this bug?!
Damn!
Comment 17 Michel Dänzer 2011-11-04 04:01:05 UTC
(In reply to comment #9)
> So I provide here a new logfile after crash/hang of the system, at the end you
> can see an error-message:
> [    71.633] (II) RADEON(0): radeon_dri2_flip_event_handler:981
> fevent[0x19870d0] width 1366 pitch 5632 (/4 1408)

Those aren't errors but harmless debug messages. Something is increasing the X server log level from the default for you.


As the problem doesn't happen without gnome-shell, it's most likely a Mesa driver bug. Can you try if it still happens with current upstream Mesa Git? Please also attach the glxinfo output.
Comment 18 Peter Weber 2011-11-05 08:47:09 UTC
Okay. I've tried it (for more than five hours). The final showstopper was mesa-git, with a compile-error in r300.
I will wait for the next official release mesa/ati-dri/xf86-video-ati and hope the best.

By the way, even if the current git-repos include a fix. Shouldn't be the bug itself in the radeon-kernel-module?
Comment 19 Peter Weber 2011-11-06 03:06:16 UTC
Created attachment 53212 [details]
glxinfo output of kernel 3.1
Comment 20 Peter Weber 2011-11-06 03:07:03 UTC
Created attachment 53213 [details]
glxinfo output kernel 2.6.37

not affected
Comment 21 Peter Weber 2011-11-06 03:07:29 UTC
Created attachment 53214 [details]
glxinfo of kernel 2.6.39.3
Comment 22 Peter Weber 2011-11-06 03:13:01 UTC
Okay! I've nearly got it!
I've taken a bunch of old stock-kernels from archlinux and tested a lot.

kernel-2.6.37 is not affected
kernel-2.6.38.1 fails to login into gnome-shell (don't care)
kernel-2.6.38.5-1 is not affected
kernel-2.6.38.7-1 is not affected
kernel-2.6.38.8 is affected and crashes reliable!
kenrel-2.6.39.3 is affected and crashes reliable!
kernel-3.0 is affected and crashed reliable!
kernel-3.1 is affected and crhases reliable!

live-cd of fedora 16 is affected (beta)
live-cd of fedora 15 is not affected (kernel-2.6.38-something)


So I think some of the patches between 2.6.38.7 and 2.6.38.8 is the cause!
I hope this helps :-)
Comment 23 Peter Weber 2011-11-06 03:22:28 UTC
I'm just guessing:

Alex Deucher (1):
      drm/radeon/evergreen/btc/fusion: setup hdp to invalidate and flush when asked


Possible?!
Comment 24 Peter Weber 2011-11-06 05:35:55 UTC
Or this? This looks more interesting!
https://lkml.org/lkml/2011/6/1/302

I patching...
Comment 25 Peter Weber 2011-11-06 07:11:34 UTC
Created attachment 53218 [details] [review]
remove r600_ioctl_wait_idle for evergreen (r800) based cards

It seems mit guess was right, if I'm right this patch caused the bug?
https://lkml.org/lkml/2011/6/1/302

I take a look at the source of /drivers/gpu/drm/radeon/radeon_asic.c and decided to remove "ioctl_wait_idle = r600_ioctl_wait_idle" from "static struct radeon_asic evergreen_asic". I hope this doesn't cause a new problems, but maybe I'm lucky:

/drivers/gpu/drm/radeon/r600.c
3533 /**
3534  * r600_ioctl_wait_idle - flush host path cache on wait idle ioctl
3535  * rdev: radeon device structure
3536  * bo: buffer object struct which userspace is waiting for idle
3537  *
3538  * Some R6XX/R7XX doesn't seems to take into account HDP flush performed
3539  * through ring buffer, this leads to corruption in rendering, see
3540  * http://bugzilla.kernel.org/show_bug.cgi?id=15186 to avoid this we
3541  * directly perform HDP flush by writing register through MMIO.
3542  */
3543 void r600_ioctl_wait_idle(struct radeon_device *rdev, struct radeon_bo *bo)
{
...
}

My affected card is a Radeon 5650 Mobility, a Evergreen which is ~ R800. So r600_ioctl_wait_idle() shouldn't be necessary for Evergreen based cards. I've tested the change now as much as possible for me within the Gnome-Shell and Framebuffer-Terminals (Suspend to RAM, runs of IOQuake3, glxgears, glchess, Midori, Firefox).

Thanks
Comment 26 Alex Deucher 2011-11-07 06:49:31 UTC
I still think this may be a mesa bug.  Have you tried mesa from git?  The glxinfo outputs you attached are from mesa 7.11.
Comment 27 Alex Deucher 2011-11-07 06:51:54 UTC
Created attachment 53247 [details] [review]
flush HDP via the ring

You still need to flush the HDP cache otherwise the CPU may get stale data if it accesses vram after GPU rendering is complete.
Comment 28 Peter Weber 2011-11-07 08:06:56 UTC
Fine! I will apply your patch today evening and will report the result as soon as possible. Can you tell me shortly what the HDP cache is? For what is it good? Thanks!


I tried to compile mesa from git, but didn't succeed and gave up in shame! In general I think code from user-space shouldn't able to trigger (or prevent, in this case) a fatal crash or hang in kernel-space. So I decided to investigate the problem by testing the different versions of the kernel and some poking in the code ;-)
Comment 29 Alex Deucher 2011-11-07 08:35:45 UTC
(In reply to comment #28)
> Fine! I will apply your patch today evening and will report the result as soon
> as possible. Can you tell me shortly what the HDP cache is? For what is it
> good? Thanks!
> 

Host Data Path.  It's the interface for accessing vram via the CPU.  E.g., when you access a buffer in vram via the PCI FB BAR, it goes through the HDP on the GPU.  Unfortunately, bugzilla.kernel.org is down so I don't remember exactly what bug the r600_ioctl_wait_idle patch fixed.  It's possible you just aren't hitting the case in your particular desktop scenario, but that removing it would regress other cases.  I'm leery of applying it until I have some confirmation that someone else is hitting this bug or that it doesn't regress any other cases.

> 
> I tried to compile mesa from git, but didn't succeed and gave up in shame! In

What sort of problems are you seeing?

> general I think code from user-space shouldn't able to trigger (or prevent, in
> this case) a fatal crash or hang in kernel-space. So I decided to investigate
> the problem by testing the different versions of the kernel and some poking in
> the code ;-)

That's the nature of complex 3D engines.
Comment 30 Peter Weber 2011-11-07 11:10:52 UTC
(In reply to comment #29)
> Host Data Path.  It's the interface for accessing vram via the CPU.  E.g., when
> you access a buffer in vram via the PCI FB BAR, it goes through the HDP on the
> GPU.  Unfortunately, bugzilla.kernel.org is down so I don't remember exactly
> what bug the r600_ioctl_wait_idle patch fixed.  It's possible you just aren't
> hitting the case in your particular desktop scenario, but that removing it
> would regress other cases.  I'm leery of applying it until I have some
> confirmation that someone else is hitting this bug or that it doesn't regress
> any other cases.

Thanks for your description. I'm understand that you are careful.

> What sort of problems are you seeing?

The compiler complaint about something in "r300..." buth honestly I don't remember it really.

> That's the nature of complex 3D engines.

I understand! But we are lucky to have you and your co-workers at AMD :-)




Okay. I tried to apply the patch, but I doesn't work.

[peter@cupcake linux-3.1]$ patch -p1 --dry-run < ~/flush_hdp_via_the_ring.patch 
patching file drivers/gpu/drm/radeon/evergreen_blit_kms.c
Hunk #1 FAILED at 625.
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/radeon/evergreen_blit_kms.c.rej
patching file drivers/gpu/drm/radeon/r600.c
Hunk #1 FAILED at 2331.
Hunk #2 succeeded at 2342 (offset -11 lines).
1 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/radeon/r600.c.rej
patching file drivers/gpu/drm/radeon/r600_blit_kms.c
Hunk #1 FAILED at 503.
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/radeon/r600_blit_kms.c.rej
patching file drivers/gpu/drm/radeon/radeon_asic.c


I'm afraid you created the patch against a different kernel? Is it current git?
Comment 31 Peter Weber 2011-11-07 11:21:55 UTC
Yep, looks like git! Little bit to late for me now. I will give you feedback tomorrow!
Comment 32 Peter Weber 2011-11-09 14:49:36 UTC
Sorry! I'm late!

Because Linus Torvalds releases 3.2-rc1 I decided to test the rc1 instead of a clone of git-master.
* kernel-3.2-rc1 without patch           - unstable, crash/hang
* kernel-3.2-rc1 with your patch        - stable, works perfect


During testing I got the feeling the kernel 3.2-rc1 without the patch itself is "more stable". I will try to describe. In most cases a unpatched kernel will crash/hang after opening by mouse-click (not slideshow) 1 to 5 images on piratpix.com (the website is the only known website, to me, which reproduce the issue reliable). The unpatched rc1 seems to survive more, I was able to open ~20 pictures before the system doesn't respond.

The kernel 3.2-rc1 with your patch is completely stable and works fine! Great work!
Thanks

I add a new glxinfo and dmesg of the patched and kernel with patch!
Comment 33 Peter Weber 2011-11-09 14:50:40 UTC
Created attachment 53353 [details]
glxinfo of kernel-3.2-rc1 with patch
Comment 34 Peter Weber 2011-11-09 14:51:19 UTC
Created attachment 53354 [details]
dmesg of kernel-3.2-rc1 with patch
Comment 35 Peter Weber 2011-11-09 15:03:53 UTC
Oh no! After my last comment I just left Epiphany open and let it draw the slideshow on the website. While that I wrote some mails and heared some music (just normal stuff), till my system crashed/hang. The music replayed in a sound-loop.

Kernel 3.1                                 - regulary crash with the 1 to 5 picture
Kernel 3.2-rc1                          - regulary crash with the ~ 20 picture
Kernel 3.2-rc1 with patch         - crash after really many pictures


:-(
Comment 36 Alex Deucher 2011-11-09 15:41:34 UTC
Any chance you can try a newer mesa?  There may be test packages available for your distro.
Comment 37 Peter Weber 2011-11-10 04:04:43 UTC
In the official testing-repos of Archlinux are currently no mesa-packages, but I know a repo hosted by an user - http://spiralinear.org/perry3d/x86_64/

If these packages work I will remove your patch or downgrade to 3.1 to get a more "unreliable" environment and start testing.
Comment 38 Peter Weber 2011-11-13 11:23:40 UTC
Good news!

* I installed mesa-git and everything depending on it
* kernel 3.2-rc1 and 3.1

The system is stable and doesn't crash! I still worried about the complex connections between mesa and kernel, but I'm just glade about a rock stable system! Thanks!

Maybe I get access to the identical laptop of an co-worker with the same hardware (the radeon is just relabeled as 6650) and will try Fedora 16 on it. I want reproduce the bug to confirm myself.
After that I will add the result here an close the bug, finally :-)
Comment 39 Peter Weber 2011-11-13 11:24:15 UTC
I doesn't use any patch for the kernels!
Comment 40 Michel Dänzer 2011-11-15 08:33:19 UTC
(In reply to comment #38)
> * I installed mesa-git and everything depending on it
[...]
> The system is stable and doesn't crash!

Glad to hear it. It would be great if you could try if the problem still happens with the current Mesa Git 7.11 branch, and if it does, if you could bisect which change from the master branch fixed it. Then we could maybe backport the fix to the 7.11 branch.
Comment 41 Peter Weber 2011-11-19 04:14:45 UTC
To be honest, bisect mesa is to much time consuming for me. So I decided to give the new mesa-release 7.11.1 a try but the commit which fixed the issue on mesa-git is not included. Currently im looking forward to thursday next week, where I should get access to the laptop of my co-worker.

I've taken a look on the changelog on the commit-messages on the git-log of mesa. But I didn't found something interesting. Maybe I will take a deeper look later.
Comment 42 Peter Weber 2011-11-25 07:37:15 UTC
My co-worker gave me today access to his laptop (thanks!).

1. I booted up Fedora 16 (Live) and installed Epiphany
2. Visit the website from above
3. After clicking on the second thumbnail the system hanged/crashed

It is the same 3820TG from Acer with a Radeon 5600 Series ASIC, the BIOS version 1.19 is also the same. Looks like "reproduceable" for me. Next interesting thing would be an general test on another Evergreen based ASICs.

Should I set this bug on "Resolved"? Or wait for Mesa 8.0?
Comment 43 duzak_87 2011-11-28 10:09:21 UTC
HD5850 owner here,

Since many months ago I suffered under complete random system freezes (no tty, no ssh) as soon as I switched to gallium instead of classic. I could not reproduce it and the logs did not contain anything suspicious, probably because I needed to reset the machine. I can't count how many config files did got corrupted in this time. It happened almost guaranteed under two hours uptime and it didn't matter what distribution or desktop I used.

Sadly, I had no time to debug it further and used the classic driver instead.
I tried using mesa-git in the last few days and the problem disappeared, I tracked it down, until I found this little bug report, just to find out what it was.

Thank you guys, you did me and perhaps a few others a great favour.
Comment 44 Peter Weber 2012-03-15 12:58:49 UTC
The good news: mesa-8.0 is stable
The better news: mesa-8.0.1-2 is stable in archlinux repositories
The best news: it fixed officially!

The sad news: still don't know what it finally fixed ;-)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.