Summary: | SIGBUS in EVERGREENUploadToScreen after hibernation (Linux 3.12.4-tuxonice) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | txtoxtox285 | ||||||||
Component: | DRM/Radeon | Assignee: | xf86-video-ati maintainers <xorg-driver-ati> | ||||||||
Status: | RESOLVED DUPLICATE | QA Contact: | |||||||||
Severity: | normal | ||||||||||
Priority: | medium | ||||||||||
Version: | unspecified | ||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Created attachment 90785 [details]
Xorg.0.log
I forgot to mention: 3.9.9-tuxonice does not have this problem. I wonder whether this message during resume has anything to do with it (see appended dmsg): [ 864.574325] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec [ 864.574328] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002) [ 864.574331] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35). [ 864.574334] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). Created attachment 91087 [details]
syslog on modprobe -rv radeon
OK, so I thought about: unload radeon.ko, patch source for printk debugging, compile and load to see where the culprit is, but modprobe -rv radeon rendered my machine basically useless; seems this banana is not ripe yet.
Does it work correctly without tuxonice? sigbus is usually a sign that the cpu is trying to access a non-CPU visible region of vram. (In reply to comment #5) > Does it work correctly without tuxonice? No, vanilla 3.12.4 has exactly the same problems: boot => suspend to disk => resume => start X => run firefox => SIGBUS in EVERGREENUploadToScreen (In case it is related: As on 3.12.4-tuxonice modprobe -r radeon oopses an “BUG: unable to handle kernel paging request”, even without ever starting X.) I admit I didn’t try X before a suspend/resume cycle. Correct me if I’m wrong, but if I understand the driver code correctly, the SIGBUS is delivered because of the following chain of function calls: ttm_bo_move_buffer(bo, {fpfn = 0x0, lpfn = 0x10000, num_placement = 1, num_busy_placement = 1}, 0, 0) = -12 ttm_bo_validate radeon_bo_fault_reserve_notify (as bdev->driver->fault_reserve_notify ttm_bo_vm_fault (as ttm_vm_ops->fault) radeon_ttm_fault where the return value from ttm_bo_move_buffer (-ENOMEM) is moved up to ttm_bo_vm_fault, which then returns VM_FAULT_SIGBUS. During the many cycles of “printk, compile, reboot, suspend, resume, crash X” I’ve found that the problem may have to do with the fact that I load radeon.ko during my initramfs before attempting resume by echoing into /sys/power/resume or /sys/power/tuxonice/do_resume. If I do *not* load radeon.ko into the “booting” kernel (i.e., the one which is then replaced by the “resuming” kernel), I couldn’t reproduce the crash. Furthermore, I do *not* get these four messages from the “resuming” kernel about a lockup: [ 864.574325] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec [ 864.574328] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002) [ 864.574331] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35). [ 864.574334] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). (In reply to comment #8) > During the many cycles of “printk, compile, reboot, suspend, resume, crash > X” I’ve found that the problem may have to do with the fact that I load > radeon.ko during my initramfs before attempting resume by echoing into > /sys/power/resume or /sys/power/tuxonice/do_resume. > > If I do *not* load radeon.ko into the “booting” kernel (i.e., the one which > is then replaced by the “resuming” kernel), I couldn’t reproduce the crash. > Furthermore, I do *not* get these four messages from the “resuming” kernel > about a lockup: > > [ 864.574325] radeon 0000:01:00.0: GPU lockup CP stall for more than > 10000msec > [ 864.574328] radeon 0000:01:00.0: GPU lockup (waiting for > 0x0000000000000004 last fence id 0x0000000000000002) > [ 864.574331] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed > (-35). > [ 864.574334] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB > on ring 5 (-35). I think what's happening is that when you get the GPU lockup, the driver can't reset the GPU properly so it has no way to migrate data from non-CPU accessible vram to CPU accessible vram and you end up with the SIGBUS when the CPU tries to access the non-CPU accessible vram. I'm not quite following what you mean by booting vs. resuming kernel. > I'm not quite following what you mean by booting vs. resuming kernel.
I resume by echoing into /sys/power/resume from an initramfs. The kernel which is in charge before I called the “booting” kernel. When I echo into /sys/power/resume, my naive understanding is that the old kernel (the one which save a memory image to swap) is taking over control again (the “resuming” kernel).
I suppose that’s not exactly what happens (is the kernel which just booted really kicked out of existence or does it keep on running and take over the memory image from the kernel which was suspended ?). But my point was that when radeon.ko is loaded before the initramfs echoes into /sys/power/resume, the lookup and the SIGBUS happen, whereas if it is not loaded, everything’s fine.
(In reply to comment #10) > I suppose that’s not exactly what happens (is the kernel which just booted > really kicked out of existence or does it keep on running and take over the > memory image from the kernel which was suspended ?). But my point was that > when radeon.ko is loaded before the initramfs echoes into /sys/power/resume, > the lookup and the SIGBUS happen, whereas if it is not loaded, everything’s > fine. Does the booting kernel unload the radeon driver prior to switching to the resuming kernel? If not, the resuming kernel may take over before the radeon driver has finished loading in the boot kernel, or in the middle of some operation which leaves the GPU in a bad state when the resuming kernel takes over. I don't want to tangle this bug completely; but I get something similar, 100% reproducibly; my sin being just to login under XFCE, or to launch an app under gnome-shell ;-) Program received signal SIGBUS, Bus error. 0xb7274907 in memcpy () from /lib/libc.so.6 (gdb) bt #0 0xb7274907 in memcpy () from /lib/libc.so.6 #1 0xb6cc0b97 in memcpy (__len=7680, __src=0xb41b6008, __dest=0xb377c000) at /usr/include/bits/string3.h:51 #2 EVERGREENUploadToScreen (pDst=pDst@entry=0xa5689c0, x=0, y=y@entry=0, w=1920, h=1200, src=0xb41b6008 "\020\017\v\377\021\020\f\377\020\016\f\377\016\r\v\377\f\r\n\377\021\021\016"..., src_pitch=7680) at evergreen_exa.c:1721 #3 0xb6c7c506 in exaCopyDirty (migrate=migrate@entry=0xbfdd50a0, pValidDst=pValidDst@entry=0xa568a34, pValidSrc=pValidSrc@entry=0xa568a28, transfer=0xb6cc0820 <EVERGREENUploadToScreen>, fallback_index=fallback_index@entry=0, sync=sync@entry=0x0) at exa_migration_classic.c:220 #4 0xb6c7c9e6 in exaCopyDirtyToFb (migrate=migrate@entry=0xbfdd50a0) at exa_migration_classic.c:303 #5 0xb6c7e92f in exaDoMigration_mixed (pixmaps=0xbfdd5090, npixmaps=2, can_accel=1) at exa_migration_mixed.c:118 #6 0xb6c7b17f in exaDoMigration (pixmaps=pixmaps@entry=0xbfdd5090, npixmaps=npixmaps@entry=2, can_accel=can_accel@entry=1) at exa.c:1134 #7 0xb6c81218 in exaFillRegionTiled (pDrawable=pDrawable@entry=0xa5b9f38, pRegion=pRegion@entry=0xa6cce40, pTile=0xa5689c0, pPatOrg=pPatOrg@entry=0xa207ce8, planemask=4294967295, alu=3, clientClipType=0) at exa_accel.c:1124 #8 0xb6c81d58 in exaPolyFillRect (pDrawable=0xa5b9f38, pGC=0xa207cc0, nrect=1, prect=0xa5daff0) at exa_accel.c:821 #9 0x08163a67 in damagePolyFillRect (pDrawable=0xa5b9f38, pGC=0xa207cc0, nRects=1, pRects=0xa5daff0) at damage.c:1254 #10 0x081bffa8 in miPaintWindow (pWin=<optimized out>, pWin@entry=0xa5b9f38, I wonder if it could be related, very happy to help debug/test patches etc. as I say it's very easy to reproduce, no suspend/resume required =) *** This bug has been marked as a duplicate of bug 44099 *** (In reply to comment #13) > > *** This bug has been marked as a duplicate of bug 44099 *** I can confirm that with the patch posted there (#91541) applied the X server does no longer crash. However, if I may be so bold to ask, does this patch actually solve the problem, or does it simply avoid it? |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 90784 [details] dmsg How to reproduce: * boot 3.12.4-tuxonice; do not start X * suspend to disk * resume * start KDE 4.10.5 * start Firefox ==> X dies with SIGBUS Graphics hardware: [AMD/ATI] Cedar [Radeon HD 5000/6000/7350/8350 Series]; VID:PID 1002:68f9, SVID:SPID 1043:03d8 Software: * Kernel 3.12.4-tuxonice * Gentoo: ** xorg-x11-7.4-r2 ** xorg-server-1.14.3-r2 ** xf86-video-ati-7.2.0 GDB: (gdb) bt #0 __memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:819 #1 0x00002af85fef84b4 in EVERGREENUploadToScreen (pDst=0x2187f90, x=0, y=0, w=1516, h=43, src=0x21e1728 "", src_pitch=6064) at /usr/include/bits/string3.h:52 #2 0x00002af8603519dc in exaDoPutImage (src_stride=6064, bits=0x21e1728 "", format=2, h=43, w=1516, y=<optimized out>, x=<optimized out>, pGC=0x1f373d0, pDrawable=0x2187f90, depth=<optimized out>) at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/exa/exa_accel.c:212 #3 exaPutImage (pDrawable=0x2187f90, pGC=0x1f373d0, depth=32, x=0, y=0, w=1516, h=43, leftPad=0, format=2, bits=0x21e1728 "") at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/exa/exa_accel.c:233 #4 0x000000000076616d in ProcPutImage (client=<optimized out>) at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/dix/dispatch.c:1966 #5 0x0000000000769556 in Dispatch () at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/dix/dispatch.c:432 #6 0x0000000000757ef3 in main (argc=<optimized out>, argv=0x7fffd57dae58, envp=<optimized out>) at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/dix/main.c:298 (gdb) info locals pScrn = 0x1818d90 info = 0x1819350 accel_state = 0x2187f90 driver_priv = 0x1818d90 [bogus, should be 0x2052f10] scratch = <optimized out> copy_dst = 0x2024720 dst = 0x2af865f19000 <Address 0x2af865f19000 out of bounds> size = <optimized out> dst_domain = 4 bpp = <optimized out> scratch_pitch = <optimized out> copy_pitch = 6144 ret = <optimized out> flush = <optimized out> r = 1 i = <optimized out> src_obj = {pitch = 3581782816, width = 32767, height = 1141, bpp = 0, domain = 0, bo = 0x1800018a33b0, tiling_flags = 3581782752, surface = 0x2af85fedec79 <RADEONEXAPixmapIsOffscreen+9>} dst_obj = {pitch = 3581782784, width = 32767, height = 1614088140, bpp = 11000, domain = 3581782784, bo = 0x2af85fedec79 <RADEONEXAPixmapIsOffscreen+9>, tiling_flags = 3581782816, surface = 0x2af8603507cc <exaPixmapHasGpuCopy_mixed+108>} height = <optimized out> base_align = <optimized out> (gdb) p $driver_priv->bo $1 = (struct radeon_bo *) 0x2024720 (gdb) p *((struct radeon_bo_gem*)copy_dst) $2 = {base = {ptr = 0x2af865f19000, flags = 0, handle = 265, size = 7028736, alignment = 256, domains = 4, cref = 1, bom = 0x1824130, space_accounted = 0, referenced_in_cs = 0}, name = 0, map_count = 1, reloc_in_cs = {atomic = 0}, priv_ptr = 0x2af865f19000} (gdb) x/x ((struct radeon_bo_gem*)copy_dst)->priv_ptr 0x2af865f19000: Cannot access memory at address 0x2af865f19000 (gdb) ^Z [1]+ Stopped gdb -p $(pgrep X) ~ # grep 2af865f19000 /proc/$(pgrep X)/maps 2af865f19000-2af8665cd000 rw-s 10aa4c000 00:05 6534 /dev/dri/card0 --------------------------------------------------- Looks like EVERGREENUploadToScreen wants to memcpy into copy_dst->ptr, which has a value of 0x2af865f19000 and which (according to /proc/$(pgrep X)/maps) *is* mapped and should be writable; however, it isn’t. At this point I lost my wits and would be grateful for a pointer where this memory is mapped, both in user and kernel space.