Bug 72716 - SIGBUS in EVERGREENUploadToScreen after hibernation (Linux 3.12.4-tuxonice)
Summary: SIGBUS in EVERGREENUploadToScreen after hibernation (Linux 3.12.4-tuxonice)
Status: RESOLVED DUPLICATE of bug 44099
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: xf86-video-ati maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-12-14 19:53 UTC by txtoxtox285
Modified: 2014-02-01 13:25 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmsg (80.73 KB, text/plain)
2013-12-14 19:53 UTC, txtoxtox285
no flags Details
Xorg.0.log (42.64 KB, text/plain)
2013-12-14 19:54 UTC, txtoxtox285
no flags Details
syslog on modprobe -rv radeon (12.93 KB, text/plain)
2013-12-21 14:39 UTC, txtoxtox285
no flags Details

Description txtoxtox285 2013-12-14 19:53:14 UTC
Created attachment 90784 [details]
dmsg

How to reproduce:
* boot 3.12.4-tuxonice; do not start X
* suspend to disk
* resume
* start KDE 4.10.5
* start Firefox

==> X dies with SIGBUS

Graphics hardware:  [AMD/ATI] Cedar [Radeon HD 5000/6000/7350/8350 Series]; VID:PID 1002:68f9, SVID:SPID 1043:03d8

Software:
* Kernel 3.12.4-tuxonice
* Gentoo:
** xorg-x11-7.4-r2
** xorg-server-1.14.3-r2
** xf86-video-ati-7.2.0

GDB:
(gdb) bt
#0  __memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:819
#1  0x00002af85fef84b4 in EVERGREENUploadToScreen (pDst=0x2187f90, x=0, y=0, w=1516, h=43,
    src=0x21e1728 "", src_pitch=6064) at /usr/include/bits/string3.h:52
#2  0x00002af8603519dc in exaDoPutImage (src_stride=6064, bits=0x21e1728 "", format=2, h=43, w=1516,
    y=<optimized out>, x=<optimized out>, pGC=0x1f373d0, pDrawable=0x2187f90, depth=<optimized out>)
    at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/exa/exa_accel.c:212
#3  exaPutImage (pDrawable=0x2187f90, pGC=0x1f373d0, depth=32, x=0, y=0, w=1516, h=43, leftPad=0,
    format=2, bits=0x21e1728 "")
    at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/exa/exa_accel.c:233
#4  0x000000000076616d in ProcPutImage (client=<optimized out>)
    at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/dix/dispatch.c:1966
#5  0x0000000000769556 in Dispatch ()
    at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/dix/dispatch.c:432
#6  0x0000000000757ef3 in main (argc=<optimized out>, argv=0x7fffd57dae58, envp=<optimized out>)
    at /mnt/var-pub/tmp/portage/x11-base/xorg-server-1.14.3-r2/work/xorg-server-1.14.3/dix/main.c:298

(gdb) info locals
pScrn = 0x1818d90
info = 0x1819350
accel_state = 0x2187f90
driver_priv = 0x1818d90 [bogus, should be 0x2052f10]
scratch = <optimized out>
copy_dst = 0x2024720
dst = 0x2af865f19000 <Address 0x2af865f19000 out of bounds>
size = <optimized out>
dst_domain = 4
bpp = <optimized out>
scratch_pitch = <optimized out>
copy_pitch = 6144
ret = <optimized out>
flush = <optimized out>
r = 1
i = <optimized out>
src_obj = {pitch = 3581782816, width = 32767, height = 1141, bpp = 0, domain = 0, bo = 0x1800018a33b0,
  tiling_flags = 3581782752, surface = 0x2af85fedec79 <RADEONEXAPixmapIsOffscreen+9>}
dst_obj = {pitch = 3581782784, width = 32767, height = 1614088140, bpp = 11000, domain = 3581782784,
  bo = 0x2af85fedec79 <RADEONEXAPixmapIsOffscreen+9>, tiling_flags = 3581782816,
  surface = 0x2af8603507cc <exaPixmapHasGpuCopy_mixed+108>}
height = <optimized out>
base_align = <optimized out>
(gdb) p $driver_priv->bo
$1 = (struct radeon_bo *) 0x2024720
(gdb) p *((struct radeon_bo_gem*)copy_dst)
$2 = {base = {ptr = 0x2af865f19000, flags = 0, handle = 265, size = 7028736, alignment = 256,
    domains = 4, cref = 1, bom = 0x1824130, space_accounted = 0, referenced_in_cs = 0}, name = 0,
  map_count = 1, reloc_in_cs = {atomic = 0}, priv_ptr = 0x2af865f19000}
(gdb) x/x ((struct radeon_bo_gem*)copy_dst)->priv_ptr
0x2af865f19000: Cannot access memory at address 0x2af865f19000
(gdb) ^Z
[1]+  Stopped                 gdb -p $(pgrep X)
~ # grep 2af865f19000 /proc/$(pgrep X)/maps
2af865f19000-2af8665cd000 rw-s 10aa4c000 00:05 6534                      /dev/dri/card0

---------------------------------------------------

Looks like EVERGREENUploadToScreen wants to memcpy into copy_dst->ptr,
which has a value of 0x2af865f19000 and which (according to /proc/$(pgrep X)/maps)
*is* mapped and should be writable; however, it isn’t.

At this point I lost my wits and would be grateful for a pointer where this memory is mapped,
both in user and kernel space.
Comment 1 txtoxtox285 2013-12-14 19:54:00 UTC
Created attachment 90785 [details]
Xorg.0.log
Comment 2 txtoxtox285 2013-12-15 08:26:00 UTC
I forgot to mention: 3.9.9-tuxonice does not have this problem.
Comment 3 txtoxtox285 2013-12-21 14:35:21 UTC
I wonder whether this message during resume has anything to do with it (see appended dmsg):

[  864.574325] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[  864.574328] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002)
[  864.574331] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[  864.574334] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
Comment 4 txtoxtox285 2013-12-21 14:39:34 UTC
Created attachment 91087 [details]
syslog on modprobe -rv radeon

OK, so I thought about: unload radeon.ko, patch source for printk debugging, compile and load to see where the culprit is, but modprobe -rv radeon rendered my machine basically useless; seems this banana is not ripe yet.
Comment 5 Alex Deucher 2013-12-21 14:55:07 UTC
Does it work correctly without tuxonice?  sigbus is usually a sign that the cpu is trying to access a non-CPU visible region of vram.
Comment 6 txtoxtox285 2013-12-22 15:24:34 UTC
(In reply to comment #5)
> Does it work correctly without tuxonice? 

No, vanilla 3.12.4 has exactly the same problems:
boot => suspend to disk => resume => start X => run firefox => SIGBUS in EVERGREENUploadToScreen

(In case it is related: As on 3.12.4-tuxonice modprobe -r radeon oopses an “BUG: unable to handle kernel paging request”, even without ever starting X.)

I admit I didn’t try X before a suspend/resume cycle.
Comment 7 txtoxtox285 2013-12-22 21:25:58 UTC
Correct me if I’m wrong, but if I understand the driver code correctly, the SIGBUS is delivered because of the following chain of function calls:

ttm_bo_move_buffer(bo, {fpfn = 0x0, lpfn = 0x10000, num_placement = 1, num_busy_placement = 1}, 0, 0) = -12
ttm_bo_validate
radeon_bo_fault_reserve_notify (as bdev->driver->fault_reserve_notify
ttm_bo_vm_fault (as ttm_vm_ops->fault)
radeon_ttm_fault

where the return value from ttm_bo_move_buffer (-ENOMEM) is moved up to ttm_bo_vm_fault, which then returns VM_FAULT_SIGBUS.
Comment 8 txtoxtox285 2013-12-22 21:33:25 UTC
During the many cycles of “printk, compile, reboot, suspend, resume, crash X” I’ve found that the problem may have to do with the fact that I load radeon.ko during my initramfs before attempting resume by echoing into /sys/power/resume or /sys/power/tuxonice/do_resume.

If I do *not* load radeon.ko into the “booting” kernel (i.e., the one which is then replaced by the “resuming” kernel), I couldn’t reproduce the crash. Furthermore, I do *not* get these four messages from the “resuming” kernel about a lockup:

[  864.574325] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[  864.574328] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002)
[  864.574331] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[  864.574334] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
Comment 9 Alex Deucher 2013-12-23 14:22:12 UTC
(In reply to comment #8)
> During the many cycles of “printk, compile, reboot, suspend, resume, crash
> X” I’ve found that the problem may have to do with the fact that I load
> radeon.ko during my initramfs before attempting resume by echoing into
> /sys/power/resume or /sys/power/tuxonice/do_resume.
> 
> If I do *not* load radeon.ko into the “booting” kernel (i.e., the one which
> is then replaced by the “resuming” kernel), I couldn’t reproduce the crash.
> Furthermore, I do *not* get these four messages from the “resuming” kernel
> about a lockup:
> 
> [  864.574325] radeon 0000:01:00.0: GPU lockup CP stall for more than
> 10000msec
> [  864.574328] radeon 0000:01:00.0: GPU lockup (waiting for
> 0x0000000000000004 last fence id 0x0000000000000002)
> [  864.574331] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed
> (-35).
> [  864.574334] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB
> on ring 5 (-35).

I think what's happening is that when you get the GPU lockup, the driver can't reset the GPU properly so it has no way to migrate data from non-CPU accessible vram to CPU accessible vram and you end up with the SIGBUS when the CPU tries to access the non-CPU accessible vram.

I'm not quite following what you mean by booting vs. resuming kernel.
Comment 10 txtoxtox285 2013-12-23 21:48:43 UTC
> I'm not quite following what you mean by booting vs. resuming kernel.

I resume by echoing into /sys/power/resume from an initramfs. The kernel which is in charge before I called the “booting” kernel. When I echo into /sys/power/resume, my naive understanding is that the old kernel (the one which save a memory image to swap) is taking over control again (the “resuming” kernel).

I suppose that’s not exactly what happens (is the kernel which just booted really kicked out of existence or does it keep on running and take over the memory image from the kernel which was suspended ?). But my point was that when radeon.ko is loaded before the initramfs echoes into /sys/power/resume, the lookup and the SIGBUS happen, whereas if it is not loaded, everything’s fine.
Comment 11 Alex Deucher 2013-12-24 19:18:07 UTC
(In reply to comment #10)
> I suppose that’s not exactly what happens (is the kernel which just booted
> really kicked out of existence or does it keep on running and take over the
> memory image from the kernel which was suspended ?). But my point was that
> when radeon.ko is loaded before the initramfs echoes into /sys/power/resume,
> the lookup and the SIGBUS happen, whereas if it is not loaded, everything’s
> fine.

Does the booting kernel unload the radeon driver prior to switching to the resuming kernel?  If not, the resuming kernel may take over before the radeon driver has finished loading in the boot kernel, or in the middle of some operation which leaves the GPU in a bad state when the resuming kernel takes over.
Comment 12 Michael Meeks 2014-01-25 20:20:19 UTC
I don't want to tangle this bug completely; but I get something similar, 100% reproducibly; my sin being just to login under XFCE, or to launch an app under gnome-shell ;-)

Program received signal SIGBUS, Bus error.
0xb7274907 in memcpy () from /lib/libc.so.6
(gdb) bt
#0  0xb7274907 in memcpy () from /lib/libc.so.6
#1  0xb6cc0b97 in memcpy (__len=7680, __src=0xb41b6008, __dest=0xb377c000) at
/usr/include/bits/string3.h:51
#2  EVERGREENUploadToScreen (pDst=pDst@entry=0xa5689c0, x=0, y=y@entry=0,
w=1920, h=1200, 
    src=0xb41b6008
"\020\017\v\377\021\020\f\377\020\016\f\377\016\r\v\377\f\r\n\377\021\021\016"..., src_pitch=7680) at evergreen_exa.c:1721
#3  0xb6c7c506 in exaCopyDirty (migrate=migrate@entry=0xbfdd50a0,
pValidDst=pValidDst@entry=0xa568a34, pValidSrc=pValidSrc@entry=0xa568a28, 
    transfer=0xb6cc0820 <EVERGREENUploadToScreen>,
fallback_index=fallback_index@entry=0, sync=sync@entry=0x0) at
exa_migration_classic.c:220
#4  0xb6c7c9e6 in exaCopyDirtyToFb (migrate=migrate@entry=0xbfdd50a0) at
exa_migration_classic.c:303
#5  0xb6c7e92f in exaDoMigration_mixed (pixmaps=0xbfdd5090, npixmaps=2,
can_accel=1) at exa_migration_mixed.c:118
#6  0xb6c7b17f in exaDoMigration (pixmaps=pixmaps@entry=0xbfdd5090,
npixmaps=npixmaps@entry=2, can_accel=can_accel@entry=1) at exa.c:1134
#7  0xb6c81218 in exaFillRegionTiled (pDrawable=pDrawable@entry=0xa5b9f38,
pRegion=pRegion@entry=0xa6cce40, pTile=0xa5689c0, 
    pPatOrg=pPatOrg@entry=0xa207ce8, planemask=4294967295, alu=3,
clientClipType=0) at exa_accel.c:1124
#8  0xb6c81d58 in exaPolyFillRect (pDrawable=0xa5b9f38, pGC=0xa207cc0, nrect=1,
prect=0xa5daff0) at exa_accel.c:821
#9  0x08163a67 in damagePolyFillRect (pDrawable=0xa5b9f38, pGC=0xa207cc0,
nRects=1, pRects=0xa5daff0) at damage.c:1254
#10 0x081bffa8 in miPaintWindow (pWin=<optimized out>, pWin@entry=0xa5b9f38,

I wonder if it could be related, very happy to help debug/test patches etc. as I say it's very easy to reproduce, no suspend/resume required =)
Comment 13 Alex Deucher 2014-01-25 20:23:23 UTC

*** This bug has been marked as a duplicate of bug 44099 ***
Comment 14 txtoxtox285 2014-02-01 13:25:41 UTC
(In reply to comment #13)
> 
> *** This bug has been marked as a duplicate of bug 44099 ***

I can confirm that with the patch posted there (#91541) applied the X server does no longer crash.

However, if I may be so bold to ask, does this patch actually solve the problem, or does it simply avoid it?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.