Bug 29738

Summary:

SIGBUS after upgrade to 2.6.36-rc1-git4 [full stacktrace]

Product:

DRI

Reporter:

Till Matthiesen <entropy>

Component:

DRM/Radeon

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

alexandref75, bugs.xorg, daniel, jlp.bugs, mikko.cal, tomasz.figa

Version:

XOrg git

Hardware:

Other

OS:

All

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
backtrace	none
fix range restriction checks in drm_mm	none
Log from X.org X Server	none
Xorg backtrace	none

Description Till Matthiesen 2010-08-22 14:31:48 UTC

Hi all,

I gave Linux 2.6.36-rc1-git4 a try today coming from 2.6.35.3.

Once in a while X crashes with an SIGBUS in pixman (0.18.4) [see the backtrace attached].
I'm filing this bug here because it 'seems' that this is somehow connected
to the Kernel version as I was not able to reproduce this with 2.6.35.3.

I cannot provide a definite way to reproduce this issue. It happened during different actions such as flickering through a PDF document and resizing an mplayer window.

Keep up the excellent work!

-- system --

xorg-server-1.8.2

xf86-video-ati (git 08/22)
libdrm (git 08/22)

AMD Phenom(tm) II X4 940 Processor
AMD Radeon HD 4850

Comment 1 Till Matthiesen 2010-08-22 14:33:40 UTC

Created attachment 38076 [details]
backtrace

Comment 2 Alexandre 2010-08-23 08:54:34 UTC

Same error. It happens with xorg 1.9.0 and older versions but only with 2.6.36-rc1. All versions of 2.6.35 (including 2.6.35.2) work well, I could never generate a bus error.

It sometimes happens within pixman but with me it is more common with this backtrace.

Backtrace:
0: /usr/bin/X (xorg_backtrace+0x28) [0x4931a8]
1: /usr/bin/X (0x400000+0x5d979) [0x45d979]
2: /lib/libpthread.so.0 (0x7fe19e948000+0xf410) [0x7fe19e957410]
3: /lib/libc.so.6 (memcpy+0x67) [0x7fe19d931aa7]
4: /usr/lib64/xorg/modules/libfb.so (fbBlt+0x100) [0x7fe19b600350]
5: /usr/lib64/xorg/modules/libfb.so (fbBltStip+0x40) [0x7fe19b601080]
6: /usr/lib64/xorg/modules/libfb.so (fbPutZImage+0x1a2) [0x7fe19b605fd2]
7: /usr/lib64/xorg/modules/libexa.so (0x7fe19b3d6000+0x1245a) [0x7fe19b3e845a]
8: /usr/lib64/xorg/modules/libexa.so (0x7fe19b3d6000+0x960d) [0x7fe19b3df60d]
9: /usr/bin/X (0x400000+0xc3846) [0x4c3846]
10: /usr/bin/X (0x400000+0x381cb) [0x4381cb]
11: /usr/bin/X (0x400000+0x3a4b9) [0x43a4b9]
12: /usr/bin/X (0x400000+0x248ca) [0x4248ca]
13: /lib/libc.so.6 (__libc_start_main+0xfd) [0x7fe19d8cfd2d]
14: /usr/bin/X (0x400000+0x24469) [0x424469]
Bus error at address 0x7fe187df9000


Backtrace with pixman bus error
Backtrace:
0: /usr/bin/X (xorg_backtrace+0x28) [0x493148]
1: /usr/bin/X (0x400000+0x5d8c9) [0x45d8c9]
2: /lib/libpthread.so.0 (0x7fefb50d9000+0xf410) [0x7fefb50e8410]
3: /usr/lib/libpixman-1.so.0 (0x7fefb4e79000+0x51673) [0x7fefb4eca673]
4: /usr/lib/libpixman-1.so.0 (0x7fefb4e79000+0x51b1e) [0x7fefb4ecab1e]
5: /usr/lib/libpixman-1.so.0 (pixman_fill+0x3a) [0x7fefb4eaa9ea]
6: /usr/lib64/xorg/modules/libfb.so (fbFill+0x30c) [0x7fefb1d94ffc]
7: /usr/lib64/xorg/modules/libfb.so (fbPolyFillRect+0x1da) [0x7fefb1d9555a]
8: /usr/bin/X (0x400000+0x168cf0) [0x568cf0]
9: /usr/bin/X (0x400000+0x168eea) [0x568eea]
10: /usr/bin/X (miWideLine+0x1a7) [0x56ac17]
11: /usr/bin/X (miPolySegment+0x3f) [0x5490df]
12: /usr/lib64/xorg/modules/libexa.so (0x7fefb1b67000+0x137a9) [0x7fefb1b7a7a9]
13: /usr/bin/X (0x400000+0xc272d) [0x4c272d]
14: /usr/bin/X (0x400000+0x388a2) [0x4388a2]
15: /usr/bin/X (0x400000+0x3a4b9) [0x43a4b9]
16: /usr/bin/X (0x400000+0x248ca) [0x4248ca]
17: /lib/libc.so.6 (__libc_start_main+0xfd) [0x7fefb4060d2d]
18: /usr/bin/X (0x400000+0x24469) [0x424469]
Bus error at address 0x7fefaace4420

Comment 3 Alex Deucher 2010-08-23 09:00:46 UTC

can you bisect the kernel to see what commit is causing the problem?

Comment 4 Alexandre 2010-08-23 16:37:19 UTC

I will try to bisect. It will take a while because I still have not found a consistent a way to trigger the bug.

Comment 5 Till Matthiesen 2010-08-23 17:17:27 UTC

All I can tell so far is that it looks like this behaviour was introduced somewhere between 2.6.35-git2 and 2.6.35-git3.

As Alexandre said, it is not straight forward to trigger.
I might (don't think so) have missed it in 2.6.35-git2 but X crashed with 2.6.35-git3 shortly after a few resizes of the mplayer window.

Comment 6 Till Matthiesen 2010-08-24 13:29:05 UTC

Bisection pointed me to the merged drm-core-next branch which does not really came unexpected.

commit fc1caf6eafb30ea185720e29f7f5eccca61ecd60
Merge: 9779714 96576a9
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Aug 5 16:02:01 2010 -0700

    Merge branch 'drm-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6

Can you tell me how I bisect such a branch?

As a side note - Xorg.0.log ends with the following message:
[   232.660] EXA bug: FinishAccess called without PrepareAccess for pixmap 0x0x1ed3b90.

Comment 7 Nick Bowler 2010-08-25 15:36:08 UTC

(In reply to comment #6)
> Bisection pointed me to the merged drm-core-next branch which does not really
> came unexpected.
[...]
> Can you tell me how I bisect such a branch?

If the bisection pinpoints a merge, it _usually_ means that you mis-labeled an
earlier commit (although it is possible that the merge itself is actually the
culprit).  Unless you made a typo, all your 'bad' labels are likely correct.

So look at the output of 'git bisect log', and make a guess at which 'good'
commit might be mis-labeled (choose an older commit over a newer one if you're
unsure which to try first).  Check out that commit, and try again.  If it turns
out to be bad, you'll need to issue a 'git bisect bad' and then remove one or
more 'good' labels, see the git-bisect man page for details.

Comment 8 Till Matthiesen 2010-08-25 17:57:35 UTC

You were perfectly right Nick. Thanks for the hint.

Finally, we have the commit that triggers the issue.

commit 709ea97145c125b3811ff70429e90ebdb0e832e5
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Fri Jul 2 15:02:16 2010 +0100

    drm: implement helper functions for scanning lru list
    
    These helper functions can be used to efficiently scan lru list
    for eviction. Eviction becomes a three stage process:
    1. Scanning through the lru list until a suitable hole has been found.
    2. Scan backwards to restore drm_mm consistency and find out which
       objects fall into the hole.
    3. Evict the objects that fall into the hole.
    
    These helper functions don't allocate any memory (at the price of
    not allowing any other concurrent operations). Hence this can also be
    used for ttm (which does lru scanning under a spinlock).
    
    Evicting objects in this fashion should be more fair than the current
    approach by i915 (scan the lru for a object large enough to contain
    the new object). It's also more efficient than the current approach used
    by ttm (uncoditionally evict objects from the lru until there's enough
    free space).
    
    Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Acked-by: Thomas Hellstrom <thellstrom@vmwgfx.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Dave Airlie <airlied@redhat.com>

Comment 9 Alexandre 2010-08-26 00:33:25 UTC

In my bisect (I tried to reduce the number of tests) by restricting the commits to the drivers/gpu/drm include/drm directories.

The commit I got is the one below. I am not 100% confident since getting the good is difficult (the bad is easy) and some combinations keep getting a different bug that is described below.

7a6b2896f261894dde287d3faefa4b432cddca53 is the first bad commit
commit 7a6b2896f261894dde287d3faefa4b432cddca53
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Fri Jul 2 15:02:15 2010 +0100

    drm_mm: extract check_free_mm_node
    
    There are already two copies of this logic. And the new scanning
    stuff will add some more. So extract it into a small helper
    function.
    
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Acked-by: Thomas Hellstrom <thellstrom@vmwgfx.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Dave Airlie <airlied@redhat.com>

:040000 040000 a1d9ede188db6c1fc721c526a56437758f213c35 fcc9ba1feccf51c5d078301a27e9ec41b0498adf M      drivers

In some cases I keep getting this error. The drm restarts. I do not know if this error is masking the fatal error to occur. I do not see this error in 2.6.35 or 2.6.25.2.

[18090.916037] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:235 0xffffffff8138504a()
[18090.916040] Hardware name: Studio XPS 1640
[18090.916041] GPU lockup (waiting for 0x002D906A last fence id 0x002D9062)
Aug 25 20:06:58 becky3 kernel: [18090.916043] Modules linked in: mmc_block aes_x86_64 aes_generic snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss sco bnep rfcomm l2cap ipv6 acpi_cpufreq mperf btusb bluetooth arc4 snd_hda_codec_atihdmi ecb snd_hda_codec_idt iwlagn snd_hda_intel iwlcore snd_hda_codec snd_hwdep usbhid mac80211 uvcvideo snd_pcm snd_timer snd videodev v4l2_compat_ioctl32 cfg80211 ehci_hcd uhci_hcd usbcore rtc_cmos rtc_core sdhci_pci sdhci mmc_core dell_wmi dell_laptop processor snd_page_alloc wmi ohci1394 video thermal joydev i2c_i801 rtc_lib rfkill led_class pcspkr sg backlight output intel_agp ieee1394 thermal_sys dcdbas ac battery button
[18090.916092] Pid: 5449, comm: kwin Tainted: G        W   2.6.35-rc4+ #3
[18090.916094] Call Trace:
[18090.916098]  [<ffffffff8103a0ca>] 0xffffffff8103a0ca
[18090.916100]  [<ffffffff8103a1a1>] 0xffffffff8103a1a1
[18090.916102]  [<ffffffff8138504a>] 0xffffffff8138504a
[18090.916104]  [<ffffffff81051790>] ? 0xffffffff81051790
[18090.916106]  [<ffffffff8138593c>] 0xffffffff8138593c
[18090.916108]  [<ffffffff81350906>] 0xffffffff81350906
[18090.916110]  [<ffffffff8139aece>] 0xffffffff8139aece
[18090.916112]  [<ffffffff8133c3ea>] 0xffffffff8133c3ea
[18090.916114]  [<ffffffff8139ae40>] ? 0xffffffff8139ae40
[18090.916116]  [<ffffffff810c6e12>] ? 0xffffffff810c6e12
[18090.916119]  [<ffffffff810d5b18>] 0xffffffff810d5b18
[18090.916121]  [<ffffffff810d5cc0>] 0xffffffff810d5cc0
[18090.916123]  [<ffffffff810c7a70>] ? 0xffffffff810c7a70
[18090.916125]  [<ffffffff810d61aa>] 0xffffffff810d61aa
[18090.916127]  [<ffffffff81002ceb>] 0xffffffff81002ceb
[18090.916129] ---[ end trace cc0eb4d6579677d8 ]---
[18090.916136] [drm] Disabling audio support
[18090.917139] radeon 0000:01:00.0: ffff88013ef8ae00 unpin not necessary
[18090.917356] radeon 0000:01:00.0: GPU softreset
[18090.917359] radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xA23034E0
[18090.917361] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
[18090.917363] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200010C0
[18090.917372] radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
[18090.932253] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
[18090.948123] radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003030
[18090.948125] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
[18090.948128] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
[18090.949120] radeon 0000:01:00.0: GPU reset succeed
[18090.967247] [drm] Clocks initialized !
[18091.000606] [drm] ring test succeeded in 1 usecs
[18091.000616] [drm] ib test succeeded in 1 usecs
[18091.000619] [drm] Enabling audio support
[18094.508025] radeon 0000:01:00.0: GPU lockup CP stall for more than 1020msec

Comment 10 Alex Deucher 2010-08-26 08:01:25 UTC

*** Bug 29819 has been marked as a duplicate of this bug. ***

Comment 11 Daniel Vetter 2010-08-26 12:37:10 UTC

Created attachment 38195 [details] [review]
fix range restriction checks in drm_mm

This patch should fix the problem. Please test.

Comment 12 Till Matthiesen 2010-08-26 13:49:21 UTC

Thanks, Daniel.

I'm testing against rc2-git4 and it runs fine so far (~30 min),
doing all that 'stuff' that made it easily crash within 2-5 minutes.

Comment 13 Alexandre 2010-08-26 17:41:29 UTC

Me too. Before you patch I was finishing testing with 2.6.36-rc2 with the two previous patches removed and it worked fine. I am testing with your patch and it works, I could not crash it. Thanks.

Comment 14 Till Matthiesen 2010-08-27 02:31:59 UTC

I think we can consider this issue fixed.

Thanks again!

Comment 15 Alex Deucher 2010-09-10 08:29:29 UTC

*** Bug 30116 has been marked as a duplicate of this bug. ***

Comment 16 boris64 2010-09-10 08:48:32 UTC

I got this X crash on kernel-2.6.35.4
(-> https://bugs.freedesktop.org/show_bug.cgi?id=30116).
Is there a known patch that applies correctly to this kernel version?

Comment 17 Tomasz Figa 2010-09-14 06:44:23 UTC

I'm afraid that this problem persists. I'm getting random crashes with SIGBUS signal on kernels from drm-radeon-testing and drm-fixes branches.

xorg-server git 09/13
xf86-video-ati git 09/13
libdrm git 09/13

Intel Core 2 Quad Q6600 Processor
AMD Radeon HD 5770

Comment 18 Tomasz Figa 2010-09-14 06:46:15 UTC

Created attachment 38690 [details]
Log from X.org X Server

Comment 19 Tomasz Figa 2010-09-14 07:55:09 UTC

After some additional testing, I can reproduce this bug by opening a PDF document in Okular and scrolling through several pages back and forth, then mouse cursor hangs and X server crashes.

Moreover, I can reproduce this bug with 2.6.35.4 kernel (with 2D corruption patch applied, haven't tried without it yet).

Comment 20 Tomasz Figa 2010-09-14 08:15:49 UTC

After disabling RenderAccel I can't reproduce the bug, but it disables the acceleration and therefore can't be considered as a solution to the problem.

Comment 21 Daniel Vetter 2010-09-14 09:26:09 UTC

Tomasz, can you please confirm that you're indeed still hitting this bug by checking the that the comit that introduced the problem (709ea97145c125b3811ff70429e90ebdb0e832e5) does not work, but its immediate parent (7a6b2896f261894dde287d3faefa4b432cddca53) does work. If that's not the case, your hitting a different problem, please then create a new bug to avoid confusion.

Also please add your dmesg (after having crashed X).

Comment 22 Tomasz Figa 2010-09-14 10:02:11 UTC

(In reply to comment #21)
> Tomasz, can you please confirm that you're indeed still hitting this bug by
> checking the that the comit that introduced the problem
> (709ea97145c125b3811ff70429e90ebdb0e832e5) does not work, but its immediate
> parent (7a6b2896f261894dde287d3faefa4b432cddca53) does work. If that's not the
> case, your hitting a different problem, please then create a new bug to avoid
> confusion.
> 
> Also please add your dmesg (after having crashed X).

I guess I'm hitting another problem. Commit 7a6b2896f261894dde287d3faefa4b432cddca53 doesn't work too, but it gives some useful information in dmesg.

I'm opening a new bug then.

Comment 23 Maggioni Marcello 2010-10-05 17:42:51 UTC

Created attachment 39199 [details]
Xorg backtrace

Comment 24 Alex Deucher 2010-11-18 10:13:34 UTC

This should be fixed in 2.6.36.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.