Summary: | SIGBUS after upgrade to 2.6.36-rc1-git4 [full stacktrace] | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Till Matthiesen <entropy> | ||||||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||||
Severity: | normal | ||||||||||||
Priority: | medium | CC: | alexandref75, bugs.xorg, daniel, jlp.bugs, mikko.cal, tomasz.figa | ||||||||||
Version: | XOrg git | ||||||||||||
Hardware: | Other | ||||||||||||
OS: | All | ||||||||||||
Whiteboard: | |||||||||||||
i915 platform: | i915 features: | ||||||||||||
Attachments: |
|
Description
Till Matthiesen
2010-08-22 14:31:48 UTC
Created attachment 38076 [details]
backtrace
Same error. It happens with xorg 1.9.0 and older versions but only with 2.6.36-rc1. All versions of 2.6.35 (including 2.6.35.2) work well, I could never generate a bus error. It sometimes happens within pixman but with me it is more common with this backtrace. Backtrace: 0: /usr/bin/X (xorg_backtrace+0x28) [0x4931a8] 1: /usr/bin/X (0x400000+0x5d979) [0x45d979] 2: /lib/libpthread.so.0 (0x7fe19e948000+0xf410) [0x7fe19e957410] 3: /lib/libc.so.6 (memcpy+0x67) [0x7fe19d931aa7] 4: /usr/lib64/xorg/modules/libfb.so (fbBlt+0x100) [0x7fe19b600350] 5: /usr/lib64/xorg/modules/libfb.so (fbBltStip+0x40) [0x7fe19b601080] 6: /usr/lib64/xorg/modules/libfb.so (fbPutZImage+0x1a2) [0x7fe19b605fd2] 7: /usr/lib64/xorg/modules/libexa.so (0x7fe19b3d6000+0x1245a) [0x7fe19b3e845a] 8: /usr/lib64/xorg/modules/libexa.so (0x7fe19b3d6000+0x960d) [0x7fe19b3df60d] 9: /usr/bin/X (0x400000+0xc3846) [0x4c3846] 10: /usr/bin/X (0x400000+0x381cb) [0x4381cb] 11: /usr/bin/X (0x400000+0x3a4b9) [0x43a4b9] 12: /usr/bin/X (0x400000+0x248ca) [0x4248ca] 13: /lib/libc.so.6 (__libc_start_main+0xfd) [0x7fe19d8cfd2d] 14: /usr/bin/X (0x400000+0x24469) [0x424469] Bus error at address 0x7fe187df9000 Backtrace with pixman bus error Backtrace: 0: /usr/bin/X (xorg_backtrace+0x28) [0x493148] 1: /usr/bin/X (0x400000+0x5d8c9) [0x45d8c9] 2: /lib/libpthread.so.0 (0x7fefb50d9000+0xf410) [0x7fefb50e8410] 3: /usr/lib/libpixman-1.so.0 (0x7fefb4e79000+0x51673) [0x7fefb4eca673] 4: /usr/lib/libpixman-1.so.0 (0x7fefb4e79000+0x51b1e) [0x7fefb4ecab1e] 5: /usr/lib/libpixman-1.so.0 (pixman_fill+0x3a) [0x7fefb4eaa9ea] 6: /usr/lib64/xorg/modules/libfb.so (fbFill+0x30c) [0x7fefb1d94ffc] 7: /usr/lib64/xorg/modules/libfb.so (fbPolyFillRect+0x1da) [0x7fefb1d9555a] 8: /usr/bin/X (0x400000+0x168cf0) [0x568cf0] 9: /usr/bin/X (0x400000+0x168eea) [0x568eea] 10: /usr/bin/X (miWideLine+0x1a7) [0x56ac17] 11: /usr/bin/X (miPolySegment+0x3f) [0x5490df] 12: /usr/lib64/xorg/modules/libexa.so (0x7fefb1b67000+0x137a9) [0x7fefb1b7a7a9] 13: /usr/bin/X (0x400000+0xc272d) [0x4c272d] 14: /usr/bin/X (0x400000+0x388a2) [0x4388a2] 15: /usr/bin/X (0x400000+0x3a4b9) [0x43a4b9] 16: /usr/bin/X (0x400000+0x248ca) [0x4248ca] 17: /lib/libc.so.6 (__libc_start_main+0xfd) [0x7fefb4060d2d] 18: /usr/bin/X (0x400000+0x24469) [0x424469] Bus error at address 0x7fefaace4420 can you bisect the kernel to see what commit is causing the problem? I will try to bisect. It will take a while because I still have not found a consistent a way to trigger the bug. All I can tell so far is that it looks like this behaviour was introduced somewhere between 2.6.35-git2 and 2.6.35-git3. As Alexandre said, it is not straight forward to trigger. I might (don't think so) have missed it in 2.6.35-git2 but X crashed with 2.6.35-git3 shortly after a few resizes of the mplayer window. Bisection pointed me to the merged drm-core-next branch which does not really came unexpected. commit fc1caf6eafb30ea185720e29f7f5eccca61ecd60 Merge: 9779714 96576a9 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu Aug 5 16:02:01 2010 -0700 Merge branch 'drm-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6 Can you tell me how I bisect such a branch? As a side note - Xorg.0.log ends with the following message: [ 232.660] EXA bug: FinishAccess called without PrepareAccess for pixmap 0x0x1ed3b90. (In reply to comment #6) > Bisection pointed me to the merged drm-core-next branch which does not really > came unexpected. [...] > Can you tell me how I bisect such a branch? If the bisection pinpoints a merge, it _usually_ means that you mis-labeled an earlier commit (although it is possible that the merge itself is actually the culprit). Unless you made a typo, all your 'bad' labels are likely correct. So look at the output of 'git bisect log', and make a guess at which 'good' commit might be mis-labeled (choose an older commit over a newer one if you're unsure which to try first). Check out that commit, and try again. If it turns out to be bad, you'll need to issue a 'git bisect bad' and then remove one or more 'good' labels, see the git-bisect man page for details. You were perfectly right Nick. Thanks for the hint. Finally, we have the commit that triggers the issue. commit 709ea97145c125b3811ff70429e90ebdb0e832e5 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Fri Jul 2 15:02:16 2010 +0100 drm: implement helper functions for scanning lru list These helper functions can be used to efficiently scan lru list for eviction. Eviction becomes a three stage process: 1. Scanning through the lru list until a suitable hole has been found. 2. Scan backwards to restore drm_mm consistency and find out which objects fall into the hole. 3. Evict the objects that fall into the hole. These helper functions don't allocate any memory (at the price of not allowing any other concurrent operations). Hence this can also be used for ttm (which does lru scanning under a spinlock). Evicting objects in this fashion should be more fair than the current approach by i915 (scan the lru for a object large enough to contain the new object). It's also more efficient than the current approach used by ttm (uncoditionally evict objects from the lru until there's enough free space). Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Thomas Hellstrom <thellstrom@vmwgfx.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Dave Airlie <airlied@redhat.com> In my bisect (I tried to reduce the number of tests) by restricting the commits to the drivers/gpu/drm include/drm directories. The commit I got is the one below. I am not 100% confident since getting the good is difficult (the bad is easy) and some combinations keep getting a different bug that is described below. 7a6b2896f261894dde287d3faefa4b432cddca53 is the first bad commit commit 7a6b2896f261894dde287d3faefa4b432cddca53 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Fri Jul 2 15:02:15 2010 +0100 drm_mm: extract check_free_mm_node There are already two copies of this logic. And the new scanning stuff will add some more. So extract it into a small helper function. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Thomas Hellstrom <thellstrom@vmwgfx.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Dave Airlie <airlied@redhat.com> :040000 040000 a1d9ede188db6c1fc721c526a56437758f213c35 fcc9ba1feccf51c5d078301a27e9ec41b0498adf M drivers In some cases I keep getting this error. The drm restarts. I do not know if this error is masking the fatal error to occur. I do not see this error in 2.6.35 or 2.6.25.2. [18090.916037] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:235 0xffffffff8138504a() [18090.916040] Hardware name: Studio XPS 1640 [18090.916041] GPU lockup (waiting for 0x002D906A last fence id 0x002D9062) Aug 25 20:06:58 becky3 kernel: [18090.916043] Modules linked in: mmc_block aes_x86_64 aes_generic snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss sco bnep rfcomm l2cap ipv6 acpi_cpufreq mperf btusb bluetooth arc4 snd_hda_codec_atihdmi ecb snd_hda_codec_idt iwlagn snd_hda_intel iwlcore snd_hda_codec snd_hwdep usbhid mac80211 uvcvideo snd_pcm snd_timer snd videodev v4l2_compat_ioctl32 cfg80211 ehci_hcd uhci_hcd usbcore rtc_cmos rtc_core sdhci_pci sdhci mmc_core dell_wmi dell_laptop processor snd_page_alloc wmi ohci1394 video thermal joydev i2c_i801 rtc_lib rfkill led_class pcspkr sg backlight output intel_agp ieee1394 thermal_sys dcdbas ac battery button [18090.916092] Pid: 5449, comm: kwin Tainted: G W 2.6.35-rc4+ #3 [18090.916094] Call Trace: [18090.916098] [<ffffffff8103a0ca>] 0xffffffff8103a0ca [18090.916100] [<ffffffff8103a1a1>] 0xffffffff8103a1a1 [18090.916102] [<ffffffff8138504a>] 0xffffffff8138504a [18090.916104] [<ffffffff81051790>] ? 0xffffffff81051790 [18090.916106] [<ffffffff8138593c>] 0xffffffff8138593c [18090.916108] [<ffffffff81350906>] 0xffffffff81350906 [18090.916110] [<ffffffff8139aece>] 0xffffffff8139aece [18090.916112] [<ffffffff8133c3ea>] 0xffffffff8133c3ea [18090.916114] [<ffffffff8139ae40>] ? 0xffffffff8139ae40 [18090.916116] [<ffffffff810c6e12>] ? 0xffffffff810c6e12 [18090.916119] [<ffffffff810d5b18>] 0xffffffff810d5b18 [18090.916121] [<ffffffff810d5cc0>] 0xffffffff810d5cc0 [18090.916123] [<ffffffff810c7a70>] ? 0xffffffff810c7a70 [18090.916125] [<ffffffff810d61aa>] 0xffffffff810d61aa [18090.916127] [<ffffffff81002ceb>] 0xffffffff81002ceb [18090.916129] ---[ end trace cc0eb4d6579677d8 ]--- [18090.916136] [drm] Disabling audio support [18090.917139] radeon 0000:01:00.0: ffff88013ef8ae00 unpin not necessary [18090.917356] radeon 0000:01:00.0: GPU softreset [18090.917359] radeon 0000:01:00.0: R_008010_GRBM_STATUS=0xA23034E0 [18090.917361] radeon 0000:01:00.0: R_008014_GRBM_STATUS2=0x00000003 [18090.917363] radeon 0000:01:00.0: R_000E50_SRBM_STATUS=0x200010C0 [18090.917372] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE [18090.932253] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 [18090.948123] radeon 0000:01:00.0: R_008010_GRBM_STATUS=0x00003030 [18090.948125] radeon 0000:01:00.0: R_008014_GRBM_STATUS2=0x00000003 [18090.948128] radeon 0000:01:00.0: R_000E50_SRBM_STATUS=0x200000C0 [18090.949120] radeon 0000:01:00.0: GPU reset succeed [18090.967247] [drm] Clocks initialized ! [18091.000606] [drm] ring test succeeded in 1 usecs [18091.000616] [drm] ib test succeeded in 1 usecs [18091.000619] [drm] Enabling audio support [18094.508025] radeon 0000:01:00.0: GPU lockup CP stall for more than 1020msec *** Bug 29819 has been marked as a duplicate of this bug. *** Created attachment 38195 [details] [review] fix range restriction checks in drm_mm This patch should fix the problem. Please test. Thanks, Daniel. I'm testing against rc2-git4 and it runs fine so far (~30 min), doing all that 'stuff' that made it easily crash within 2-5 minutes. Me too. Before you patch I was finishing testing with 2.6.36-rc2 with the two previous patches removed and it worked fine. I am testing with your patch and it works, I could not crash it. Thanks. I think we can consider this issue fixed. Thanks again! *** Bug 30116 has been marked as a duplicate of this bug. *** I got this X crash on kernel-2.6.35.4 (-> https://bugs.freedesktop.org/show_bug.cgi?id=30116). Is there a known patch that applies correctly to this kernel version? I'm afraid that this problem persists. I'm getting random crashes with SIGBUS signal on kernels from drm-radeon-testing and drm-fixes branches. xorg-server git 09/13 xf86-video-ati git 09/13 libdrm git 09/13 Intel Core 2 Quad Q6600 Processor AMD Radeon HD 5770 Created attachment 38690 [details]
Log from X.org X Server
After some additional testing, I can reproduce this bug by opening a PDF document in Okular and scrolling through several pages back and forth, then mouse cursor hangs and X server crashes. Moreover, I can reproduce this bug with 2.6.35.4 kernel (with 2D corruption patch applied, haven't tried without it yet). After disabling RenderAccel I can't reproduce the bug, but it disables the acceleration and therefore can't be considered as a solution to the problem. Tomasz, can you please confirm that you're indeed still hitting this bug by checking the that the comit that introduced the problem (709ea97145c125b3811ff70429e90ebdb0e832e5) does not work, but its immediate parent (7a6b2896f261894dde287d3faefa4b432cddca53) does work. If that's not the case, your hitting a different problem, please then create a new bug to avoid confusion. Also please add your dmesg (after having crashed X). (In reply to comment #21) > Tomasz, can you please confirm that you're indeed still hitting this bug by > checking the that the comit that introduced the problem > (709ea97145c125b3811ff70429e90ebdb0e832e5) does not work, but its immediate > parent (7a6b2896f261894dde287d3faefa4b432cddca53) does work. If that's not the > case, your hitting a different problem, please then create a new bug to avoid > confusion. > > Also please add your dmesg (after having crashed X). I guess I'm hitting another problem. Commit 7a6b2896f261894dde287d3faefa4b432cddca53 doesn't work too, but it gives some useful information in dmesg. I'm opening a new bug then. Created attachment 39199 [details]
Xorg backtrace
This should be fixed in 2.6.36. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.