Bug 34313

Summary: RV770 lock-up with OpenGL
Product: DRI Reporter: Bob Ham <rah>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: major    
Priority: medium CC: kenyon, louismariegivel
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
X server log
none
Kernel log
none
X server configuration file
none
Output of glxinfo
none
X server log with r600g
none
Output of dmesg with r600g
none
X server configuration file with r600g
none
Output of glxinfo with r600g
none
Kernel log of GPU lockup followed by (unrecoverable) X crash
none
kernel.log
none
Kernel log with strange time out behaviour following X load none

Description Bob Ham 2011-02-15 15:38:20 UTC
Created attachment 43405 [details]
X server log

I get the following GPU lock-up report with many different OpenGL programs, but in this instance Stellarium.

------------[ cut here ]------------
WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:244 radeon_fence_wait+0x38f/0x3f0 [radeon]()
Hardware name: P5Q
GPU lockup (waiting for 0x000CAA2E last fence id 0x000CAA28)
Modules linked in: usb_storage cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_ondemand freq_table lirc_serial(C) lirc_dev sco bnep rfcomm l2cap crc16 binfmt_misc ip6t_LOG ip6table_filter ip6_tables fuse nfsd exportfs nls_iso8859_1 nls_cp437 vfat fat coretemp loop dm_crypt snd_hda_codec_hdmi radeon snd_hda_codec_realtek ttm snd_emu10k1_synth snd_hda_intel snd_emux_synth snd_hda_codec snd_seq_virmidi snd_seq_midi_emul snd_emu10k1 snd_ac97_codec ac97_bus snd_util_mem snd_pcm_oss snd_hwdep snd_mixer_oss snd_seq_dummy snd_seq_oss snd_seq_midi snd_pcm snd_rawmidi gspca_pac207 gspca_main videodev v4l1_compat drm_kms_helper drm snd_seq_midi_event v4l2_compat_ioctl32 snd_page_alloc 8250_pnp snd_seq snd_timer snd_seq_device snd btusb pcspkr i2c_i801 asus_atk0110 bluetooth emu10k1_gp soundcore 8250 gameport evdev serial_core dm_mod usbhid hid sg uhci_hcd sr_mod cdrom atl1e firewire_ohci firewire_core crc_itu_t via_rhine mii ehci_hcd [last unloaded: scsi_wait_scan]
Pid: 2992, comm: Xorg Tainted: G        WC  2.6.37-linux-2.6-latest-32 #10
Call Trace:
 [<ffffffff8103c8fb>] ? warn_slowpath_common+0x7b/0xc0
 [<ffffffff8103c9f5>] ? warn_slowpath_fmt+0x45/0x50
 [<ffffffffa0461b1f>] ? radeon_fence_wait+0x38f/0x3f0 [radeon]
 [<ffffffff81057740>] ? autoremove_wake_function+0x0/0x30
 [<ffffffffa00f90c9>] ? ttm_bo_wait+0x119/0x1d0 [ttm]
 [<ffffffffa0478b8e>] ? radeon_gem_wait_idle_ioctl+0x8e/0x110 [radeon]
 [<ffffffffa01fa331>] ? drm_ioctl+0x3c1/0x440 [drm]
 [<ffffffffa0478b00>] ? radeon_gem_wait_idle_ioctl+0x0/0x110 [radeon]
 [<ffffffff8100d772>] ? save_i387_xstate+0x92/0x210
 [<ffffffff810022df>] ? do_signal+0x19f/0x800
 [<ffffffff810e7a5f>] ? do_vfs_ioctl+0x9f/0x550
 [<ffffffff8100d538>] ? restore_i387_xstate+0x148/0x1d0
 [<ffffffff810e7f90>] ? sys_ioctl+0x80/0xa0
 [<ffffffff81002dbb>] ? system_call_fastpath+0x16/0x1b
---[ end trace 37e82fba971a1173 ]---
Comment 1 Bob Ham 2011-02-15 15:39:16 UTC
Created attachment 43406 [details]
Kernel log
Comment 2 Bob Ham 2011-02-15 15:40:10 UTC
Created attachment 43407 [details]
X server configuration file
Comment 3 Bob Ham 2011-02-15 15:43:25 UTC
Created attachment 43409 [details]
Output of glxinfo
Comment 4 Alex Deucher 2011-02-15 17:15:11 UTC
I'd recommend using r600g.
Comment 5 Bob Ham 2011-02-16 15:57:34 UTC
With r600g, I still get lockups:

------------[ cut here ]------------
WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:244 radeon_fence_wait+0x38f/0x3f0 [radeon]()
Hardware name: P5Q
GPU lockup (waiting for 0x0013F5D7 last fence id 0x0013F5D6)
Modules linked in: cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_ondemand freq_table lirc_serial(C) lirc_dev sco bnep rfcomm l2cap crc16 binfmt_misc ip6t_LOG ip6table_filter ip6_tables fuse nfsd exportfs nls_iso8859_1 nls_cp437 vfat fat coretemp loop dm_crypt snd_hda_codec_hdmi snd_hda_codec_realtek snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_hda_intel snd_seq_midi_emul radeon snd_hda_codec snd_emu10k1 snd_pcm_oss snd_ac97_codec snd_mixer_oss ac97_bus snd_util_mem snd_pcm snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi ttm snd_seq_midi_event usb_storage snd_seq 8250_pnp drm_kms_helper btusb snd_timer bluetooth snd_seq_device 8250 i2c_i801 usbhid snd hid emu10k1_gp soundcore snd_page_alloc gameport pcspkr serial_core asus_atk0110 drm evdev gspca_pac207 gspca_main videodev v4l1_compat v4l2_compat_ioctl32 dm_mod sg atl1e sr_mod cdrom firewire_ohci firewire_core via_rhine crc_itu_t mii uhci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
Pid: 10203, comm: stellarium Tainted: G         C  2.6.37-linux-2.6-latest-32 #10
Call Trace:
 [<ffffffff8103c8fb>] ? warn_slowpath_common+0x7b/0xc0
 [<ffffffff8103c9f5>] ? warn_slowpath_fmt+0x45/0x50
 [<ffffffffa04c1b1f>] ? radeon_fence_wait+0x38f/0x3f0 [radeon]
 [<ffffffff81057740>] ? autoremove_wake_function+0x0/0x30
 [<ffffffffa03200c9>] ? ttm_bo_wait+0x119/0x1d0 [ttm]
 [<ffffffffa04d8b8e>] ? radeon_gem_wait_idle_ioctl+0x8e/0x110 [radeon]
 [<ffffffffa00e0331>] ? drm_ioctl+0x3c1/0x440 [drm]
 [<ffffffffa04d8b00>] ? radeon_gem_wait_idle_ioctl+0x0/0x110 [radeon]
 [<ffffffff810dbef0>] ? cp_new_stat+0xe0/0x100
 [<ffffffff810e7a5f>] ? do_vfs_ioctl+0x9f/0x550
 [<ffffffff810e7f90>] ? sys_ioctl+0x80/0xa0
 [<ffffffff81002dbb>] ? system_call_fastpath+0x16/0x1b
---[ end trace ec1d5bad24ddd54b ]---
Comment 6 Bob Ham 2011-02-16 15:58:14 UTC
Created attachment 43456 [details]
X server log with r600g
Comment 7 Bob Ham 2011-02-16 15:58:39 UTC
Created attachment 43457 [details]
Output of dmesg with r600g
Comment 8 Bob Ham 2011-02-16 15:59:10 UTC
Created attachment 43458 [details]
X server configuration file with r600g
Comment 9 Bob Ham 2011-02-16 15:59:38 UTC
Created attachment 43459 [details]
Output of glxinfo with r600g
Comment 10 Alex Deucher 2011-02-21 10:52:52 UTC
What specific programs cause the lockups?
Comment 11 Bob Ham 2011-02-21 11:02:23 UTC
The above lockups are from Stellarium:

  http://www.stellarium.org/

However, Nexuiz and Doom 3 will also cause GPU lockups (and eventually system lockups.)
Comment 12 Bob Ham 2011-03-31 11:56:56 UTC
This also happens with 2.6.38.
Comment 13 Bob Ham 2011-03-31 12:16:30 UTC
Created attachment 45100 [details]
Kernel log of GPU lockup followed by (unrecoverable) X crash

This is with linux 2.6.38
Comment 14 eric 2011-04-11 06:00:55 UTC
This happens also with r300g + linux-2.6.38.2 + KMS.

I can not always reproduce this GPU lockup, but working (scrolling, opening a new tab) with big or fullscreen windows seems to trigger this lockup... but not right after a reboot... so it's difficult to reproduce. Hopefully someone finds a better way to reproduce this lockup.

When the lockup starts the display disappears (turns to black) for 1 or 2 seconds, then the next will happen:
1) display will return and everything will work again, but the next lockup will happen soon;
2) display will return but only mouse and keyboard are working, the display will be frozen except the mouse pointer can move around; I have to exit all running applications (alt+f4) blindly because the changes will not be seen, go to console (ctrl+alt+f1) and from there reboot the system.

Switching to console (ctrl+alt+f1) usually works and there the display is working again, but killing and restarting X will not work, the only way to get X back is to reboot.

During the lockup the kernel.log will be full of:
  kernel: [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
  kernel: [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(1).
These lines are going for ever until I kill X.

Some lockups looking like mine were solved by turning off 'EnablePageFlip' in xorg.conf or setting 'agpmode' to -1, 1 or 4 in modprobe.conf or exporting 'LIBGL_ALWAYS_INDIRECT=1'... none of these solutions worked for me. Maybe turning off KMS works, but without KMS certain things won't work/are different, so I try to keep using KMS.
Comment 15 eric 2011-04-11 06:03:13 UTC
Created attachment 45475 [details]
kernel.log

kernel.log during GPU lockup
Comment 16 Bob Ham 2011-05-16 08:07:58 UTC
After testing I've discovered that there is no obvious culprit.  Reducing the number of GL features (texture compression, VBOs, GLSL, etc) used by nexuiz simply extends the amount of time it takes for the GPU to lock up.  A lockup will occur even with all optional GL features turned off, it just takes a long time.  Increasing the number of GL features in use just reduces the time it takes for the GPU to lock up.

Repeatedly running the simple mesa demos in a cycle will eventually cause the GPU to lock up.  Unfortunately, not always at the same point.

Disabling writeback using 'radeon.no_wb=1' on the kernel command line does not stop GPU lockups.  Temperature is not a factor; the GPU has locked up at 48 degrees while at other times running fine at over 58 degrees for long periods.  There are no such problems while using the proprietary fglrx driver.
Comment 17 Bob Ham 2011-05-16 08:18:28 UTC
The previous comment is while using the latest drm-fixes kernel tree and latest git for X, ddx, mesa (r600g), etc.
Comment 18 Bob Ham 2011-05-16 15:36:12 UTC
Neither reducing the VRAM size to 32MB or increasing the GART size to 1GB makes a difference.
Comment 19 Bob Ham 2011-05-21 19:55:50 UTC
I recalled that I had played Warzone 2100 (a GL game) some time ago.  Looking at the time stamps on the save game files, I found which kernel I was using at the time (2.6.36).  From there, I've done a git bisect using Linus' tree.  This is the output:

rah@myrtle:/usr/src/linus$ git bisect good
6f34be50bd1bdd2ff3c955940e033a80d05f248a is the first bad commit
commit 6f34be50bd1bdd2ff3c955940e033a80d05f248a
Author: Alex Deucher <alexdeucher@gmail.com>
Date:   Sun Nov 21 10:59:01 2010 -0500

    drm/radeon/kms: add pageflip ioctl support (v3)
    
    This adds support for dri2 pageflipping.
    
    v2: precision updates from Mario Kleiner.
    v3: Multihead fixes from Mario Kleiner; missing crtc offset
        add note about update pending bit on pre-avivo chips
    
    Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
    Signed-off-by: Mario Kleiner <mario.kleiner@tuebingen.mpg.de>
    Signed-off-by: Dave Airlie <airlied@redhat.com>

:040000 040000 e5e9b7c6860a5ebba78346f3396791520a7842a4 2e9d672e24e579dc56c8f25b0ff883d88c09b7c4 M      drivers
Comment 20 Alex Deucher 2011-05-22 09:09:49 UTC
Does:
Option "EnablePageFlip" "False"
in the device section of your xorg.conf fix the issues?
Comment 21 Bob Ham 2011-05-22 10:01:32 UTC
(In reply to comment #20)
> Does:
> Option "EnablePageFlip" "False"
> in the device section of your xorg.conf fix the issues?

Yes and no.  I believe there are at least three distinct problems causing lockups.  Firstly, this immediate lockup caused by page flipping (it happens very quickly with GL programs.)  Secondly, is a lockup from some other cause that takes longer to show itself but inevitably does.  Lastly is a more serious lockup that is not detected by the kernel driver and causes a hard lock.  This last lockup is rarer still than the second.

The first immediate lockup seems to be remedied by disabling page flipping.  The second (and I presume third) is still present, however.  It only takes 5-15 minutes of playing nexuiz for the second type of lockup to assert itself.  Compare this with 0-60 second for the first, page flipping lockup.

So yes, disabling page flipping solves one lockup but there are still others that it doesn't solve.  I will try to do more testing using warzone 2100 instead of nexuiz to see if I can find the cause of the second lockup.
Comment 22 Bob Ham 2011-05-27 09:47:15 UTC
I've done a bisect on GPU lockups with warzone 2100.  During this bisect there were some kernels that displayed bad GL rendering (lots of artifacts, flickering textures, etc.), some of which locked up the GPU and some of which didn't.  I've treated these kernels as 'good' if they do not lock up because bad rendering is not a problem with recent kernels so it's likely to have been a different issue.

The last commit identified by the bisect caused the kernel to behave in a strange way with various bits timing out (particular on file system access) as soon as X was loaded, prior to any GL activity.  The system is unusable and the time outs continue until the machine locks hard.  I'll attached a log of the kernel's output.  I'm not sure if this is the bug I'm chasing.  Marking the commit as "skip" during the bisect unfortunately produces the following report:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
b7ae5056c94a8191c1fd0b5697707377516c0c5d
5480f727dc4c049eb46b191bfaeb034067aa6835
We cannot bisect more!


The possibilities are as follows:

Merge branch 'drm-fixes' of /home/airlied/kernel/linux-2.6 into drm-core-next
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b7ae5056c94a8191c1fd0b5697707377516c0c5d

Revert "drm/radeon/kms: remove some pll algo flags"
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5480f727dc4c049eb46b191bfaeb034067aa6835
Comment 23 Bob Ham 2011-05-27 09:50:18 UTC
Created attachment 47235 [details]
Kernel log with strange time out behaviour following X load

Above I've stated that the machine locks hard.  From this log it shows the Magic SysRq key is still working so evidently it isn't a hard lock.
Comment 24 Bob Ham 2011-05-27 16:01:34 UTC
In the previously described bisect, I failed to mention that the tree had been restricted to drivers/gpu/drm/radeon.  Using the log of that bisect as a starting point for a bisect of the whole tree, I arrived at the following, similar report:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
b7ae5056c94a8191c1fd0b5697707377516c0c5d
c9220b0f7cbd1d2272426aa81a72ae2f6582bb71
We cannot bisect more!


The first is the same skipped (weird timing out kernel) commit as the restricted bisect but the second possibility is different and from TTM:

Merge branch 'drm-fixes' of /home/airlied/kernel/linux-2.6 into drm-core-next
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b7ae5056c94a8191c1fd0b5697707377516c0c5d

drm/ttm: add unlocked variant of new manager put node.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c9220b0f7cbd1d2272426aa81a72ae2f6582bb71
Comment 25 Andreas Boll 2012-11-02 14:41:29 UTC
Is this still an issue with a newer driver/kernel?
Reassigning to drm/radeon.
Comment 26 Bob Ham 2012-11-02 14:47:05 UTC
(In reply to comment #25)
> Is this still an issue with a newer driver/kernel?

I'll investigate.
Comment 27 Martin Peres 2019-11-19 08:18:08 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/177.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.