Bug 32911

Summary: [RADEON:KMS:MEMCORRUPTION] random memory corruption if using more than 4GiB of RAM on Core i7/P55
Product: DRI Reporter: Siarhei Siamashka <siarhei.siamashka>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED INVALID QA Contact:
Severity: normal    
Priority: highest    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg.0.log
none
dmesg-4GB.log none

Description Siarhei Siamashka 2011-01-07 22:57:53 UTC
VGA compatible controller: ATI Technologies Inc RV710 [Radeon HD 4350]
Linux kernel 2.6.37, kms enabled, xf86-video-ati-6.13.2

The following happened on flash video playback in firefox browser:

Jan  8 15:49:01 i7 kernel: general protection fault: 0000 [#1] PREEMPT SMP 
Jan  8 15:49:01 i7 kernel: last sysfs file: /sys/devices/virtual/sound/timer/uevent
Jan  8 15:49:01 i7 kernel: CPU 0 
Jan  8 15:49:01 i7 kernel: Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ipt_addrtype xt_DSCP xt_dscp xt_string xt_NFQUEUE xt_multiport xt_mark xt_hashlimit xt_conntrack xt_connmark nf_conntrack ip_tables x_tables cdc_ether cdc_subset processor r8169 usbnet mii joydev pata_acpi i2c_i801 pl2303 usbserial pcspkr thermal_sys rtc_cmos button libiscsi scsi_transport_iscsi tg3 libphy e1000 fuse nfs jfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 dm_snapshot dm_crypt dm_mirror dm_region_hash dm_log dm_mod scsi_wait_scan sl811_hcd usbhid ohci_hcd ssb uhci_hcd usb_storage ehci_hcd usbcore aic94xx libsas lpfc qla2xxx megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid sx8 DAC960 cciss 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc scsi_tgt mptspi mptscsih mptbase atp870u dc395x qla1280 imm parport dmx3191d sym53c8xx gdth advansys initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sg pdc_adma sata_inic162x sata_mv ata_piix ahci libahci sata_qstor sata_vsc sata_uli sata_sis sata_sx4 sata_nv sata_via sata_svw sata_sil24 sata_sil sata_promise pata_sl82c105 pata_cs5530 pata_cs5520 pata_via pata_jmicron pata_marvell pata_sis pata_netcell pata_sc1200 pata_pdc202xx_old pata_triflex pata_atiixp pata_opti pata_amd pata_ali pata_it8213 pata_pcmcia pcmcia pcmcia_core pata_ns87415 pata_ns87410 pata_serverworks pata_platform pata_artop pata_it821x pata_optidma pata_hpt3x2n pata_hpt3x3 pata_hpt37x pata_hpt366 pata_cmd64x pata_efar pata_rz1000 pata_sil680 pata_radisys pata_pdc2027x pata_mpiix libata
Jan  8 15:49:01 i7 kernel: 
Jan  8 15:49:01 i7 kernel: Pid: 16575, comm: X Not tainted 2.6.37-gentoo #1 P7P55D-E/System Product Name
Jan  8 15:49:01 i7 kernel: RIP: 0010:[<ffffffff810ad689>]  [<ffffffff810ad689>] kfree+0x71/0x19b
Jan  8 15:49:01 i7 kernel: RSP: 0018:ffff88021559bbe8  EFLAGS: 00010046
Jan  8 15:49:01 i7 kernel: RAX: 0000000000000000 RBX: 154488021f800500 RCX: ffff88021471d4d0
Jan  8 15:49:01 i7 kernel: RDX: ffff88021559bbd8 RSI: 0000000000000286 RDI: ffff88020f7dec40
Jan  8 15:49:01 i7 kernel: RBP: ffff88021559bc28 R08: 00000000000040bf R09: ffff88021559bc28
Jan  8 15:49:01 i7 kernel: R10: dead000000200200 R11: dead000000100100 R12: 0000000000000286
Jan  8 15:49:01 i7 kernel: R13: ffff88021648c9b0 R14: ffff88020f7dec40 R15: ffff88021648c000
Jan  8 15:49:01 i7 kernel: FS:  00007fe6f13b7880(0000) GS:ffff8800dfc00000(0000) knlGS:0000000000000000
Jan  8 15:49:01 i7 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  8 15:49:01 i7 kernel: CR2: 00007fe6f1327000 CR3: 0000000210e78000 CR4: 00000000000006f0
Jan  8 15:49:01 i7 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan  8 15:49:01 i7 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan  8 15:49:01 i7 kernel: Process X (pid: 16575, threadinfo ffff88021559a000, task ffff88020f592110)
Jan  8 15:49:01 i7 kernel: Stack:
Jan  8 15:49:01 i7 kernel: ffff88021559bc08 0000000000000001 ffff88021559bc18 ffff88020f7dec40
Jan  8 15:49:01 i7 kernel: ffff88020f7dec48 ffff88021648c9b0 ffff88021648c940 ffff88021648c000
Jan  8 15:49:01 i7 kernel: ffff88021559bc48 ffffffff813debe4 ffff88020f7dec48 ffffffff813deb7b
Jan  8 15:49:01 i7 kernel: Call Trace:
Jan  8 15:49:01 i7 kernel: [<ffffffff813debe4>] radeon_fence_destroy+0x69/0x6e
Jan  8 15:49:01 i7 kernel: [<ffffffff813deb7b>] ? radeon_fence_destroy+0x0/0x6e
Jan  8 15:49:01 i7 kernel: [<ffffffff813281ce>] kref_put+0x43/0x4d
Jan  8 15:49:01 i7 kernel: [<ffffffff813deb79>] radeon_fence_unref+0x23/0x25
Jan  8 15:49:01 i7 kernel: [<ffffffff813f1f8f>] radeon_ib_get+0x172/0x1ac
Jan  8 15:49:01 i7 kernel: [<ffffffff813f2ceb>] radeon_cs_ioctl+0x91/0x1a2
Jan  8 15:49:01 i7 kernel: [<ffffffff813a1efb>] drm_ioctl+0x26c/0x32e
Jan  8 15:49:01 i7 kernel: [<ffffffff8109e595>] ? mmap_region+0x3a1/0x4ab
Jan  8 15:49:01 i7 kernel: [<ffffffff813f2c5a>] ? radeon_cs_ioctl+0x0/0x1a2
Jan  8 15:49:01 i7 kernel: [<ffffffff810bfb77>] do_vfs_ioctl+0x43c/0x48b
Jan  8 15:49:01 i7 kernel: [<ffffffff810bfc08>] sys_ioctl+0x42/0x65
Jan  8 15:49:01 i7 kernel: [<ffffffff8100293b>] system_call_fastpath+0x16/0x1b
Jan  8 15:49:01 i7 kernel: Code: 40 10 48 8b 10 66 85 d2 79 04 48 8b 40 10 48 8b 10 84 d2 78 04 0f 0b eb fe 48 8b 58 28 e8 b0 40 28 00 83 3d 51 d3 86 00 01 89 c0 <4c> 8b 2c c3 0f 8e e8 00 00 00 4c 89 f7 4c 89 75 c8 e8 2d 68 f7 
Jan  8 15:49:01 i7 kernel: RIP  [<ffffffff810ad689>] kfree+0x71/0x19b
Jan  8 15:49:01 i7 kernel: RSP <ffff88021559bbe8>
Jan  8 15:49:01 i7 kernel: ---[ end trace ac2f01944e8e8f1a ]---
Comment 1 Siarhei Siamashka 2011-01-08 20:59:34 UTC
OK, some more information. Got a few other similar mild failures (so that the box is still accessible via ssh) and they have different backtraces, but also seem to be memory allocation related. Right now my guess it that radeon kms is causing some memory corruption in the kernel.

I tried to switch to SLUB allocator and enable SLUB_DEBUG. Unfortunately this change did not help me to catch any problems yet. On a somewhat positive side, the mild reliability problems also have disappeared (for example it does not seem to easily fail on browser flash video playback anymore).

Still there is a testcase which is guaranteed to kill the system for me. It involves launching gl-117 game (using llvmpipe for 3D) in one window so that it runs its demo, and also starting scaled video playback in mplayer in another window so that both windows are visible on screen at the same time. In about less than 15 minutes and typically a lot faster, the whole system deadlocks. And the box is totally dead, I even can't connect to it with ssh, so I have neither backtraces nor clues about what could have happened.

Anyway, I'm going to keep KMS enabled for a while, maybe some some other easier to debug testcases will be discovered and the driver bug(s?) could be found.
Comment 2 Siarhei Siamashka 2011-01-22 17:06:19 UTC
Appears that the memory just gets randomly corrupted. When running gl-117 game demo using mesa 7.9 llvmpipe, and also running memtester program [1] at the same time so that the rest of the available memory gets tested, memtester is typically able to detect memory corruption before the system goes down in a spectacular way.

The system used to have 4 memory sticks installed, 2GiB each. Memory corruption disappears if only one 2GiB stick is left (tested for more than a week without problems) or using 2GiB+2GiB configuration (just started using this, appears to be stable so far). Installing 6GiB or 8GiB of memory in various ways (trying different placement in slots on the motherboard) makes the issue reproducible again.

It could be either a problem in radeon kernel drivers, or just defective hardware (motherboard?, PSU?, CPU?, memory?, graphics card?). Though memtest86+ does not detect problems and the system appears to be stable when used "headless" even with the intensive CPU/RAM usage. Anyway, unless somebody else manages to reproduce the same problem, there is no definite answer to this question. There is nothing else to be added here (other than dmesg and Xorg logs). So probably that's my last comment here unless I somehow manage to narrow down the bug and make a patch. Thanks. The bug can be closed if you want.

1. http://pyropus.ca/software/memtester/
Comment 3 Siarhei Siamashka 2011-01-22 17:07:20 UTC
Created attachment 42325 [details]
Xorg.0.log
Comment 4 Siarhei Siamashka 2011-01-22 17:07:57 UTC
Created attachment 42326 [details]
dmesg-4GB.log
Comment 5 Siarhei Siamashka 2011-03-19 02:56:03 UTC
After all, it looks like it is a hardware problem on my side. Experimenting with 'memmap' option simulating 4GiB setup precisely does not help when all 8GiB are installed. Also reducing RAM speed from DDR3-1333 to DDR3-1066 or DDR3-800 makes the problem significantly harder to reproduce, but does not eliminate it completely. And looking around in various hardware forums, reliability with all four DIMM modules installed seems to be a rather common problem.

Appears that llvmpipe is just a very good memory checker, putting a lot of stress on all CPU cores and memory controller, exposing problems which don't happen in other use cases.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.