Bug 30349 - GA-MA78G-DS3H (F9d) & Radeon 4730 system suddenly, but not quite randomly, softresets
Summary: GA-MA78G-DS3H (F9d) & Radeon 4730 system suddenly, but not quite randomly, so...
Status: RESOLVED NOTOURBUG
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/Radeon (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: xf86-video-ati maintainers
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-23 09:21 UTC by Sergey Kondakov
Modified: 2011-03-08 09:50 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Xorg.0.log (46.09 KB, text/plain)
2010-09-23 09:25 UTC, Sergey Kondakov
no flags Details
dmesg (68.28 KB, patch)
2010-09-23 09:26 UTC, Sergey Kondakov
no flags Details | Splinter Review
mplayer output for one of those videos (247.68 KB, patch)
2010-09-23 09:42 UTC, Sergey Kondakov
no flags Details | Splinter Review
mplayer output for one of those videos (8.61 KB, application/octet-stream)
2010-09-23 09:47 UTC, Sergey Kondakov
no flags Details
smplayer output for one of those videos (60.02 KB, application/octet-stream)
2010-09-23 09:47 UTC, Sergey Kondakov
no flags Details
vlcrc (73.19 KB, text/plain)
2010-12-10 21:02 UTC, Sergey Kondakov
no flags Details

Description Sergey Kondakov 2010-09-23 09:21:06 UTC
this is the strangest glitch i have ever saw - my system completely softresets (and not panics) if several conditions are met:
1) KMS (GEM,DRI2,r600c) is enabled
2) any of following:
a) i play "Castlevania: Symphony of Night" (US) [or "Devil's Castle Dracula X: Nocturne in the Moonlight" (JAP), any of two versions] on pcsx-r 1.9.92 emulator.
no other games i played on that emulator trigger this nor playing on other emulator does, from what i noticed. it usually triggered after several hours of straight playing and not dependent on actual gameplay progress, location, etc.
b) i play any of the several particular mp4 videos from russian game magazine "igromania" in VLC (and VLC only).
in that case it triggered in 5-10 seconds of playtime but not in any particular frame/second of that video. playing it in mplayer or kaffeine not triggers this.
it interesting that if mplayer launched via smplayer to play such video file, smplayer thinks that its video stream is 0x0 size and hides video output.
even if video have started playing in VLC and paused right away, system softresets after approximately same amount of time.

really weird but painfull - filesystems get damaged because of this.
otherwise, my system proved that it can work without issues for weeks in a row - no crashes or even X freezes.
i experienced this with 2.6.{33-35} and 2.6.36-rc2 kernels. vanilla and "zen".
i don't know about any other kernels since i have not played those games/videos in their times and bought that card just recently.

card info via lspci: ATI Technologies Inc RV770 CE [Radeon HD 4710]
it wrong for some reason.
`cat /proc/sys/kernel/panic` returns '0'
libdrm, xf86-video-ati and mesa used from git.

please, tell what info i may provide to make it clearer.
Comment 1 Sergey Kondakov 2010-09-23 09:25:09 UTC
Created attachment 38909 [details]
Xorg.0.log
Comment 2 Sergey Kondakov 2010-09-23 09:26:47 UTC
Created attachment 38910 [details] [review]
dmesg
Comment 3 Sergey Kondakov 2010-09-23 09:42:20 UTC
Created attachment 38911 [details] [review]
mplayer output for one of those videos

via terminal and with default settings
Comment 4 Sergey Kondakov 2010-09-23 09:47:04 UTC
Created attachment 38912 [details]
mplayer output for one of those videos

via smplayer
Comment 5 Sergey Kondakov 2010-09-23 09:47:30 UTC
Created attachment 38913 [details]
smplayer output for one of those videos
Comment 6 Sergey Kondakov 2010-10-31 03:21:05 UTC
still happens with 2.6.36 and r600g.
and looks like it does not triggered if player window "minimazed" off screen but only if it actually renderred, even if video is on pause.
Comment 7 Sergey Kondakov 2010-12-10 21:00:50 UTC
ah, looks like it happens with pretty much any mp4 video with h264 inside opened by any vlc version on my system.
i can't tell if it's just-any such video or something else matters or if it triggered in default vlc configuration since results of such tests would be devastating and i avoid using vlc at all costs right now.
Comment 8 Sergey Kondakov 2010-12-10 21:02:45 UTC
Created attachment 41001 [details]
vlcrc

my vlc config
Comment 9 Jouko Orava 2010-12-11 23:13:22 UTC
I have the same MB with similar symptoms, but older F2 BIOS.

Disabling virtualization in the BIOS helped but did not eliminate the problem for me. However, with that and a kernel with MCE support disabled (i.e. CONFIG_X86_MCE not set), I've not seen a single unexpected reboot nor shutdown.

Could you check if the problem disappears with virtualization support disabled in the BIOS and/or running a kernel with CONFIG_X86_MCE unset?
Comment 10 Sergey Kondakov 2010-12-12 10:33:33 UTC
with virtualization disabled in BIOS and same kernel reset was triggered a little later than usual (about 10-20 seconds after starting vlc and right when i tried to make it fullscreen in contradiction to usual 1-5 seconds even while windowed).

with enabled virtualization and kernel compiled without MCE support i unable to trigger the issue so far, thank you.
it is, hovewer, troubling that MCE is designed to tell to kernel about, well, exceptions and that way it unable to. but it not looks like kernel was resetting machine either (normally, kernel should panic). and it only happened with KMS enabled for me so far.

there is no any dmesg info out of ordinary or any signs of any other issues.
Comment 11 Jouko Orava 2010-12-12 14:18:11 UTC
This is actually very good. It indicates the problem stems from interactions between the radeon driver and the MCE support. Perhaps something funky in the thermal or voltage sensors?

As soon as I have the time, I'll investigate further.
Comment 12 Jouko Orava 2010-12-12 16:38:25 UTC
No, got another reboot in fullscreen mplayer with a 2.6.36.2 kernel sans MCE. Updating to F9E BIOS and investigating further.
Comment 13 Jouko Orava 2010-12-15 16:44:17 UTC
And apparently blew out the capacitors on my motherboard or PSU. No post-mortem yet, but the smell is familiar. ;)

Sergey, I recommend taking appropriate backup measures. It is possible this problem indicates an actual hardware failure, and is not caused by software. (I suspect voltage fluctuations caused by aging/faulty capacitors; a temperature sensor I'd attached at the base of the northbridge heatsink indicated ~ 55'C, a normal temperature for my setup, at the time of the breakdown.)
Comment 14 Sergey Kondakov 2010-12-15 21:16:38 UTC
oh, don't tell me i'll have to replace a device at a christmas eve in forth time in a row :\ i've installed APC UPS ES 525 with "surge protection" function almost same time i built that setup (PSU is brand [FSP Group] new, not a year old even) and this is supposed to be that "solid capacitors", polymeric stuff (my Radeon X4730 is also have its power lines filled with those, and it's of newer, less power-hungry revision).
if this is not enought for damn thing to live at least few years without failure than i don't know what is.

sure, with shitty CPU cooling on hot Athlon [Brisbane, 65 nm, 89 TDP] 6000+X2 (60*C idle, ~85*C at full load) and anomally hot (~40*C) summer it were in rough situation but it has never, ever failed at load. this is Gentoo, after all. and a gaming station wuth HD LCD.
it only failed in those specific conditions i described even with near-zero load (what's peny non-HD videofile or mixed 2D/3D ~15 years old game game for such thing?)

no, i refuse to admit hardware failure unless damn thing goes down in flames, or at least starts BlueScreen(tm) Windows(r) at 100% CPU & GPU load.
there is something fishy going on. if i will get any resets while MCE disabled i will report.
Comment 15 Jouko Orava 2010-12-17 18:07:26 UTC
> oh, don't tell me i'll have to replace a device at a christmas eve in forth
> time in a row :\

Be prepared. I wish I had prepared better. I think I'm out of warranty, too.

Post-mortem: One of the K3919 M81 voltage regulators -- second one in the outer row, counting in from the edge -- *blew*. It's about 30% expanded, with high-temperature damage also in the motherboard around the regulator legs.

My rig used a 65W TDP Athlon64 X2 4800+ with a huge Noctua NH-U12 heatsink, and three 120mm Noctua fans wired to a front panel controller with four thermal probes, in an Antec Sonata 500 case. The northbridge heatsink did get up to 58'C, but none of the other components *ever* exceeded 36'C. I also monitored my three HDD's temperatures using hddtemp; it was consistently about 1'C higher than the thermal probe mounted on the outside (as one might expect). 

Therefore, I'm confident this was not *thermal* damage, but a faulty component.

This also ties in completely with the symptoms. The motherboard does have hardware protection, shutting down/restarting in case of a voltage problem.

You should see if the motherboard voltages fluctuate when starting mplayer. (lm_sensors and some hardware monitor panel application should do.)
Comment 16 Sergey Kondakov 2010-12-17 19:05:18 UTC
i wasn't thorough but all fluctuation i noticed were in ~0.05 range. i'm also never had reset while using mplayer, only vlc. main difference between mplayer and vlc on my system is that mplayer configured on "gl" output while looks like vlc using "xv". i'm also were thinking that maybe vlc doing some funky stuff with its memory, judging by how many memory management related options it has.

i must notice that i have underclocked and undervolted CPU (constant 2618Mhz instead of dynamical changes up to 3100Mhz; constant 1.25V instead of 1.25-1.35V ranging, or whatever its stock values). i had to do it because
1) my heatsink is bad and i'm a scrooge
2) compiling with 3-5 "jobs" of big stuff with it working full-power can make temperature go up ~95-100*C
3) stupid Windows(r) heat-rapes my CPU at its very boot-screen.

but:
1) my resets triggered by very specific programs doing specific things
2) load-agnostic
3) if it hardware protection mechanism triggers the reset in a first place then disabling MCE in kernel should not have any effect but i didn't get a singe reset yet (this i will test more further and try to post vlc log from some "offending" video).
4) in both(all) of my cases 2D GPU-accelerated graphic is involved.
5) it doesn't smell bad yet and running 24h/7d non-stop.
makes me still believe and hope that players-triggered resets and hw corruption-triggered resets are two separate issues.

i mean, what video player could do with hardware what all other things on Gentoo and Windows(r) couldn't ?
Comment 17 Jouko Orava 2010-12-19 10:05:31 UTC
As to mplayer, you can see if "-vo xv" triggers the problem if MCE is enabled in the kernel; this is how I had mplayer configured. You could also install the proprietary Catalyst driver, and see if you can reproduce the softresets. You might have to try a lot of different graphics workloads, but if you can reliably cause a softreset with Catalyst too, you can be pretty sure it's a hardware problem.

Here's what I think (but remember this is just my humble opinion):

The root cause is a hardware problem, specifically a problematic voltage regulator. In your case, the regulator is just borderline; disabling MCE causes the motherboard to ignore the voltage drop; however, it's not outside the working envelope for the northbridge chip, so the machine keeps working.

In my case, the voltage regulator failed in the span of a couple of weeks (from completely OK to cracked and melted). Initially, disabling MCE avoided the resets. Later, as the regulator degraded further, the voltage fluctuations most likely caused the hardware CPU voltage protection to kick in, causing a reset.

The RS780 chipset is a complex beast. It's quite possible that a specific workload -- not the heaviest one, but a specific type of workload -- stressed the right voltage regulator (by requiring more current than otherwise). It might be something as simple as a workload synchronized to the mains frequency, or something esoteric like a PLL programmed to a specific frequency. I suspect only a Gigabyte engineer can tell for sure.
Comment 18 Sergey Kondakov 2010-12-19 22:27:30 UTC
you maybe right. i will watch for my voltage (i haven't noticed any fluctuations beyond ~0.05 on some) and stress machine some more (`mplayer -vo xv` on killer-videos didn't triggered reset) but i will not go for Catalyst since that thing never ever worked for me: even if i was able to get picture from it - it glitched, crashed graphic apps, freezed X, freezed system and whatnot.
i haven't BSODSes(tm) with ati driver on Windows(r) but it's just unreliable on Linux.
Comment 19 Sergey Kondakov 2011-01-12 11:11:38 UTC
ah, it happened again while updating system (emerge -1u @installed) and watching offending video about half-hour in vlc. with kernel 2.6.37 and latest libdrm,mesa,xf86-video-ati from git.
my friend from repair shop suggested that unusual overall system unresponsiveness (bad latency with everything despite any attempts to improve it, especially while some I/O operation happens) is result of insufficient power from 450W PSU on my 6 HDD+DVD-RW system with all PCI and almost all USB slots used. maybe this has something to so with it too. but again, it very strange.
Comment 20 Jerome Glisse 2011-03-08 09:50:44 UTC
Sounds like hw/power issue


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.