Bug 81644 - Random crashes on RadeonSI with Chromium.
Summary: Random crashes on RadeonSI with Chromium.
Status: RESOLVED DUPLICATE of bug 85647
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: x86-64 (AMD64) All
: high normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-22 15:50 UTC by Aaron B
Modified: 2015-02-09 12:00 UTC (History)
8 users (show)

See Also:
i915 platform:
i915 features:


Attachments
DMesg (103.07 KB, text/plain)
2014-07-25 02:23 UTC, Aaron B
Details
GLXInfo (58.07 KB, text/plain)
2014-07-25 02:24 UTC, Aaron B
Details
XOrg log, with crash+recover. (77.96 KB, text/plain)
2014-07-25 02:25 UTC, Aaron B
Details
Hardware report. (22.49 KB, text/plain)
2014-07-25 02:27 UTC, Aaron B
Details
Xorg crash, no recover. (61.35 KB, text/plain)
2014-07-25 16:09 UTC, Aaron B
Details
Crash+recover (59.32 KB, text/plain)
2014-07-26 16:04 UTC, Aaron B
Details
DMesg of crash+recover. (94.58 KB, text/plain)
2014-07-26 16:07 UTC, Aaron B
Details
Two crashes+recovers. Xorg. (78.79 KB, text/plain)
2014-07-26 19:52 UTC, Aaron B
Details
Two crashes+recovers. Dmesg. (131.53 KB, text/plain)
2014-07-26 19:53 UTC, Aaron B
Details
DMesg of multile flash crashes. (100.74 KB, text/plain)
2014-08-01 05:06 UTC, Aaron B
Details
X turning black & crash (30 minutes of HTML5 video on youtube, movie trailers), no successful recovery, recovery with Magic SYSRQ + k (386.82 KB, text/plain)
2014-08-02 01:36 UTC, jackdachef
Details
full dmesg output with the 2 crashes included (Magic SYSRQ + k), reiserfs is /boot (storing the info there) (396.68 KB, text/plain)
2014-08-02 01:46 UTC, jackdachef
Details
Xorg.0.log, after several crashes X can't be opened/used anymore, device seems unavailable (20.76 KB, text/plain)
2014-08-02 02:21 UTC, jackdachef
Details
full dmesg output with all the crashes included, X (via xdm/slim or startx) can't be launched anymore (409.66 KB, text/plain)
2014-08-02 02:25 UTC, jackdachef
Details
dmesg: X crashing while playing HTML5 video movie trailers on youtube, LLVM disabled, sb backend (119.34 KB, text/plain)
2014-08-03 03:30 UTC, jackdachef
Details
Steam crash because of Chromium crash. (41.28 KB, text/plain)
2014-08-03 05:50 UTC, Aaron B
Details
dmesg-output after 25 minute hardware-accelerated html5 video crash, no hardlock this time (Magic SYSRQ works), screen corruption (screen subdivided horizontally into ~18 parts) (173.62 KB, text/plain)
2014-08-05 15:47 UTC, jackdachef
Details
dmesg (71.88 KB, text/plain)
2014-08-11 11:02 UTC, arnej
Details
API trace of a crash. (1.82 KB, text/plain)
2014-09-01 18:33 UTC, Aaron B
Details
chromiumapitrace (1.74 KB, text/plain)
2014-09-02 02:58 UTC, Aaron B
Details
DPM Crash log/watch script output. (82.62 KB, text/plain)
2014-09-27 07:05 UTC, Aaron B
Details
Dmesg log (79.38 KB, text/plain)
2014-09-27 09:30 UTC, José Suárez
Details
X.org log (52.73 KB, text/plain)
2014-09-27 09:35 UTC, José Suárez
Details
Xorg log when it froze (115.80 KB, text/plain)
2014-09-30 05:46 UTC, Alexandre Demers
Details

Description Aaron B 2014-07-22 15:50:52 UTC
Chromium randomly crashes with RadeonSI driver when using Chromium. Most usually with Youtube videos. Everything is current git from Oibaf PPA using Mint 17 Cinnamon, although I'm about to move to Arch Linux so I don't know how long I'll be able to provide logs as the problem doesn't exist in Arch's currents AFAIK. But there are logs in Bug #77980 from my crash, which was close sounding, but apparently a different bug.
Comment 1 Michel Dänzer 2014-07-23 03:31:55 UTC
(In reply to comment #1)
> Everything is current git from Oibaf PPA using Mint 17 Cinnamon, although I'm
> about to move to Arch Linux so I don't know how long I'll be able to provide
> logs as

Better attach them here ASAP then. :) Xorg.0.log, dmesg and glxinfo.

> the problem doesn't exist in Arch's currents AFAIK.

What version of Mesa does that use?
Comment 2 Aaron B 2014-07-25 02:23:00 UTC
Created attachment 103414 [details]
DMesg

I have moved to Arch. Bug just happened and is in these logs. I'm currently on Mesa-git. And the bug not being present isn't solid. My friend told me it didn't happen, and has the same tower. But he didn't stay on it long enough to happen so I'm going to take that back for now. I don't remember this bug in Mesa from around January, but yet my card was unusable in January. So I'd ping it to somewhere around 2-5 months old tbh. But yes, Chromium has this problem on arch, and I switched not only OS's but DE's so it's not a Cinnamon problem.
Comment 3 Aaron B 2014-07-25 02:24:04 UTC
Created attachment 103415 [details]
GLXInfo
Comment 4 Aaron B 2014-07-25 02:25:37 UTC
Created attachment 103416 [details]
XOrg log, with crash+recover.
Comment 5 Aaron B 2014-07-25 02:27:37 UTC
Created attachment 103417 [details]
Hardware report.

Important hardware:

R9 270X - 2GB GDDR5 - ASUS.
FX-8350 - AMD.
M5A99FX PRO 2.0 - ASUS.
Comment 6 Aaron B 2014-07-25 16:09:01 UTC
Created attachment 103455 [details]
Xorg crash, no recover.
Comment 7 Aaron B 2014-07-26 16:04:15 UTC
Created attachment 103504 [details]
Crash+recover
Comment 8 Aaron B 2014-07-26 16:07:00 UTC
Created attachment 103505 [details]
DMesg of crash+recover.
Comment 9 Aaron B 2014-07-26 19:52:54 UTC
Created attachment 103522 [details]
Two crashes+recovers. Xorg.
Comment 10 Aaron B 2014-07-26 19:53:34 UTC
Created attachment 103523 [details]
Two crashes+recovers. Dmesg.
Comment 11 Aaron B 2014-07-27 01:28:16 UTC
(In reply to comment #1)
> (In reply to comment #1)
> > Everything is current git from Oibaf PPA using Mint 17 Cinnamon, although I'm
> > about to move to Arch Linux so I don't know how long I'll be able to provide
> > logs as
> 
> Better attach them here ASAP then. :) Xorg.0.log, dmesg and glxinfo.
> 
> > the problem doesn't exist in Arch's currents AFAIK.
> 
> What version of Mesa does that use?

I'm guessing as of this patch this can be closed? http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2062afb4f804afef61cbe62a30cac9a46e58e067
Comment 12 Michel Dänzer 2014-07-28 02:36:47 UTC
(In reply to comment #11)
> I'm guessing as of this patch this can be closed?
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=2062afb4f804afef61cbe62a30cac9a46e58e067

No, that's unrelated.

BTW, the backtraces in Xorg log files aren't useful for diagnosing this kind of problem, so there's no point in attaching more Xorg log files just for that.
Comment 13 Michel Dänzer 2014-07-28 09:36:53 UTC
(In reply to comment #6)
> Chromium randomly crashes with RadeonSI driver when using Chromium. Most
> usually with Youtube videos.

Using Flash or HTML5 video? Fullscreen or windowed? ...
Comment 14 Aaron B 2014-07-28 16:08:51 UTC
(In reply to comment #13)
> (In reply to comment #6)
> > Chromium randomly crashes with RadeonSI driver when using Chromium. Most
> > usually with Youtube videos.
> 
> Using Flash or HTML5 video? Fullscreen or windowed? ...

I use HTML5 video. But it's a Chromium issue in general, flash video just helps it happen faster. It also happens a lot when switching tabs, clicking on content that adds a new element on the page over top of everything else, or loading more objects. Good examples are opening the comments section on Yahoo, and the mousing over of names on facebook. Think it'd be useful to try to attach a gdb session to Chromium? In the dmesg log, every time the problem happens, Chromium does receive a segfault.
Comment 15 Michel Dänzer 2014-07-29 03:36:42 UTC
(In reply to comment #14)
> I use HTML5 video. But it's a Chromium issue in general, flash video just
> helps it happen faster.

Is it HTML5 or Flash now? :)


> Think it'd be useful to try to attach a gdb session to Chromium? In the dmesg
> log, every time the problem happens, Chromium does receive a segfault.

Yes, backtraces of those crashes might be interesting.
Comment 16 Aaron B 2014-07-29 03:47:50 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > I use HTML5 video. But it's a Chromium issue in general, flash video just
> > helps it happen faster.
> 
> Is it HTML5 or Flash now? :)
> 
> 
> > Think it'd be useful to try to attach a gdb session to Chromium? In the dmesg
> > log, every time the problem happens, Chromium does receive a segfault.
> 
> Yes, backtraces of those crashes might be interesting.

Whoops, I meant the video playing in HTML5 made the glitch happen worse. But it's video in general, Flash video on other sites does crash it too. And okay, if I can get it working I'll hopefully have a good log to show sooner or later.
Comment 17 Aaron B 2014-07-29 06:35:51 UTC
I just got a crash while trying to get some debugging output...but all Chromium would output, and it was just through the terminal, was "GPU process stalled after 10000ms." and that was basically all the information I got from it. I'll try again tomorrow, maybe try valgrind or some different CL arguments this time around. We'll see. Now it's time for sleep, though.
Comment 18 jackdachef 2014-07-31 14:52:07 UTC
so I got a different bug or could this be the same ?

running on drm-next-3.17-rebased-on-fixes

only doing some surfing via current opera-developer (https://bugs.gentoo.org/show_bug.cgi?id=514696)

flash plugin was not working (tested via surfing over to speedtest.net)

switched a few times between KDE4, fluxbox, xfce4 and razor-qt

window managers were compiz-fusion (0.8.8/0.8.6), kwin with opengl compositing (opengl 1.2, 2.0, 3.1 testing)


programs used:

- gnote, tomboy
- opera-developer
- firefox (flash rendered unusable by uninstalling vdpau, no data via env VDPAU_TRACE=1 firefox), noscript, adblock
(these were all if I remember correctly)


first hardlock (no Magic SYSRQ Key recovery possible) was during playing around and switching between Themes in Opera Developer (version 24)


second hardlock was during simply reading a few notes in gnote and surfing via firefox (version 31)
Comment 19 Aaron B 2014-07-31 16:58:54 UTC
It sounds like the same bug, but I'm not 1000% sure. But I had a few crashes in Chromium, the only out of place log message is:

ATTENTION: default value of option force_s3tc_enable overridden by environment.
[8923:8923:0731/125646:ERROR:sandbox_linux.cc(302)] InitializeSandbox() called with multiple threads in process gpu-process

Not what to make of it, but it's something more.
Comment 20 Aaron B 2014-07-31 17:06:35 UTC
[7925:7960:0731/125644:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms.
[7894:7894:0731/125644:ERROR:gpu_process_transport_factory.cc(347)] Lost UI shared context.

Here also is the message about the process hanging, it was in the log too, missed it at first.
Comment 21 Aaron B 2014-07-31 19:11:13 UTC
I'm getting the glitch a lot more, but trying to remember what sets it off most:

1. Youtube. Often, but not as often as #2.

2. Clicking on an element that brings up another element onto the page, basically over it with a higher z-index, like I said facebook does this a LOT.

3. Adding elements to the page at all can also trigger this bug. Like loading ANY comments, from disqus to yahoo, they all have triggered it.

4. Slideshows where the objects slide against each other, like in PCWorld slideshows, just had a crash there.

All of these have in common is moving pixels, but also adding pixels to the screen. Is there special code for that, those dirty rectangles being glitchy or something? I have, well, only bare minimal knowledge of how these drivers work from looking at lots of Git commits, but I bet that is a pretty good clue to people who know what they're looking at with what could be wrong. :)
Comment 22 Aaron B 2014-07-31 21:49:39 UTC
Just had a horrible chance I dropped the ball on ,it killed the server and outputs couldn't even be reached on my card. I restarted LightDM to only have it fail and lose the log of the crash, but Steam was failing to start any games also, even starting steam took a few tries. I'm trying to reproduce. The only error I got out of I screwed up the log files was "Active IB in BO." or something of that nature, and it was printed in the tty2 terminal when lightdm died. That's all I could scroung out of it, but I'm trying to reproduce now, that error could provide all the information needed if I could have gotten the logs from it.
Comment 23 jackdachef 2014-08-01 01:44:34 UTC
@Aaron:

try reading into Magic SYSRQ key: http://en.wikipedia.org/wiki/Magic_SysRq_key



next time when X seems to freeze - first do a Magic SYSRQ + R, then try to imitate a "secure attention key" 
== Magic SYSRQ + k

whether that unlocks it


or if X doesn't show any change anymore but the system still seems to be alive:

Magic SYSRQ Key + r

then Magic SYSRQ Key + w
(what tasks or kernel threads, etc. is stuck)

then the rest of "REISUB" (Magic SYSRQ Key + e, i, s, u, b)


after the next start you can look through the contents of the kernel messages in

/var/log/kern.log

via


tail -n 3500 /var/log/kern.log | less

(keyboard cursor up & down, quit via "q")


if the messages are too much 

try via


tail -n 7500 /var/log/kern.log | less

or higher


sudo may be needed, if not using root privileges

logging may be disabled, so it first might have to be enabled 


hope this helps ...
Comment 24 Aaron B 2014-08-01 05:06:00 UTC
Created attachment 103786 [details]
DMesg of multile flash crashes.

I don't think it'll help much, it's more of the same from a dmesg, but this crash was like the other I experienced, pooped me out to a command prompt after some time. It stalled on some new rings other than 0, which is "nice" I guess.

But this was caused by Chromium, just refreshing a page or switching tabs, one of those. When it happened, Counter Strike Source was murdered, so it affected it too in some way. But, my steam log shows nothing with the output directed to it. I guess it shut down too quick to flush anything to it. I don't really know what else to do, I've been reading up the posts from Agd5f (Drazner?) posts and I mean, I'm a programmer and do lots of assembly and it makes sense. I just don't know where to start debugging such a lower level kernel problem. I believe it's out of my realm too much so to try anything else besides post logs. :/
Comment 25 jackdachef 2014-08-02 01:36:34 UTC
Created attachment 103839 [details]
X turning black & crash (30 minutes of HTML5 video on youtube, movie trailers), no successful recovery, recovery with Magic SYSRQ + k

not sure if to post it here or bug #79980

watched movie trailers with chromium (adobe flash player explicitly turned off), after approx. 30 minutes on another trailer (mostly 1080p, 720p) the screen stalled and turned black,

seemingly couldn't recover on its own, sound kept on continuing, so figured that Magic SYSRQ + k could be successful and it was


originally wanted to post the bug info from within chromium but during attempt to surf over to bugs.freedesktop.org the screen turned black again and successfully recovered with a little help of Magic SYSRQ + k (so the trend seems to be that it's getting better but we're not quite there yet) ...

will post full dmesg with info *after* 2nd crash here, too

more looks like it's related to #79980 than being a separate bug (currently posing this from within firefox 31)
Comment 26 jackdachef 2014-08-02 01:46:41 UTC
Created attachment 103841 [details]
full dmesg output with the 2 crashes included (Magic SYSRQ + k), reiserfs is /boot (storing the info there)

summary: 1st crash happened within fluxbox + chromium (html5 video) on youtube, movie trailers



right after Magic SYSRQ + k, xdm (slim) came up again

erroneously got into Xfce4 + compiz-fusion,



then exited and got into fluxbox + chromium again

here the 

2nd crash happened during attempt to open up/loading of bugs.freedesktop.org (via chromium)



now posting this information from fluxbox + firefox 31 (compiled on this system) still on the running system (no system reboot !) - so gpu state got more stable than previous drm-next-3.17-rebased-on-fixes


dmesg output seems to suggest that it's still related & an issue introduced with the recent GPUVM changes


hope this information is valuable and helps in tracking down where the problem lies
Comment 27 jackdachef 2014-08-02 02:21:55 UTC
Created attachment 103843 [details]
Xorg.0.log, after several crashes X can't be opened/used anymore, device seems unavailable

not sure if to post this here or in #79980, 

but when continuing usage after the gpu gets unstable with chromium-usage (so using chromium in browsing in complex sites; chromium with adobe flash player [youtube !]; chromium with HTML5 video [youtube] - seems to mostly accelerate & more easily trigger the common issue with #79980 ?)

the following end-result is visible in the attached Xorg.0.log, the gpu isn't accessible anymore and a system reboot is necessary


usage-pattern was following:

- firefox - browsing the web, viewing wallpapers, gmail, google plus
- gnote - writing notes
- switching between firefox and gnote

desktop environment used was Xfce4 + compiz-fusion; past experience has shown that it doesn't matter whether composited or not (fluxbox, xfce4+compiz-fusion, kde4+kwin, razor-qt + kwin); so no difference since opengl or webgl acceleration is used anyway


(opengl); webgl might be something further to investigate - 

the "issue" is, however that all of this works in 99%-100% with 3.14 kernel for me
Comment 28 jackdachef 2014-08-02 02:25:01 UTC
Created attachment 103844 [details]
full dmesg output with all the crashes included, X (via xdm/slim or startx) can't be launched anymore
Comment 29 jackdachef 2014-08-02 19:42:12 UTC
using the new firmware (http://people.freedesktop.org/~agd5f/radeon_ucode/ucode.tar.gz) didn't seem to make a change in stability:


starting in fluxbox
+ firefox
+ gnote


watching videos on youtube (adobe flash player) for several minutes, writing some notes in gnote, exiting; gmail, google+

starting in Xfce4+ compiz-fusion, exiting


starting in fluxbox
+ firefox
+ gnote


browsing, watching wallpapers, gmail, google+


text corruption, especially in gmail & google search results occured with:

radeon 0000:01:00.0: Packet0 not allowed!


couldn't pinpoint what triggered it, several occurences in dmesg



starting up chromium, browsing youtube (adobe flash explicitly disabled), wanted to watch a video (HTML5 video), screen doesn't react anymore, turns black

gpu/screen turns black, hardlock


will attach dmesg later, no suspicious content besides

radeon 0000:01:00.0: Packet0 not allowed!


any useful data in connection with chromium -> hardlock, all data lost
Comment 30 jackdachef 2014-08-02 21:38:24 UTC
@Aaron:

you tried running Chromium without hardware acceleration ? just booted up latest drm-next-3.17-rebased-on-fixes with newest firmware/ucode & running chromium



@Michel:

would it help to run chromium without hardware acceleration to get some meaningful information ?

I'm slowly running out of ideas on what else to try :/
Comment 31 jackdachef 2014-08-02 22:08:26 UTC
related:

https://bugs.freedesktop.org/show_bug.cgi?id=39469

https://bugzilla.kernel.org/show_bug.cgi?id=38792

?

so it's more of a userspace issue ?
Comment 32 jackdachef 2014-08-03 02:28:26 UTC
with hardware acceleration disabled no crashes, lockups, etc. within 4-4.5 hours,

now running chromium with

R600_LLVM=0 R600_DEBUG=sb,sbsafemath,nollvm chromium-browser

== LLVM is disabled


and watching HTML5 videos,



did a little search through bugs.freedesktop.org:

radeonsi isn't the only chipset family affected: intel cards hd2000, hd3000, hd4000; gm45, gm965, i915 (sandybridge, ivybridge, haswell, etc.), or other radeons are also affected in different ways
Comment 33 jackdachef 2014-08-03 03:30:31 UTC
Created attachment 103895 [details]
dmesg: X crashing while playing HTML5 video movie trailers on youtube, LLVM disabled, sb backend

so it at least doesn't seem to be caused by LLVM and/or vdpau,

must be something else in combination with the hardware acceleration
Comment 34 jackdachef 2014-08-03 03:49:20 UTC
(In reply to comment #33)
> Created attachment 103895 [details]
> dmesg: X crashing while playing HTML5 video movie trailers on youtube, LLVM
> disabled, sb backend
> 
> so it at least doesn't seem to be caused by LLVM and/or vdpau,
> 
> must be something else in combination with the hardware acceleration

this was with hardware acceleration enabled
Comment 35 Aaron B 2014-08-03 05:50:27 UTC
Created attachment 103896 [details]
Steam crash because of Chromium crash.

This is a log of steam trying to play CSS with Chromium open, which crashed, so we got a good and new log. Shows the VM dies.

Also, in dmesg got a new error among the GPU resets attempts:

[63446.944129] radeon 0000:01:00.0: still active bo inside vm

This was after a crash that failed to recover and dropped to a TTY screen. Not much, but it's something new for me.
Comment 36 jackdachef 2014-08-03 17:38:19 UTC
Google Chrome 38 disabled lots of stuff (about:gpu) :

Graphics Feature Status
Canvas: Software only. Hardware acceleration disabled
Flash: Hardware accelerated
Flash Stage3D: Software only, hardware acceleration unavailable
Flash Stage3D Baseline profile: Software only, hardware acceleration unavailable
Compositing: Hardware accelerated
Rasterization: Software only, hardware acceleration unavailable
Threaded Rasterization: Enabled
Video Decode: Software only, hardware acceleration unavailable
Video Encode: Hardware accelerated
WebGL: Hardware accelerated


had some features explicitly disabled via about:flags:

enable-zero-copy: off
enable-one-copy: off
accelerated overflow scroll: off
disable accelerated 2D Canvas (disabled anyway)
Composite fixed position elements: off
Composite RenderLayers with transitions: off
Composite fixed root backgrounds: off


hardlock occured with this running on razor-qt with kwin + opengl compositing

after 25 minutes of movie trailers watching in youtube (forgot to explicitely disable adobe-flash, 1 flash video, the others were HTML5 video)

so unfortunately no log


will now try with hardware acceleration disabled & flash player enabled


as mentioned in bug #79980 there, however, also was a crash with firefox browsing on composited Xfce (compiz-fusion) with hardware acceleration disabled


there could be a common issue ... or not
Comment 37 Aaron B 2014-08-05 08:57:58 UTC
VLC also, rarely, has the crashing problem with simple video output acceleration.
Comment 38 Michel Dänzer 2014-08-05 09:48:28 UTC
(In reply to comment #32)
> R600_DEBUG=sb,sbsafemath,nollvm chromium-browser

Those debugging options only have an effect with the r600g driver, not with radeonsi.


> radeonsi isn't the only chipset family affected: intel cards hd2000, hd3000,
> hd4000; gm45, gm965, i915 (sandybridge, ivybridge, haswell, etc.), or other
> radeons are also affected in different ways

r600g issues might be related, but most definitely not Intel GPU ones.
Comment 39 jackdachef 2014-08-05 15:32:00 UTC
(In reply to comment #38)
> (In reply to comment #32)
> > R600_DEBUG=sb,sbsafemath,nollvm chromium-browser
> 
> Those debugging options only have an effect with the r600g driver, not with
> radeonsi.
> 

are the other ways to temporarily disable LLVM for debugging in radeonsi ?

gentoo ebuilds force users to compile mesa & radeonsi with llvm - so that seemingly is the only way of operation with these kind of chipsets/cards ?


> 
> > radeonsi isn't the only chipset family affected: intel cards hd2000, hd3000,
> > hd4000; gm45, gm965, i915 (sandybridge, ivybridge, haswell, etc.), or other
> > radeons are also affected in different ways
> 
> r600g issues might be related, but most definitely not Intel GPU ones.

so things are implemented *that* differently between intel & radeon drivers ?
Comment 40 Alex Deucher 2014-08-05 15:33:42 UTC
(In reply to comment #39)
> 
> are the other ways to temporarily disable LLVM for debugging in radeonsi ?

llvm is required for radeonsi.
Comment 41 jackdachef 2014-08-05 15:47:48 UTC
Created attachment 104076 [details]
dmesg-output after 25 minute hardware-accelerated html5 video crash, no hardlock this time (Magic SYSRQ works), screen corruption (screen subdivided horizontally into ~18 parts)

kernel running with drm-next-3.17-rebased-on-fixes applied on top of 3.16-rc6

latest commit:
author	Christian König <christian.koenig@amd.com>	2014-07-28 11:30:12 (GMT)
committer	Alex Deucher <alexander.deucher@amd.com>	2014-08-04 21:45:53 (GMT)
commit	fa783807977da98da35590fd1d5efdfd4f33fd59 (patch)
tree	0f1573ae770843228930a0f278a82eb5d482a4c5
parent	5fc6854683aad9ae8b711cbe0d824c11b4aad66c (diff)
drm/radeon: allow userptr write access under certain conditions



several hours of pushing and trying to get X/system lockup with firefox (hardware acceleration enabled) and watching & opening up large jpg images - showed that at least that issue was resolved (Bug #81612 )


Then now proceeded to re-test HTML5 video with hardware acceleration (hardware acceleration disabled was seemingly stable so far)

the funny thing: each of the last 3 test attempts after pretty much exactly 25 minutes it tends to lock up X


reproducer: chromium 38.0.2107.3 (previous versions should also work), but this one has more options disabled which should rule out other crash/instability triggers,

youtube.com ,
keywords: movie trailers 2014

watching random movie trailers with preferrably 1080p (some only available in 720p)


result: screen content locks up, mouse still movable for a short time & sound continuing, the screen turning black - (box locking up/hardlock - this time *not*) - this time: (in total 2) attempts to salvage via Magic SYSRQ + k

screen flickers, another Magic SYSRQ + k

screen turns on again, mentioned screen corruption (screen subdivided horizontally into ~18 parts) with mostly white and green color in the shape of tiles

took a photo, if needed


so we got a *clear* improvement: the box does *not* hardlock anymore, Magic SYSRQ key works again and screen attempts to recover with Magic SYSRQ + k,

but it's not successful yet


hope the information of dmesg helps with further adding some ideas on how to solve this


added the following patchset (patches 2-7) on top of that kernel https://lkml.org/lkml/2014/8/3/120 ([PATCH 0/7] locking/rwsem: enable reader opt-spinning & writer respin ), not sure if that might increase stability


Cheers
Comment 42 Aaron B 2014-08-06 20:41:26 UTC
I've been able to get lucky and not get a hard lock most of the time. You HAVE to wait 10 seconds for the drm to reset the GPU. The patch didn't improve or change anything, I've seen those results for a while.
Comment 43 jackdachef 2014-08-06 20:53:58 UTC
(In reply to comment #42)
> I've been able to get lucky and not get a hard lock most of the time. You
> HAVE to wait 10 seconds for the drm to reset the GPU. The patch didn't
> improve or change anything, I've seen those results for a while.

any special commands you append to the radeon driver ? defaults (no special settings) ?

is it compiled into the kernel ? initramfs ?


thanks, will wait next time for 10+ seconds and crossing fingers
Comment 44 Maciej 2014-08-07 01:43:39 UTC
Testing Oibaf repo with kernel 3.16 under Ubuntu Gnome 14.04 - no issues after few hours using Chrome Beta (even tried with #ignore-gpu-blacklist flag). At least on my machine (7770) this issue looks to be Ubuntu 14.04 only.
Comment 45 Aaron B 2014-08-07 02:02:24 UTC
I'm on Arch, from Mint, and have the issue on both. If you still don't have a crash for a while, lets start comparing chrome/DE settings. I'm on an R9 270X, which is a 7850 reclocked AFAIK.
Comment 46 Michel Dänzer 2014-08-07 06:04:29 UTC
(In reply to comment #43)
> any special commands you append to the radeon driver ?

If you're still using radeon.hard_reset=1, you might try dropping that.
Comment 47 arnej 2014-08-11 11:02:32 UTC
Created attachment 104430 [details]
dmesg

I also have these problems. I attached my dmesg.

Until now (mesa 10.2.5), my system always hung completely and I had to reboot. Now I'm using mesa git and the screen goes blank, sound continues from the youtube video, and after about 10 seconds the screen is back and usable again.
Comment 48 Aaron B 2014-08-17 18:57:47 UTC
One way to trigger the bug pretty darn often is to run chromium for a while, then start an opengl game. I start TF2 and very, very often get a complete crash. The game usually closes, the browser stays open, though, if it recovers. Does that help any to know we can trigger it pretty often like that?
Comment 49 TheLetterN 2014-08-17 23:01:31 UTC
I may have found a way to reliably trigger this bug. However, this is based on the assumption that this bug is also in Drivers/Gallium/r600 (Arch with mesa version 10.2.5-1) which is what I'm using on my sysem with a Radeon HD 5870. I'm posting here because the symptoms and error messages are exactly the same, and it looks like this bug is present in multiple mesa drivers, and I couldn't find a relevant bug posting in /Drivers/Gallium/r600.

So if anyone is willing to test it, I've found that using certain abilities in Tales of Maj'Eyal 64-bit Linux ( http://te4.org/ ) triggers the GPU lockup. Specifically, using the ability "Death Dance" will trigger it, and I believe something being called by the spinningwinds.lua and spinningwinds.frag scripts within t-engine4-linux64-1.2.3/game/modules/tome-1.2.2-gfx.team//data/gfx/shaders/ is causing it.

Unfortunately you'd have to make a berserker character and get to level 8 and 28 strength to unlock "Death Dance", so I've uploaded a gzipped save file you can extract and put in ~/.t-engine/4.0/tome/save/ That should work for testing if it triggers the GPU lock.

http://s000.tinyupload.com/index.php?file_id=67207270324448074311
Comment 50 Michel Dänzer 2014-08-18 03:55:49 UTC
(In reply to comment #49)
> I may have found a way to reliably trigger this bug. [...]

Please file your own report. In the unlikely case that it is indeed the same bug as this report is about, it'll be easy to merge the reports.
Comment 51 Maciej 2014-08-30 16:29:22 UTC
Any news? Mesa from 10.2 and up is still unusable on 7770.
Comment 52 Aaron B 2014-08-30 20:23:47 UTC
Nope, still having daily crashes with Chromium and VLC. Although I haven't been able to build mesa 32 or 64 in a week because git always hangs at 99 damn percent.
Comment 53 Maciej 2014-08-30 21:25:15 UTC
I upgraded xserver from 1.15 to 1.16 today and it looks like issue is gone. I even tried to reproduce the bug by running multiple flash instances and it didn't hang. Can anyone confirm that this bug doesn't occur on 1.16?
Comment 54 Aaron B 2014-08-31 03:07:57 UTC
I'm on 3.16 and have been for a while, it definitely is still in there.
Comment 55 Aaron B 2014-08-31 06:08:43 UTC
1.16* Whoops, not talking the kernel.
Comment 56 Clément Guérin 2014-08-31 12:07:37 UTC
I had this bug because I forced hw acceleration ON in Chromium (about:flags). Now that everything is set as default it's stable as rock. No crash in two days, before that it was more like 5-6 crashes a day.

Arch Linux with latest mesa-git/llvm-svn, Linux 3.16. HD 7950
Comment 57 Maciej 2014-08-31 16:39:23 UTC
Well, after a day the bug is back, so xserver update didn't help.
@Clément Guérin - I have this bug on default Chrome, turning on HW just makes it more frequent.
Comment 58 Aaron B 2014-08-31 23:35:24 UTC
I would also like to add, if this happens three times in quick succession (Have chrome running with an OpenGL game starting/Level loading in a game) it'll sometimes have 3 quick fail and recovers. If it does happen 3 times in a row, X dies and it throws up a few "Couldn't schedule IB" with possibly a previous "Still active IB in BO." or two to the kernel log. This is pretty rare, but about one out of every 5 crashes when I'm starting a game and Chromium/Mesa fail to start the OpenGL app will fail in such a manner, although it has also happened with just Chromium, but that is much more rare.
Comment 59 Aaron B 2014-09-01 00:27:26 UTC
*Still active OB in VM"* Just had about 3 crashes in 2 hours. This is a joke of a bug that it seems nobody cares, and probably affects all SI, and probably R600 users.
Comment 60 Aaron B 2014-09-01 01:34:49 UTC
I mean, if it helps, every crash is always during a situation where new graphics information would be loaded up to the card, that can't me that much of the Mesa code since I'd imagine the most complex part is all the OpenGL standards compliance.

I run my R9 270X on a PCIe 3.0 port on a M5A99FX Pro 2.0 Evo motherboard. Anyone else on majorly different hardware and have this problem?

(Fun fact, it had an uncovered crashed while I was typing this in, no other screen updates at all.)
Comment 61 Damian Nowak 2014-09-01 01:55:01 UTC
My story: https://bugs.freedesktop.org/show_bug.cgi?id=65963#c4

tl;dr Mesa 10.1.4 is the last version that doesn't cause the problem.
Comment 62 Damian Nowak 2014-09-01 01:56:32 UTC
*** Bug 65963 has been marked as a duplicate of this bug. ***
Comment 63 Michel Dänzer 2014-09-01 06:14:19 UTC
(In reply to comment #61)
> tl;dr Mesa 10.1.4 is the last version that doesn't cause the problem.

If somebody could bisect that regression between Mesa 10.1 and 10.2, that would be very helpful. See https://bugs.freedesktop.org/show_bug.cgi?id=79696#c21 for some hints.
Comment 64 Christian König 2014-09-01 10:34:23 UTC
(In reply to comment #58)
> I would also like to add, if this happens three times in quick succession
> (Have chrome running with an OpenGL game starting/Level loading in a game)
> it'll sometimes have 3 quick fail and recovers. If it does happen 3 times in
> a row, X dies and it throws up a few "Couldn't schedule IB" with possibly a
> previous "Still active IB in BO." or two to the kernel log. This is pretty
> rare, but about one out of every 5 crashes when I'm starting a game and
> Chromium/Mesa fail to start the OpenGL app will fail in such a manner,
> although it has also happened with just Chromium, but that is much more rare.

Regarding this you could try Alex drm-next-3.18 branch, it contains some reset work from Maarten Lankhorst and me and should address this issue.

But nerveless bisecting what causes the crash to appear between mesa 10.1.4 and 10.2 sound like a good idea to me.
Comment 65 Aaron B 2014-09-01 17:11:23 UTC
I don't know what changed, if anything, but I did upgrade Chrome about 3 days ago, and now the past 2 days ever since I saw them mentioning more then 3-4 crashes a day, I'm running into 15-20 crashes a day. I can't do anything with Chrome. Next time I watch a video, we'll see if that went down in stability. LIke I said though, I did upgrade Chrome, but I don't remember any regressions in the first day of use. Maybe I didn't close it so I wasn't running on the new version, but still my crash numbers have gone through the roof now, I can't keep chrome open at all unless it's needed.
Comment 66 Christian König 2014-09-01 17:22:47 UTC
(In reply to comment #65)
> I don't know what changed, if anything, but I did upgrade Chrome about 3
> days ago, and now the past 2 days ever since I saw them mentioning more then
> 3-4 crashes a day, I'm running into 15-20 crashes a day. I can't do anything
> with Chrome. Next time I watch a video, we'll see if that went down in
> stability. LIke I said though, I did upgrade Chrome, but I don't remember
> any regressions in the first day of use. Maybe I didn't close it so I wasn't
> running on the new version, but still my crash numbers have gone through the
> roof now, I can't keep chrome open at all unless it's needed.

Well that at least makes the issue more reproducible. Could you try grabbing an apitrace log what what's going wrong?

That would really help.
Comment 67 Aaron B 2014-09-01 18:33:10 UTC
Created attachment 105569 [details]
API trace of a crash.

> (In reply to comment #65)
> > I don't know what changed, if anything, but I did upgrade Chrome about 3
> > days ago, and now the past 2 days ever since I saw them mentioning more then
> > 3-4 crashes a day, I'm running into 15-20 crashes a day. I can't do anything
> > with Chrome. Next time I watch a video, we'll see if that went down in
> > stability. LIke I said though, I did upgrade Chrome, but I don't remember
> > any regressions in the first day of use. Maybe I didn't close it so I wasn't
> > running on the new version, but still my crash numbers have gone through the
> > roof now, I can't keep chrome open at all unless it's needed.
> 
> Well that at least makes the issue more reproducible. Could you try grabbing
> an apitrace log what what's going wrong?
> 
> That would really help.

This what you're looking for?
Comment 68 Michel Dänzer 2014-09-02 01:19:11 UTC
(In reply to comment #67)
> This what you're looking for?

No, we're looking for the /home/aaron/chromium.*.trace file generated by apitrace corresponding to the crash.

BTW, can you try if Mesa 10.1 is stable for you as well? And if so, bisect?
Comment 69 Aaron B 2014-09-02 02:58:03 UTC
Created attachment 105581 [details]
chromiumapitrace

(In reply to comment #68)
> (In reply to comment #67)
> > This what you're looking for?
> 
> No, we're looking for the /home/aaron/chromium.*.trace file generated by
> apitrace corresponding to the crash.
> 
> BTW, can you try if Mesa 10.1 is stable for you as well? And if so, bisect?

If I get time I'll compile Mesa 10.1 32 and 64 bit and see if they work. No promises, as git always fails either on or right after receiving 99% of objects, so I have to try about 20 times to even attempt compile basically anything.

Still, you're in luck as I started an API Trace and as soon as my browser started to open, it killed it, so it's a nice and pair of traces.
Comment 70 Aaron B 2014-09-02 03:04:00 UTC
Wrong file. This is the right file:

https://drive.google.com/file/d/0B1laUfqMuZQBeWpHSDYtV2N3RjQ/edit?usp=sharing
Comment 71 Michel Dänzer 2014-09-02 07:57:49 UTC
(In reply to comment #70)
> https://drive.google.com/file/d/0B1laUfqMuZQBeWpHSDYtV2N3RjQ/edit?usp=sharing

Can you reproduce the crash by feeding those traces to glretrace?


(In reply to comment #69)
> If I get time I'll compile Mesa 10.1 32 and 64 bit and see if they work. No
> promises, as git always fails either on or right after receiving 99% of
> objects, so I have to try about 20 times to even attempt compile basically
> anything.

Note that you only really need to clone a Git repository once, after that all the history (including all the commits you might need to test during the bisection) is available locally.
Comment 72 Aaron B 2014-09-02 13:54:52 UTC
(In reply to comment #71)
> (In reply to comment #70)
> > https://drive.google.com/file/d/0B1laUfqMuZQBeWpHSDYtV2N3RjQ/edit?usp=sharing
> 
> Can you reproduce the crash by feeding those traces to glretrace?
> 
> 
> (In reply to comment #69)
> > If I get time I'll compile Mesa 10.1 32 and 64 bit and see if they work. No
> > promises, as git always fails either on or right after receiving 99% of
> > objects, so I have to try about 20 times to even attempt compile basically
> > anything.
> 
> Note that you only really need to clone a Git repository once, after that
> all the history (including all the commits you might need to test during the
> bisection) is available locally.

Yeah, but I build it with yaourt, which likes to contain it's gits elsewhere and apparently deletes them every time as you can't keep a clone unless it's in the same session. I would work on getting 2-3 clones and uploading them to my git and copying from that if need be, but it's just tons of work for me. Like I said, I have to try about 20 times to even clone anything, it's a severe pain.

And running glretrace about 10 times on the trace files, no beans on getting another crash.

[aaron@Aaron-Arch Chromiumapitrace]$ glretrace chromium.trace
0 64 glXSwapIntervalMESA(interval = 1) = 0
64: warning: unsupported glXSwapIntervalMESA call
Rendered 150 frames in 2.68355 secs, average of 55.8961 fps
[aaron@Aaron-Arch Chromiumapitrace]$
Comment 73 Aaron B 2014-09-03 06:54:40 UTC
I'm not having any luck bisecting, because I can't even get mesa-10.1.4 to compile. I recompiled LLVM with R600 and everything, but still no dice. Could it be I'm using LLVM 3.6 or should there be no functional difference? I'm getting errors for non-existent members and implicit function declarations. If I could get it to compile I probably could have had it bisected, but looks like I can't do it right now.
Comment 74 Michel Dänzer 2014-09-03 06:58:27 UTC
(In reply to comment #73)
> Could it be I'm using LLVM 3.6 [...]

Yes, only released versions of LLVM can be supported by released versions of Mesa. Even LLVM 3.5 is only being released now, so you might need LLVM 3.4(.y) for Mesa 10.1.
Comment 75 Michel Dänzer 2014-09-03 07:01:58 UTC
(In reply to comment #72)
> Yeah, but I build it with yaourt, which likes to contain it's gits elsewhere
> and apparently deletes them every time as you can't keep a clone unless it's
> in the same session.

Sounds like a severe yaourt limitation, but maybe you can convince it to clone an existing local repository instead of a remote one?


> And running glretrace about 10 times on the trace files, no beans on getting
> another crash.

Then those trace files won't help others reproduce the problem either unfortunately, let alone fix it.
Comment 76 Aaron B 2014-09-03 07:26:19 UTC
Ah, okay. Yeah, dropped to LLVM 3.4 and we're good to go. I'll report any results I find interesting or believe may be problematic soon, along with my bisect paths. I have 4 days off, so I'll probably be posting here a lot if I get anything. Also helps I'm currently working on getting my projects to a publishable state so I'll be in the programming-debugging mood to boot.
Comment 77 Aaron B 2014-09-03 10:56:25 UTC
Took a few hours, but only one shot thankfully.

first bad commit: [0e7b0f2a0ad3818d02907746a86568c264c97701] meta: Refactor binding of renderbuffer as texture image

http://cgit.freedesktop.org/mesa/mesa/commit/?id=0e7b0f2a0ad3818d02907746a86568c264c97701

I think this is all you guys, now.
Comment 78 Marek Olšák 2014-09-03 12:12:12 UTC
(In reply to comment #77)
> Took a few hours, but only one shot thankfully.
> 
> first bad commit: [0e7b0f2a0ad3818d02907746a86568c264c97701] meta: Refactor
> binding of renderbuffer as texture image
> 
> http://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=0e7b0f2a0ad3818d02907746a86568c264c97701
> 
> I think this is all you guys, now.

That commit doesn't have any effect on radeonsi. Back to square one I guess.
Comment 79 Michel Dänzer 2014-09-03 15:21:28 UTC
Aaron, most likely the bisection didn't work because you marked commits as good which would have shown the problem after more testing. I'm afraid you'll have to test longer before declaring a commit as good.
Comment 80 Aaron B 2014-09-03 16:33:17 UTC
(In reply to comment #79)
> Aaron, most likely the bisection didn't work because you marked commits as
> good which would have shown the problem after more testing. I'm afraid
> you'll have to test longer before declaring a commit as good.

Hmm, okay, I'll try again. I'll test for about 8 hours this time. I don't have the best grasp of the Mesa works, but I didn't think I screwed up as it seemed to be something related to multisampler blitting which sounds darn close to what is screwed up.

Starting over.
Comment 81 Michel Dänzer 2014-09-04 00:49:23 UTC
(In reply to comment #80)
> Hmm, okay, I'll try again. I'll test for about 8 hours this time.

Thanks. Basically, you need to test for at least as long as it's ever taken for the problem to appear. A multiple of that would be even better.


> I don't have the best grasp of the Mesa works, but I didn't think I screwed up
> as it seemed to be something related to multisampler blitting which sounds darn
> close to what is screwed up.

The meta code is only used by classic drivers, not by Gallium based drivers.
Comment 82 Aaron B 2014-09-04 14:47:01 UTC
I'm still bisecting, but I just want to say I suck at it and I'll probably need at least 2 bisects to the same point, if not more. I'm trying to be patient, but on the old Mesa's the glitch just takes so long to do, even when I set it up to do it.

So, should I skip to bisecting if this DMA patch that was just proposed is the source of our problem, also?

https://bugs.freedesktop.org/show_bug.cgi?id=83500
Comment 83 Aaron B 2014-09-04 20:36:36 UTC
This bisect put me here, which looks like it didn't go as planned again...

http://cgit.freedesktop.org/mesa/mesa/commit/?id=78578b759943cb198d34eedc00b3408c1599f6ec

I'm going to give up for now, maybe when I don't have so many other things going on I'll bisect it over a week or so.
Comment 84 Clément Guérin 2014-09-05 19:43:15 UTC
Awesome work Aaron, don't give up!
Comment 85 Aaron B 2014-09-05 20:09:15 UTC
I won't. I just am the typical programmer...it takes a while to trace it, but why not just look for other broken stuff? Someone else has to have better ability to do it and find what is causing it before I take a week to trace it completely. :)

I've been using the patch I linked to for half a day now, the disabling 1D tiling in conjunction with ReadonSI DMA. I have had no Chromium crashes. I did have an odd crash in steam while playing a flash video on a game page in large mode, but I don't believe it to be related. I'll report in later as to if I believe it is a valid work around, for now it seems good though.
Comment 86 Aaron B 2014-09-06 17:05:06 UTC
1 day since I compiled mesa with the 2nd patch here, all is well and stable.

https://bugs.freedesktop.org/show_bug.cgi?id=83500
Comment 87 Michel Dänzer 2014-09-08 07:53:21 UTC
(In reply to comment #86)
> https://bugs.freedesktop.org/show_bug.cgi?id=83500

I think in the meantime Grigori has seen GPU hangs even with that patch though, with 2D tiling.

However, can you avoid the problem if you start Chromium with the environment variable R600_DEBUG=nodma?
Comment 88 Aaron B 2014-09-08 13:15:59 UTC
(In reply to comment #87)
> (In reply to comment #86)
> > https://bugs.freedesktop.org/show_bug.cgi?id=83500
> 
> I think in the meantime Grigori has seen GPU hangs even with that patch
> though, with 2D tiling.
> 
> However, can you avoid the problem if you start Chromium with the
> environment variable R600_DEBUG=nodma?

I was waiting for someone to comment before I updated my status. I have had one crash, one of the ones where X dies because of 3 quick lock ups+recovers. But, considering that was the only crash in 2 days, it's much better than before. I also didn't know if the crash was Chromium or not, because I was alt+tabing into other windows, but I'd bet it was. Let me recompile mesa fresh and try launching Chromium and let you know the end of today if disabling also fixes it.
Comment 89 Maciej 2014-09-10 22:52:37 UTC
Could someone please fix this already? Mesa 10.2.x, 10.3-RC and git are simply unusable on 7770.
Comment 90 Aaron B 2014-09-11 01:42:09 UTC
My testing has been delayed since they pushed LLVM 3.5 to stable on Arch, but they changed one of the function's parameters. Then, 3.6 compiles perfectly for 64-bit, yet when compiling 32-bit, it fails to link. So I've been stuck using repro's of 10.2.7 which...seems better, but it still does crash. Maybe it a memory/buffer overrun problem, because the versions are either god-awful, or somewhat usable with only a rare hiccup, which more recent versions just getting worse and worse. But like I said, I can't compile basically anything right now until everyone gets their stuff straight on function names and parameters.
Comment 91 Michel Dänzer 2014-09-11 02:00:48 UTC
(In reply to comment #89)
> Could someone please fix this already? Mesa 10.2.x, 10.3-RC and git are
> simply unusable on 7770.

I'm sorry to hear that. We're working on it, but since we haven't been able to reproduce these issues, we need your help for testing: Does the environment variable R600_DEBUG=nodma help? If not, can you try if Mesa 10.1 is stable for you, and if so, bisect between 10.1 and 10.2?
Comment 92 Aaron B 2014-09-11 02:08:27 UTC
Like I said, I tried. But, it's too hard to reproduce. Unless you do it over a 2 week time span, it's so difficult to find. Maybe he could do it better since his builds seem to crash more if it's completely unusable. But, if you're on arch and don't want to help, stick to 32 and 64-bit LLVM 3.4.2. I can't help for now.
Comment 93 Maciej 2014-09-11 10:25:57 UTC
(In reply to comment #91)
> I'm sorry to hear that. We're working on it, but since we haven't been able
> to reproduce these issues, we need your help for testing: Does the
> environment variable R600_DEBUG=nodma help? If not, can you try if Mesa 10.1
> is stable for you, and if so, bisect between 10.1 and 10.2?

R600_DEBUG=nodma doesn't seem to help, got a hang while typing this comment.

It doesn't happen with Mesa 10.1 from default Ubuntu 14.04 installation, but I need at least Mesa 10.3 and llvm-3.5 for any enjoyable gaming. My choice is either use Mesa for gaming, but then I can't use browser with any sort of hardware acceleration (or flash) or use fglrx for awesome 12fps desktop performance (cause fglrx is broken at all fronts).

As for bisecting, I have no idea how to do that, nor I have time to learn - I'm just a Ubuntu user with no technical skills.
Comment 94 Grigori Goronzy 2014-09-11 10:33:39 UTC
Maybe the crash actually happens because of glamor rendering - setting R600_DEBUG won't do anything in that case.

Does this patch to Mesa make any difference?

https://bugs.freedesktop.org/attachment.cgi?id=105745
Comment 95 Maciej 2014-09-12 16:52:42 UTC
(In reply to comment #94)
> Maybe the crash actually happens because of glamor rendering - setting
> R600_DEBUG won't do anything in that case.
> 
> Does this patch to Mesa make any difference?
> 
> https://bugs.freedesktop.org/attachment.cgi?id=105745

I got a feeling that You guys always expect everyone here to have some sort of skills, but a lot of Linux users have no idea how to compile Mesa, just like me. I can test this if it lands in something like Oibaf PPA, but that's it :/
Comment 96 Aaron B 2014-09-12 17:15:44 UTC
Just a while ago, I was the same way, sort. The main thing is to get away from the not-as-good package manager in Ubuntu, and go with Arch like I did. Package support is amazing because of the AUR, which build source for you, and when you have to do it yourself like us, most of the time you can just grab the config from the auto-installer, which should have 99% of the config options in-place. Like I said, it's screwed ATM, but usually it's pretty painless to compile anything. You just have to try it out. If you email me, I can get you set up with a Mesa build configure for RadeonSI. Then all you have to do is learn how to clone, pull, and patch in git. It's all not too bad. I mean, most LInux people are devs of some sort, so if you want to have a good time, you should put a little more time in and learn how the back-end of Linux and how programs are compiled to have a good time. I'll keep testing, as soon as I find a way to compile LLVM 3.6.
Comment 97 Maciej 2014-09-12 20:12:34 UTC
(In reply to comment #96)
> Just a while ago, I was the same way, sort. The main thing is to get away
> from the not-as-good package manager in Ubuntu, and go with Arch like I did.
> Package support is amazing because of the AUR, which build source for you,
> and when you have to do it yourself like us, most of the time you can just
> grab the config from the auto-installer, which should have 99% of the config
> options in-place. Like I said, it's screwed ATM, but usually it's pretty
> painless to compile anything. You just have to try it out. If you email me,
> I can get you set up with a Mesa build configure for RadeonSI. Then all you
> have to do is learn how to clone, pull, and patch in git. It's all not too
> bad. I mean, most LInux people are devs of some sort, so if you want to have
> a good time, you should put a little more time in and learn how the back-end
> of Linux and how programs are compiled to have a good time. I'll keep
> testing, as soon as I find a way to compile LLVM 3.6.

I'd rather have my system to just work. Adding PPA takes no effort, setting up something like Arch or compiling Mesa does and I have no time to deal with it. Some of Linux users (I assume mostly those on *buntu distros) are not developers, sysadmins or kernel hackers, we got other jobs and things to do, but we like open source solutions and other benefits of Linux world. I suppose I should just change my GPU to Nvidia and stop giving a fuck, but I prefer open source solutions and AMD is the only choice when it comes to high performance foss drivers.

I'm not bitching or whining here, just clarifying few facts.

I can reports bugs, provide logs, test some PPAs, but please don't patronize me with your linux hacker attitude, I'm not one of You, I'm just a user with different job description who happens to like open source for various reason (messing with code is not one of them).
Comment 98 Aaron B 2014-09-12 20:37:59 UTC
I'm giving you details on how to more help us, and if you can run Linux in any way, and understand how the command line works even remotely, and how packages work, going to a better system/learning one little thing like code compiling isn't hard is my point, if you don't want to, that is fine, but at the same time, even I'm a programmer, and the DRM infrastructure and hardware isn't in my league so I can't help that way, but to help, yeah it's good to get confirmation of bugs, but unless we can patch together and such, it's a long shot from finding the actual problem, since, like I said, it's hard to track down, it's so random.

But still, whoever said AMD has the best FOSS drivers was dead wrong, as they're last AFAIK. Intel and Nvidia patches to Mesa even are 10x more. Our guys on here do good work, it's not their fault AMD as a company won't cater to us, though. They don't have much help, it's 4-5 guys, most of them watching this. I mean, if they ditched their Proprietary drivers made in Hong Kong, which aren't bad when they compile on 2 year old kernels, and contributed to Mesa/DRM, we'd of had good drivers years ago. But they don't, so we're here now, and we gotta be patient. You should have known and understood this before you went to a newer technology card. I mean, I'm mad too, I also want to go with a GTX 770ti, but I'm sticking with this to help get it stable, as without users it's also harder to get software into a usable state over all.

If you want complete usability on your hardware, you should just go windows then, FWIW. As bad as that sounds, if you need it to work, use what does for your hardware, as new hardware in Linux takes time, especially since there's always new hardware. It works for me, just a few bugs, I'm sorry yours is worse, but there's not much any of us can do.

Also, I haven't had a crash in 3 or so days now since I went to LLVM 3.6, even though it doesn't build Mesa with it on 32-bit. Maybe it's fixed, just have to wait for LLVM to update.
Comment 99 Emil Velikov 2014-09-12 20:55:36 UTC
Maciej

Just a friendly note from a guy not involved with the radeon drivers or the AMD team.

There is software development and distribution. As the latter differs greatly it is somewhat unexpected for developers to know every distro, how they build their software, how its packaged, distributed etc.
If you are willing to help, apart from reporting issues here, file them with the distribution and obviously link the two. This way they can prepare packages for you (and other affected people) to test.

I fear that explaining over and over again that you don't have experience in building and/or bisecting does not really help. If you have the time to learn - great, otherwise seek assistance from your distribution.
Comment 100 Aaron B 2014-09-12 20:57:40 UTC
I also would like to note that, I just had 2 crashes. Chromium. So it still exists, just seems to happen rarer and rarer for me.
Comment 101 Maciej 2014-09-13 13:02:05 UTC
(In reply to comment #98)
> > If you want complete usability on your hardware, you should just go windows
> then, FWIW.

This here is where Linux desktop issues start - You people drive new users away from this platform, so only devs are here who do not make software for human beings, so it's harder for new users to come here, which results in low market share, which results in lack of proper hardware support which results in even lower market share, good job. If not for Canonical or Valve there would be no Linux desktop at all, all You would be running Open Quake on top of fuckin' Gentoo or Arch with some useless (for average joe, not developer) tiling window manager.

I'm using Linu.. wait, let me correct - I'm using Ubuntu because I like open source for various reasons, hacking the code is not one of them (I'm repeating myself I think, obviously You people can't read any form of human language - "it's not C, I don't understand!"). Also my card is from 2012, it's not even remotely new hardware, get your facts straight.

--------

Emil Velikov

Not sure what difference would it make to file a bug at launchpad from where they will point me right here. Aren't all Mesa devs right here anyway? I'm seeking assistance from Mesa developers, which I heard are paid by AMD (those on AMD team) - which makes me their customer.

I suppose this is wrong place to be a normal person, not a dev, so I'll stay away in the future and meanwhile jump ship to Nvidia camp, fuck You very much (Mesa is great, but Your attitude towards non-developers is terrible).
Comment 102 Emil Velikov 2014-09-13 13:50:34 UTC
(In reply to comment #101)
> Emil Velikov
> 
> Not sure what difference would it make to file a bug at launchpad from where
> they will point me right here. Aren't all Mesa devs right here anyway? I'm
> seeking assistance from Mesa developers, which I heard are paid by AMD
> (those on AMD team) - which makes me their customer.
> 
Please try to settle for a moment and read the whole "development" vs "distribution" note.

> I suppose this is wrong place to be a normal person, not a dev, so I'll stay
> away in the future and meanwhile jump ship to Nvidia camp, fuck You very
> much (Mesa is great, but Your attitude towards non-developers is terrible).
>
I never said nor hinted that this is the "wrong place for normal people". I feel that once you realise what I meant with "development vs distribution" you will understand that "our attitude" is not terrible. All developers ask is some clear steps in order to understand and fix the issue.

Thanks
Comment 103 Aaron B 2014-09-15 02:07:04 UTC
It seems the more often the build has this bug, the most it ALSO randomly has "blank" and "garbled" frames. Every so often, the screen will shift, or sometimes seem to just go completely blank. I notice in the builds where chrome crashes a ton, this happens a ton. When it doesn't, it also doesn't crash as often.

What are the chances our other hardware could help find this? I dunno what you guys on the AMD dev side run, but I did get an FX-8350, and maybe this has something to do with the vblank patches I saw you guys post, as I remember seeing errors I saw with the vblank patches and talk. I don't know if they were scrapped or what, but could this be tied to any external hardware or bus or something just racing something else? Any way to test for the race conditions, successfully? I mean, I'm back on this current build of mesa, and it's just a mess.
Comment 104 Maciej 2014-09-15 23:41:46 UTC
Easy way to reproduce (though takes some time) is watching any video using mplayer with xv. After few minutes (or an hour) it will hang. Same goes for flash videos, fire up 5 or 6 at once and in no time it should hang.

Watching videos in mplayer with vdpau doesn't result in crash (I watch a lot of movies on this machine with smplayer+vdpau, never hanged with it). Also aside from flash, current Firefox (32) from Ubuntu 14.10 works properly, but if I turn on hardware acceleration (OMCT or something) it behaves just like Chrome (hangs).
Comment 105 Aaron B 2014-09-16 21:17:38 UTC
Nobody reproducing this still? I just had another X dying when booting up/starting to transfer GPU data in CSS for a map and I got a new error in between other usuals:

r600_ring_test: *ERROR* Radeon: ring test failed (scratch(0x850C)=0xCAFEDEAD)

Also, looks like our issue, VERY similar in every way, but was supposedly fixed.

https://www.libreoffice.org/bugzilla/show_bug.cgi?format=multiple&id=67187

Is it a power issue? Is it only triggered because a web browser isn't very hard to run, and is constantly switching states depending on use and how much there is to update, while other OpenGL apps like games and such don't crash because they skip problematic states in the PM? Just throwing out whatever sticks, as if that was their issue, it sounds exactly like my issue, even my primary errors with it.
Comment 106 Aaron B 2014-09-22 06:00:02 UTC
LibreOffice triggers the bug much, much more than Chromium and VLC. Trying to make a few work documents and update my resume, and it's just not doable.
Comment 107 smoki 2014-09-22 06:47:33 UTC
 @Aaron B 

 Can i suggest something to try if helps, because this bug geting very big, unreproducable for some people and seems like goes nowhere... so if you build your mesa, try to build it with -mtune=native for both 32bit and 64bit variants, for me recently that (or everything that avoid default -march=generic) make it more faster&stable. bug 83436
Comment 108 Aaron B 2014-09-23 21:20:26 UTC
I have seen the comment, once LLVM becomes unbroken, and GCC becomes unbroken, and everything doesn't suck, I'll recompile it all. But right now, LLVM is broken, and even when compiled, doesn't link right with Mesa/Mesa32, so I can't test for certain right now.
Comment 109 Aaron B 2014-09-25 17:19:36 UTC
Crashing still persists with -mtune=native and -march=native on my kernel and mesa 64, and mesa 32 with LLVM 3.5 from Arch's repos. :c
Comment 110 Aaron B 2014-09-25 17:53:49 UTC
I found this post, stating some of the ASUS hardware comes PRE-OVERCLOCKED from the factory. Good chance this is the cause of my instability?

http://www.tomshardware.com/answers/id-2041324/problem-asus-270x-screen-blank-game.html

My GPU:

http://www.newegg.com/Product/Product.aspx?Item=N82E16814121802&nm_mc=KNC-GoogleAdwords&cm_mmc=KNC-GoogleAdwords-_-pla-_-Desktop+Graphics+Cards-_-N82E16814121802&gclid=COjjs6v3_MACFRAR7Aod_2EAqQ

Core clock is listed at 1120 Mhz, not 1050 Mhz.
Comment 111 Alex Deucher 2014-09-25 18:29:21 UTC
(In reply to comment #110)
> I found this post, stating some of the ASUS hardware comes PRE-OVERCLOCKED
> from the factory. Good chance this is the cause of my instability?
> 
> http://www.tomshardware.com/answers/id-2041324/problem-asus-270x-screen-
> blank-game.html
> 
> My GPU:
> 
> http://www.newegg.com/Product/Product.aspx?Item=N82E16814121802&nm_mc=KNC-
> GoogleAdwords&cm_mmc=KNC-GoogleAdwords-_-pla-_-Desktop+Graphics+Cards-_-
> N82E16814121802&gclid=COjjs6v3_MACFRAR7Aod_2EAqQ
> 
> Core clock is listed at 1120 Mhz, not 1050 Mhz.

Radeon already limits the clocks in certain cases depending on the vbios tables.  You can disable dpm which will keep the chip at the default boot up levels (usually some where in the range of 300e/150m) for testing.  Append radeon.dpm=0 to the kernel command line in grub.
Comment 112 Aaron B 2014-09-25 19:11:12 UTC
OKay, I'm testing it now. Can't game at any good framerate, but that is fine. If I get no crashes here, does that mean the random crashes are from changes in power state? It would make sense, sense most of the time is on page loading, or most of the screens pixels changing at the same time. I'm on a 1080p screen via hdmi, maybe that could help you try to replicate it some more.
Comment 113 Alexandre Demers 2014-09-26 04:36:03 UTC
I commented out every "return ret;" of si_dpm_set_power_state() in si_dpm.c. After booting this modified kernel, I can confirm this is the only error reported in si_dpm_set_power_state(): every other verification passes OK and it goes down to the very end.
Comment 114 Alexandre Demers 2014-09-26 04:40:21 UTC
(In reply to comment #113)
> I commented out every "return ret;" of si_dpm_set_power_state() in si_dpm.c.
> After booting this modified kernel, I can confirm this is the only error
> reported in si_dpm_set_power_state(): every other verification passes OK and
> it goes down to the very end.

Oops, I had too many bugs opened, I wrote this comment in the wrong one.
Comment 115 Aaron B 2014-09-26 04:41:39 UTC
Haha, was it the SI power state ULV one? I was worried I accidentally posted in that and this was a reply. No crashes so far, I'll report tomorrow if I have any, but it seems stable with DPM off. But we'll continue more tomorrow. :)
Comment 116 Alexandre Demers 2014-09-26 07:10:48 UTC
I've been reading the comments in fast forward, and I'm sure I'm hitting the same bug.

For info, I'm using a 7950 on Arch 64. I'm running Xorg server 1.16.1. Mesa, drm and ddx are all from the latest git repositories. I was not experiencing this bug yesterday while I was still using a 6950 on Glamor.

Now, I have a log that I can push from just after a hang and reset or from a hang and a badly running Xorg server. There is something about ring 5 not responding. I'll send the file later when I'll get to my desktop.

To trigger the bug in no time, as you, I just have to watch a movie (either flash or html5) on chrome or firefox and wait. Otherwise, going through websites, apps, etc will get you to the same point, but it will take more time.

I must warn you though: chrome/chromium has been hanging with both r600 and readeonsi drivers. It should not be mixed with the current issue, because they are probably two different bugs. Thus, I prefer to use firefox for now to eleminate mixing these two.
Comment 117 Aaron B 2014-09-26 21:27:53 UTC
Well, I'm still on Chromium, and I've forced my GPU from the backlist, because if you don't then it doesn't use all accel so it should be okay. Anyways, I turned off DPM and have no crashing at all, Chromium seems stable, LibreOffice seems stable, and also when you turn off DPM, the "Screen Jumping/corruption" goes away. It seems to be proportional to number of crashes, too. Which I have seen it go away, and no crashes now, so maybe it's related. But without DPM I'd say my GPU is now stable. My R9 270X is ASUS brand from about January, if that would help figure what hardware it has.
Comment 118 Alexandre Demers 2014-09-26 22:58:48 UTC
(In reply to comment #117)
> Well, I'm still on Chromium, and I've forced my GPU from the backlist,
> because if you don't then it doesn't use all accel so it should be okay.
> Anyways, I turned off DPM and have no crashing at all, Chromium seems
> stable, LibreOffice seems stable, and also when you turn off DPM, the
> "Screen Jumping/corruption" goes away. It seems to be proportional to number
> of crashes, too. Which I have seen it go away, and no crashes now, so maybe
> it's related. But without DPM I'd say my GPU is now stable. My R9 270X is
> ASUS brand from about January, if that would help figure what hardware it
> has.

Well, it points out a problem with dpm. Maybe the uvd class is problematic?
Comment 119 Alexandre Demers 2014-09-26 23:17:32 UTC
Small question Alex Deucher or Christian may answer: is it normal ring 5 is completely in a different GPU's memory address area?
[    9.353518] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x00000000c0000c00 and cpu addr 0xffff880411a25c00
[    9.353519] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x00000000c0000c04 and cpu addr 0xffff880411a25c04
[    9.353521] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x00000000c0000c08 and cpu addr 0xffff880411a25c08
[    9.353522] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x00000000c0000c0c and cpu addr 0xffff880411a25c0c
[    9.353524] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x00000000c0000c10 and cpu addr 0xffff880411a25c10
[    9.356425] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90015fb5a18

rings 0 to 4 are all in the same gpu address subset, but not ring 5?
Comment 120 Aaron B 2014-09-27 05:35:06 UTC
I don't know how much it'll help, but I have this running in the background to try to see whatever causes it. It doesn't help much considering I don't have a marker/way to input when the crash starts to happen, but maybe someone has an idea of the best way to do that. Maybe look for messages in dmesg about the rings locking up? Any improvement ideas? You probably will have to adjust the sampling rate and size to fit your computer, my drive is at 50% usage running this, but I can afford it since I have a few drives.

http://ideone.com/V4MkeN
Comment 121 Aaron B 2014-09-27 07:05:04 UTC
Created attachment 106942 [details]
DPM Crash log/watch script output.

First log with that script, no markers needed as you can see in the log when the DPM dies, it goes back to no-dpm mode until it recovers.
Comment 122 José Suárez 2014-09-27 09:30:10 UTC
Created attachment 106946 [details]
Dmesg log
Comment 123 José Suárez 2014-09-27 09:31:46 UTC
(In reply to comment #122)
> Created attachment 106946 [details]
> Dmesg log

I am experiencing the same problem described in comment 116. I only use firefox and I am getting hangs consistently whenever I watch a flash or a html5 youtube video. The screen gets black and I have to reboot.

I attached the system dmesg log above.
Comment 124 José Suárez 2014-09-27 09:34:02 UTC
Just for an easier read:

Radeon HD 7870 GHZ Edition 2 GB
FX 8150 @ 3700 MHz
16 GB RAM

Linux 3.17-rc5, with mesa 10.4~git1409200730.4eb2bb compiled with llvm 3.6~svn217413 on Kubuntu 14.04 with all the latest updates (including firefox).
Comment 125 José Suárez 2014-09-27 09:35:10 UTC
Created attachment 106947 [details]
X.org log
Comment 126 Christian König 2014-09-27 11:14:36 UTC
(In reply to comment #119)
> Small question Alex Deucher or Christian may answer: is it normal ring 5 is
> completely in a different GPU's memory address area?
> [    9.353518] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
> 0x00000000c0000c00 and cpu addr 0xffff880411a25c00
> [    9.353519] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr
> 0x00000000c0000c04 and cpu addr 0xffff880411a25c04
> [    9.353521] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr
> 0x00000000c0000c08 and cpu addr 0xffff880411a25c08
> [    9.353522] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
> 0x00000000c0000c0c and cpu addr 0xffff880411a25c0c
> [    9.353524] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr
> 0x00000000c0000c10 and cpu addr 0xffff880411a25c10
> [    9.356425] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr
> 0x0000000000075a18 and cpu addr 0xffffc90015fb5a18
> 
> rings 0 to 4 are all in the same gpu address subset, but not ring 5?

Yes that's perfectly normal. Ring 5 is the UVD ring and that needs to have it's fence in the first 256MB of VRAM.
Comment 127 Alexandre Demers 2014-09-27 11:33:30 UTC
(In reply to comment #126)
> (In reply to comment #119)
> > Small question Alex Deucher or Christian may answer: is it normal ring 5 is
> > completely in a different GPU's memory address area?
> > [    9.353518] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
> > 0x00000000c0000c00 and cpu addr 0xffff880411a25c00
> > [    9.353519] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr
> > 0x00000000c0000c04 and cpu addr 0xffff880411a25c04
> > [    9.353521] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr
> > 0x00000000c0000c08 and cpu addr 0xffff880411a25c08
> > [    9.353522] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
> > 0x00000000c0000c0c and cpu addr 0xffff880411a25c0c
> > [    9.353524] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr
> > 0x00000000c0000c10 and cpu addr 0xffff880411a25c10
> > [    9.356425] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr
> > 0x0000000000075a18 and cpu addr 0xffffc90015fb5a18
> > 
> > rings 0 to 4 are all in the same gpu address subset, but not ring 5?
> 
> Yes that's perfectly normal. Ring 5 is the UVD ring and that needs to have
> it's fence in the first 256MB of VRAM.

Thank you Christian for your quick explanation. Can we suspect the problem to be withing the UVD code then if:
- crashes are mostly happening when watching videos (flash or html5 in whatever browser);
- logs points to ring 5 not responding for some time?
Comment 128 Christian König 2014-09-27 15:02:59 UTC
(In reply to comment #127)
> Thank you Christian for your quick explanation. Can we suspect the problem
> to be withing the UVD code then if:
> - crashes are mostly happening when watching videos (flash or html5 in
> whatever browser);
> - logs points to ring 5 not responding for some time?

Well that behavior could actually be completely normal, incorrectly feeding UVD with informations can crash the block.

That can happen for example because of a corrupted video stream, or because the userspace driver doesn't work 100% correctly.

Whatever it is the recommended way to handle it is to just reset the block and that's exactly what we do, the only problem is that it's really hard to get the reset to be 100% reliable.
Comment 129 Aaron B 2014-09-27 16:45:37 UTC
Didn't mention it last night, but in my logs were this in my system log were messages about msc number being wrong, which we've seen before. But I also saw a "Chromium segfault" error which I haven't seen before, but would explain why it crashes. It's probably it's fault for not bailing from a bad return gracefully, but maybe the return isn't NULL when it should be, anything is possible.

Also interesting in DMesg:

[ 8531.831921] radeon 0000:01:00.0: Packet0 not allowed!

Is it broken in all of radeonsi, and not only hawaii? I haven't seen this before. This was 500 seconds before my screen died.

Still, anything else I should log from DPM and such to try to find what screws up when?
Comment 130 José Suárez 2014-09-27 18:19:16 UTC
Oh, by the way, since I see UVD comments, just to be clear, I don't have GPU acceleration enabled in Flash.
Comment 131 José Suárez 2014-09-27 18:51:16 UTC
I have just played a youtube video (coincidentally the youtube message from AMD's John Byrne regarding Catalyst!) and I have just checked that the following messages show up in the dmesg log:

[ 2053.298531] radeon 0000:05:00.0: Packet0 not allowed!
[ 2058.548486] radeon 0000:05:00.0: Packet0 not allowed!
[ 2166.793537] radeon 0000:05:00.0: Packet0 not allowed!

Given that the video is very short and I haven't played any other youtube video the system hasn't crashed (yet). I presume that if I were to play other videos the system would hang.
Comment 132 Alexandre Demers 2014-09-27 21:18:54 UTC
(In reply to comment #131)
> I have just played a youtube video (coincidentally the youtube message from
> AMD's John Byrne regarding Catalyst!) and I have just checked that the
> following messages show up in the dmesg log:
> 
> [ 2053.298531] radeon 0000:05:00.0: Packet0 not allowed!
> [ 2058.548486] radeon 0000:05:00.0: Packet0 not allowed!
> [ 2166.793537] radeon 0000:05:00.0: Packet0 not allowed!
> 
> Given that the video is very short and I haven't played any other youtube
> video the system hasn't crashed (yet). I presume that if I were to play
> other videos the system would hang.

I saw that also at at least one time. I'll look in my logs for this when it hangs.
Comment 133 Grigori Goronzy 2014-09-29 08:05:10 UTC
I'm not really sure the failing UVD resets are even related to the hangs. UVD also sometimes fails to reset properly for me after the DMA block hangs. GPU reset just isn't that reliable on SI right now
Comment 134 Alexandre Demers 2014-09-29 14:37:06 UTC
I'd like to test disabling UVD. I'll dig in the kernel to report UVD as disabled for SI for this test. This should narrow where we have to look for this bug.
Comment 135 Christian König 2014-09-29 14:40:46 UTC
(In reply to comment #134)
> I'd like to test disabling UVD. I'll dig in the kernel to report UVD as
> disabled for SI for this test. This should narrow where we have to look for
> this bug.

Sometimes UVD functions are needed for power management to work correctly, so not enabling UVD at all might not be an option either. But I'm not sure if that applies to your hardware or not.

Easiest way to disable UVD otherwise is to just add an "return -1" to the beginning of radeon_uvd_init in the kernel module.
Comment 136 Alex Deucher 2014-09-29 15:11:05 UTC
(In reply to comment #131)
> I have just played a youtube video (coincidentally the youtube message from
> AMD's John Byrne regarding Catalyst!) and I have just checked that the
> following messages show up in the dmesg log:
> 
> [ 2053.298531] radeon 0000:05:00.0: Packet0 not allowed!
> [ 2058.548486] radeon 0000:05:00.0: Packet0 not allowed!
> [ 2166.793537] radeon 0000:05:00.0: Packet0 not allowed!

These indicate that userspace is sending a broken command stream (e.g., a bad count for a packet most likely).  The kernel will reject the command stream if a bad one is encountered.  There still could be a bug in the command stream setup in userspace that is just masked in other cases due to the layout of the commands.
Comment 137 Aaron B 2014-09-30 04:47:00 UTC
I'm gonna try with more DPM to make sure the combo of DPM and DMA isn't what is causing it. I'm getting many more consistent crashes with Chrome. Also, seems by bug report here is a dup of this one just posted to, a month before mine:

http://lists.freedesktop.org/archives/dri-devel/2014-September/069303.html

Same everything, so yeah.
Comment 138 Alexandre Demers 2014-09-30 05:45:25 UTC
(In reply to comment #135)
> (In reply to comment #134)
> > I'd like to test disabling UVD. I'll dig in the kernel to report UVD as
> > disabled for SI for this test. This should narrow where we have to look for
> > this bug.
> 
> Sometimes UVD functions are needed for power management to work correctly,
> so not enabling UVD at all might not be an option either. But I'm not sure
> if that applies to your hardware or not.
> 
> Easiest way to disable UVD otherwise is to just add an "return -1" to the
> beginning of radeon_uvd_init in the kernel module.

Well, when disabling UVD as suggested, I had no hang when viewing movies all night long. No ring 5 error either.

I did get some Packet0 errors though, which were "stuttering" the display for short times, but it never crashed.

However, at one point, Xorg just froze (the sound was still playing and the mouse was moving but nothing  else was responding) but I think it is a different bug (there is nothing in dmesg, but there is errors in Xorg.0.log). I'll attach the file in a second. If someone is sure this is a different bug, let me know so I'll open a new bug for it.

Overall, I'd be tempted to point at UVD for the current bug.
Comment 139 Alexandre Demers 2014-09-30 05:46:23 UTC
Created attachment 107099 [details]
Xorg log when it froze
Comment 140 Aaron B 2014-09-30 05:55:19 UTC
I dunno, I've had all those bugs, and when I'm stable, none happen to my knowldge. I think they're all same things, different ways of triggering. Or, maybe we do have multiple bugs.

Chromium/Chrome/VLC/Video/Screen blanks: UVD

What bugs should be classified as a different trigger? We should probably ping what is what now, that we have lots more people with the same problems posting. :)

Also, started watching a movie, crashed in 5 minutes, UVD died. Turned off PM, and UVD is fine , an hour or so in to the movie, and I'm going to watch 2 more after it, so we'll see. If it's stable after 2 more hours, DPM is definitely causing something, maybe both bugs with UVD and GPU faulting/EQ overflow/MSC Numbers unequal.
Comment 141 Michel Dänzer 2014-09-30 07:03:53 UTC
If someone has a way to reliably reproduce the 'Packet0 not allowed!' error, please file a separate report for that. Or, if you know at least which process triggers it, you can run that process with the environment variable RADEON_DUMP_CS=1, and it should print a dump of the failed command stream(s) on stderr.

P.S. I'm afraid this report has turned into a train wreck, it's impossible to keep track of who encountered what issue(s) under what circumstances.
Comment 142 Alexandre Demers 2014-09-30 13:06:05 UTC
(In reply to comment #141)
> If someone has a way to reliably reproduce the 'Packet0 not allowed!' error,
> please file a separate report for that. Or, if you know at least which
> process triggers it, you can run that process with the environment variable
> RADEON_DUMP_CS=1, and it should print a dump of the failed command stream(s)
> on stderr.
> 
> P.S. I'm afraid this report has turned into a train wreck, it's impossible
> to keep track of who encountered what issue(s) under what circumstances.

I opened Bug 84500 just for Packet0.
Comment 143 Aaron B 2014-10-04 17:36:48 UTC
So, what to do with this bug report? Keep it open, and when you guys think it may be fixed with a commit, just ask here? I just re-built Mesa with today's commits, and it's just back to crashing like crazy. :)
Comment 144 Aaron B 2014-10-19 01:35:08 UTC
Is it possible at all that the problem is related to using DDR3-1866 RAM? I mean I'm trying to think of why you guys at AMD can't encounter it, while I can't stay away from it. That is basically the only piece of hardware that would be "overclocked" by any means. But I don't know. You guys still haven't encountered these problems?
Comment 145 Aaron B 2014-10-23 01:55:41 UTC
I'm just going to end my comments here, if someone wants to know more like asking me to test or testing a patch related, let me know.

With radeon.dpm=0, this crash is "fixed" and never happens at all, past 4 days have been running perfect on it. About 20 hours up time total in a row, which is unheard of for me.
Comment 146 Aaron B 2014-10-23 02:21:18 UTC
Go figure after saying that, I maximized a youtube video and it died when on radeon.dpm=0. So just...guess this can just die. -_-
Comment 147 Aaron B 2014-10-31 05:44:18 UTC
Marking duplicate to start over.

*** This bug has been marked as a duplicate of bug 85647 ***
Comment 148 Gedalya 2014-11-05 15:10:43 UTC
See https://code.google.com/p/chromium/issues/detail?id=404357

In my case it was definitely crashing with mesa 10.2. In all cases it seemed to be triggered by chromium. Bug 85647 is talking about something that started with mesa 10.3.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.