Bug 79980

Summary: Random radeonsi crashes
Product: DRI Reporter: darkbasic <darkbasic>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: high CC: albandil, alexander, alexandre.f.demers, alexey, ansla80, djdunn.safety, farmboy0+freedesktop, frederik.vogelsang, freedesktop_LIFP, gaknar, gedalya, jacobsvenningsen15, julien.isorce, kozzi11, maaniv, mabo, marti, mmstickman, ooblick+freedesktop, samtygier
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
gray screen
none
Possible fix
none
Possible fix v2.
none
Kernel errors with Possible fix v2
none
Fixups for Christian's patch
none
Fixups for Christian's patch v2
none
Possible fix v3.
none
Crash on V3
none
Crash on V3.
none
dmesg_3.16-rc7
none
crash photo
none
crash during watching of a flash-video on youtube
none
dmesg before the crash occured (clean boot)
none
Xorg.0.log with some (hopefully useful) information about the crashes
none
Xorg.0.log from the same session (several lockups and partially successful recovery attempts, finally no X startup possible anymore)
none
this is the Xorg.0.log after X finally crashed after all the recoveries and xdm/slim/startx couldn't be successfully launched up anymore (it stayed in VT)
none
whole dmesg output, at the end ("still active bo inside vm", "couldn't schedule ib") X can't be launched up anymore - it simply stays in VT
none
DVD watching (150 minutes) on fluxbox with only konsole & smplayer with vdpau running
none
output of Xorg.0.log during uvd-test with DVD watching in fluxbox via smplayer
none
Oops on fbcon load drm-next-3.17-rebased-on-fixex
none
dmesg output drm-next-3.17
none
Xorg.0.log after X freeze
none
double-hang after "failed to get a new IB (-35)"
none
GPU lockup followed by "GPU fault detected: 147"
none
Last 300 lines of dmesg on a Radeon 6970 none

Description darkbasic 2014-06-13 13:06:44 UTC
Created attachment 100978 [details]
dmesg

Kernel 3.15.0-rc8 + PTE patches
Comment 1 Alex Deucher 2014-06-13 13:47:11 UTC
What specific app were you using that caused the GPU hang?  Also if this is a regression can you biect?
Comment 2 darkbasic 2014-06-13 13:50:44 UTC
No specific app (not counting KDE desktop effects). If the problem is the kernel it's a regression because I didn't have any problem with -rc5. Unfortunately it's not easy to trigger the crash so there is no chance to bisect given how busy I actually am.
Comment 3 Alex Deucher 2014-06-13 13:52:31 UTC
Does it still happen if you drop the PTE patches?
Comment 4 darkbasic 2014-06-13 14:11:32 UTC
Didn't try, but PTE worked flawlessly on -rc5.
Comment 5 Andy Furniss 2014-06-13 14:15:35 UTC
(In reply to comment #3)
> Does it still happen if you drop the PTE patches?

Is that stop poisoning the GART TLB?

Whatever - it could be a separate issue, but I am now getting sort of random crashes on your drm-next-3.16 with my pitcairn.

I am stable on deathsimple 3.15 fixes + hdmi patches.
Comment 6 Andy Furniss 2014-06-13 14:45:29 UTC
(In reply to comment #5)
> (In reply to comment #3)
> > Does it still happen if you drop the PTE patches?
> 
> Is that stop poisoning the GART TLB?

Ok ignore that :-) I didn't spot the rs600
Comment 7 Alex Deucher 2014-06-13 15:35:09 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #3)
> > > Does it still happen if you drop the PTE patches?
> > 
> > Is that stop poisoning the GART TLB?
> 
> Ok ignore that :-) I didn't spot the rs600

It applies to all asics from rs600 forward.
Comment 8 Andy Furniss 2014-06-13 15:47:48 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #5)
> > > (In reply to comment #3)
> > > > Does it still happen if you drop the PTE patches?
> > > 
> > > Is that stop poisoning the GART TLB?
> > 
> > Ok ignore that :-) I didn't spot the rs600
> 
> It applies to all asics from rs600 forward.

Ahh, in the meantime I've now built with 

optimize SI VM handling + use lower_32_bits where appropriate reverted - the latter just so I could revert the former.

I'll see if I am stable over the next couple of days like this.
Comment 9 Andy Furniss 2014-06-15 22:05:30 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #6)
> > > (In reply to comment #5)
> > > > (In reply to comment #3)
> > > > > Does it still happen if you drop the PTE patches?
> > > > 
> > > > Is that stop poisoning the GART TLB?
> > > 
> > > Ok ignore that :-) I didn't spot the rs600
> > 
> > It applies to all asics from rs600 forward.
> 
> Ahh, in the meantime I've now built with 
> 
> optimize SI VM handling + use lower_32_bits where appropriate reverted - the
> latter just so I could revert the former.
> 
> I'll see if I am stable over the next couple of days like this.

I am stable so far with the above reverted.
Comment 10 Andy Furniss 2014-06-16 13:46:44 UTC
(In reply to comment #9)
> (In reply to comment #8)

> > optimize SI VM handling + use lower_32_bits where appropriate reverted - the
> > latter just so I could revert the former.
> > 
> > I'll see if I am stable over the next couple of days like this.
> 
> I am stable so far with the above reverted.

Spoke too soon, I just locked. Wasn't quite the same as before in that screen stayed on displaying normal rather that off/on + junk.

Wasn't doing anything GPU related (accepting I always am with glamor), was doing a big compile, so memory pressure I guess.

Also just add to the mix, after thinking I was stable yesterday I upgraded gcc and updated llvm and mesa so they were different in several ways, though I haven't rebuilt kernel.
Comment 11 darkbasic 2014-06-17 12:21:10 UTC
Created attachment 101226 [details]
gray screen

This is what I often get, I was simply syncing my portage tree while it happened.
Comment 12 agapito 2014-06-17 13:00:28 UTC
First of all: excuse my bad english.

I have the same problem with my HD 7950; using hangouts, playing Left for Dead 2, or watching a flash video my screen goes crazy with vertical lines or grey fog. Started when i upgraded to testing repo (Archlinux) and downloaded the newest linux-firmware package, who includes TAHITI_mc2.bin. I suffered this bug on kernels 3.14 and 3.15. For now, i am using 3.15.1 kernel, and the old Tahiti firmware, and it seems stable.
Comment 13 darkbasic 2014-06-17 13:06:24 UTC
> Wasn't doing anything GPU related (accepting I always am with glamor), was
> doing a big compile, so memory pressure I guess.

You're right, i was compiling too when it crashed. Nothing GPU related anyway.
Comment 14 Andy Furniss 2014-06-17 13:20:24 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > (In reply to comment #8)
> 
> > > optimize SI VM handling + use lower_32_bits where appropriate reverted - the
> > > latter just so I could revert the former.
> > > 
> > > I'll see if I am stable over the next couple of days like this.
> > 
> > I am stable so far with the above reverted.
> 
> Spoke too soon, I just locked. Wasn't quite the same as before in that
> screen stayed on displaying normal rather that off/on + junk.
> 
> Wasn't doing anything GPU related (accepting I always am with glamor), was
> doing a big compile, so memory pressure I guess.
> 
> Also just add to the mix, after thinking I was stable yesterday I upgraded
> gcc and updated llvm and mesa so they were different in several ways, though
> I haven't rebuilt kernel.

I got another lock last thing, this one was "typical" happened when closing seamonkey, this is the third time closing it has locked. Of course it doesn't do it if I try. I must be using gl someway/sometimes, as the last thing I see is the xterm from where it was started and there is a mesa message about default setting for s3tc being overridden by env (and that's not by me - I don't have drirc anywhere).

I think this is going to be a pain to find - I just tried reset --hard onto 

 add large PTE support for NI, SI and CIK v5

that failed to resume from mem 1st try, though it wasn't locked. just corrupt (mouse cursor large block of junk, fluxbox desktop black, but toolbar still visible) so maybe a different issue fixed by a later commit. I could SysRq - the log was normal.
Comment 15 agapito 2014-06-18 16:32:11 UTC
This bug is caused by TAHITI_mc2.bin firmware. The old firmware works good.
Comment 16 Andy Furniss 2014-06-18 19:23:08 UTC
(In reply to comment #15)
> This bug is caused by TAHITI_mc2.bin firmware. The old firmware works good.

Well I haven't tried without it, but I have so far failed to reproduce this bug on a slightly older 3.15 drm fixes also using TAHITI_mc2.bin.
Comment 17 Alex Deucher 2014-06-18 19:29:05 UTC
(In reply to comment #15)
> This bug is caused by TAHITI_mc2.bin firmware. The old firmware works good.

Did you test a new kernel with the old firmware or an old kernel without the new firmware patch?  It could be some other change if you did the latter.
Comment 18 Alex Deucher 2014-06-18 21:22:12 UTC
If it's the same problem Marek is seeing it's probably this:
6d2f294 - drm/radeon: use normal BOs for the page tables v4
Comment 19 agapito 2014-06-19 06:46:32 UTC
(In reply to comment #17)
> (In reply to comment #15)
> > This bug is caused by TAHITI_mc2.bin firmware. The old firmware works good.
> 
> Did you test a new kernel with the old firmware or an old kernel without the
> new firmware patch?  It could be some other change if you did the latter.

3.14 or 3.15 + New firmware = Crashes

3.14 or 3.15 + Old firmware = No problems!
Comment 20 agapito 2014-06-21 13:29:29 UTC
OK forget it. It's not a firmware related problem. I had this bug with old firmware on kernel 3.15.1. I resized a flash video window (vdpau accelerated) and lost my screen.
Comment 21 agapito 2014-06-23 10:42:07 UTC
It happened again. In this case with 3.16.rc2, resizing a firefox windows with flash content (vdpau on).
Comment 22 darkbasic 2014-06-23 11:27:38 UTC
It happened on 3.16-rc1 too while doing a video call with skype.
Comment 23 agapito 2014-06-25 12:18:58 UTC
Kernel 3.10.44 is affected also ! I am using my Intel Graphic Card for now. I had this bug every 15 minuts watching flash content.

My graphic card is HD 7950 using HDMI output.
Comment 24 agapito 2014-07-07 15:13:29 UTC
This bug is still present in 3.16 rc4, and 3.15.4.
Comment 25 Aaron B 2014-07-09 15:42:31 UTC
*** Bug 80141 has been marked as a duplicate of this bug. ***
Comment 26 Aaron B 2014-07-09 15:47:06 UTC
(In reply to comment #24)
> This bug is still present in 3.16 rc4, and 3.15.4.

This sounds exactly like the bug I talk about in Bug #80141. I'll mark my bug as duplicate of it.

Could Mesa commit c8011c1885003b79c9f0c6530e46ae6cb0e69575 have anything to do with what made 370184e813b25b463ad3dc9ca814231c98b95864 need to happen? Think that could be re-enabled for our GPU's now or not?

Also, would the geometry shaders have any effect on our GPU's as Mesa just patched a couple leaks on those.

These 2 fixes look like good ones fore this problem, as this problem was very random and sporadic, and that is the definition of a good, small leak.
Comment 27 darkbasic 2014-07-09 15:49:24 UTC
This bug is so annoying that I switched to Catalyst :-(
Comment 28 Aaron B 2014-07-10 07:29:40 UTC
*** Bug 80141 has been marked as a duplicate of this bug. ***
Comment 29 agapito 2014-07-14 15:24:16 UTC
I can reproduce this bug, using mesa-git repo from Archlinux under kernel-lts 3.14.12. Unigine-valley engine ALWAYS crashes my display when 3D scene starts. If i use normal mesa (10.2) i can run ungine valley OK! But the bug is always present. Like i said in my previous posts, watching flash content increase the chances that the bug appears.
Comment 30 Marek Olšák 2014-07-14 15:53:42 UTC
(In reply to comment #29)
> I can reproduce this bug, using mesa-git repo from Archlinux under
> kernel-lts 3.14.12. Unigine-valley engine ALWAYS crashes my display when 3D
> scene starts. If i use normal mesa (10.2) i can run ungine valley OK! But
> the bug is always present. Like i said in my previous posts, watching flash
> content increase the chances that the bug appears.

You're talking about a different bug. See:
https://bugs.freedesktop.org/show_bug.cgi?id=79659
Comment 31 Lukas Kahnert 2014-07-14 16:11:25 UTC
On my machine i have no flash player installed but it increase the chance for this bug too on watching HTML5-Videos(qtwebkit with gstreamer).
Unigine Valley is always crashing with black screen and GPU hang.
Unigine Heaven works but with white screen(the FPS are visible), but im not sure if it have something to do with this bug.
Using Linux 3.16-rc4 and mesa-git
Comment 32 Christian König 2014-07-14 16:31:00 UTC
Created attachment 102784 [details] [review]
Possible fix

Please try if the attached patch (based on 3.15.5) fixes the stability issues with 3.15 and 3.16.

Thanks in advance,
Christian.
Comment 33 Lukas Kahnert 2014-07-14 16:50:26 UTC
compile error on 3.16-rc5 with this patch

drivers/gpu/drm/radeon/radeon_gem.c: In function 'radeon_gem_object_close':
drivers/gpu/drm/radeon/radeon_gem.c:183:10: error: 'struct radeon_cs_reloc' has no member named 'domain'
  bo_reloc.domain = RADEON_GEM_DOMAIN_VRAM;
          ^
drivers/gpu/drm/radeon/radeon_gem.c:184:10: error: 'struct radeon_cs_reloc' has no member named 'alt_domain'
  bo_reloc.alt_domain = RADEON_GEM_DOMAIN_VRAM;
          ^
scripts/Makefile.build:257: recipe for target 'drivers/gpu/drm/radeon/radeon_gem.o' failed
Comment 34 Christian König 2014-07-15 18:03:30 UTC
Created attachment 102867 [details] [review]
Possible fix v2.

As noted in the comment the last patch was for 3.15.

Here is an updated patch based on alex drm-fixes-3.16-wip branch.
Comment 35 Andy Furniss 2014-07-16 15:12:33 UTC
(In reply to comment #34)
> Created attachment 102867 [details] [review] [review]
> Possible fix v2.
> 
> As noted in the comment the last patch was for 3.15.
> 
> Here is an updated patch based on alex drm-fixes-3.16-wip branch.

Tried this on alex drm-fixes-3.16-wip with my R9 270X and it didn't go well.

When doing nothing I am getting errors (attached), later I was transcoding some vids so I guess memory pressure, and then I managed to lock the screen by trying to use uvd. I could SysRq OK - so different for me from before in that respect.
Comment 36 Andy Furniss 2014-07-16 15:14:13 UTC
Created attachment 102925 [details]
Kernel errors with Possible fix v2
Comment 37 Aaron B 2014-07-16 16:39:01 UTC
(In reply to comment #34)
> Created attachment 102867 [details] [review] [review]
> Possible fix v2.
> 
> As noted in the comment the last patch was for 3.15.
> 
> Here is an updated patch based on alex drm-fixes-3.16-wip branch.

Applied to 3.16-rc5 and after about 12 hours, most of which I've been watching youtube videos, all is well here. Also gamed fine, lots of screen switches and movement. I'll report any crashes/errors I encounter, though. But this is much more stable for me, no problems at all.
Comment 38 Michel Dänzer 2014-07-17 04:10:32 UTC
Created attachment 102960 [details] [review]
Fixups for Christian's patch

This patch on top of Christian's patch has been working very well for me.
Comment 39 Aaron B 2014-07-17 05:37:33 UTC
(In reply to comment #38)
> Created attachment 102960 [details] [review] [review]
> Fixups for Christian's patch
> 
> This patch on top of Christian's patch has been working very well for me.

Both of these patches together on top of a 3.16-rc5 kernel make an unbootable kernel for me. It has a null pointer dereference somewhere very early along the lines of loading and setting up everything.

Jul 17 01:16:17 aaron-desktop kernel: [    4.084761] Switched to clocksource tsc
Jul 17 01:16:17 aaron-desktop kernel: [    5.000793] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
Jul 17 01:16:17 aaron-desktop kernel: [    5.000822] IP: [<ffffffffc055764d>] radeon_vm_bo_set_addr+0x23d/0x440 [radeon]
Jul 17 01:16:17 aaron-desktop kernel: [    5.000879] PGD 41ee4a067 PUD 41e4de067 PMD 0
Jul 17 01:16:17 aaron-desktop kernel: [    5.000897] Oops: 0000 [#1] SMP
Jul 17 01:16:17 aaron-desktop kernel: [    5.000911] Modules linked in: hid_generic usbhid hid uas usb_storage mxm_wmi radeon i2c_algo_bit psmouse ttm drm_kms_helper r8169 drm mii ahci libahci ohci_pci wmi
Jul 17 01:16:17 aaron-desktop kernel: [    5.000979] CPU: 5 PID: 280 Comm: plymouthd Not tainted 3.16.0-rc5-rc99-RadeonSIFixV2 #1
Jul 17 01:16:17 aaron-desktop kernel: [    5.001003] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A99FX PRO R2.0, BIOS 2301 01/06/2014
Jul 17 01:16:17 aaron-desktop kernel: [    5.001032] task: ffff88041d905b20 ti: ffff88041f710000 task.ti: ffff88041f710000
Jul 17 01:16:17 aaron-desktop kernel: [    5.001053] RIP: 0010:[<ffffffffc055764d>]  [<ffffffffc055764d>] radeon_vm_bo_set_addr+0x23d/0x440 [radeon]
Jul 17 01:16:17 aaron-desktop kernel: [    5.001093] RSP: 0018:ffff88041f713b38  EFLAGS: 00010203
Jul 17 01:16:17 aaron-desktop kernel: [    5.001109] RAX: ffff88041d9a0000 RBX: 0000000000000002 RCX: ffff88041e719560
Jul 17 01:16:17 aaron-desktop kernel: [    5.001129] RDX: ffff880424834400 RSI: 0000000000000003 RDI: ffff8800367ca438
Jul 17 01:16:17 aaron-desktop kernel: [    5.001150] RBP: ffff88041f713b80 R08: 0000000000000000 R09: ffff88041e718150
Jul 17 01:16:17 aaron-desktop kernel: [    5.001170] R10: 0000000000000000 R11: ffffffffc0503cc5 R12: 0000000000000000
Jul 17 01:16:17 aaron-desktop kernel: [    5.001190] R13: 0000000000000002 R14: 0000000000000001 R15: ffff88041f9494e0
Jul 17 01:16:17 aaron-desktop kernel: [    5.001211] FS:  00007f5419413740(0000) GS:ffff88043ed40000(0000) knlGS:0000000000000000
Jul 17 01:16:17 aaron-desktop kernel: [    5.001234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 17 01:16:17 aaron-desktop kernel: [    5.001251] CR2: 0000000000000078 CR3: 000000041ef62000 CR4: 00000000000407e0
Jul 17 01:16:17 aaron-desktop kernel: [    5.001271] Stack:
Jul 17 01:16:17 aaron-desktop kernel: [    5.001278]  ffff88041f713b50 ffff88041e718000 ffff8800367ca438 ffff880424834400
Jul 17 01:16:17 aaron-desktop kernel: [    5.001304]  0000000000000000 ffff88041e718000 ffff88041f998c00 ffff8800367ca400
Jul 17 01:16:17 aaron-desktop kernel: [    5.001338]  ffff88041f95c800 ffff88041f713bc8 ffffffffc048b693 ffff880424f30800
Jul 17 01:16:17 aaron-desktop kernel: [    5.001363] Call Trace:
Jul 17 01:16:17 aaron-desktop kernel: [    5.001379]  [<ffffffffc048b693>] radeon_driver_open_kms+0x133/0x230 [radeon]
Jul 17 01:16:17 aaron-desktop kernel: [    5.001408]  [<ffffffffc03c8367>] drm_open+0x1b7/0x4d0 [drm]
Jul 17 01:16:17 aaron-desktop kernel: [    5.001428]  [<ffffffffc03c8725>] drm_stub_open+0xa5/0x100 [drm]
Jul 17 01:16:17 aaron-desktop kernel: [    5.001448]  [<ffffffff811d416f>] chrdev_open+0x9f/0x1d0
Jul 17 01:16:17 aaron-desktop kernel: [    5.001465]  [<ffffffff811cceff>] do_dentry_open+0x1ff/0x350
Jul 17 01:16:17 aaron-desktop kernel: [    5.001482]  [<ffffffff811da7f2>] ? __inode_permission+0x52/0xc0
Jul 17 01:16:17 aaron-desktop kernel: [    5.001500]  [<ffffffff811d40d0>] ? cdev_put+0x30/0x30
Jul 17 01:16:17 aaron-desktop kernel: [    5.001516]  [<ffffffff811cd221>] finish_open+0x31/0x40
Jul 17 01:16:17 aaron-desktop kernel: [    5.001532]  [<ffffffff811de99a>] do_last+0xa7a/0x1210
Jul 17 01:16:17 aaron-desktop kernel: [    5.001548]  [<ffffffff811dad21>] ? link_path_walk+0x71/0x870
Jul 17 01:16:17 aaron-desktop kernel: [    5.001566]  [<ffffffff811b3d56>] ? kmem_cache_alloc_trace+0x1c6/0x1f0
Jul 17 01:16:17 aaron-desktop kernel: [    5.001586]  [<ffffffff81342383>] ? apparmor_file_alloc_security+0x23/0x40
Jul 17 01:16:17 aaron-desktop kernel: [    5.001606]  [<ffffffff811df1eb>] path_openat+0xbb/0x670
Jul 17 01:16:17 aaron-desktop kernel: [    5.001622]  [<ffffffff811da789>] ? putname+0x29/0x40
Jul 17 01:16:17 aaron-desktop kernel: [    5.001637]  [<ffffffff811dfebf>] ? user_path_at_empty+0x5f/0x90
Jul 17 01:16:17 aaron-desktop kernel: [    5.001655]  [<ffffffff811dffaa>] do_filp_open+0x3a/0x90
Jul 17 01:16:17 aaron-desktop kernel: [    5.001672]  [<ffffffff811ecb17>] ? __alloc_fd+0xa7/0x130
Jul 17 01:16:17 aaron-desktop kernel: [    5.001688]  [<ffffffff811ceaa8>] do_sys_open+0x128/0x220
Jul 17 01:16:17 aaron-desktop kernel: [    5.001705]  [<ffffffff81021b15>] ? syscall_trace_enter+0x145/0x250
Jul 17 01:16:17 aaron-desktop kernel: [    5.001724]  [<ffffffff811cebbe>] SyS_open+0x1e/0x20
Jul 17 01:16:17 aaron-desktop kernel: [    5.001739]  [<ffffffff817725ff>] tracesys+0xe1/0xe6
Jul 17 01:16:17 aaron-desktop kernel: [    5.001754] Code: f4 ff 4c 89 ef 41 89 dd e8 a1 91 21 c1 4d 39 ee 0f 83 4f ff ff ff 0f 1f 84 00 00 00 00 00 48 8b 7d c8 e8 47 8f 21 c1 4d 8b 67 48 <41> 8b 44 24 78 4d 8d 6c 24 48 85 c0 0f 84 8b 01 00 00 49 8b bc
Jul 17 01:16:17 aaron-desktop kernel: [    5.001891] RIP  [<ffffffffc055764d>] radeon_vm_bo_set_addr+0x23d/0x440 [radeon]
Jul 17 01:16:17 aaron-desktop kernel: [    5.001924]  RSP <ffff88041f713b38>
Jul 17 01:16:17 aaron-desktop kernel: [    5.001934] CR2: 0000000000000078
Jul 17 01:16:17 aaron-desktop kernel: [    5.001945] ---[ end trace e59240e65015cb90 ]---
Comment 40 Michel Dänzer 2014-07-17 08:01:02 UTC
Created attachment 102966 [details] [review]
Fixups for Christian's patch v2

v2: Fix use-after-free and unprotected list manipulations
Comment 41 Aaron B 2014-07-17 16:29:46 UTC
This is with only the 3.16-rc5 patch without fix-ups, which was working okay. But when I clicked on the top-right of facebook to open up an event, it went out just like old times. But if you see from the time, it had a good run this time for sure. Youtube/Video players in general never crashed it once. I have the fixed kernel building now so soon I'll jump on the fixed one, it looks like code related to this has changed (Error message output a little different.) so I'll try it out.

http://pastebin.com/zntHnrxu
Comment 42 Christian König 2014-07-17 16:38:45 UTC
Created attachment 102992 [details] [review]
Possible fix v3.

Updated and largely simplified patch.

I'm running the third piglit test with it now and so far the system seems to be stable.
Comment 43 Aaron B 2014-07-17 18:30:01 UTC
Built, testing. Played youtube videos, chrome, multiple tabs, all while playing Portal 2 and not a single hiccup on the output, outside of the casual VBlank update problems I see you guys working on for 3-17. I did get a crash on the old patch, as said, but we'll give it more time and I'll post any negative results. For now, this is much more stable than before, though.
Comment 44 Aaron B 2014-07-17 18:56:55 UTC
(In reply to comment #42)
> Created attachment 102992 [details] [review] [review]
> Possible fix v3.
> 
> Updated and largely simplified patch.
> 
> I'm running the third piglit test with it now and so far the system seems to
> be stable.

Just had a crash happen, was opening a Yahoo page. Very normal to crash on it TBH from the old version too, but it shows that this patch may only delay the problem, not be an actual fix. I don't really know what to say about it, same old same old. :/

http://pastebin.com/VXAb5k17
Comment 45 Aaron B 2014-07-17 19:02:29 UTC
Also seems, by looking at my xorg log, many problems are happening along the way.

http://pastebin.com/q3b8fEid
Comment 46 Lukas Kahnert 2014-07-17 20:39:01 UTC
I tried to run piglit with all tests and everytime(I tried 3 times) i get a blackscreen and the System hangs. I dont know the usage of Piglit so i cant say on which test the GPU hangs.
It looks like the same bug which also appears randomly by watching videos/flash.
I used the latest patch(v3).
Comment 47 Aaron B 2014-07-18 03:54:50 UTC
Created attachment 103013 [details]
Crash on V3

Just posting another crash here. This one was caused, possibly by youtube as it was on in the background, but clicking on a facebook chat, it just lost it. So, there it is.
Comment 48 Aaron B 2014-07-18 04:07:04 UTC
Created attachment 103014 [details]
Crash on V3.

Pulled the wrong part out of the log, this is the correct crash. Times happened to be identical, forgot it was in MT.
Comment 49 agapito 2014-07-18 09:59:08 UTC
I am using now vainilla 3.16 rc5 kernel, xserver 1.16, llvm 3.4.2 and latest mesa-git code. I had another crash, but this time i didn't lose my screen, I could "dmesg" and saw a lot of:

radeon 0000:01:00.0: failed to get a new IB (-35)

Then I could resuib my computer.


--------------------------------------------------------------------------

Now when I am trying to run unigine-valley I don't lose my screen like before, but i had this error:


LLVM ERROR: Cannot select: 0x6b00970: i32 = truncate 0x6aec240 [ORD=19] [ID=146]
  0x6aec240: i128 = srl 0x6afc660, 0x6b00470 [ORD=19] [ID=126]
    0x6afc660: i128,ch = load 0x6a31ff8, 0x6ae57c0, 0x6ae6bd0<LD16[%30](tbaa=!"const")> [ORD=19] [ID=116]
      0x6ae57c0: i64,ch = CopyFromReg 0x6a31ff8, 0x6ae56c0 [ID=108]
        0x6ae56c0: i64 = Register %vreg100 [ID=2]
      0x6ae6bd0: i64 = undef [ID=8]
    0x6b00470: i32 = Constant<96> [ID=103]
In function: main
Comment 50 agapito 2014-07-18 11:27:35 UTC
mesa-git compiled against llvm-svn = unigine-valley working again :)
Comment 51 Andy Furniss 2014-07-21 23:07:51 UTC
(In reply to comment #42)
> Created attachment 102992 [details] [review] [review]
> Possible fix v3.
> 
> Updated and largely simplified patch.
> 
> I'm running the third piglit test with it now and so far the system seems to
> be stable.

Been running (not piglit) for a few days now without crashing.

I see it and a couple more fixes are now in agd5f drm-fixes-3.16-wip, so will try that.
Comment 52 Aaron B 2014-07-22 04:21:11 UTC
I can confirm that the bug still isn't fixed. But, it does seem to be much more delayed now, though. I can run youtube for a while, but now Chromium seems to crash it more often in general. Been running for a few days, have had at least 4 crashes. All with about the same fail logs as before. But as side, there's a few VM and IB fixes extra in the 3.16-wip and 3.16 branch, so I'll wait until those to care about this problem more. :)
Comment 53 Michel Dänzer 2014-07-22 06:00:07 UTC
(In reply to comment #52)
> I can run youtube for a while, but now Chromium seems to crash it more often
> in general. Been running for a few days, have had at least 4 crashes. All with
> about the same fail logs as before.

Your Youtube / Chromium issue is probably separate and should be tracked somewhere else. This report is about a stability regression in 3.15/6-rc kernels, which seems to be addressed by Christian's fixes.
Comment 54 Christian König 2014-07-22 09:02:41 UTC
(In reply to comment #53)
> (In reply to comment #52)
> > I can run youtube for a while, but now Chromium seems to crash it more often
> > in general. Been running for a few days, have had at least 4 crashes. All with
> > about the same fail logs as before.
> 
> Your Youtube / Chromium issue is probably separate and should be tracked
> somewhere else. This report is about a stability regression in 3.15/6-rc
> kernels, which seems to be addressed by Christian's fixes.

Yeah, agree. Your log doesn't show any VM faults at all.

That looks more like a userspace problem triggered by some Chromium operations.
Comment 55 darkbasic 2014-07-22 10:51:04 UTC
Finally, it's time to purge Catalyst once again :)
Comment 56 Aaron B 2014-07-22 14:41:49 UTC
(In reply to comment #54)
> (In reply to comment #53)
> > (In reply to comment #52)
> > > I can run youtube for a while, but now Chromium seems to crash it more often
> > > in general. Been running for a few days, have had at least 4 crashes. All with
> > > about the same fail logs as before.
> > 
> > Your Youtube / Chromium issue is probably separate and should be tracked
> > somewhere else. This report is about a stability regression in 3.15/6-rc
> > kernels, which seems to be addressed by Christian's fixes.
> 
> Yeah, agree. Your log doesn't show any VM faults at all.
> 
> That looks more like a userspace problem triggered by some Chromium
> operations.

Any idea where I should file the bug report? Would it be the Cinnamon back end, or glamour?
Comment 57 Michel Dänzer 2014-07-22 14:45:19 UTC
(In reply to comment #56)
> Any idea where I should file the bug report? Would it be the Cinnamon back
> end, or glamour?

The first candidate is the Mesa radeonsi driver.
Comment 58 agapito 2014-07-28 19:57:57 UTC
3.16 rc7 solved this bug! but i need more testing.
Comment 59 darkbasic 2014-07-30 10:11:41 UTC
Created attachment 103680 [details]
dmesg_3.16-rc7

Far from being fixed with 3.16-rc7
I simply watched a Facebook flash video full screen.
Comment 60 darkbasic 2014-07-30 10:14:51 UTC
Created attachment 103681 [details]
crash photo
Comment 61 jackdachef 2014-07-30 14:06:51 UTC
thanks for all the fixes !

currently using http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.17-rebased-on-fixes and it looks pretty stable so far (radeonsi, R9 270X)

even with 3.14.14 there were some (pretty seldom) but occasional crashes of X:

vanilla 3.16-rc6 was pretty much unusable (X + box hardlocks)

and especially http://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-3.17-wip (*without* the fixes) had dozens of gpu crashes yesterday evening/night which were recovering (probably thanks to radeon.hard_reset=1) without crashing X and hardlocking the system:


ring 0 stalled for more than 10010msec GPU lockup
waiting for ... last fence id ... on ring 0
GPU softreset ...


[drm] UVD initialized successfully. 
[drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35)
radeon 0000:01:00.0: ib ring test failed (-35) 
couldn't schedule ib 
still active bo inside vm
GPU softreset ...


radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
radeon 0000:01:00.0: GPU lockup (waiting for ... last fence id ... on ring 5 [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).[drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35)[drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed)


radeon 0000:01:00.0: ring 5 stalled for more than 3438516msec
radeon 0000:01:00.0: GPU lockup (waiting for ... last fence id ... on ring 5)
radeon 0000:01:00.0: ffff8807cfa63000 pin failed
[drm:radeon_crtc_page_flip] *ERROR* failed to pin new rbo buffer before flip 




and messages in Xorg.0.log with 
(WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 89175 < target_msc 89176

and

(WW) RADEON(0): flip queue failed: Invalid argument
(WW) RADEON(0): Page flip failed: Invalid argument



currently still running with radeon.hard_reset=1 - but that probably shouldn't be needed anymore


best trigger for those occasional crashes was using chromium/chrome, watching videos on youtube, switching apps via alt+tab (using an composited desktop with compiz-fusion 0.8.8) and from time to time opening up gmail (which, if I remember correctly would cause the gpu to crash at least twice of these countless times yesterday)
Comment 62 jackdachef 2014-07-30 14:26:04 UTC
Created attachment 103685 [details]
crash during watching of a flash-video on youtube

video during crash was: https://www.youtube.com/watch?v=9QIlB3ZVJes

before that: https://www.youtube.com/watch?v=fMKe89zcvHU

and before that: 1 video in 1080p, 3 videos in 720p (mainly slowly moving classical music, not sure if the type of the video's content [slow, fast-moved, etc.] makes any difference)


like mentioned in https://bugs.freedesktop.org/show_bug.cgi?id=81612 will try running with zswap & zram disabled,

but the frequent crashes on drm-next-3.17-wip also occured with zram, zswap disabled
Comment 63 jackdachef 2014-07-30 14:29:34 UTC
Created attachment 103686 [details]
dmesg before the crash occured (clean boot)

hope this is fixed soon-ish & you guys know why this occurs,

otherwise I'll have to swap this card for production out
Comment 64 jackdachef 2014-07-30 14:57:29 UTC
Created attachment 103687 [details]
Xorg.0.log with some (hopefully useful) information about the crashes
Comment 65 jackdachef 2014-07-30 19:18:03 UTC
Created attachment 103695 [details]
Xorg.0.log from the same session (several lockups and partially successful recovery attempts, finally no X startup possible anymore)
Comment 66 jackdachef 2014-07-30 19:22:32 UTC
Created attachment 103696 [details]
this is the Xorg.0.log after X finally crashed after all the recoveries and xdm/slim/startx couldn't be successfully launched up anymore (it stayed in VT)

it's basically the same behavior like with drm-next-3.17-wip - only, that it this time took longer until X couldn't be started up anymore
Comment 67 Alex Deucher 2014-07-30 19:26:14 UTC
I think the main issue this bug was for (VM page table stability regression) is fixed at this point.  The remaining issues seem to be due to video playback as that seems to be the common trigger in the last few comments.  You might try not using vdpau for video playback to see if that helps to narrow it down.  Someone should probably open a new bug for the videoplay back stability as this bug is starting to get unwieldy and has become a dumping ground for anything.
Comment 68 jackdachef 2014-07-30 19:28:00 UTC
Created attachment 103697 [details]
whole dmesg output, at the end ("still active bo inside vm", "couldn't schedule ib") X can't be launched up anymore - it simply stays in VT
Comment 69 jackdachef 2014-07-30 19:33:07 UTC
(In reply to comment #67)
> I think the main issue this bug was for (VM page table stability regression)
> is fixed at this point.  The remaining issues seem to be due to video
> playback as that seems to be the common trigger in the last few comments. 
> You might try not using vdpau for video playback to see if that helps to
> narrow it down.  Someone should probably open a new bug for the videoplay
> back stability as this bug is starting to get unwieldy and has become a
> dumping ground for anything.

even with those last lines from #68 ?

I've suspected for some time, that it must be related to videoplayback and uvd - so it should improve with vdpau disabled, right ?

but those last lines got me thinking that it could be something else

in the end (after those recovery attempts)

triggers (where the screen turned black) were

- switching between programs (alt+tab)
- simply entering text in a note (gnote) [at least twice]
- ... others I currently don't remember
Comment 70 jackdachef 2014-07-30 19:39:05 UTC
not sure but the flash version in chromium/google-chrome isn't using vdpau at all, right ?

can't remember having viewed any video in a video-player accelerated with vdpau during those crashes (not even before)

so it's not related to vdpau, at least in my case
Comment 71 Alex Deucher 2014-07-30 19:44:54 UTC
This bug was originally about a stability regression due to some GPUVM changes in 3.15.  If 3.14 is stable for you but newer kernels are not, then it may be related.  Otherwise, it's probably another issue.  See bug 81644 about stability issues with Chromium specifically.
Comment 72 Andy Furniss 2014-07-30 20:43:40 UTC
(In reply to comment #70)
> not sure but the flash version in chromium/google-chrome isn't using vdpau
> at all, right ?

To be sure start chrome with the env VDPAU_TRACE=1 and play something, there will be lots of debugging to see if it does use it.

I've tried quite hard to crash and so far failed, but then I am using seamonkey, and it seems I am also a few commits off head now.
Comment 73 darkbasic 2014-07-31 11:25:10 UTC
Alex I didn't use vdpau when playing back, also I got an X freeze even without playing back anything: I was just starting Android Studio, PyCharm and Netbeans (I often get freezes when starting these editors).
Comment 74 jackdachef 2014-07-31 15:05:17 UTC
(In reply to comment #71)
> This bug was originally about a stability regression due to some GPUVM
> changes in 3.15.  If 3.14 is stable for you but newer kernels are not, then
> it may be related.  Otherwise, it's probably another issue.  See bug 81644
> about stability issues with Chromium specifically.

unfortunately 3.14 also isn't entirely stable it mostly is (99%) but here the problem is that very very seldomly X simply crashes, gpu doesn't simply turn black and recovers, so important data can be lost (when being worked on) - couldn't pinpoint the reason yet - the gpu just "reboots"

no additional error messages dmesg or Xorg.0.log as far as I know

can't say anything about kernel versions prior to that since this card is still a few days old


gcc 4.9 also couldn't be the cause, at least with 3.16-rc* kernels and drm-next-3.17-rebased-on-fixes, since I've already added the patch manually and the kernel also isn *NOT* compiled with -Os (optimize for size)


it seems to be more of general instability going towards 3.15 and 3.16 - but who I am to ask, I'm just a enthusiast user :P


having read about stability issues with the new firmware somewhere

how could I test and revert to the old firmware ? would it still work ?

simply removing the PITCAIRN_mc2.bin (e.g. for the R9 270X) and leaving the other PITCAIRN files in /lib/firmware ?

the graphics driver is loading as a module and not compiled into the kernel or included in initramfs

I'd really like to upgrade to 3.16-rc* due to recent changes (especially in connection with Btrfs)

an option to disable UVD entirely would have been nice (when still using 5850 and during the new introduction of DPM there was a patch which offered a module parameter to turn it manually off) - would that be an option to further troubleshoot this issue and to exclude UVD from the list of potential causes ?


Thanks
Comment 75 agapito 2014-07-31 20:56:34 UTC
Well, still not fixed in 3.16 rc7 :(  

I was using steam (wine) and the bug reappears. Grey garbage in my screen and a hard lock up. I had to reboot pressing reset button.
Comment 76 jackdachef 2014-08-01 01:31:08 UTC
Created attachment 103776 [details]
DVD watching (150 minutes) on fluxbox with only konsole & smplayer with vdpau running

so obviously vdpau is *NOT* the problem here the whole movie played fine and the whole box didn't lockup without any suspicious messages, data or behavior on dmesg
Comment 77 jackdachef 2014-08-01 01:36:51 UTC
Created attachment 103777 [details]
output of Xorg.0.log during uvd-test with DVD watching in fluxbox via smplayer

mark the "extra" explicitely set settings:

[   253.915] (**) RADEON(0): Option "EnablePageFlip" "off"
[   253.915] (**) RADEON(0): Option "ColorTiling" "on"
[   253.915] (**) RADEON(0): Option "AccelMethod" "Glamor"
[   253.915] (**) RADEON(0): Option "EXAVSync" "off"
[   253.915] (**) RADEON(0): Option "EXAPixmaps" "on"
[   253.915] (**) RADEON(0): Option "SwapbuffersWait" "on"

I read about commits from 3.14 to 3.16-rc* that mentioned pageflipping changes but disabling that didn't make a change




*before* watching the DVD

I did another test

and accidentally watched a short portion of a HTML video on youtube with chromium (actually only had intended to read lkml, but there was a reference link and didn't mention that it was a video /facepalm), did some browsing, had gnote running (and some more which I currently don't remember)

adobe flash was explicitely turned *off* via about:plugins

and X hardlocked while playing music (and mp3, via audacious) & browsing via chromium


so at least in my case more and more signs seem to lead towards bug 81644 (https://bugs.freedesktop.org/show_bug.cgi?id=81644) and that chromium + html video; or chromium and all sorts of video (vdpau doesn't apply here ?!) triggers instability & (hard)locks
Comment 78 jackdachef 2014-08-01 02:02:39 UTC
just saw that there are new updates in

http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.17-wip &

http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.17-rebased-on-fixes



hopefully the changes make a difference, will try out the new drm-next-3.17-rebased-on-fixes in approximately a day or later

thanks guys !
Comment 79 Andy Furniss 2014-08-01 14:16:24 UTC
Created attachment 103818 [details]
Oops on fbcon load drm-next-3.17-rebased-on-fixex
Comment 80 Andy Furniss 2014-08-01 14:17:17 UTC
(In reply to comment #78)
> just saw that there are new updates in
> 
> http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.17-wip &
> 
> http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.17-rebased-on-
> fixes
> 
> 
> 
> hopefully the changes make a difference, will try out the new
> drm-next-3.17-rebased-on-fixes in approximately a day or later
> 
> thanks guys !

Well I just tried drm-next-3.17-rebased-on-fixes and it died as soon as fbcon loaded. Screen full junk and an oops logged.

I have got the new firmware (unless there's an even newer one) and have booted into the other 3.17 before - but it eventually crashed due to missing fixes.
Comment 81 Michel Dänzer 2014-08-01 14:19:53 UTC
(In reply to comment #80)
> Well I just tried drm-next-3.17-rebased-on-fixes and it died as soon as
> fbcon loaded. Screen full junk and an oops logged.

Please file a new report for that, it's nothing to do with random crashes in X.
Comment 82 Alex Deucher 2014-08-01 14:26:15 UTC
(In reply to comment #80)
> 
> I have got the new firmware (unless there's an even newer one) and have
> booted into the other 3.17 before - but it eventually crashed due to missing
> fixes.

When did you download it?  The header format changed last week and I uploaded a new version.
Comment 83 Andy Furniss 2014-08-01 14:33:14 UTC
(In reply to comment #82)
> (In reply to comment #80)
> > 
> > I have got the new firmware (unless there's an even newer one) and have
> > booted into the other 3.17 before - but it eventually crashed due to missing
> > fixes.
> 
> When did you download it?  The header format changed last week and I
> uploaded a new version.

Ahh, it was longer ago that that, will try newer.
Comment 84 Andy Furniss 2014-08-01 15:08:46 UTC
(In reply to comment #83)
> (In reply to comment #82)
> > (In reply to comment #80)
> > > 
> > > I have got the new firmware (unless there's an even newer one) and have
> > > booted into the other 3.17 before - but it eventually crashed due to missing
> > > fixes.
> > 
> > When did you download it?  The header format changed last week and I
> > uploaded a new version.
> 
> Ahh, it was longer ago that that, will try newer.

Booting OK with new firmware.
Comment 85 jackdachef 2014-08-02 01:48:47 UTC
(In reply to comment #82)
> (In reply to comment #80)
> > 
> > I have got the new firmware (unless there's an even newer one) and have
> > booted into the other 3.17 before - but it eventually crashed due to missing
> > fixes.
> 
> When did you download it?  The header format changed last week and I
> uploaded a new version.

does this apply to PITCAIRN gpus, too ?

I only see firmware up to April (PITCAIRN_mc2.bin)

kindly link to the new firmware files so that I can update to it, too



update:

posted new information concerning this case (or the issue with chromium) in #81644 (https://bugs.freedesktop.org/show_bug.cgi?id=81644)
Comment 86 Andy Furniss 2014-08-02 11:34:39 UTC
(In reply to comment #85)
> (In reply to comment #82)
> > (In reply to comment #80)
> > > 
> > > I have got the new firmware (unless there's an even newer one) and have
> > > booted into the other 3.17 before - but it eventually crashed due to missing
> > > fixes.
> > 
> > When did you download it?  The header format changed last week and I
> > uploaded a new version.
> 
> does this apply to PITCAIRN gpus, too ?
> 
> I only see firmware up to April (PITCAIRN_mc2.bin)
> 
> kindly link to the new firmware files so that I can update to it, too

http://people.freedesktop.org/~agd5f/radeon_ucode/ucode.tar.gz

The new firmwares have lowercase names so you can keep the old ones in place for kernels < 3.17.
Comment 87 jackdachef 2014-08-02 21:07:21 UTC
(In reply to comment #86)
> (In reply to comment #85)
> > (In reply to comment #82)
> > 
> > does this apply to PITCAIRN gpus, too ?
> > 
> > I only see firmware up to April (PITCAIRN_mc2.bin)
> > 
> > kindly link to the new firmware files so that I can update to it, too
> 
> http://people.freedesktop.org/~agd5f/radeon_ucode/ucode.tar.gz
> 
> The new firmwares have lowercase names so you can keep the old ones in place
> for kernels < 3.17.


thanks a lot

running now with latest firmware, unfortunately

the trigger still seems to be chromium & the whole box hardlocks ...
Comment 88 jackdachef 2014-08-03 16:06:55 UTC
just crashed yesterday with chromium only being started up, 

and browsing some random wallpapers with firefox 31 (not even fully opened, only previews)


both firefox & chromium had hardware acceleration/webgl disabled


Desktop: Xfce4 + compiz-fusion (opengl composited desktop)

unfortunately no dmesg info
Comment 89 agapito 2014-08-05 12:43:37 UTC
Still present in the final 3.16 kernel. This bug is really crazy. I have a lot of hard lockups playing Age of Empires HD using windows steam. I can't provide any log or debug info because my machine die completely.

My hardware is: Gigabyte HD 7950, using HDMI output on Archlinux + KDE + Mesa 10.2.5 + xserver 1.16.
Comment 90 jackdachef 2014-08-05 13:00:24 UTC
(In reply to comment #89)
> Still present in the final 3.16 kernel. This bug is really crazy. I have a
> lot of hard lockups playing Age of Empires HD using windows steam. I can't
> provide any log or debug info because my machine die completely.
> 
> My hardware is: Gigabyte HD 7950, using HDMI output on Archlinux + KDE +
> Mesa 10.2.5 + xserver 1.16.

make sure you try out the very latest & greatest

http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.17-rebased-on-fixes

with userptr support

these changes are supposd to be performance-enhanced improvements (only ?) but also seem raise stability for me ( bug #81612 ):

information on this: http://lists.freedesktop.org/archives/intel-gfx/2014-February/040513.html


I'll go ahead and test other user-cases, see whether those are also stable now
Comment 91 agapito 2014-08-06 10:45:32 UTC
Created attachment 104145 [details]
dmesg output drm-next-3.17

Using drm-next-3.17 my xorg crashes. I don't know if is the same bug. But here is my dmesg output (sorry about the quality)
Comment 92 Chernovsky Oleg 2014-08-16 12:24:52 UTC
I've successfully reproduced this bug 

I'm on 3.16 kernel, mesa 10.2.5.

I start Qt5 QtCreator (as I know it uses OpenGL acceleration for some QML elements), then I start any 3D-demanding app (like one of Valve's game titles).
After that I close app. And when I close QtCreator, this bug occurs, total system hang, ring 0 freezes and does not wake up on soft reset.

Where should I look at to fix it? I assume it hids somewhere in GPUVM? (or is it needed anymore or is it already fixed?)
Comment 93 farmboy0+freedesktop 2014-08-19 21:24:49 UTC
With 3.17-rc1 my desktop has been mostly stable.

Havent experienced any more random deadlocks besides one that left this in
Xorg.log:
(EE) Backtrace:
(EE) 0: /usr/bin/X (QueuePointerEvents+0x52) [0x454ad2]
(EE) 1: /usr/lib64/xorg/modules/input/evdev_drv.so (_init+0x2dd4) [0x7f53cf900254]
(EE) 2: /usr/bin/X (DPMSSupported+0xd8) [0x47c738]
(EE) 3: /usr/bin/X (xf86SerialModemClearBits+0x1ca) [0x4a6e6a]
(EE) 4: /lib64/libpthread.so.0 (funlockfile+0x70) [0x7f53daabf73f]
(EE) 5: /lib64/libc.so.6 (ioctl+0x7) [0x7f53d97dc1c7]
(EE) 6: /usr/lib64/libdrm.so.2 (drmIoctl+0x30) [0x7f53da8a5310]
(EE) 7: /usr/lib64/libdrm.so.2 (drmCommandWrite+0x1b) [0x7f53da8a799b]
(EE) 8: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x4c3824) [0x7f53d4be7eb5]
(EE) 9: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x4c4573) [0x7f53d4be9810]
(EE) 10: /usr/lib64/dri/radeonsi_dri.so (radeon_drm_winsys_create+0x1b46) [0x7f53d47364f2]
(EE) 11: /usr/lib64/dri/radeonsi_dri.so (radeon_drm_winsys_create+0xc909) [0x7f53d474bbde]
(EE) 12: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x3f3888) [0x7f53d4a47f8d]
(EE) 13: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x1e695b) [0x7f53d462e122]
(EE) 14: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x1e7f7f) [0x7f53d4630be6]
(EE) 15: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x114a46) [0x7f53d448a250]
(EE) 16: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x1164d1) [0x7f53d448d749]
(EE) 17: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x1eb007) [0x7f53d4636764]
(EE) 18: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x116e11) [0x7f53d448e487]
(EE) 19: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_swrast+0x116e6a) [0x7f53d448eb5e]
(EE) 20: /usr/lib64/libglamor.so.0 (glamor_poly_segment_nf+0x2d22) [0x7f53d7f36ea2]
(EE) 21: /usr/lib64/libglamor.so.0 (glamor_poly_segment_nf+0x335b) [0x7f53d7f37abb]
(EE) 22: /usr/lib64/libglamor.so.0 (glamor_add_traps_nf+0x36d) [0x7f53d7f2f22d]
(EE) 23: /usr/bin/X (miFillUniqueSpanGroup+0x1a08) [0x58ce28]
(EE) 24: /usr/bin/X (xf86I2CGetScreenBuses+0x1e3a) [0x4cf17a]
(EE) 25: /usr/bin/X (dixDestroyPixmap+0x19d9) [0x43ae19]
(EE) 26: /usr/bin/X (SendErrorToClient+0x2ff) [0x43c8ff]
(EE) 27: /usr/bin/X (remove_fs_handlers+0x42d) [0x440bcd]
(EE) 28: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x7f53d971add5]
(EE) 29: /usr/bin/X (_start+0x29) [0x42a7f1]
(EE) 30: ? (?+0x29) [0x29]
(EE) 
(EE) [mi] EQ overflow continuing.  1000 events have been dropped.
(EE) [mi] No further overflow reports will be reported until the clog is cleared.

I am using xorg 1.15.1 and glamor from git.

Havent tried many opengl applications yet besides pale moon(firefox clone) with hardware acceleration and some short wine session.
Comment 94 agapito 2014-08-20 12:47:31 UTC
I had this bug again using 3.17-rc1. Still not fixed.
Comment 95 farmboy0+freedesktop 2014-08-20 21:01:55 UTC
Is there some way to debug this?
Debug messages from the kernel/mesa/whatever to activate to find out what goes on with the radeon driver.
Some /sys files to monitor or something?

Please let us help you fix this bug but for me at least it was always a complete deadlock with nothing in the logs to indicate what went wrong.
Comment 96 Maciej 2014-08-21 00:03:34 UTC
Yes, I would love to help debug this issue. Atm mesa git is completely unusable on radeonsi.
Comment 97 darkbasic 2014-08-21 00:30:31 UTC
I do not even use radeonsi anymore because of this bug, please don't understimate the effect it has on your userbase.
Comment 98 Christian König 2014-08-21 09:06:30 UTC
*** Bug 82886 has been marked as a duplicate of this bug. ***
Comment 99 Malte Schröder 2014-08-21 15:18:52 UTC
I am not sure if this is related, but what I see _a lot_ recently in my kernel log is this:

[  554.747835] [TTM] Illegal buffer object size
[  554.747837] [TTM] Illegal buffer object size
[  554.747838] [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (0, 6, 4096, -22)

This is on X.org 1.16.0, radeon_drv 7.4.99 with glamor enabled, mesa 10.2.5 and drm-next-3.17.
Comment 100 Alex Deucher 2014-08-21 15:21:55 UTC
(In reply to comment #99)
> I am not sure if this is related, but what I see _a lot_ recently in my
> kernel log is this:
> 
> [  554.747835] [TTM] Illegal buffer object size
> [  554.747837] [TTM] Illegal buffer object size
> [  554.747838] [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM
> object (0, 6, 4096, -22)

Unrelated.  You are seeing bug 82162 which is already fixed.
Comment 101 Maximilian Böhm 2014-08-22 11:19:19 UTC
Hey, my system is running nearly crash free for three days now with Linux 3.17 RC1. The crashes were related to VLC and VDPAU and something got stuck but my system feels pretty stable now. If you are on Arch Linux compile linux-mainline from AUR and try for yourself (don't forget updating GRUB).
Comment 102 darkbasic 2014-08-22 14:28:02 UTC
3.17-rc1 + drm-fixes seems MUCH more stable than ever. I will let you know if things keeps going on well.
Comment 103 darkbasic 2014-08-22 14:38:15 UTC
I just saw this in dmesg:
radeon 0000:01:00.0: Packet0 not allowed!
Comment 104 Alex Deucher 2014-08-22 14:41:09 UTC
(In reply to comment #103)
> I just saw this in dmesg:
> radeon 0000:01:00.0: Packet0 not allowed!

Some userspace component is generating an invalid command stream.  Probably a bad packet count somewhere.
Comment 105 Chernovsky Oleg 2014-08-22 16:45:22 UTC
So, is this bug finally fixed? I saw some activity, but was it a real fix or only a workaround?
Comment 106 Alex Deucher 2014-08-22 16:51:55 UTC
(In reply to comment #105)
> So, is this bug finally fixed? I saw some activity, but was it a real fix or
> only a workaround?

I think thus bug has become largely useless.  It's become a general dumping ground for any sort of problem with radeonsi.
Comment 107 agapito 2014-08-22 19:04:38 UTC
3.17 rc1 and 3.16.1 are still affected by this bug, but they are more stable than previous kernels.
Comment 108 darkbasic 2014-08-22 20:32:36 UTC
The original bug wasn't fixed in 3.16-rc but it seems fixed in 3.17+drm-fixes. I will let you know after a couple of days, I'm really happy I can use radeonsi once more. With 3.16-rc I had crashes after a few minutes of usage.
Comment 109 Chernovsky Oleg 2014-08-23 09:32:42 UTC
Yep. I tried to repeat previously described crashes (including my own) on 3.17-rc1 with your drm-fixes branch and failed. No gfx ring lockup anymore.
Comment 110 Chernovsky Oleg 2014-08-23 09:34:17 UTC
Alex, I'm still curious, what was the original problem that caused this bug?
Comment 111 Christian König 2014-08-23 09:37:29 UTC
(In reply to comment #110)
> Alex, I'm still curious, what was the original problem that caused this bug?

Well that was the problem: A couple of different things!

We have an long outstanding issue with TLB poisoning, a couple of bugs related to dynamically allocating page tables and I think one or two userspace issues mixed into a single bugreport.
Comment 112 Chernovsky Oleg 2014-08-23 10:00:01 UTC
Thanks, Christian, so I assume, all those issues were fixed and TLB poisoning was workarounded for now?

I'm asking because I'm currently digging through the source trying to figure out the full picture.
Comment 113 mmstickman 2014-08-23 17:50:31 UTC
I've tried kernel 3.17-rc1 and I still get Xorg crashing with my Radeon HD 7950. After idling in KDE overnight this error happened:

24726.971466] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec
[24726.971476] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000020375d last fence id 0x0000000000203756 on ring 0)
[24727.506525] radeon 0000:01:00.0: Saved 493 dwords of commands on ring 0.
[24727.506572] radeon 0000:01:00.0: GPU softreset: 0x0000006C
[24727.506574] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
[24727.506576] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[24727.506578] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[24727.506580] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000AC0
[24727.506614] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[24727.506616] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[24727.506618] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[24727.506620] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000006
[24727.506622] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80018647
[24727.506624] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483106
[24727.506626] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C84206
[24727.506628] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[24727.506630] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[24727.506634] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c03d6dc0 flags=0x0010]
[24727.506646] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c03d6df0 flags=0x0030]
[24727.506651] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c0000100 flags=0x0030]
[24727.506655] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c03d6c00 flags=0x0010]
[24727.506659] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c03d6c80 flags=0x0010]
[24727.506663] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c03d6c40 flags=0x0010]
[24728.034963] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
[24728.035016] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100140
[24728.036173] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
[24728.036175] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[24728.036177] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[24728.036179] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[24728.036213] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[24728.036215] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[24728.036217] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[24728.036219] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[24728.036221] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[24728.036223] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[24728.036224] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[24728.036322] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[24728.085248] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
[24728.085250] [drm] PCIE gen 2 link speeds already enabled
[24728.086416] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[24728.086542] radeon 0000:01:00.0: WB enabled
[24728.086545] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x00000000c0000c00 and cpu addr 0xffff880420943c00
[24728.086547] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x00000000c0000c04 and cpu addr 0xffff880420943c04
[24728.086549] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x00000000c0000c08 and cpu addr 0xffff880420943c08
[24728.086550] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x00000000c0000c0c and cpu addr 0xffff880420943c0c
[24728.086552] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x00000000c0000c10 and cpu addr 0xffff880420943c10
[24728.086936] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90011d35a18
[24728.256690] [drm] ring test on 0 succeeded in 1 usecs
[24728.256694] [drm] ring test on 1 succeeded in 1 usecs
[24728.256698] [drm] ring test on 2 succeeded in 1 usecs
[24728.256758] [drm] ring test on 3 succeeded in 2 usecs
[24728.256765] [drm] ring test on 4 succeeded in 2 usecs
[24728.454199] [drm] ring test on 5 succeeded in 2 usecs
[24728.454204] [drm] UVD initialized successfully.
[24728.454233] radeon 0000:01:00.0: GPU fault detected: 146 0x03ee700c
[24728.454235] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000019F
[24728.454236] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E07000C
[24728.454238] VM fault (0x0c, vmid 7) at page 415, read from CP (112)
[24738.469792] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec
[24738.469801] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000203762 last fence id 0x0000000000203756 on ring 0)
[24738.469823] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[24738.469829] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
[24738.469833] radeon 0000:01:00.0: ib ring test failed (-35).
[24738.976051] radeon 0000:01:00.0: GPU softreset: 0x00000048
[24738.976054] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
[24738.976056] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[24738.976058] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[24738.976060] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[24738.976094] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[24738.976096] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[24738.976098] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010100
[24738.976100] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000086
[24738.976102] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80018647
[24738.976104] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[24738.976106] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[24738.976108] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[24738.976110] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[24739.475375] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
[24739.475428] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[24739.476585] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
[24739.476587] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[24739.476588] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[24739.476590] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[24739.476625] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[24739.476627] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[24739.476629] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[24739.476631] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[24739.476632] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[24739.476634] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[24739.476636] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[24739.476720] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[24739.493947] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
[24739.493950] [drm] PCIE gen 2 link speeds already enabled
[24739.495109] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[24739.495231] radeon 0000:01:00.0: WB enabled
[24739.495233] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x00000000c0000c00 and cpu addr 0xffff880420943c00
[24739.495235] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x00000000c0000c04 and cpu addr 0xffff880420943c04
[24739.495237] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x00000000c0000c08 and cpu addr 0xffff880420943c08
[24739.495239] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x00000000c0000c0c and cpu addr 0xffff880420943c0c
[24739.495241] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x00000000c0000c10 and cpu addr 0xffff880420943c10
[24739.495655] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90011d35a18
[24739.665510] [drm] ring test on 0 succeeded in 1 usecs
[24739.665515] [drm] ring test on 1 succeeded in 1 usecs
[24739.665519] [drm] ring test on 2 succeeded in 1 usecs
[24739.665578] [drm] ring test on 3 succeeded in 2 usecs
[24739.665586] [drm] ring test on 4 succeeded in 2 usecs
[24739.863021] [drm] ring test on 5 succeeded in 1 usecs
[24739.863027] [drm] UVD initialized successfully.
[24739.863059] [drm] ib test on ring 0 succeeded in 0 usecs
[24739.863077] [drm] ib test on ring 1 succeeded in 0 usecs
[24739.863095] [drm] ib test on ring 2 succeeded in 0 usecs
[24739.863113] [drm] ib test on ring 3 succeeded in 0 usecs
[24739.863156] [drm] ib test on ring 4 succeeded in 0 usecs
[24750.048352] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
[24750.048362] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5)
[24750.048368] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[24750.048375] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
[24750.048395] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed
[25118.213731] radeon 0000:01:00.0: ring 5 stalled for more than 377576msec
[25118.213741] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000003 last fence id 0x0000000000000002 on ring 5)
Comment 114 Chernovsky Oleg 2014-08-23 20:01:58 UTC
(In reply to comment #113)
> I've tried kernel 3.17-rc1 and I still get Xorg crashing with my Radeon HD

did you try vanilla rc1 or with Alex's drm-fixes branch applied?
Comment 115 darkbasic 2014-08-23 22:20:12 UTC
Please use drm-fixes because it's what we are talking about.
Comment 116 farmboy0+freedesktop 2014-08-24 10:23:20 UTC
Will the drm-fixes branch be part of 3.17-rc{2,}?
Where is the repo for the branch located?
Comment 117 darkbasic 2014-08-24 11:00:05 UTC
Here is drm-fixes and yes, it will be part of -rc2:
http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-3.17

Unfortunately I've been too eager to declare drm-fixes stable because I just got an X freeze:
[139998.633243] traps: chrome[11324] general protection ip:7f8f8f1790b0 
sp:7fffba3d5610 error:0 in .radeonsi_dri.so._portage_merge_.20817 (deleted)[7f8f8ef8c000+5de000]

I was simply running an "emerge --sync" while browsing a couple of simple web pages.

Any chance to fix this? If yes I'm willing to open a new bug for this.
Comment 118 darkbasic 2014-08-25 20:34:51 UTC
[121054.909144] radeon 0000:01:00.0: ring 3 stalled for more than 10000msec
[121054.909150] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000983653 last fence id 0x000000000098364e on ring 3)

Just got another crash with drm-fixes-3.17: ironically I was replying to a Phoronix user telling him that drm-fixes-3.17 is finally stable LOL
Comment 119 darkbasic 2014-08-26 12:30:07 UTC
I was browsing while compiling and once again it crashed:
[57534.191174] Watchdog[29216]: segfault at 0 ip 00007f402637ae58 sp 00007f40131e1810 error 6 in chrome[7f40228fa000+590d000]
[57544.227442] Watchdog[10690]: segfault at 0 ip 00007f1ffab93e58 sp 00007f1fe79fa810 error 6 in chrome[7f1ff7113000+590d000]

Does it happen with other browsers than chrome/chromium? Do someone use firefox?
Comment 120 darkbasic 2014-08-26 12:33:07 UTC
Created attachment 105282 [details]
Xorg.0.log after X freeze

I found a backtrace in Xorg.0.log
Comment 121 Andy Furniss 2014-08-27 22:41:33 UTC
(In reply to comment #119)
> I was browsing while compiling and once again it crashed:
> [57534.191174] Watchdog[29216]: segfault at 0 ip 00007f402637ae58 sp
> 00007f40131e1810 error 6 in chrome[7f40228fa000+590d000]
> [57544.227442] Watchdog[10690]: segfault at 0 ip 00007f1ffab93e58 sp
> 00007f1fe79fa810 error 6 in chrome[7f1ff7113000+590d000]
> 
> Does it happen with other browsers than chrome/chromium? Do someone use
> firefox?

I've been stable for some time now using seamonkey, but than I use flashblock and don't usually use it for vid. Saying that I have deliberately tried and failed to crash playing vids since this bug started.

I think there's another bug for chromium issues.
Comment 122 thomas.lassdiesonnerein 2014-08-31 13:34:35 UTC
(In reply to comment #119)
> I was browsing while compiling and once again it crashed:
> [57534.191174] Watchdog[29216]: segfault at 0 ip 00007f402637ae58 sp
> 00007f40131e1810 error 6 in chrome[7f40228fa000+590d000]
> [57544.227442] Watchdog[10690]: segfault at 0 ip 00007f1ffab93e58 sp
> 00007f1fe79fa810 error 6 in chrome[7f1ff7113000+590d000]
> 
> Does it happen with other browsers than chrome/chromium? Do someone use
> firefox?

Yes, me. We just spoke at github. Never had x crashes or lockups.
Comment 123 thomas.lassdiesonnerein 2014-09-01 10:29:50 UTC
> Yes, me. We just spoke at github. Never had x crashes or lockups.

My specs (was in a hurry yesterday):
HD7950
Core2Quad
Kernel 3.16 (openSUSE Factory, x64, KDE+desktopeffects)
Rest from git (mesa, llvm, ati, xserver)

did not test chrome, but firefox and its old flash (11.2) or steam games never gave me x-crashes or lockups on radeonSI. Also earlier kernel versions (openSUSE Tumbleweed) made no problem.

cheers tomtomme
Comment 124 Grigori Goronzy 2014-09-03 05:25:22 UTC
I can very quickly, almost deterministically, hang the GPU (radeonsi, Cape Verde) with the following command:

> LIBGL_ALWAYS_SOFTWARE=1 mpv --fs --vo=opengl:sw /path/to/some_video

this works on both 3.16.0 and 3.17rc3. Try seeking, it often happens directly after a seek. In most cases, the hang is unrecoverable and crashes the kernel after some "atombios stuck in a loop" messages. Very strange indeed, software rendered glxgears doesn't cause this.

Can anyone verify? A somewhat reliable test case might be a good start to finally fixing this.
Comment 125 Christian König 2014-09-03 09:15:59 UTC
(In reply to comment #124)
> I can very quickly, almost deterministically, hang the GPU (radeonsi, Cape
> Verde) with the following command:
> 
> > LIBGL_ALWAYS_SOFTWARE=1 mpv --fs --vo=opengl:sw /path/to/some_video
> 
> this works on both 3.16.0 and 3.17rc3. Try seeking, it often happens
> directly after a seek. In most cases, the hang is unrecoverable and crashes
> the kernel after some "atombios stuck in a loop" messages. Very strange
> indeed, software rendered glxgears doesn't cause this.
> 
> Can anyone verify? A somewhat reliable test case might be a good start to
> finally fixing this.

Well this is interesting, so you're saying that using software rendering on the client side can crash the GPU? That only leaves glamor and maybe the compositor as the only one using the hardware driver.
Comment 126 darkbasic 2014-09-03 10:44:24 UTC
(In reply to comment #124)
> I can very quickly, almost deterministically, hang the GPU (radeonsi, Cape
> Verde) with the following command:
> 
> > LIBGL_ALWAYS_SOFTWARE=1 mpv --fs --vo=opengl:sw /path/to/some_video

I'm sorry but I can't reproduce it with Tahiti (HD7950).
Comment 127 Andy Furniss 2014-09-03 11:11:41 UTC
(In reply to comment #126)
> (In reply to comment #124)
> > I can very quickly, almost deterministically, hang the GPU (radeonsi, Cape
> > Verde) with the following command:
> > 
> > > LIBGL_ALWAYS_SOFTWARE=1 mpv --fs --vo=opengl:sw /path/to/some_video
> 
> I'm sorry but I can't reproduce it with Tahiti (HD7950).

I also can't reproduce with pitcairn (R9270X)

On agd5f drm-next-3.18-wip, git mesa,llvm,ddx,glamor. Xorg couple of months old.

Tried with mplayer and 2 versions of mpv.

Running fluxbox, so no compositor.

Other differences - I guess mesa/llvmpipe uses different sse for me on older CPU (phenom II x4 965).

Screen res? I am testing 1920x1080@60Hz.
Comment 128 Grigori Goronzy 2014-09-04 13:51:07 UTC
You might want to try the patch in
https://bugs.freedesktop.org/show_bug.cgi?id=83500

Maybe some of these issues have a common cause.
Comment 129 AdrianG 2014-09-04 18:23:13 UTC
Radeon 8550g/8670m - doesn't get passed login screen with 3.17-rc3. At least in rc1 I could get to the desktop but then it would almost immediately hang. (distro: Ubuntu 14.04 standard + Gnome 3.2).

Works like a charm on kernel 3.14*
Comment 130 Marti Raudsepp 2014-09-08 21:31:20 UTC
Created attachment 105926 [details]
double-hang after "failed to get a new IB (-35)"
Comment 131 Marti Raudsepp 2014-09-08 21:32:52 UTC
Created attachment 105927 [details]
GPU lockup followed by "GPU fault detected: 147"
Comment 132 darkbasic 2014-09-12 14:05:12 UTC
This is 100% reproducible: just start 3DMark2003 with a gallium nine enabled wine (and mesa of course) and it will crash your whole system: https://bugs.freedesktop.org/show_bug.cgi?id=83800
Comment 133 darkbasic 2014-09-24 20:56:01 UTC
When radeonsi in the right mood (I guess something triggers an higher instability) then Chrome becomes completely unusable: opening simple urls without videos is enough to trigger segfaults. The system hangs for a few seconds and then it keeps working normally until the next segfault.

Kernel is drm-next-3.18, but I had similar behaviour with any kernel >= 3.15

[ 2122.073875] Watchdog[3418]: segfault at 0 ip 00007f6ceb0e90f8 sp 00007f6cd820e810 error 6 in chrome[7f6ce744e000+5c1c000]
[ 3579.978613] Watchdog[7831]: segfault at 0 ip 00007f05578ef0f8 sp 00007f0544a14810 error 6 in chrome[7f0553c54000+5c1c000]
[ 3665.106245] Watchdog[8079]: segfault at 0 ip 00007f18609c70f8 sp 00007f184daec810 error 6 in chrome[7f185cd2c000+5c1c000]
[ 4352.155130] Watchdog[10289]: segfault at 0 ip 00007fcf005000f8 sp 00007fceed625810 error 6 in chrome[7fcefc865000+5c1c000]
[ 4387.874001] Watchdog[26191]: segfault at 0 ip 00007fbcf4cca0f8 sp 00007fbce1def810 error 6 in chrome[7fbcf102f000+5c1c000]
[ 4434.438550] Watchdog[4605]: segfault at 0 ip 00007fa1f585c0f8 sp 00007fa1e2981810 error 6 in chrome[7fa1f1bc1000+5c1c000]
[16362.244095] Watchdog[25058]: segfault at 0 ip 00007f8e3cf6a0f8 sp 00007f8e2a08f810 error 6 in chrome[7f8e392cf000+5c1c000]
[16386.333329] Watchdog[25301]: segfault at 0 ip 00007fd1e34500f8 sp 00007fd1d0575810 error 6 in chrome[7fd1df7b5000+5c1c000]
[16495.014110] Watchdog[25410]: segfault at 0 ip 00007f7b9bc4b0f8 sp 00007f7b88d70810 error 6 in chrome[7f7b97fb0000+5c1c000]
[16581.203809] Watchdog[25675]: segfault at 0 ip 00007f9489d2e0f8 sp 00007f9476e53810 error 6 in chrome[7f9486093000+5c1c000]
[16679.412852] Watchdog[25702]: segfault at 0 ip 00007fa3338c50f8 sp 00007fa3209ea810 error 6 in chrome[7fa32fc2a000+5c1c000]
[16758.080893] Watchdog[25824]: segfault at 0 ip 00007fc4d3d8e0f8 sp 00007fc4c0eb3810 error 6 in chrome[7fc4d00f3000+5c1c000]
[16782.192107] Watchdog[26032]: segfault at 0 ip 00007ffe9d6af0f8 sp 00007ffe8a7d4810 error 6 in chrome[7ffe99a14000+5c1c000]
[16796.275309] Watchdog[26161]: segfault at 0 ip 00007f75faa130f8 sp 00007f75e7b38810 error 6 in chrome[7f75f6d78000+5c1c000]

Next week I will be able to provide remote access if needed.
Comment 134 Bogar Boris 2014-09-27 09:17:20 UTC
I get several crash every day.
Right now I'm using the kernel 3.17 with mesa git from ~lcarlier repository (arch linux + kde4) on my radeon hd7970. Switching to stable kernel and mesa don't change the situation. I don't know if this is the same error but I leave my dmesg log from latest crash. The system stop responding and the monitor output garbage. Sometime I can switch to TTY to reboot. Most of time I can't.
http://pastebin.com/nkKNQE1f
Comment 135 agapito 2014-09-29 09:00:17 UTC
This bug still happens in kernel 3.17 rc7 with mesa 10.4
Comment 136 agapito 2014-10-02 11:12:37 UTC
With the kernel 3.17 rc7 the crashes are a lot more frequent than 3.16.4 kernel.
Comment 137 Jacob 2014-10-07 10:20:02 UTC
About three months ago I stumbled upon this very same bug. The system will just totally lockup, and I pretty much have to force restart the machine. It seems to happen on any kernel newer than 3.13, least in my case.
I've tried with mesa 10.4, 10.3, 10.2 and even 10.1 and nothing seems to fix the problem. It was only when I started changing the kernel versions, that I finally found 3.13 to be stable.

Been using the Oibaf PPA and Linux 3.13 for the past month and a half, and I've not experienced a single crash, but when I try any new kernel releases it won't be long until I experience yet another system lockup.
Comment 138 darkbasic 2014-10-07 15:24:50 UTC
Oct  7 17:17:30 gentoo-desktop kernel: [11674.296906] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec
Oct  7 17:17:30 gentoo-desktop kernel: [11674.296912] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000022e490 last fence id 0x000000000022e49f on ring 0)
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706707] radeon 0000:01:00.0: Saved 657 dwords of commands on ring 0.
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706750] radeon 0000:01:00.0: GPU softreset: 0x000000EC
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706751] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706752] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706753] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706754] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200040C0
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706788] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706789] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706790] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706791] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00400002
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706793] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x84010243
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706794] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x60C83146
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706795] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44E84246
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706796] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Oct  7 17:17:31 gentoo-desktop kernel: [11674.706797] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Oct  7 17:17:31 gentoo-desktop kernel: [11675.106010] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
Oct  7 17:17:31 gentoo-desktop kernel: [11675.106071] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00108140
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107217] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107218] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107219] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107220] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000AC0
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107254] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107255] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107256] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107257] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107258] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107259] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107260] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Oct  7 17:17:31 gentoo-desktop kernel: [11675.107336] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
Oct  7 17:17:31 gentoo-desktop kernel: [11675.135080] Watchdog[12138]: segfault at 0 ip 00007f493c6f60f8 sp 00007f492981b810 error 6 in chrome[7f4938a5b000+5c1c000]
Comment 139 Bogar Boris 2014-10-07 17:44:13 UTC
The other day I noticed that the onboard video (i5-3570K - HD4000) was enabled after a BIOS upgrade. By disabling it the radeon driver crashes have stopped. I write only because it could be of help to someone.
Comment 140 darkbasic 2014-10-07 21:05:18 UTC
Do you disable the onboard video in the bios or somewhere else?
Comment 141 Michel Dänzer 2014-10-08 02:18:04 UTC
(In reply to Jacob from comment #137)
> It seems to happen on any kernel newer than 3.13, least in my case.

Can you bisect which change between 3.13 and 3.14 caused the instability for you?
Comment 142 Jacob 2014-10-08 06:27:32 UTC
(In reply to Michel Dänzer from comment #141)
> (In reply to Jacob from comment #137)
> > It seems to happen on any kernel newer than 3.13, least in my case.
> 
> Can you bisect which change between 3.13 and 3.14 caused the instability for
> you?

I've only downloaded kernel images as .deb files from http://kernel.ubuntu.com/~kernel-ppa/mainline/
How would I go about bisecting which change caused the instability?
Comment 143 Bogar Boris 2014-10-08 08:57:58 UTC
(In reply to darkbasic from comment #140)
> Do you disable the onboard video in the bios or somewhere else?

In the bios. In my case (asrock z77 extreme6) I have to disable the entry "IGPU Multi-Monitor"
Comment 144 darkbasic 2014-10-08 10:02:00 UTC
Unfortunately it doesn't help, on the contrary I just got a new instability world record: it crashed in *KDM*! Yes of course, I didn't even had time to type my username!
I never checked GTX 970's price so often: one day or another I will find a good offer and I will say goodbye to radeoncrash forever.
Comment 145 Daniel Kozak 2014-10-08 16:21:03 UTC
Same issue with my HD 7770 on 3.16 and 3.17 (much often). Even trying to write this comment here cause crash. So I am unable to write more details on this pc, because of time to crash is reali tiny :(
Comment 146 Daniel Kozak 2014-10-09 13:02:10 UTC
I am able to reproduce, just start VLC with some video and wait. After some secs or few minutes it happens
Comment 147 mmstickman 2014-10-09 15:39:35 UTC
I'm getting VM Faults within minutes of idling in i3wm now with my 7950 in 3.17. My AMD A4-5000 laptop is unaffected by these bugs, however.
Comment 148 darkbasic 2014-10-09 15:44:59 UTC
I just had to book my flight 3 times with HD 7950 because it loves crashing when I use Chromium (plugins disabled).
Comment 149 mmstickman 2014-10-09 15:51:18 UTC
Perhaps you should revert to 3.14 LTS until this issue is fixed. I don't have any issues running 3.14 with my 7950.
Comment 150 darkbasic 2014-10-09 16:01:32 UTC
3.14 is stable with my HD 7950 but some users reported they are stable only with 3.13 (I don't remember their card)
Comment 151 initzero 2014-10-09 19:32:28 UTC
Same issues with my OLAND card here.
Archlinux, Mesa 10.3, Xorg 1.16.1 and Radeon 7.5.0.

Kernel 3.14.20 is still stable, 3.16.X and 3.17 may run for approx. 1hr and finally crash during normal desktop usage (Gnome 3.14 + Browser + ...). Didn't check 3.15 recently.

Sooner or later someone needs to bisect that sucker! :)
Comment 152 agapito 2014-10-10 17:37:49 UTC
IMPORTANT

This was my first message in this bug report: 

I have the same problem with my HD 7950; using hangouts, playing Left for Dead 2, or watching a flash video my screen goes crazy with vertical lines or grey fog. Started when i upgraded to testing repo (Archlinux) and downloaded the newest linux-firmware package, who includes TAHITI_mc2.bin. I suffered this bug on kernels 3.14 and 3.15. 

--------------------------------------------------------------------------

In Archlinux i was stable with kernel 3.14, and the problem started when i was using the new firmware. I thought that the new firmware was the cause of this bug, but NO, because i had the same bug using the old firmare, so this bug it was caused by one of this radeon commits backported to kernel 3.14.6 (the first kernel using newest firmware). I am 100% sure.


https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/?id=refs/tags/v3.14.21&ofs=1300
Comment 153 Jacob 2014-10-11 17:13:07 UTC
I've tried using different kernel versions the past few days and I've failed to trigger the crash with any kernel prior to 3.15-rc3. Today after a few hours of used, my system just locked up again and my screens went black, forcing me to reboot the machine, same thing I've experienced with 3.16 and 3.17 so I believe this is where the bug originated.
However, it seems like when I booted up, nothing has been written to kern.log nor does any errors show up in xorg.log, and dmesg shows me nothing.
I got the image from: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.15-rc3-utopic/ which also shows what changes were made to this release.
The issue might have originated here, but it could very well be that I've missed something.
Comment 154 Jacob 2014-10-12 08:14:19 UTC
(In reply to agapito from comment #152)
> IMPORTANT
> 
> This was my first message in this bug report: 
> 
> I have the same problem with my HD 7950; using hangouts, playing Left for
> Dead 2, or watching a flash video my screen goes crazy with vertical lines
> or grey fog. Started when i upgraded to testing repo (Archlinux) and
> downloaded the newest linux-firmware package, who includes TAHITI_mc2.bin. I
> suffered this bug on kernels 3.14 and 3.15. 
> 
> --------------------------------------------------------------------------
> 
> In Archlinux i was stable with kernel 3.14, and the problem started when i
> was using the new firmware. I thought that the new firmware was the cause of
> this bug, but NO, because i had the same bug using the old firmare, so this
> bug it was caused by one of this radeon commits backported to kernel 3.14.6
> (the first kernel using newest firmware). I am 100% sure.
> 
> 
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/
> ?id=refs/tags/v3.14.21&ofs=1300

I've just compared the git messages from your link and from http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.15-rc3-utopic/CHANGES and it seems like the commits made to drm/radeon, are the only commits these two kernel versions have in common.
Only seven of them are part of 3.15-rc3, which crashed on me yesterday, so it would seem like the crashes are caused by one of those commits
Comment 155 agapito 2014-10-12 11:33:38 UTC
With kernel 3.16.5 i have this bug every 2 hours approximately. With kernel 3.17 every 20 minutes.
Comment 156 Malte Schröder 2014-10-12 12:19:44 UTC
Hi, with kernel v3.17 these crashes where much more frequent for me too. Now I've set aspm=0 for the radeon module and the system has been running for some hours straight.
Comment 157 agapito 2014-10-12 16:14:25 UTC
(In reply to Malte Schröder from comment #156)
>> I've set aspm=0 for the radeon module and the system has been running for
> some hours straight.

Not for me.
Comment 158 Malte Schröder 2014-10-13 08:15:52 UTC
(In reply to agapito from comment #157)
> (In reply to Malte Schröder from comment #156)
> >> I've set aspm=0 for the radeon module and the system has been running for
> > some hours straight.
> 
> Not for me.

Yeah, it just crashed on me again. So I was just lucky. I also tried disabling dynclks and dpm, no effect. What I did differently yesterday is I had very litte browser (Debian Iceweasel) usage. Today I had some Youtube running when the crash happened. In fact the crashes happen most time when whatching stuff on Youtube, i.e. when Iceweasel uses vdpau through gstreamer. I now removed mesa vdpau drivers. I will report back if this changes anything.
Comment 159 Malte Schröder 2014-10-13 09:54:20 UTC
VDPAU doesn't make a difference, still crashes.
Comment 160 Michel Dänzer 2014-10-15 06:36:41 UTC
(In reply to Jacob from comment #153)
> I've tried using different kernel versions the past few days and I've failed
> to trigger the crash with any kernel prior to 3.15-rc3.

What was the closest earlier version you tried?
Comment 161 Jacob 2014-10-15 07:50:14 UTC
(In reply to Michel Dänzer from comment #160)
> (In reply to Jacob from comment #153)
> > I've tried using different kernel versions the past few days and I've failed
> > to trigger the crash with any kernel prior to 3.15-rc3.
> 
> What was the closest earlier version you tried?

The last image I tried was 3.15-rc2, which didn't crash on me during 18-hours of uptime
Comment 162 Michel Dänzer 2014-10-15 07:53:28 UTC
(In reply to Jacob from comment #161)
> The last image I tried was 3.15-rc2, which didn't crash on me during
> 18-hours of uptime

Can you try 3.15-rc2 again for even longer, to make sure it wasn't just luck?

If it's consistent, it would be really helpful if you could bisect between 3.15-rc2 and 3.15-rc3.
Comment 163 Jacob 2014-10-15 08:14:06 UTC
(In reply to Michel Dänzer from comment #162)
> (In reply to Jacob from comment #161)
> > The last image I tried was 3.15-rc2, which didn't crash on me during
> > 18-hours of uptime
> 
> Can you try 3.15-rc2 again for even longer, to make sure it wasn't just luck?
> 
> If it's consistent, it would be really helpful if you could bisect between
> 3.15-rc2 and 3.15-rc3.

I'll do that; try rc2 for a couple days, see if I can get it to crash, then try rc3 again to see if I can crash that again just for good measure, then do a bisection between the two, if rc2 turns out to be stable and rc3 does not
Comment 164 Jacob 2014-10-19 18:16:08 UTC
So after testing 3.15-rc2 for about 3 days without any crashes, I decided to once again test 3.15-rc3 to see if it would crash on me again, which it did. The OS just stopped responding to anything, then my monitors went black, just like it has done for me on 3.16 and 3.17 as well.

I ran a bisection between the two releases, and the result was the following:
Bisecting: 120 revisions left to test after this (roughly 7 steps)
[3fe89d2e768792a924d3c1e9310ba0b4448cb78e] Merge tag 'fixes-3.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Seems weird that arm got anything to do with this issue, but even after running "git bisect bad" until the end, it doesn't pick out anything committed to drm/radeon.
Nonetheless, I compiled the kernel, and I'll now test it to see if it'll crash or not, then run git bisect bad until the kernel gets stable once again.

I'll report back if I either manage to compile a kernel which is stable, or if I don't
Comment 165 Marti Raudsepp 2014-10-19 18:24:13 UTC
After upgrading to kernel 3.17.1, these GPU hangs/crashes still occur, but now it doesn't hang the whole machine any more. Sometimes it recovers from the GPU hang completely, sometimes it just drops me into a text console. Thanks, that's an improvement.

I am using Radeon R9 270 on Arch Linux.
Comment 166 Michel Dänzer 2014-10-20 01:45:15 UTC
(In reply to Jacob from comment #164)
> I ran a bisection between the two releases, and the result was the following:
> Bisecting: 120 revisions left to test after this (roughly 7 steps)

That's not the result but just an early step of the bisection. :) As the above says, Git estimates that you'll need to test around 7 more kernels before there is a result. BTW, make sure to only run 'bisect good' after you've tested a kernel long enough to be sure it's not affected by the problem. If you mark a commit as good which is actually bad, the bisection will fail.
Comment 167 agapito 2014-10-20 20:10:02 UTC
3.18 rc1 still affected.
Comment 168 DJ Dunn 2014-10-21 03:55:03 UTC
if this helps any, I've been seeing the same error on my gentoo box, its near constant on 3.16+ kernels within 2 or 3 min of loging in X but, ive been seeing it very rarely (once every few hours) on 3.14.22 but still seeing it, and I never seen it happen on 3.14.19

gentoo box with mesa-10.3.1 xorg-server-1.16.1
Comment 169 DJ Dunn 2014-10-21 03:57:27 UTC
my card is HD7870
Comment 170 Jacob 2014-10-21 10:58:47 UTC
(In reply to DJ Dunn from comment #168)
> if this helps any, I've been seeing the same error on my gentoo box, its
> near constant on 3.16+ kernels within 2 or 3 min of loging in X but, ive
> been seeing it very rarely (once every few hours) on 3.14.22 but still
> seeing it, and I never seen it happen on 3.14.19
> 
> gentoo box with mesa-10.3.1 xorg-server-1.16.1

Could you test 3.14.20 and 3.14.21, to see if the issue also occur in any of those, then do a bisection between the first version where the crashes started to occur, and the version prior to it?
Comment 171 darkbasic 2014-10-21 12:56:02 UTC
On my system with 3.17+ sometimes it takes days to crash while sometimes it crashes after a few minutes. Only 3.15 did *always* crash in a couple of minutes. I remember 3.15-rc1-pre didn't crash, so the bad commit should be somewhere around 3.15-rc0 and 3.15-rc2.
Comment 172 darkbasic 2014-10-21 12:58:38 UTC
Sorry I meant 3.15-rc0 and 3.15-rc*3*.
Comment 173 Jacob 2014-10-21 17:03:48 UTC
(In reply to darkbasic from comment #171)
> On my system with 3.17+ sometimes it takes days to crash while sometimes it
> crashes after a few minutes. Only 3.15 did *always* crash in a couple of
> minutes. I remember 3.15-rc1-pre didn't crash, so the bad commit should be
> somewhere around 3.15-rc0 and 3.15-rc2.

Takes about a couple hours for me to crash 3.15-rc3, "sadly" not minutes. 3.15-rc2 didn't crash on me,  so I'm currently doing a bisection
Comment 174 Jacob 2014-10-23 14:16:42 UTC
Considering the crash occurs within an hour or two on the last kernel, but only occurs every few hours on 3.15, it makes you wonder if the two issues are even related.
Nonetheless, I'll report the result from the bisection once I got it
Comment 175 Christian König 2014-10-23 14:21:47 UTC
(In reply to Jacob from comment #174)
> Considering the crash occurs within an hour or two on the last kernel, but
> only occurs every few hours on 3.15, it makes you wonder if the two issues
> are even related.
> Nonetheless, I'll report the result from the bisection once I got it

Kernel 3.15 has some known VM issues which are only fixed in 3.16. Independent of that I think we indeed have multiple different issues that seems to be hard to distinct.

In general you can split the issues into two categories one is with VM faults in the logs and the other ones are without.
Comment 176 Jacob 2014-10-23 14:37:35 UTC
(In reply to Christian König from comment #175)
> (In reply to Jacob from comment #174)
> > Considering the crash occurs within an hour or two on the last kernel, but
> > only occurs every few hours on 3.15, it makes you wonder if the two issues
> > are even related.
> > Nonetheless, I'll report the result from the bisection once I got it
> 
> Kernel 3.15 has some known VM issues which are only fixed in 3.16.
> Independent of that I think we indeed have multiple different issues that
> seems to be hard to distinct.
> 
> In general you can split the issues into two categories one is with VM
> faults in the logs and the other ones are without.

Seems the one I'm bisecting for now would be the one without then.
Whenever it crashes, I get nothing in the logs at all. Nothing in dmesg, in xorg.log or in the kern.log. Nothing at all
Comment 177 sam tygier 2014-10-26 08:52:47 UTC
Seeing this lock up on Debian Jessie when watching youtube HTML5 videos in firefox.
kernel
Linux oberon 3.16-3-amd64 #1 SMP Debian 3.16.5-1 (2014-10-10) x86_64 GNU/Linux
Mesa 10.2.8-1
libdrm-radeon1 2.4.58-2
xserver-xorg-video-radeon 1:7.5.0-1
On a [AMD/ATI] Cape Verde PRO [Radeon HD 7750 / R7 250E]

I have logs which i can post, but it looks like you already have quite a few.
Comment 178 Jacob 2014-10-27 12:14:19 UTC
The "supposedly" random crashes I encountered with 3.15-rc3 weren't really random at all. I came to a somewhat sad realization that only one application actually crashed with it, so the bisection has mostly been a waste.

I've personally been stable on 3.14.20, and according to DJ Dunn, the issue has now hit 3.14.22, so I'm gonna check 21 and 22 to see if it crashes for me as well, and this time make sure it's application independent.
Comment 179 Maximilian Böhm 2014-10-27 13:18:50 UTC
Just want to remind you that there is a Mesa connection somehow. Either it's a kernel call only later Mesa versions implement or it's a Mesa issue – I'm stable for *months* now on Linux 3.16/3.17 and this downgraded packages on Arch Linux with a Radeon HD 7770:
ati-dri-10.1.4-1-x86_64.pkg.tar.xz
clang-3.4.1-2-x86_64.pkg.tar.xz
lib32-llvm-libs-3.4.1-1-x86_64.pkg.tar.xz
lib32-mesa-10.1.4-1-x86_64.pkg.tar.xz
lib32-mesa-libgl-10.1.4-1-x86_64.pkg.tar.xz
llvm-3.4.1-2-x86_64.pkg.tar.xz
llvm-libs-3.4.1-2-x86_64.pkg.tar.xz
mesa-10.1.4-1-x86_64.pkg.tar.xz
mesa-demos-8.1.0-2-x86_64.pkg.tar.xz
mesa-libgl-10.1.4-1-x86_64.pkg.tar.xz
Comment 180 agapito 2014-10-27 13:34:43 UTC
I know it's early to say this but 3.18 rc2 solved this bug for me.
Comment 181 agapito 2014-10-27 17:14:37 UTC
After 5 hours i am still stable. I've played L4D2, unigine valley, watched vdpau content, flash videos, Google Earth, chromium with a lot of tabs, kwin effects...

HD7950 under Archlinux with 3.18 rc2 kernel and mesa 10.3.2
Comment 182 Michel Dänzer 2014-10-28 01:13:59 UTC
(In reply to Maximilian Böhm from comment #179)
> Just want to remind you that there is a Mesa connection somehow.

I've seen that mentioned before, but the answer is always the same: Please bisect Mesa.


(In reply to agapito from comment #180)
> I know it's early to say this but 3.18 rc2 solved this bug for me.

Can you bisect which commit in 3.18-rc2 fixed it for you?
Comment 183 Aaron B 2014-10-28 04:49:32 UTC
3.18 is still crashing for me, I doubt it is fixed.
Comment 184 agapito 2014-10-28 09:02:47 UTC
(In reply to Michel Dänzer from comment #182)
> Can you bisect which commit in 3.18-rc2 fixed it for you?

Sorry, I do not know how to do it. But these are the changes between RC1 (still crashing) and RC2 (stable):

 drm/radeon: reduce sparse false positive warnings
 Revert "drm/radeon: drop btc_get_max_clock_from_voltage_dependency_table"
 Revert "drm/radeon/dpm: drop clk/voltage dependency filters for SI"
 drm/radeon: initialize sadb to NULL in the audio code
 drm/radeon: fix speaker allocation setup
 drm/radeon: use gart memory for DMA ring tests
 drm/radeon: fix vm page table block size calculation

I am not an expert, but probably: drm/radeon: use gart memory for DMA ring tests; could be the good commit.

(In reply to Aaron B from comment #183)
> 3.18 is still crashing for me, I doubt it is fixed.

rc2 crashed for you? After 24 hours I am still stable.
Comment 185 Michel Dänzer 2014-10-28 09:13:10 UTC
(In reply to agapito from comment #184)
> > Can you bisect which commit in 3.18-rc2 fixed it for you?
> 
> Sorry, I do not know how to do it.

Search the web for 'git bisect howto'. One gotcha is that you'll need to run 'git bisect good' for bad kernels and vice versa, because git bisect can only isolate good -> bad transitions.


> I am not an expert, but probably: drm/radeon: use gart memory for DMA ring
> tests; could be the good commit.

That should have no effect once the driver is initialized. None of the changes between rc1 and rc2 seem like obvious candidates.
Comment 186 Jacob 2014-10-28 09:17:40 UTC
(In reply to agapito from comment #184)
> (In reply to Michel Dänzer from comment #182)
> > Can you bisect which commit in 3.18-rc2 fixed it for you?
> 
> Sorry, I do not know how to do it. But these are the changes between RC1
> (still crashing) and RC2 (stable):
> 
>  drm/radeon: reduce sparse false positive warnings
>  Revert "drm/radeon: drop btc_get_max_clock_from_voltage_dependency_table"
>  Revert "drm/radeon/dpm: drop clk/voltage dependency filters for SI"
>  drm/radeon: initialize sadb to NULL in the audio code
>  drm/radeon: fix speaker allocation setup
>  drm/radeon: use gart memory for DMA ring tests
>  drm/radeon: fix vm page table block size calculation
> 
> I am not an expert, but probably: drm/radeon: use gart memory for DMA ring
> tests; could be the good commit.
> 
> (In reply to Aaron B from comment #183)
> > 3.18 is still crashing for me, I doubt it is fixed.
> 
> rc2 crashed for you? After 24 hours I am still stable.

https://wiki.ubuntu.com/Kernel/KernelBisection
This guide helped me. Might help you too.

In short, you just have to clone the Linux repository, run "git bisect start <BAD> <GOOD>", then compile the kernel and test it.
If it crashes, then run "git bisect bad", and recompile.
If you think you've tested it long enough and the version is stable, then run "git bisect good", and recompile.
Continue to do so until no revisions are left to be tested
Comment 187 Daniel Kozak 2014-10-28 10:58:08 UTC
After reinstall my arch workstation, I am unable to reproduce this issue anymore. Even with same mesa, linux, and linux-firmware versions as before.
Comment 188 Aaron B 2014-10-28 17:36:55 UTC
It's random for a reason, it acts like a buffer over run or leak or something that isn't easily produced, as it changes how often it happens every installed package. But I've had worse luck on 3.18-rc2, it's just my install is more prone to it this build where it seems others haven't crashed yet, but give it time. Make sure you run HTML5 youtube for an hour or so. ;) :)
Comment 189 agapito 2014-10-28 18:54:24 UTC
This is strange because 5 days ago i was starting to use intel graphic card because i had a lot of lock-ups with 3.17.1 and 3.18 rc1 kernels. When 3.18 rc2 was launched i returned to radeon driver and this bug disappeared under 3.18 rc2, but now i am using 3.17.1 and it seems stable... Maybe this bug is a mesa problem, and not a kernel problem. Mesa 10.3.2 arrived to archlinux 3 days ago.

Changes from 10.3.1 to 10.3.2:

Brian Paul (3):
      mesa: fix spurious wglGetProcAddress / GL_INVALID_OPERATION error
      st/wgl: add WINAPI qualifiers on wgl function typedefs
      glsl: fix several use-after-free bugs

Daniel Manjarres (1):
      glx: Fix glxUseXFont for glxWindow and glxPixmaps

Dave Airlie (1):
      mesa: fix GetTexImage for 1D array depth textures

Emil Velikov (3):
      docs: Add sha256 sums for the 10.3.1 release
      Update VERSION to 10.3.2
      Add release notes for the 10.3.2 release

Ilia Mirkin (4):
      gm107/ir: add dnz emission for fmul
      gk110/ir: add dnz flag emission for fmul/fmad
      nouveau: 3d textures are unsupported, limit 3d levels to 1
      st/gbm: fix order of arguments passed to is_format_supported

Kenneth Graunke (3):
      i965: Add a BRW_MOCS_PTE #define.
      i965: Use BDW_MOCS_PTE for renderbuffers.
      i965: Fix register write checks.

Marek Olšák (2):
      st/mesa: use pipe_sampler_view_release for releasing sampler views
      glsl_to_tgsi: fix the value of gl_FrontFacing with native integers

Michel Dänzer (4):
      radeonsi: Clear sampler view flags when binding a buffer
      r600g,radeonsi: Always use GTT again for PIPE_USAGE_STREAM buffers
      winsys/radeon: Use separate caching buffer manager for each set of flags
      r600g: Drop references to destroyed blend state
Comment 190 Daniel Kozak 2014-10-28 20:08:32 UTC
(In reply to agapito from comment #189)
> This is strange because 5 days ago i was starting to use intel graphic card
> because i had a lot of lock-ups with 3.17.1 and 3.18 rc1 kernels. When 3.18
> rc2 was launched i returned to radeon driver and this bug disappeared under
> 3.18 rc2, but now i am using 3.17.1 and it seems stable... Maybe this bug is
> a mesa problem, and not a kernel problem. Mesa 10.3.2 arrived to archlinux 3
> days ago.
> 
> Changes from 10.3.1 to 10.3.2:
> 
> Brian Paul (3):
>       mesa: fix spurious wglGetProcAddress / GL_INVALID_OPERATION error
>       st/wgl: add WINAPI qualifiers on wgl function typedefs
>       glsl: fix several use-after-free bugs
> 
> Daniel Manjarres (1):
>       glx: Fix glxUseXFont for glxWindow and glxPixmaps
> 
> Dave Airlie (1):
>       mesa: fix GetTexImage for 1D array depth textures
> 
> Emil Velikov (3):
>       docs: Add sha256 sums for the 10.3.1 release
>       Update VERSION to 10.3.2
>       Add release notes for the 10.3.2 release
> 
> Ilia Mirkin (4):
>       gm107/ir: add dnz emission for fmul
>       gk110/ir: add dnz flag emission for fmul/fmad
>       nouveau: 3d textures are unsupported, limit 3d levels to 1
>       st/gbm: fix order of arguments passed to is_format_supported
> 
> Kenneth Graunke (3):
>       i965: Add a BRW_MOCS_PTE #define.
>       i965: Use BDW_MOCS_PTE for renderbuffers.
>       i965: Fix register write checks.
> 
> Marek Olšák (2):
>       st/mesa: use pipe_sampler_view_release for releasing sampler views
>       glsl_to_tgsi: fix the value of gl_FrontFacing with native integers
> 
> Michel Dänzer (4):
>       radeonsi: Clear sampler view flags when binding a buffer
>       r600g,radeonsi: Always use GTT again for PIPE_USAGE_STREAM buffers
>       winsys/radeon: Use separate caching buffer manager for each set of
> flags
>       r600g: Drop references to destroyed blend state

I don't think so. I try tio downgrade mesa, linux-firmware and lots of other packages, but even with vdpau vlc, html5 youtube videos or flash videos I am unable to frozen my system again (I really try it hard all day). It must be some HW problem or some wierd HW state or something completly different.
Comment 191 agapito 2014-10-29 14:47:35 UTC
3.17.1 still affected :S  I had a crash just 5 minutes ago. 

Well, i will use 3.18 rc2 because i didn't have any crash yet.
Comment 192 Aaron B 2014-10-29 16:16:15 UTC
I guess it is possible there are many different crash types. I'm still crashing left and right. Is everyone else still stable? If so, looks like I'll leave you guys here alone to mark your problem fixed....and find which one I need to be living in for bug reports again. :)
Comment 193 Alex Deucher 2014-10-29 16:21:14 UTC
(In reply to Aaron B from comment #192)
> I guess it is possible there are many different crash types. I'm still
> crashing left and right. Is everyone else still stable? If so, looks like
> I'll leave you guys here alone to mark your problem fixed....and find which
> one I need to be living in for bug reports again. :)

Unfortunately, this bug has become a dumping ground for any kind of stability issue with radeonsi so I'm not really sure how useful it is anymore.  I suspect there are actually multiple issues that are now all mixed up.
Comment 194 Marti Raudsepp 2014-10-29 16:30:29 UTC
(In reply to Alex Deucher from comment #193)
> Unfortunately, this bug has become a dumping ground for any kind of
> stability issue with radeonsi so I'm not really sure how useful it is
> anymore.  I suspect there are actually multiple issues that are now all
> mixed up.

What's the way forward? Shouldn't it be up to the developers to try and make sense of the reports and split up the bug entry appropriately?

Should there be one report per affected user, or is there a better way to group them together?

Christian König from comment #175 made one suggestion, is that what we should be doing?
> In general you can split the issues into two categories one is with VM
> faults in the logs and the other ones are without.
Comment 195 Aaron B 2014-10-29 17:18:30 UTC
Assuming my issue is separate, and your fixes were fixed, ever since my issue was the only issue pertaining to random crashes, mostly by video players/web browsers, it seemed the AMD guys never could reproduce it. After some time with nobody else on the bug report, a few appeared with the same problem, and reported the exact same results. Somewhere after that though, I think there were more "Random crashes" bugs. I'd bet the ones who joined in a little later have the same bug as me, the more current random crashes are probably not the same though.

Maybe we should kill these reports, and make a couple with titles more appropriate, funnel people there, and start over. :)
Comment 196 Alex Deucher 2014-10-29 17:31:15 UTC
There are other bug reports related to stability issues specifically with chrome, firefox, and video playback in certain cases which may not be related.  Those bugs may be better fits depending on the exact nature of the issue you are seeing.
Comment 197 agapito 2014-10-29 18:50:51 UTC
After 2 days,3.18 rc2 crashed... Arggg this bug is crazy.
Comment 198 agapito 2014-10-29 18:58:36 UTC
2 days without crashes... Now 2 crashes in 5 minutes. I will start using my intel graphic card again.
Comment 199 Jacob 2014-10-29 19:20:39 UTC
(In reply to agapito from comment #198)
> 2 days without crashes... Now 2 crashes in 5 minutes. I will start using my
> intel graphic card again.

Do you know what exactly you're doing when the crashes occur?
Comment 200 darkbasic 2014-10-29 19:36:23 UTC
I had the very same behaviour with any 3.17+ kernel: sometimes it doesn't crashes for days, others it crashes multiple times per minute. It doesn't matter what you do, it just crashes (even starting your desktop environment is enough sometimes).
I will probably buy a Nvidia GTX 970 while waiting for the new unified driver, then I will try the open source path once again: hopefully having the proprietary driver using the very same kernel code will force AMD to take stability into higher consideration.
Comment 201 Aaron B 2014-10-29 20:13:49 UTC
Not that it is very prominent, but I also plan on switching to a 780 Ti or similar, if the AMD guys can show their management how many people are not only going to hurt the company, but support the competition it might show AMD it's worth it to get you guys more help over all.
Comment 202 farmboy0+freedesktop 2014-10-29 20:34:15 UTC
I am declaring kernel 3.18-rc2 preliminary stable for me again.
My card is an HD 7750 Pro Cape Verde.
I am using 3.18-rc2 with the lower-cased firmware for verde + TAHITI_uvd.
Mesa and Llvm is from recent git.
Comment 203 agapito 2014-10-29 21:16:52 UTC
(In reply to Jacob from comment #199)
> (In reply to agapito from comment #198)
> > 2 days without crashes... Now 2 crashes in 5 minutes. I will start using my
> > intel graphic card again.
> 
> Do you know what exactly you're doing when the crashes occur?

Yeah, on both occasions, I was trying to write a message in a forum. (Chromium)

(In reply to farmboy0+freedesktop from comment #202)
> I am declaring kernel 3.18-rc2 preliminary stable for me again.
> My card is an HD 7750 Pro Cape Verde.
> I am using 3.18-rc2 with the lower-cased firmware for verde + TAHITI_uvd.
> Mesa and Llvm is from recent git.

I thought 3.18-rc2 was stable, but is not...
Comment 204 Alex Deucher 2014-10-29 21:23:06 UTC
(In reply to agapito from comment #203)
> (In reply to Jacob from comment #199)
> > (In reply to agapito from comment #198)
> > > 2 days without crashes... Now 2 crashes in 5 minutes. I will start using my
> > > intel graphic card again.
> > 
> > Do you know what exactly you're doing when the crashes occur?
> 
> Yeah, on both occasions, I was trying to write a message in a forum.
> (Chromium)

If it's happens mostly with chromium, it may be bug 81644.  When you say crash what do you mean?  Segfault?  System hang?  GPU hang?  GPU page fault?  Something else?
Comment 205 agapito 2014-10-29 21:41:58 UTC
(In reply to Alex Deucher from comment #204)
> crash what do you mean?  Segfault?  System hang?  GPU hang?  GPU page fault?
> Something else?

My bug is not Chromium related. 4 months ago, my browser was Firefox and i had the same bug. Always is the same behaviour. Sometimes with videos, or games, flash content... It´s totally random.

I think it happens often when I click anywhere or i resize a windows with vdpau content, then my system is freezed 5 seconds (I can move the mouse, but the windows or programs are not responding) after 5 seconds, my screen shows garbage like this:  https://bugs.freedesktop.org/attachment.cgi?id=101226  or my monitor turns off completely. Sometimes i can reboot with reisub, sometimes i need a hard reset. Some months ago i posted a picture of my dmesg output https://bugs.freedesktop.org/attachment.cgi?id=104145
Comment 206 Aaron B 2014-10-29 21:46:42 UTC
I believe Firefox and Chromium both suffer from the same issue myself, we've been treating it that way at least, and Firefox users have never reported any changes different with patches and updates, so I believe the same issue is being talked about as with Chromium. I think it has to do with video being sent to the GPU at all, which with RadeonSI and any modern browsers, any accelerated browser probably will have the same problems.
Comment 207 Michel Dänzer 2014-10-30 03:01:29 UTC
(In reply to Marti Raudsepp from comment #194)
> Shouldn't it be up to the developers to try and make sense of the reports and
> split up the bug entry appropriately?

We are doing that all the time. However, users tend to focus too much on some symptom(s) they have as well and ignore any differences. It's understandable, but unfortunate.


> Should there be one report per affected user, or is there a better way to
> group them together?

I can't think of anything better than that. In general, it's much better to track things separately. Once several unrelated issues are mixed up in a single report, it's very hard to untangle and keep track of it.
Comment 208 Marti Raudsepp 2014-10-30 08:59:19 UTC
(In reply to Michel Dänzer from comment #207)
> In general, it's much better to
> track things separately. Once several unrelated issues are mixed up in a
> single report, it's very hard to untangle and keep track of it.

So *TELL* users that clearly, to make individual bug reports. Close this bug if necessary. Direct your users instead of going "oh well, users can't report bugs and we can't do anything about it".

When I was beginning to see these issues, I asked around in #radeon whether I should report a new bug, and I was told to see this bug instead. Of course that was another misled user.
Comment 209 Michel Dänzer 2014-10-30 09:16:35 UTC
(In reply to Marti Raudsepp from comment #208)
> So *TELL* users that clearly, to make individual bug reports.

What do you think I'm doing what feels like every day? :}
Comment 210 Michel Dänzer 2014-10-30 09:24:28 UTC
One thing I find interesting is that only Southern Islands seems affected. At least I can't see any mentions of Bonaire, Kaveri, Kabini or Hawaii being affected in this report or other related ones.
Comment 211 Jacob 2014-10-30 09:32:14 UTC
I looked back through kern.log and found the last time I encountered a crash, which happened to be with kernel 3.15.10 from the Ubuntu repo.
That crash seemed to have been caused by dpm:
[drm:si_dpm_set_power_state] *ERROR* si_set_sw_state failed

The 3.15.10 version isn't part of the source, so I instead looked through the list of changes and found that the last set of changes made to drm, landed in kernel 3.16-rc6.

So I installed the rc6 image and has now been running it for some time, but ran into yet another issue, which is unrelated to this bug.

[ 6533.114483] alloc_contig_range test_pages_isolated(1bc800, 1bca8c) failed
[ 6533.114492] alloc_contig_range test_pages_isolated(1bc800, 1bca8d) failed
[ 6533.114500] alloc_contig_range test_pages_isolated(1bc800, 1bca8e) failed
[ 6533.114506] alloc_contig_range test_pages_isolated(1bc800, 1bca8f) failed
[ 6533.114511] alloc_contig_range test_pages_isolated(1bc800, 1bca90) failed
[ 6533.114516] alloc_contig_range test_pages_isolated(1bc800, 1bca91) failed
And so on.
It pretty much causes frequent 2-5 second hangs, even while I'm writing this message. Just moving the cursor, causes such a hang. Changing workspace and the hang lasts 40 seconds.
In other words, bisecting the dpm issue would be very difficult, since I would more than likely run into this issue as well.

The last change made to drm/radeon/dpm were merged into 3.16-rc1, and it only affect si hardware.

Suppose this issue could be moved to another bug entry, if it hasn't already been fixed.
Comment 212 Marti Raudsepp 2014-10-30 09:48:19 UTC
(In reply to Jacob from comment #211)
> I looked back through kern.log and found the last time I encountered a
> crash, which happened to be with kernel 3.15.10 from the Ubuntu repo.
> That crash seemed to have been caused by dpm:
> [drm:si_dpm_set_power_state] *ERROR* si_set_sw_state failed

Jacob, please report a separate bug about your symptoms.

(In reply to Michel Dänzer from comment #210)
> One thing I find interesting is that only Southern Islands seems affected.

And Pitcairn too.

(In reply to Michel Dänzer from comment #207)
> In general, it's much better to
> track things separately.

Should the two bugs resolved as "duplicate" be de-duplicated then? Bug 80141 and Bug 82886.
Comment 213 Michel Dänzer 2014-10-31 03:23:13 UTC
(In reply to Marti Raudsepp from comment #212)
> (In reply to Michel Dänzer from comment #210)
> > One thing I find interesting is that only Southern Islands seems affected.
> 
> And Pitcairn too.

Pitcairn is Southern Islands, just like Cape Verde and Tahiti.


> (In reply to Michel Dänzer from comment #207)
> > In general, it's much better to track things separately.
> 
> Should the two bugs resolved as "duplicate" be de-duplicated then? Bug 80141
> and Bug 82886.

I reopened 82886, but Aaron already has his own report (about Chromium, but it may or may not turn out to be the same problem).
Comment 214 Aaron B 2014-10-31 05:42:52 UTC
I'll duplicate my Bugs to Bug #85647 to start over, I KNOW I have that bug at minimum. I'll  stay off of other "Random RadeonSI" crash reports until we resole it there.
Comment 215 initzero 2014-10-31 12:39:18 UTC
For me it's also Southern Islands related.
Up2date Archlinux + Oland: unstable
Up2date Archlinux + Kaveri: stable
Comment 216 Cilyan Olowen 2014-11-02 15:37:03 UTC
Not sure if it is related, but I have the same log on dmesg while playing Minecraft with Radeon 6970 (Northern Island, if I'm not mistaken). Linux 3.17.1, temp sensor around 57°C, not critical.
Comment 217 Cilyan Olowen 2014-11-02 15:38:13 UTC
Created attachment 108795 [details]
Last 300 lines of dmesg on a Radeon 6970
Comment 218 Sean Rhone 2014-11-06 05:52:13 UTC
Just a bit of feedback, but my 7850 seems relatively stable under Xubuntu 14.10 + 3.18rc3 + Paulo's mesa PPA.

General desktop usage and web browsing over the past week resulted in no crashes or GPU hangs, but I did have a slightly weird issue with fullscreened flash video (while fullscreen once the player OSD disappears, moving the mouse would freeze the video, but double-clicking it to un-fullscreen it was fine and it played back normally).

Was watching a fullscreen video through Plex's web interface (I think videos playback with HTML5?), I had a couple of GPU hangs (right term?) and restarts, but they were really quick (black screen for about 2 seconds, then restore as if nothing happened). If I recall right, there were about 2 or 3 hangs over a 37-minute period.

I'm using Google Chrome (not Chromium) 40.0.2202.3 dev (64-bit) with --ignore-gpu-blacklist enabled. Just checking chrome:gpu, I noticed:

Log Messages
[2523:2523:1104/211347:ERROR:sandbox_linux.cc(301)] : InitializeSandbox() called with multiple threads in process gpu-process
[2523:2523:1104/211636:WARNING:x11_util.cc(1490)] : X error received: serial 59083, error_code 3 (BadWindow (invalid Window parameter)), request_code 4, minor_code 0 (X_DestroyWindow)
[2523:2523:1104/221802:ERROR:gpu_video_decode_accelerator.cc(299)] : Not implemented reached in void content::GpuVideoDecodeAccelerator::Initialize(const media::VideoCodecProfile, IPC::Message *)HW video decode acceleration not available.
[2523:2529:1104/225251:ERROR:gpu_watchdog_thread.cc(253)] : The GPU process hung. Terminating after 10000 ms.
GpuProcessHostUIShim: The GPU process crashed!
GpuProcessHostUIShim: The GPU process crashed!
Comment 219 Michel Dänzer 2014-11-11 09:31:27 UTC
Might be worth trying the Mesa patches I attached to bug 85647.
Comment 220 agapito 2014-12-09 06:22:35 UTC
Since mesa 10.3.4 update i don't have this bug anymore on Archlinux. I've been "stable" for more than two weeks.  

http://www.mesa3d.org/relnotes/10.3.4.html
Comment 221 darkbasic 2014-12-09 08:29:55 UTC
Of course since 'radeonsi: Disable asynchronous DMA except for PIPE_BUFFER' the vast majority of crashes disappeared.
Comment 222 fdb4c415 2015-01-02 16:10:06 UTC
I would like to inform you I got the same problem. My box freezes randomly. Keyboard is almost dead, mouse sometimes working (pointer moves, but cannot click on anything) and usually I can only ssh into that box - to reboot it. Sometimes Ctrl-Alt-F1 works and I can log in, but sometimes not. The monitor switches repeatedly: no signal/black screen/no signal/black screen/...
In the log I can see quite similar messages:
radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000002258 last fence id 0x0000000000002255 on ring 0)
VM fault (0x04, vmid 1) at page 30481, read from DMA1 (61)

I have the latest debian test 64 bit installed:
Linux <host> 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt2-1 (2014-12-08) x86_64 GNU/Linux
and I have:
VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO [Radeon HD 7750 / R7 250E]

Is there any way to fix it?
Comment 223 Michel Dänzer 2015-01-07 08:18:45 UTC
(In reply to fdb4c415 from comment #222)
> VM fault (0x04, vmid 1) at page 30481, read from DMA1 (61)
> 
> I have the latest debian test 64 bit installed:

There's a good chance that a newer upstream version of Mesa would help for your problem, if not fix it completely.

For those still having problems, the kernel patches http://lists.freedesktop.org/archives/dri-devel/2015-January/074968.html and http://lists.freedesktop.org/archives/dri-devel/2015-January/074969.html might be worth a try.
Comment 224 Gedalya 2015-01-07 16:19:18 UTC
Filed debian bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=774784

Might need to file a separate bug for the linux package.
Comment 225 Tilman Sauerbeck 2015-01-10 09:47:29 UTC
(In reply to Michel Dänzer from comment #223)

> For those still having problems, the kernel patches
> http://lists.freedesktop.org/archives/dri-devel/2015-January/074968.html and
> http://lists.freedesktop.org/archives/dri-devel/2015-January/074969.html
> might be worth a try.

I applied http://lists.freedesktop.org/archives/dri-devel/2015-January/074969.html on top of kernel 3.18.4, and got:

radeon 0000:01:00.0: GPU fault detected: 146 0x0008080c
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0800800C
VM fault (0x0c, vmid 4) at page 0, read from 'TC0' (0x54433000) (8)

(following by an unsuccessful attempt to unwedge the GPU, but I guess the lines above are what's really interesting).

This is with Mesa built from 8d2542fc9d5af4db355b67cc2a1ff2f413685a27 on a bonaire xtx.
Comment 226 Tilman Sauerbeck 2015-01-10 19:09:05 UTC
(In reply to Tilman Sauerbeck from comment #225)
> (In reply to Michel Dänzer from comment #223)
> 
> > For those still having problems, the kernel patches
> > http://lists.freedesktop.org/archives/dri-devel/2015-January/074968.html and
> > http://lists.freedesktop.org/archives/dri-devel/2015-January/074969.html
> > might be worth a try.
> 
> I applied
> http://lists.freedesktop.org/archives/dri-devel/2015-January/074969.html on
> top of kernel 3.18.4, and got:

Oops, I tested with 3.18.2 (the latest stable release as of today).
Comment 227 Andrew 2015-01-15 00:36:56 UTC
The monitor switches repeatedly: no signal/black screen/no signal/black screen/... In wine starcraft 2 with gallium nine 100% repeatability(7 of 7 launches). Sorry for my English.
Comment 228 Michel Dänzer 2015-01-15 01:30:56 UTC
(In reply to Andrew from comment #227)
> The monitor switches repeatedly: no signal/black screen/no signal/black
> screen/... In wine starcraft 2 with gallium nine 100% repeatability(7 of 7
> launches).

That's not a random crash but a reproducible one, probably a Mesa bug. Please file a separate report for that.
Comment 229 Andrew 2015-01-15 07:49:48 UTC
My dmesg output is similar to an attachments to this bug. Do I need to create a new bug in this case?
Comment 230 Liss 2015-01-15 09:33:59 UTC
Looks like I have similar issue with Radeon 8850M. I already filled bug 88364, but I'm not sure should I mark it as duplicate because I'm not sure that it is same problem.
Comment 231 Marti Raudsepp 2015-01-15 09:38:06 UTC
(In reply to Liss from comment #230)
> Looks like I have similar issue with Radeon 8850M. I already filled bug
> 88364, but I'm not sure should I mark it as duplicate

(In reply to Andrew from comment #229)
> My dmesg output is similar to an attachments to this bug. Do I need to
> create a new bug in this case?

Please report all issues you have as separate bugs! This bug mixes together multiple issues and symptoms, so it's almost useless.
Comment 232 Morgan Jones 2015-01-21 07:01:05 UTC
Same symptoms with a Hawaii device (R9 290X).

dmesg:

[20174.016203] Watchdog[15659]: segfault at 0 ip 00007fa1902fbb0b sp 00007fa17a6dd560 error 6 in chromium[7fa18c040000+6497000]

lspci:

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT [Radeon R9 290X]
Comment 233 Morgan Jones 2015-01-21 07:08:17 UTC
Also, it's worth noting that my crashes are pretty reproducible when running Chromium without --disable-gpu if compton is running. Disabled compton and haven't had any so far.
Comment 234 Marti Raudsepp 2015-01-21 08:29:53 UTC
(In reply to Morgan Jones from comment #232)
> Same symptoms with a Hawaii device (R9 290X).

There are no "same symptoms" in this bug report, it's a mix of multiple different symptoms and issues. Please report a new bug for your problem.
Comment 235 Tom Guder 2015-03-18 12:40:19 UTC
Hello,

i get random freezes only in dota2. Other OpenGL applications run well. Archlinux, 3.18.6-1-ARCH #1 SMP PREEMPT Sat Feb 7 08:44:05 CET 2015 x86_64 GNU/Linux

[11008.894953] radeon 0000:02:00.0: GPU fault detected: 147 0x000c4402
[11008.894956] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100000
[11008.894958] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C044002
[11008.894959] VM fault (0x02, vmid 6) at page 1048576, read from TC (68)
[11008.894961] radeon 0000:02:00.0: GPU fault detected: 147 0x058c4801
[11008.894962] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000AA85A
[11008.894963] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C0C8002
[11008.894964] VM fault (0x02, vmid 6) at page 698458, read from TC (200)
[11019.062287] radeon 0000:02:00.0: ring 4 stalled for more than 10000msec
[11019.062291] radeon 0000:02:00.0: GPU lockup (current fence id 0x0000000000013609 last fence id 0x000000000001360a on ring 4)
[11019.062312] radeon 0000:02:00.0: failed to get a new IB (-35)
[11019.062315] [drm:radeon_cs_ib_fill] *ERROR* Failed to get ib !
[11019.556350] radeon 0000:02:00.0: Saved 780 dwords of commands on ring 0.
.
.
.
Comment 236 Tom Guder 2015-03-18 12:55:27 UTC
Same with kernel 3.14.35-1-lts. Dota2 crashes everytimes within one minute spectating a game and freezes the screen and keyboard. Networking works.

Bests
Tom

[  129.619475] radeon 0000:02:00.0: GPU fault detected: 147 0x000a4401
[  129.619479] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x01000000
[  129.619480] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001
[  129.619482] VM fault (0x01, vmid 5) at page 16777216, read from TC (68)
[  129.619484] radeon 0000:02:00.0: GPU fault detected: 146 0x020a440c
[  129.619485] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  129.619486] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Comment 237 Marti Raudsepp 2015-03-18 12:58:27 UTC
(In reply to Tom Guder from comment #235)
> i get random freezes only in dota2. Other OpenGL applications run well.

Please report a separate bug for your exact circumstances. This one is being ignored by developers because there are many different causes.
Comment 238 agapito 2015-04-13 06:50:10 UTC
I had this bug again :S 

Using KDE 5, i had 2 crashes when i changed speed animation in systemsettings5 - screens and monitor - compositor options. 

My system is Archlinux 64 bits, kernel 3.19.3 and mesa 10.5.2. I can't report any dmesg or log, my system completely freezes.
Comment 239 Marti Raudsepp 2015-04-13 07:32:13 UTC
(In reply to agapito from comment #238)
> I had this bug again :S 

There is no "this bug". Please report a separate bug for your exact circumstances. This one is being ignored by developers because there are many different causes.
Comment 240 Martin Peres 2019-11-19 08:52:51 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/506.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.