Bug 63599

Summary:

[r600] SUMO2 GPU lockup CP stall (kernel 3.2.47,3.4,3.8, 3.9, 3.10)

Product:

DRI

Reporter:

wojtek <wojtask9>

Component:

DRM/Radeon

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

alpha_one_x86, manowar

Version:

XOrg git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
lspci	none
dmesg	none
Xorg.0.log without crash (but GPU lockup CP stall appears)	none
Xorg.0.log with crash log	none
reg_dump_radeon_kernel39	none
reg_dump_fglrx_kernel39	none
dmesg output, vramlimit=16, Xorg started	none
dmesg output, vramlimit=32, Xorg started	none
dmesg output, vramlimit=128, Xorg started	none
dmesg output from ssh (hyperz enabled)	none
dmesg output from ssh (hyperz disabled)	none
sumo2.patch	none
possible fix	none

Description wojtek 2013-04-16 11:43:40 UTC

Created attachment 78069 [details]
lspci

This happens always with KDM, LightDM/XDM (xdm-archlinux package). Login screen freezes (but I can move mouse pointer).

First I thought that was my configuration error but the same issue with knoppix livecd (7.0.5, kernel 3.6.10, mesa-9.0.1)

Without Display Manager (startx from console) result is the same. Screen appears but is frozen (test using xfce4 and kde). Without DM I don't see any GPU lockup but Xorg.0.log shows errors.

Setting R600_HYPERZ=0 or R600_DEBUG=nohyperz doesn't help

system
kernel 3.8.5 or 3.9.0-rc7
mesa 9.1.1 
other packages are up to date form arch repository

Comment 1 wojtek 2013-04-16 11:44:18 UTC

Created attachment 78070 [details]
dmesg

Comment 2 wojtek 2013-04-16 11:45:26 UTC

Created attachment 78071 [details]
Xorg.0.log without crash (but GPU lockup CP stall appears)

Comment 3 wojtek 2013-04-16 11:46:12 UTC

Created attachment 78072 [details]
Xorg.0.log with crash log

Comment 4 wojtek 2013-05-17 19:18:53 UTC

I've just installed Gentoo. Issue still exists.

radeon-ucode-20130513
mesa-9.2-20130515 (without llvm)
kernel-3.9.2 (UVD disabled because kms with UVD don't work)
libdrm-2.4.44
xf86-video-ati-7.1.0 (without glamour)

any ideas or patches to test?

Comment 5 Alex Deucher 2013-05-17 19:54:06 UTC

Is this a regression with kernels 3.8, 3.9?  I.e., did 3.7 work ok?

Comment 6 wojtek 2013-05-17 20:04:00 UTC

This is the first linux installation on that machine. I cannot confirm if this is regression.

First time when I installed Arch linux kernel version was 3.7 and GPU lockup occurred

Comment 7 wojtek 2013-05-17 21:47:59 UTC

short summary (Alex suggestions on IRC):

remove r600g_dri.so -> NOT OK
"ColorTiling2D" "false" -> NOT OK
kernel-3.10-rc1 -> NOT OK

"NoAccel" "true" -> OK

Comment 8 wojtek 2013-06-07 18:33:25 UTC

Created attachment 80492 [details]
reg_dump_radeon_kernel39

Comment 9 wojtek 2013-06-07 18:33:54 UTC

Created attachment 80493 [details]
reg_dump_fglrx_kernel39

Comment 10 Jerome Glisse 2013-06-19 15:29:14 UTC

What is the motherboard and cpu reference ?

AMD A4-3400 ?

Comment 11 wojtek 2013-06-19 15:43:22 UTC

Motherboard: Gigabyte Technology Co., Ltd. GA-A75M-UD2H/GA-A75M-UD2H, BIOS F5 11/03/2011

CPU: AMD A4-3400 APU with Radeon(tm) HD Graphics (fam: 12, model: 01, stepping: 00)

Comment 12 wojtek 2013-06-27 20:30:04 UTC

probably duplicate 
https://bugs.freedesktop.org/show_bug.cgi?id=56081
and almost the same issue
http://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg40024.html

On my system with tree from 
http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.11-wip-4 
GPU lockup still present (tested on X11 and Wayland)

Comment 13 Alex Deucher 2013-07-04 19:51:39 UTC

Try comparing the following registers between fglrx and radeon:

0x98FC
0x98F8
0x8950
0x98F4
0x3F88
0x9B7C
0x3F90
0x9148
0x3F94
0x914C
0x8954
0x2004
0x2008
0x2768
0x8B24
0xA008
0xA020
0xA02C
0x9100
0x913C
0x960c
0x9610
0x88C4

You can use radeonreg (as root):
./radeonreg regmatch 0x98FC

See if changing any of them to what fglrx programs before starting X with tiling disabled.

Comment 14 wojtek 2013-07-05 14:15:21 UTC

It doesn't help :/

xorg.conf

ColorTilling "false"
ColorTilling2D "false"

Maybe I'm doing something wrong?

modprobe fglrx
radeonreg reg radeon > fglrx_reg_dump_console.log
start x
radeonreg reg radeon > fglrx_reg_dump_x11.log

restart computer

modprobe radeon
radeonreg reg radeon > radeon_reg_dump_console.log

./diff_and_select select_registers.txt > selected_registers.txt #script that generate diff and select only registers from comment#13

./set_registers selected_registers.txt #script read selected_registers.txt and execute radeonreg regset "register" "register_value"

Comment 15 wojtek 2013-07-05 16:04:28 UTC

diff between registers from comment#13 before startx

diff fglrx_registers_c.log radeon_registers_c.log
2c2
< 0x98F8        0x00000000 (0)
---
> 0x98F8        0x02010002 (33619970)
7,8c7,8
< 0x3F90        0xffff0000 (-65536)
< 0x9148        0xffff0000 (-65536)
---
> 0x3F90        0x00000000 (0)
> 0x9148        0x00000000 (0)
15,17c15,17
< 0x8B24        0x00000000 (0)
< 0xA008        0x00030000 (196608)
< 0xA020        0x00158011 (1409041)
---
> 0x8B24        0x00ff0fff (16715775)
> 0xA008        0x00010000 (65536)
> 0xA020        0x00020009 (131081)
20,21c20,21
< 0x913C        0x01000000 (16777216)
< 0x960c        0x76543210 (1985229328)
---
> 0x913C        0x00000004 (4)
> 0x960c        0x54763210 (1417032208)
23c23
< 0x88C4        0x00000000 (0)
---
> 0x88C4        0x000000c1 (193)

Comment 16 Paul Wolneykien 2013-07-17 06:22:49 UTC

  Hi, there. I don't absolutely sure I've got the same issue, but at least, my problem is very close to one reported here.

System:
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Sumo [Radeon HD 6480G]
xorg-server-1.14.2-alt1
xorg-drv-radeon-7.1.0-alt2
Linux 3.9.10-std-def-alt1

I'm asking you to test the following thing, please: try to

modprobe radeon vramlimit=32

On my system it helps, but only a little: the X screen doesn't freeze, just highly corrupted. And VTs could be switched. The vramlimit=16 option works the same way, but any other value I've tried -- 64, 128 --- doesn't!

Does it mean something we can use to trace down this bug?

Comment 17 Paul Wolneykien 2013-07-17 06:27:04 UTC

Created attachment 82523 [details]
dmesg output, vramlimit=16, Xorg started

Attached the dmesg | grep 'radeon' after modprobe radeon vramlimit=16 and Xorg start (kdm).

Comment 18 Paul Wolneykien 2013-07-17 06:27:44 UTC

Created attachment 82524 [details]
dmesg output, vramlimit=32, Xorg started

Attached the dmesg | grep 'radeon' after modprobe radeon vramlimit=32 and Xorg start (kdm).

Comment 19 Paul Wolneykien 2013-07-17 06:30:31 UTC

Created attachment 82525 [details]
dmesg output, vramlimit=128, Xorg started

Attached the dmesg | grep 'radeon' after modprobe radeon vramlimit=128 and Xorg start (kdm).

It freezes as with no vramlimit option. Used ...; sleep 10; dmesg | grep 'radeon' >file to catch the output.

Comment 20 wojtek 2013-07-17 12:13:08 UTC

yeah :) 
with vramlimit=16 xeyes works perfect :).
KDM (4.10.5) is working (without artifacts).

kernel-3.11-rc1 from(http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-3.11)

I'll do more tests later

Comment 21 Alex Deucher 2013-07-17 12:59:38 UTC

That just disables the use of vram for exa pixmaps.  You can accomplish the same thing by adding:
Option "EXAPixmaps" "false"
to the device section of your xorg.conf

Comment 22 Paul Wolneykien 2013-07-24 10:18:32 UTC

(In reply to comment #21)
> That just disables the use of vram for exa pixmaps.  You can accomplish the
> same thing by adding:
> Option "EXAPixmaps" "false"
> to the device section of your xorg.conf

  Yes, the option helps a little: I can see the KDM screen without any artifacts. However, it is dead, i.e. frozen. Feels like a GPU lockup, but no error messages in Xorg log or dmesg. How can I get more info?

Comment 23 wojtek 2013-07-25 22:43:04 UTC

(In reply to comment #13)
> Try comparing the following registers between fglrx and radeon:
> 
difference:

left (fglrx) right(rad)

GB_ADDR_CONFIG
-0x98F8 0x02010002 (33619970)
+0x98F8 0x02010001 (33619969)

CGTS_SYS_TCC_DISABLE
-0x3F90 0x00000000 (0)
+0x3F90 0xff000000 (-16777216)

CGTS_TCC_DISABLE
-0x9148 0x00000000 (0)
+0x9148 0xff000000 (-16777216)

CGTS_USER_SYS_TCC_DISABLE
-0x3F94 0x00000000 (0)
+0x3F94 0xff000000 (-16777216)

CGTS_USER_TCC_DISABLE
-0x914C 0x00000000 (0)
+0x914C 0xff000000 (-16777216)

PA_SC_FORCE_EOV_MAX_CNTS
-0x8B24 0x00ff0fff (16715775)
+0x8B24 0x00ff3fff (16728063)

SMX_DC_CTL0
-0xA020 0x00158009 (1409033)
+0xA020 0x00020009 (131081)

SPI_CONFIG_CNTL_1
-0x913C 0x00000004 (4)
+0x913C 0x00000000 (0)

VGT_CACHE_INVALIDATION
-0x88C4 0x000000c1 (193)
+0x88C4 0x000000c2 (194)

Comment 24 wojtek 2013-07-25 22:57:07 UTC

> left (fglrx) right(rad)
> GB_ADDR_CONFIG
> -0x98F8 0x02010002 (33619970)
> +0x98F8 0x02010001 (33619969)

my mistake

- (left - radeon)
+ (right -fglrx)

Comment 25 pablow.1422 2013-08-01 21:30:12 UTC

Hi. I'm having a similar GPU lockup CP stall on a REDWOOD card. But it doesn't happen at login. With kernels 3.9, 3.10 I can run  Xorg, use KDE Kwin effects without any issue; but when I start a game like 0a.d. or Need for Speed Most Wanted under wine, after some time a lockup happens. With kernel 3.11-rc3 lockups occurs while in desktop, little time after login -with or without radeon.dpm set to 1-.

Here are two attachments (the two running a kernel 3.10) of dmesg, one with R600_HYPERZ=0 env var set, and the other not (as suggested on another bug).

Is there anything else I can help?

Comment 26 pablow.1422 2013-08-01 21:32:29 UTC

Created attachment 83488 [details]
dmesg output from ssh (hyperz enabled)

Comment 27 pablow.1422 2013-08-01 21:33:02 UTC

Created attachment 83489 [details]
dmesg output from ssh (hyperz disabled)

Comment 28 Alex Deucher 2013-08-01 21:33:26 UTC

(In reply to comment #25)
> Hi. I'm having a similar GPU lockup CP stall on a REDWOOD card. But it
> doesn't happen at login. With kernels 3.9, 3.10 I can run  Xorg, use KDE
> Kwin effects without any issue; but when I start a game like 0a.d. or Need
> for Speed Most Wanted under wine, after some time a lockup happens. With
> kernel 3.11-rc3 lockups occurs while in desktop, little time after login
> -with or without radeon.dpm set to 1-.
> 
> Here are two attachments (the two running a kernel 3.10) of dmesg, one with
> R600_HYPERZ=0 env var set, and the other not (as suggested on another bug).
> 
> Is there anything else I can help?

Please open a different bug.  Your issues are not related to this one.

Comment 29 wojtek 2013-10-01 23:21:11 UTC

Created attachment 86938 [details]
sumo2.patch

simple patch that's fix problem on my system :)

Comment 30 Paul Wolneykien 2013-12-21 00:06:50 UTC

(In reply to comment #13)
> Try comparing the following registers between fglrx and radeon:
> 
> ...
> 
> You can use radeonreg (as root):
> ./radeonreg regmatch 0x98FC
> 

Here is the console mode comparison (radeon installs the framebuffer console, though):

$ diff -u fglrx-console.data radeon-console.data 
--- fglrx-console.data  2013-12-21 01:38:37.952290647 +0400
+++ radeon-console.data 2013-12-21 01:42:45.844348784 +0400
@@ -1,23 +1,23 @@
 0x98FC 0x00000000 (0)
-0x98F8 0x00000000 (0)
+0x98F8 0x02010002 (33619970)
 0x8950 0xfffcf001 (-200703)
 0x98F4 0x00fe0001 (16646145)
 0x3F88 0x00fe0001 (16646145)
 0x9B7C 0x00000000 (0)
-0x3F90 0xffff0000 (-65536)
-0x9148 0xffff0000 (-65536)
+0x3F90 0x00000000 (0)
+0x9148 0x00000000 (0)
 0x3F94 0x00000000 (0)
 0x914C 0x00000000 (0)
 0x8954 0x00000000 (0)
 0x2004 0x00000210 (528)
 0x2008 0x00fac688 (16434824)
 0x2768 0x00007000 (28672)
-0x8B24 0x00000000 (0)
-0xA008 0x00030000 (196608)
-0xA020 0x00158011 (1409041)
+0x8B24 0x00ff0fff (16715775)
+0xA008 0x00010000 (65536)
+0xA020 0x00158009 (1409033)
 0xA02C 0x0000001b (27)
 0x9100 0x00000000 (0)
-0x913C 0x01000000 (16777216)
-0x960c 0x76543210 (1985229328)
+0x913C 0x00000004 (4)
+0x960c 0x54763210 (1417032208)
 0x9610 0x0000ba98 (47768)
-0x88C4 0x00000000 (0)
+0x88C4 0x000000c1 (193)

> See if changing any of them to what fglrx programs before starting X with
> tiling disabled.

  I've set each register values from fglrx-console.data and call xinit. Lockup again, nothing changes.

Comment 31 Alex Deucher 2013-12-21 00:39:30 UTC

(In reply to comment #30)
> 
>   I've set each register values from fglrx-console.data and call xinit.
> Lockup again, nothing changes.

Does the patch in comment 29 fix the issue for you?  That patch is upstream now and should be in most stable kernels as well.

Comment 32 Paul Wolneykien 2013-12-21 09:15:53 UTC

(In reply to comment #31)
> (In reply to comment #30)
> > 
> >   I've set each register values from fglrx-console.data and call xinit.
> > Lockup again, nothing changes.
> 
> Does the patch in comment 29 fix the issue for you?  That patch is upstream
> now and should be in most stable kernels as well.

  To the pity, no. I've tested Linux 3.12 with the patch already applied and it changes nothing. My chip is:

$ lspci | grep 'VGA'
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Sumo [Radeon HD 6480G]

  Is it really a SUMO2 chip or I should open a new bug?

Comment 33 Alex Deucher 2013-12-21 14:32:37 UTC

(In reply to comment #32)
> (In reply to comment #31)
> > (In reply to comment #30)
> > > 
> > >   I've set each register values from fglrx-console.data and call xinit.
> > > Lockup again, nothing changes.
> > 
> > Does the patch in comment 29 fix the issue for you?  That patch is upstream
> > now and should be in most stable kernels as well.
> 
>   To the pity, no. I've tested Linux 3.12 with the patch already applied and
> it changes nothing. My chip is:
> 
> $ lspci | grep 'VGA'
> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> Sumo [Radeon HD 6480G]
> 
>   Is it really a SUMO2 chip or I should open a new bug?

What's the numeric pci id (lspci -nn)?  You should also see a line in the dmesg output like this:
[    1.758014] [drm] initializing kernel modesetting (SUMO 0x1002:0x9640 0x1458:0xD000).

It will say SUMO or SUMO2 depending on which one you have.

Comment 34 Paul Wolneykien 2013-12-23 12:28:07 UTC

(In reply to comment #33)
> (In reply to comment #32)
> > (In reply to comment #31)
> > > (In reply to comment #30)
> > > > 
> > > >   I've set each register values from fglrx-console.data and call xinit.
> > > > Lockup again, nothing changes.
> > > 
> > > Does the patch in comment 29 fix the issue for you?  That patch is upstream
> > > now and should be in most stable kernels as well.
> > 
> >   To the pity, no. I've tested Linux 3.12 with the patch already applied and
> > it changes nothing. My chip is:
> > 
> > $ lspci | grep 'VGA'
> > 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> > Sumo [Radeon HD 6480G]
> > 
> >   Is it really a SUMO2 chip or I should open a new bug?
> 
> What's the numeric pci id (lspci -nn)?  You should also see a line in the
> dmesg output like this:
> [    1.758014] [drm] initializing kernel modesetting (SUMO 0x1002:0x9640
> 0x1458:0xD000).
> 
> It will say SUMO or SUMO2 depending on which one you have.


$ lspci -nn | grep 'VGA'
00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Sumo [Radeon HD 6480G] [1002:9649]

[  188.599127] [drm] initializing kernel modesetting (SUMO 0x1002:0x9649 0x17AA:0x21EA).


Thus, is's SUMO isn't it? Should I try to use "max_hw_contexts = 4" for the SUMO case or that's unreasonable?

Comment 35 Alex Deucher 2013-12-23 14:34:10 UTC

Created attachment 91154 [details] [review]
possible fix

(In reply to comment #34)
> 
> $ lspci -nn | grep 'VGA'
> 00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc.
> [AMD/ATI] Sumo [Radeon HD 6480G] [1002:9649]
> 
> [  188.599127] [drm] initializing kernel modesetting (SUMO 0x1002:0x9649
> 0x17AA:0x21EA).
> 
> 
> Thus, is's SUMO isn't it? Should I try to use "max_hw_contexts = 4" for the
> SUMO case or that's unreasonable?

Your chip was misclassified.  The attached patch should fix it.

Comment 36 Paul Wolneykien 2013-12-23 20:20:33 UTC

(In reply to comment #35)
> Created attachment 91154 [details] [review] [review]
> possible fix
> 
> (In reply to comment #34)
> > 
> > $ lspci -nn | grep 'VGA'
> > 00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc.
> > [AMD/ATI] Sumo [Radeon HD 6480G] [1002:9649]
> > 
> > [  188.599127] [drm] initializing kernel modesetting (SUMO 0x1002:0x9649
> > 0x17AA:0x21EA).
> > 
> > 
> > Thus, is's SUMO isn't it? Should I try to use "max_hw_contexts = 4" for the
> > SUMO case or that's unreasonable?
> 
> Your chip was misclassified.  The attached patch should fix it.

  Great! Now, it works! :)))

  Thank you so much.

Comment 37 Paul Wolneykien 2013-12-23 21:16:20 UTC

(In reply to comment #36)
> (In reply to comment #35)
> >
> > Your chip was misclassified.  The attached patch should fix it.
> 
>   Great! Now, it works! :)))
> 
>   Thank you so much.

  However, the fun didn't last much long: suspend..resume gives black screen, vt switch doesn't work.

  There are a number of bugs on that subject over there, including

https://bugs.freedesktop.org/show_bug.cgi?id=66940
https://bugs.freedesktop.org/show_bug.cgi?id=23103
https://bugs.freedesktop.org/show_bug.cgi?id=40935
https://bugs.freedesktop.org/show_bug.cgi?id=42162
https://bugs.freedesktop.org/show_bug.cgi?id=50805
https://bugs.freedesktop.org/show_bug.cgi?id=72710

  Which one can you advice me to join?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.