Bug 78453

Summary: [HAWAII] Get acceleration working
Product: DRI Reporter: Luzipher <luziphermcleod>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: fejfighter, jmuehldorfer, kai, nick, peterasplund, portals, serkan
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
netconsole when running glxgears on hawaii, test E
none
test E: dmesg output up to and including "Xorg -retro" startup
none
test E: journalctl while glxgears runs
none
test F: dmesg of glxgears without tiling
none
radeonsi: Disable tiling as much as possible
none
test F: dmesg of glxgears without tiling (second patch by Michel Dänzer)
none
test F: new behaviour with latests mesa pathes by Marek - gpu lockup on Xorg start
none
dmesg with 3.14.3
none
dmesg with 3.14.3, mesa 469b42e
none
Dmesg with kernel 3.12+
none
Dmesg with kernel 32f79a8
none
With tiling disabled, kernel 32f79a8a
none
Dmesg with kernel 32f79a8 radeon.dpm/runpm = 0
none
radeon_fence_info before running eglinfo
none
radeon_fence_info with egl first run
none
radeon_fence_info with egl second run
none
dmesg-using-test3
none
dmesg-test-1
none
use old hdp flush
none
dmesg with old-hdp patch (kernel 3.15+)
none
dmesg kernel 3.12+ test=2
none
dmesg of eglgears_screen with drm-fixes 3.15-rc6 and agd5f's use-old-hdp-flush patch
none
Possible fix
none
dmesg kernel 3.15+-disable-fence-wait-patch
none
xorg and then glxinfo with kernel patch from attachment 99839
none
dmesg with nodma + notiling
none
dmesg: lock-up with Xorg -retro
none
Xorg.0.log showing "EQ overflowing"
none
Hack to temporarily fix accel
none
dmesg output after restarting X after lockup from comment #87
none
dmesg with working acceleration: GPU faults
none
dmesg with working acceleration (Marek's patches): WoW GPU reset
none
dmesg with drm-next-3.17-wip
none
dmesg with drm-next-3.17-wip with x -retro and starting xterm
none
xorg log with drm-next-3.17-wip with x -retro and starting xterm none

Description Luzipher 2014-05-08 18:49:40 UTC
Hawaii acceleration is currently disabled by default, as it doesn't really work (causes GPU crashes). This bug intends to collect data that hopefully helps to resolve the issues.

agd5f suggested on irc that, when hawaii support was first commited, it worked better than now. Back then, glxgears worked according to his words. He also said it'd help to know which component caused the regression for glxgears. I tried to get glxgears working with software versions from back then, but I wasn't succesful.

The information I collected so far is:

#### A: OLD (mostly from the time when hawaii support was commited first) ####
xf86-video-ati: d571d6af70ef27efd1ed6420eb892bdde963ed7a
glamor: 0.5.1-r1
xorg-server: 1.14.6
mesa: 469b42ee21d6bc530200c76cb0e73b0b461ab6e8
kernel: ??
libdrm: ?? (git, from 2.4.53 series)
----> Result when statring Xorg:
    - screen goes black (~ 1sec)
    - text-mode cursor appears in upper left corner (non blinking)
    - keyboard doesn't respond (caps-lock doesn't change led, ctrl-alt-F# doesn't change to console)
    - short press on power button on computer works (screen stays frozen, but system shuts down after a few seconds)


#### B: Updated glamor ####
xf86-video-ati: d571d6af70ef27efd1ed6420eb892bdde963ed7a
glamor: 0.6.0-r1
xorg-server: 1.14.6
mesa: 469b42ee21d6bc530200c76cb0e73b0b461ab6e8
kernel: 3.13.0-15365-gef64cf9-dirty, 3.15.0-rc2-177836-ge8d0b39-dirty
libdrm: ?? (git, from 2.4.53 series)
----> Result when statring Xorg:
    - screen goes black (~ 1sec)
    - text-mode cursor appears in upper left corner (non blinking)
    - keyboard doesn't respond (caps-lock doesn't change led, ctrl-alt-F# doesn't change to console)
    - short press on power button on computer works (screen stays frozen, but system shuts down after a few seconds)


#### C: Updated glamor, xf86-video-ati ####
xf86-video-ati: 0333f5bda27dc0ec2edc180c7a4dc9a432f13f97
glamor: 0.6.0-r1
xorg-server: 1.14.6
mesa: 469b42ee21d6bc530200c76cb0e73b0b461ab6e8
kernel: 3.15.0-rc2-177836-ge8d0b39-dirty
libdrm: ?? (git, from 2.4.53 series)
----> Result when statring Xorg:
    - screen goes black (~ 1sec)
    - text-mode cursor appears in upper left corner (non blinking)
    - keyboard doesn't respond (caps-lock doesn't change led, ctrl-alt-F# doesn't change to console)
    - short press on power button on computer works (screen stays frozen, but system shuts down after a few seconds)


#### D: Updated glamor, xf86-video-ati, xorg-server ####
xf86-video-ati: 0333f5bda27dc0ec2edc180c7a4dc9a432f13f97
glamor: 0.6.0-r1
xorg-server: 1.15.1
mesa: 469b42ee21d6bc530200c76cb0e73b0b461ab6e8
kernel: 3.13.0-15365-gef64cf9-dirty, 3.15.0-rc2-177836-ge8d0b39-dirty
libdrm: ?? (git, from 2.4.53 series)
----> Result when statring Xorg:
    - screen goes black (~ 1sec)
    - text-mode cursor appears in upper left corner (non blinking)
    - keyboard doesn't respond (caps-lock doesn't change led, ctrl-alt-F# doesn't change to console)
    - short press on power button on computer works (screen stays frozen, but system shuts down after a few seconds)
    - startx falls back to a unresponsive console with output of X startup, when pressing the powerbutton I can see systemd shutting down the system


#### E: Updated glamor, xf86-video-ati, xorg-server, libdrm, mesa ####
xf86-video-ati: 0333f5bda27dc0ec2edc180c7a4dc9a432f13f97
glamor: 0.6.0-r1
xorg-server: 1.15.1
mesa: git (2014-05-07)
kernel: 3.15.0-rc3-41840-g2a1235e-dirty (airlied drm-fixes git)
libdrm: git (2014-05-07), 2.4.54 series
----> Result when statring "Xorg -retro":
    - X starts as expected (low resolution, pattern an mouse visible)
    - remotely running glxinfo gives output (radeon driver) but crashes X
    - glxgears shows a correctly rendered frame, stalls, gpu crash shortly afterward (black screen). See attached video for details
    - the gpu crashes continue, the screen get's more and more corrupted
    - when killing glxgears (from remote), the computer crashes completely
    - also see attached logs
Comment 1 Luzipher 2014-05-08 19:01:20 UTC
Asuploading the video to this bugtracker didn't work (filesize), I uploaded it to youtube, here: http://youtu.be/oT--dsCdh98
Comment 2 Luzipher 2014-05-08 19:04:08 UTC
Hardware Details:
=====================

Graphics Card: Hawaii XT, Sapphire Radeon R9 290X Tri-X OC (11226-00-50G)
Graphics Chip: HAWAII 0x1002:0x67B0 0x174B:0xE285
Monitors: 3 (HP LP2475w via DVI, Samsung 214T via DVI, Samsung TV via HDMI)
Processor: Core i7-965 (LGA 1366)
Mainboard: Asus P6T Deluxe
RAM: 6GB
Comment 3 Luzipher 2014-05-08 19:08:09 UTC
Created attachment 98702 [details]
netconsole when running glxgears on hawaii, test E
Comment 4 Luzipher 2014-05-08 19:12:57 UTC
Created attachment 98703 [details]
test E: dmesg output up to and including "Xorg -retro" startup
Comment 5 Luzipher 2014-05-08 19:14:53 UTC
Created attachment 98704 [details]
test E: journalctl while glxgears runs
Comment 6 Alex Deucher 2014-05-08 20:17:41 UTC
Try disabling tiling.  Add:

Option "ColorTiling" "false"
Option "ColorTiling2D" "false"

to the device section of your xorg.conf, then apply this patch to mesa:

diff --git a/src/gallium/drivers/radeon/r600_texture.c b/src/gallium/drivers/radeon/r600_texture.c
index e30d933..fe806be 100644
--- a/src/gallium/drivers/radeon/r600_texture.c
+++ b/src/gallium/drivers/radeon/r600_texture.c
@@ -725,6 +725,8 @@ static unsigned r600_choose_tiling(struct r600_common_screen *rscreen,
 {
        const struct util_format_description *desc = util_format_description(templ->format);
 
+       return RADEON_SURF_MODE_LINEAR_ALIGNED;
+
        /* MSAA resources must be 2D tiled. */
        if (templ->nr_samples > 1)
                return RADEON_SURF_MODE_2D;

And try running gears on a bare xserver.
Comment 7 Luzipher 2014-05-08 21:40:13 UTC
Created attachment 98711 [details]
test F: dmesg of glxgears without tiling

#### Test F:

Disabling tiling with xorg.conf and your patch had the following results:
    - not even one image of gears (unlike with tiling, where I get one correctly rendered static image at the beginning), instead just a black rectangle where the gears would be
    - the phase with corruptions seems shorter, the screens go black without coming back sooner (might be chance), but gpu resets continue for a while in dmesg (see attachment)
Comment 8 Michel Dänzer 2014-05-09 03:28:47 UTC
Created attachment 98725 [details] [review]
radeonsi: Disable tiling as much as possible

Please try this Mesa patch instead of the one from comment #6. The hardware doesn't support linear depth/stencil buffers.
Comment 9 Luzipher 2014-05-09 21:51:03 UTC
Created attachment 98783 [details]
test F: dmesg of glxgears without tiling (second patch by Michel Dänzer)

No change at all between the different no-tiling patches. With yours, Michel, I also get only a black rectangle (no gears at all) and the same gpu reset cycles as before with screen corruption. Tested this time with "Xorg -retro" as well as only "Xorg".

By the way, my kernel command line includes:
    radeon.dpm=0 drm.rnodes=1
And in xorg.conf.d in the "Module" section:
    Load "dri2"
    Load "glamoregl"
And in the "Device" section:
    Driver "radeon"
    Option "NoAccel" "false"
    Option "AccelMethod" "glamor"
    Option "ColorTiling" "false"
    Option "ColorTiling2D" "false"
Comment 10 Luzipher 2014-05-11 02:26:19 UTC
Created attachment 98832 [details]
test F: new behaviour with latests mesa pathes by Marek - gpu lockup on Xorg start

With the newest patches on mesa git by Marek Olšák (last commit 	d9e102b220701c15730329290daa0176751af09a, "radeonsi: prepare depth export registers at compile time"), I get even more gpu lockups. The first lockup now happens when starting "Xorg -retro", but it resets the gpu successfully. When starting glxgears then, I don't get _any_ output (not even a black rectangle in the place where the gears should be). The gpu lockup cycling and screen corruption still occur right after starting glxgears and eventually crash the machine (just as before).

Tested without tiling (patch and xorg.conf options) and with tiling, same results.

Attachment: dmesg with Xorg startup until the retro-pattern is visible, but _not_ glxgears.
Comment 11 Luzipher 2014-05-11 12:02:32 UTC
I bisected the Xorg startup "regression" and as suspected it is caused by commit:
315f3c171d423e13069beb99a6b772726a141865 radeonsi: use DRAW_PREAMBLE on CIK

Before that commit, "Xorg -retro" works without hangig the gpu, with that commit "Xorg -retro" hangs the gpu. Today I couldn't even get Xorg up at all, it went into a reset-loop with corruption in between right away.
Comment 12 Michel Dänzer 2014-05-12 07:09:23 UTC
Note that 2D contents use 3D hardware acceleration as well via glamor. For these tests, it might be best to use as little 2D as possible, e.g. just a bare X server without -retro and glxgears, or even something like es2gears without X at all.
Comment 13 Luzipher 2014-05-13 00:12:56 UTC
I do test with Xorg (no -retro) now and then - that is a "bare X server" I guess ? If not, please tell me how to start even more bare, I'll gladly do that, "-retro" just makes it a little more ... visible - I couldn't see a black rectangle (glxgears) on a black screen (bare X) after all. So far "-retro" never made a difference for the observations. But of course I'll test with bare Xorg if you think that's better - after all this bug is for helping with finding the cause ;-)

If I try to run "es2gears_screen" from textmode console (KMS) I get:
    libEGL warning: DRI2: xcb_connect failed
Is there anything special I have to do to get it working ?

For "DISPLAY=:0.0 es2_info" with "Xorg" running (no -retro), I get some output (see below). If I try to run it again, the command hangs for at least 2min (no output, nothing in dmesg, just black screen, no corruption), but I can Ctrl-C it and its process quits. Also, every other es/egl program and even glxgears just hangs and does nothing. Output on first invocation:
    EGL_VERSION: 1.4 (DRI2)
    EGL_VENDOR: Mesa Project
    EGL_EXTENSIONS:
        EGL_MESA_drm_image, EGL_MESA_configless_context, ...
    EGL_CLIENT_APIS: OpenGL OpenGL_ES OpenGL_ES2 OpenGL_ES3 
    GL_VERSION: OpenGL ES 3.0 Mesa 10.3.0-devel (git-58c6597)
    GL_RENDERER: Gallium 0.4 on AMD HAWAII
    GL_EXTENSIONS:
        GL_EXT_blend_minmax, ...

There is actually a difference here when running it with "-retro": the screen turns off after running es2_info. That doesn't happen on plain Xorg. In either case there is nothig printed in dmesg.


For "DISPLAY=:0.0 es2gears_screen" on plain "Xorg" I get the following output:
    EGL_VERSION = 1.4 (DRI2)
    EGLUT: failed to choose a config

As with es2info, consecutively executed programs just hang.


For "DISPLAY=:0.0 es2gears_x11" on plain "Xorg" I get identical behaviour as with glxgears - no picture at all, gpu lockups in a cycle, corruption each cycle after the screen lights up again and eventually the machine crashes after a few lockup cycles (ssh stops working). The output is instantaneous and doesn't change anymore:
    EGL_VERSION = 1.4 (DRI2)
    vertex shader info: 
    fragment shader info: 
    info: 


If you want any logs from that, I'd happily provide them, but as far as I can tell they're identical to the glxgears logs already posted.

If I can do anything else to help, please tell me.
Comment 14 Michel Dänzer 2014-05-13 08:38:35 UTC
(In reply to comment #13)
> If I try to run "es2gears_screen" from textmode console (KMS) I get:
>     libEGL warning: DRI2: xcb_connect failed

Make sure the DISPLAY environment variable is not set, and set the environment variable EGL_LOG_LEVEL=debug to get more information.
Comment 15 Luzipher 2014-05-13 16:26:04 UTC
DISPLAY wasn't set ("echo $DISPLAY" just prints an empty line) and with (In reply to comment #14)
> Make sure the DISPLAY environment variable is not set, and set the
> environment variable EGL_LOG_LEVEL=debug to get more information.

DISPLAY wasn't set ("echo $DISPLAY" just prints an empty line) and with "EGL_LOG_LEVEL=debug es2gears_screen" I get:
    libEGL debug: Native platform type: x11 (build-time configuration)
    libEGL debug: EGL search path is /usr/lib64/egl
    libEGL debug: added egl_dri2 to module array
    libEGL warning: DRI2: xcb_connect failed
    libEGL warning: DRI2: xcb_connect failed
    libEGL debug: EGL user error 0x3001 (EGL_NOT_INITIALIZED) in eglInitialize
    
    EGLUT: failed to initialize EGL display


/usr/lib64/egl doesn't exist ... but I'm not sure what should be there or if that is the cause for xcb_connect failing. Any idea what I could do ?
Comment 16 vincent 2014-05-13 16:38:20 UTC
You need to build mesa with the "--with-egl-platforms=x11,drm" and "--enable-gallium-egl" flags, and then starts with "EGL_LOG_LEVEL=debug EGL_PLATFORM=drm ./es2_gears" command.
Comment 17 vincent 2014-05-13 18:04:10 UTC
Created attachment 98990 [details]
dmesg with 3.14.3

Here with a kernel 3.14.3 (from fedora 20) and a mesa at revision 1646f4d0fb0efec04dce62b6dd4d974206acc8ac (10.3 devel) I see a single frame of es2_gears ; if I abort execution, it looks like the gpu hang and/or reset as my screen go in low power mode but I can bring tty back again.
Comment 18 Michel Dänzer 2014-05-14 04:10:09 UTC
(In reply to comment #17)
> I see a single frame of es2_gears ; if I abort execution, it looks like the
> gpu hang and/or reset as my screen go in low power mode but I can bring tty
> back again.

I can't see any attempt to reset the GPU in your dmesg. Does that happen if you start es2_gears again (or another GL app)?
Comment 19 vincent 2014-05-15 00:11:20 UTC
Created attachment 99050 [details]
dmesg with 3.14.3, mesa 469b42e

No... it happens at the first time I run es2gears. 
I tried with mesa revision 469b42e (where hawaii pci id were added) and your patch, and I had the same visual results (ie a single frame from gear, and apparent gpu reset) but this time there is a mention about gpu reset in dmesg (I attached it).

If Hawaii acceleration has worked, I suspect the regression to be found in the kernel module rather than in Mesa (or the issue was fixed and then broken between november and now, but randomly picking commits around december didn't bring me to a working state) and I don't think llvm to be the culprit here as the isa is shared with bonnaire, which works. 

Unfortunatly I'm not very good at bisecting kernel regression, especially when it can span several kernel release (from 3.12+ to 3.13). I can try but I don't what kernel commit is a good candidate to start.
Comment 20 vincent 2014-05-15 18:02:10 UTC
Created attachment 99112 [details]
Dmesg with kernel 3.12+

With kernel at revision 8d0a2215931f1ffd77aef65cae2c0becc3f5d560

(see http://cgit.freedesktop.org/~airlied/linux/commit/?id=8d0a2215931f1ffd77aef65cae2c0becc3f5d560)

and mesa at 469b42ee21d6bc530200c76cb0e73b0b461ab6e8 revision, if I launch es2gears_screen, nothing happens (ie the tty is waiting for something). If I kill es2gears_screen and relaunch it, I still have to wait a couple of seconds before the screen turns black, the monitor shut down, then is turned on again and shows the tty just before it displays "something" (a corrupted green gears) for several seconds, before monitors is shut down and brought back again.
Comment 21 vincent 2014-05-15 19:10:19 UTC
Created attachment 99123 [details]
Dmesg with kernel 32f79a8

With same mesa and kernel at 32f79a8a82b2ff6f1828b258da214869adc2a28c (drm/radeon/cik: Add macrotile mode array query) I have to wait a little less before es2gears_screen starts. But it still corrupted, and another attempt fails at rendering anything.
Comment 22 Michel Dänzer 2014-05-16 03:21:49 UTC
(In reply to comment #21)
> With same mesa and kernel at 32f79a8a82b2ff6f1828b258da214869adc2a28c
> (drm/radeon/cik: Add macrotile mode array query) I have to wait a little
> less before es2gears_screen starts. But it still corrupted, and another
> attempt fails at rendering anything.

It's probably worth using (something like) my patch to disable tiling as much as possible (but no more) for now, to avoid any 2D tiling related issues.
Comment 23 vincent 2014-05-16 15:01:15 UTC
Created attachment 99163 [details]
With tiling disabled, kernel 32f79a8a

With your mesa patch and kernel at 32f79a8a82b2ff6f1828b258da214869adc2a28c I still have a corrupted screen, and gpu hangs.

Do you think es2gears_screen may be "too advanced" for hawaii support in november ? I remember it was said to support glxgears and I use es2gears, I think the shaders/commands are equivalent for both program but maybe I'm wrong as I can't find a revision that works.
Comment 24 Tom Stellard 2014-05-16 15:04:37 UTC
I think that the least complicated test case is probably the hello_world OpenCL example from: http://cgit.freedesktop.org/~tstellar/opencl-example/
Comment 25 vincent 2014-05-16 16:33:36 UTC
When running Opencl test script from your repo it works, with mesa 469b42ee21 + Michel patch and kernel at 32f79a8a82b2ff6f1828b258da214869adc2a28c, as well as with kernel 3.14.3.

It looks like only the graphic commands are affected by the regression.
Comment 26 Michel Dänzer 2014-05-17 03:14:35 UTC
(In reply to comment #23)

FWIW, for the initial radeonsi bringup, I used mesa/demos/src/egl/opengl/egltri_screen.
Comment 27 Marek Olšák 2014-05-17 13:56:05 UTC
You can also use piglit without X using:
$ piglit/piglit-run.py -p gbm
OR
$ PIGLIT_PLATFORM=gbm piglit/bin/test -auto

Alternative to glxinfo without X:

$ PIGLIT_PLATFORM=gbm piglit/bin/glinfo
Comment 28 vincent 2014-05-17 20:36:47 UTC
Created attachment 99246 [details]
Dmesg with kernel 32f79a8 radeon.dpm/runpm = 0

It doesnt work with egltri_screen (or eglgears_screen FWIW)

I have attached dmesg. It looks like disabling dpm and runpm generates a more complete dmesg log.
What does "sa_manager is not empty, clearing anyway" mean ? Can it create gpu lock up ?
Comment 29 vincent 2014-05-17 21:53:08 UTC
piglit/bin/glinfo and gbm platforms also hang the gpu.

Sometimes when I run ./egltri_screen I have a shaded triangle, that doesnt move, it looks like the first frame is correctly processed.

I'm trying to check if llvm may have introduced a regression, although it doesnt seem likely as the isa is shared with bonnaire.
Comment 30 vincent 2014-05-17 22:22:55 UTC
With llvm at bdbcffa4af01cda413690276d8e81b3ab5cea9b6
(R600/SI: Add processor type for Hawaii) gpu still hangs.

The remaining component I didn't test is libdrm, is it likely to change something ?
Comment 31 vincent 2014-05-17 22:35:49 UTC
No luck with libdrm-2.4.48 either (roughly when hawaii pciid was added).

I'm using llvm bdbcffa4af01cda413690276d8e81b3ab5cea9b6,
mesa 469b42ee21d6bc530200c76cb0e73b0b461ab6e8 (+ MD patch)
and kernel 8d0a2215931f1ffd77aef65cae2c0becc3f5d560 with radeon.dpm=0 and radeon.runpm=0.
Comment 32 vincent 2014-05-18 15:44:08 UTC
Is it likely that the firmware may cause this regression ?
I have the firmware from fedora 20, I think they are newer than what was used for initial hawaii bring up and I'm not sure there is any other component that might be involved.
Comment 33 Michel Dänzer 2014-05-19 03:36:18 UTC
(In reply to comment #28)
> What does "sa_manager is not empty, clearing anyway" mean ? Can it create
> gpu lock up ?

No, I think that's just an artifact of the attempt to reset the GPU.


(In reply to comment #29)
> Sometimes when I run ./egltri_screen I have a shaded triangle, that doesnt
> move, it looks like the first frame is correctly processed.

egltri only renders one frame. :)


(In reply to comment #32)
> I have the firmware from fedora 20, I think they are newer than what was
> used for initial hawaii bring up [...]

Trying older firmware is worth a shot I guess, though not very likely to make a difference I'm afraid.
Comment 34 vincent 2014-05-19 15:16:33 UTC
FWIW I also tested with R600_DEBUG=nohyperz and R600_DEBUG=nodma but it doesnt solve the issue (although I'm not sure these env var are useful on radeonsi).

Unfortunatly egltri_screen display a triangle but sometimes and in a very broken way : always a message about gpu stall, sometimes it display garbage, sometimes a triangle but on the right or on the left. The gpu is definitively doing some rendering works correctly but it fails before swapping buffer.

BTW why eglinfo also make the gpu stall ? As far as I can tell there is no draw operation involved, just egl initialisation and unintialisation.
Comment 35 Luzipher 2014-05-19 23:58:18 UTC
(In reply to comment #16)
> You need to build mesa with the "--with-egl-platforms=x11,drm" and
> "--enable-gallium-egl" flags, and then starts with "EGL_LOG_LEVEL=debug
> EGL_PLATFORM=drm ./es2_gears" command.

First sorry for not responding for so long, I've been out of town. Thanks for the info, vincent, and thanks for picking up the tests where I stopped ! I figured out that "--enable-gallium-egl" was missing, because the gentoo ebuild only enables it, if the "openvg" is enabled as well. Is OpenVG really needed or is it a bug in the ebuild ?

On the most recent deathsimple-kernel, mesa, libdrm and llvm from git, with three monitors attached (2 dvi, 1 hdmi) I now get es2gears_screen trying to do something with the following command:
    EGL_LOG_LEVEL=debug EGL_PLATFORM=drm es2gears_screen
On the HDMI screen I get instant corruption while on the DVI screens nothing changes. Then the gpu reset cycles start (the screens only came back once).

Output of es2gears_screen:
    libEGL debug: Native platform type: drm (environment overwrite)
    libEGL debug: EGL search path is /usr/lib64/egl
    libEGL debug: added /usr/lib64/egl/egl_gallium.so to module array
    libEGL debug: added egl_dri2 to module array
    libEGL debug: dlopen(/usr/lib64/egl/egl_gallium.so)
    libEGL info: use DRM for display (nil)
    libEGL debug: the best driver is Gallium
    EGL_VERSION = 1.4 (Gallium)
    Found 16 modes:
      0: 1920 x 1080
      1: 1920 x 1080
      2: 1920 x 1080
      3: 1920 x 1080
      4: 1920 x 1080
      5: 1920 x 1080
      6: 1920 x 1080
      7: 1920 x 1080
      8: 1920 x 1080
      9: 1920 x 1080
     10: 1920 x 1080
     11: 1280 x 720
     12: 1280 x 720
     13: 1280 x 720
     14: 720 x 576
     15: 720 x 480
    Will use screen size: 1920 x 1080
    vertex shader info:
    fragment shader info:
    info:

I retried with only one DVI monitor attached and got a white screen after a short black blanking of about a second (instead of corruption) before the gpu reset cycles started and I got multiple of those cycles. dmesg looks like always.

I also tried egltri_screen and also got a white screen for a few seconds, then it went back to console. No lockup, nothing in dmesg. Last line of output was:
    1 frames in 5.0 seconds =  0.200 FPS
I'm also able to re-run this multiple times with the same result, but I'm not entirely sure if it uses radeon or maybe llvmpipe (eglinfo doesn't print anything useful in this regard). Any idea how I could find out reliably ?


I also wondered why glxinfo, eglinfo and es2_info crash the gpu or at least disable subsequent invocations of anything accelerated.
Comment 36 Michel Dänzer 2014-05-20 09:23:59 UTC
(In reply to comment #34)
> Unfortunatly egltri_screen display a triangle but sometimes and in a very
> broken way : always a message about gpu stall, sometimes it display garbage,
> sometimes a triangle but on the right or on the left. The gpu is
> definitively doing some rendering works correctly but it fails before
> swapping buffer.

Beware that it might display something from a previous run even if it doesn't actually render anything. To rule that out, you may want to use another simple test, e.g. one that only clears the window with glClear(), and alternate between the tests, so you can tell whether they actually render anything.


> BTW why eglinfo also make the gpu stall ? As far as I can tell there is no
> draw operation involved, just egl initialisation and unintialisation.

A command stream might still be submitted to the hardware when the EGL context is unbound/destroyed. Should be easy to verify via a breakpoint in gdb.

If that is the case, it might be useful to try narrowing down which part of that 'empty' command stream hangs the GPU.
Comment 37 vincent 2014-05-20 19:19:19 UTC
in eglinfo it's eglInitialize that makes the gpu hangs apparently.

With some investigation using gdb, a (the ?) gpu stall occurs with the following calls :


eglInitialize
-> _eglMatchDriver
--> _eglMatchAndInitialize
---> mod->Driver->API.Initialize(mod->Driver, dpy)

I'm trying to narrow down the issue.
Comment 38 vincent 2014-05-20 19:22:33 UTC
Narrowing further, a hang occurs when calling 

nplat->create_display(dpy->PlatformDisplay, dpy->Options.UseFallback);

in egl_g3d.c:539
Comment 39 vincent 2014-05-20 19:39:01 UTC
Hang occurs in radeonsi_screen_create, src/gallium/drivers/radeonsi/radeonsi_pipe:line 684

"rscreen->b.aux_context = rscreen->b.b.context_create(&rscreen->b.b, NULL);"
Comment 40 vincent 2014-05-20 19:48:49 UTC
Hang occurs in si_get_backend_mask in r600_hw_context.c, line 89:


		results = ctx->b.ws->buffer_map(buffer->cs_buf, ctx->b.rings.gfx.cs, PIPE_TRANSFER_READ);

It looks like the packet being sent does not work for Hawaii.
Comment 41 vincent 2014-05-20 19:56:47 UTC
If I comment the block that does buffer_map/unmap, eglinfo does not hang the gpu (although I'm not sure if it's safe). However it doesnt fix egltri_screen/eglgears_screen.
Comment 42 vincent 2014-05-20 20:03:12 UTC
There is 4 calls to radeon_bo_map outside of si_get_backend_mask, corresponding to dummy shader upload. There is only one buffer_map(READ), in si_get_backend_mask, right after the buffer_map(WRITE) from si_get_backend_mask.
Comment 43 Christian König 2014-05-21 08:55:17 UTC
The hang caused by eglinfo is probably just because we try to clear a buffer on init using hw accel or something like this.

Here are a couple of tips how to further narrow it down:

1. Don't work on the system you are trying to debug. Use another system and access the box with the hardware over the network or even better with a serial cable.

2. Add radeon.lockup_timeout=0 to the kernel commandline. This will disable recovery after the first lockup. It doesn't make sense to try to debug all lockups at the same time, we probably got more than one problem to solve.

3. After the system crashed take a look at the debugfs files under /sys/kernel/debug/dri/0/, especially radeon_fence_info. If you lockup the GPU does this always happen at the same command submitted? E.g. if you try twice do you always get the same fence value? Or is this random?
Comment 44 vincent 2014-05-21 14:23:29 UTC
Created attachment 99511 [details]
radeon_fence_info before running eglinfo
Comment 45 vincent 2014-05-21 14:24:16 UTC
Created attachment 99512 [details]
radeon_fence_info with egl first run
Comment 46 vincent 2014-05-21 14:24:32 UTC
Created attachment 99513 [details]
radeon_fence_info with egl second run
Comment 47 vincent 2014-05-21 14:28:15 UTC
When using the radeon.lockup_timeout=0 eglinfo (and every other egl apps) never finishes, probably waiting for a fence, but I still can switch tty. That's with kernel 3.12+ (otoh with 3.14.3 if I try to run a egl app, I have garbage on display and my monitor shutdown, and is never brought up, there is probably others issues introduced after hawaii commit as suggested)

The radeon_fence_info are the same between several run.
Comment 48 Michel Dänzer 2014-05-21 14:42:07 UTC
(In reply to comment #42)
> There is only one buffer_map(READ), in si_get_backend_mask, right after the
> buffer_map(WRITE) from si_get_backend_mask.

Please make sure the kernel has the commit(s) necessary to provide the backend mask to userspace, so si_get_backend_mask() doesn't need to fall back to this method.
Comment 49 vincent 2014-05-21 14:56:34 UTC
Do you know which commits I should cherry-pick ?
Comment 50 Christian König 2014-05-21 15:16:21 UTC
(In reply to comment #47)
> When using the radeon.lockup_timeout=0 eglinfo (and every other egl apps)
> never finishes, probably waiting for a fence, but I still can switch tty.

That was expected, it probably just waits forever for some results.

> That's with kernel 3.12+ (otoh with 3.14.3 if I try to run a egl app, I have
> garbage on display and my monitor shutdown, and is never brought up, there
> is probably others issues introduced after hawaii commit as suggested)

That's why I suggested to work over the network, trying to debug gfx hardware while the hardware is in use (e.g. it's your output device for debug messages) is usually pointless.

> The radeon_fence_info are the same between several run.

Yeah, and the values are pretty interesting:

--- ring 0 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000002
...
Last sync to ring 3 0x0000000000000009

--- ring 3 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000009

Ring 0 is the 3D engine, ring 3 is the DMA engine. What we see here is that we submitted some commands to the DMA engine which then got stuck.

The 3D engine is just waiting for the DMA to continue as well.

Try to load the radeon module with radeon.test=3 on the kernel command line (keep in mind that this can take a while and will probably crash as well).
Comment 51 vincent 2014-05-22 14:52:48 UTC
Unfortunatly with "radeon.test=3" the tty doesnt even appear, and it looks like sshd server is never run, I cant access anything :/
Comment 52 Alex Deucher 2014-05-22 14:54:56 UTC
(In reply to comment #51)
> Unfortunatly with "radeon.test=3" the tty doesnt even appear, and it looks
> like sshd server is never run, I cant access anything :/

Try booting into a non-X runlevel and then manually loading radeon from the command line.
Comment 53 vincent 2014-05-22 15:49:11 UTC
Created attachment 99591 [details]
dmesg-using-test3
Comment 54 vincent 2014-05-22 15:52:55 UTC
Content of radeon_fence_info in this case is :

--- ring 0 ---
Last signaled fence 0x0000000000000002
Last emitted        0x0000000000000003
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 1 ---
Last signaled fence 0x0000000000000003
Last emitted        0x0000000000000003
Last sync to ring 0 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 2 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000001
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 3 ---
Last signaled fence 0x00000000000007f7
Last emitted        0x00000000000007f7
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 4 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000001
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 5 ---
Last signaled fence 0x0000000000000002
Last emitted        0x0000000000000002
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000


I can't run eglinfo in this case so I think it's expected that ring 0 does not attempt to sync with ring 3.
I don't know if dmesg attached in #53 is enough to determine why the dma engine is stalling.
Comment 55 vincent 2014-05-22 15:56:50 UTC
Do the messages like "[  124.822006] [drm] Tested GTT->VRAM and VRAM->GTT copy for GTT offset 0x3fe98000" and "[  124.822010] [drm] Testing syncing between rings 1 and 0..." mean that the test are actually passed ?
Comment 56 Christian König 2014-05-22 16:05:27 UTC
(In reply to comment #55)
> Do the messages like "[  124.822006] [drm] Tested GTT->VRAM and VRAM->GTT
> copy for GTT offset 0x3fe98000" and "[  124.822010] [drm] Testing syncing
> between rings 1 and 0..." mean that the test are actually passed ?

Yes the DMA actually seems to work when used for buffer moves.

The second message is from the second test which tries semaphore sync between the different engines.

What the last message logged when you try to load the module?

Also please try radeon.test=1 and radeon.test=2 separately.
Comment 57 vincent 2014-05-22 16:30:43 UTC
Created attachment 99592 [details]
dmesg-test-1

radeon_fence_info with test 1

--- ring 0 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000001
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 1 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000001
Last sync to ring 0 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 2 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000001
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 3 ---
Last signaled fence 0x00000000000007f7
Last emitted        0x00000000000007f7
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 4 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000001
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 5 ---
Last signaled fence 0x0000000000000002
Last emitted        0x0000000000000002
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000

I didnt have any other message when loading the module.
Comment 58 Alex Deucher 2014-05-22 18:04:18 UTC
Created attachment 99594 [details] [review]
use old hdp flush

Does this kernel patch help?
Comment 59 vincent 2014-05-22 20:16:54 UTC
Created attachment 99597 [details]
dmesg with old-hdp patch (kernel 3.15+)

On top of kernel 3.15, unfortunatly not, eglinfo works well but egltri_screen and eglgears_screen still render garbage (but they dont stall the gpu).

I'd like to apply it on 3.12+, commit 32f79a8a because that's where I made my ring test but the patches doesnt apply cleanly, and the code seems to have changed quite a lot in cik.c.
Comment 60 vincent 2014-05-22 21:31:40 UTC
I checked that eglinfo already worked with kernel 3.14.3 so the situation didn't improve with the patch on kernel 3.15.
Comment 61 vincent 2014-05-23 15:21:46 UTC
Created attachment 99652 [details]
dmesg kernel 3.12+ test=2

Sorry I forget test=2 case, dmesg attached, radeon_fence_info is :

--- ring 0 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000003
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 1 ---
Last signaled fence 0x0000000000000003
Last emitted        0x0000000000000003
Last sync to ring 0 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 2 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000001
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 3 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000001
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 4 ---
Last signaled fence 0x0000000000000001
Last emitted        0x0000000000000001
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 5 0x0000000000000000
--- ring 5 ---
Last signaled fence 0x0000000000000002
Last emitted        0x0000000000000002
Last sync to ring 0 0x0000000000000000
Last sync to ring 1 0x0000000000000000
Last sync to ring 2 0x0000000000000000
Last sync to ring 3 0x0000000000000000
Last sync to ring 4 0x0000000000000000
Comment 62 vincent 2014-05-23 15:25:04 UTC
What are rings 1 and rings 2 ? I suspect one of the ring is the constant engine but I'm not sure if it's currently supported or not.

I've read that hawaii has something a gfx/compute command queue and 7 compute only command queue, does they use the same ring ?
Comment 63 Alex Deucher 2014-05-23 15:29:27 UTC
(In reply to comment #62)
> What are rings 1 and rings 2 ? I suspect one of the ring is the constant
> engine but I'm not sure if it's currently supported or not.
> 

1 and 2 are compute rings.  The constant engine is part of the gfx ring.

> I've read that hawaii has something a gfx/compute command queue and 7
> compute only command queue, does they use the same ring ?

The gfx ring can execute gfx or compute work.  There are many compute rings on CI hardware, but we currently only expose two.
Comment 64 Luzipher 2014-05-25 23:44:30 UTC
Created attachment 99813 [details]
dmesg of eglgears_screen with drm-fixes 3.15-rc6 and agd5f's use-old-hdp-flush patch

(In reply to comment #58)
> Created attachment 99594 [details] [review] [review]
> use old hdp flush
> 
> Does this kernel patch help?

I don't know if this info helps, but anyway:

With current stuff (llvm, glamor, mesa from git) and airlied's current drm-fixes (commit 77c01bef72a5ce5cb24adae6066ed81a52004d30) with your old-hdp-flush patch and the patch from bug 74250, I get:
* eglinfo works multiple times in a row
* egltri_screen from mesa-demos works, outputs a coloured triangle on grey background for 5s (with 3 monitors attached)
* eglgears_screen garbles the screen, but it only stalls the gpu once or twice (dmesg attached) and then returns to console
* and I can repeatedly run egltri_screen even after eglgears_screen ran

So for me it seems like quite an improvement.

I'd like to help further if possible, but I don't have any experience with gdb unfortunately. If I can do anything else or if there are some detailed instructions on what to do, just tell me and I'll try to find some time.
Comment 65 Luzipher 2014-05-26 00:25:46 UTC
Sorry, I have to correct myself somewhat: I can run egltri_screen exactly *twice* with correct output. And I guess the second time only looks right, because it displays what the first run left in memory. Even if I restart the machine after elgtri_screen and the run eglgears_screen (!), I get the triangle, not gears.
So I guess its only the first invocation that works. But it doesn't stall the gpu, even if it outputs garbage (and nothing is printed in dmesg).

Sometimes it even works again after a few invocations with garbage, but the triangle is not centered then, but offset to the left or right. The garbage also looks deterministic (always identical for the third and fourth invocation after a restart).

After some playing around I also got this from egltri_screen:
    Will use screen size: 1920 x 1080
    radeon: Failed to allocate virtual address for buffer:
    radeon:    size      : 4147200 bytes
    radeon:    alignment : 256 bytes
    radeon:    domains   : 4
    radeon:    va        : 0x0000000001820000
    radeon: Failed to allocate virtual address for buffer:
    radeon:    size      : 4147200 bytes
    radeon:    alignment : 256 bytes
    radeon:    domains   : 4
    radeon:    va        : 0x0000000001820000
    Segmentation fault
And the screen where the output would have shown went black (no signal), while the other two stayed "alive".
Comment 66 Christian König 2014-05-26 07:56:28 UTC
Created attachment 99839 [details] [review]
Possible fix

> Sorry I forget test=2 case, dmesg attached, radeon_fence_info is :

That looks like at least semaphores are not working as they should.

Please try the egl* tests with this workaround applied.
Comment 67 vincent 2014-05-26 19:19:41 UTC
Created attachment 99886 [details]
dmesg kernel 3.15+-disable-fence-wait-patch

egltri_screen doesnt work and only display garbage, and I have a message about vm protection fault.
THere is also a couple of "radeon 0000:01:00.0: failed to sync rings (-35)" message
Comment 68 Luzipher 2014-05-26 23:32:45 UTC
Created attachment 99906 [details]
xorg and then glxinfo with kernel patch from attachment 99839 [details] [review]

Sorry, I have to make another correction to my last post - mesa wasn't from git but from commit 58c659703bed86ea004a2e64ee231e3ba99b3d45. I can't get EGL to work with newest mesa, with the following messages when trying to run egltri_screen:
    # EGL_LOG_LEVEL="debug" EGL_PLATFORM="drm" ./egltri_screen

    libEGL debug: Native platform type: drm (environment overwrite)
    libEGL debug: EGL search path is /usr/lib64/egl
    libEGL debug: added /usr/lib64/egl/egl_gallium.so to module array
    libEGL debug: added egl_dri2 to module array
    libEGL debug: dlopen(/usr/lib64/egl/egl_gallium.so)
    libEGL info: use DRM for display (nil)
    libEGL debug: the best driver is Gallium
    EGL_VERSION = 1.4 (Gallium)
    libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
    libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
    libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
    EGLUT: failed to choose a config


I attached a dmesg with patch 99839 from Christian König but NOT patch 99594 from Alex Deucher, with newest mesa, where I did a "Xorg -retro" start and then glxinfo (which garbles the screen). I think there might be some new messages like:
[drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing C466 (len 254, WS 0, PS 4) @ 0xC490
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing B9EC (len 145, WS 0, PS 8) @ 0xBA77

And:
radeon 0000:02:00.0: still active bo inside vm

Also glxgears still crashes and causes garbage, but it seems ore recoverable (Ctrl-C makes the gpu reset cycle quit sometimes) and it seems to believe to render more frames (messages like "3 frames in XXX sec" instead of "1 frame").


With older mesa egltri_screen still causes gpu resets and garbage.
Comment 69 Michel Dänzer 2014-05-27 09:30:33 UTC
First of all, let me remind both of you to make sure si_get_backend_mask() gets the information from the kernel, and to use my patch which disables tiling as much as possible for now.

Also, might be worth testing with R600_DEBUG=nodma for now, to prevent the userspace code from using the asynchronous DMA engine.


(In reply to comment #64)
> * egltri_screen from mesa-demos works, outputs a coloured triangle on grey
> background for 5s (with 3 monitors attached)

That's all it's supposed to do, FWIW. :)


(In reply to comment #65)
> Sometimes it even works again after a few invocations with garbage, but the
> triangle is not centered then, but offset to the left or right. The garbage
> also looks deterministic (always identical for the third and fourth
> invocation after a restart).

Still, I'd assume in those cases it doesn't actually render anything but just displays leftovers from previous runs.

To differentiate that, I built a second copy of egltri_screen with all rendering commands except for glClear() disabled (and a glClearColor() call added for a distinctive clear colour). By running both copies alternatively, I could be pretty sure whether what I'm seeing was rendered by the current run or just a leftover.


(In reply to comment #68)
> I can't get EGL to work with newest mesa, with the following messages when trying
> to run egltri_screen:
>     # EGL_LOG_LEVEL="debug" EGL_PLATFORM="drm" ./egltri_screen

Does it work if you add EGL_DRIVER=egl_dri2 ?


> I attached a dmesg with patch 99839 from Christian König but NOT patch 99594
> from Alex Deucher, with newest mesa, where I did a "Xorg -retro" start and
> then glxinfo (which garbles the screen).

I'm afraid it's still too early to test in X. Even without -retro, the GPU is probably used for some operations via glamor, e.g. for the cursor image.
Comment 70 vincent 2014-05-27 15:06:12 UTC
Created attachment 99958 [details]
dmesg with nodma + notiling

Oops sorry I forgot about the tiling patch. Unfortunatly it doesn't help,
with it I can see egltri_screen triangle but the monitor reset, and eglgears_screen still display garbage.
Attached is the dmesg with the 2 apps tested.

By the way does clover use dma ? Clover runs fine, at least in simple case, but with inputs and output verification, and back to the commit that introduced hawaii support.
Comment 71 Robert White 2014-06-11 00:12:32 UTC
I don't know if this will help at all, but I am getting similar errors on a radeon laptop _and_ an i965 laptop. Im using the latest git as fetched by gentoo layman overlay tools and linux 3.15.0 (and previously 3.14.{0,1,5,6}).

I'm missing some icons under xfce4 and kde is a right mess on the i965.

EGL_LOG_LEVEL=debug  eglgears_x11
libEGL debug: Native platform type: x11 (autodetected)
libEGL debug: EGL search path is /usr/lib64/egl
libEGL debug: added /usr/lib64/egl/egl_gallium.so to module array
libEGL debug: added egl_dri2 to module array
libEGL debug: dlopen(/usr/lib64/egl/egl_gallium.so)
libEGL info: use X11 for display 0x1fef010
libEGL info: created a pipe screen for r600
libEGL debug: the best driver is Gallium
EGL_VERSION = 1.4 (Gallium)
libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
... many duplicates sikpped ...
libEGL debug: the value (0x0) of attribute 0x3040 did not meet the criteria (0x8)
EGLUT: failed to choose a config

vs

EGL_LOG_LEVEL=debug eglgears_screen 
libEGL debug: Native platform type: x11 (build-time configuration)
libEGL debug: EGL search path is /usr/lib64/egl
libEGL debug: added egl_dri2 to module array
libEGL debug: DRI2: dlopen(/usr/lib64/dri/i965_dri.so)
libEGL debug: DRI2: found extension `DRI_Core'
libEGL info: DRI2: found extension DRI_Core version 1
libEGL debug: DRI2: found extension `DRI_IMAGE_DRIVER'
libEGL debug: DRI2: found extension `DRI_DRI2'
libEGL info: DRI2: found extension DRI_DRI2 version 4
libEGL debug: DRI2: found extension `DRI_DriverVtable'
libEGL debug: DRI2: found extension `DRI_ConfigOptions'
libEGL debug: DRI2: found extension `DRI_TexBuffer'
libEGL info: DRI2: found extension DRI_TexBuffer version 3
libEGL debug: DRI2: found extension `DRI2_Flush'
libEGL info: DRI2: found extension DRI2_Flush version 4
libEGL debug: DRI2: found extension `DRI_IMAGE'
libEGL info: DRI2: found extension DRI_IMAGE version 8
libEGL debug: DRI2: found extension `DRI_RENDERER_QUERY'
libEGL debug: DRI2: found extension `DRI_CONFIG_QUERY'
libEGL debug: DRI2: found extension `DRI_Robustness'
libEGL debug: the value (0x4) of attribute 0x302f did not meet the criteria (0x5)
... duplicates, though not as many, removed ...
libEGL debug: the value (0x4) of attribute 0x302f did not meet the criteria (0x5)
libEGL debug: the best driver is DRI2
EGL_VERSION = 1.4 (DRI2)
libEGL debug: attribute 0x3033 has an invalid value 0x8
libEGL debug: EGL user error 0x3004 (EGL_BAD_ATTRIBUTE) in eglChooseConfig

EGLUT: failed to choose a config
Comment 72 Robert White 2014-06-11 00:24:14 UTC
P.S. Both of these boxes have worked "well" to "excellent" in and have recently been going weird. Games that played recently (q.v. warzone2100) have stopped working on both systems and, as said, the i965 is really suffering some display rot.
Comment 73 Michel Dänzer 2014-06-11 01:52:06 UTC
(In reply to comment #72)
> P.S. Both of these boxes have worked "well" to "excellent" in and have
> recently been going weird.

Your issues are not related to this bug report, though you might find some advice for trouble shooting in here. Please file your own reports for your issues.
Comment 74 Konstantin 2014-06-23 22:02:40 UTC
First, sorry for being away so long.

(In reply to comment #69)
> First of all, let me remind both of you to make sure si_get_backend_mask()
> gets the information from the kernel, and to use my patch which disables
> tiling as much as possible for now.
Ok, but how do I make sure that si_get_backend_mask() gets information from the kernel ?

(In reply to comment #69)
> Also, might be worth testing with R600_DEBUG=nodma for now, to prevent the
> userspace code from using the asynchronous DMA engine.
Will do that.

(In reply to comment #69)
> That's all it's supposed to do, FWIW. :)
I guessed as much :-) I actually was quite excited that it worked.

(In reply to comment #69)
> Does it work if you add EGL_DRIVER=egl_dri2 ?
Unfortunately that didn't work. As output I get (note best driver is DRI2 instead of gallium):
# EGL_LOG_LEVEL=debug EGL_PLATFORM=drm EGL_DRIVER=egl_dri2 ./egltri_screen
    libEGL debug: Native platform type: drm (environment overwrite)
    libEGL debug: EGL search path is /usr/lib64/egl
    libEGL debug: added egl_dri2 to module array
    libEGL debug: the best driver is DRI2
    EGL_VERSION = 1.4 (DRI2)
    libEGL debug: attribute 0x3033 has an invalid value 0x8
    libEGL debug: EGL user error 0x3004 (EGL_BAD_ATTRIBUTE) in eglChooseConfig
    EGLUT: failed to choose a config

But on recent mesa it works again without EGL_DRIVER.

(In reply to comment #69)
> I'm afraid it's still too early to test in X. Even without -retro, the GPU
> is probably used for some operations via glamor, e.g. for the cursor image.
Well, I just tested again on yesterdays mesa (2014-06-22) and vanilla kernel 3.16-rc2 (no patches, cause I had problems on rc1). egltri_screen still works once, then I get garbage, that now looks different (lots of small dots).

I also tried eglgears_screen, but that segfaults. I guess it's a problem in llvm - and unfortunately llvm refuses to build for about 2 weeks now.
Output is:

LLVM triggered Diagnostic Handler: unsupported call to function llvm.AMDGPU.rsq. in main
[  621.798733] traps: eglgears_screen[431] general protection ip:7f0cb3ae51d8 sp:7fffe17cfa10 error:0 in libLLVM-3.5svn.so[7f0cb2b35000+1af1000]
Segmentation fault

Next I'll try to get the kernel patches for disabling tiling running again and use R600_DEBUG=nodma. Do I have to do anything about si_get_backend_mask() ?
Comment 75 Michel Dänzer 2014-06-24 02:28:05 UTC
(In reply to comment #74)
> Ok, but how do I make sure that si_get_backend_mask() gets information from
> the kernel ?

By tracing its execution flow, either in gdb or by adding debugging printfs.


> > Also, might be worth testing with R600_DEBUG=nodma for now, to prevent the
> > userspace code from using the asynchronous DMA engine.
> Will do that.

Actually, that was a red herring, sorry; there's no asynchronous DMA support yet for CIK.


> LLVM triggered Diagnostic Handler: unsupported call to function
> llvm.AMDGPU.rsq. in main

You need to update your LLVM SVN/Git snapshot.
Comment 76 Luzipher 2014-07-01 20:17:04 UTC
(In reply to comment #75)
> > LLVM triggered Diagnostic Handler: unsupported call to function
> > llvm.AMDGPU.rsq. in main
> 
> You need to update your LLVM SVN/Git snapshot.

Just a quick update if someone else runs into this problem: llvm doesn't build with gcc 4.7 or 4.9, but it works with gcc 4.8.3.

Other than that I didn't get different results from the newest git code (mesa, xf86-video-ati) on a patched kernel 3.16-rc2 - egltri_screen works once, eglgears_screen produces garbage. But it seems somewhat more stable (no unrecoverable crashes as far as I can remember). Xorg on the other hand seems worse (no output anymore), but that might be the new in-server glamor stuff, I guess.

I won't be able to continue investigations for a few days - the printks are still on my todo list (or maybe gdb ...).
Comment 77 Kai 2014-07-17 12:54:53 UTC
Just an additional note: with 3.15.5 I'm seeing eight (one for each ring?) instances of:
> [drm:radeon_atom_get_leakage_vddc_based_on_leakage_params] *ERROR* Unknown table version 3, 1

Probably not important to the larger issue (no acceleration; fallback to llvmpipe) here though.
Comment 78 Kai 2014-07-17 13:29:48 UTC
Ahrg, comment #77 was only half of what I wanted to add... (screwed the C&P up)

In Xorg.0.log I'm seeing
> (EE) Error config_odev_get_int_attribute called for non integer attrib 4
no matter whether I force acceleration on or not. Not sure, this is relevant though.


In case I force HAWAII acceleration (same settings as Luzipher in comment #9) on I end up with a black screen and the system is inaccessible directly. Over SSH I found, that no error is logged to either Xorg.0.log or dmesg (I checked journalctl as well, just to be sure, but nothing there as well). Issuing a reboot command over SSH didn't work either. Would I need to set some kernel variable to get info about a locked-up GPU?


The stack I used was (base is Debian Testing):
GPU: Hawaii PRO [Radeon R9 290] (ChipID = 0x67b1)
Linux: 3.15.5
libdrm: 2.4.54-1
LLVM: SVN:trunk/r213236
libclc: Git:master/0ec7437d9c
Mesa: Git:master/48deb4dbf2
DDX: 1:7.4.0-2
X: 2:1.15.99.904-1 (1.16.0 RC 4)
Comment 79 Kai 2014-07-17 13:56:13 UTC
Created attachment 102983 [details]
dmesg: lock-up with Xorg -retro

Ok, this is weird: just out of curiosity I tried to launch Xorg with "-retro", then I do see errors logged (see attached excerpt from dmesg). If I run Xorg without parameters, I just end up with a black screen, no logged errors and Xorg.0.log looks like everything is fine (except for that "config_odev_get_int_attribute" error):
> [  1724.926] (--) RADEON(0): Chipset: "HAWAII" (ChipID = 0x67b1)
> [  1724.926] (EE) Error config_odev_get_int_attribute called for non integer attrib 4
> [  1724.926] (II) Loading sub module "dri2"
> [  1724.926] (II) LoadModule: "dri2"
> [  1724.926] (II) Module "dri2" already built-in
> [  1724.926] (II) Loading sub module "glamoregl"
> [  1724.926] (II) LoadModule: "glamoregl"
> [  1724.926] (II) Loading /usr/lib/xorg/modules/libglamoregl.so
> [  1724.931] (II) Module glamoregl: vendor="X.Org Foundation"
> [  1724.931] 	compiled for 1.15.99.904, module version = 1.0.0
> [  1724.931] 	ABI class: X.Org ANSI C Emulation, version 0.4
> [  1724.931] (II) glamor: OpenGL accelerated X.org driver based.
> [  1724.958] (II) glamor: EGL version 1.4 (DRI2):
> [  1724.975] (II) RADEON(0): glamor detected, initialising EGL layer.
> [  1724.975] (II) RADEON(0): KMS Color Tiling: disabled
> [  1724.975] (II) RADEON(0): KMS Color Tiling 2D: disabled
> [  1724.975] (II) RADEON(0): KMS Pageflipping: enabled
> [  1724.975] (II) RADEON(0): SwapBuffers wait for vsync: enabled
> [...]

Shouldn't I be seeing the same errors as with "-retro" as well?
Comment 80 Alex Deucher 2014-07-17 16:34:25 UTC
(In reply to comment #79)
> Ok, this is weird: just out of curiosity I tried to launch Xorg with
> "-retro", then I do see errors logged (see attached excerpt from dmesg). If
> I run Xorg without parameters, I just end up with a black screen, no logged
> errors and Xorg.0.log looks like everything is fine (except for that
> "config_odev_get_int_attribute" error):

> Shouldn't I be seeing the same errors as with "-retro" as well?

-retro invokes acceleration while the non-retro case does not.
Comment 82 Kai 2014-07-25 14:51:08 UTC
Created attachment 103450 [details]
Xorg.0.log showing "EQ overflowing"

(In reply to comment #81)
> it's now working more or less.  grab the latest ucode here
> http://people.freedesktop.org/~agd5f/radeon_ucode/ucode.tar.gz and use my
> http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.17-wip kernel
> tree, plus this patch for ddx:
> http://people.freedesktop.org/~agd5f/0001-radeon-enable-hawaii-accel-
> conditionally.patch

I followed these instructions and everything seems to work, until KDM wants to draw the desktop for the first time, then I see the following:
> (EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
> (EE) 
>(EE) Backtrace:
> [...]
> (EE) 
> (EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
> (EE) [mi] mieq is *NOT* the cause.  It is a victim.
> (EE) [mi] EQ overflow continuing.  100 events have been dropped.

The backtrace and "EQ overflow continuing" part repeats two times more. See the attached Xorg.0.log for the backtraces.

The stack I used was (base is Debian Testing):
GPU: Hawaii PRO [Radeon R9 290] (ChipID = 0x67b1)
Linux: ~agdf5/linux:drm-next-3.17-wip (calls itself 3.16-rc4?)
libdrm: 2.4.54-1
LLVM: 3.5 RC1
libclc: Git:master/0ec7437d9c
Mesa: Git:master/bf1247936a
DDX: 1:7.4.0-2 + Patch
X: 2:1.16.0-1 (1.16.0)

Firmeware was put into /lib/firmware/updates/$(uname -r)/radeon (which translates to /lib/firmware/updates/3.16.0-rc4-citadel/radeon).

Do I need to do something else? Disable tiling? DPM? Is this a remaining issue? Or did I misunderstand, something and it is expected not to get KDE up and running?
Comment 83 Kai 2014-07-25 14:55:23 UTC
After I killed X the following three lines appeared in dmesg's output:
> [ 1421.048407] konsole[1509]: segfault at e0 ip 00007fa9e928a1dd sp 00007fff046f5160 error 4 in libkdeui.so.5.13.3[7fa9e8f1e000+445000]
> [ 1431.576124] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec
> [ 1431.576130] radeon 0000:01:00.0: GPU lockup (waiting for 0x00000000000001eb last fence id 0x00000000000001de on ring 0)
Comment 84 Luzipher 2014-07-25 17:47:25 UTC
First: Thanks Alex for looking into this and making progress !

I did as instructed, but couldn't yet get accel to work (llvmpipe is used). A few questions:
1. Do I need to rename the firmware files to uppercase ? (HAWAII_ce.bin instead of hawaii_ce.bin) ?
2. Does the current radeon stuff already work with the glamor that got merged into xorg-server 1.16 or do I need the external glamor ? (does that work with 1.16 ?)
3. As far as I know, the "NoAccel" xorg.conf option was renamed to "Accel". Is that still necessary with your xf86-video-ati patch ?
Comment 85 Kai 2014-07-25 18:15:09 UTC
I think I can answer your questions, since I got acceleration working up to the point where it crashed (see comment #82).

(In reply to comment #84)
> I did as instructed, but couldn't yet get accel to work (llvmpipe is used).
> A few questions:
> 1. Do I need to rename the firmware files to uppercase ? (HAWAII_ce.bin
> instead of hawaii_ce.bin) ?

No. I didn't rename the files and they were correctly loaded.

> 2. Does the current radeon stuff already work with the glamor that got
> merged into xorg-server 1.16 or do I need the external glamor ? (does that
> work with 1.16 ?)

Yes. My 7.4.0 came up with the integrated GLAMOR. And just yesterday I helped a friend with an OLAND to get it set up with the free drivers. His setup uses also a 1.16.0 server and the DDX uses the integrated GLAMOR.

> 3. As far as I know, the "NoAccel" xorg.conf option was renamed to "Accel".
> Is that still necessary with your xf86-video-ati patch ?

My xorg.conf had only a 'Driver "radeon"' line (since I don't have the -ati loader shim around). No "Accel" or other option was set (and yes, it's Accel with 7.4.0).
Comment 86 Jerome Glisse 2014-07-25 21:11:32 UTC
Created attachment 103474 [details] [review]
Hack to temporarily fix accel

If you want working desktop you can use this patch to disable the packet that is the issue. With that mesa patch i have stable acceleration and desktop.
Comment 87 Kai 2014-07-25 22:10:24 UTC
(In reply to comment #86) 
> If you want working desktop you can use this patch to disable the packet
> that is the issue. With that mesa patch i have stable acceleration and
> desktop.

Thanks Jérôme! The "EQ overflowing" is gone (same stack as detailed in comment #82, except for Mesa, which is now at Git master/5eb11eb192 + the patch from attachment 103474 [details] [review]), but I'm still stuck on the KDE loading screen, just before the desktop is drawn. After I kill X I see the three lines, I pasted in comment #83, appearing again at the end of dmesg's output. So, no luck for me yet.
Comment 88 Alex Deucher 2014-07-25 22:23:34 UTC
Try re-grabbing my drm-next-3.17-wip tree.  I dropped the last commit.
Comment 89 Kai 2014-07-25 22:26:45 UTC
Created attachment 103477 [details]
dmesg output after restarting X after lockup from comment #87

Oh, forgot to add: if I restart X after that, I get the attached additional lines in dmesg.

After that I rebooted, logged into KDE (acceleration disabled), killed Konsole, so it wouldn't automatically get startet and tried a login with acceleration again. But still no luck: I'm still stuck on the last icon of the KDE loading screen just before the desktop get's drawn the first time. After killing X I'm seeing the GPU stall (see comment #83) again, just without the segfault in Konsole.

Maybe it has something to do with the composition type and Qt graphic system I've chosen in KDE for desktop effects? The composition type is set to "OpenGL 3.1" and the Qt graphic system is set to "Native".
Comment 90 Kai 2014-07-25 23:10:44 UTC
(In reply to comment #88)
> Try re-grabbing my drm-next-3.17-wip tree.  I dropped the last commit.

I did; no change, still get that GPU stall.
Comment 91 Luzipher 2014-07-26 11:15:32 UTC
Thanks Kai, for your answers. I got it working now as well - in fact I'm typing from an accelerated xorg that's running for over an hour now.

I used:
- agd5f's kernel (drm-next-3.17-wip), including the "handle ASIC_ProfilingInfo v3.1" patch
- ucode from comment #81 copied to "/lib/firmware/updates/$(uname -r)/radeon"
- xf86-video-ati with agd5f's patch from comment #81
- mesa git with glisse's patch from comment #86
- llvm git
- xorg-server 1.16
- glamor 0.6.0 (is that still necessary with xorg 1.16 ?)
- kernel command line parameters: radeon.dpm=0 drm.rnodes=1

I do not use any kind of login manager (gdm, kdm) for login, but plain old text-mode, then typing startx.

Both xfce4 and cinnamon start up and work, but at least in cinnamon I get massive flickering on mouse-overs or animated cursors (terminology). For example the background image occupies the whole text-area where I type this comment, whenever I move the mouse in or out (firefox). The same happens when I type, but obly for 2-3 lines of the surrounding text.using the mouse-wheel to scroll helps and corrects the display.

I am able to run glxgears. And I even tried World of Warcraft via wine, where I could get to the character selection screen (3D character is displayed), but upon entering the world, the machine crashed and I got the screen full of noise as output (similar to what happened before the fix on crashes).
Comment 92 Kai 2014-07-26 11:48:22 UTC
(In reply to comment #91)
> Thanks Kai, for your answers. I got it working now as well - in fact I'm
> typing from an accelerated xorg that's running for over an hour now.

yw; happy I could help.

> - glamor 0.6.0 (is that still necessary with xorg 1.16 ?)

No (as I've indicated in comment #85); my understanding is, that only the integrated version is maintained at this point. You might even have overwritten the integrated libglamoregl.so with the externel version? You can check which one you loaded by looking in your Xorg.0.log for the following lines:
> (II) Loading sub module "glamoregl"
> (II) LoadModule: "glamoregl"
> (II) Loading /usr/lib/xorg/modules/libglamoregl.so
> (II) Module glamoregl: vendor="X.Org Foundation"
> 	compiled for 1.16.0, module version = 1.0.0
> 	ABI class: X.Org ANSI C Emulation, version 0.4
and check what version you see. The integrated version is 1.0.0 and as you can see from attachment 103450 [details], my X started correctly with that. I just get a GPU stall when the desktop should be drawn/shown the frist time. The last image I see is the KDE loading screen with all icons.

> - kernel command line parameters: radeon.dpm=0 drm.rnodes=1

Oh, I didn't try those so far. I'll look into that and see if I can KDE loading with those.
Comment 93 Luzipher 2014-07-26 11:49:08 UTC
Created attachment 103497 [details]
dmesg with working acceleration: GPU faults

I just looked at my dmesg (with the same running system as described in comment #91) and noticed quite a few "GPU faults" logged there (see attachement).

My Xorg.0.log looks clean though (only the EDID information of one of my monitors was printed out waaay after I started X, at timestamps 1705 and 3420. But I believe that was when powersaving switched off the screens).
Comment 94 Marek Olšák 2014-07-26 12:42:15 UTC
(In reply to comment #86)
> Created attachment 103474 [details] [review] [review]
> Hack to temporarily fix accel
> 
> If you want working desktop you can use this patch to disable the packet
> that is the issue. With that mesa patch i have stable acceleration and
> desktop.

I don't recommend this patch. It disables updates to resource descriptors in some cases, which means you'll get resource bindings from old draw packets or even old IBs, which will cause VM faults or even hangs. I think it also breaks MSAA and might cause hangs there too (CMASK must be cleared).

Hawaii works without the patch very well here. I'm only using Alex's stuff from comment 81 and my Mesa fixes. Nothing else.
Comment 95 Luzipher 2014-07-26 14:20:59 UTC
Thank again, Kai, indeed my xorg-server used glamor 0.6.0.

Ok, so it's not really trivial to get xorg-server-1.16 working right with the built-in glamor 1.0.0 on gentoo. The ebuilds for xorg-server-1.16, xorg-drivers-1.16 and xf86-video-ati are borked. I'll file a bugreport on their bugtracker later. If anyone here is interested, I could attach the ebuilds I altered here.

But back on topic, I finally got it all working with the built in glamor - and with that I'm seeing exactly what Kai described in comment #82.

(In reply to comment #94)
> Hawaii works without the patch very well here. I'm only using Alex's stuff
> from comment 81 and my Mesa fixes. Nothing else.

Next I'll try your patches, Marek - I guess you mean the ones on the mailing list ? Is there a git-repo where you have them ? I checked your stuff for mesa on freedesktop, but couldn't see anything there.
Comment 96 Kai 2014-07-26 15:11:18 UTC
(In reply to comment #94)
> (In reply to comment #86)
> > Created attachment 103474 [details] [review] [review] [review]
> > Hack to temporarily fix accel
> > 
> > If you want working desktop you can use this patch to disable the packet
> > that is the issue. With that mesa patch i have stable acceleration and
> > desktop.
> 
> I don't recommend this patch. It disables updates to resource descriptors in
> some cases, which means you'll get resource bindings from old draw packets
> or even old IBs, which will cause VM faults or even hangs. I think it also
> breaks MSAA and might cause hangs there too (CMASK must be cleared).
> 
> Hawaii works without the patch very well here. I'm only using Alex's stuff
> from comment 81 and my Mesa fixes. Nothing else.

IT WORKS! Thanks to Alex, Marek, Jérôme and all the others involved!
The stack I used was (base is Debian Testing):
GPU: Hawaii PRO [Radeon R9 290] (ChipID = 0x67b1)
Linux: ~agdf5/linux:drm-next-3.17-wip (calls itself 3.16-rc4?)
libdrm: 2.4.54-1
LLVM: 3.5 RC1
libclc: Git:master/0ec7437d9c
Mesa: Git:master/74e100affc + the three patches Marek named ("radeonsi: fix CMASK and HTILE calculations for Hawaii"; "gallium/util: add a helper for calculating primitive count from vertex count"; "radeonsi: fix a hang with instancing on Hawaii") and can be found on mesa-dev.
DDX: 1:7.4.0-2 + Patch from http://people.freedesktop.org/~agd5f/0001-radeon-enable-hawaii-accel-conditionally.patch
X: 2:1.16.0-1 (1.16.0)
Comment 97 Luzipher 2014-07-26 15:13:00 UTC
Created attachment 103500 [details]
dmesg with working acceleration (Marek's patches): WoW GPU reset

Yay, with your seven patches, Marek, it works with the built-in glamor-1.0.0 ! Thanks a lot !

The following patches were applied:

[Mesa-dev] [PATCH 1/2] r600g, radeonsi: add debug flags which disable tiling
http://lists.freedesktop.org/archives/mesa-dev/2014-July/064127.html

[Mesa-dev] [PATCH 2/2] radeonsi: fix CMASK and HTILE calculations for Hawaii
http://lists.freedesktop.org/archives/mesa-dev/2014-July/064128.html

[Mesa-dev] [PATCH 1/3] gallium/util: add a helper for calculating primitive count from vertex count
http://lists.freedesktop.org/archives/mesa-dev/2014-July/064129.html

[Mesa-dev] [PATCH 2/3] radeonsi: fix a hang with instancing on Hawaii
http://lists.freedesktop.org/archives/mesa-dev/2014-July/064130.html

[Mesa-dev] [PATCH 3/3] radeonsi: fix a hang with streamout on Hawaii
http://lists.freedesktop.org/archives/mesa-dev/2014-July/064131.html

[Mesa-dev] [PATCH] winsys/radeon: fix vram_size overflow with Hawaii
http://lists.freedesktop.org/archives/mesa-dev/2014-July/064137.html

[Mesa-dev] [PATCH] radeonsi: fix occlusion queries on Hawaii
http://lists.freedesktop.org/archives/mesa-dev/2014-July/064138.html


With those the transparency and flickery effects I described in comment #91 are gone !

World of Warcraft still crashes when entering the world, but not fatally anymore (there is a GPU reset and black screens for a while, but it recovers and only the game crashes). I attached the dmesg of the crash, Xorg.0.log doesn't show anything. I'm not sure if the GPU faults / VM faults visible in the log are related to the game, but they probably occured while I logged in and selected a character.
Comment 98 Kai 2014-07-26 15:36:54 UTC
After some playing around, I do see some (minor) visual issues. See https://imgur.com/a/uswfc for some screenshots and descriptions.
Comment 99 Marek Olšák 2014-07-26 17:01:25 UTC
BTW, MSAA may be broken and I recommend turning it off for now.
Comment 100 Kai 2014-07-26 18:04:43 UTC
(In reply to comment #99)
> BTW, MSAA may be broken and I recommend turning it off for now.

Ok, how do I do that? I'm only aware of the debug flag "msaa". My radeon(4) man page didn't yield anything for MSAA either and a quick grep over the Mesa code pointed only to code actually handling MSAA (or dealing with the debug flag).
Comment 101 Marek Olšák 2014-07-26 18:15:34 UTC
You can either disable MSAA in your apps (graphics settings in games, etc.) or you can apply this libdrm patch ;)
http://lists.freedesktop.org/archives/dri-devel/2014-July/064743.html
Comment 102 Kai 2014-07-26 18:35:36 UTC
(In reply to comment #101)
> You can either disable MSAA in your apps (graphics settings in games, etc.)
> or you can apply this libdrm patch ;)
> http://lists.freedesktop.org/archives/dri-devel/2014-July/064743.html

Ah, ok. I thought there was some DRI option I was unaware of. I build libdrm with the patch, seems the better solution to me. Thanks again!
Comment 103 Serkan Hosca 2014-07-26 23:42:16 UTC
Created attachment 103527 [details]
dmesg with drm-next-3.17-wip

Not working for me. I've updated the firmware files and using agd5f's drm-next-3.17-wip branch. I can't switch to other vt's, it gets stuck with the boot messages on vt1. Haven't tried starting X yet.
Comment 104 Kai 2014-07-27 07:13:13 UTC
(In reply to comment #98)
> After some playing around, I do see some (minor) visual issues. See
> https://imgur.com/a/uswfc for some screenshots and descriptions.

Ignore this, this was most likely a layer 8 problem. I just noticed, that the "Compositing type" for KDE's desktop effects has been XRender instead of "OpenGL 3.1" (not sure how that changed back; maybe I had to do it for fglrx and don't remember). Now that I've changed that, all title bars are rendered correctly. XRender never worked with GLAMOR since I've started using GLAMOR (IIRC it was something close to 0.3).
Comment 105 Luzipher 2014-07-27 12:50:49 UTC
(In reply to comment #103)
> Not working for me. I've updated the firmware files and using agd5f's
> drm-next-3.17-wip branch. I can't switch to other vt's, it gets stuck with
> the boot messages on vt1. Haven't tried starting X yet.

Your dmesg shows that you have a ASIC_ProfilingInfo v3.1 table on your board. You might want to try a patch from bug #73420: attachement #93015.
The issue itself is tracked in bug #74250.

Other than that I can report Civ5, Half Life 2 and even Metro Last Light working on Hawaii now (native via Steam) :-)

Kai: did changing "Composition type" also fix the corrupted glyphs or just the title bars ? I've seen exactly one corrupted glyph in firefox so far (several hours of use) on cinnamon (but I'm not sure what cinnamon uses as I couldn't find any settings related to opengl).
Comment 106 Kai 2014-07-27 13:12:57 UTC
(In reply to comment #105)
> Other than that I can report Civ5, Half Life 2 and even Metro Last Light
> working on Hawaii now (native via Steam) :-)

I can confirm Portal 2 (Source), Jagged Alliance: Flashback (Unity) and XCOM: Enemy Unknown (Unreal Engine 3) as working, though the performance in XCOM is abysmal (9 FPS) and after a while you get GPU stalls (system recovered after a short block and I could continue, but it's of course not much fun).
The other games have room to improve as well. But that was to be expected at this point. Still, having more performance with the free driver would be really nice. ;-)

Playing streams (live and recorded broadcasts) from Twitch with this setup is, however, neigh impossible. With the open driver and GLAMOR acceleration, X uses 80 % CPU time on one core and the video becomes a dia show. The Flash plugin is at 40 % CPU time (different core), but that's roughly the same as with fglrx. (For reference: my CPU is an Intel Core i7-3770K.)
 
> Kai: did changing "Composition type" also fix the corrupted glyphs or just
> the title bars ? I've seen exactly one corrupted glyph in firefox so far
> (several hours of use) on cinnamon (but I'm not sure what cinnamon uses as I
> couldn't find any settings related to opengl).

Can't say for sure. With XRender I saw a corrupted glyph almost immediately on the first page I visited. Since switching to OpenGL as "compositing type" in KDE's settings, I haven't seen a corrupt glyph. But it could also be coincidence.
Comment 107 Marek Olšák 2014-07-27 14:29:33 UTC
Yeah, it looks like Hawaii is underclocked or something. It's a lot slower than my Bonaire.
Comment 108 Kai 2014-07-27 15:01:09 UTC
Btw, watching a video on Twitch spams my Xorg.0.log with *tons* of:
> [ 21013.158] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2822274 < target_msc 2822275
> [ 21075.557] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2826015 < target_msc 2826016
> [ 21139.338] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2829840 < target_msc 2829841
> [ 21161.360] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2831160 < target_msc 2831161
> [ 21163.677] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2831298 < target_msc 2831299
> [ 21163.811] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2831306 < target_msc 2831307
> [ 21178.130] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2832164 < target_msc 2832165
> [ 21209.838] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2834066 < target_msc 2834067
> [ 21267.601] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2837529 < target_msc 2837530
> [ 21273.836] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2837902 < target_msc 2837903
> [ 21483.986] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2850507 < target_msc 2850508
> [ 21516.393] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2852450 < target_msc 2852451
> [ 21668.162] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2861549 < target_msc 2861550
> [ 21699.370] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2863419 < target_msc 2863420
> [ 21699.837] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2863446 < target_msc 2863447
> [ 21731.427] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2865340 < target_msc 2865341
> [ 21899.500] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2875419 < target_msc 2875420
> [ 21906.569] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2875843 < target_msc 2875844
> [ 21906.835] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2875859 < target_msc 2875860

AFAICT this happens only in fullscreen mode.
Comment 109 Serkan Hosca 2014-07-27 16:32:12 UTC
(In reply to comment #105)
> (In reply to comment #103)
> > Not working for me. I've updated the firmware files and using agd5f's
> > drm-next-3.17-wip branch. I can't switch to other vt's, it gets stuck with
> > the boot messages on vt1. Haven't tried starting X yet.
> 
> Your dmesg shows that you have a ASIC_ProfilingInfo v3.1 table on your
> board. You might want to try a patch from bug #73420: attachement #93015.
> The issue itself is tracked in bug #74250.

Tried attachment #93015 [details] [review] from bug #73420 and i get a hard lock up, can't even ssh the machine.
Comment 110 Serkan Hosca 2014-07-27 16:54:54 UTC
Created attachment 103549 [details]
dmesg with drm-next-3.17-wip with x -retro and starting xterm

(In reply to comment #109)
> (In reply to comment #105)
> > (In reply to comment #103)
> > > Not working for me. I've updated the firmware files and using agd5f's
> > > drm-next-3.17-wip branch. I can't switch to other vt's, it gets stuck with
> > > the boot messages on vt1. Haven't tried starting X yet.
> > 
> > Your dmesg shows that you have a ASIC_ProfilingInfo v3.1 table on your
> > board. You might want to try a patch from bug #73420: attachement #93015.
> > The issue itself is tracked in bug #74250.
> 
> Tried attachment #93015 [details] [review] [review] from bug #73420 and i get a hard
> lock up, can't even ssh the machine

Scratch that, my mistake, the attachment worked, cleared up those messages. I can start x -retro but when i try to launch xterm gpu crashes.
Comment 111 Serkan Hosca 2014-07-27 16:56:08 UTC
Created attachment 103550 [details]
xorg log with drm-next-3.17-wip with x -retro and starting xterm
Comment 112 Kai 2014-07-27 16:58:18 UTC
(In reply to comment #111)
> Created attachment 103550 [details]
> xorg log with drm-next-3.17-wip with x -retro and starting xterm

Just FYI: all "working" reports came from people with a stable X.Org 1.16.0 AFAIK. Maybe you want to try that instead of your development version?
Comment 113 Kai 2014-07-27 17:05:18 UTC
Oh, and I needed the three Mesa patches "radeonsi: fix CMASK and HTILE calculations for Hawaii", "gallium/util: add a helper for calculating primitive count from vertex count" and "radeonsi: fix a hang with instancing on Hawaii" to get into my KDE desktop, maybe it's the same for X -retro. The other Mesa patches can't hurt either (I have them applied now).
Comment 114 Michel Dänzer 2014-07-28 04:41:25 UTC
(In reply to comment #108)
> Btw, watching a video on Twitch spams my Xorg.0.log with *tons* of:
> > [ 21013.158] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2822274 < target_msc 2822275

These are fixed in drm-fixes-3.16, but that hasn't been merged into drm-next-3.17-wip yet.
Comment 115 Kai 2014-07-28 14:22:52 UTC
(In reply to comment #106)
> Playing streams (live and recorded broadcasts) from Twitch with this setup
> is, however, neigh impossible. With the open driver and GLAMOR acceleration,
> X uses 80 % CPU time on one core and the video becomes a dia show. The Flash
> plugin is at 40 % CPU time (different core), but that's roughly the same as
> with fglrx. (For reference: my CPU is an Intel Core i7-3770K.)

After some further testing yesterday I could only reproduce the stalling of X with "recorded brodcasts" on Twitch. It doesn't matter what stream quality you pick (if you're patient enough to get that menu open). X starts hogging all resources on one CPU core as soon as you load any recording (I tried four different recordings). Live broadcasts/streams worked normally yesterday afternoon/evening and today.


(In reply to comment #114)
> (In reply to comment #108)
> > Btw, watching a video on Twitch spams my Xorg.0.log with *tons* of:
> > > [ 21013.158] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 2822274 < target_msc 2822275
> 
> These are fixed in drm-fixes-3.16, but that hasn't been merged into
> drm-next-3.17-wip yet.

Ah, ok. I found <http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.16&id=f53f81b2576a9bd3af947e2b1c3a46dfab51c5ef>. Is that the correct commit for Hawaii? Doesn't look like the appropriate commit to me (so much "Evergreen" everywhere), but it was the only one a search for radeon_dri2_flip_event_handler yielded.
Comment 116 Michel Dänzer 2014-07-29 03:20:30 UTC
(In reply to comment #115)
> I found
> <http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.
> 16&id=f53f81b2576a9bd3af947e2b1c3a46dfab51c5ef>. Is that the correct commit
> for Hawaii?

Yes, but you probably want all page-flipping related fixes.

> Doesn't look like the appropriate commit to me (so much "Evergreen"
> everywhere),

The programming of that hardware block hasn't changed since Evergreen.
Comment 117 Kai 2014-07-29 17:04:34 UTC
(In reply to comment #116)
> (In reply to comment #115)
> > I found
> > <http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.
> > 16&id=f53f81b2576a9bd3af947e2b1c3a46dfab51c5ef>. Is that the correct commit
> > for Hawaii?
> 
> Yes, but you probably want all page-flipping related fixes.

Ok. I tried to pick all those page-flipping patches from Mario Kleiner over, but I could only get 60c90d98ba00f7f8e8ec55f6b24096372f57e9a4 and 5900fdc42ca3cbbc50bab7133750459a165a5cca to apply. The other two failed and the code looked quite different, so I left them out. If I need either 826484977c29b42c8cb8c42bd41acaa6e152a4bb or 5f87e090a7368adc2290ae17ffd82a070caadd20), then any help in adapting them, would be appreciated very much.

> > Doesn't look like the appropriate commit to me (so much "Evergreen"
> > everywhere),
> 
> The programming of that hardware block hasn't changed since Evergreen.

Ok, was just irritated by all the changes happening in evergreen.c and similar "non-CIK" named files. But I didn't check the entire include chain. ;-) Thanks for clearing this up.


With regard to <http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.17-wip&id=505178d8b01dd65567b3445d5aa13e81c3a479c0>: does that mean, that I would need a new version of <http://people.freedesktop.org/~agd5f/0001-radeon-enable-hawaii-accel-conditionally.patch>?
Comment 118 Alex Deucher 2014-07-29 17:10:46 UTC
I'll post a link to a new git tree with the radeon 3.17 changes rebased on drm-fixes.
Comment 119 Luzipher 2014-07-29 18:22:51 UTC
(In reply to comment #88)
> Try re-grabbing my drm-next-3.17-wip tree.  I dropped the last commit.

Alex, are you going to reapply the dropped patch for the ASIC_ProfilingInfo v3.1 table ? Or does it need to be handled differently ? Just so you don't forget ;-)
Comment 120 Alex Deucher 2014-07-29 18:29:31 UTC
(In reply to comment #119)
> (In reply to comment #88)
> > Try re-grabbing my drm-next-3.17-wip tree.  I dropped the last commit.
> 
> Alex, are you going to reapply the dropped patch for the ASIC_ProfilingInfo
> v3.1 table ? Or does it need to be handled differently ? Just so you don't
> forget ;-)

Can someone test it to see if it fixes any issues other than the messages in the log?  I don't have a hawaii board with that table version.
Comment 121 Kai 2014-07-29 19:09:30 UTC
(In reply to comment #120)
> (In reply to comment #119)
> > (In reply to comment #88)
> > > Try re-grabbing my drm-next-3.17-wip tree.  I dropped the last commit.
> > 
> > Alex, are you going to reapply the dropped patch for the ASIC_ProfilingInfo
> > v3.1 table ? Or does it need to be handled differently ? Just so you don't
> > forget ;-)
> 
> Can someone test it to see if it fixes any issues other than the messages in
> the log?  I don't have a hawaii board with that table version.

What kind of issues would I be looking for? (I think I see the relevant error message:
> [drm:radeon_atom_get_leakage_vddc_based_on_leakage_params] *ERROR* Unknown table version 3, 1
but I didn't see any issues besides what I've reported here (mainly missing speed))


And just to ensure this doesn't get overlooked: if we need a new version of the DDX patch (since the kernel is now returning a "2" to signal acceleration), please provide that as well. ;-)
Comment 122 Kai 2014-07-29 20:13:53 UTC
(In reply to comment #106)
> (In reply to comment #105)
> > Kai: did changing "Composition type" also fix the corrupted glyphs or just
> > the title bars ? I've seen exactly one corrupted glyph in firefox so far
> > (several hours of use) on cinnamon (but I'm not sure what cinnamon uses as I
> > couldn't find any settings related to opengl).
> 
> Can't say for sure. With XRender I saw a corrupted glyph almost immediately
> on the first page I visited. Since switching to OpenGL as "compositing type"
> in KDE's settings, I haven't seen a corrupt glyph. But it could also be
> coincidence.

Just to get back to this: I'm still seeing misrendered glyphs from time to time. The way they are misrendered is different each time. Sometimes I get just a black block, sometimes what can be seen in the image I posted, etc. It's also not just limited to the browser, even though I can observe it there most often.
Comment 123 Alex Deucher 2014-07-29 20:35:11 UTC
(In reply to comment #121)
> (In reply to comment #120)
> > (In reply to comment #119)
> > > (In reply to comment #88)
> > > > Try re-grabbing my drm-next-3.17-wip tree.  I dropped the last commit.
> > > 
> > > Alex, are you going to reapply the dropped patch for the ASIC_ProfilingInfo
> > > v3.1 table ? Or does it need to be handled differently ? Just so you don't
> > > forget ;-)
> > 
> > Can someone test it to see if it fixes any issues other than the messages in
> > the log?  I don't have a hawaii board with that table version.
> 
> What kind of issues would I be looking for? (I think I see the relevant
> error message:
> > [drm:radeon_atom_get_leakage_vddc_based_on_leakage_params] *ERROR* Unknown table version 3, 1
> but I didn't see any issues besides what I've reported here (mainly missing
> speed))

That issue is tracked separately in bug 74250.  As for the performance, check
sudo cat /sys/kernel/debug/dri/64/radeon_pm_info and see if the values in there look sane and if they scale up properly with 3D load.  Report any issues related to that on bug 74250.

Here's the 3.17 rebased on fixes branch:
http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.17-rebased-on-fixes
Comment 124 Luzipher 2014-07-29 22:57:38 UTC
(In reply to comment #120)
> (In reply to comment #119)
> > (In reply to comment #88)
> > > Try re-grabbing my drm-next-3.17-wip tree.  I dropped the last commit.
> > 
> > Alex, are you going to reapply the dropped patch for the ASIC_ProfilingInfo
> > v3.1 table ? Or does it need to be handled differently ? Just so you don't
> > forget ;-)
> 
> Can someone test it to see if it fixes any issues other than the messages in
> the log?  I don't have a hawaii board with that table version.

Hm, I'll test if there is any change with your unmodified drm-next-3.17-rebased-on-fixes branch (no v3.1 patch).


Other than that (all still with the v3.1 patch included):
I can reliably cause GPU resets with entering the world in World of Warcraft (see previous comment #97 for a dmesg). I never gat a rendered image before the game crashes.

I didn't yet notice more corrupted glyphs, but the single one I saw was a black rectangle.

And for the DPM stuff: dpm doesn't seem to work, the numbers never changed (tested on console without X and Metro Last Light; I dropped my previous "radeon.dpm=0" from the kernel parameters):
# cat /sys/kernel/debug/dri/64/radeon_pm_info
power level avg    sclk: 30000 mclk: 15000

I also updated bug #72450 regarding the ASIC_ProfilingInfo v3.1 table with some new information.
Comment 125 Michel Dänzer 2014-07-30 06:26:25 UTC
I think it's time to resolve this report; any remaining issues should be tracked in separate reports.
Comment 126 Alex Deucher 2014-07-30 13:46:02 UTC
(In reply to comment #124)
> 
> And for the DPM stuff: dpm doesn't seem to work, the numbers never changed
> (tested on console without X and Metro Last Light; I dropped my previous
> "radeon.dpm=0" from the kernel parameters):
> # cat /sys/kernel/debug/dri/64/radeon_pm_info
> power level avg    sclk: 30000 mclk: 15000

You need to print that out while you have a 3D app running.  Those numbers are real-time.  E.g., in the console, there is no 3D acceleration happening so they stay at their low levels.
Comment 127 Kai 2014-07-30 16:34:07 UTC
(In reply to comment #125)
> I think it's time to resolve this report; any remaining issues should be
> tracked in separate reports.

NAK!

With the latest round of patches and builds, I'm falling back to llvmpipe again. I gues I know the reason: I probably need a different patch for the DDX than <http://people.freedesktop.org/~agd5f/0001-radeon-enable-hawaii-accel-conditionally.patch>, right? I asked multiple times (comment #117 and comment #121) but nobody reacted to that and now I'm seeing no error anywhere, but I do see
> [    48.086] (--) RADEON(0): Chipset: "HAWAII" (ChipID = 0x67b1)
> [    48.086] (II) RADEON(0): GPU accel disabled or not working, using shadowfb for KMS
in Xorg.0.log again.

As soon as that is resolved, I'm happy to move this to follow-up bugs (like speed in 3D applications).

My stack is:
GPU: Hawaii PRO [Radeon R9 290] (ChipID = 0x67b1)
Linux: Git:~agdf5/linux:drm-next-3.17-rebased-on-fixes:6e07731f71 (calls itself 3.16-rc6)
libdrm: Git:master/libdrm-2.4.56
LLVM: 3.5 RC1
libclc: Git:master/0ec7437d9c
Mesa: Git:master/85109bc507
DDX: 1:7.4.0-2 + Patch from http://people.freedesktop.org/~agd5f/0001-radeon-enable-hawaii-accel-conditionally.patch
X: 2:1.16.0-1 (1.16.0)
Comment 128 Alex Deucher 2014-07-30 16:37:43 UTC
(In reply to comment #127)
> 
> With the latest round of patches and builds, I'm falling back to llvmpipe
> again. I gues I know the reason: I probably need a different patch for the
> DDX than
> <http://people.freedesktop.org/~agd5f/0001-radeon-enable-hawaii-accel-
> conditionally.patch>, right? I asked multiple times (comment #117 and
> comment #121) but nobody reacted to that and now I'm seeing no error
> anywhere, but I do see
> > [    48.086] (--) RADEON(0): Chipset: "HAWAII" (ChipID = 0x67b1)
> > [    48.086] (II) RADEON(0): GPU accel disabled or not working, using shadowfb for KMS
> in Xorg.0.log again.
> 
> As soon as that is resolved, I'm happy to move this to follow-up bugs (like
> speed in 3D applications).

The latest ddx patch is on the mailing list:
http://lists.x.org/archives/xorg-driver-ati/2014-July/026517.html
Comment 129 Kai 2014-07-30 17:34:46 UTC
(In reply to comment #128)
> (In reply to comment #127)
> > 
> > With the latest round of patches and builds, I'm falling back to llvmpipe
> > again. I gues I know the reason: I probably need a different patch for the
> > DDX than
> > <http://people.freedesktop.org/~agd5f/0001-radeon-enable-hawaii-accel-
> > conditionally.patch>, right? I asked multiple times (comment #117 and
> > comment #121) but nobody reacted to that and now I'm seeing no error
> > anywhere, but I do see
> > > [    48.086] (--) RADEON(0): Chipset: "HAWAII" (ChipID = 0x67b1)
> > > [    48.086] (II) RADEON(0): GPU accel disabled or not working, using shadowfb for KMS
> > in Xorg.0.log again.
> > 
> > As soon as that is resolved, I'm happy to move this to follow-up bugs (like
> > speed in 3D applications).
> 
> The latest ddx patch is on the mailing list:
> http://lists.x.org/archives/xorg-driver-ati/2014-July/026517.html

Thank you very much! Now everything is back to working as before. ;-)

As far as I'm concerned this bug can be closed and I can open new ones for the misrendered glyphs, the lacking 3D performance (e.g. 9 FPS in "XCOM: Enemy Unknown", as reported by the Gallium HUD), etc.

Thanks to all, who helped getting Hawaii support in shape.
Comment 130 Luzipher 2014-07-30 23:09:07 UTC
(In reply to comment #126)
> (In reply to comment #124)
> > 
> > And for the DPM stuff: dpm doesn't seem to work, the numbers never changed
> > (tested on console without X and Metro Last Light; I dropped my previous
> > "radeon.dpm=0" from the kernel parameters):
> > # cat /sys/kernel/debug/dri/64/radeon_pm_info
> > power level avg    sclk: 30000 mclk: 15000
> 
> You need to print that out while you have a 3D app running.  Those numbers
> are real-time.  E.g., in the console, there is no 3D acceleration happening
> so they stay at their low levels.

I did just that :-) Sorry if I wasn't clear enough on this. Actually what I did was starting Metro Last Light and catting over ssh multiple times from my laptop (Metro doesn't allow alt-tab). The values never changed at all.

I chose Metro, because it's the most graphically challenging piece of software I have - and should therefore certainly cause a change of values. glxgears didn't cause a change as well, but I thought that might be because it's too simple.

Today, to verify, I also tried it with Half Life 2. Same result. (But MSAA works now with the libdrm patch, thanks Marek !)

All of the above still with the "ASIC_ProfilingInfo v3.1" as I got a NULL pointer dereference with yesterday's drm-next-3.17-rebased-on-fixes kernel.

I guess the new kernel to use is "standard" 3.17-wip, as it now contains all the Hawaii stuff ?


I'm also ok with tracking the remaining issues in separate bugs - do I need to close this bug ? (I'd wait till the xf86-video-ati patch is applied).
Any requests on which issues should get their own bugs now ? Turning off HDMI-0 and Glyphs come to mind ?

Oh and I had a typo in my last comment - of course I updated bug #74250. Is that maybe related to my never changing power states ?
Comment 131 Marek Olšák 2014-07-31 00:07:08 UTC
DPM seems to be working here. If I do "sudo cat /sys/kernel/debug/dri/64/radeon_pm_info" with no 3D app running, I get:

power level avg    sclk: 30047 mclk: 15000

If I do the same while glxgears is running, I get:

power level avg    sclk: 80000 mclk: 112500

That said, the performance seems to be lower than Bonaire. The open source game Torcs has 27 FPS with Bonaire and 18 FPS with Hawaii. radeontop shows 100% GPU usage with Hawaii while Torcs is running.
Comment 132 Kai 2014-07-31 15:23:36 UTC
(In reply to comment #130)
> (In reply to comment #126)
> > (In reply to comment #124)
> > > 
> > > And for the DPM stuff: dpm doesn't seem to work, the numbers never changed
> > > (tested on console without X and Metro Last Light; I dropped my previous
> > > "radeon.dpm=0" from the kernel parameters):
> > > # cat /sys/kernel/debug/dri/64/radeon_pm_info
> > > power level avg    sclk: 30000 mclk: 15000
> > 
> > You need to print that out while you have a 3D app running.  Those numbers
> > are real-time.  E.g., in the console, there is no 3D acceleration happening
> > so they stay at their low levels.
> 
> I did just that :-) Sorry if I wasn't clear enough on this. Actually what I
> did was starting Metro Last Light and catting over ssh multiple times from
> my laptop (Metro doesn't allow alt-tab). The values never changed at all.
> 
> I chose Metro, because it's the most graphically challenging piece of
> software I have - and should therefore certainly cause a change of values.
> glxgears didn't cause a change as well, but I thought that might be because
> it's too simple.
> 
> Today, to verify, I also tried it with Half Life 2. Same result. (But MSAA
> works now with the libdrm patch, thanks Marek !)
> 
> All of the above still with the "ASIC_ProfilingInfo v3.1" as I got a NULL
> pointer dereference with yesterday's drm-next-3.17-rebased-on-fixes kernel.
> 
> I guess the new kernel to use is "standard" 3.17-wip, as it now contains all
> the Hawaii stuff ?

Just a thought: you had radeon.dpm=0 set for a long time according to some of your posts. Are you sure you've removed that from your Kernel command line?

I can't tell if the reclocking is working as Marek reported in comment #131 yet. I might have time fireing a game up during the weekend. But should that be a problem I can just open a new bug. I don't think this possible reclocking issue should keep this bug open.

> I'm also ok with tracking the remaining issues in separate bugs - do I need
> to close this bug ? (I'd wait till the xf86-video-ati patch is applied).

Yes, that sounds reasonable: the person who commits the DDX patch can close this bug, because then all pieces should be at least in Git.

> Any requests on which issues should get their own bugs now ? Turning off
> HDMI-0 and Glyphs come to mind ?


The glyph stuff is already tracked in bug 81930. We still need bugs for the poor performance and crashing the GPU by running Unigine Heaven (you have to actually reboot, otherwise the screen won't come up again, in fact the screen is not just black, it loses its signal). And possible other issues I'm not seeing myself or just have not noticed yet.
Comment 133 Luzipher 2014-07-31 17:22:47 UTC
(In reply to comment #132)
> Just a thought: you had radeon.dpm=0 set for a long time according to some
> of your posts. Are you sure you've removed that from your Kernel command
> line?

No, I removed that argument a while ago. But I just rechecked, this is what dmesg prints - so it's really not there anymore:
Command line: BOOT_IMAGE=/kernel-3.16.0-rc4-gd8dacc8 root=/dev/sda7 ro init=/usr/lib/systemd/systemd

> I can't tell if the reclocking is working as Marek reported in comment #131
> yet. I might have time fireing a game up during the weekend. But should that
> be a problem I can just open a new bug. I don't think this possible
> reclocking issue should keep this bug open.

No, probably not. I'll start a new bug if the issue is not related to bug #74250.

(In reply to comment #131)
> DPM seems to be working here.

I guess you have a unmodified reference design card ?
Comment 134 Marek Olšák 2014-07-31 19:29:22 UTC
(In reply to comment #133)
> (In reply to comment #131)
> > DPM seems to be working here.
> 
> I guess you have a unmodified reference design card ?

I guess so. It doesn't look fancy like cards you can buy. It's just a big black brick with an ugly sticker on it saying "HAWAII XT". :)
Comment 135 Luzipher 2014-07-31 20:08:38 UTC
(In reply to comment #134)
> (In reply to comment #133)
> > (In reply to comment #131)
> > > DPM seems to be working here.
> > 
> > I guess you have a unmodified reference design card ?
> 
> I guess so. It doesn't look fancy like cards you can buy. It's just a big
> black brick with an ugly sticker on it saying "HAWAII XT". :)

Well, yes, mine looks quite fancy, with 3 fans and stuff. I bought it for the better cooling solution (no idea why the reference design must be loud and not very efficient).
Thing is, it also has a different BIOS, because it doesn't have "Uber"-Mode, as that mode should be default on my card. AFAIK "Uber"-Mode on reference designs squeezes out more performance but is way louder. It should be activatable by a small switch on the card.
In contrast my "Sapphire Radeon R9 290X Tri-X OC" switches beteen normal BIOS and UEFI mode with that switch (I'm using the normal one as my rig is from the ancient times before UEFI).

Maybe it'd be interesting if you tried to switch to "Uber"-Mode and see if anything chages (does dpm still work or does the "Uber"-Mode BIOS also use a ASIC_ProfilingInfo v3.1 table ?).
Comment 136 Marek Olšák 2014-07-31 22:36:49 UTC
Alright, I have tested Uber Mode and it doesn't improve performance, at least not visibly. That's not the biggest problem. Uber Mode makes the card very unstable. I'm getting random geometry corruption and hangs with it. If I test a game, it usually doesn't survive 2 minutes of playing. The only other difference I see is that the maximum shader clock changed from 80000 to 85000 (KHz?). Anyway, the clocks seem too low. I think this card should be able to reach 1 GHz. It's also pretty cold - it only has 50 °C.
Comment 137 Alex Deucher 2014-07-31 22:43:41 UTC
You might try the patches I posted on bug 74250.
Comment 138 Alex Deucher 2014-07-31 22:44:53 UTC
(In reply to comment #136)
> Alright, I have tested Uber Mode and it doesn't improve performance, at
> least not visibly. That's not the biggest problem. Uber Mode makes the card
> very unstable. I'm getting random geometry corruption and hangs with it. If
> I test a game, it usually doesn't survive 2 minutes of playing. The only
> other difference I see is that the maximum shader clock changed from 80000
> to 85000 (KHz?). Anyway, the clocks seem too low. I think this card should
> be able to reach 1 GHz. It's also pretty cold - it only has 50 °C.

80000 is 800Mhz.  The driver stores clocks in 10khz units.
Comment 139 Marek Olšák 2014-08-01 12:59:00 UTC
(In reply to comment #137)
> You might try the patches I posted on bug 74250.

Sorry, they don't help.
Comment 140 Marek Olšák 2014-08-05 01:09:19 UTC
(In reply to comment #136)
> Alright, I have tested Uber Mode and it doesn't improve performance, at
> least not visibly. That's not the biggest problem. Uber Mode makes the card
> very unstable. I'm getting random geometry corruption and hangs with it. If
> I test a game, it usually doesn't survive 2 minutes of playing. The only
> other difference I see is that the maximum shader clock changed from 80000
> to 85000 (KHz?). Anyway, the clocks seem too low. I think this card should
> be able to reach 1 GHz. It's also pretty cold - it only has 50 °C.

I think I should correct some things I said above. (BTW if you don't know what a Uber/Quiet mode is, there is a tiny on-board switch for switching between the two: http://assets.hardwarezone.com/img/2013/11/R9_BIOS_Switch-600W.jpg)

The Quiet mode is pretty stable. Some apps might hang due to incorrect hardware programming, but other than that, it's solid.

The Uber mode has some graphics corruption in some 3D apps. That might be the reason why it hangs, but the hangs can be rare despite seeing a lot of disappearing/flickering geometry.

I'm using Alex's drm-next-3.17-rebased-on-fixes. The other branch 3.17-wip turned out be very unstable when I first fetched it, so you might want to avoid it (not sure if the wip branch is working now).

I've figured out why Heaven and Valley hang. I have a workaround now and I'll try to find the best way to fix it tomorrow.
Comment 141 Luzipher 2014-08-06 21:35:10 UTC
The two patches fixing Unigine Heaven and Valley you posted today on the mailing list, Marek, also solve the gpu crashes I have with World of Warcraft, described briefly above.

Patches are in this mailing list thread: http://lists.freedesktop.org/archives/mesa-dev/2014-August/064919.html
Comment 142 Luzipher 2014-08-12 21:39:39 UTC
Closing this bug, as with todays commit to xf86-video-ati ( http://cgit.freedesktop.org/xorg/driver/xf86-video-ati/commit/?id=94202cbfbca05a503acdc1cca2f8409d141173af ) all the needed pieces are committed to the development repositories.

You need a development version of kernel 3.17 (currently agd5f's branch drm-fixes-3.17-wip seems a good choice), current mesa from git, at least libdrm-2.4.56, and xf86-video-ati from git, as well as the new microcode (the link is somewhere in this bugreport).

Thanks again to everyone involved fixing up Hawaii support ! :-)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.