62997 – Crashes on ARUBA unless R600_DEBUG=nodma

Bug 62997 - Crashes on ARUBA unless R600_DEBUG=nodma

Summary: Crashes on ARUBA unless R600_DEBUG=nodma

Status:	RESOLVED DUPLICATE of bug 62959

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/r600 (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-04-01 15:31 UTC by udo
Modified:	2013-04-14 14:28 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
Xorg.0.log with R600_DEBUG=nodma (57.85 KB, text/plain) 2013-04-01 15:40 UTC, udo	Details
dmesg (88.51 KB, text/plain) 2013-04-01 15:41 UTC, udo	Details
View All

Description udo 2013-04-01 15:31:59 UTC

Ever since booting into kernel.org 3.8.4 on my AMD A10-5800K (ARUBA graphics), running git mesa and git xf86-video-ati, I get short uptimes (15 minutes, around one hour max) due to crashes.
The logs mention stuff like:

[ 1332.480233] radeon 0000:00:01.0: GPU fault detected: 146 0x0134710c
[ 1332.480243] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000813
[ 1332.480250] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0407100C

Watching youtube `helps` triggering the issue as it appears. (correlates, no real causation yet) 
Having R600_DEBUG=nodma in the environment solves the problem.

Occasionally I see a GPU lockup, if that is related:

    [29648.098135]  disk 0, wo:0, o:1, dev:sda2
    [29648.098140]  disk 1, wo:0, o:1, dev:sdb2
    [29648.098142]  disk 2, wo:0, o:1, dev:sdc2
    [29648.098145]  disk 3, wo:0, o:1, dev:sdd2
    [68707.166021] radeon 0000:00:01.0: GPU fault detected: 146 0x0d4c2604
    [68707.166030] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000008D4
    [68707.166043] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C026004
    [70621.378798] radeon 0000:00:01.0: GPU fault detected: 146 0x013c710c
    [70621.378808] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000813
    [70621.378815] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C07100C
    [70621.378837] radeon 0000:00:01.0: GPU fault detected: 147 0x0f0c7102
    [70621.378843] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
    [70621.378848] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
    [70621.378854] radeon 0000:00:01.0: GPU fault detected: 147 0x0f1c7102
    [70621.378859] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
    [70621.378864] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
    [70631.857918] radeon 0000:00:01.0: GPU lockup CP stall for more than 10000msec
    [70631.857927] radeon 0000:00:01.0: GPU lockup (waiting for 0x00000000007e1fe5 last fence id 0x00000000007e1fe3)
    [70631.858436] radeon 0000:00:01.0: sa_manager is not empty, clearing anyway
    [70631.859755] radeon 0000:00:01.0: Saved 951 dwords of commands on ring 0.
    [70631.859761] radeon 0000:00:01.0: GPU softreset: 0x00000003
    [70631.859766] radeon 0000:00:01.0:   VM_CONTEXT0_PROTECTION_FAULT_ADDR   0x00000000
    [70631.859770] radeon 0000:00:01.0:   VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00000000
    [70631.859774] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
    [70631.859778] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
    [70631.867299] radeon 0000:00:01.0:   GRBM_STATUS               = 0xA2703828
    [70631.867305] radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0x1D000007
    [70631.867309] radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
    [70631.867313] radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000040
    [70631.867317] radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
    [70631.867321] radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x00018000
    [70631.867325] radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00008006
    [70631.867328] radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x80038647
    [70631.867332] radeon 0000:00:01.0:   GRBM_SOFT_RESET=0x0000DF7B
    [70631.867386] radeon 0000:00:01.0:   GRBM_STATUS               = 0x00003828
    [70631.867390] radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0x00000007
    [70631.867393] radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
    [70631.867397] radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000040
    [70631.867400] radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
    [70631.867404] radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
    [70631.867408] radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
    [70631.867411] radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x00000000
    [70631.883681] radeon 0000:00:01.0: GPU reset succeeded, trying to resume
    [70631.916445] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
    [70631.916534] radeon 0000:00:01.0: WB enabled
    [70631.916536] radeon 0000:00:01.0: fence driver on ring 0 use gpu addr 0x0000000030000c00 and cpu addr 0xffff880235891c00
    [70631.916538] radeon 0000:00:01.0: fence driver on ring 1 use gpu addr 0x0000000030000c04 and cpu addr 0xffff880235891c04
    [70631.916540] radeon 0000:00:01.0: fence driver on ring 2 use gpu addr 0x0000000030000c08 and cpu addr 0xffff880235891c08
    [70631.916541] radeon 0000:00:01.0: fence driver on ring 3 use gpu addr 0x0000000030000c0c and cpu addr 0xffff880235891c0c
    [70631.916543] radeon 0000:00:01.0: fence driver on ring 4 use gpu addr 0x0000000030000c10 and cpu addr 0xffff880235891c10
    [70631.935206] [drm] ring test on 0 succeeded in 3 usecs
    [70631.935264] [drm] ring test on 3 succeeded in 2 usecs
    [70631.935271] [drm] ring test on 4 succeeded in 1 usecs
    [70631.949531] [drm] ib test on ring 0 succeeded in 0 usecs
    [70631.950057] [drm] ib test on ring 3 succeeded in 0 usecs
    [70631.950576] [drm] ib test on ring 4 succeeded in 1 usecs

Comment 1 udo 2013-04-01 15:40:07 UTC

Created attachment 77277 [details]
Xorg.0.log with R600_DEBUG=nodma

Comment 2 udo 2013-04-01 15:41:30 UTC

Created attachment 77278 [details]
dmesg

Comment 3 udo 2013-04-05 14:25:57 UTC

With R600_DEBUG=nodma we get some mentions of GPU fault but not as often and no crashing the whole PC.

Comment 4 udo 2013-04-05 14:26:32 UTC

I shttps://bugs.freedesktop.org/show_bug.cgi?id=58667 a related issue?

Comment 5 udo 2013-04-07 06:24:18 UTC

It does crash, but without reboot.
Gui disappears.
Pure text mode screne is shown of first few seconds of boot.
No network. Kernel alive.

Apr  7 07:59:47 surfplank2 dbus[3118]: [system] Rejected send message, 2 matched rules; type="method_return", sender=":1.2" (uid=0 pid=3090 comm="/usr/lib/systemd/systemd-logind ") interface="(unset)" member
="(unset)" error name="(unset)" requested_reply="0" destination=":1.34" (uid=500 pid=4127 comm="gnome-session ")
Apr  7 08:11:39 surfplank2 kernel: [406000.278385] radeon 0000:00:01.0: GPU fault detected: 147 0x0f727102
Apr  7 08:11:39 surfplank2 kernel: [406000.278390] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000018F7
Apr  7 08:11:39 surfplank2 kernel: [406000.278393] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02071002
Apr  7 08:11:39 surfplank2 kernel: [406000.278396] radeon 0000:00:01.0: GPU fault detected: 147 0x0f627102
Apr  7 08:11:39 surfplank2 kernel: [406000.278399] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Apr  7 08:11:39 surfplank2 kernel: [406000.278401] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Apr  7 08:11:39 surfplank2 kernel: [406000.278404] radeon 0000:00:01.0: GPU fault detected: 147 0x07527102
Apr  7 08:11:39 surfplank2 kernel: [406000.278406] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Apr  7 08:11:39 surfplank2 kernel: [406000.278409] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Apr  7 08:11:39 surfplank2 kernel: [406000.278411] radeon 0000:00:01.0: GPU fault detected: 147 0x07627102
Apr  7 08:11:39 surfplank2 kernel: [406000.278413] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Apr  7 08:11:39 surfplank2 kernel: [406000.278416] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Apr  7 08:11:39 surfplank2 kernel: [406000.278418] radeon 0000:00:01.0: GPU fault detected: 147 0x00a27102
Apr  7 08:11:39 surfplank2 kernel: [406000.278420] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Apr  7 08:11:39 surfplank2 kernel: [406000.278423] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Apr  7 08:11:39 surfplank2 kernel: [406000.278426] radeon 0000:00:01.0: GPU fault detected: 147 0x00a27102
Apr  7 08:11:39 surfplank2 kernel: [406000.278428] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000000Apr  7 08:17:11 surfplank2 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Apr  7 08:17:11 surfplank2 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="3041" x-info="http://www.rsyslog.com"] start

Comment 6 udo 2013-04-07 10:52:55 UTC

FWIW: Another lockup..

[ 9912.997377] nf_conntrack: automatic helper assignment is deprecated and it will be removed soon. Use the iptables CT target to attach helpers instead.
[16500.596325] radeon 0000:00:01.0: GPU fault detected: 146 0x0eb27104
[16500.596330] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000008EB
[16500.596332] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02071004
[16500.596335] radeon 0000:00:01.0: GPU fault detected: 146 0x0ec27104
[16500.596337] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[16500.596340] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[16500.596342] radeon 0000:00:01.0: GPU fault detected: 147 0x06b27102
[16500.596344] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[16500.596347] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[16500.596349] radeon 0000:00:01.0: GPU fault detected: 147 0x06c27102
[16500.596351] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[16500.596353] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[16511.077533] radeon 0000:00:01.0: GPU lockup CP stall for more than 10000msec
[16511.077537] radeon 0000:00:01.0: GPU lockup (waiting for 0x000000000038b92b last fence id 0x000000000038b928)
[16511.078189] radeon 0000:00:01.0: sa_manager is not empty, clearing anyway
[16511.079467] radeon 0000:00:01.0: Saved 215 dwords of commands on ring 0.
[16511.079470] radeon 0000:00:01.0: GPU softreset: 0x00000003
[16511.079473] radeon 0000:00:01.0:   VM_CONTEXT0_PROTECTION_FAULT_ADDR   0x00000000
[16511.079475] radeon 0000:00:01.0:   VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00000000
[16511.079478] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[16511.079480] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[16511.261445] radeon 0000:00:01.0:   GRBM_STATUS               = 0xE5702828
[16511.261447] radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0xFC000005
[16511.261450] radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
[16511.261451] radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000040
[16511.261454] radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[16511.261456] radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x00018000
[16511.261458] radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00008006
[16511.261461] radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x80038647
[16511.261462] radeon 0000:00:01.0:   GRBM_SOFT_RESET=0x0000DF7B
[16511.261515] radeon 0000:00:01.0:   GRBM_STATUS               = 0x00003828
[16511.261517] radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0x00000007
[16511.261519] radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
[16511.261521] radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000040
[16511.261523] radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[16511.261525] radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[16511.261527] radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[16511.261528] radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x00000000
[16511.274728] radeon 0000:00:01.0: GPU reset succeeded, trying to resume
[16511.463803] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
[16511.463892] radeon 0000:00:01.0: WB enabled
[16511.463895] radeon 0000:00:01.0: fence driver on ring 0 use gpu addr 0x0000000030000c00 and cpu addr 0xffff8802331cdc00
[16511.463897] radeon 0000:00:01.0: fence driver on ring 1 use gpu addr 0x0000000030000c04 and cpu addr 0xffff8802331cdc04
[16511.463900] radeon 0000:00:01.0: fence driver on ring 2 use gpu addr 0x0000000030000c08 and cpu addr 0xffff8802331cdc08
[16511.463902] radeon 0000:00:01.0: fence driver on ring 3 use gpu addr 0x0000000030000c0c and cpu addr 0xffff8802331cdc0c
[16511.463903] radeon 0000:00:01.0: fence driver on ring 4 use gpu addr 0x0000000030000c10 and cpu addr 0xffff8802331cdc10
[16511.482550] [drm] ring test on 0 succeeded in 2 usecs
[16511.482609] [drm] ring test on 3 succeeded in 2 usecs
[16511.482617] [drm] ring test on 4 succeeded in 1 usecs
[16511.497231] [drm] ib test on ring 0 succeeded in 0 usecs
[16511.497751] [drm] ib test on ring 3 succeeded in 0 usecs
[16511.498269] [drm] ib test on ring 4 succeeded in 1 usecs

Comment 7 Alex Deucher 2013-04-09 13:12:28 UTC

This may be related to bug 62959.  Does attachment 72794 [details] [review] (kernel patch) fix the issue?

Comment 8 udo 2013-04-09 13:41:02 UTC

Will start testing on 3.8.6 in a few minutes.

Comment 9 udo 2013-04-09 15:11:50 UTC

3.8.6 with and without patch had crashes of various kind. (hard freeze even!)
Now doing 3.8.5 without patch, waiting for the raid check to complete.

Comment 10 udo 2013-04-12 11:11:33 UTC

Despite crashes for other reasons (ARUBA (Cayman) not yet ready for OpenCL) I saw no GPU faults etc in the logs since booting into 3.8.5 with the patch.
I want to give it a few more days without OpenCL disruptions to be sure.

Comment 11 Alex Deucher 2013-04-12 13:09:01 UTC

This is starting to look like a duplicate of bug 62959.  Can you try attachment 77608 [details] [review]?  That seems to fix 62959, hopefully it will fix this one as well.

Comment 12 udo 2013-04-12 13:14:40 UTC

So I undo the previous patch and try this new one?
(Or try them combined?)

Comment 13 Alex Deucher 2013-04-12 13:17:27 UTC

(In reply to comment #12)
> So I undo the previous patch and try this new one?
> (Or try them combined?)

Try them separately, not combined.

Comment 14 udo 2013-04-14 07:13:01 UTC

I guess the second patch also fixes the issue.
After  1 day, 15:11 of uptime I saw no GPU faults, hangs, etc.
Normally they occurred much sooner than that.

Comment 15 Alex Deucher 2013-04-14 14:28:00 UTC


*** This bug has been marked as a duplicate of bug 62959 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.