Bug 84500

Summary:

[radeonsi] radeon 0000:01:00.0: Packet0 not allowed!

Product:

DRI

Reporter:

Alexandre Demers <alexandre.f.demers>

Component:

DRM/Radeon

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED MOVED

QA Contact:

Severity:

normal

Priority:

medium

CC:

daniel, electropura, eseifert, grantipak, john.ettedgui, kitcat490, ooblick+freedesktop

Version:

XOrg git

Hardware:

Other

OS:

All

See Also:

https://bugs.freedesktop.org/show_bug.cgi?id=87278

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
dump full CS when we hit a packet 0	none
One CS dump	none
Dmesg while hitting a packet0	none
dmesg with packet not allowed error	none
dmesg for M6700	none
dmesg log	none
dmesg log	none

Description Alexandre Demers 2014-09-30 06:00:56 UTC

On a 7950, I keep getting this error from time to time in dmesg:
radeon 0000:01:00.0: Packet0 not allowed!

I have associated this error with playing either html5 or flash videos. It may happen when playing offline movies, but I can't tell since I haven't tested it.

When the error happens, there is a slight "stuttering" (from a fraction of second to a few seconds). And then it continues.

There is nothing in Xorg.0.log about it, and no other message than "radeon 0000:01:00.0: Packet0 not allowed!" in dmesg.

Comment 1 Alexandre Demers 2014-09-30 06:04:14 UTC

Even when UVD is manually disabled, the error still shows in dmesg.

Comment 2 Michel Dänzer 2014-09-30 07:18:02 UTC

Can you run the browser with the environment variable RADEON_DUMP_CS=1, and attach any command stream dumps that generates on stderr?

Comment 3 Alexandre Demers 2014-09-30 13:07:30 UTC

(In reply to comment #2)
> Can you run the browser with the environment variable RADEON_DUMP_CS=1, and
> attach any command stream dumps that generates on stderr?

I'll run firefox with this env var later today.

Comment 4 Alex Deucher 2014-09-30 13:28:15 UTC

Created attachment 107128 [details] [review]
dump full CS when we hit a packet 0

This kernel patch should make it much easier to debug.  When you hit the error, please attach the full output of the CS.

Comment 5 Alexandre Demers 2014-10-01 06:05:10 UTC

Created attachment 107162 [details]
One CS dump

Got this CS dump when using Firefox while playing a few streams in Flash (it happens often when there is more than one stream playing). I was playing the live stream from radio-canada.ca, a show from tou.tv and another one from telequebec.tv.

Short after, I experienced a GPU reset. Obviously, Flash had been killed in the process.

Here is the log from that hang/reset:
25590.472377] radeon 0000:01:00.0: ring 0 stalled for more than 10020msec
[25590.472383] radeon 0000:01:00.0: GPU lockup (waiting for 0x00000000003d58ba last fence id 0x00000000003d58b5 on ring 0)
[25590.488409] radeon 0000:01:00.0: ring 3 stalled for more than 10036msec
[25590.488415] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000014a8e2 last fence id 0x000000000014a8e0 on ring 3)
[25590.979347] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c039ec40 flags=0x0010]
[25590.979352] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c039ec70 flags=0x0030]
[25590.979354] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c0000100 flags=0x0030]
[25590.979355] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c039eb00 flags=0x0010]
[25590.979357] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x00000000c039eb40 flags=0x0010]
[25590.979386] radeon 0000:01:00.0: Saved 321 dwords of commands on ring 0.
[25590.979432] radeon 0000:01:00.0: GPU softreset: 0x0000006C
[25590.979434] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
[25590.979437] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[25590.979439] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[25590.979441] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[25590.979476] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[25590.979478] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[25590.979480] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[25590.979482] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00400002
[25590.979484] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x84010243
[25590.979486] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483106
[25590.979488] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C84246
[25590.979490] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[25590.979492] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[25591.482479] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
[25591.482532] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100140
[25591.483688] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
[25591.483690] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[25591.483692] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[25591.483694] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200002C0
[25591.483728] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[25591.483730] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[25591.483731] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[25591.483734] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[25591.483735] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[25591.483741] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[25591.483743] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[25591.483826] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[25591.524037] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
[25591.524039] [drm] PCIE gen 2 link speeds already enabled
[25591.525213] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[25591.525303] radeon 0000:01:00.0: WB enabled
[25591.525306] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x00000000c0000c00 and cpu addr 0xffff8804113f2c00
[25591.525308] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x00000000c0000c04 and cpu addr 0xffff8804113f2c04
[25591.525310] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x00000000c0000c08 and cpu addr 0xffff8804113f2c08
[25591.525311] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x00000000c0000c0c and cpu addr 0xffff8804113f2c0c
[25591.525313] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x00000000c0000c10 and cpu addr 0xffff8804113f2c10
[25591.528260] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90015cb5a18
[25591.693423] [drm] ring test on 0 succeeded in 1 usecs
[25591.693427] [drm] ring test on 1 succeeded in 1 usecs
[25591.693431] [drm] ring test on 2 succeeded in 1 usecs
[25591.693440] [drm] ring test on 3 succeeded in 2 usecs
[25591.693446] [drm] ring test on 4 succeeded in 1 usecs
[25591.693471] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[25591.693475] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
[25591.693476] radeon 0000:01:00.0: ib ring test failed (-35).
[25592.186042] radeon 0000:01:00.0: GPU softreset: 0x00000048
[25592.186044] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
[25592.186046] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[25592.186048] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[25592.186050] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[25592.186084] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[25592.186086] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[25592.186088] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[25592.186090] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
[25592.186091] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80010243
[25592.186093] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[25592.186095] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[25592.186097] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[25592.186099] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[25592.439595] Watchdog[906]: segfault at 0 ip 00007f4d1c491c2e sp 00007f4d0a258770 error 6 in chrome[7f4d1833b000+547e000]
[25592.674414] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
[25592.674468] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[25592.675624] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
[25592.675627] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[25592.675629] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[25592.675631] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[25592.675665] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[25592.675667] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[25592.675669] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[25592.675671] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[25592.675673] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[25592.675675] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[25592.675677] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[25592.675761] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[25592.701553] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
[25592.701556] [drm] PCIE gen 2 link speeds already enabled
[25592.702716] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[25592.702806] radeon 0000:01:00.0: WB enabled
[25592.702809] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x00000000c0000c00 and cpu addr 0xffff8804113f2c00
[25592.702811] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x00000000c0000c04 and cpu addr 0xffff8804113f2c04
[25592.702812] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x00000000c0000c08 and cpu addr 0xffff8804113f2c08
[25592.702814] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x00000000c0000c0c and cpu addr 0xffff8804113f2c0c
[25592.702816] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x00000000c0000c10 and cpu addr 0xffff8804113f2c10
[25592.706024] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90015cb5a18
[25592.873022] [drm] ring test on 0 succeeded in 1 usecs
[25592.873026] [drm] ring test on 1 succeeded in 1 usecs
[25592.873030] [drm] ring test on 2 succeeded in 1 usecs
[25592.873039] [drm] ring test on 3 succeeded in 2 usecs
[25592.873045] [drm] ring test on 4 succeeded in 1 usecs
[25592.873066] [drm] ib test on ring 0 succeeded in 0 usecs
[25592.873084] [drm] ib test on ring 1 succeeded in 0 usecs
[25592.873124] [drm] ib test on ring 2 succeeded in 0 usecs
[25592.873141] [drm] ib test on ring 3 succeeded in 0 usecs
[25592.873158] [drm] ib test on ring 4 succeeded in 0 usecs
[25592.873294] switching from power state:
[25592.873295] 	ui class: none
[25592.873297] 	internal class: boot 
[25592.873298] 	caps: 
[25592.873299] 	uvd    vclk: 0 dclk: 0
[25592.873301] 		power level 0    sclk: 50000 mclk: 15000 vddc: 950 vddci: 875 pcie gen: 2
[25592.873302] 	status: c b 
[25592.873303] switching to power state:
[25592.873304] 	ui class: performance
[25592.873305] 	internal class: none
[25592.873306] 	caps: 
[25592.873307] 	uvd    vclk: 0 dclk: 0
[25592.873308] 		power level 0    sclk: 30000 mclk: 15000 vddc: 850 vddci: 875 pcie gen: 2
[25592.873309] 		power level 1    sclk: 50100 mclk: 125000 vddc: 950 vddci: 875 pcie gen: 2
[25592.873310] 		power level 2    sclk: 88000 mclk: 125000 vddc: 1090 vddci: 875 pcie gen: 2
[25592.873311] 	status: r

Comment 6 Alexandre Demers 2014-10-01 15:51:08 UTC

And as a sidenote: aside from the attached CS dump, others are present in dmesg, but they did not trigger a GPU reset. It seems the last one that I had encountered and that I attached was just too much for some reason, and it reset again when I restarted the streamings.

Comment 7 Michel Dänzer 2014-10-02 03:29:31 UTC

> [25322.031213] 	0x0000000b

This value is written to the R_028A3C_VGT_GROUP_VECT_1_FMT_CNTL register. However, the driver only ever writes 0 to that register, in si_init_config().

> [25322.031214] 	0x00000000 <---
> [25322.031215] 	0x00000295
> [25322.031215] 	0x00000080
> [25322.031216] 	0x00000040
> [25322.031217] 	0x00000002

The values after the arrow look like the following series of register writes to R_028A54_VGT_GS_PER_ES and the two following registers.

So, it looks like the value for the R_028A3C_VGT_GROUP_VECT_1_FMT_CNTL register and the following PKT3_SET_CONTEXT_REG header were scribbled over with the value 0x0000000b00000000. Looks like memory corruption to me.

Running firefox in valgrind or with something like the GCC / clang address sanitizers might give a clue, but might be painful.

Comment 8 Christian König 2014-10-02 12:32:14 UTC

(In reply to Michel Dänzer from comment #7)
> So, it looks like the value for the R_028A3C_VGT_GROUP_VECT_1_FMT_CNTL
> register and the following PKT3_SET_CONTEXT_REG header were scribbled over
> with the value 0x0000000b00000000. Looks like memory corruption to me.

Yeah, agree that strongly looks like a memory corruption. Which would also explain all the crashes.

Comment 9 José Suárez 2014-10-02 22:21:38 UTC

I have been trying linux 3.16 with the packet0 patch and after some testing I haven't got any Packet0 message in the dmesg log. So I guess it must be related to the 3.17 rc's. I don't remember getting similar crashes just by watching youtube videos in firefox with previous kernels.

I will try to build 3.17 rc6 (which was the version that gave the Packet0 logs and system hangs) with the Packet 0 patch and report back.

@Alexandre: Can you try linux 3.16 and see if it works properly for you?

Comment 10 Alexandre Demers 2014-10-03 06:21:42 UTC

(In reply to José Suárez from comment #9)
> I have been trying linux 3.16 with the packet0 patch and after some testing
> I haven't got any Packet0 message in the dmesg log. So I guess it must be
> related to the 3.17 rc's. I don't remember getting similar crashes just by
> watching youtube videos in firefox with previous kernels.
> 
> I will try to build 3.17 rc6 (which was the version that gave the Packet0
> logs and system hangs) with the Packet 0 patch and report back.
> 
> @Alexandre: Can you try linux 3.16 and see if it works properly for you?

built and testing. I'll report ASAP.

Comment 11 José Suárez 2014-10-03 19:17:59 UTC

Created attachment 107281 [details]
Dmesg while hitting a packet0

Comment 12 José Suárez 2014-10-03 19:21:41 UTC

I compiled 3.17rc7 with the packet0 patch. You can find a dmesg log just above this message.

Running firefox with RADEON_DUMP_CS=1 didn't produce any dump. Is it because I need the mesa dbg packages (not currently installed)? I guess it should appear on the console output, right? Or is it writen to a file? (Sorry about those noob questions. First time debugging this kind of problem...)

Comment 13 Alexandre Demers 2014-10-04 04:42:15 UTC

(In reply to José Suárez from comment #12)
> I compiled 3.17rc7 with the packet0 patch. You can find a dmesg log just
> above this message.
> 
> Running firefox with RADEON_DUMP_CS=1 didn't produce any dump. Is it because
> I need the mesa dbg packages (not currently installed)? I guess it should
> appear on the console output, right? Or is it writen to a file? (Sorry about
> those noob questions. First time debugging this kind of problem...)

It seems pretty much the same "signature" as the CS dump I had attached.

The CS dump is written in your dmesg log or systemd journal if I remember correctly.

On my side, I've been playing with a 3.16 with the patch applied and I've been unable to get a Packet0 error. So, it seems to have been introduced somewhere between 3.16 and 3.17-rc7. I'll try to bisect as soon as I'll have time (maybe not before next week).

Comment 14 Andy Furniss 2014-10-04 08:45:56 UTC

(In reply to Alexandre Demers from comment #13)
> (In reply to José Suárez from comment #12)
> > I compiled 3.17rc7 with the packet0 patch. You can find a dmesg log just
> > above this message.
> > 
> > Running firefox with RADEON_DUMP_CS=1 didn't produce any dump. Is it because
> > I need the mesa dbg packages (not currently installed)? I guess it should
> > appear on the console output, right? Or is it writen to a file? (Sorry about
> > those noob questions. First time debugging this kind of problem...)
> 
> It seems pretty much the same "signature" as the CS dump I had attached.
> 
> The CS dump is written in your dmesg log or systemd journal if I remember
> correctly.
> 
> On my side, I've been playing with a 3.16 with the patch applied and I've
> been unable to get a Packet0 error. So, it seems to have been introduced
> somewhere between 3.16 and 3.17-rc7. I'll try to bisect as soon as I'll have
> time (maybe not before next week).

FWIW I just grepped my kern.log for Packet0 and have 47 between Jul 4 and now.

Doing grep Packet0 /var/log/kern.log -B 860 | grep Microcode

Only comes up with pitcairn (lowercase = new firmware = 3.17)

Comment 15 Alexandre Demers 2014-10-05 15:16:07 UTC

Slowly bisecting: b401796 would be good (haven't had a Packet0 error since yesterday) and 005f8005 would be bad. Continuing.

Comment 16 Alexandre Demers 2014-10-06 04:27:08 UTC

It may be related to general GPU crashes seen other bugs: while bisecting, I hit a loop of GPU resets just after logging in until I rebooted.

Comment 17 Alexandre Demers 2014-10-06 04:36:38 UTC

(In reply to Alexandre Demers from comment #16)
> It may be related to general GPU crashes seen other bugs: while bisecting, I
> hit a loop of GPU resets just after logging in until I rebooted.

Refering to commit 3c2ea70 (for trace purpose)

Comment 18 Alexandre Demers 2014-10-10 04:33:52 UTC

I add to come back in my bisection because the result couldn't make sense. It's taking longer than expected...

Comment 19 Alexandre Demers 2014-10-10 18:25:21 UTC

Hmmm, dummy question but I must ask: isn't a HD 7950 supposed to be a Tahiti GPU? Because when looking in dmesg, it seems to load a Pitcairn ucode... My research seems to say it is indeed a Tahiti GPU... I'm puzzled.

Comment 20 John Bridgman 2014-10-10 21:06:14 UTC

Yes, AFAIK HD 78xx is Pitcairn and HD 79xx is Tahiti.

Comment 21 Alexandre Demers 2014-10-11 01:39:39 UTC

(In reply to John Bridgman from comment #20)
> Yes, AFAIK HD 78xx is Pitcairn and HD 79xx is Tahiti.

Well, I'm sick (literally) and I mixed José's dmesg with mine. Everything is fine with the device ID then.

Comment 22 Christian König 2014-10-11 13:52:53 UTC

Keep in mind that this might actually be a user space problem and that different kernel versions work or don't work only be coincident.

If you can get me an SSH access to the box I could take a look as well. Attaching a debugger to the process in question shouldn't be to hard.

Comment 23 Alexandre Demers 2014-10-11 14:56:51 UTC

(In reply to Christian König from comment #22)
> Keep in mind that this might actually be a user space problem and that
> different kernel versions work or don't work only be coincident.
> 
> If you can get me an SSH access to the box I could take a look as well.
> Attaching a debugger to the process in question shouldn't be to hard.

I've been having a hard time getting the error lately (not encountered in the last two days with a kernel 3.17-rc4). I'll go back to a newer kernel and I'll see if the Packet0 bug still happens as often as before.

About the SSH connection, that could be possible if needed in time.

Comment 24 Alexandre Demers 2014-10-13 05:34:21 UTC

(In reply to Alexandre Demers from comment #23)
> (In reply to Christian König from comment #22)
> > Keep in mind that this might actually be a user space problem and that
> > different kernel versions work or don't work only be coincident.
> > 
> > If you can get me an SSH access to the box I could take a look as well.
> > Attaching a debugger to the process in question shouldn't be to hard.
> 
> I've been having a hard time getting the error lately (not encountered in
> the last two days with a kernel 3.17-rc4). I'll go back to a newer kernel
> and I'll see if the Packet0 bug still happens as often as before.
> 
> About the SSH connection, that could be possible if needed in time.

Well, at last, I've been able to hit the error with a 3.17-rc4 and something.

Comment 25 Zoltán Böszörményi 2014-10-19 11:16:25 UTC

I see "Packet0 not allowed" messages on 3.17.0 / 3.17.1 under Fedora 21.
The video card is R9 270X, also Pitcairn.

Comment 26 Erich Seifert 2014-10-19 12:23:34 UTC

I'm also getting "Packet0 not allowed!" messages with a Radeon HD 7770 (Cape Verde XT) video card on kernel 3.17.0 and 3.17.1.
I experienced several random crashes with 3.17.0, but I'm not sure they are related to this problem yet. I'll apply the patch and report back soon.

Comment 27 Alexandre Demers 2014-10-19 18:28:26 UTC

(In reply to Christian König from comment #22)
> Keep in mind that this might actually be a user space problem and that
> different kernel versions work or don't work only be coincident.
> 
> If you can get me an SSH access to the box I could take a look as well.
> Attaching a debugger to the process in question shouldn't be to hard.

By the way, if I understand correctly, if the bug is in userspace and was introduced around the same time kernel 3.17-rcX went out, would it appears when using a previous kernel version? I'm trying to figure out a way to distinguish one from the other because from where I am in the bisection, I was unable to reproduce the bug with a 3.16 kernel, but it does appear before 3.17-rc1...

Comment 28 Christian König 2014-10-19 20:35:26 UTC

(In reply to Alexandre Demers from comment #27)
> (In reply to Christian König from comment #22)
> > Keep in mind that this might actually be a user space problem and that
> > different kernel versions work or don't work only be coincident.
> > 
> > If you can get me an SSH access to the box I could take a look as well.
> > Attaching a debugger to the process in question shouldn't be to hard.
> 
> By the way, if I understand correctly, if the bug is in userspace and was
> introduced around the same time kernel 3.17-rcX went out, would it appears
> when using a previous kernel version? I'm trying to figure out a way to
> distinguish one from the other because from where I am in the bisection, I
> was unable to reproduce the bug with a 3.16 kernel, but it does appear
> before 3.17-rc1...

It is possible that a new kernel let this problem surface by coincident. E.g. a slightly different memory layout or allocation timing and instead of changing two random pixel on the screen we change the command buffer and the whole box crashes and/or shows this error.

All you can do is to try to figure out when the corruption happens. The kernel copies the command buffer content from userspace to a kernel buffer and then checks the content of the kernel buffer. Might be a good idea to print the content of the userspace buffer as well and compare both?

Comment 29 Michel Dänzer 2014-10-20 07:19:52 UTC

(In reply to Alexandre Demers from comment #27)
> I'm trying to figure out a way to distinguish one from the other because from
> where I am in the bisection, I was unable to reproduce the bug with a 3.16
> kernel, but it does appear before 3.17-rc1...

That's fine. Once you've finished bisecting the kernel, we'll decide where to go from there based on the result.

Comment 30 Maciej 2014-10-24 15:54:02 UTC

Imagine bug like this happening on Windows, customers would go nuts and it would be fixed asap by AMD... But hey, Linux is not second class citizen, right?

Comment 31 José Suárez 2014-10-27 18:18:10 UTC

I've been testing linux 3.18 rc1 for a few days and I've found it to be quite stable with regard to this bug. No hangs for me yet, but the Patcket0 massages still show up in dmesg.

Comment 32 Alexandre Demers 2014-10-27 22:14:11 UTC

(In reply to José Suárez from comment #31)
> I've been testing linux 3.18 rc1 for a few days and I've found it to be
> quite stable with regard to this bug. No hangs for me yet, but the Patcket0
> massages still show up in dmesg.

Indeed, pretty much the same over here.

I'm still bisecting. Everything points to something introduced between 3.16 and 3.17-rc1. It just takes awhile since the problem doesn't appear everytime.

Comment 33 Dieter Nützel 2014-10-27 22:19:55 UTC

Hello Alexandre,

maybe you can take a look, here to speed things up?
https://bugzilla.kernel.org/show_bug.cgi?id=86891
Comment #3 and #4.

Comment 34 Dieter Nützel 2014-10-27 22:22:57 UTC

(In reply to Dieter Nützel from comment #33)
> Hello Alexandre,
> 
> maybe you can take a look, here to speed things up?
> https://bugzilla.kernel.org/show_bug.cgi?id=86891
> Comment #3 and #4.

I'll testing it on RV730 AGP
with git revert of

59bc1d89d6a4d67c94a9b70fa81bda1d5b04f0cb is the first bad commit
commit 59bc1d89d6a4d67c94a9b70fa81bda1d5b04f0cb
Author: Lauri Kasanen <cand@gmx.com>
Date:   Sun Apr 20 20:29:33 2014 +0300

    drm/radeon: Inline r100_mm_rreg, -wreg, v3

Now.

Comment 35 Michel Dänzer 2014-11-11 09:32:45 UTC

(In reply to Alexandre Demers from comment #32)
> I'm still bisecting.

Did you get somewhere with the bisection? If not (or regardless), might be worth testing the Mesa patches I attached to bug 85647.

Comment 36 Alexandre Demers 2014-11-11 13:51:32 UTC

Almost, I will be testing my last commit tonight (if I did no mistake along the way). I'll have a look at the patch after that.

Comment 37 Alexandre Demers 2014-11-12 16:27:39 UTC

Well, the bisection was not conclusive... A branch's head commit produced the error, but I was unable to reproduce it earlier in that branch... I'll have to dig again in that branch and make sure it is related to that branch only.

Comment 38 Alexandre Demers 2014-11-17 01:37:46 UTC

For the last couple of days, I've been playing with kernel 3.19 drm-next and with some previously problematic 3.18 kernel versions. I was unable to reproduce the problem.

Mesa was updated a couple of time since the beginning of the bisection, as for the ddx drive. I'll keep this bug open for still a couple of days, but I may end up closing it if I don't encounter the bug anymore.

Comment 39 drago01 2014-11-27 12:33:15 UTC

I am seeing those messages too here:

radeon 0000:02:00.0: Packet0 not allowed!

on a R9 270X ... no hangs or anything else just the message in the log (3.17.3 / mesa 10.3.3 on F20).

Comment 40 Alexandre Demers 2014-11-27 22:59:19 UTC

(In reply to drago01 from comment #39)
> I am seeing those messages too here:
> 
> radeon 0000:02:00.0: Packet0 not allowed!
> 
> on a R9 270X ... no hangs or anything else just the message in the log
> (3.17.3 / mesa 10.3.3 on F20).

I haven't hit it for a while. But I'm testing a 3.18 kernel with latest mesa from git. This could be a clue.

Comment 41 Alexandre Demers 2015-01-16 04:39:30 UTC

There is an application that still triggers the Packet0 error: Serious Sam 3. I could get an apitrace if someone thinks it could be useful.

Comment 42 Öyvind Saether 2015-02-02 21:56:11 UTC

Happens with 3.19.0-rc6, no idea what triggered it.

Comment 43 Öyvind Saether 2015-02-02 21:57:13 UTC

Created attachment 113076 [details]
dmesg with packet not allowed error

Comment 44 Lorenzo Bona 2015-02-18 07:53:29 UTC

I'm hitting this error too.

Playing Dota2 (my only game) causes this to appear in dmesg.

drm-fixes-3.19, mesa/ddx/xserver/drm from git.
The GPU is a R7-265.
The distribution is debian sid.

Comment 45 Lorenzo Bona 2015-02-19 11:40:15 UTC

Since yesterday I've been testing last drm-fixes-3.19 kernel with old radeon firmwares. I mean before big upgrade on 24th of July.

I've played Dota2 and watched videos on flash and on mpv with vdpau, and I can't reproduce those warnings anymore.

But while I play I can see these:

[10319.747657] radeon 0000:07:00.0: GPU fault detected: 146 0x0b080404
[10319.747665] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00017258
[10319.747670] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08004004
[10319.747675] VM fault (0x04, vmid 4) at page 94808, read from TC (4)
[12134.226711] radeon 0000:07:00.0: GPU fault detected: 146 0x0b084404
[12134.226719] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00017258
[12134.226724] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08044004
[12134.226728] VM fault (0x04, vmid 4) at page 94808, read from TC (68)

Comment 46 Alexandre Demers 2015-02-19 15:44:10 UTC

(In reply to Lorenzo Bona from comment #45)
> Since yesterday I've been testing last drm-fixes-3.19 kernel with old radeon
> firmwares. I mean before big upgrade on 24th of July.
> 
> I've played Dota2 and watched videos on flash and on mpv with vdpau, and I
> can't reproduce those warnings anymore.
> 
> But while I play I can see these:
> 
> [10319.747657] radeon 0000:07:00.0: GPU fault detected: 146 0x0b080404
> [10319.747665] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00017258
> [10319.747670] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x08004004
> [10319.747675] VM fault (0x04, vmid 4) at page 94808, read from TC (4)
> [12134.226711] radeon 0000:07:00.0: GPU fault detected: 146 0x0b084404
> [12134.226719] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00017258
> [12134.226724] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x08044004
> [12134.226728] VM fault (0x04, vmid 4) at page 94808, read from TC (68)

Your VM errors may be related to bug 87278, which was also reopened after a reverted commit in LLVM.

Comment 47 Chernovsky Oleg 2015-02-28 15:59:54 UTC

Got this bug today.

How can I know whether is it user-space IB assembly corruption or kernel-space error? I have all sources at hand, where to look at?

Comment 48 Alex Deucher 2015-02-28 18:57:11 UTC

Dump the IB in userspace before submission and compare it to what gets dumped in the kernel.

Comment 49 patches 2015-03-17 14:19:03 UTC

Happens with 3.18.9-100.fc20.x86_64

Comment 50 patches 2015-03-17 14:20:17 UTC

Created attachment 114385 [details]
dmesg for M6700

Comment 51 Vladimir Usikov 2015-03-31 10:04:16 UTC

Created attachment 114754 [details]
dmesg log

I got same issue playing in Dota 2 and Alt+Tab in firefox.
Radeon 7950, kernel 4.0rc4, mesa-git, llvm-svn, KDE 5, Archlinux x86-64.

Comment 52 Chernovsky Oleg 2015-09-28 22:35:49 UTC

Created attachment 118502 [details]
dmesg log

Just got this bug and GPU lockup while trying to play Guild Wars through wine.
Happens when I rotate camera at the start of the game extensively, forcing Mesa to compile all the shaders at once.

It's repeatable so I can provide some logs here.

Radeon R9 270

Arch x86_64, Linux 4.2.1, brand new Mesa 11.0.1

Comment 53 Christian König 2015-09-29 08:05:19 UTC

(In reply to Chernovsky Oleg from comment #52)
> Created attachment 118502 [details]
> dmesg log
> 
> Just got this bug and GPU lockup while trying to play Guild Wars through
> wine.
> Happens when I rotate camera at the start of the game extensively, forcing
> Mesa to compile all the shaders at once.
> 
> It's repeatable so I can provide some logs here.
> 
> Radeon R9 270
> 
> Arch x86_64, Linux 4.2.1, brand new Mesa 11.0.1

Great! You are the guy who also did the fan control patches aren't you?

As first step please try to catch an apitrace of it.

If that doesn't work and you still want to get your hands dirty with the code again contact me by mail (christian.koenig@amd.com) and we can discuss how to dig deeper into this issue.

Best regards,
Christian.

Comment 54 Chernovsky Oleg 2015-09-29 22:42:20 UTC

> Great! You are the guy who also did the fan control patches aren't you?

Yep, that's me, thanks!
I also stalkered Michel Dänzer for explanations of GTT and VMM at some time :)

> As first step please try to catch an apitrace of it.
> 
> If that doesn't work and you still want to get your hands dirty with the
> code again contact me by mail (christian.koenig@amd.com) and we can discuss
> how to dig deeper into this issue.
> 
> Best regards,
> Christian.

Will do on weekend and mail you about results.

Comment 55 Martin Peres 2019-11-19 08:57:21 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/540.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.