105733 – Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working.

Bug 105733 - Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working.

Summary: Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but do...

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	highest blocker
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-03-25 04:47 UTC by Allan
Modified:	2019-09-18 20:36 UTC (History)
CC List:	19 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg, killing pids, shutting down, unloading amdgpu, xorg log (67.04 KB, text/x-matlab) 2018-03-25 04:47 UTC, Allan	no flags	Details
amdgpu timeout with iommu enabled (54.77 KB, text/plain) 2018-07-26 08:26 UTC, Suloev Dmitry	no flags	Details
amdgpu timeout with iommu disabled (53.59 KB, text/plain) 2018-07-26 08:27 UTC, Suloev Dmitry	no flags	Details
startx.log (3.34 KB, text/x-log) 2018-07-26 08:31 UTC, Suloev Dmitry	no flags	Details
Memory manager not clean during takedown. (54.77 KB, text/plain) 2018-07-26 08:32 UTC, Suloev Dmitry	no flags	Details
amdgpu with dc enabled (92.66 KB, text/plain) 2018-07-26 09:09 UTC, Suloev Dmitry	no flags	Details
dmesg after logging into the system from remote machine. (89.17 KB, text/plain) 2018-08-13 02:25 UTC, krutoileshii	no flags	Details
attachment-17526-0.html (3.07 KB, text/html) 2018-11-16 14:44 UTC, krutoileshii	no flags	Details
attachment-26904-0.html (1.96 KB, text/html) 2018-11-17 15:36 UTC, krutoileshii	no flags	Details
dmesg logs for failure (185.62 KB, text/plain) 2018-11-19 05:31 UTC, Kent Ross	no flags	Details
AMD wip kernel config with 1000Hz timer (118.42 KB, text/plain) 2018-11-22 19:28 UTC, fin4478	no flags	Details
I get these errors when attempting to boot after a normal GPU hang and KMS happens (86.21 KB, text/plain) 2019-01-23 19:44 UTC, las	no flags	Details
attachment-12630-0.html (4.00 KB, text/html) 2019-01-23 19:46 UTC, krutoileshii	no flags	Details
attachment-2574-0.html (12.95 KB, text/html) 2019-02-05 16:28 UTC, Garry Hurley Jr	no flags	Details
attachment-3556-0.html (1.65 KB, text/html) 2019-03-10 10:55 UTC, las	no flags	Details
After AMDGPU crashes (153.68 KB, text/plain) 2019-07-16 10:18 UTC, Hadet	no flags	Details
Show Obsolete (1) View All

Description Allan 2018-03-25 04:47:54 UTC

Created attachment 138344 [details]
dmesg, killing pids, shutting down, unloading amdgpu, xorg log

WHAT HAPPENS
- Amdgpu hangs without any clear clue of what is happening.
- The mouse cursor responds to movements when the system is not frozen, but also it does nothing as well.
- The keyboard gets num lock frozen and even trying with a ps2 one does not work.
- The video gets frozen.
- Only ssh works, but only the times that the system is not frozen, of course.
- The most irritating part : the system can not be shutdown. No matter what you do :
-- If you press the power button from the case, it is the only answer that you can get from the output display : it shows a console indicating that x-server is trying to be turned off. But nothing else happens and the system can't be turned off.
-- If you try anything from ssh : "init 0", "poweroff", "shutdown -P 0 -h", "reboot". It simply does not work. It keeps waiting for something that never happens. Then you have to press ctrl_c to get back to the ssh sessioon. In an attempt it closed the ssh daemon but the shutdown itself never happened... even after 30mins.
-- It is IMPOSSIBLE to force unload amdgpu using "rmmod -f amdgpu". The task takes forever and never responds. It only hangs the ssh session.
-- It is IMPOSSIBLE to kill some x-related pids properly. If you try to kill it either nothing will happen or the process will be in a defunct state. Not even a "su -c 'kill -9 <pid>'" will work.

TIPS
- The crashes that allows ssh connection almost always happens when firefox is openned and running a video (netflix, youtube) or whatsapp web.
- The crashes that simply hangs the entire computer may occur at any time.

OBSERVATIONS
- I use a custom kernel (from 4.15). I've tried including the polaris binaries for my card, that showed an improvement (less freeze states) for a while. But now it is the same again.
- I use a nvidia io second pci-e slot for vfio. It is a must and I disable nouveau as well... It shoud not be a reason for failing. I tried also with another amd/none-card on second slot. The results were the same as I remember.

SYSTEM SPECS
- Custom kernel compilation optimized for ryzen (https://wiki.gentoo.org/wiki/Ryzen) and using polaris binaries (https://wiki.gentoo.org/wiki/AMDGPU)
- Chipset X370 (mobo)
- RX480 in first slot
- GTX 1070 on second slot.
- Tried also with a RX 580 on second slot.
- Tried also with nothing on second slot.
- i3wm loading from startx command

Comment 1 Allan 2018-03-25 04:49:51 UTC

Basically it blocks :
- killing pids
- shutting down
- xorg
- quitting xorg

Comment 2 Allan 2018-03-25 16:52:20 UTC

Tried getting all binaries available here https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu .

Even that I included the polaris binaries in the kernel, some binaries were missing (exactly those that were required...).

I've seen that before, but since sometimes it got working I just thought that some other bin was being used instead.

Well... I launched Unigine Valley as a test and now the problem is even worse :

[From dmesg]
```
[  517.630633] amdgpu 0000:0e:00.0: GPU fault detected: 147 0x00004802
[  517.630636] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  517.630638] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08048002
[  517.630640] amdgpu 0000:0e:00.0: VM fault (0x02, vmid 4) at page 0, read from 'TC4' (0x54433400) (72)
[  517.630644] amdgpu 0000:0e:00.0: GPU fault detected: 147 0x00004802
[  517.630645] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  517.630646] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08084002
[  517.630648] amdgpu 0000:0e:00.0: VM fault (0x02, vmid 4) at page 0, read from 'TC7' (0x54433700) (132)
```

The symptoms and reactions are the same as above. I got the output from a ssh because only the cursor was moving and nothing else working.

So ... did my card die or is it a bug?

By the way ... I also have an RX580 and the problem described firstly was happening too. (I had not tried forcing binaries before)

Comment 3 Allan 2018-03-27 03:02:46 UTC

Updating, now this error appears in dmesg too :

```
[ 1502.683100] Chrome_~dThread[2218]: segfault at 0 ip 00007f53a4452bd3 sp 00007f53a0899ad0 error 6 in libxul.so[7f53a3f3e000+4e2a000]
[ 1502.689186] Chrome_~dThread[2694]: segfault at 0 ip 00007f2ef4552bd3 sp 00007f2ef0999ad0 error 6 in libxul.so[7f2ef403e000+4e2a000]
[ 1502.689275] Chrome_~dThread[2300]: segfault at 0 ip 00007fc55ad52bd3 sp 00007fc557199ad0 error 6 in libxul.so[7fc55a83e000+4e2a000]
[ 1502.689287] Chrome_~dThread[2781]: segfault at 0 ip 00007f2ce4852bd3 sp 00007f2ce0c99ad0 error 6 in libxul.so[7f2ce433e000+4e2a000]
```

Comment 4 Allan 2018-03-27 11:52:19 UTC

If you set amdgpu.dc=1 as a boot parameter and then try openning pavucontrol, the screen hungs with artifacts (mouse cursor keeps moving) and you get this error :


```
[  125.640254] amdgpu 0000:0e:00.0: GPU fault detected: 147 0x04f00402
[  125.640259] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001C389E
[  125.640262] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04004002
[  125.640264] amdgpu 0000:0e:00.0: VM fault (0x02, vmid 2) at page 1849502, read from 'TC1' (0x54433100) (4)
[  125.640641] amdgpu 0000:0e:00.0: GPU fault detected: 147 0x05004802
[  125.640643] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001C38A0
[  125.640644] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04048002
[  125.640646] amdgpu 0000:0e:00.0: VM fault (0x02, vmid 2) at page 1849504, read from 'TC4' (0x54433400) (72)
```

I'm using kernel 4.15.

Then when you request a poweroff from ssh the call trace appears again and hungs the system, then you have to do a hard reset.

Comment 5 Emil Velikov 2018-03-28 15:34:24 UTC

Hi Allan, just sharing some ideas - I'm not working on the AMD drivers 

Make sure you're not using libdrm* 2.4.90 - it has some nasty bugs.

Afterwords, try to track down exactly what's causing the problem and a simple way to reproduce it.

Comment 6 Allan 2018-04-01 14:15:35 UTC

TL;DR : I don't have any idea of what is happening. The errors aren't clear and I didn't find a discrete way of reproducing it and I'm in need of help.

That's exactly the problem... I'm getting crazy about this problem.

I've been trying to understand what is happening for weeks...

So... I'll give you a brief(long) description :

I've been running an RX 580. And then sometimes the system used to freeze like this and I was starting to think about the card being problematic.

Then I got an RX 480, and I was planning to sell the RX580.

I compiled a kernel with the polaris binaries and etc... It was going very well until a system upgrade.

Then "here we go again" ... same problems... and now it seems like RX 480 fails twice as fast as the RX580 fails.

If you are asking yourself "what kind of failures ?" I'll resume it : code 147, code 146, chrome_dthread libxul.so (for both firefox and chromium), a big call trace telling about amdgpu blocked for more than 120 seconds. Everything after the screen being frozen, ignoring the keyboard and mouse clicks, the only thing that really works is the mouse cursor moving.

When it happens? After a few minutes running youtube or unigine valley or some random time (from minutes to several hours) using an opencl task for example.

Then I started to think about the other components...
- RAM ? Checked and running.... if the screen hangs, some ssh tests run fine.
- CPU ? Never had a problem about it as far as I remember. Ssh tests run fine.
- MOBO ? I really don't know. That's why :
---- I had been having some sound cracklings, indicating that some power management could be tainted.
---- I noticed that disabling IOMMU decreased the amount of crashes significantly... but unfortunately after updating the BIOS/EFI the option of enabling/disabling it simply was removed... I'll be contacting the manufacturer. So I can't affirm that it was the cause.
---- I started to think that something nasty was going on with the power supply.
- POWER SUPPLY ? I bet that it is not
---- I have an 5 yeras old Aerocool 80 plus silver 800W power supply. It always had been a very good PSU... holding a HD7970GHz (290W TDP) most part of the time without a single problem.
---- But okay... maybe the capacitors were faulty (as the mobo manufacturer said when I asked about the sound). Then I bought an AX860i. And if there is any better PSU than this for the 800W range... I'd like to know. 80 plus platinum certified... and even that the certification system does not get verified for years (almost like irrelevant to be honest). I already had an Corsair HX600 before and it was outstanding... an AX is better than a HX so... only a titanium that costs more than my mobo and cpu togheter would be better then.
---- Guess what? The same problems. Actually, now, it shuts down sometimes.
- KERNEL ? I was thinking that the problem was 4.15 because it has like 5x more chance of failling. But it also occurs with the very stable 4.13. Maybe I'll try other kernels... but as further we go behind with kernel versions, less features we have with amdgpu AFAIK.
---- Also. With the RX480 it started to fail the video output when I configure the Display Port output to be 144Hz. My screen can handle 160Hz with adaptive sync, but it never worked with amdgpu.
---- The DisplayPort/HDMI sound with DC/DAL support in 4.15 is a myth and NEVER works. If I configure amdgpu.dc=1 with RX580 it simply does not sound anything and with the RX480 it hangs the system when starting the pavucontrol. When forcing the output to the HDMI/DP it simply does not sound anything in both ways (but pavucontrol shows that something was supposed to be happening).
---- While running a tty the chances of crashing is very low. But it happens when trying an opencl application after some random time as said before.
---- When using RX580+1070 or RX480+1070 for vfio I noticed that unbinding the nvidia card extended the amount of working time before crashing. (was also one reason for me to think that the PSU was faulty)

Now the "best" part : running a single GPU leads to the same problems... :/

I'm not sure about anything right now. I'll try only the 1070 for sometime to guarantee that amdgpu is the only problem here.

I never touched the amdgpu code but it seems to me that either I sell the cards or I fix it by hand. Because I'm not finding anything related.

Comment 7 bernhardu 2018-04-04 20:49:14 UTC

Just a note, that this might be a similar issue as my #104345:
- both using Ryzen CPUs,
- amdgpu kernel module with Polaris GPUs,
- with unkillable processes
- "GPU fault detected" messages followed by
  "task ... blocked for more than 120 seconds"

Comment 8 Allan 2018-04-09 01:19:33 UTC

Tweaking between packages changes a little bit the amount of time before crashing.

(in debian)
For example... forcing to install libdrm-amdgpu1-dbg implies in older packages being installed altogether with it and will mainly crash while using something inside a docker container.

Upgrading it to a newest on (unstable, testing) results always in crashing, sooner or later.

Now I got this error :

[ 1812.460184] Watchdog[3376]: segfault at 0 ip 00000000f5011ce7 sp 00000000ae0f89d0 error 6 in libcef.so[f1d30000+419c000]
[53310.478516] [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* displayport link status failed
[53310.478547] [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* clock recovery failed
[53310.833870] [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* displayport link status failed
[53310.833900] [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* clock recovery failed

If I had to guess here, I'd guess that something concurrent is going on, because sometimes it works like a charm, and sometimes the system becomes the hell itself and gets unusable.

Comment 9 Allan 2018-04-09 01:22:52 UTC

 bernhardu is correct becaus in ALL cases it is impossible to kill some processes (the cause and anything xorg related).

Maybe something related to the chipset ? (X370?)

I don't know if other chipsets than those for ryzen are having these problems.

Comment 10 Peter 2018-04-11 09:31:31 UTC

Dear All,

I have a similar problem,
Kernel 4.15.16, Xeon CPU E3-1505M, Radeon R9 M295X.

The Laptop runs fine, provided I'm not accessing the GPU.

DRI_PRIME=0 glxinfo | grep "OpenGL renderer"
OpenGL renderer string: Mesa DRI Intel(R) HD Graphics P530 (Skylake GT2) 

is fine. However 
DRI_PRIME=1 glxinfo | grep "OpenGL renderer"

doesn't respond, can't be killed and after a while the Laptop freezes completely.
At the beginning of the 4.15 releases it was working fine.
I even got a significantly higher frame rate in supertuxkart using
amdgpu instead of the intel graphic.

But I don't know what else got updated besides the kernel.

Best regards,
Peter

Comment 11 txtsd 2018-04-27 11:26:06 UTC

This happens to me too. I run a Ryzen 2400G on an MSI B350 Tomahawk.

Comment 12 Allan 2018-04-27 12:41:52 UTC

My system started to power down for nothing sometimes, even using the GTX1070 (nvidia|nouveau) .
Then I installed a Windows image just to be sure if the kernel was the problem.

Well, for now it *SEEMS* that isn't *ONLY* the driver/kernel :
- The RX480 was freezing in the same way, then I sent it for warranty.
- RX580 run problematically, almost always I got a message like : "DX11 : device disconnected" or "Mantle : Device lost".
- GTX1070 was running fine for 1 day, then it became the same as the RX580 and for my bad luck the system started to power down after a random time (5min to 2 hours +/-).

For sure the driver/kernel (amdgpu/linux) has its faults here, and here's why:
- At Windows, the only card that stuck the system was RX480 sometimes because it was really broken.
- In other cases, when a failure happened (with Nvidia or AMD), the system was able to retake the control over the device.
- Maybe doing a soft-reset?
- Maybe just killing the driver and starting again?
- Maybe just by stopping the process that were using the GPU to avoid a big chain of resulting problems?
- Neither the RX580 nor GTX1070 has dual-bios AFAIK. Maybe RX480, but I did not test it.

Then :
- Revised and changed the PCI-Ex power lines : OK.
- Tested power supply (lucky for me AX860i has a self test) : OK.
- Cleaned all slots with a brush : OK.
- Tested again CPU and RAM : OK.

But , I must be in a very bad luck, the problems persisted.

I've sent the Motherboard for warranty. I'm waiting for its diagnostic and solution.

I'll inform here as soon as it becomes possible.

Thoughts for the while :
- Not being able to kill the processes *is* a problem that concerns only amdgpu and it is either a problem of the driver itself (most likely to be) or of the kernel.
- The driver is not capable of retaking control of the device.
- It is impossible to kill children pids when something hung using amdgpu.
- Yes, it occurred once or twice using nvidia proprietary too, but it was probably caused because of the faulty motherboard that I'm waiting to be fixed.
- Using nouveau was the most happy path , but unfortunately nouveau does not support Pascal at all yet. It keeps the card at the min clock (300 or 400MHz) and it is not possible yet to increase the speed of the card. So it is not a valid working way.

Comment 13 emmanuel.boudreault 2018-05-02 07:51:57 UTC

I seem to have this same issue when opening an e-mail with a certain picture using emacs. I'm using ArchLinux and Wayland (gnome shell). It is very easy to reproduce so let me know if more logs/debugging can help.

AMD Ryzen 5 1600
Radeon RX 560
Kernel: 4.16.5-1-ARCH
amdgpu 18.0.1-1
mesa 18.0.1-1


These are the drm and amd related dmseg logs:

[    3.516074] amdgpu 0000:20:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    3.516140] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    3.516703] amdgpu 0000:20:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[    3.516707] amdgpu 0000:20:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF
[    3.516716] [drm] Detected VRAM RAM=4096M, BAR=256M
[    3.516718] [drm] RAM width 128bits GDDR5
[    3.516870] [drm] amdgpu: 4096M of VRAM memory ready
[    3.516873] [drm] amdgpu: 4096M of GTT memory ready.
[    3.516891] [drm] GART: num cpu pages 65536, num gpu pages 65536
[    3.517011] [drm] PCIE GART of 256M enabled (table at 0x000000F400040000).
[    3.517987] [drm] Chained IB support enabled!
[    3.524025] [drm] Found UVD firmware Version: 1.130 Family ID: 16
[    3.525784] [drm] Found VCE firmware Version: 52.4 Binary ID: 3
[    3.597334] amdgpu: [powerplay] 
[    3.597356] amdgpu: [powerplay] 
[    3.606949] [drm] DM_PPLIB: values for Engine clock
[    3.606951] [drm] DM_PPLIB:	 21400
[    3.606952] [drm] DM_PPLIB:	 38700
[    3.606953] [drm] DM_PPLIB:	 84300
[    3.606953] [drm] DM_PPLIB:	 99500
[    3.606954] [drm] DM_PPLIB:	 106200
[    3.606955] [drm] DM_PPLIB:	 110800
[    3.606956] [drm] DM_PPLIB:	 114900
[    3.606956] [drm] DM_PPLIB:	 122600
[    3.606957] [drm] DM_PPLIB: Validation clocks:
[    3.606958] [drm] DM_PPLIB:    engine_max_clock: 122600
[    3.606959] [drm] DM_PPLIB:    memory_max_clock: 150000
[    3.606960] [drm] DM_PPLIB:    level           : 0
[    3.606962] [drm] DM_PPLIB: values for Memory clock
[    3.606963] [drm] DM_PPLIB:	 30000
[    3.606964] [drm] DM_PPLIB:	 62500
[    3.606964] [drm] DM_PPLIB:	 150000
[    3.606965] [drm] DM_PPLIB: Validation clocks:
[    3.606966] [drm] DM_PPLIB:    engine_max_clock: 122600
[    3.606967] [drm] DM_PPLIB:    memory_max_clock: 150000
[    3.606967] [drm] DM_PPLIB:    level           : 0
[    3.617049] [drm] Display Core initialized with v3.1.27!
[    3.642678] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    3.642680] [drm] Driver supports precise vblank timestamp query.
[    3.672521] [drm] UVD and UVD ENC initialized successfully.
[    3.773428] [drm] VCE initialized successfully.
[    4.301585] [drm] fb mappable at 0xE056A000
[    4.301587] [drm] vram apper at 0xE0000000
[    4.301588] [drm] size 11059200
[    4.301589] [drm] fb depth is 24
[    4.301590] [drm]    pitch is 10240
[    4.301702] fbcon: amdgpudrmfb (fb0) is primary device
[    4.358740] amdgpu 0000:20:00.0: fb0: amdgpudrmfb frame buffer device
[    4.371876] [drm] Initialized amdgpu 3.23.0 20150101 for 0000:20:00.0 on minor 0
[   55.222527] amdgpu 0000:20:00.0: GPU fault detected: 147 0x04f04802
[   55.222536] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0050309E
[   55.222540] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048002
[   55.222545] amdgpu 0000:20:00.0: VM fault (0x02, vmid 3) at page 5255326, read from 'TC0' (0x54433000) (72)
[   65.330363] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=701, last emitted seq=704
[   65.330378] [drm] IP block:gfx_v8_0 is hung!
[   65.330425] [drm] GPU recovery disabled.

Comment 14 Koz Ross 2018-05-08 07:38:04 UTC

I also seem to be having similar issues - I have given a full report as bug #106434.

Comment 15 alpapad 2018-07-08 12:11:50 UTC

Maybe similar: Fedora 28 with latest updates. RX 550 and monitor on display port. Kernel running with nopti flag.


When I lock the computer (ctrl-l on gnome) and leave it for 30m, then display will not come back. The monitor will wake up and after a few secs it will go back to sleep with a no signal message. ctrl-alt-fx does not work.


Something else noticed: When I work long, sometimes I get a message from the monitor saying there is no signal and it will go to sleep. I cancel the message and continue working, but it seems something funny is happening with the driver.

Note: Before I was using an nvidia quadro with no such problems.



[ 2058.885223] kernel BUG at mm/slub.c:296!
[ 2058.885233] invalid opcode: 0000 [#1] SMP NOPTI
[ 2058.885235] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables fuse ip_set nfnetlink bridge stp llc libcrc32c binfmt_misc smsc47b397 intel_powerclamp coretemp kvm_intel kvm hp_wmi sparse_keymap irqbypass iTCO_wdt iTCO_vendor_support rfkill gpio_ich crct10dif_pclmul crc32_pclmul wmi_bmof ghash_clmulni_intel snd_hda_codec_realtek intel_cstate snd_hda_codec_generic intel_uncore snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core
[ 2058.885285]  snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd lpc_ich soundcore i7core_edac shpchp wmi acpi_cpufreq amdkfd amd_iommu_v2 amdgpu chash i2c_algo_bit gpu_sched drm_kms_helper ttm crc32c_intel firewire_ohci serio_raw drm tg3 firewire_core nvme crc_itu_t nvme_core i2c_dev [last unloaded: ip6_tables]
[ 2058.885314] CPU: 3 PID: 6943 Comm: Xorg Tainted: G          I       4.17.3-200.fc28.x86_64 #1
[ 2058.885316] Hardware name: Hewlett-Packard HP Z400 Workstation/0B4Ch, BIOS 786G3 v03.60 02/24/2016
[ 2058.885324] RIP: 0010:kfree+0x165/0x180
[ 2058.885326] RSP: 0018:ffffacf38341faf0 EFLAGS: 00010246
[ 2058.885329] RAX: ffff96d7ea198c00 RBX: ffff96d7ea198c00 RCX: ffff96d7ea198c00
[ 2058.885332] RDX: 00000000000073a0 RSI: ffff96da572e6160 RDI: ffff96da56c06e80
[ 2058.885336] RBP: ffff96d8ef407200 R08: 0000000000000000 R09: ffffffffc05ffbb8
[ 2058.885339] R10: ffffe7044ea86600 R11: 0000000000000a00 R12: ffffffffc05ffbb8
[ 2058.885342] R13: ffff96d7ea19e000 R14: ffff96d7ea19cc00 R15: ffff96da49111000
[ 2058.885346] FS:  00007f34dc902ac0(0000) GS:ffff96da572c0000(0000) knlGS:0000000000000000
[ 2058.885349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2058.885352] CR2: 00007ff5e4000010 CR3: 0000000492ece005 CR4: 00000000000206e0
[ 2058.885354] Call Trace:
[ 2058.885456]  dc_stream_release+0x28/0x50 [amdgpu]
[ 2058.885535]  dm_update_crtcs_state+0x1be/0x4d0 [amdgpu]
[ 2058.885614]  amdgpu_dm_atomic_check+0x1b1/0x3b0 [amdgpu]
[ 2058.885642]  drm_atomic_check_only+0x360/0x4f0 [drm]
[ 2058.885663]  drm_atomic_commit+0x13/0x50 [drm]
[ 2058.885682]  drm_atomic_connector_commit_dpms+0xdb/0x100 [drm]
[ 2058.885701]  drm_mode_obj_set_property_ioctl+0x178/0x280 [drm]
[ 2058.885721]  ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm]
[ 2058.885739]  drm_mode_connector_property_set_ioctl+0x39/0x60 [drm]
[ 2058.885756]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[ 2058.885773]  drm_ioctl+0x1b3/0x370 [drm]
[ 2058.885792]  ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm]
[ 2058.885843]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 2058.885849]  do_vfs_ioctl+0xa4/0x610
[ 2058.885853]  ksys_ioctl+0x60/0x90
[ 2058.885857]  __x64_sys_ioctl+0x16/0x20
[ 2058.885863]  do_syscall_64+0x5b/0x160
[ 2058.885870]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2058.885873] RIP: 0033:0x7f34d9b90e17
[ 2058.885876] RSP: 002b:00007ffd08e499a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 2058.885879] RAX: ffffffffffffffda RBX: 0000000002872380 RCX: 00007f34d9b90e17
[ 2058.885882] RDX: 00007ffd08e499e0 RSI: 00000000c01064ab RDI: 000000000000000c
[ 2058.885884] RBP: 00007ffd08e499e0 R08: 0000000000000001 R09: 0000000000000000
[ 2058.885887] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000c01064ab
[ 2058.885889] R13: 000000000000000c R14: 00000000028727a0 R15: 0000000000830c01
[ 2058.885892] Code: 74 05 41 0f b6 72 69 5b 4c 89 d7 5d 41 5c e9 c3 bb f8 ff 48 89 d9 48 89 da 41 b8 01 00 00 00 5b 4c 89 d6 5d 41 5c e9 7b f6 ff ff <0f> 0b 0f 0b 49 8b 42 20 a8 01 75 c1 0f 0b 48 8b 3d 36 73 fa 00 
[ 2058.885937] RIP: kfree+0x165/0x180 RSP: ffffacf38341faf0
[ 2058.885955] ---[ end trace 71e7210e68d99a2b ]---

Comment 16 Suloev Dmitry 2018-07-26 08:20:59 UTC

This issue looks pretty similar to one of mine.
But in addition to this I found few more bugs in amdgpu+iommu+drm bundle.

Comment 17 Suloev Dmitry 2018-07-26 08:26:04 UTC

Created attachment 140820 [details]
amdgpu timeout with iommu enabled

Comment 18 Suloev Dmitry 2018-07-26 08:27:40 UTC

Created attachment 140821 [details]
amdgpu timeout with iommu disabled

Comment 19 Suloev Dmitry 2018-07-26 08:31:03 UTC

Created attachment 140822 [details]
startx.log

I even can run X with disabled iommu, but when I start firefox - X hangs.
But gpu_recovery trying reset gpu and I cag get back to console.

Comment 20 Suloev Dmitry 2018-07-26 08:32:34 UTC

Created attachment 140823 [details]
Memory manager not clean during takedown.

But everything changes with iommu enabled!

Comment 21 Suloev Dmitry 2018-07-26 09:09:41 UTC

Created attachment 140825 [details]
amdgpu with dc enabled

And different traceback with amdgpu.dc enabled.

Comment 22 Suloev Dmitry 2018-07-26 09:19:22 UTC

With iommu and dc system can't even boot.

Comment 23 krutoileshii 2018-08-13 02:24:17 UTC

Similar issues. The most reliable way to replicate for me is to use Dota 2. While it's not 100%, it does seem to work reliably 1 out of 4 attempts. This does happen with other apps such as chrome when visiting school library website or firefox. The system even hangs right after login occasionally.

Comment 24 krutoileshii 2018-08-13 02:25:15 UTC

Created attachment 141053 [details]
dmesg after logging into the system from remote machine.

Comment 25 Andrey Grodzovsky 2018-08-14 19:46:52 UTC

(In reply to Allan from comment #12)
> My system started to power down for nothing sometimes, even using the
> GTX1070 (nvidia|nouveau) .
> Then I installed a Windows image just to be sure if the kernel was the
> problem.
> 
> Well, for now it *SEEMS* that isn't *ONLY* the driver/kernel :
> - The RX480 was freezing in the same way, then I sent it for warranty.
> - RX580 run problematically, almost always I got a message like : "DX11 :
> device disconnected" or "Mantle : Device lost".
> - GTX1070 was running fine for 1 day, then it became the same as the RX580
> and for my bad luck the system started to power down after a random time
> (5min to 2 hours +/-).
> 
> For sure the driver/kernel (amdgpu/linux) has its faults here, and here's
> why:
> - At Windows, the only card that stuck the system was RX480 sometimes
> because it was really broken.
> - In other cases, when a failure happened (with Nvidia or AMD), the system
> was able to retake the control over the device.
>  - Maybe doing a soft-reset?
>  - Maybe just killing the driver and starting again?
>  - Maybe just by stopping the process that were using the GPU to avoid a big
> chain of resulting problems?
> - Neither the RX580 nor GTX1070 has dual-bios AFAIK. Maybe RX480, but I did
> not test it.
> 
> Then :
> - Revised and changed the PCI-Ex power lines : OK.
> - Tested power supply (lucky for me AX860i has a self test) : OK.
> - Cleaned all slots with a brush : OK.
> - Tested again CPU and RAM : OK.
> 
> But , I must be in a very bad luck, the problems persisted.
> 
> I've sent the Motherboard for warranty. I'm waiting for its diagnostic and
> solution.
> 
> I'll inform here as soon as it becomes possible.
> 
> Thoughts for the while :
> - Not being able to kill the processes *is* a problem that concerns only
> amdgpu and it is either a problem of the driver itself (most likely to be)
> or of the kernel.

We recently fixed the issue of not being able to kill a process stuck like your process in wait for fence signal in kernel mode. 

Can you build latest kernel (4.18) and grab again latest firmware and try again ?
Links to kernel and firmware:
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ 

> - The driver is not capable of retaking control of the device.
> - It is impossible to kill children pids when something hung using amdgpu.
> - Yes, it occurred once or twice using nvidia proprietary too, but it was
> probably caused because of the faulty motherboard that I'm waiting to be
> fixed.
> - Using nouveau was the most happy path , but unfortunately nouveau does not
> support Pascal at all yet. It keeps the card at the min clock (300 or
> 400MHz) and it is not possible yet to increase the speed of the card. So it
> is not a valid working way.

Comment 26 Allan 2018-08-14 20:32:00 UTC

I will do it as soon as possible, but it may take a while (maybe a month) because my motherboard showed many issues and I'm requesting money back to buy another.

Comment 27 Suloev Dmitry 2018-08-15 07:11:00 UTC

Looks like all my problems fixed in latest kernel. Thx!

Comment 28 Jan Jurzitza 2018-08-23 19:10:32 UTC

(In reply to Andrey Grodzovsky from comment #25)

Still same issue happening here on both projects built from git. One issue here which doesn't seem completely related:
Aug 23 20:41:20 archlinux kernel: ------------[ cut here ]------------
Aug 23 20:41:20 archlinux kernel: CPU update of VM recommended only for large BAR system
Aug 23 20:41:20 archlinux kernel: WARNING: CPU: 5 PID: 1092 at drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2606 amdgpu_vm_init+0x477/0x490 [amdgpu]
Aug 23 20:41:20 archlinux kernel: Modules linked in: bnep nct6775 hwmon_vid joydev btusb btrtl btbcm btintel bluetooth snd_usb_audio snd_usbmidi_lib snd_rawmidi input_leds snd_seq_device ecdh_generic mousedev nls_iso8859_1 nls_cp437 vfat fat btrfs zstd_compress libcrc32c zstd_decompress xxhash xor arc4 amdkfd amd_iommu_v2 amdgpu iwlmvm mac80211 edac_mce_amd led_class kvm_amd iwlwifi snd_hda_codec_realtek chash gpu_sched kvm snd_hda_codec_hdmi snd_hda_codec_generic ttm snd_hda_intel drm_kms_helper irqbypass snd_hda_codec cfg80211 morus1280_avx2 drm morus1280_sse2 morus1280_glue morus640_sse2 morus640_glue snd_hda_core aegis256_aesni aegis128l_aesni aegis128_aesni igb snd_hwdep crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcm pcbc snd_timer agpgart evdev ccp sp5100_tco aesni_intel snd syscopyarea i2c_algo_bit sysfillrect
Aug 23 20:41:20 archlinux kernel:  aes_x86_64 wmi_bmof mac_hid crypto_simd sysimgblt raid6_pq cryptd glue_helper fb_sys_fops soundcore k10temp i2c_piix4 dca rfkill rng_core wmi button acpi_cpufreq sch_fq_codel vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sr_mod cdrom sd_mod uas usb_storage hid_uclogic hid_generic usbhid hid ahci libahci xhci_pci libata crc32c_intel xhci_hcd usbcore scsi_mod usb_common
Aug 23 20:41:20 archlinux kernel: CPU: 5 PID: 1092 Comm: Xorg.wrap Tainted: G           O      4.18.0-rc1-5024f8dfe478 #1
Aug 23 20:41:20 archlinux kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Gaming-ITX/ac, BIOS P3.40 11/07/2017
Aug 23 20:41:20 archlinux kernel: RIP: 0010:amdgpu_vm_init+0x477/0x490 [amdgpu]
Aug 23 20:41:20 archlinux kernel: Code: b8 08 d8 ff ff e8 79 89 7c e8 e9 ee fe ff ff 41 89 ef e9 e6 fe ff ff 48 c7 c7 08 65 f0 c0 c6 05 41 af 2b 00 01 e8 a3 8f 37 e8 <0f> 0b 0f b6 8b 60 01 00 00 e9 b4 fc ff ff e8 26 8d 37 e8 66 0f 1f 
Aug 23 20:41:20 archlinux kernel: RSP: 0018:ffffacc2c8df7b60 EFLAGS: 00010286
Aug 23 20:41:20 archlinux kernel: RAX: 0000000000000000 RBX: ffff9b10f7bf9000 RCX: 0000000000000006
Aug 23 20:41:20 archlinux kernel: RDX: 0000000000000007 RSI: 0000000000000002 RDI: ffff9b10fe7564d0
Aug 23 20:41:20 archlinux kernel: RBP: ffff9b10f5640000 R08: 0000001856da5330 R09: 0000000000000036
Aug 23 20:41:20 archlinux kernel: R10: 0000000000000424 R11: 000000000006ad48 R12: ffff9b10f7bf90b8
Aug 23 20:41:20 archlinux kernel: R13: 000000000000000a R14: 0000000000000000 R15: 0000000000000000
Aug 23 20:41:20 archlinux kernel: FS:  00007fcf6cc95500(0000) GS:ffff9b10fe740000(0000) knlGS:0000000000000000
Aug 23 20:41:20 archlinux kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 23 20:41:20 archlinux kernel: CR2: 00007fcf6cb1d960 CR3: 00000007e1190000 CR4: 00000000003406e0
Aug 23 20:41:20 archlinux kernel: Call Trace:
Aug 23 20:41:20 archlinux kernel:  ? ida_simple_get+0x91/0xf0
Aug 23 20:41:20 archlinux kernel:  amdgpu_driver_open_kms+0x83/0x1d0 [amdgpu]
Aug 23 20:41:20 archlinux kernel:  drm_open+0x20b/0x440 [drm]
Aug 23 20:41:20 archlinux kernel:  drm_stub_open+0xaf/0xf0 [drm]
Aug 23 20:41:20 archlinux kernel:  chrdev_open+0xa3/0x1b0
Aug 23 20:41:20 archlinux kernel:  ? cdev_put.part.3+0x20/0x20
Aug 23 20:41:20 archlinux kernel:  do_dentry_open+0x1ab/0x2d0
Aug 23 20:41:20 archlinux kernel:  path_openat+0x31b/0x1440
Aug 23 20:41:20 archlinux kernel:  ? alloc_set_pte+0x1fd/0x4e0
Aug 23 20:41:20 archlinux kernel:  do_filp_open+0x93/0x100
Aug 23 20:41:20 archlinux kernel:  ? __check_object_size+0x9c/0x171
Aug 23 20:41:20 archlinux kernel:  do_sys_open+0x186/0x210
Aug 23 20:41:20 archlinux kernel:  do_syscall_64+0x4e/0x100
Aug 23 20:41:20 archlinux kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 23 20:41:20 archlinux kernel: RIP: 0033:0x7fcf6cbbc452
Aug 23 20:41:20 archlinux kernel: Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 f5 70 0d 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 
Aug 23 20:41:20 archlinux kernel: RSP: 002b:00007ffe9a15b0a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Aug 23 20:41:20 archlinux kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcf6cbbc452
Aug 23 20:41:20 archlinux kernel: RDX: 0000000000000002 RSI: 00007ffe9a15b180 RDI: 00000000ffffff9c
Aug 23 20:41:20 archlinux kernel: RBP: 00007ffe9a15b130 R08: 0000000000000000 R09: 0000000000000000
Aug 23 20:41:20 archlinux kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe9a15b180
Aug 23 20:41:20 archlinux kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Aug 23 20:41:20 archlinux kernel: ---[ end trace eb5bc55fd8b7f883 ]---


and then the issue OP posted too:


Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: GPU fault detected: 147 0x00a60401 for process payday2_release pid 6643 thread amdgpu_cs:0 pid 6644
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06ABF814
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x2B004001
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: VM fault (0x01, vmid 5, pasid 32776) at page 111933460, write from 'TC1' (0x54433100) (4)
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: GPU fault detected: 147 0x00a60401 for process payday2_release pid 6643 thread amdgpu_cs:0 pid 6644
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06ABF814
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x2B004001
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: VM fault (0x01, vmid 5, pasid 32776) at page 111933460, write from 'TC1' (0x54433100) (4)
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: GPU fault detected: 147 0x00a60401 for process payday2_release pid 6643 thread amdgpu_cs:0 pid 6644
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06ABF814
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x23004001
Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: VM fault (0x01, vmid 1, pasid 32776) at page 111933460, write from 'TC1' (0x54433100) (4)
Aug 23 19:42:06 archlinux kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=519868, emitted seq=519871
Aug 23 19:42:06 archlinux kernel: [drm] GPU recovery disabled.


Happens on pretty much any application using Vulkan after some time or Core OpenGL applications too. Doesn't happen on normal desktop usage with Chrome.

Happens on 4.18.3 and these traces are from 4.18.0-rc1-5024f8dfe478
X370 chipset (like OP)
RX 480 (same as OP)
Ryzen 7 1700x
Mesa 18.1.6
xorg 1.20.1
i3wm

Comment 29 Andrey Grodzovsky 2018-08-23 19:33:44 UTC

(In reply to Jan Jurzitza from comment #28)
> (In reply to Andrey Grodzovsky from comment #25)
> 
> Still same issue happening here on both projects built from git. One issue
> here which doesn't seem completely related:
> Aug 23 20:41:20 archlinux kernel: ------------[ cut here ]------------
> Aug 23 20:41:20 archlinux kernel: CPU update of VM recommended only for
> large BAR system
> Aug 23 20:41:20 archlinux kernel: WARNING: CPU: 5 PID: 1092 at
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2606 amdgpu_vm_init+0x477/0x490
> [amdgpu]
> Aug 23 20:41:20 archlinux kernel: Modules linked in: bnep nct6775 hwmon_vid
> joydev btusb btrtl btbcm btintel bluetooth snd_usb_audio snd_usbmidi_lib
> snd_rawmidi input_leds snd_seq_device ecdh_generic mousedev nls_iso8859_1
> nls_cp437 vfat fat btrfs zstd_compress libcrc32c zstd_decompress xxhash xor
> arc4 amdkfd amd_iommu_v2 amdgpu iwlmvm mac80211 edac_mce_amd led_class
> kvm_amd iwlwifi snd_hda_codec_realtek chash gpu_sched kvm snd_hda_codec_hdmi
> snd_hda_codec_generic ttm snd_hda_intel drm_kms_helper irqbypass
> snd_hda_codec cfg80211 morus1280_avx2 drm morus1280_sse2 morus1280_glue
> morus640_sse2 morus640_glue snd_hda_core aegis256_aesni aegis128l_aesni
> aegis128_aesni igb snd_hwdep crct10dif_pclmul crc32_pclmul
> ghash_clmulni_intel snd_pcm pcbc snd_timer agpgart evdev ccp sp5100_tco
> aesni_intel snd syscopyarea i2c_algo_bit sysfillrect
> Aug 23 20:41:20 archlinux kernel:  aes_x86_64 wmi_bmof mac_hid crypto_simd
> sysimgblt raid6_pq cryptd glue_helper fb_sys_fops soundcore k10temp
> i2c_piix4 dca rfkill rng_core wmi button acpi_cpufreq sch_fq_codel
> vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) sg crypto_user
> ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sr_mod
> cdrom sd_mod uas usb_storage hid_uclogic hid_generic usbhid hid ahci libahci
> xhci_pci libata crc32c_intel xhci_hcd usbcore scsi_mod usb_common
> Aug 23 20:41:20 archlinux kernel: CPU: 5 PID: 1092 Comm: Xorg.wrap Tainted:
> G           O      4.18.0-rc1-5024f8dfe478 #1
> Aug 23 20:41:20 archlinux kernel: Hardware name: To Be Filled By O.E.M. To
> Be Filled By O.E.M./X370 Gaming-ITX/ac, BIOS P3.40 11/07/2017
> Aug 23 20:41:20 archlinux kernel: RIP: 0010:amdgpu_vm_init+0x477/0x490
> [amdgpu]
> Aug 23 20:41:20 archlinux kernel: Code: b8 08 d8 ff ff e8 79 89 7c e8 e9 ee
> fe ff ff 41 89 ef e9 e6 fe ff ff 48 c7 c7 08 65 f0 c0 c6 05 41 af 2b 00 01
> e8 a3 8f 37 e8 <0f> 0b 0f b6 8b 60 01 00 00 e9 b4 fc ff ff e8 26 8d 37 e8 66
> 0f 1f 
> Aug 23 20:41:20 archlinux kernel: RSP: 0018:ffffacc2c8df7b60 EFLAGS: 00010286
> Aug 23 20:41:20 archlinux kernel: RAX: 0000000000000000 RBX:
> ffff9b10f7bf9000 RCX: 0000000000000006
> Aug 23 20:41:20 archlinux kernel: RDX: 0000000000000007 RSI:
> 0000000000000002 RDI: ffff9b10fe7564d0
> Aug 23 20:41:20 archlinux kernel: RBP: ffff9b10f5640000 R08:
> 0000001856da5330 R09: 0000000000000036
> Aug 23 20:41:20 archlinux kernel: R10: 0000000000000424 R11:
> 000000000006ad48 R12: ffff9b10f7bf90b8
> Aug 23 20:41:20 archlinux kernel: R13: 000000000000000a R14:
> 0000000000000000 R15: 0000000000000000
> Aug 23 20:41:20 archlinux kernel: FS:  00007fcf6cc95500(0000)
> GS:ffff9b10fe740000(0000) knlGS:0000000000000000
> Aug 23 20:41:20 archlinux kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Aug 23 20:41:20 archlinux kernel: CR2: 00007fcf6cb1d960 CR3:
> 00000007e1190000 CR4: 00000000003406e0
> Aug 23 20:41:20 archlinux kernel: Call Trace:
> Aug 23 20:41:20 archlinux kernel:  ? ida_simple_get+0x91/0xf0
> Aug 23 20:41:20 archlinux kernel:  amdgpu_driver_open_kms+0x83/0x1d0 [amdgpu]
> Aug 23 20:41:20 archlinux kernel:  drm_open+0x20b/0x440 [drm]
> Aug 23 20:41:20 archlinux kernel:  drm_stub_open+0xaf/0xf0 [drm]
> Aug 23 20:41:20 archlinux kernel:  chrdev_open+0xa3/0x1b0
> Aug 23 20:41:20 archlinux kernel:  ? cdev_put.part.3+0x20/0x20
> Aug 23 20:41:20 archlinux kernel:  do_dentry_open+0x1ab/0x2d0
> Aug 23 20:41:20 archlinux kernel:  path_openat+0x31b/0x1440
> Aug 23 20:41:20 archlinux kernel:  ? alloc_set_pte+0x1fd/0x4e0
> Aug 23 20:41:20 archlinux kernel:  do_filp_open+0x93/0x100
> Aug 23 20:41:20 archlinux kernel:  ? __check_object_size+0x9c/0x171
> Aug 23 20:41:20 archlinux kernel:  do_sys_open+0x186/0x210
> Aug 23 20:41:20 archlinux kernel:  do_syscall_64+0x4e/0x100
> Aug 23 20:41:20 archlinux kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Aug 23 20:41:20 archlinux kernel: RIP: 0033:0x7fcf6cbbc452
> Aug 23 20:41:20 archlinux kernel: Code: 25 00 00 41 00 3d 00 00 41 00 74 4c
> 48 8d 05 f5 70 0d 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c
> ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33
> 0c 25 
> Aug 23 20:41:20 archlinux kernel: RSP: 002b:00007ffe9a15b0a0 EFLAGS:
> 00000246 ORIG_RAX: 0000000000000101
> Aug 23 20:41:20 archlinux kernel: RAX: ffffffffffffffda RBX:
> 0000000000000000 RCX: 00007fcf6cbbc452
> Aug 23 20:41:20 archlinux kernel: RDX: 0000000000000002 RSI:
> 00007ffe9a15b180 RDI: 00000000ffffff9c
> Aug 23 20:41:20 archlinux kernel: RBP: 00007ffe9a15b130 R08:
> 0000000000000000 R09: 0000000000000000
> Aug 23 20:41:20 archlinux kernel: R10: 0000000000000000 R11:
> 0000000000000246 R12: 00007ffe9a15b180
> Aug 23 20:41:20 archlinux kernel: R13: 0000000000000000 R14:
> 0000000000000000 R15: 0000000000000000
> Aug 23 20:41:20 archlinux kernel: ---[ end trace eb5bc55fd8b7f883 ]---
> 
> 

This is just a warning meaning you use CPU to update GPU page tables, any reason why ? try passing kernel  
 amdgpu.vm_update_mode=0 instead.

> and then the issue OP posted too:
> 
> 
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: GPU fault detected:
> 147 0x00a60401 for process payday2_release pid 6643 thread amdgpu_cs:0 pid
> 6644
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:  
> VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06ABF814
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:  
> VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x2B004001
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: VM fault (0x01, vmid
> 5, pasid 32776) at page 111933460, write from 'TC1' (0x54433100) (4)
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: GPU fault detected:
> 147 0x00a60401 for process payday2_release pid 6643 thread amdgpu_cs:0 pid
> 6644
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:  
> VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06ABF814
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:  
> VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x2B004001
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: VM fault (0x01, vmid
> 5, pasid 32776) at page 111933460, write from 'TC1' (0x54433100) (4)
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: GPU fault detected:
> 147 0x00a60401 for process payday2_release pid 6643 thread amdgpu_cs:0 pid
> 6644
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:  
> VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06ABF814
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0:  
> VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x23004001
> Aug 23 19:40:06 archlinux kernel: amdgpu 0000:0d:00.0: VM fault (0x01, vmid
> 1, pasid 32776) at page 111933460, write from 'TC1' (0x54433100) (4)
> Aug 23 19:42:06 archlinux kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> ring gfx timeout, signaled seq=519868, emitted seq=519871
> Aug 23 19:42:06 archlinux kernel: [drm] GPU recovery disabled.
> 
> 
> Happens on pretty much any application using Vulkan after some time or Core
> OpenGL applications too. Doesn't happen on normal desktop usage with Chrome.

So is it only Vulkan specific ?
> 
> Happens on 4.18.3 and these traces are from 4.18.0-rc1-5024f8dfe478
> X370 chipset (like OP)
> RX 480 (same as OP)
> Ryzen 7 1700x
> Mesa 18.1.6
> xorg 1.20.1
> i3wm

Comment 30 Jan Jurzitza 2018-08-24 12:35:46 UTC

(In reply to Andrey Grodzovsky from comment #29)
> > ...
> This is just a warning meaning you use CPU to update GPU page tables, any
> reason why ? try passing kernel  
>  amdgpu.vm_update_mode=0 instead.

Yes I had been experimenting with kernel flags trying to fix it. I had it 0 before and it was happening too. Also have tried that variation with amdgpu.dc=0 and 1, the one with update_mode=1 only with amdgpu.dc=0

> > and then the issue OP posted too:
> > 
> > 
> > ...
> > 
> > 
> > Happens on pretty much any application using Vulkan after some time or Core
> > OpenGL applications too. Doesn't happen on normal desktop usage with Chrome.
> 
> So is it only Vulkan specific ?

No Core OpenGL apps too. Hadn't had it happen to legacy OpenGL apps yet (or any wine DirectX app actually, not sure if they use core or legacy), but that of course doesn't mean it couldn't happen there too.

Comment 31 Jan Jurzitza 2018-08-26 11:18:35 UTC

I have found a workaround (amd patched kernel not required):

cat /sys/class/drm/card0/device/pp_dpm_sclk
# insert appropriate index here, I went for 1077Mhz
echo 3 > /sys/class/drm/card0/device/pp_dpm_sclk

Makes the GPU a bit slower (changes clock to 1077 Mhz on my card) for the session, but at least applications don't freeze the system anymore now (or at least this is delaying it so much that it works for multiple hours, but it didn't freeze for me yet)

Though because of the slowdown I don't think this is a good solution long-term. Maybe a hint towards a solution though maybe? What I noticed in radeon-profile is that on auto it is capable of running at the boost frequency (1266 Mhz) and not limited to the base frequency the product page specifies (1120 Mhz) by default, so I changed it here and it basically fixed it.

Fixes the issue on kernel 4.18.4

Comment 32 dwagner 2018-08-26 22:02:59 UTC

(In reply to Jan Jurzitza from comment #31)
> I have found a workaround (amd patched kernel not required):
> 
> cat /sys/class/drm/card0/device/pp_dpm_sclk
> # insert appropriate index here, I went for 1077Mhz
> echo 3 > /sys/class/drm/card0/device/pp_dpm_sclk
> 
> Makes the GPU a bit slower (changes clock to 1077 Mhz on my card) for the
> session, but at least applications don't freeze the system anymore now (or
> at least this is delaying it so much that it works for multiple hours, but
> it didn't freeze for me yet)

As long as /sys/class/drm/card0/device/power_dpm_force_performance_level is set to "auto", this write to pp_dpm_sclk won't have a lasting effect, as dynamic power management changes this clock setting all the time.

For the symptoms I reported in bug https://bugs.freedesktop.org/show_bug.cgi?id=102322 I found that actually disabling dynamic power management prevents them from happening, but I do need an 

echo manual >power_dpm_force_performance_level

for this (regardless of what values I write to pp_dpm_sclk and pp_dpm_mclk thereafter.

Cave: Every mode change or re-enabling of a screen with silently disregard a previous "manual" setting, so that needs to be re-applied afterwards - this is subject to bug report https://bugs.freedesktop.org/show_bug.cgi?id=107141

Comment 33 Allan 2018-08-27 21:55:45 UTC

(In reply to Jan Jurzitza from comment #31)
> I have found a workaround (amd patched kernel not required):
> 
> cat /sys/class/drm/card0/device/pp_dpm_sclk
> # insert appropriate index here, I went for 1077Mhz
> echo 3 > /sys/class/drm/card0/device/pp_dpm_sclk
> 
> Makes the GPU a bit slower (changes clock to 1077 Mhz on my card) for the
> session, but at least applications don't freeze the system anymore now (or
> at least this is delaying it so much that it works for multiple hours, but
> it didn't freeze for me yet)
> 
> Though because of the slowdown I don't think this is a good solution
> long-term. Maybe a hint towards a solution though maybe? What I noticed in
> radeon-profile is that on auto it is capable of running at the boost
> frequency (1266 Mhz) and not limited to the base frequency the product page
> specifies (1120 Mhz) by default, so I changed it here and it basically fixed
> it.
> 
> Fixes the issue on kernel 4.18.4

Even that I didn't mention, I tried it.

It worked for me for a while, and most part while I wasn't properly running 3D rendering, but OpenCL codes instead.

But it never worked as a workaround cause it just randomized the time to happen the errors.

And this is exactly why I didn't mention it before.

Indeed, I need to test it on kernel 4.18 yet.

###############################################################################

In time : seems like that the warranty of my motherboard will take a long time to finish.

I borrowed an old PC from my aunt and I hope that it will be enough to compile the kernel and test the GPU. It is going to be fun to compile a kernel on a 1.6GHz dual core (1C/2T).

Comment 34 markusraat 2018-08-30 20:01:10 UTC

I have exactly same in here. Youtube videos make the system crash randomly, also it is happening without video playback. This bug is making whole system worthless. I have run system ram memtests.

[   10.390343] [drm] amdgpu kernel modesetting enabled.
[   10.401936] fb: switching to amdgpudrmfb from EFI VGA
[   10.402439] amdgpu 0000:01:00.0: enabling device (0106 -> 0107)
[   10.402655] amdgpu 0000:01:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[   10.402658] amdgpu 0000:01:00.0: GTT: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
[   10.402898] [drm] amdgpu: 4096M of VRAM memory ready
[   10.402900] [drm] amdgpu: 4096M of GTT memory ready.
[   10.658069] fbcon: amdgpudrmfb (fb0) is primary device
[   10.710971] amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device
[   10.744284] [drm] Initialized amdgpu 3.26.0 20150101 for 0000:01:00.0 on minor 0

[   10.390343] [drm] amdgpu kernel modesetting enabled.
[   10.401936] fb: switching to amdgpudrmfb from EFI VGA
[   10.402567] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xCA).
[   10.402578] [drm] register mmio base: 0xFBE00000
[   10.402579] [drm] register mmio size: 262144
[   10.402585] [drm] probing gen 2 caps for device 8086:2f08 = 77a3103/e
[   10.402587] [drm] probing mlw for device 8086:2f08 = 77a3103
[   10.402589] [drm] add ip block number 0 <vi_common>
[   10.402590] [drm] add ip block number 1 <gmc_v8_0>
[   10.402592] [drm] add ip block number 2 <tonga_ih>
[   10.402593] [drm] add ip block number 3 <powerplay>
[   10.402594] [drm] add ip block number 4 <dm>
[   10.402595] [drm] add ip block number 5 <gfx_v8_0>
[   10.402597] [drm] add ip block number 6 <sdma_v3_0>
[   10.402598] [drm] add ip block number 7 <uvd_v6_0>
[   10.402599] [drm] add ip block number 8 <vce_v3_0>
[   10.402606] [drm] UVD is enabled in physical mode
[   10.402607] [drm] VCE enabled in physical mode
[   10.402648] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[   10.402662] [drm] Detected VRAM RAM=4096M, BAR=256M
[   10.402664] [drm] RAM width 512bits HBM
[   10.402898] [drm] amdgpu: 4096M of VRAM memory ready
[   10.402900] [drm] amdgpu: 4096M of GTT memory ready.
[   10.402906] [drm] GART: num cpu pages 262144, num gpu pages 262144
[   10.402941] [drm] PCIE GART of 1024M enabled (table at 0x000000F400300000).
[   10.403779] [drm] Found UVD firmware Version: 1.87 Family ID: 12
[   10.403784] [drm] UVD ENC is disabled
[   10.404378] [drm] Found VCE firmware Version: 53.20 Binary ID: 3
[   10.466617] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[   10.480097] [drm] Display Core initialized with v3.1.44!
[   10.515673] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[   10.515675] [drm] Driver supports precise vblank timestamp query.
[   10.550650] [drm] UVD initialized successfully.
[   10.650581] [drm] VCE initialized successfully.
[   10.658015] [drm] fb mappable at 0xC098C000
[   10.658017] [drm] vram apper at 0xC0000000
[   10.658018] [drm] size 14745600
[   10.658019] [drm] fb depth is 24
[   10.658020] [drm]    pitch is 10240
[   10.658069] fbcon: amdgpudrmfb (fb0) is primary device
[   10.677347] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 241500
[   10.710971] amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device
[   10.744284] [drm] Initialized amdgpu 3.26.0 20150101 for 0000:01:00.0 on minor 0

System:    Host: x99 Kernel: 4.18.5-041805-generic x86_64 bits: 64 gcc: 8.2.0
           Desktop: Gnome 3.28.3 (Gtk 3.22.30-1ubuntu1) Distro: Ubuntu 18.04.1 LTS
Machine:   Device: desktop System: ASUS product: All Series serial: N/A
           Mobo: ASUSTeK model: STRIX X99 GAMING v: Rev 1.xx serial: N/A
           UEFI: American Megatrends v: 1902 date: 03/21/2018
CPU:       18 core Intel Xeon E5-2696 v3 (-MT-MCP-) arch: Haswell rev.2 cache: 46080 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 82945
           clock speeds: max: 3800 MHz 1: 1202 MHz 2: 1202 MHz 3: 1202 MHz 4: 1202 MHz 5: 1203 MHz 6: 1202 MHz
           7: 1203 MHz 8: 1284 MHz 9: 1203 MHz 10: 1202 MHz 11: 1203 MHz 12: 1202 MHz 13: 1202 MHz 14: 1355 MHz
           15: 1202 MHz 16: 1202 MHz 17: 1202 MHz 18: 1203 MHz 19: 1206 MHz 20: 1204 MHz 21: 1204 MHz
           22: 1205 MHz 23: 1204 MHz 24: 1204 MHz 25: 1203 MHz 26: 1324 MHz 27: 1203 MHz 28: 1206 MHz
           29: 1205 MHz 30: 1203 MHz 31: 1204 MHz 32: 1697 MHz 33: 1204 MHz 34: 1204 MHz 35: 1204 MHz
           36: 1202 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] bus-ID: 01:00.0
           Display Server: wayland (X.Org 1.19.6 ) driver: amdgpu Resolution: 2560x1440@59.91hz
           OpenGL: renderer: AMD Radeon R9 Fury Series (FIJI, DRM 3.26.0, 4.18.5-041805-generic, LLVM 6.0.0)
           version: 4.5 Mesa 18.1.7 - padoka PPA Direct Render: Yes
Audio:     Card-1 Advanced Micro Devices [AMD/ATI] Fiji HDMI/DP Audio [Radeon R9 Nano / FURY/FURY X]
           driver: snd_hda_intel bus-ID: 01:00.1
           Card-2 Intel C610/X99 series HD Audio Controller driver: snd_hda_intel bus-ID: 00:1b.0
           Sound: Advanced Linux Sound Architecture v: k4.18.5-041805-generic
Network:   Card: Intel Ethernet Connection (2) I218-V driver: e1000e v: 3.2.6-k port: f000 bus-ID: 00:19.0
           IF: eno1 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:    HDD Total Size: 6251.2GB (22.8% used)
           ID-1: /dev/nvme0n1 model: Samsung_SSD_960_EVO_250GB size: 250.1GB
           ID-2: /dev/sda model: WDC_WD6001FZWX size: 6001.2GB temp: 0C

Comment 35 markusraat 2018-08-31 09:04:59 UTC

It might be that kernel option apci=ht ( also apci=off ) solve the problem. It is taking more time to waiting the possible problem appearance. At least it worth of testing. But this is not maybe the final solution for this bug?

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.18.5-041805-generic root=UUID=c3df607f-ac6e-11e8-9f6b-3497f638e103 ro acpi=ht
[    0.000000] Malformed early option 'acpi'

Comment 36 markusraat 2018-09-03 08:46:49 UTC

(In reply to markusraat from comment #35)
> It might be that kernel option apci=ht ( also apci=off ) solve the problem.
> It is taking more time to waiting the possible problem appearance. At least
> it worth of testing. But this is not maybe the final solution for this bug?
> 
> [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.18.5-041805-generic
> root=UUID=c3df607f-ac6e-11e8-9f6b-3497f638e103 ro acpi=ht
> [    0.000000] Malformed early option 'acpi'

Okay, the "acpi=off" or "acpi=ht" was the miss shot.

But changing from motherboard bios GPU PCIe speed auto > gen3 is giving very promissing results! I also rose logging level from grub settings to "loglevel=8" but I haven't got regenerated the crash. I will reply if this fails again.

Comment 37 markusraat 2018-09-06 13:06:38 UTC

(In reply to markusraat from comment #36)
> (In reply to markusraat from comment #35)
> > It might be that kernel option apci=ht ( also apci=off ) solve the problem.
> > It is taking more time to waiting the possible problem appearance. At least
> > it worth of testing. But this is not maybe the final solution for this bug?
> > 
> > [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.18.5-041805-generic
> > root=UUID=c3df607f-ac6e-11e8-9f6b-3497f638e103 ro acpi=ht
> > [    0.000000] Malformed early option 'acpi'
> 
> Okay, the "acpi=off" or "acpi=ht" was the miss shot.
> 
> But changing from motherboard bios GPU PCIe speed auto > gen3 is giving very
> promissing results! I also rose logging level from grub settings to
> "loglevel=8" but I haven't got regenerated the crash. I will reply if this
> fails again.

Nope,

Sep  6 16:04:31 x99 org.gnome.Shell.desktop[2332]: [Child 18594, MediaPlayback #2] WARNING: Decoder=7fbcd7976d40 Decode error: NS_ERROR_DOM_MEDIA_FATAL_ERR (0x806e0005) - RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> > mozilla::MediaSourceTrackDemuxer::DoGetSamples(int32_t): manager is detached.: file /build/firefox-oscv9o/firefox-61.0.1+build1/dom/media/MediaDecoderStateMachine.cpp, line 3411
Sep  6 16:04:31 x99 org.gnome.Shell.desktop[2332]: [Child 18594, MediaPlayback #1] WARNING: Decoder=7fbcd7976d40 Decode error: NS_ERROR_DOM_MEDIA_FATAL_ERR (0x806e0005) - RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> > mozilla::MediaSourceTrackDemuxer::DoGetSamples(int32_t): manager is detached.: file /build/firefox-oscv9o/firefox-61.0.1+build1/dom/media/MediaDecoderStateMachine.cpp, line 3411
Sep  6 16:04:31 x99 org.gnome.Shell.desktop[2332]: [Child 18594, MediaPlayback #3] WARNING: Decoder=7fbcd7976d40 Decode error: NS_ERROR_DOM_MEDIA_FATAL_ERR (0x806e0005) - RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> > mozilla::MediaSourceTrackDemuxer::DoGetSamples(int32_t): manager is detached.: file /build/firefox-oscv9o/firefox-61.0.1+build1/dom/media/MediaDecoderStateMachine.cpp, line 3411

Comment 38 Jan Jurzitza 2018-09-14 16:33:16 UTC

So the "manual" freq hack for me fixes games and graphics intensive applications (or at least delays it by more than 15 hours). There are still actual crashes (don't know if it's because of GPU or CPU) that occasionally occur with my setup which happen after simply browsing a lot, especially with lots of SVGs and images, but they used to happen before the manual hack as well and don't seem to be related to this issue.

Comment 39 Allan 2018-10-31 12:49:00 UTC

I can't clone the git repo using command :
 "git clone git://people.freedesktop.org/~agd5f/linux"

Firstly it was checksum errors, found that the processor had a bug, replaced it through warranty process, and now I'm getting :

"
Cloning into 'linux'...
remote: Enumerating objects: 6619592, done.
remote: Counting objects: 100% (6619592/6619592), done.
remote: Compressing objects: 100% (989580/989580), done.
remote: Total 6619592 (delta 5587252), reused 6617842 (delta 5585574)   
Receiving objects: 100% (6619592/6619592), 1.18 GiB | 896.00 KiB/s, done.
Resolving deltas: 100% (5587252/5587252), done.
fatal: did not receive expected object 22906b31d43fbb88c62d2f4b18c5bd2d0e3cebc1
fatal: index-pack failed
"

I get this error even using :

"git clone -b amd-staging-drm-next --single-branch git://people.freedesktop.org/~agd5f/linux"

Any tip for me? Am I doing any mistake?

Comment 40 John W. 2018-11-04 01:19:18 UTC

Is there any resolution or work being done on this issue?
I've tried the frequency hack and it slightly delayed the issue
I also tried the latest amd staging kernel with latest firmware and XF86 driver and found the same issue still happened but somewhat less. Reading my journalctl logs I found sometimes when it occurs it will attempt to recover but in the process loses NRAM and freezes the screen covered in odd colors
At least when this occurs the machine is otherwise functional and I can change TTYs and kill X11
I'm using a 580 and I've added the relevant logs of the attempted recovery.

Nov 02 15:31:26 Towering-DG kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=59193, emitted seq=59194
Nov 02 15:31:27 Towering-DG kernel: amdgpu 0000:01:00.0: GPU reset begin!
Nov 02 15:31:27 Towering-DG kernel: amdgpu 0000:01:00.0: GPU pci config reset
Nov 02 15:31:27 Towering-DG kernel: amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume
Nov 02 15:31:27 Towering-DG kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
Nov 02 15:31:27 Towering-DG kernel: [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
Nov 02 15:31:27 Towering-DG kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110)

(Note: Usually it's ring SDMA0 instead of SDMA1 and occasionally GFX)

Comment 41 Philipp 2018-11-16 14:28:16 UTC

I can second much of what John W. says. The crashes have become less frequent with recent fimware/kernel versions, but they still happen.
Also for me the crashes only started on my vega 64, when I threw out my ancient Intel CPU and replaced it with an AMD Ryzen 5 1600 on a GR-AB350M-Gaming 3 Board.
I've done stability tests on that other OS, so I don't think I've got faulty hardware here.

One of my crash logs:

Nov 16 15:18:29 localhorst kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:7 pasid:32776, for process RocketLeague pid 6347 thread RocketLeag:cs0 pid 6400
                                )
Nov 16 15:18:29 localhorst kernel: amdgpu 0000:08:00.0:   at address 0x0000800319593000 from 27
Nov 16 15:18:29 localhorst kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0070053C
Nov 16 15:18:30 localhorst kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:220 vmid:7 pasid:32776, for process RocketLeague pid 6347 thread RocketLeag:cs0 pid 6400
                                )
Nov 16 15:18:30 localhorst kernel: amdgpu 0000:08:00.0:   at address 0x00008201004e0000 from 27
Nov 16 15:18:30 localhorst kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x007013B8
Nov 16 15:18:40 localhorst kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=38153, emitted seq=38155
Nov 16 15:18:40 localhorst kernel: [drm] GPU recovery disabled.

Comment 42 krutoileshii 2018-11-16 14:44:11 UTC

Created attachment 142491 [details]
attachment-17526-0.html

What's your ram on the machine? I swapped mine for gskills and the freezes
are completely gone now.

On Fri, Nov 16, 2018, 07:28 <bugzilla-daemon@freedesktop.org wrote:

> *Comment # 41 <https://bugs.freedesktop.org/show_bug.cgi?id=105733#c41> on
> bug 105733 <https://bugs.freedesktop.org/show_bug.cgi?id=105733> from
> Philipp <philipp+freedesktop@xndr.de> *
>
> I can second much of what John W. says. The crashes have become less frequent
> with recent fimware/kernel versions, but they still happen.
> Also for me the crashes only started on my vega 64, when I threw out my ancient
> Intel CPU and replaced it with an AMD Ryzen 5 1600 on a GR-AB350M-Gaming 3
> Board.
> I've done stability tests on that other OS, so I don't think I've got faulty
> hardware here.
>
> One of my crash logs:
>
> Nov 16 15:18:29 localhorst kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault
> (src_id:0 ring:158 vmid:7 pasid:32776, for process RocketLeague pid 6347 thread
> RocketLeag:cs0 pid 6400
>                                 )
> Nov 16 15:18:29 localhorst kernel: amdgpu 0000:08:00.0:   at address
> 0x0000800319593000 from 27
> Nov 16 15:18:29 localhorst kernel: amdgpu 0000:08:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x0070053C
> Nov 16 15:18:30 localhorst kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault
> (src_id:0 ring:220 vmid:7 pasid:32776, for process RocketLeague pid 6347 thread
> RocketLeag:cs0 pid 6400
>                                 )
> Nov 16 15:18:30 localhorst kernel: amdgpu 0000:08:00.0:   at address
> 0x00008201004e0000 from 27
> Nov 16 15:18:30 localhorst kernel: amdgpu 0000:08:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x007013B8
> Nov 16 15:18:40 localhorst kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> ring gfx timeout, signaled seq=38153, emitted seq=38155
> Nov 16 15:18:40 localhorst kernel: [drm] GPU recovery disabled.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 43 Philipp 2018-11-17 13:28:18 UTC

I've got 2x8GB DDR4 Corsair Vengeance LPX RAM (oh dear, those names).
I have run a few rounds of memtest without any errors so far, but I'll run a few more hours today when I get the chance.

Did you switch your RAM because of memtest error reports or other concerns?

Comment 44 krutoileshii 2018-11-17 15:36:55 UTC

Created attachment 142496 [details]
attachment-26904-0.html

No mine passes memtest as well, but I was seeing failures on mprime. Also
was on Corsair vengence originally. See if you can borrow a stick to test
from someone.

On Sat, Nov 17, 2018, 06:28 <bugzilla-daemon@freedesktop.org wrote:

> *Comment # 43 <https://bugs.freedesktop.org/show_bug.cgi?id=105733#c43> on
> bug 105733 <https://bugs.freedesktop.org/show_bug.cgi?id=105733> from
> Philipp <philipp+freedesktop@xndr.de> *
>
> I've got 2x8GB DDR4 Corsair Vengeance LPX RAM (oh dear, those names).
> I have run a few rounds of memtest without any errors so far, but I'll run a
> few more hours today when I get the chance.
>
> Did you switch your RAM because of memtest error reports or other concerns?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 45 Kent Ross 2018-11-19 05:08:11 UTC

This happens to me as well. I first noticed it occurring when I had a double-GPU setup, but since then I have completely reinstalled with only the AMD gpu (a Vega 64) and it still happens. The failures are similar to those bernhardu notes. I have not had a failure simply using Chrome and desktop applications yet, but it is typically reproducible between 5 and 60 minutes in a 3D game like Dota 2.

I suspected it might be related to memory stability, but the machine it happens on happily passes both memtest and mprime. The lockup still occurs even when the memory is underclocked by 25% (retaining the same timings and voltage, so that's a full 25% overhead for every command).

I have:

- Intel 7980XE cpu
- Ubuntu Cosmic, linux-image-4.18.0-11-generic
- default amdgpu drivers

I have also tried updated amdgpu packages from ppa:oibaf/graphics-drivers; the failure is the same.

Comment 46 Kent Ross 2018-11-19 05:31:54 UTC

Created attachment 142511 [details]
dmesg logs for failure

Other items of potential relevance:

I have two screens, one at 3840x2160 and one at 2560x1600. When I've experienced this failure (I haven't tried a wide variety of applications) it it is with games that do not have exclusive control of the screen, running in the desktop compositor.

The second screen also freezes, but other applications that are running on the other screen -- such as a Chrome window playing streaming video -- will have their audio continue uninterrupted.

Comment 47 Allan 2018-11-20 14:15:24 UTC

I have really bad news.

I'm delaying a lot to answer because I literally sent for warranty or replaced ALL of my components in the PC.

The CPU (R7 1800X) was replaced from a batch 21 to a new by AMD itself batched 35.

But OK, let's talk about the amdgpu :

(In reply to Andrey Grodzovsky from comment #25)
> (In reply to Allan from comment #12)
> Can you build latest kernel (4.18) and grab again latest firmware and try
> again ?
> Links to kernel and firmware:
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ 

For reasons already explained here I couldn't either compile or test it before, so please don't be mad with me :
- Sold my old PC.
- My notebook was completely filled with files.
- Components on warranty. Testing everything else.

So I managed to borrow a PC to test the video cards. I have tested only the nvidia one to prove for AMD that the GPU is working and the pci-controller (a guess of mine) of the CPU/chipset that is broken. Going to test the RX480 on this PC as soon as possible. My warranties are expiring and I had to enumerate priorities.

I already said it here but, with the 1800X I couldn't even clone the git repository (the checksum always fails, tried many times).

Then I managed to free some space on my notebook and started to build yesterday.
- Included amd-ucode firmware.
- Included polaris10 firmware (for RX480).
- Made some optimizations for ryzen as descbribed on the gentoo's dedicated page.

Compiled, version 4.20-rc1 as present in the branch. No errors reported.

There are 2 main applications that are easier to test right now to find the problems :
- Metro 2033 Redux through steam.
- Left for Dead 2 through steam.

Started Metro 2033, worked for some minutes with no issue, but it was for some reason without any sound. Closed. Turned off the HDMI audio on pavucontrol to use only the default output. Restarted steam.

Started Left for Dead 2 this time. Was able to change graphics settings to max without AA and vsync. Played for 15 seconds and got a screen freeze. Waited for a script to record properly the logs and temps. Hard rebooted. This time even my BIOS/EFI screen had a green background, but still operational. Everything was green except the text. Rebooted again, got back to normal colors.

And here are the logs :

kern.log about Firefox usage :
> Nov 14 05:26:50 desk kernel: [  324.714998] Chrome_~dThread[1788]: segfault at 0 ip 00007fbfee5e3181 sp 00007fbfec2d1ad0 error 6 in libxul.so[7fbfee5cf000+3a2c000]

It points that the CPU stills with either a problematic microcode or is defective.

dmesg about amdgpu screen freeze :
> [ 3323.920795] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000080c for process hl2_linux pid 14648 thread amdgpu_cs:0 pid 14653
> [ 3323.920799] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
> [ 3323.920801] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200800C
> [ 3323.920804] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 1, pasid 32774) at page 0, read from 'TC0' (0x54433000) (8)
> [ 3334.103233] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=274140, emitted seq=274142
> [ 3334.103239] amdgpu 0000:09:00.0: GPU reset begin!
> [ 3344.332607] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:46:crtc-0] hw_done or flip_done timed out
> [ 3504.834097] INFO: task kworker/u32:2:3872 blocked for more than 120 seconds.
> [ 3504.834103]       Not tainted 4.20.0-rc1-amd #2
> [ 3504.834105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3504.834107] kworker/u32:2   D    0  3872      2 0x80000000
> [ 3504.834123] Workqueue: events_unbound commit_work [drm_kms_helper]
> [ 3504.834126] Call Trace:
> [ 3504.834133]  ? __schedule+0x2a0/0x880
> [ 3504.834136]  schedule+0x28/0x80
> [ 3504.834139]  schedule_timeout+0x25d/0x380
> [ 3504.834217]  ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu]
> [ 3504.834292]  ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
> [ 3504.834297]  dma_fence_default_wait+0x23b/0x2a0
> [ 3504.834301]  ? dma_fence_release+0x90/0x90
> [ 3504.834304]  dma_fence_wait_timeout+0xdd/0x100
> [ 3504.834308]  reservation_object_wait_timeout_rcu+0x161/0x270
> [ 3504.834387]  amdgpu_dm_do_flip+0x112/0x370 [amdgpu]
> [ 3504.834468]  amdgpu_dm_atomic_commit_tail+0x68b/0xcd0 [amdgpu]
> [ 3504.834472]  ? __switch_to_asm+0x40/0x70
> [ 3504.834475]  ? wait_for_completion_timeout+0x3b/0x1a0
> [ 3504.834477]  ? __switch_to_asm+0x34/0x70
> [ 3504.834480]  ? __switch_to_asm+0x40/0x70
> [ 3504.834483]  ? __switch_to+0x1ba/0x450
> [ 3504.834492]  commit_tail+0x3d/0x70 [drm_kms_helper]
> [ 3504.834497]  process_one_work+0x1aa/0x3a0
> [ 3504.834500]  worker_thread+0x30/0x3a0
> [ 3504.834503]  ? drain_workqueue+0x130/0x130
> [ 3504.834506]  kthread+0x11d/0x140
> [ 3504.834509]  ? kthread_park+0x80/0x80
> [ 3504.834512]  ret_from_fork+0x22/0x40
> [ 3516.645267] WARNING: CPU: 14 PID: 14694 at kernel/kthread.c:501 kthread_park+0x6c/0x80
> [ 3516.645271] Modules linked in: fuse edac_mce_amd kvm_amd nls_ascii nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec joydev amdgpu snd_hda_core snd_hwdep chash gpu_sched snd_pcm snd_timer ttm drm_kms_helper snd drm i2c_algo_bit sp5100_tco soundcore kvm efi_pstore efivars sg irqbypass evdev wmi_bmof serio_raw pcspkr k10temp ccp tpm_crb pcc_cpufreq tpm_tis tpm_tis_core tpm rng_core acpi_cpufreq button parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto btrfs xor zstd_decompress zstd_compress xxhash raid6_pq libcrc32c crc32c_generic algif_skcipher af_alg dm_crypt dm_mod sd_mod hid_generic usbhid hid uas usb_storage crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ahci xhci_pci aes_x86_64 libahci crypto_simd xhci_hcd cryptd glue_helper libata r8169 i2c_piix4 libphy usbcore scsi_mod thermal wmi gpio_amdpt gpio_generic
> [ 3516.645324] CPU: 14 PID: 14694 Comm: TaskSchedulerFo Not tainted 4.20.0-rc1-amd #2
> [ 3516.645327] Hardware name: BIOSTAR Group X370GT7/X370GT7, BIOS 5.13 08/07/2018
> [ 3516.645330] RIP: 0010:kthread_park+0x6c/0x80
> [ 3516.645333] Code: 18 e8 88 6c 67 00 be 40 00 00 00 48 89 df e8 8b c3 00 00 48 85 c0 74 1b 31 c0 5b 5d c3 0f 0b eb ae 0f 0b b8 da ff ff ff eb f0 <0f> 0b b8 f0 ff ff ff eb e7 0f 0b eb e3 0f 1f 80 00 00 00 00 0f 1f
> [ 3516.645335] RSP: 0018:ffffbafdc3fcfb60 EFLAGS: 00010202
> [ 3516.645338] RAX: 0000000000000004 RBX: ffff9dcd93f140c0 RCX: dead000000000200
> [ 3516.645339] RDX: ffff9dcd92ba7430 RSI: ffff9dcd93f140c0 RDI: ffff9dcd8a9049c0
> [ 3516.645341] RBP: ffff9dcd940a5360 R08: ffff9dcd96da25a8 R09: 0000000000000000
> [ 3516.645343] R10: 0000000000000000 R11: 000000000000019c R12: ffff9dcd92ba27a0
> [ 3516.645344] R13: ffff9dcd76d34200 R14: 0000000000000206 R15: dead000000000100
> [ 3516.645347] FS:  00007efea483e700(0000) GS:ffff9dcd96d80000(0000) knlGS:0000000000000000
> [ 3516.645349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3516.645351] CR2: 00005654fe725e10 CR3: 0000000200d40000 CR4: 00000000003406e0
> [ 3516.645352] Call Trace:
> [ 3516.645362]  drm_sched_entity_fini+0x37/0x190 [gpu_sched]
> [ 3516.645423]  amdgpu_vm_fini+0xad/0x530 [amdgpu]
> [ 3516.645429]  ? idr_destroy+0x78/0xc0
> [ 3516.645481]  amdgpu_driver_postclose_kms+0x151/0x270 [amdgpu]
> [ 3516.645496]  drm_file_free.part.5+0x21f/0x300 [drm]
> [ 3516.645510]  drm_release+0xaa/0x120 [drm]
> [ 3516.645514]  __fput+0xac/0x1e0
> [ 3516.645518]  task_work_run+0x8f/0xb0
> [ 3516.645522]  do_exit+0x2e6/0xb30
> [ 3516.645525]  do_group_exit+0x3a/0xb0
> [ 3516.645528]  get_signal+0x27a/0x5f0
> [ 3516.645532]  do_signal+0x30/0x6d0
> [ 3516.645537]  exit_to_usermode_loop+0x89/0xf0
> [ 3516.645540]  do_syscall_64+0xda/0xe0
> [ 3516.645544]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 3516.645547] RIP: 0033:0x7efeb6b9d19a
> [ 3516.645553] Code: Bad RIP value.
> [ 3516.645555] RSP: 002b:00007efea483d810 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> [ 3516.645557] RAX: fffffffffffffdfc RBX: 00007efea483d958 RCX: 00007efeb6b9d19a
> [ 3516.645559] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007efea483d980
> [ 3516.645560] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007ffe661d7080
> [ 3516.645562] R10: 00007efea483d860 R11: 0000000000000246 R12: 0000000000000000
> [ 3516.645564] R13: 00007efea483d980 R14: 00007efea483d990 R15: 00007efea483d930
> [ 3516.645566] ---[ end trace 7da35ac4aa65c90d ]---

It is important to note that the most common code that appears while using generic kernels is 147 despite of 146 that is being shown here.

Xorg.0.log reports nothing.

I said that these were bad news because seems to me that both CPU and amdgpu driver are defective.

I noticed that while running kernel 4.18 the gpu is kept at 100% (mclk and sclk) all the time while with this new kernel the GPU tries to scale the performance.

Also, it is important to note that the nvidia GTX 1070 throws a lot of xid error codes ( see https://devtalk.nvidia.com/default/topic/1043483/linux/xid-errors-on-gtx-1070-linux/post/5293440 ). And this is why I'm thinking that the 1800X has a defective pci-controller. And it is also the second part of the "really bad news". Maybe it is happening mostly with ryzen processors? I'll test the RX480 with the other computer ASAP, need to send informations about the CPU for AMD to proceed with the warranty process.

The GTX 1070 works without a single problem outside of this PC. The other cards that I had tested before follows the same pattern ( 2 RX480, 1 RX 580, 1 GTX 970, 1 GTX 1070).

Currently I have only 1 RX480 and 1 GTX 1070. Now that I know that the cards don't have any problem I'm selling the cards and soon I'll have only one or none. The seller told me off because of requesting warranty for the RX 480 when I thought it was defective, he sent me another different and the one that I sent was working without any issues according to him.

I'm already in a new stage of re-sending the CPU for AMD, and praying to solve my endless torment. I think that they'll have to refund me (and then I'll have a loss with the motherboard).

Please tell me any other step that you may want to be done.

I can also provide a full description of the kernel compilation (parameters) and even provide a link to the generated .deb packages.

Comment 48 Allan 2018-11-20 15:18:55 UTC

Damn, ignore the kern.log report, is outdated.

Comment 49 Allan 2018-11-20 20:32:41 UTC

Haha, I knew I could count with a genuine updated valid segfault (thanks to ryzen) :
> /var/log/kern.log:Nov 20 13:42:20 desk kernel: [ 9940.857175] Cameras IPC[1957]: segfault at 0 ip 000055ea219b1cc2 sp 00007f390e1fe8b0 error 6
> /var/log/kern.log:Nov 20 13:42:20 desk kernel: [ 9940.857184] Cameras IPC[2999]: segfault at 0 ip 00005651d2cf7cc2 sp 00007f95f6fb48b0 error 6
> /var/log/kern.log:Nov 20 13:42:20 desk kernel: [ 9940.857232] Chrome_~dThread[1809]: segfault at 0 ip 00007f3942529181 sp 00007f3940217ad0 error 6 in libxul.so[7f3942515000+3a2c000]
> /var/log/kern.log:Nov 20 13:42:20 desk kernel: [ 9940.857264] Chrome_~dThread[2448]: segfault at 0 ip 00007f963661a181 sp 00007f9634308ad0 error 6 in libxul.so[7f9636606000+3a2c000]

Comment 50 russianneuromancer 2018-11-21 00:52:03 UTC

>  And this is why I'm thinking that the 1800X has a defective pci-controller. And it is also the second part of the "really bad news". Maybe it is happening mostly with ryzen processors?

Can't tell you about RX480, but I know for sure that at least Vega 64 is totally fine with 1800X PCI-controller, no single not-solvable graphics-related issue for a year (so far all issues I had was solved by upgrading kernel and/or Mesa).

Comment 51 Allan 2018-11-22 18:47:58 UTC

Tried to install the RX480 on the other PC : the card is too big that it touches the RAM slot's tabs. Can't install it.

In time, seems like the errors delay a little bit when setting randomize_va_space=0. Was testing it for the CPU and noticed that amdgpu delayed to fail, but it still failed.

What happened :
- the screen got granulated with pinkish colors as usual
 - desktop extended this behavior
- but I could operate the system
- tty was black and white (normal)
- I could restart x server
- colors got normal after restarting
- tried the same application again
- crashed and froze the system

Main difference : 
- now sometimes I can kill the tasks/restart xserver

I registered the times of each event, here follows:

(Firefox was opened in background while I tried to play Left for Dead 2 through steam)

1. Recoverable delay with granulated colors (l4d2 begins 11:48, occurs 11:50 after some delay while loading the game menu)
> [Thu Nov 22 11:48:03 2018] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=11477, emitted seq=11480
> [Thu Nov 22 11:48:03 2018] amdgpu 0000:09:00.0: GPU reset begin!
> [Thu Nov 22 11:48:03 2018] amdgpu 0000:09:00.0: GPU pci config reset
> [Thu Nov 22 11:48:03 2018] amdgpu 0000:09:00.0: GPU reset succeeded, trying to resume
> [Thu Nov 22 11:48:03 2018] [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
> [Thu Nov 22 11:48:03 2018] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
> [Thu Nov 22 11:48:04 2018] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110)
> [Thu Nov 22 11:48:04 2018] [drm] UVD and UVD ENC initialized successfully.
> [Thu Nov 22 11:48:04 2018] [drm] VCE initialized successfully.
> [Thu Nov 22 11:48:04 2018] [drm] recover vram bo from shadow start
> [Thu Nov 22 11:48:04 2018] [drm] recover vram bo from shadow done
> [Thu Nov 22 11:48:04 2018] [drm] Skip scheduling IBs!
> [Thu Nov 22 11:48:04 2018] [drm] Skip scheduling IBs!
> [Thu Nov 22 11:48:04 2018] amdgpu 0000:09:00.0: GPU reset(1) succeeded!
> [Thu Nov 22 11:48:04 2018] [drm] Skip scheduling IBs!
> [Thu Nov 22 11:48:04 2018] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [Thu Nov 22 11:48:04 2018] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [Thu Nov 22 11:48:04 2018] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [Thu Nov 22 11:48:04 2018] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [Thu Nov 22 11:48:06 2018] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [Thu Nov 22 11:50:46 2018] show_signal_msg: 9 callbacks suppressed
> [Thu Nov 22 11:50:46 2018] Chrome_~dThread[1734]: segfault at 0 ip 00007f7926c4c181 sp 00007f792493aad0 error 6 in libxul.so[7f7926c38000+3a2c000]
> [Thu Nov 22 11:50:46 2018] Code: 15 dc f2 5f 04 48 89 10 c7 04 25 00 00 00 00 7c 09 00 00 e8 21 60 ff ff 90 48 8b 05 f9 2a 9b 05 48 8d 0d 22 f3 5f 04 48 89 08 <c7> 04 25 00 00 00 00 02 0a 00 00 e8 ff 5f ff ff e8 0a f3 ff ff 48
> [Thu Nov 22 11:50:46 2018] Chrome_~dThread[1885]: segfault at 0 ip 00007f7fa150a181 sp 00007f7f9f1f8ad0 error 6 in libxul.so[7f7fa14f6000+3a2c000]
> [Thu Nov 22 11:50:46 2018] Chrome_~dThread[8072]: segfault at 0 ip 00007fffededa181 sp 00007fffebbc8ad0 error 6
> [Thu Nov 22 11:50:46 2018] Code: 15 dc f2 5f 04 48 89 10 c7 04 25 00 00 00 00 7c 09 00 00 e8 21 60 ff ff 90 48 8b 05 f9 2a 9b 05 48 8d 0d 22 f3 5f 04 48 89 08 <c7> 04 25 00 00 00 00 02 0a 00 00 e8 ff 5f ff ff e8 0a f3 ff ff 48
> [Thu Nov 22 11:50:46 2018]  in libxul.so[7fffedec6000+3a2c000]
> [Thu Nov 22 11:50:46 2018] Code: 15 dc f2 5f 04 48 89 10 c7 04 25 00 00 00 00 7c 09 00 00 e8 21 60 ff ff 90 48 8b 05 f9 2a 9b 05 48 8d 0d 22 f3 5f 04 48 89 08 <c7> 04 25 00 00 00 00 02 0a 00 00 e8 ff 5f ff ff e8 0a f3 ff ff 48
> [Thu Nov 22 11:50:46 2018] Chrome_~dThread[1931]: segfault at 0 ip 00007f8dc581f181 sp 00007f8dc350dad0 error 6 in libxul.so[7f8dc580b000+3a2c000]
> [Thu Nov 22 11:50:46 2018] Code: 15 dc f2 5f 04 48 89 10 c7 04 25 00 00 00 00 7c 09 00 00 e8 21 60 ff ff 90 48 8b 05 f9 2a 9b 05 48 8d 0d 22 f3 5f 04 48 89 08 <c7> 04 25 00 00 00 00 02 0a 00 00 e8 ff 5f ff ff e8 0a f3 ff ff 48
kern.log = dmesg

2. Unrecoverable crash (l4d2 begins 12:00, goes well until 12:55 when crashes everything)
dmesg:
> [Thu Nov 22 12:55:04 2018] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1688198, emitted seq=1688200
> [Thu Nov 22 12:55:04 2018] amdgpu 0000:09:00.0: GPU reset begin!
> [Thu Nov 22 12:55:14 2018] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:46:crtc-0] hw_done or flip_done timed out
kern.log = dmesg

Xorg log is not reporting anything useful.


(In reply to russianneuromancer from comment #50)
> Can't tell you about RX480, but I know for sure that at least Vega 64 is
> totally fine with 1800X PCI-controller, no single not-solvable
> graphics-related issue for a year (so far all issues I had was solved by
> upgrading kernel and/or Mesa).

I wish I had this luck.

Comment 52 fin4478 2018-11-22 19:27:18 UTC

To prevent random kernel lock ups with Ryzen, fix this with bios, set to Typical Current Idle  in the bios Advanced/AMD CBS menu.

Use latest AMD wip kernel and Oibaf ppa Mesa. Disable display composting and vsync with Xfce. Use 300Hz kernel timer.

Working kernel config file for my system as attachment.

Comment 53 fin4478 2018-11-22 19:28:55 UTC

Created attachment 142573 [details]
AMD wip kernel config with 1000Hz timer

Comment 54 OliverHB 2019-01-15 14:18:27 UTC

Did anyone ever try switching to a text console (CTRL-ALT-F[1-6]) and back (usually CTRl-ALT-F7)to graphical screen? That does the trick for me! However, I wouldn't mind if there is a solution which makes that obsolete...

Comment 55 las 2019-01-17 22:25:19 UTC

I have a very similar problem, although the few differences is that my entire screen becomes one single color, which doesn't seem to be entirely random. Some times it is grey, other times a blueish tint, but never colors like black.
In addition, num lock etc. are still responsive for a small while, although it seems that a delay in the response time is added rapidly each second, very soon seeming completely unresponsive.

My system:
CPU: AMD Ryzen 5 1600
GPU: Sapphire NITRO+ RX 580 4 GB
Motherboard: ASUS ROG STRIX X470-F
Kernel: 4.20.1
Distribution: NixOS
WM: Sway or i3, happens in both

I am using DVI-D, if that is at all relevant.

Oddly enough, even though the symptoms have stayed the exact same the entire time, the error messages I get very widely. At one point I was getting the "GPU fault detected" errors, at other times it would say that an sdma0 ring or gfx ring timed out, and now I have no errors at all when it happens, which seems to have happened after I switched from an HDMI display to a DVI-D display (it also seems to have become much more infrequent oddly enough?).
Another interesting thing is that when I was using 4.18.12 or lower, I could avoid this problem entirely by flipping my VBIOS switch away from the IO ports.
In addition, when it starts happening, if I reboot my system by just turning it off by holding down the power button and then turning it on normally, it will happen soon again after launching my WM. This is seemingly avoidable by completely disconnecting it from power, e.g. by turning my PSU off.

This might actually be a completely unrelated bug, but the symptoms seem to fit enough, that it could be the same bug.
It could perhaps also be a hardware bug, since it is very odd that the errors I get change, or maybe it is multiple bugs that seem to be the same? In addition, I can't find a definite way to reproduce my issue instantly other than just waiting for it to happen, although of course graphics intensive work does accelerate it considerably.

Comment 56 las 2019-01-17 22:32:08 UTC

Also, forgot to mention, but the new GPU recovery thing doesn't work, and it would make the following error in dmesg:

jan 16 16:43:26 las kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=2792, emitted seq=2795
jan 16 16:43:26 las kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
jan 16 16:43:26 las kernel: amdgpu 0000:08:00.0: GPU reset begin!
jan 16 16:43:26 las kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=7013, emitted seq=7015
jan 16 16:43:26 las kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sway pid 1328 thread sway:cs0 pid 1329
jan 16 16:43:26 las kernel: amdgpu 0000:08:00.0: GPU reset begin!
jan 16 16:43:26 las kernel: amdgpu: [powerplay] 
                             failed to send message 261 ret is 0 
jan 16 16:43:27 las kernel: amdgpu: [powerplay] 
                             last message was failed ret is 0
jan 16 16:43:28 las kernel: amdgpu: [powerplay] 
                             failed to send message 261 ret is 0 
lines 955-1010/1010 (END)

(Fetched through `journalctl -ekb<boot ID>`)

This however stopped once I switched to DVI-D, since I now get no errors at all.

Comment 57 las 2019-01-23 19:44:24 UTC

Created attachment 143206 [details]
I get these errors when attempting to boot after a normal GPU hang and KMS happens

Recently I've been getting another type of hang somehow. After a normal hang happens, where my screen gets garbled output, I can't even get past KMS in the next couple of boots. I can fix this by flipping my VBIOS switch, which heavily leads me to believe that amdgpu somehow corrupts the GPU's firmware. I have attached the error I get when KMS happens at boot, which happens after I get a hang while using the system normally. The monitor doesn't display anything when this happens, but I can still control caps lock, etc., however I can not shut it off normally. I have not tested whether SSH and such still work.

This honestly makes me doubt whether what I am experiencing is the same bug; is it simply a faulty GPU? I am using a Sapphire RX 580 4GB, which I bought used from a windows user. It *did* work for him, so obviously it isn't entirely broken at least.

Comment 58 krutoileshii 2019-01-23 19:46:34 UTC

Created attachment 143207 [details]
attachment-12630-0.html

The corruption suggestion is interesting. My RX580 does this and now it
won't even boot on windows anymore, just crashes.

On Wed, Jan 23, 2019, 12:44 PM <bugzilla-daemon@freedesktop.org wrote:

> las@protonmail.ch changed bug 105733
> <https://bugs.freedesktop.org/show_bug.cgi?id=105733>
> What Removed Added
> CC   las@protonmail.ch
>
> *Comment # 57 <https://bugs.freedesktop.org/show_bug.cgi?id=105733#c57> on
> bug 105733 <https://bugs.freedesktop.org/show_bug.cgi?id=105733> from
> las@protonmail.ch <las@protonmail.ch> *
>
> Created attachment 143206 [details] <https://bugs.freedesktop.org/attachment.cgi?id=143206> [details] <https://bugs.freedesktop.org/attachment.cgi?id=143206&action=edit>
> I get these errors when attempting to boot after a normal GPU hang and KMS
> happens
>
> Recently I've been getting another type of hang somehow. After a normal hang
> happens, where my screen gets garbled output, I can't even get past KMS in the
> next couple of boots. I can fix this by flipping my VBIOS switch, which heavily
> leads me to believe that amdgpu somehow corrupts the GPU's firmware. I have
> attached the error I get when KMS happens at boot, which happens after I get a
> hang while using the system normally. The monitor doesn't display anything when
> this happens, but I can still control caps lock, etc., however I can not shut
> it off normally. I have not tested whether SSH and such still work.
>
> This honestly makes me doubt whether what I am experiencing is the same bug; is
> it simply a faulty GPU? I am using a Sapphire RX 580 4GB, which I bought used
> from a windows user. It *did* work for him, so obviously it isn't entirely
> broken at least.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 59 dwagner 2019-01-23 20:23:47 UTC

I don't think your observations indicate a hardware defect.

I have also a reproducible "hysteresis"-effect with regards to my RX460 GPU: When I experience the bug scenario I reported in https://bugs.freedesktop.org/show_bug.cgi?id=102322 and then reboot by pressing the reset-button, the BIOS greeting and the GRUB loader are consistently not shown (just a black screen, but with the connected TV indicating a valid HDMI signal), only once Linux sets the console video mode during boot, then the screen lights up again. If at that point, or at any time thereafter I reboot either by typing "reboot" or by pressing the RESET button, then the BIOS greeting and GRUB menu are shown as normal.

I think this is just due to some lack of thorough initialization upon reset, because if I power down the machine by switching off the power supply, and then reboot, the BIOS and GRUB menu always come up. Seems to me that pressing the RESET button just isn't resetting as much as an actual power down does.

Comment 60 las 2019-01-23 21:05:48 UTC

dwagner, my problem persists even if I completely power the system down after shutting it down by holding down the power button and then turning the PSU completely off. I have not tried shutting it off only using the PSU fearing damage to my hardware, although I will try that the next time at least once.

Also, it is reassuring to see that I am not the only one experiencing such odd behavior, but could it not be that we simply all use faulty hardware?

Comment 61 Zheng Luo 2019-01-25 07:36:40 UTC

I experienced similar problems, but mine is much worse. I can't recover from black screen after reboot/hard reset unless I drain the builtin battery. However this problem disappears in 5.0rc3 (in contrast to the buggy 4.20). Strongly suspect there are some kinds of firmware corruption

Comment 62 las 2019-01-25 08:20:36 UTC

What changes happened in 5.0rc3 that could have fixed this? I will try to see if I still experience problems with 5.0rc3 when I can check.

Also, can you elaborate on what you mean by draining the built-in battery? Are you using a laptop, or are you referring to some other built-in battery? Excuse my ignorance.

(In reply to Zheng Luo from comment #61)
> I experienced similar problems, but mine is much worse. I can't recover from
> black screen after reboot/hard reset unless I drain the builtin battery.
> However this problem disappears in 5.0rc3 (in contrast to the buggy 4.20).
> Strongly suspect there are some kinds of firmware corruption

Comment 63 las 2019-01-26 13:18:47 UTC

Well, my GPU doesn't even work properly on Windows anymore. I do not think the GPU was originally faulty, as it *did* work without problems on Windows before, but now after having used it on Linux, it has the exact same problems on Windows. Hopefully I can get it replaced, but I will not use it on Linux anymore for fear of fucking it up again.

Comment 64 lada.dvorak7 2019-02-02 10:55:21 UTC

I've been facing freezes for many days on Ryzen1600+RX560. I have tried bios, kernel, mesa updates, kernel parameters: "processor.nocst=1 iommu=pt amggpu.vm_update_mode=3", but it didn't help. Finally I've tried kernel param ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2 and it does the trick. No freezing anymore.

Comment 65 jake.hedges 2019-02-04 03:17:14 UTC

Adding my PC to the pile affected by this -

Ryzen 5 1600 
Aorus RX 480
Debian (stretch) 
2x8GB G.Skill DIMM (previously OC, but now everything in BIOS is "optimized default")
ASUS ROG STRIX B350-F with latest bios/aegesa 4207

I am windows migrant who went cold turkey into Linux.  Debian has been kind to me minus a few hiccups and re-installs.  My very first few installations have been the most stable sadly.  Finally, I have pin pointed my issue to this thread.  

/var/log/syslog demonstrates the GPU failure messages right before crash.  Issue seems to occur whether I have linux-firmware installed or not.  Most recently, I had crash simply opening "show applications" in GNOME.  

Crash is the same as others have stated.  Screen blips and goes black.  Fans spin up high speed.  (I did not test ssh), but you cannot use reset button.  The machine must be hard power down in order to recover.  Twice now this has corrupted file system to the point where it would not boot normally.  Since, I was not versed enough to recover manually, I just re-installed.  

As this is my first hard shot at Linux, this is quite a damper on what was a very exciting change.  For now, I have applied the kernel params mentioned above and will report back should I crash again.

Comment 66 jake.hedges 2019-02-04 04:23:48 UTC

It really did not take too long to crash it with even with the params.  I back to square one.  Thinking I will at least try a few different distros and possibly upgrade some hardware though I am not disappointed in their performance until I have used linux.  Anyways, I will keep experimenting and report back.

Comment 67 Alex Deucher 2019-02-04 18:38:57 UTC

For those with AMD platforms, does adding idle=nomwait on the kernel command line in grub help?

Comment 68 Alex Deucher 2019-02-04 18:42:38 UTC

(In reply to jake.hedges from comment #66)
> It really did not take too long to crash it with even with the params.  I
> back to square one.  Thinking I will at least try a few different distros
> and possibly upgrade some hardware though I am not disappointed in their
> performance until I have used linux.  Anyways, I will keep experimenting and
> report back.

Do the suggestions in comment 64 or comment 67 help?

Comment 69 jake.hedges 2019-02-05 02:18:24 UTC

Hi Alex, comment 64 did not resolve the issue.  It did seem to delay the crash, but ultimately did not resolve it.  I will test the idle=nomwait param now and begin testing.  If I am still stuck, I also have another suggestion to limit the Mhz on the GPU itself.

Comment 70 jake.hedges 2019-02-05 13:42:42 UTC

Ok, that seems to have stabilized my system.  It at least withstood constant use for 4+ hours.  I went idle, stressed it, idle again, and no crashes.  

My current setup is Buster and idle=nomwait.  I am going to move to add idle=nomwait to my startup permanently for now and continue reading on the behavior so I can better troubleshoot moving ahead.  From my cursory glance this seems to indicate it was a CPU issue rather than display problem.   Is that way off base?  Thank you for the suggestion, I will report back if issue reoccur.

Comment 71 Garry Hurley Jr 2019-02-05 16:28:03 UTC

Created attachment 143307 [details]
attachment-2574-0.html

What I want to know is what is calling your machine ‘localhorst’? 

Sent from my iPhone

> On Nov 20, 2018, at 9:15 AM, bugzilla-daemon@freedesktop.org wrote:
> 
> Comment # 47 on bug 105733 from Allan
> I have really bad news.
> 
> I'm delaying a lot to answer because I literally sent for warranty or replaced
> ALL of my components in the PC.
> 
> The CPU (R7 1800X) was replaced from a batch 21 to a new by AMD itself batched
> 35.
> 
> But OK, let's talk about the amdgpu :
> 
> (In reply to Andrey Grodzovsky from comment #25)
> > (In reply to Allan from comment #12)
> > Can you build latest kernel (4.18) and grab again latest firmware and try
> > again ?
> > Links to kernel and firmware:
> > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
> > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ 
> 
> For reasons already explained here I couldn't either compile or test it before,
> so please don't be mad with me :
> - Sold my old PC.
> - My notebook was completely filled with files.
> - Components on warranty. Testing everything else.
> 
> So I managed to borrow a PC to test the video cards. I have tested only the
> nvidia one to prove for AMD that the GPU is working and the pci-controller (a
> guess of mine) of the CPU/chipset that is broken. Going to test the RX480 on
> this PC as soon as possible. My warranties are expiring and I had to enumerate
> priorities.
> 
> I already said it here but, with the 1800X I couldn't even clone the git
> repository (the checksum always fails, tried many times).
> 
> Then I managed to free some space on my notebook and started to build
> yesterday.
> - Included amd-ucode firmware.
> - Included polaris10 firmware (for RX480).
> - Made some optimizations for ryzen as descbribed on the gentoo's dedicated
> page.
> 
> Compiled, version 4.20-rc1 as present in the branch. No errors reported.
> 
> There are 2 main applications that are easier to test right now to find the
> problems :
> - Metro 2033 Redux through steam.
> - Left for Dead 2 through steam.
> 
> Started Metro 2033, worked for some minutes with no issue, but it was for some
> reason without any sound. Closed. Turned off the HDMI audio on pavucontrol to
> use only the default output. Restarted steam.
> 
> Started Left for Dead 2 this time. Was able to change graphics settings to max
> without AA and vsync. Played for 15 seconds and got a screen freeze. Waited for
> a script to record properly the logs and temps. Hard rebooted. This time even
> my BIOS/EFI screen had a green background, but still operational. Everything
> was green except the text. Rebooted again, got back to normal colors.
> 
> And here are the logs :
> 
> kern.log about Firefox usage :
> > Nov 14 05:26:50 desk kernel: [  324.714998] Chrome_~dThread[1788]: segfault at 0 ip 00007fbfee5e3181 sp 00007fbfec2d1ad0 error 6 in libxul.so[7fbfee5cf000+3a2c000]
> 
> It points that the CPU stills with either a problematic microcode or is
> defective.
> 
> dmesg about amdgpu screen freeze :
> > [ 3323.920795] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000080c for process hl2_linux pid 14648 thread amdgpu_cs:0 pid 14653
> > [ 3323.920799] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
> > [ 3323.920801] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200800C
> > [ 3323.920804] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 1, pasid 32774) at page 0, read from 'TC0' (0x54433000) (8)
> > [ 3334.103233] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=274140, emitted seq=274142
> > [ 3334.103239] amdgpu 0000:09:00.0: GPU reset begin!
> > [ 3344.332607] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:46:crtc-0] hw_done or flip_done timed out
> > [ 3504.834097] INFO: task kworker/u32:2:3872 blocked for more than 120 seconds.
> > [ 3504.834103]       Not tainted 4.20.0-rc1-amd #2
> > [ 3504.834105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 3504.834107] kworker/u32:2   D    0  3872      2 0x80000000
> > [ 3504.834123] Workqueue: events_unbound commit_work [drm_kms_helper]
> > [ 3504.834126] Call Trace:
> > [ 3504.834133]  ? __schedule+0x2a0/0x880
> > [ 3504.834136]  schedule+0x28/0x80
> > [ 3504.834139]  schedule_timeout+0x25d/0x380
> > [ 3504.834217]  ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu]
> > [ 3504.834292]  ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
> > [ 3504.834297]  dma_fence_default_wait+0x23b/0x2a0
> > [ 3504.834301]  ? dma_fence_release+0x90/0x90
> > [ 3504.834304]  dma_fence_wait_timeout+0xdd/0x100
> > [ 3504.834308]  reservation_object_wait_timeout_rcu+0x161/0x270
> > [ 3504.834387]  amdgpu_dm_do_flip+0x112/0x370 [amdgpu]
> > [ 3504.834468]  amdgpu_dm_atomic_commit_tail+0x68b/0xcd0 [amdgpu]
> > [ 3504.834472]  ? __switch_to_asm+0x40/0x70
> > [ 3504.834475]  ? wait_for_completion_timeout+0x3b/0x1a0
> > [ 3504.834477]  ? __switch_to_asm+0x34/0x70
> > [ 3504.834480]  ? __switch_to_asm+0x40/0x70
> > [ 3504.834483]  ? __switch_to+0x1ba/0x450
> > [ 3504.834492]  commit_tail+0x3d/0x70 [drm_kms_helper]
> > [ 3504.834497]  process_one_work+0x1aa/0x3a0
> > [ 3504.834500]  worker_thread+0x30/0x3a0
> > [ 3504.834503]  ? drain_workqueue+0x130/0x130
> > [ 3504.834506]  kthread+0x11d/0x140
> > [ 3504.834509]  ? kthread_park+0x80/0x80
> > [ 3504.834512]  ret_from_fork+0x22/0x40
> > [ 3516.645267] WARNING: CPU: 14 PID: 14694 at kernel/kthread.c:501 kthread_park+0x6c/0x80
> > [ 3516.645271] Modules linked in: fuse edac_mce_amd kvm_amd nls_ascii nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec joydev amdgpu snd_hda_core snd_hwdep chash gpu_sched snd_pcm snd_timer ttm drm_kms_helper snd drm i2c_algo_bit sp5100_tco soundcore kvm efi_pstore efivars sg irqbypass evdev wmi_bmof serio_raw pcspkr k10temp ccp tpm_crb pcc_cpufreq tpm_tis tpm_tis_core tpm rng_core acpi_cpufreq button parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto btrfs xor zstd_decompress zstd_compress xxhash raid6_pq libcrc32c crc32c_generic algif_skcipher af_alg dm_crypt dm_mod sd_mod hid_generic usbhid hid uas usb_storage crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ahci xhci_pci aes_x86_64 libahci crypto_simd xhci_hcd cryptd glue_helper libata r8169 i2c_piix4 libphy usbcore scsi_mod thermal wmi gpio_amdpt gpio_generic
> > [ 3516.645324] CPU: 14 PID: 14694 Comm: TaskSchedulerFo Not tainted 4.20.0-rc1-amd #2
> > [ 3516.645327] Hardware name: BIOSTAR Group X370GT7/X370GT7, BIOS 5.13 08/07/2018
> > [ 3516.645330] RIP: 0010:kthread_park+0x6c/0x80
> > [ 3516.645333] Code: 18 e8 88 6c 67 00 be 40 00 00 00 48 89 df e8 8b c3 00 00 48 85 c0 74 1b 31 c0 5b 5d c3 0f 0b eb ae 0f 0b b8 da ff ff ff eb f0 <0f> 0b b8 f0 ff ff ff eb e7 0f 0b eb e3 0f 1f 80 00 00 00 00 0f 1f
> > [ 3516.645335] RSP: 0018:ffffbafdc3fcfb60 EFLAGS: 00010202
> > [ 3516.645338] RAX: 0000000000000004 RBX: ffff9dcd93f140c0 RCX: dead000000000200
> > [ 3516.645339] RDX: ffff9dcd92ba7430 RSI: ffff9dcd93f140c0 RDI: ffff9dcd8a9049c0
> > [ 3516.645341] RBP: ffff9dcd940a5360 R08: ffff9dcd96da25a8 R09: 0000000000000000
> > [ 3516.645343] R10: 0000000000000000 R11: 000000000000019c R12: ffff9dcd92ba27a0
> > [ 3516.645344] R13: ffff9dcd76d34200 R14: 0000000000000206 R15: dead000000000100
> > [ 3516.645347] FS:  00007efea483e700(0000) GS:ffff9dcd96d80000(0000) knlGS:0000000000000000
> > [ 3516.645349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 3516.645351] CR2: 00005654fe725e10 CR3: 0000000200d40000 CR4: 00000000003406e0
> > [ 3516.645352] Call Trace:
> > [ 3516.645362]  drm_sched_entity_fini+0x37/0x190 [gpu_sched]
> > [ 3516.645423]  amdgpu_vm_fini+0xad/0x530 [amdgpu]
> > [ 3516.645429]  ? idr_destroy+0x78/0xc0
> > [ 3516.645481]  amdgpu_driver_postclose_kms+0x151/0x270 [amdgpu]
> > [ 3516.645496]  drm_file_free.part.5+0x21f/0x300 [drm]
> > [ 3516.645510]  drm_release+0xaa/0x120 [drm]
> > [ 3516.645514]  __fput+0xac/0x1e0
> > [ 3516.645518]  task_work_run+0x8f/0xb0
> > [ 3516.645522]  do_exit+0x2e6/0xb30
> > [ 3516.645525]  do_group_exit+0x3a/0xb0
> > [ 3516.645528]  get_signal+0x27a/0x5f0
> > [ 3516.645532]  do_signal+0x30/0x6d0
> > [ 3516.645537]  exit_to_usermode_loop+0x89/0xf0
> > [ 3516.645540]  do_syscall_64+0xda/0xe0
> > [ 3516.645544]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 3516.645547] RIP: 0033:0x7efeb6b9d19a
> > [ 3516.645553] Code: Bad RIP value.
> > [ 3516.645555] RSP: 002b:00007efea483d810 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> > [ 3516.645557] RAX: fffffffffffffdfc RBX: 00007efea483d958 RCX: 00007efeb6b9d19a
> > [ 3516.645559] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007efea483d980
> > [ 3516.645560] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007ffe661d7080
> > [ 3516.645562] R10: 00007efea483d860 R11: 0000000000000246 R12: 0000000000000000
> > [ 3516.645564] R13: 00007efea483d980 R14: 00007efea483d990 R15: 00007efea483d930
> > [ 3516.645566] ---[ end trace 7da35ac4aa65c90d ]---
> 
> It is important to note that the most common code that appears while using
> generic kernels is 147 despite of 146 that is being shown here.
> 
> Xorg.0.log reports nothing.
> 
> I said that these were bad news because seems to me that both CPU and amdgpu
> driver are defective.
> 
> I noticed that while running kernel 4.18 the gpu is kept at 100% (mclk and
> sclk) all the time while with this new kernel the GPU tries to scale the
> performance.
> 
> Also, it is important to note that the nvidia GTX 1070 throws a lot of xid
> error codes ( see
> https://devtalk.nvidia.com/default/topic/1043483/linux/xid-errors-on-gtx-1070-linux/post/5293440
> ). And this is why I'm thinking that the 1800X has a defective pci-controller.
> And it is also the second part of the "really bad news". Maybe it is happening
> mostly with ryzen processors? I'll test the RX480 with the other computer ASAP,
> need to send informations about the CPU for AMD to proceed with the warranty
> process.
> 
> The GTX 1070 works without a single problem outside of this PC. The other cards
> that I had tested before follows the same pattern ( 2 RX480, 1 RX 580, 1 GTX
> 970, 1 GTX 1070).
> 
> Currently I have only 1 RX480 and 1 GTX 1070. Now that I know that the cards
> don't have any problem I'm selling the cards and soon I'll have only one or
> none. The seller told me off because of requesting warranty for the RX 480 when
> I thought it was defective, he sent me another different and the one that I
> sent was working without any issues according to him.
> 
> I'm already in a new stage of re-sending the CPU for AMD, and praying to solve
> my endless torment. I think that they'll have to refund me (and then I'll have
> a loss with the motherboard).
> 
> Please tell me any other step that you may want to be done.
> 
> I can also provide a full description of the kernel compilation (parameters)
> and even provide a link to the generated .deb packages.
> You are receiving this mail because:
> You are the assignee for the bug.
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Comment 72 castor_fou 2019-02-10 00:13:58 UTC

I tried comment 64 suggestion: ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2 

After 2 days without any hang, I've just got one.
I am desperate about this problem, it has happened only since 18.04 upgrade. I had no issue for 4 years with 16.04 and previous versions.

My mitigation is to cron a restart of display-manager twice a day. What a pity solution.
0 12,19 * * * /bin/systemctl restart display-manager

Comment 73 Mauro Gaspari 2019-02-23 12:14:25 UTC

This problem affects me as well. It has for quite some time. 
My setup: 
CPU AMD Ryzen 7 2700X
RAM 64GB DDR4 3200
GPU AMD Vega RX 64

Since this issue has plagued me for quite a while, I tried to even install windows10, and I can confirm there are no issues at all. Having said that AMD drivers were quite bad at Vega launch on windows too. 

In my experience the bug comes and goes together with mesa versions being used, or combination of kernel plus mesa. I can reproduce the issue easily by playing some games.Some extra tests I ran to make sure it was not hardware issue or game issue:
- Same games work fine on windows on same hardware, same bios settings, etc.
- Same games work fine on my Nvidia+Intel based laptop, running same linux distributions and kernels.

For example for me kubuntu 18.04.01 Using AMDGPU opensource drivers was ok without the bug for a very long time. Then, a couple of weeks ago mesa update came and i started having the freeze again. 
I tried to upgrade to 18.10 and I still had the freeze. Added oibaf PPA, and the issue was gone. after a few weeks an update came and issue started happening again. I am now using padoka PPA but still having the freeze.
Same problem happens for me also on OpenSUSE Tumbleweed and Arch on same machine. 

I tried disabling compositor, disablign vsync, changing compositor on my KDE Plasma, running game in windowed mode vs full screen. Nothing helped.

Also please note that before upgrading my CPU and Motherboard, I was running Vega RX64 on an Intel CPU, and I had the same issues.

Some info I saved a while back when running on OpenSUSE Tumbleweed below. If needed I can grab more recent logs and system info and post.
I am also going to try and install kubuntu 18.04.1 with AMDGPU-PRO proprietary drivers to see if there is any difference.


---First time i noticed the issue:

OS: OpenSUSE tumbleweed x86_64 updated (2018 04 21)
Kernel: 4.16.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.0 Mesa 18.0.0
GPU: AMD Radeon RX Vega 64 8GB

Apr 21 17:08:34 STUDIO kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Apr 21 17:08:34 STUDIO kernel: [drm] No hardware hang detected. Did some blocks stall?
Apr 21 17:08:44 STUDIO kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=128859, last emitted seq=128861
Apr 21 17:08:44 STUDIO kernel: [drm] No hardware hang detected. Did some blocks stall?
-- Reboot --


Dmesg lines relative to amdgpu:

[    3.407020] [drm] amdgpu kernel modesetting enabled.
[    3.411462] fb: switching to amdgpudrmfb from VESA VGA
[    3.426163] amdgpu 0000:04:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    3.426261] amdgpu 0000:04:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    3.426263] amdgpu 0000:04:00.0: GTT: 256M 0x000000F600000000 - 0x000000F60FFFFFFF
[    3.426371] [drm] amdgpu: 8176M of VRAM memory ready
[    3.426372] [drm] amdgpu: 8176M of GTT memory ready.
[    4.031665] fbcon: amdgpudrmfb (fb0) is primary device
[    4.083803] amdgpu 0000:04:00.0: fb0: amdgpudrmfb frame buffer device
[    4.096086] amdgpu 0000:04:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[    4.096088] amdgpu 0000:04:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[    4.096089] amdgpu 0000:04:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[    4.096090] amdgpu 0000:04:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[    4.096091] amdgpu 0000:04:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[    4.096093] amdgpu 0000:04:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[    4.096094] amdgpu 0000:04:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[    4.096095] amdgpu 0000:04:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[    4.096096] amdgpu 0000:04:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[    4.096098] amdgpu 0000:04:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[    4.096099] amdgpu 0000:04:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[    4.096100] amdgpu 0000:04:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[    4.096101] amdgpu 0000:04:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1
[    4.096103] amdgpu 0000:04:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1
[    4.096104] amdgpu 0000:04:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1
[    4.096105] amdgpu 0000:04:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[    4.096107] amdgpu 0000:04:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[    4.096108] amdgpu 0000:04:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[    4.096662] [drm] Initialized amdgpu 3.23.0 20150101 for 0000:04:00.0 on minor 0


---It was identified to be this bug https://bugs.freedesktop.org/show_bug.cgi?id=105317 . After I upgraded Tumbleweed to mesa 18.0.1 the issue was gone.


--- Later on I had the same bug again.
OS: OpenSUSE tumbleweed x86_64 updated (2018 08 10)
Kernel: 4.17.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.1 Mesa 18.1.5
GPU: AMD Radeon RX Vega 64 8GB


Relevant log lines I found during freeze:

2018-08-09T23:16:53.103775+08:00 MGDT-Tumbleweed kernel: [ 6305.852703] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1745163, last emitted seq=
1745165
2018-08-09T23:16:53.103795+08:00 MGDT-Tumbleweed kernel: [ 6305.852704] [drm] No hardware hang detected. Did some blocks stall?


Dmesg lines relative to amdgpu:

[    3.130759] [drm] amdgpu kernel modesetting enabled.
[    3.135770] fb: switching to amdgpudrmfb from EFI VGA
[    3.136106] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    3.136171] amdgpu 0000:03:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    3.136173] amdgpu 0000:03:00.0: GTT: 512M 0x000000F600000000 - 0x000000F61FFFFFFF
[    3.136494] [drm] amdgpu: 8176M of VRAM memory ready
[    3.136495] [drm] amdgpu: 8176M of GTT memory ready.
[    4.114469] fbcon: amdgpudrmfb (fb0) is primary device
[    4.141179] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device
[    4.164072] amdgpu 0000:03:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[    4.164074] amdgpu 0000:03:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[    4.164075] amdgpu 0000:03:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[    4.164075] amdgpu 0000:03:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[    4.164076] amdgpu 0000:03:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[    4.164077] amdgpu 0000:03:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[    4.164078] amdgpu 0000:03:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[    4.164079] amdgpu 0000:03:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[    4.164079] amdgpu 0000:03:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[    4.164080] amdgpu 0000:03:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[    4.164081] amdgpu 0000:03:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[    4.164082] amdgpu 0000:03:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[    4.164083] amdgpu 0000:03:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1
[    4.164084] amdgpu 0000:03:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1
[    4.164085] amdgpu 0000:03:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1
[    4.164085] amdgpu 0000:03:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[    4.164086] amdgpu 0000:03:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[    4.164087] amdgpu 0000:03:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[    4.164553] [drm] Initialized amdgpu 3.25.0 20150101 for 0000:03:00.0 on minor 0

Comment 74 Mauro Gaspari 2019-03-09 10:31:18 UTC

Quick update.
After latest updates on my Kubuntu 18.10 with Padoka unstable PPA, I am noticing great improvements. Performance using DXVK with DX11 is greatly improved with LLVM9.0.0, mesa 19.0.1-devel seems stable and so far I had no freezes.

I am currently using: Kubuntu 18.10 with Mesa 19.1.0-devel - padoka PPA, DRM3.26.0, 4.18.0-16-generic, LLVM 9.0.0
This is the PPA being used: https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa/

Further explanation and installation guides included here: https://github.com/lutris/lutris/wiki/Installing-drivers

I hope this helps.
Cheers
Mauro

Comment 75 Allan 2019-03-10 09:37:56 UTC

Well, after a long time I'm here again to tell what happened:

A very nice AMD staff was following me up because of the CPU, and it ended up solving the problems I had with the video card (seems like).


1. Regarding the kernel timing
(In reply to fin4478 from comment #52)
> To prevent random kernel lock ups with Ryzen, fix this with bios, set to
> Typical Current Idle  in the bios Advanced/AMD CBS menu.
> 
> Use latest AMD wip kernel and Oibaf ppa Mesa. Disable display composting and
> vsync with Xfce. Use 300Hz kernel timer.
> 
> Working kernel config file for my system as attachment.

Yes, I tried it a lot, believe me, all combinations possible, 300hz, 250hz, 1000hz, your config, linux-firmware drivers. At least 10 attempts with variations of your config, including a pure one only activating dmcrypt that is not enabled in yours.

2. Regarding the PSU profile
As already said by fin4478 and requested by AMD, I requested to BIOSTAR a bios that allowed to change it. They sent me a beta version to test it.

No luck at all, nothing related.

3. The madness
Nothing worked, but the CPU was already ok. The mobo was already ok, the video card was hunging sometimes, even while on Windows now.

Ok, I made a shot in the dark suspecting of some nonsense incompatibility of the ram.

And this is it. Even after sending it to the warranty, even after making 100+ tests, the ram was the issue.

Was a Corsair Vengeance one : 2x4GB DDR4 CL15, 2133MHz SPD (JEDEC), 3000MHz XMP2.

Even at JEDEC specifications it caused the system to fail.

Even if I delayed the latencies by much it was causing it.

It was what was causing the amdgpu driver to fail. Along with any heavy application. Since the RAM is used before sending things to VRAM, makes sense to the driver/device to process something unexpected.

I warn everyone that uses Corsair memories, specially if they don't have their "Ryzen ready" merchandise. Even though there's a standard called JEDEC, they simply don't implement it very well.

It was the reason why sometimes I could use the system for 1-2 hours, and sometimes not even 5 minutes before crashing. There is some kind of instability there.

I sold it to a guy that uses an 8700k or something, exaplined the situation, he agreed. Until now (more than 2 months) there is not a single issue related to the memory chips. They must have  done somthing to optimize for Intel beyond the XMP profile and compromised the entire project. Along with 1 year of my life and a bunch of money spent.

But, the fixes along time in amdgpu indeed was proven to be useful, so it was not only a ram's fault. Because using the same ram chips, I had a lot less problems compared to when I reported this problem.

Now I'm using a G.Skill Tridentz 3200MHz @ 2666MHz that is the speed assured by AMD that the 1800X must work with. Stable without a single problem related to it.

4. To confirm that I have won the raffle of a not working system my RX480 died a month ago probably because of a BGA problem.

Then I found a label in the card, looked for it, and discovered that a selled sold me a refurbished product as new.

Then I'm evaluating if I'll sue him or just fix the card.

And I told about it because this is why I can't test it again until I get another amd card. I'm using the nvidia that I couldn't sell in the meantime.

5. The funny part.

The nvidia driver that seemed to be a lot stable at first, started to fail like hell after replacing the truly problematic CPU.

And the amdgpu driver started to be more stable, more than any other driver from linux or windows.


Well, I think that this is it. I'll return when I'm able to test amdgpu again.

But the veredict for now is :

I tested the RX480 without a single problem while using amdgpu. Not used intensively, just common tests and played a little bit of Left for Dead 2 without any issue (good point, it always crashed).

The card showed the BGA problem when using a variation of the Adrenalin driver for windows, when I was doing some verifications requested by AMD.

Cheers for all.
Prefer G. Skill instead of Corsair.

Comment 76 Allan 2019-03-10 10:03:07 UTC

To clarify what kernels to aim for if you are using ryzen+amdgpu :

1 - drm-next-4.21-wip https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.21-wip
2 - drm-next-5.2-wip https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-5.2-wip
3 - amd-staging-drm-next https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

Any generic kernel provided by debian (as example, most distros may follow a similar policy and thus lead to the same result) won't be enough to handle the Ryzen CPU properly yet.
I have tested until 4.19.0-1-amd64 (Debian 4.19.12-1) from debian repos.

There were some fixes awaiting for a pull request acceptance.

Comment 77 Allan 2019-03-10 10:07:04 UTC

Also, I need instructions of what to do with the status of the bug.

It worked for me, but there are some users discussing it yet.

I'll wait for a response. Please cite me.

Comment 78 las 2019-03-10 10:55:43 UTC

Created attachment 143609 [details]
attachment-3556-0.html

I think this issue should be closed. I am not yet sure if my issue was the same as yours (doesn't seem likely), but if it wasn't the same, I'll just open a new one if needed. Likewise, those whose issues have not been resolved by this should just open a new issue IMO.

-------- Original Message --------
On 10 Mar 2019, 11:07, wrote:

> [Comment # 77](https://bugs.freedesktop.org/show_bug.cgi?id=105733#c77) on [bug 105733](https://bugs.freedesktop.org/show_bug.cgi?id=105733) from [Allan](mailto:allan4229@gmail.com)
>
> Also, I need instructions of what to do with the status of the bug.
>
> It worked for me, but there are some users discussing it yet.
>
> I'll wait for a response. Please cite me.
>
> ---------------------------------------------------------------
> You are receiving this mail because:
>
> - You are on the CC list for the bug.

Comment 79 Mauro Gaspari 2019-03-11 06:49:26 UTC

Please go ahead and close it, I will open a new one. no problem.

Cheers 
Mauro

Comment 80 Allan 2019-03-12 08:35:21 UTC

Closing this issue, here is the summary for a quick look:

The problem : amdgpu hangs suddenly, nothing can kill it.

Solution : The driver got more stable over time.

The causes :
1 - The driver itself was more unstable.
2 - The kernel wasn't supporting ryzen CPUs properly, leading to segfaults and unexpected behaviors. If it is your case, use any kernel already listed here.
3 - Corsair RAM is not a good deal to work with Ryzen, specially if they don't have some kind of "Ryzen ready" seal. Aiming the best performance for Intel platforms made them to not support JEDEC standards properly while trying to use the SPD profile, even if you try to delay latencies.
Thus, bad RAM -> unexpected behaviors, including from the driver.

Additional information :
I was able to test only a few days (a week or so) before the GPU showed BGA problems.
It was working fine.

If I ever be able to test it again and find another scenario where the driver hangs and can't be killed I'll report here.

Comment 81 Hadet 2019-07-16 10:18:41 UTC

Created attachment 144797 [details]
After AMDGPU crashes

Having some similar issues. After closing games running in Wine specifically

Comment 82 Michel Dänzer 2019-07-17 07:58:45 UTC

(In reply to Hadet from comment #81)
> Having some similar issues. After closing games running in Wine specifically

Please file your own report. The reporter of this one marked it as resolved.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.