Bug 106194

Summary: AMDGPU RIP: dm_update_crtcs_state on kernel 4.17rc2
Product: DRI Reporter: Kevin McCormack <harlemsquirrel>
Component: DRM/AMDgpuAssignee: Leo Li <sunpeng.li>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: high CC: bjo, bugs.freedesktop.org, cig, ebiggers3, harry.wentland, jonemilj, levis.kool, sunpeng.li
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg output running kernel 4.17-rc2
none
dmesg output linux 4.17 rc2
none
Partially revert commit introducing the BUG() where the "hang" happens
none
Fix v1
none
"Before fix" dmesg output
none
"After Fix" dmesg output none

Description Kevin McCormack 2018-04-23 17:18:13 UTC
Created attachment 139018 [details]
dmesg output running kernel 4.17-rc2

I left my computer on over night. After I noticed the system was frozen just now, I rebooted to found these log entries.

5:06:46 AM
RIP: dm_update_crtcs_state+0x347/0x3c0 [amdgpu] RSP: ffffb2560bc37b10

5:06:46 AM
kernel BUG at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:4700!

Software versions:
  OpenGL version string: 3.0 Mesa 18.0.1


CPU hardware:
  x86_64
  AMD Ryzen 7 1800X Eight-Core Processor
  	Max Speed: 4100 MHz
	Current Speed: 3600 MHz


Memory:
	Speed: 3200 MT/s
              total        used        free      shared  buff/cache   available
Mem:           15Gi       3.0Gi        11Gi        67Mi       1.5Gi        12Gi
Swap:         7.8Gi          0B       7.8Gi


GPU hardware:
  OpenGL renderer string: AMD Radeon (TM) R9 Fury Series (FIJI / DRM 3.23.0 / 4.15.15-1-ARCH, LLVM 6.0.0)
  0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] [1002:7300] (rev c8)
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/580] [1002:67df] (rev e7)


Motherboard:
  ASUSTeK COMPUTER INC.
  CROSSHAIR VI HERO
  BIOS Version: 3008


Storage:
Smart Log for NVME device:nvme0 namespace-id:ffffffff

I've attached the relevent dmesg output.
Comment 1 ebiggers3 2018-04-25 06:34:05 UTC
This happened to me too: the same BUG() with the same stacktrace, on 4.17-rc2.  It was when I was away, so maybe it happened when the screen turned off due to inactivity.  Graphics card is an Radeon RX 550, one monitor connected via HDMI.  Didn't happen with 4.16.  I did not test 4.17-rc1.
Comment 2 Levis Raju 2018-04-28 15:13:12 UTC
Created attachment 139197 [details]
dmesg output linux 4.17 rc2
Comment 3 Levis Raju 2018-04-28 15:15:36 UTC
Comment on attachment 139197 [details]
dmesg output linux 4.17 rc2

This is from one boot to the next boot. I booted the system at 18:22 and I was using it for upto 20:00. I went for a cup of tea and system froze at 20:08.

Then I've booted the system again.

I am using a hand compiled linux kernel 4.17-rc2
Comment 4 Levis Raju 2018-04-28 15:19:55 UTC
I know that I left the system at 20:00

Apr 28 20:08:28 levis-desktop kernel: kernel BUG at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:4700!
Apr 28 20:08:28 levis-desktop kernel: RIP: 0010:dm_update_crtcs_state+0x347/0x3c0 [amdgpu]
Apr 28 20:08:28 levis-desktop kernel:  amdgpu_dm_atomic_check+0x191/0x3e0 [amdgpu]
Apr 28 20:08:28 levis-desktop kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Apr 28 20:08:28 levis-desktop kernel: RIP: dm_update_crtcs_state+0x347/0x3c0 [amdgpu] RSP: ffffa87b814bbb18


System:    Host: levis-desktop Kernel: 4.17.0-rc2-next-20180426-ARCH x86_64 bits: 64 Desktop: KDE Plasma 5.12.4 
           Distro: Arch Linux 
Machine:   Type: Desktop Mobo: Micro-Star model: A320M PRO-VD PLUS (MS-7B38) v: 1.0 serial: N/A 
           UEFI: American Megatrends v: 1.80 date: 03/15/2018 
CPU:       Topology: Quad Core model: AMD Ryzen 5 2400G with Radeon Vega Graphics bits: 64 type: MT MCP 
           L2 cache: 2048 KiB 
           Speed: 1368 MHz min/max: 1600/3600 MHz Core speeds (MHz): 1: 1611 2: 1381 3: 1373 4: 1470 5: 1834 
           6: 2727 7: 1375 8: 1376 
Graphics:  Card-1: AMD Raven Bridge [Radeon Vega Series / Radeon Vega Mobile Series] driver: amdgpu v: kernel 
           Display: x11 server: X.Org 1.19.6 driver: modesetting unloaded: ati,fbdev,vesa 
           resolution: 1600x900~60Hz 
           OpenGL: renderer: AMD RAVEN (DRM 3.25.0 / 4.17.0-rc2-next-20180426-ARCH LLVM 6.0.0) 
           v: 4.5 Mesa 18.0.1
Comment 5 Jon 2018-04-28 20:14:52 UTC
I might be having the same issue, however my stack looks a bit different, perhaps due to different kernel commits and the way it is triggered. Can you guys trigger this hang by locking the window manager session/screen and waiting for the display(s) to go to sleep? That's how I can reliably trigger my similar hang.

april 28 20:16:49 beist.localdomain kernel: kernel BUG at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:4708!                                                                                        
april 28 20:16:49 beist.localdomain kernel: invalid opcode: 0000 [#1] SMP NOPTI 
april 28 20:16:49 beist.localdomain kernel: RIP: 0010:dm_update_crtcs_state+0x397/0x410 [amdgpu] 
...
april 28 20:16:49 beist.localdomain kernel: Call Trace:                                                                                                                                                            
april 28 20:16:49 beist.localdomain kernel:  amdgpu_dm_atomic_check+0x182/0x3b0 [amdgpu]                                                                                                                           
april 28 20:16:49 beist.localdomain kernel:  drm_atomic_check_only+0x33a/0x4f0 [drm]                                                                                                                               
april 28 20:16:49 beist.localdomain kernel:  drm_atomic_commit+0x13/0x50 [drm]                                                                                                                                     
april 28 20:16:49 beist.localdomain kernel:  drm_atomic_connector_commit_dpms+0xe5/0xf0 [drm]                                                                                                                      
april 28 20:16:49 beist.localdomain kernel:  drm_mode_obj_set_property_ioctl+0x174/0x290 [drm]                                                                                                                     
april 28 20:16:49 beist.localdomain kernel:  ? drm_mode_connector_set_obj_prop+0x70/0x70 [drm]                                                                                                                     
april 28 20:16:49 beist.localdomain kernel:  drm_mode_connector_property_set_ioctl+0x3e/0x60 [drm]                                                                                                                 
april 28 20:16:49 beist.localdomain kernel:  drm_ioctl_kernel+0x5b/0xb0 [drm]                                                                                                                                      
april 28 20:16:49 beist.localdomain kernel:  drm_ioctl+0x2c3/0x360 [drm]                                                                                                                                           
april 28 20:16:49 beist.localdomain kernel:  ? drm_mode_connector_set_obj_prop+0x70/0x70 [drm]                                                                                                                     
april 28 20:16:49 beist.localdomain kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]                                                                                                                                   
april 28 20:16:49 beist.localdomain kernel:  do_vfs_ioctl+0xa4/0x620                                                                                                                                               
april 28 20:16:49 beist.localdomain kernel:  ksys_ioctl+0x70/0x80                                                                                                                                                  
april 28 20:16:49 beist.localdomain kernel:  __x64_sys_ioctl+0x16/0x20                                                                                                                                             
april 28 20:16:49 beist.localdomain kernel:  do_syscall_64+0x5b/0x160                                                                                                                                              
april 28 20:16:49 beist.localdomain kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9                                                                                                                              
april 28 20:16:49 beist.localdomain kernel: RIP: 0033:0x7f7d092cc0f7
Comment 6 Jon 2018-04-28 21:12:16 UTC
Created attachment 139205 [details] [review]
Partially revert commit introducing the BUG() where the "hang" happens

This is the patch I'm currently testing to avoid the hang in my case. It probably does not solve the issue, but it seem to avoid the hang for me. It's just that my original problem is that after the displays go to sleep the stay blank until I reboot, even with the patch.
Comment 7 Levis Raju 2018-04-29 05:05:47 UTC
For me also, the issues occurs when screen's display go black for inactivity and a hang on wake.
Comment 8 Shmerl 2018-05-02 16:49:30 UTC
Is this happening in 4.17rc3?
Comment 9 Kevin McCormack 2018-05-07 00:41:46 UTC
Seems to be resolved with 4.17rc3 so far!
Comment 10 ebiggers3 2018-05-08 05:26:14 UTC
This just happened to me again, now on v4.17-rc4.  It's the 'BUG_ON(dm_new_crtc_state->stream == NULL);' at amdgpu_dm.c:4708, apparently happened when the screen turned off due to inactivity.  Graphics card is an Radeon RX 550, one monitor connected via HDMI.
Comment 11 Leo Li 2018-05-09 15:41:20 UTC
I'm having trouble reproducing the BUG_ON in Ubuntu 18.04.
For those who can, the below info will be helpful:

- Distribution & version
- Window manager used
- Xorg log
- journalctl log, if on Wayland

Xorg log is at /var/log/Xorg.x.log
Gzip journalctl via `$ journalctl | gzip - > journalctl.gz
Comment 12 Leo Li 2018-05-16 20:42:54 UTC
Created attachment 139602 [details] [review]
Fix v1

It seems I've reproduced it by using the default modesetting DDX driver.
Please give the attached patch a shot.
Comment 13 Aaron 2018-05-19 12:36:49 UTC
Confirming i had the same issue on suspend. Patch "Fix v1" Resolves the issue for me. Attaching "before" and "after" dmesg outputs.
Comment 14 Aaron 2018-05-19 12:38:51 UTC
Created attachment 139634 [details]
"Before fix" dmesg output
Comment 15 Aaron 2018-05-19 12:41:34 UTC
Created attachment 139635 [details]
"After Fix" dmesg output

Note a "new" error near the end. No loss in functionality with the error, screen behaves normally. Cannot confirm if new trace is in response to the patch or not.  The 84 second range is right at the time i blanked and un-blanked the screen. I was using ``xset dpms force off".
Comment 16 Aaron 2018-05-19 12:44:47 UTC
Just confirmed that the error in the "after" dmesg output occurs every time the screen is blanked. Will post a new bug once the patch listed here is mainlined.
Comment 17 Shmerl 2018-05-30 23:23:10 UTC
So which rc version will contain the fix?
Comment 18 Leo Li 2018-06-01 13:18:02 UTC
(In reply to Shmerl from comment #17)
> So which rc version will contain the fix?

It's already in Dave's drm-fixes: https://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes

So it should make it to the 4.17 release.
Comment 19 Vedran Miletić 2018-07-12 13:51:43 UTC
(In reply to Leo Li from comment #18)
> (In reply to Shmerl from comment #17)
> > So which rc version will contain the fix?
> 
> It's already in Dave's drm-fixes:
> https://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes
> 
> So it should make it to the 4.17 release.

I can reproduce this (more precisely bug 104611, a likely duplicate) on 4.17.5.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.