Bug 93017

Summary: complete system stalls while changing displays resolutions on a hybrid (intel/radeon) system
Product: DRI Reporter: Yaroslav Halchenko <debian>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: major    
Priority: medium CC: debian, Rondom
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
journalctl output (a bit annonymized) showing details of the session with the crash
none
cut/paste terminal output for the 2nd crash: BUG: unable to handle kernel NULL pointer dereference at 0000000000000042 none

Description Yaroslav Halchenko 2015-11-19 15:09:36 UTC
Originally reported within DRM/Intel (https://bugs.freedesktop.org/show_bug.cgi?id=92997#c1) and was recommended to seek support with Radeon peopl.

The story began a month or so ago whenever I upgraded my Debian testing/sid installation with a new kernel (can dig out later if needed from which version) to 4.3.0 :  laptop started to freeze completely, usually when switching displays (to/from external display)/resolution.  My laptop is hp zbook 14" with dual GPU:

00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 0b)
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Mars [Radeon HD 8730M]

which sits on the docking station (stalls happen without docking I believe) which has a display connected via display port.

Running nightly of git://anongit.freedesktop.org/drm-intel from yesterday I have managed to trigger the stall again:  turned on 2nd display attached to the docking station (also via displayport), which xrandr doesn't actually see as connected, and usually it just shows the cloning of the 1st display (I guess docking station does it internally).  When I turned it off again, which lead X to reset display and almost come back when it stalled... after reboot journalctl -b -1 showed:

Nov 18 14:00:49 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) intel(0): resizing framebuffer to 1920x1200
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): EDID vendor "HWP", prod id 9977
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Using hsync ranges from config file
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Using vrefresh ranges from config file
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Printing DDC gathered Modelines:
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1920x1200"x0.0  154.00  1920 1968 2000 2080  1200 1203 1209 1235 +hsync -vsync (74.0 kHz eP)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "800x600"x0.0   40.00  800 840 968 1056  600 601 605 628 +hsync +vsync (37.9 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "640x480"x0.0   31.50  640 656 720 840  480 481 484 500 -hsync -vsync (37.5 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "640x480"x0.0   25.18  640 656 752 800  480 490 492 525 -hsync -vsync (31.5 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "720x400"x0.0   28.32  720 738 846 900  400 412 414 449 -hsync +vsync (31.5 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1280x1024"x0.0  135.00  1280 1296 1440 1688  1024 1025 1028 1066 +hsync +vsync (80.0 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1024x768"x0.0   78.75  1024 1040 1136 1312  768 769 772 800 +hsync +vsync (60.0 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1024x768"x0.0   65.00  1024 1048 1184 1344  768 771 777 806 -hsync -vsync (48.4 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "832x624"x0.0   57.28  832 864 928 1152  624 625 628 667 -hsync -vsync (49.7 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "800x600"x0.0   49.50  800 816 896 1056  600 601 604 625 +hsync +vsync (46.9 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1152x864"x0.0  108.00  1152 1216 1344 1600  864 865 868 900 +hsync +vsync (67.5 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1280x960"x0.0  108.00  1280 1376 1488 1800  960 961 964 1000 +hsync +vsync (60.0 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1280x1024"x0.0  108.00  1280 1328 1440 1688  1024 1025 1028 1066 +hsync +vsync (64.0 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1600x1000"x60.0  133.14  1600 1704 1872 2144  1000 1001 1004 1035 -hsync +vsync (62.1 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1600x1200"x0.0  162.00  1600 1664 1856 2160  1200 1201 1204 1250 +hsync +vsync (75.0 kHz e)
Nov 18 14:00:50 hopa /usr/lib/gdm3/gdm-x-session[2354]: (II) RADEON(G0): Modeline "1680x1050"x0.0  119.00  1680 1728 1760 1840  1050 1053 1059 1080 +hsync -vsync (64.7 kHz e)
Nov 18 14:00:50 hopa kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000041
Nov 18 14:00:50 hopa kernel: IP: [<ffffffffa089abeb>] ttm_bo_wait+0x6b/0x170 [ttm]
Nov 18 14:00:50 hopa kernel: PGD 35dba067 PUD 35dbb067 PMD 0 
Nov 18 14:00:50 hopa kernel: Oops: 0000 [#1] SMP 
Nov 18 14:00:50 hopa kernel: Modules linked in: fuse ctr ccm rfcomm ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c loop bnep pci_stub vboxpci(
Nov 18 14:00:50 hopa kernel:  iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass pcspkr psmouse serio_raw sg i2c_i801 lpc_ich shpchp evdev tpm_infineon mei_me mei snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_
Nov 18 14:00:50 hopa kernel:  sd_mod rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul crc32c_intel jitterentropy_rng sha256_ssse3 sha256_generic hmac drbg ansi_cprng ahci libahci aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_h
Nov 18 14:00:50 hopa kernel: CPU: 2 PID: 4571 Comm: kworker/2:0 Tainted: G        W  O    4.4.0-rc1+ #2
Nov 18 14:00:50 hopa kernel: Hardware name: Hewlett-Packard HP ZBook 14/198F, BIOS L71 Ver. 01.20 07/28/2014
Nov 18 14:00:50 hopa kernel: Workqueue: events ttm_bo_delayed_workqueue [ttm]
Nov 18 14:00:50 hopa kernel: task: ffff8804384d7100 ti: ffff880035e14000 task.ti: ffff880035e14000
Nov 18 14:00:50 hopa kernel: RIP: 0010:[<ffffffffa089abeb>]  [<ffffffffa089abeb>] ttm_bo_wait+0x6b/0x170 [ttm]
Nov 18 14:00:50 hopa kernel: RSP: 0018:ffff880035e17d70  EFLAGS: 00010246
Nov 18 14:00:50 hopa kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
Nov 18 14:00:50 hopa kernel: RDX: 0000000000000ea6 RSI: 0000000000000000 RDI: ffff8800a1626068
Nov 18 14:00:50 hopa kernel: RBP: 0000000000000001 R08: ffff8800a5c5cc78 R09: 0000000000000000
Nov 18 14:00:50 hopa kernel: R10: 0000000000000000 R11: ffff8804300f1dc0 R12: 0000000000000000
Nov 18 14:00:50 hopa kernel: R13: 0000000000000001 R14: ffff8804382d76f8 R15: ffff88031b9b6400
Nov 18 14:00:50 hopa kernel: FS:  0000000000000000(0000) GS:ffff88044ea80000(0000) knlGS:0000000000000000
Nov 18 14:00:50 hopa kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 18 14:00:50 hopa kernel: CR2: 0000000000000041 CR3: 000000043018d000 CR4: 00000000001406e0
Nov 18 14:00:50 hopa kernel: Stack:
Nov 18 14:00:50 hopa kernel:  0000000000000ea6 ffff8800a1626068 ffff88044ea959c0 ffff8800a1626068
Nov 18 14:00:50 hopa kernel:  0000000000000001 0000000000000001 ffff880437fcef40 0000000000000000
Nov 18 14:00:50 hopa kernel:  0000000000000001 ffffffffa089b327 0000000000000000 0000000000000001
Nov 18 14:00:50 hopa kernel: Call Trace:
Nov 18 14:00:50 hopa kernel:  [<ffffffffa089b327>] ? ttm_bo_cleanup_refs_and_unlock+0x27/0x170 [ttm]
Nov 18 14:00:50 hopa kernel:  [<ffffffffa089b52f>] ? ttm_bo_delayed_delete+0xbf/0x200 [ttm]
Nov 18 14:00:50 hopa kernel:  [<ffffffffa089b687>] ? ttm_bo_delayed_workqueue+0x17/0x40 [ttm]
Nov 18 14:00:50 hopa kernel:  [<ffffffff810856ff>] ? process_one_work+0x19f/0x3d0
Nov 18 14:00:50 hopa kernel:  [<ffffffff8108597d>] ? worker_thread+0x4d/0x450
Nov 18 14:00:50 hopa kernel:  [<ffffffff81085930>] ? process_one_work+0x3d0/0x3d0
Nov 18 14:00:50 hopa kernel:  [<ffffffff8108b47d>] ? kthread+0xbd/0xe0
Nov 18 14:00:50 hopa kernel:  [<ffffffff8108b3c0>] ? kthread_create_on_node+0x170/0x170
Nov 18 14:00:50 hopa kernel:  [<ffffffff8155984f>] ? ret_from_fork+0x3f/0x70
Nov 18 14:00:50 hopa kernel:  [<ffffffff8108b3c0>] ? kthread_create_on_node+0x170/0x170
Nov 18 14:00:50 hopa kernel: Code: 85 ff 74 71 41 8b 47 10 ba a6 0e 00 00 85 c0 74 64 31 db eb 0e 83 c3 01 48 85 d2 7e 4f 41 39 5f 10 76 52 48 63 c3 49 8b 6c c7 18 <48> 8b 45 40 a8 01 75 e2 48 8b 45 08 48 8b 40 18 48 85 c0 74 11 
Nov 18 14:00:50 hopa kernel: RIP  [<ffffffffa089abeb>] ttm_bo_wait+0x6b/0x170 [ttm]

Not sure if this particular traceback is associated with all the stalls -- I think that in majority of the cases system stalls before any log/journal gets dumped to the drive so it is not usually accessible after reboot.

I also setup xrandr providers to enable external displays connected to the docking station:

xrandr --setprovideroffloadsink radeon Intel
xrandr --setprovideroutputsource radeon Intel

but I think I had stalls prior doing that (but it was with older kernels etc)
Comment 1 Alex Deucher 2015-11-19 16:11:21 UTC
Might be similar to bug 92258.
Comment 2 Michel Dänzer 2015-11-20 01:39:34 UTC
Yep, please try Maarten's patch from bug 92258 for additional information.

Also, can you narrow down which kernel version/change introduced the problem, ideally using git bisect?
Comment 3 Yaroslav Halchenko 2015-11-20 02:05:38 UTC
Thank you Michael for your response!

Kernel

according to my irc log on #intel-gfx started to happen with upgrade to 4.2.0-1-amd64, and according to old copy of the journal it was

Oct 25 11:22:54 hopa kernel: Linux version 3.17-1-amd64 (debian-kernel@lists.debian.org) (gcc version 4.8.3 (Debian 4.8.3-13) ) #1 SMP Debian 3.17-1~exp1 (2014-10-14)

before that.  bisection I guess will be the measure of last resort -- this laptop is the main workhorse and halt is not 100% reproducible

patch: applied and rebuilding now.  Will report as soon as halts again (will do some forceful playful interaction with external displays tomorrow) or if can't trigger the halt.  Thanks!
Comment 4 Yaroslav Halchenko 2015-11-20 15:15:21 UTC
Created attachment 119987 [details]
journalctl output (a bit annonymized) showing details of the session with the crash
Comment 5 Yaroslav Halchenko 2015-11-20 15:19:22 UTC
reporting on "success":  after new patched kernel installation and some ugprades (kept crashing gnome not kernel, so had to upgrade), caused the stall with a bit different but overall similar traceback (full output of journalctl for that boot is attached):

Nov 20 10:04:38 hopa kernel: [drm:intel_hdmi_detect] [CONNECTOR:53:HDMI-A-2]
Nov 20 10:04:38 hopa kernel:  ffff88043c187858 0000000000000001 ffffffff81555c51 ffff88043c678080
Nov 20 10:04:38 hopa kernel: Call Trace:
Nov 20 10:04:38 hopa kernel:  [<ffffffff8108678d>] ? wq_worker_sleeping+0xd/0x90
Nov 20 10:04:38 hopa kernel:  [<ffffffff81555835>] ? __schedule+0x505/0x8f0
Nov 20 10:04:38 hopa kernel:  [<ffffffff81555c51>] ? schedule+0x31/0x80
Nov 20 10:04:38 hopa kernel:  [<ffffffff8107209c>] ? do_exit+0x72c/0xa90
Nov 20 10:04:38 hopa kernel:  [<ffffffff810175ec>] ? oops_end+0x9c/0xd0
Nov 20 10:04:38 hopa kernel: [drm:intel_hdmi_detect] Live status not up!
Nov 20 10:04:38 hopa kernel: [drm:drm_helper_probe_single_connector_modes_merge_bits] [CONNECTOR:53:HDMI-A-2] disconnected
Nov 20 10:04:38 hopa kernel:  [<ffffffff8155b5d8>] ? general_protection+0x28/0x30
Nov 20 10:04:38 hopa kernel:  [<ffffffff8140689f>] ? reservation_object_test_signaled_rcu+0xcf/0x220
Nov 20 10:04:38 hopa kernel:  [<ffffffff81406ef9>] ? reservation_object_wait_timeout_rcu+0x219/0x260
Nov 20 10:04:38 hopa kernel:  [<ffffffffa0832b29>] ? ttm_bo_wait+0x29/0x50 [ttm]
Nov 20 10:04:38 hopa kernel:  [<ffffffffa0833207>] ? ttm_bo_cleanup_refs_and_unlock+0x27/0x170 [ttm]
Nov 20 10:04:38 hopa kernel:  [<ffffffffa083340f>] ? ttm_bo_delayed_delete+0xbf/0x200 [ttm]
Nov 20 10:04:38 hopa kernel:  [<ffffffffa0833567>] ? ttm_bo_delayed_workqueue+0x17/0x40 [ttm]
Nov 20 10:04:38 hopa kernel:  [<ffffffff810856ff>] ? process_one_work+0x19f/0x3d0
Nov 20 10:04:38 hopa kernel:  [<ffffffff8108597d>] ? worker_thread+0x4d/0x450
Nov 20 10:04:38 hopa kernel:  [<ffffffff81085930>] ? process_one_work+0x3d0/0x3d0
Nov 20 10:04:38 hopa kernel:  [<ffffffff8108b47d>] ? kthread+0xbd/0xe0
Nov 20 10:04:38 hopa kernel:  [<ffffffff8108b3c0>] ? kthread_create_on_node+0x170/0x170
Nov 20 10:04:38 hopa kernel:  [<ffffffff8155984f>] ? ret_from_fork+0x3f/0x70
Nov 20 10:04:38 hopa kernel:  [<ffffffff8108b3c0>] ? kthread_create_on_node+0x170/0x170
Nov 20 10:04:38 hopa kernel: Code: 48 c7 c7 b2 07 80 81 e8 83 39 fe ff e9 bf fe ff ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 40 04 00 00 <48> 8b 40 d8 c3 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 
Nov 20 10:04:38 hopa kernel: RIP  [<ffffffff8108ba2c>] kthread_data+0xc/0x20
Nov 20 10:04:38 hopa kernel:  RSP <ffff88043c187b98>
Nov 20 10:04:38 hopa kernel: CR2: ffffffffffffffd8
Nov 20 10:04:38 hopa kernel: ---[ end trace 01c0854cd2e7cf2f ]---
Nov 20 10:04:38 hopa kernel: Fixing recursive fault but reboot is needed!

To stall it, I had both displays connected where 2nd one was just mirroring the first one. And I have turned off the 2nd display which caused all the mess

what could be the next step? ;-)

BTW -- with this recent upgrade, now two attached monitors are also seen as an extended desktop (3840x1200) which never happened before, and actually works quite nicely. but then also caused crash (no traceback was recorded and I didn't have remote session attached) using the same trick of turning the 2nd display off
Comment 6 Yaroslav Halchenko 2015-11-20 17:35:25 UTC
Created attachment 119998 [details]
cut/paste terminal output for the 2nd crash: BUG: unable to handle kernel NULL pointer dereference at 0000000000000042

The beast crashed again... I don't remember if I had those before -- just that screen went off due to inactivity (may be it was also locked -- I was away from the laptop) and when I came back -- it was stalled.  I had ssh session opened at  another box watching journalctl -f (nothing in the logs on the drive after reboot).  The last messages 

Nov 20 12:20:04 hopa kernel: [drm:drm_crtc_helper_set_config] attempting to set mode from userspace
Nov 20 12:20:04 hopa kernel: [drm:drm_mode_debug_printmodeline] Modeline 57:"" 0 296400 3840 3888 3920 4000 1200 1203 1209 1235 0x0 0x5
Nov 20 12:20:04 hopa kernel: [drm:radeon_encoder_set_active_device] setting active device to 00000008 from 00000008 00000008 for encoder 2
Nov 20 12:20:04 hopa kernel: [drm:drm_crtc_helper_set_mode] [CRTC:29]
Nov 20 12:20:04 hopa kernel: [drm:radeon_atom_encoder_dpms] encoder dpms 30 to mode 3, devices 00000080, active_devices 00000000
Nov 20 12:20:04 hopa kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000042
Comment 7 Yaroslav Halchenko 2015-11-20 17:44:30 UTC
aha -- I think I found what triggered it since I did it again and it stalled probably identically (didn't have remote console :-/):  I have ran

DISPLAY=:0 0install run -c http://gfxmonk.net/dist/0install/shellshape.xml --replace

to try shellshape and whenever it finished downloading, it did smth which triggered the bug, and screens went blank.  Probably it is a different, although possibly related, issue since during original stalls I still have smth on the screens.  In this case they just go down into suspend mode etc.  Do you think I should file a separate report on this one?
Comment 8 polo 2015-11-23 16:42:53 UTC
I see you have same laptop as me zbook 14,  DP on docking station are conected only to AMD GPU.

I have  DRI_PRIME=1 issue  with new kernel (probably start with 3.19) maybe is related 

https://bugzilla.opensuse.org/show_bug.cgi?id=954783
Comment 9 Yaroslav Halchenko 2015-11-24 13:27:03 UTC
For me with DRI_PRIME=1 it even sometimes does not render at all... first I thought it happens only with external display, but nope -- also happens straight on laptop screen unpredictably.  But no crashes from that so far during my trials
Comment 10 polo 2015-11-26 09:06:34 UTC
https://wiki.archlinux.org/index.php/PRIME

DRI_PRIME=1  need  xrandr compositing  and crash with 4.1(3.19up)  with multiple glmatrix running simultaneously after few minutes or game...  4.3 glmarix  works fine for couple hours and crash  randomly during gameplay same 4.2  maybe is completly another bug affected 4.2/4.3 kernels.
Comment 11 Martin Peres 2019-11-19 09:10:05 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/663.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.