Created attachment 138464 [details]
Kernel messages with drm.debug=0xe glitch after 24s, before 400s
I was trying to use drm-tip, but that did not boot at all for me. The device is a GPD Pocket using the Atom x7-Z8750. Linux support for this is a community effort, with lots of work by Hans de Goede in his tree at https://github.com/jwrdegoede/linux-sunxi.git . It is a messed up little device, including the tablet screen that needs rotation for a normal desktop experience and an audio circuity that we might get working properly at some time in the future …
I hope the display hardware apart from the rotation is standard enough.
The newest kernel I was able to try is commit ae718bc of https://github.com/jwrdegoede/linux-sunxi.git, which corresponds to 4.16.0-rc5 of sorts. I tried the same config with drm-tip, but the kernel failed to boot (cannot find root, or hangs). The base system is Xubuntu 17.10, installed from a 17.04 image with adaptions to the GPD pocket.
The issue I have is also discussed at https://github.com/nexus511/gpd-ubuntu-packages/issues/30 at some length. There are notes, reproduction helpers, and screenshots on https://sobukus.de/gpd/displayglitch/ .
I can run the device either with xf86-video-intel (which I prefer because of a working TearFree option) or xf86-video-modesetting. With the latter, I observe abnormal tearing abount the end of the first third from the left (top before rotation) of the screen. Maybe that is related. The main issue I have manifests with the xf86-video-intel driver: After some time of use, the screen shifts vertically (horizontally for the non-rotated view) and possibly the colors change (swapping of color channels). This state is reversible, if I continue using the device, more shift can be triggered, but also sometimes it reverts to the normal good state.
The corruption stays there, down to the splash screen when powering down the device. It is only changed if the random trigger is successful again or if I force a change like a different resolution or a rotation by 90 degrees (switching to left and then back to right rotation, a simple flip, does not fix hings). This software reversibleness is what makes me doubt that simply faulty hardware is to blame.
I devised a complicated scheme of triggering the corruption by remote-controlling evince with a PDF file, but all that is needed is actually the usual simple 60 fps color flip video for demonstrating tearing, more likely _anything_ causing a certain number of screen updates to increase chances. Anyhow, you can see some history in https://sobukus.de/gpd/displayglitch/README.txt, along with some statistics that give a hint on how reliably I can reproduce this in a certain time.
It was suggested to me in the intel-gfx IRC channel that this should be a bug in the DRM component. An alternative explanation might be defective hardware, as I seem to be in a small minority of people using the device with this intensity of corruption. But I am not the _only_ one. Maybe there is something in hardware behaviour that varies sufficiently between devices to only trigger this for very few.
The kernel shows no messages during the glitch occurence with drm.debug=0xe . Anything more I can do? Some state dump before/after glitch?
In the dmesg I am going to attach, the glitch happens between these two lines:
[ 24.506143] IPv6: ADDRCONF(NETDEV_CHANGE): wlp1s0: link becomes ready
[ 459.656003] [drm:drm_helper_probe_single_connector_modes [drm_kms_helper]] [CONNECTOR:79:DSI-1]
The drm message here just relates to me switching modes to fix the corruption.
Created attachment 138465 [details]
photo of the corruption on an Ubuntu Unity desktop
This is from running gpdpocket-20180306-4.16.0-rc3-ubuntu-17.10.1-unity-desktop-amd64.iso from https://gpdpocket.cre.ovh/, a standard Unity desktop with its usual colors, rather distorted now after causing the glitch. A software screenshot shows the correct undistorted image, this is solely something between driver (DRM) and the display.
Ah, the reason I cannot work with drm-tip is that the Xubuntu initrd was not created. It ran out of space in /boot at around 270M of compressed initrd. The modules for the kernel are rather weighty. The only difference, apart from options missing in your tree, is the new option CONFIG_DRM_I915_DEBUG_GUC, which I set, expecting that any help in debugging would be appreciated. Is that supposed to cause the i915.ko to be 83M in size? Well, nouveau.ko grew to 178M. I've been building kernels since around the turn of the century. This is new to me. What causes these giant modules? Is that normal with drm-tip?
The corruption only covers the DSI1 output. I connected an external monitor to HDMI1 (the GPD has a proper HDMI micro port and USB-C, the video output via USB-C to HDMI2 and DP is not working yet). I played http://sobukus.de/gpd/displayglitch/kenjo_vidtest_60fps.mp4 for a little while, having the mplayer window partly on the DSI1 and partly on HDMI1, and triggered the corruption on the internal display. The picture on HDMI1 is still fine.
Came here via https://bugs.freedesktop.org/show_bug.cgi?id=98876 .
Quick question: What do you need that Sunxi kernel for when your device is having an Intel x86_64 CPU?
(In reply to N. W. from comment #4)
> Quick question: What do you need that Sunxi kernel for when your device is
> having an Intel x86_64 CPU?
The sunxi tree just happens to be where Hans de Goede puts his stuff. The name is misleading, yes. I guess Hans just didn't want to work on two repos for devices he hacks on. The GPD is Intel all the way down (actually, I'm not sure where it stops;-).
(In reply to Thomas Orgis from comment #5)
> (In reply to N. W. from comment #4)
> > Quick question: What do you need that Sunxi kernel for when your device is
> > having an Intel x86_64 CPU?
> The sunxi tree just happens to be where Hans de Goede puts his stuff. The
> name is misleading, yes. I guess Hans just didn't want to work on two repos
> for devices he hacks on. The GPD is Intel all the way down (actually, I'm
> not sure where it stops;-).
Probably another reason to try out Fedora Rawhide Workstation: https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Workstation/x86_64/iso/
1. jwrdegoede seems to be a Red Hat employee working on Fedora.
2. Fedora "Rawhide is the name given to the current development version of Fedora. It consists of a package repository called "rawhide" and contains the latest build of all Fedora packages updated on a daily basis." https://fedoraproject.org/wiki/Releases/Rawhide
3. Fedora by default uses GNOME @ Wayland, which is proven to not tear at all (not even under X11).
I figured out that I need to enable module stripping on install to get a manageable initrd. I can confirm now that the glitch still can be easily triggered with drm-tip 8a51883.
Does 'xset dpms force off; xset dpms force on' fix the display as well?
The corruption apparently happens when just page flipping. That sort of rules out straightforward bugs in the DSI enable sequence.
And the panel seems to be even the most simple type you can have:
[ 4.063972] [drm:intel_dsi_vbt_init [i915]] Mode video
[ 4.064020] [drm:intel_dsi_vbt_init [i915]] Dual link: NONE
[ 4.064093] [drm:intel_dsi_vbt_init [i915]] Pixel Format 0
It seems the DSI encoder gets confused somehow and starts sending data out of phase. I could imagine that a FIFO underrun could perhaps lead to that outcome, but there are none reported in the dmesg.
So unfortunately I don't have any good ideas as to the cause of the bug.
Sorry for the late reply … the notification got lost …
(In reply to Ville Syrjala from comment #8)
> Does 'xset dpms force off; xset dpms force on' fix the display as well?
Indeed it does.
I also went through the pain of installing the Windows 10 image on this device and was unable to produce the corruption there. People suggested that my device is faulty, but I really guess it is within spec, just a bit more prone to this than others. There is of course a significant element of chance involved, as I was also unable to reproduce for quite some time on Xubuntu with the linux-sunxi kernel based on the final 4.16.0. I will test that again after purging Windows from the device …
> It seems the DSI encoder gets confused somehow and starts sending data out
> of phase. I could imagine that a FIFO underrun could perhaps lead to that
> outcome, but there are none reported in the dmesg.
Any debugging information I could aquire? Could one implement a runtime diagnostic that recognizes the confusion?
FYI: I'm trying to get a replacement mainboard for the device, suspecting that there is some damage at least relating to the USB C port.
I will test for the display corruption once I get a new mainboard. If I cannot manage to reproduce then, then this is probably a harware tolerance issue that may or may not be fixable by the driver (after all, wayland and modesetting don't seem to suffer from the corruption).
HI, how long do you think we should wait, or should we close this and you are feel free to re-open if if issues still after motherboard change?
Well, I hope to get the replacement in about two weeks. If you feel better, you can close so far. But I would like to add a comment in any case, even just to confirm that the bug is invalid.
sure, just add comment or re-open as you see best. Closing now.
I was able to test on another machine: I was unable to reproduce the corruption.
So let's assume that this is some kind of weirdness on the hardware side and keep this closed.