Nouveau seems (seems, it may have been a coincidence, or nouveau failure + hw failure) to have corrupted my laptop's screen EDID info.
So, here is a nice log of what happened:
Now, a bit of background: I've been using almost exclusively Debian on this laptop for about 4 years. I have been using nvidia blob, then nv, back to nvidia blob, and finally nouveau, happily following (not too closely) git versions.
Now, right after Debian Squeeze update, I've decided to stick with Debian packages for kernel and X stuff, so, I'm using Debian's version since then. I've been using nouveau from Debian/testing for two weeks or so without any problem of any sort.
Then, on February 19, I've decided to try 3D, and so, have upgraded nouveau (actually, libdrm-nouveau1a, X and drm stuff) to Debian/unstable versions, and installed both linux-image-2.6.37-1-686 and libgl1-mesa-dri-experimental (that contains the gallium3D part of nouveau).
I was able to play some GL-using games like Teeworlds, rRootage, Minetest, or Inside a Star-filled Sky. I've, well, mostly played for a day or two, and then went back to other occupations.
Then, on February 21, my screen suddenly turned black as I was working on my own game project (something using pygame, that doesn't use GL in anyway, or at least in any way I know of).
I was able to safely power off my computer by blindly logging in as root and typing "poweroff".
Then, the real trouble begins. Upon reboot, the backlights are on, but the BIOS doesn't show anything on the screen, nor GRUB does.
I'm then plugging in an external monitor and rebooting again. Still nothing on the laptop's screen, but the external monitor shows the GRUB loading lines, and... nothing.
Then, I've used a Debian Live USB stick, and syslinux booted, although it showed me shitty graphics instead of the fancy boot menu. But anyway, it worked. I then used the installation system present on the LiveUSB to change my GRUB config and disable the whole graphical menu thing. It worked.
Now, one more reboot, still nothing on the laptop's screen, but Windows (XP) booted and I was even able to play Quake Wars (on the external monitor, though). I then tried to use the laptop's screen in windows. As soon as I enabled it, it showed flickering vertical and horizontal lines on a black background... When powering off, I was able to see the Windows logo a few time, but it stayed for a fraction of second before being vertically distorted.
Now, back on Debian, it worked quite fine on the external monitor, too. I've then reverted libdrm, X stuff and nouveau to Debian/testing versions and, upon reboot, my screen worked fine.
But it works fine only with nouveau: BIOS, GRUB and windows still fail to use the laptop's screen.
In addition, the logs show repeated reports of EDID corruptions, and the dumped memory does actually change (see kern.log.unstable).
Created attachment 43640 [details]
Kernel log from first boot with nouveau from Debian/unstable to crash
Here are the logs from first boot with nouveau from Debian/unstable to the black screen.
Compressed because of huge size.
Created attachment 43641 [details]
Kernel log since first boot after having reverted nouveau to Debian/testing
Created attachment 43642 [details]
lspci -vvv output
Created attachment 43734 [details]
Kernel log of the working kernel with drm.debug=1
I'm now sure it's an EDID corruption, and nothing else.
The reason why the "old" nouveau works has nothing to do with nouveau, but with drm: as you can see in the kern.log.drm.debug file, the EDID header got corrected by drm.
Extracting the EDID dumps from kern.log.unstable reveals some pattern in the "corruption" (see edid-changes).
I have to add that the EDID information corrected by DRM (as found in the "edid" sysfs node) is the same as the last dump of the kern.log.unstable logs except for the first byte (corrected to right EDID header).
Since the crash occurred as the exact time I've started my game, I guess something in pygame's initialisation might have caused the "corruption". It would partially explain the corruption "cycles".
Created attachment 43735 [details]
Extracted EDID dumps, with date and number of consecutive reports
Created attachment 43772 [details]
Xorg logs with blob
I've tried the nvidia blob, and as expected it kept failing (see Xorg.0.log.blob) until I passed it the "fixed" EDID info using the "CustomEDID" option.
I've made sure the EDID is loaded from I2C (and not VBIOS or anything else) by adding a few debug statements.
I've also figured out exactly why "new" DRM doesn't fix the header: it doesn't check the whole record in one go, and tries to determine whether the current block is the first by checking the first byte.
In my case, the first byte is wrong (0x1a instead of 0x00), so, it doesn't get corrected.
I've modified the code and forced the first byte (of the whole EDID data) to 0 so I can use my laptop's screen with 2.6.37-1.
Running linux 2.6.37-1 again, I've had 3 different bad EDID reports, but I haven't found a way to reproduce it yet.
TL;DR (short) version:
I have seen very, very similar symptoms on two Dell (D620, XPS M1710) laptops running nouveau since about the same time. The laptop panels no longer work properly in Linux, in BIOS, or in Windows 7, however in one case (D620), proper fiddling with the BIOS display select hotkey during POST results in partial functionality under certain circumstances.
Like the OP, I get EDID blocks with an invalid first byte (often 0xa9 or 0xa1) and/or invalid leading 0x 00ff ffff ffff ff00 reported in dmesg.
I think this is the same issue described in #4552.
Given the amount of my hardware that I believe to have been affected, I am hereby offering as a bounty a hardware donation of ~200 USD worth of at-least-sort-of relevant hardware of his/her choosing to any developer able to resolve this in such a way that my laptop panels start working again, in the interest of keeping him/her well stocked with hardware to hack. If you're up to your eyeballs in cards to work on, I'll ship to another hacker of your choosing or consider a favorite charity.
I don't remember when the problems began, but it would have been within a week or two of the OP's 19 February date. First the M1710's panel, and then, several days later, the D620's panel began to misbehave.
The M1710 will not drive its panel or any external monitor (DVI or VGA) in BIOS, Linux, or Windows 7 after booting with the panel is connected to the motherboard LVDS plug. Booting with the LVDS unplugged allows me to use external DVI and/or VGA monitors as normal, however even reconnecting the LVDS at this point does not restore panel functionality. In either case, the panel backlight seems to turn on at the proper times, but no image is displayed on the screen.
The D620 experienced similar symptoms with the exception that the internal panel could occasionally be coaxed into working in Windows, especially when connected to the Dell panel as described below. Pressing the Fn+F8 output select key combination an appropriate number of times during POST would reliably restore LVDS output as long as at least one other monitor was connected, and LVDS output would persist after the other monitor was disconnected, allowing laptop-style operation until reboot. In either of these cases in which the LVDS panel's operation could be restored, both BIOS and whichever operating system I would subsequently boot would appear confused about the panel's logical size and the display would appear cropped until I manually set the display to 1440x900, the LVDS panel's native mode. To add insult to injury, the D620's backlight burned out about a week after the other problems began, but that is only related inasmuch as it makes further tinkering difficult and unrewarding.
Both laptops were routinely operated in an out of a variety of Dell docking stations around the time of the failures. One dock is connected to an old 17" Dell panel via DVI, and VGA was sometimes connected instead/in addition after symptoms began for diagnostic purposes. Two additional docks are connected by DVI to an IOGEAR GCS1104 4-port DVI/USB/2.1 audio KVM switch, which as best I can tell supports capturing the EDID block from the attached monitor and replaying it to the attached PCs as necessary to simulate its physical connection. This switch is connected to an Acer panel via DVI. I have also tried both systems undocked with only the internal panel plugged in, and with various combinations of the Dell and Acer panels plugged into the DVI and VGA ports on the laptops themselves. Internal panel functionality is similarly impaired across configurations except as noted above
Both machines have been running in-kernel nouveau since some time before the problems began. Both kernels are patched with Gentoo patches, Grsecurity, and TuxOnIce, however the in-kernel graphics stack is probably very similar to the corresponding vanilla kernel since none of those patches seems likely to much affect it.
The M1710 was running x86_64 188.8.131.52 at the time of the failure on a Dell OEM'd G71 (PCI ID 10de:0297). Its internal panel is 1920x1200.
The D620 was running x86 2.6.37 at the time of the failure on a Dell OEM'd G72M (PCI ID 10de:01d7). Its internal panel is 1440x900.
Both machines have since been upgraded to 184.108.40.206.
Both machines report EDID errors dozens of times in dmesg. Often the error is obviously in the first byte, which is often 0xa8 or 0xa1 instead of 0x00, which results in the checksum being incorrect and the signature invalid. Sometimes the EDID appears to be garbage, and sometimes the reported garbage or the visible defect in the reported EDID changes from message to message in dmesg. Tomorrow I will reboot and attach clean dmesg logs showing representative samples of the error messages. I will also try booting from vanilla 220.127.116.11 tomorrow and see what I get.
Both machines report syntactically correct EDIDs in response to xrandr --verbose, however the reported EDIDs do not match the attached displays but rather correspond to other displays currently or previously attached.
I will attach everything I can think of tomorrow; please let me know what additional information I can provide. As stated above, I think this issue is the same as or at least related to #4552. Finally, I am totally serious about donating hardware to whoever is able to help fix my laptop panels. Just imagine yourself driver hacking on that new card or testing multi-head with that extra monitor =).
(In reply to comment #7)
> TL;DR (short) version:
> I have seen very, very similar symptoms on two Dell (D620, XPS M1710) laptops
> running nouveau since about the same time. The laptop panels no longer work
> properly in Linux, in BIOS, or in Windows 7, however in one case (D620), proper
> fiddling with the BIOS display select hotkey during POST results in partial
> functionality under certain circumstances.
I wouldn't rule out a manufacturing defect, interestingly both laptops are mentioned in the nVidia GPU litigation .
> Like the OP, I get EDID blocks with an invalid first byte (often 0xa9 or 0xa1)
> and/or invalid leading 0x 00ff ffff ffff ff00 reported in dmesg.
> I think this is the same issue described in #4552.
> Given the amount of my hardware that I believe to have been affected, I am
> hereby offering as a bounty a hardware donation of ~200 USD worth of
> at-least-sort-of relevant hardware of his/her choosing to any developer able to
> resolve this in such a way that my laptop panels start working again, in the
> interest of keeping him/her well stocked with hardware to hack. If you're up
> to your eyeballs in cards to work on, I'll ship to another hacker of your
> choosing or consider a favorite charity.
Have you considered donating/lending out one of the affected laptops instead? It looks like physical access and being able to take the laptop apart would actually be helpful here to diagnose the problem.
Created attachment 45143 [details]
D620 dmesg from not long after boot
built-in LVDS panel (no image)
Acer DVI monitor via Dell docking station, IOGEAR GCS1104 EDID-aware DVI KVM
Created attachment 45144 [details]
M1710 dmesg from not long after boot
built-in LVDS panel (no image)
Acer DVI monitor via Dell docking station, IOGEAR GCS1104 EDID-aware DVI KVM
Created attachment 45145 [details]
D620 lspci -vvv
Created attachment 45146 [details]
M1710 lspci -vvv
Created attachment 45147 [details]
D620 Xorg log
Same display configuration as described for dmesg above
Created attachment 45148 [details]
M1710 Xorg log
Same display configuration as described for dmesg above
I have not figured out how this happened, or whether it is nouveau-related, but I have developed a workaround which has successfully and completely restored two of my affected monitors (the backlight on the third died, so I am waiting until the replacement part arrives to tinker with it). Will the problem happen again? Maybe so, but the resulting breakage is now sufficiently easy for me to repair that I'm not too worried about it.
I was able to restore my monitors by reading the EDID  from the EEPROM  in the monitor using DDC  / I2C  (using SMBus  commands), fixing the corrupt bytes (which fortunately were only in the header and were therefore easy to fix), and writing the corrected EDID back to the monitor's EEPROM. I did this using a C program and the Linux I2C interfaces  exported by the video driver (still nouveau in one case, radeon in the other). My program requires inclusion of a header from the i2c-tools package ; I will attach a copy along with the source for posterity's sake.
Compile the program with -std=gnu99 and at least -O due to inlined SMBus functions. Invoke the program with no arguments or read the source (or strings the binary) to see usage help. I recommend the following use, which also invokes parse-edid from the read-edid package :
# Find the right i2c device; reading one that's not DDC will probably give ENODEV
edid-tool /dev/i2c-0 read > edid-bad
# If you don't get warnings here about either the header or checksum being bad,
# you probably have some other problem.
edid-tool /dev/i2c-0 fix < edid-bad > edid-fixed
parse-edid < edid-fixed
# parse-edid will read several of the display related fields out of the EDID and
# generate an Xorg.conf Monitor section; CHECK IT TO MAKE SURE IT LOOKS SANE
# BEFORE YOU FLASH IT BACK TO YOUR MONITOR.
edid-tool /dev/i2c-0 write < edid-fixed
If there's something more seriously wrong with the EDID in your monitor's EEPROM than a bad header or checksum, you will need to either go field by field through the EDID standard and your monitor specifications, which hopefully you have somewhere, and generate one, or find an identical or very similar monitor elsewhere and capture its EDID. If you have a monitor with multiple inputs, they likely have distinct EEPROMs; if so, and only one is corrupted, you might well be able to capture the proper EDID from one and flash it to the other.
Please let me know if this works for you or if you suspect there are bugs in the program.
Created attachment 45638 [details]
C program for reading, writing, and manipulating contents of EDID EEPROMs attached to I2C/DDC interfaces using SMBus
/* Copyright (C) 2011 Andy Getzendanner //
// Hereby licensed AS-IS to anyone who wants it //
// Responsibility and liability for unintended consequences including //
// hardware damage are hereby disclaimed; USE ONLY AT YOUR OWN RISK //
// Note: compile with at least -O due to inlined SMBus functions //
// Note: errors in i2c-dev.h likely result from using the kernel-internal //
// version of that file rather than the userspace-facing one from the //
// i2c-tools package, available from: //
// http://www.lm-sensors.org/wiki/I2CTools */
Created attachment 45639 [details]
Userspace-facing I2C interface header from i2c-tools
Available from http://www.lm-sensors.org/wiki/I2CTools; included here in case that site gets nuked.
I've been using 2.6.38 (with a quick hack to force the first EDID byte to its proper value) for a while (several weeks) now, with Gallium3D and all, without noticing any new corruption (no single EDID error in the logs).
Thanks to Andy Getz's program, I have been able to fix my screen's EDID info a few days ago, and I am now running "stock" (without my hack) 2.6.38, and the problem seems to be entirely solved.
So, this bug exists in 2.6.37 (maybe a race condition between two i2c users? Say, nouveau and gspca), but we can safely assume it's fixed now.
> Please let me know if this works for you or if you suspect there are bugs in
> the program.
It doesn't work for me. I tried it on the broken DVI input of my monitor and the corrected EDID data did not look right (it was total rubbish). So I tried it on the working HDMI input of my monitor. I get
:; ./edid-tool /dev/i2c-0 read > edid-bad
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123 4567 89ab cdef
00000000 80 08 08 0e 0a 61 40 00 05 25 40 00 82 08 00 00 |.... .a@. .%@. ....|
00000010 0c 08 38 01 02 00 03 3d 50 50 60 32 1e 32 2d 01 |..8. ...= PP`2 .2-.|
00000020 17 25 05 12 3c 1e 1e 00 36 39 7f 80 14 1e 00 00 |.%.. <... 69.. ....|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12 f9 |.... .... .... ....|
00000040 7f 98 00 00 00 00 00 00 04 32 47 2d 55 44 49 4d |.... .... .2G- UDIM|
00000050 4d 00 00 00 00 00 00 00 00 00 00 00 00 09 25 23 |M... .... .... ..%#|
00000060 cc 4e 64 00 00 00 00 00 00 00 00 00 00 00 00 00 |.Nd. .... .... ....|
00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |.... .... .... ....|
WARN at 211: Bad header: 0x8008 080e 0a61 4000
WARN at 219: Bad checksum: 0xfe
I'm running this as a normal user having chmoded 666 /dev/i2c-0. All of the other /dev/i2c-? give "No such device or address". I'm running a 2.6.39-1-amd64 kernel and Debian unstable. I had to install the package libi2c-dev to get the right /usr/include/linux/i2c-dev.h. I also had to add "#include <stddef.h>" to get ptrdiff_t.
(In reply to comment #19)
I just booted 2.6.38-2-amd64 and now I don't even have a /dev/i2c-0 any more. I'll try and find out how to get it back. I had one with 2.6.39. Strange
I got the same problem on my MacBook Pro 6,2
System: debian 64bit, with X backport packages (xorg-server 2:1.10.4-1~bpo60+1)
kernel: 3.2.0-rc2 from nouveau/master
I used the nouveau driver for a long time now, and it works pretty well (xrandr, external display, suspend, hibernate perfect). I also can play games like "sauerbraten" with some minor bugs.
I tried to find out, if it can run StarCraft II with wine: It can't. The X Server gracefully died and that corrupted the EDID of the LVDS display:
kernel: [ 471.507583] [drm:drm_edid_block_valid] *ERROR* EDID checksum is invalid, remainder is 217
kernel: [ 471.507587] Raw EDID:
kernel: [ 471.507590] 00 ff ff ff ff ff ff 00 06 10 bb 9c 00 00 00 00
kernel: [ 471.507594] 00 13 01 03 80 21 15 78 0a 50 c5 98 58 52 8e 00
kernel: [ 471.507597] 25 50 54 00 00 00 01 01 01 01 01 01 01 01 01 01
kernel: [ 471.507600] 01 01 01 01 01 01 7c 2e 90 a0 60 1a 1e 40 30 20
kernel: [ 471.507603] 36 00 4b cf 10 00 00 18 00 00 00 01 00 06 10 30
kernel: [ 471.507606] 00 00 00 00 00 00 00 00 0a 20 00 00 00 fe 00 4c
kernel: [ 471.507609] 50 31 35 34 57 45 33 2d 54 4c 42 31 00 00 00 fe
kernel: [ 471.507612] 00 43 6f 6c 6f 72 20 4c 43 44 0a 20 20 20 00 dd
kernel: [ 471.507619] nouveau 0000:01:00.0: LVDS-1: EDID block 0 invalid.
kernel: [ 471.507624] [drm] nouveau 0000:01:00.0: DDC responded, but no EDID for LVDS-1
on boot, black screen after the boot sound.
Blind select boot option of refit works:
osx comes up but:
* suspend broken
* starcraft does not launch at all
* external display not detected
* seems to run on nivida card only, no more with the intel gpu
linux comes up, but:
* internal LVDS stays black
* external display works
* *ERROR* EDID checksum spam in system log
I found the correct EDID in an older Xorg.log. Byte 1f flipped from 0x27 to 0x00.
I was able to fix this with Andy Getz's program: 1000 Thanks !!!
It appears that this bug report has laid dormant for quite a while. Sorry we haven't gotten to it. Since we fix bugs all the time, chances are pretty good that your issue has been fixed with the latest software. Please give it a shot. (Linux kernel 3.10.7, xf86-video-nouveau 1.0.9, mesa 9.1.6, or their git versions.) If upgrading to the latest isn't an option for you, your distro's bugzilla is probably the right destination for your bug report.
In an effort to clean up our bug list, we're pre-emptively closing all bugs that haven't seen updates since 2011. If the original issue remains, please make sure to provide fresh info, see http://nouveau.freedesktop.org/wiki/Bugs/ for what we need to see, and re-open this one.
The Nouveau Team