Description
Andre
2007-12-11 15:55:36 UTC
Created attachment 13044 [details]
Xorg.0.log
I don't understand what you mean when you say b8770f710729d616b3ac72544aa522161a78f819 changed things... can you clarify? Before that commit S3 worked but only until you did an s2disk? Or did you mean that after that commit S3 fails consistently (i.e. s2disk doesn't fix it)? I seem to be unable to explain, sorry. Before the said commit: S3 will not work correctly before I do an s2disk. After resuming from an s2disk, I can S3 just fine. Everytime. Until I shutdown and reboot. After "proper" boot: failure; after powering on again from s2disk: all is well. After the said commit: S3 will not work correctly, no matter what. It is no more dependent on the above difference, just fails unconditionally. Ok thanks Andre, that clears things up. So it really does seem to be related to 3D state somehow... I could see that restoring the logical context might be an issue if the CCID registers were somehow clobbered (the set context instruction will try to save the current context as well), but in S3 that shouldn't happen. Maybe there's something wrong with the actual 3d state restore; can you try commenting out the call to I830EmitInvarientState to see if that gets to back to the old behavior? That'll at least narrow it down to context save/restore or 3d state load... Also, you're not using any framebuffer drivers right (e.g. intelfb, vesafb, uvesafb)? That could potentially cause all sorts of problems... I have no framebuffer support in the kernel at all, nor the module lying around. Commenting out the "if" statement calling # I830EmitInvarientState(pScrn); in src/i830_driver.c from today's git does not give me back the ability to S3 after a s2disk. But it leaves me with a little less left on the screen after resuming from S3: only the dock border, window borders and console fonts (without blanking) are left. Insofar, the call makes some difference. Ok, good to know, thanks for testing. I'd expect you to lose some of your display w/o the 3d state restore, but it sounds like the real problem is with the context save/restore somehow... If only I could get my hands on one of these laptops. Would an ssh access do any good already? Zhenyu, is this something you could look at? Thanks, Jesse We have one hp 855GM, which s3 is fine last time I tried, and one sony vaio which has s3 issue even without X, some acpi quirks have been tried but still fail. I'll ask if we can thinkpad R50 here. Zhenyu, any luck on that laptop? Any ideas wrt the 3D context restore? I have tested on two 855gm, one is hp nx5000, another is sony vaio. Both machine can resume from s3 nicely in X, although they have other different problems. But none lead to screen corrupt after resume. hp has problem when switch vt console, which gave white screen. sony has problem with dim screen after resume, xbacklight has no effect. And with Eric's mail, it seems we need I830EmitInvarientState for now, and can use hw context instead in future. ok, /me trying to get up to speed here... The bugs you describe for the hp and sony do not appear here, those are fine (unless I miss some finer bits about the "dim" screen). <Quote:> And with Eric's mail, it seems we need I830EmitInvarientState for now, and can use hw context instead in future. </Quote> Here is where I don't get to speed. The Function is defined in the driver, and it gets called conditionally at line 2341 (in current git): if (!IS_I965G(pI830)) { if (IS_I9XX(pI830)) I915EmitInvarientState(pScrn); else I830EmitInvarientState(pScrn); } Is this where I might give calling I830EmitInvarientState(pScrn) unconditionally a try, for testing your clue? Or do I get things allwrong? (I'm not too bold in trying things I don't grok on the driver, sorry.) According to the logic, you should be seeing I830EmitInvarientState(pScrn) now, so I don't think there's any need to remove the conditions around it... Maybe Eric has ideas about this one? Andre, can you get register dumps again with the latest tree, both before suspend and after resume (both from the console)? There's a bit in the CACHE_MODE_0 register that may explain this behavior, I'm curious to see if it changed. Created attachment 14540 [details]
Regdump from the Console with X running, before S3
Created attachment 14541 [details] Regdump from the Console with X running, after S3 Here you are. The following is a couple diffs between dumps: 1. The diff of the attached files: regdump before S3 and after S3 # diff regdump_2008-02-24.beforeS3onC regdump_2008-02-24.afterS3onC 164,165c164,165 < (II): CR0e: 0x03 < (II): CR0f: 0xc0 --- > (II): CR0e: 0x04 > (II): CR0f: 0x60 2. Diffs from regdumps before and after S3 from _within_ X # diff regdump_2008-02-24.beforeS3inX regdump_2008-02-24.afterS3inX 119c119 < (II): SR00: 0x03 --- > (II): SR00: 0x00 164,165c164,165 < (II): CR0e: 0x03 < (II): CR0f: 0xd0 --- > (II): CR0e: 0x00 > (II): CR0f: 0x00 3. Diff of console before starting X and console after starting X (both before S3) # diff regdump_2008-02-24.beforeXonC regdump_2008-02-24.beforeS3onC 8c8 < (II): RENCLK_GATE_D1: 0x00000000 --- > (II): RENCLK_GATE_D1: 0x00000001 128c128 < (II): ARX: 0x20 --- > (II): ARX: 0x30 164,165c164,165 < (II): CR0e: 0x02 < (II): CR0f: 0x80 --- > (II): CR0e: 0x03 > (II): CR0f: 0xc0 Ok, the regs don't show us anything interesting. Anyway I still suspect some problem with the logical context. I'm putting together a debug patch for you now so we can compare working & broken logical contexts. Created attachment 15275 [details] [review] Don't emit 3D state at EnterVT time Andre, can you confirm that this patch gets you back to the old behavior with the latest driver? Also, can you clarify (ideally with screen shots) the different types of corruption you see with XAA vs. EXA? On re-reading this bug that's one thing that confuses me... If we're missing some 3D state programming in I830EmitInvarientState it seems like that would only affect EXA, not XAA, since the latter just uses software rendering... Andre, Zhenyu also committed some fixes for 3D state restore after the 2.2.0 release. Can you try 2.2.1 or the git tree? The problem remained with 2.2.1, and still remains on current git. I've tried your patch against current git, but I can't see any change between patch or no. All of this happens while running exa. xaa comes next, but that may be tomorrow. It's so late it's rather early already... The way a corrupted screen exactly looks like has changed a little again, according to my notes. I cannot possibly tell you why I never tried and take a screenshot of the corrupted screen... but it actually works, to my surprise. I will attach a shot of the normal screen and the corrupt one on exa. The signs for unresponsiveness: - no active cursor is indicated in the shell, - all text is printed but never wiped when deleted or overwritten. - The mark for the active window (blue header) does no more move with the active window. - The fluxbox menu is invisible (but works well :-) - The active elements of the toolbar(arrows on the left) are gone, text is not erased as well. I hope this isn't too abstract, or shall have to fetch a more colorful theme, heaven forbid :-) I'm using gentoo's current ~x86 xorg, that is 1.4.0.90. Created attachment 15283 [details]
This is how it is supposed to look
This is my normal screen.
Created attachment 15285 [details]
This is how it looks after S3
This is the corrupted version.
A finer/further point in difference: The font rendering
of fluxbox fonts has gone awkward, but the
console font is fine.
Ok, thanks for the screenshots. So things still won't work even after an s2disk? If you can get the XAA screenshots eventually that would be nice too. Thanks. Oops, accidentally reassigned. I haven't tested the S3 after s2ram yet, I will have to wait for tomorrow anyway, because in XAA, the screenshot fails: # xwd -root -out screen_corrupt-xaa.xwd X Error of failed request: BadValue (integer parameter out of range for operation) Major opcode of failed request: 91 (X_QueryColors) Value in failed request: 0x30007fe Serial number of failed request: 664 Current serial number in output stream: 664 So I have to find me someone with a digicam or cellphone or something :-) Ok, I did the testing suite boot -> S3 -> s2disk -> S3 [-> S3]* again. 1. Vanilla intel git / EXA: S3 always fails. 2. Vanilla intel git / XAA: S3 always fails. 3. Intel git with above patch / EXA: S3 always fails. 4. Intel git with above patch / XAA: You caught the spot. The old behaviour is back when using XAA. In the test suite this means that second and consecutive S3s will resume just fine. Not in EXA, though. That is inconsistent with my above report... :-( Which is really odd, because I did double-check (and including the git-bisect checks that adds up). I will give this a testing round on git back when to find the flaw... ...unless you stop me bacause it's perfectly reasonable thing to happen :-) In my testing run with the patched XAA, another failure did occur: It went back to sleep right after resuming, repeatedly. I will try and reproduce this one as well... ...after repairing sysklogd which is borken sice then. Oh well. Created attachment 15309 [details]
The failed XAA display
This is what the screen looks like when resuming from S3 with XAA.
I added a spot of color to this one :-)
but then forgot to display the menu: The menu entries are displayed,
but not the border.
Created attachment 15310 [details]
Broken XAA after first S3
Resending, because I grabbed the wrong resolution pic.
Resuming from S3 with the patch and XAA failed in yet another way, this time it was many colorful horizontal lines... so I gave up reproducing and shall be lucky it worked once... I did another round of checking everyone else for consistent bahaviour, and it keeps coming out as described. So I went back to commit 5f92b4c2db9, the last "working" commit before complete S3 failure. Using XAA, the pattern just like the patched version occurs. Using EXA, I get consistent failures. This still is at odds with my testing and the initial description. I did that the initial description with 2.1.0, so there I will be next... I am skimming through my notes for any help there... my first terrible hunch of a test failure is that the switch from XAA to EXA as default is only few commits after the IntelEmitInvariantState addition. My xorg.conf excerpt in the initial description suggests that I used to define XAA but not EXA explicitly. Thereby identifying an XAA bug. On the other hand I am pretty shure to have double-checked this with Carlos and did quite a few EXA/XAA comparisons myself. Not quite prepared to think I srewed them allwrong all the time while busy grepping logs... Anyway, from what I see now it looks like i better go through the cornerstones of the git-bisection again -- and prepare shovel some ashes on my head... Heh, no problem. These sorts of problems can be hard to nail down, let us know when you're run your tests again... Thanks. Ok, I got to check through my notes and test some.
I got it wrong like I feared by not asking for "EXA"
in xorg.conf explicitly. EXA gets the default three
commits away.
I have actually described the suspicios bit, but
did not take the clue... in the original bug I wrote:
> At least commit e784e152a8e84b6e447b55a5c7019e7b47e17621
> (18 minutes after the offender) still shows the old corruption
> pattern (as described for 2.1 versions) while already failing
> constantly on S3.
Oh well.
So the suspicious behaviour I describe is XAA only,
while EXA is consistently just failing to resume
from S3 properly. This behaviour is stable from
5f92b4c (just before the XAA regression) until now.
So there is no regression for EXA, just a plain bug :-)
So I guess I should just file a bug for the
EXA issue while forgetting about the XAA failure.
That said, I did all of the rechecking above
on 2.6.24.3. On current 2.6.25-rc{6,7},
we may forget about the EXA bug as well,
probably... at least for the moment,
as the current kernel does not come back from
suspend at all. No X, no intel driver needed,
no EXA vs. XAA, just plain "won't work". :-)
Let's see to that instead and see what EXA
behaves like when THAT regression is healed.
At least, I know how to bisect now, for my
next task :-)
Anyway, sorry for screwing this one wrong.
If there's any followup you'd like to get,
just ask for it. Thanks.
Yeah, 855GM suspend/resume with DRM is broken at this point... still trying to fix it. Thanks for checking everything else though, good to know we haven't regressed EXA at least. Is there a bug (or twelve) for the linux DRM issue? I did not find anything I could relate to in the 2.6.25 regressions and suspend issues tracker bugs nor on lkml nor here... Anything short of following up on the kernel logs? Thanks! One of the 855 failures is being tracked in #15158, the other was an Ubuntu reported bug; I don't think we have an upstream one open for it yet (it affected one platform that was known to be unstable in other ways as well). Some news here: I got the current kernel git (3925e6fc) to resume from S3 again. Let me start slightly off-topic on the console... S3 works when DRM and DRM_I915 (or DRM_830, no difference) are built into the kernel. It fails when they are built as modules, and fails as well when not built at all. All the following are not working: CONFIG_DRM=y # CONFIG_DRM_I830 is not set # CONFIG_DRM_I915 is not set CONFIG_DRM=y # CONFIG_DRM_I830 is not set CONFIG_DRM_I915=m CONFIG_DRM=m # CONFIG_DRM_I830 is not set CONFIG_DRM_I915=m # CONFIG_DRM is not set # CONFIG_DRM_I830 is not set # CONFIG_DRM_I915 is not set Given that the module is loaded by the X server, which is not running, this is somewhat odd. Actually, I stumbled over it, more like. I decided on a bisection with a monolithic kernel after some attempts to bisect it failed miserably... As I started with a monolithic and very minimal kernel, I can hereby grant that this pattern is consistent over a dozen configs, some of which I may post on request... And here is where I am open to any pointers, and specifically: Do you want this reported as a kernel bug? Do I assign it to you or some neighbour of yours? :-) Wishes for another attempt to bisect? (I failed before, but have more to go on, now.) ===== Symptoms: The failures to resume have a very consistent pattern. On resume, the screen hangs after powering up, with a cursor on top left. (This is replaced with the proper screen after a second when it works, but on failure, it stays right there.) The "invisible" console is fully functional. When shutting down, the machine powers off cleanly, but the screen stays on. The screen turns off only after a hard reset. Back in X: Giving X a try, now with DRM built into the kernel, resume is failing in the same way it did before, on first sight. The cursor and some stuff is missing, screen blanking won't work. So, basically, back to where I was, using gentoos 2.2.99.901 ebuild. Back on topic, I did get S3 to restore X fine, repeatedly. Sad thing is, I cannot reproduce it after a reboot. Anyway, it did work after I made my way through the debug states in /sys/power/pm_test, from freezer to core. Now, probably the important note here is that the machine comes back fine from core, so is nearly suspended. The screen is never physically powered down in the process, though. Anyway, even using the voodoo of reapeating my session by restarting same apps and redoing the tests by # echo freezer > /sys/power/pm_test && echo mem > /sys/power/state && sleep 5 && echo devices > /sys/power/pm_test && echo mem > /sys/power/state && sleep 5 && echo platform > /sys/power/pm_test && echo mem > /sys/power/state && sleep 5 && echo processors > /sys/power/pm_test && echo mem > /sys/power/state && sleep 5 && echo core > /sys/power/pm_test && echo mem > /sys/power/state && sleep 5 && echo none > /sys/power/pm_test does not help any. I only did a few compiles the original time around, but I cannot see how any of that may help. So, throwing any hunches at me is very welcome. While testing all this, I find a new fluke to suspending (and also when suspending to some shallow state using /sys/power/pm_test): after resume, the current mouse selection is pasted to the current console. This happens with intel driver 2c135ef8a (last week's git tip) and linux 3dc50637 (yesterday's git tip). I don't know yet which one is responsible here, but will post when I know. === On the kernel DRM issue, I am afraid another round of going backwards in git to find a working version fails all the way back to .22, I know it did work back then, though. I seem to be more stupid than git. So, some updates... I still cannot reproduce the working condition in X. It "just worked" on two occasions, both after several hours uptime. Note that uptime is not the key in itself. It could go after the moon or after your neighbor's dog just as well. WHEN it decides to just work, it just keeps working reproducibly until I reboot. In more detail, I can say the following: - On both occasions when it happened to work, it was after more than six hours uptime, no s2disk till then. This may mean nothing. - It reproduces when working, I did a handful of cycles for this and the following. - An s2disk does no harm, S3 works just fine thereafter. - Suspending from the console works just as well, the running X session is restored just fine. - After a fresh boot, it fails again. Consecutively. - When suspending again from the corrupted screen, leaving garbled letters in the console, I find them in the console displaying my "blind" moves just right, like it should have. Which prolly just means really everything is fully functional. - When suspending from the console BEFORE starting X, I can afterwards start a functional X. - When suspending from the console with X running, X is borken in exactly the same way as when suspending from within X. - When stopping the corrupted X server and start again, X does not come up cleanly, only giving me the menu border and the arrows I actually miss in the corrupted state. The background is corrupted, an xterm "blends in" with the corruption. Again, all is fully functional. Suspending again from there, and the arrows are gone, and corruptions seems unchanged. s2disk does not restore this one. So much for visual appraisal. Regdumps I made through all of the states (fresh boot, failed state, after resuming properly, when coming back from s2disk) are all identical now. Somebody has obviously nailed the regs properly in the meantime. This unchanging, robust regdump is attached. The dmesg output also looks very consistent to me. There are some sporadic differences though, so I will attach two from late in the "always works" run, one taken after an S3, one after an s2disk. The third one is taken after a failing resume after reboot. I have plenty more, should the need arise :-) The first occurrence was on kernel and intel versions like above, the second on kernel git afa26be86b6 and intel git a0ced923. That is to say, both are jolly recent. ==== This seems to be a different bug, but I'm not sure, so... As marked above, in X the current mouse selection is sent to the active xterm. In the beginning it was preceded by two newlines: <quote> leisereiter /home/andre/bug-i810/testing_2.6.26-rc1-3 # echo mem > /sys/power/st ate leisereiter /home/andre/bug-i810/testing_2.6.26-rc1-3 # leisereiter /home/andre/bug-i810/testing_2.6.26-rc1-3 # leisereiter /home/andre/bug-i810/testing_2.6.26-rc1-3 # echo hallo leisereiter /home/andre/bug-i810/testing_2.6.26-rc1-3 # echo mem > /sys/power/st ate leisereiter /home/andre/bug-i810/testing_2.6.26-rc1-3 # leisereiter /home/andre/bug-i810/testing_2.6.26-rc1-3 # leisereiter /home/andre/bug-i810/testing_2.6.26-rc1-3 # echo hallo </quote> After the first s2disk, the two newlines did no more occur, but the "paste" still happended. Now just like middle mouse button. Created attachment 16395 [details]
This regdump is universal. It does no more change through any suspends or even reboots.
Created attachment 16396 [details]
dmesg after a successful resume from S3
Created attachment 16397 [details]
dmesg after resume from s2disk in the successful series
Created attachment 16398 [details]
dmesg after an unsuccessful resume from S3
Created attachment 16399 [details]
Xorg log from the successful run, debugging enabled
Created attachment 16400 [details]
Xorg log from a failed run, debugging enabled
Ok, not true enough, again. I had another successful S3 some hours later, but on the next cycle, resume was broken, with a slightly different pattern than usual. So the thesis of repeatablility just hit a snag. clearing "NEEDINFO" re-assign to zhenyu, since he has got the TP R50e. Behaviour of resuming changes for the better in the 2.4 series. But not all is well, though. I do get failures in resume every so often, but I also got a very long series of about 40 successful resumes with intel git b0b0998b5d5 from 30 July and linux-2.27-rc2. I use the gentoo distro driver in the 2.4 versions right now. While not seeing proper patterns, here are some observations on the 2.4ish behaviour: - I never got it through a suspend cycle successfully shortly after boot. - It often works, but I can stress-test them to death in 4 to 40 attempts. - Together, this causes the impression, that resume functionality comes and goes on its own timescale. Something not volatile enough. But that's just how it feels. - That feeling got support on my attempts at git-bisection of behavioural changes. Some modern pattern I just had reproduced on older driver versions, where I never before seen those. Rather confusing, and so I left it at that back then... - I experience two kinds of failures on resume: 1. The screen is not fully restored, the system is unresponsive, it does another blink (even an endless series of blinks with late 2.3 versions) and is off. Proper shutdown via ACPI. 2. The screen is not fully restored, but fully functional. More suspend cycles from X or the console work out, but I never saw it restore to correct operation. This failure is the most common one. Once I saw it degrade further on continued suspend cycles. - I have seen both success in restoring the screen when doing another suspend cycle from the console and not doing so. Never saw it "repairing" to correct state after another resume from WITHIN the X-server. And, by impression, it does not seem to come back to functional if it was not "repaired" at once, i.e. on the next suspend cycle. - Very rarely, I get a suspend failure (total freeze of system and screen looks like morphing, oily patterns. The most entertaining of failures I know of :-) - I found the screen brightness not restoring to the previous level, even not to some fixed level, but coming up "somewhere". I still need to inquire into this one. I have an R50e, with an intel 855GM card. Using intel driver+kernel 2.6.27 resume takes me back to a non refreshing X. In good english, you can understand that as all the windows not redrawing at all. Example: 1. suspend 2. resume 3. gnome-screensaver password prompt appears as a solid grey rectangle 4. hit esc 5. g-s password prompt dissapears, everything still black 6. input your correct password, hit enter, everything still black 7. switch to another vt, switch back, everything now as "big solid rectangles", imagine a normal desktop but with the "contents" being solid color areas, like the title bar a solid blue rectangle, this browser having a solid grey square instead of menus, buttons, etc Everything works but the screen does not redraw anything so if you want to see what happened you can switch to a vt and then back, but won't help much since you will still only see big rectangles. Funny, your wallpaper is perfect :-) This was working until last week with kernel 2.6.24 (.27 was broken) and ubuntu ibex's last week xorg bundle, it is now broken in .24 and .27 which was already broken. I'm available for debugging, let me know. Good news here! Starting off with the problem of X freezing ar once in the 1.5 series of the server with linux-2.6.28-rcX, I worked my way to a functional setup. Most recent working solution is: libX11 git head libdrm git head mesa git head xorg-server-1.5.2 (a heavily patched version -r1 from gentoo x11 overlay is fine, too) xf86-video-intel git head running on linux-2.6.28-rc4-00322-g58e20d8 The xorg-server from git leaves me with an unresponsive system, but not a frozen one; I can shutdown cleanly with my power button. I will try and find out about that failure in the next couple days. Now, with this setup, I can reliably suspend to ram and to disk, it seems. From within X, from the consoles with X running, all comes back nicely. The system annonces direct rendering, but Mesa uses the software renderer: OpenGL renderer string: Software Rasterizer OpenGL version string: 2.1 Mesa 7.3-devel So, I cannot be shure this still is the same bug, but it definitely is a fully functional system, minus some known rendering issues in firefox. As a side note, I gave the new intelfb a largely unsuccessful try. For one thing , the VESA modes will not display at all, and it breaks resume in X. But that de finitely is another bug report at another time :-) Even more good news! The setup described above survives some stress testing of about 30 cycles of s2ram with interspersed suspends from console and/or to disk. Upgrade to xorg-server master was not too hard, I needed ot amend my xorg.conf to not rely on hal but instead use my mouse and keyboard definitions. I had some corruption issues with master, but no lockups. Switching the intel driver to the dri2 got me back in business. To clarify: When saying "git" for all the components (see last post) I reference the gentoo x11 git overlay, so there are some patches against pure git master. So, with xorg-server master and intel driver on dri2 branch, I get a functional system. Mesa relies on the software rasterizer still, but it is supposed to, if I followed things correctly. A couple of suspends to ram from X and the consoles, and an s2disk all give me my nice, functional system back. Hooray! I still need to do some more stress testing, and will complain on failure. I do attach an Xorg.0.log after a couple suspends with the dri2 setup. Feel free to ask for more info or testing. So, ths may be called works for me, as far as I can tell. Created attachment 20367 [details]
Xorg.0.log after s2ram on xorg-server master and intel-dri2 branch with lots of debug output
reporter says this is fixed with current code |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.