Summary: | [bisected] Graphics corruption related to pageflip ioctl support in 2.6.38-rc* | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Dave Witbrodt <dawitbro> | ||||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||
Severity: | normal | ||||||||||
Priority: | medium | ||||||||||
Version: | DRI git | ||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Description
Dave Witbrodt
2011-02-03 01:39:49 UTC
I would also like to mention that, other than the 'prboom-plus' all-black melt glitch, everything is working fine. Even 'torcs', which has given me a great deal of problems in the past when testing cutting-edge drivers/software, runs better than I've ever seen it (since I used the 'nvidia' blob with my long-dead GeForce 7950GT. I have never been able to play the "Forza" track in 'torcs' with an open source driver -- the frame rate was 8 fps or below -- until now. Other tracks are mostly playable, though their frame rate was capped at 30 fps or less ever since vline was added to xf86-video-ati; now those rates are 25-35 fps minimum, and frequently reach the 50's (even 60's occasionally). So, I only discovered the issue with the hangs because I was trying to bisect a minor glitch. Someone is really doing something right with the open source support, because just since the last week of January the performance I'm seeing on this HD 5750 seems 50-150% faster! I don't know what caused it: I have built newer versions of the kernel, xf86-video-ati, xorg-server, and mesa during that time. Maybe this has something to do with it: commit 8c631cfeae29b5236928f759e222aa35e6e4984c Author: Marek Olšák <maraeo@gmail.com> Date: Fri Jan 28 22:04:09 2011 +0100 r600g: rework vertex buffer uploads Only upload the [min_index, max_index] range instead of [0, userbuf_size]. This an important optimization. Framerate in Lightsmark: Before: 22 fps After: 75 fps The same optimization is already in r300g. Whoever is actually to blame, I would just like to thank all of you who are working on this stuff! To determine whether I goofed something up with the kernel I described in my preliminary report, I built kernels from the drm-airlied/drm-fixes git tree corresponding to the important commits in the bisection process: 1. f5a8020903932624cf020dc72455a10a3e005087 drm/kms/radeon: Add support for precise vblank timestamping. 2. 6f34be50bd1bdd2ff3c955940e033a80d05f248a drm/radeon/kms: add pageflip ioctl support (v3) 3. 147666fb3b93b8c484f562da33a37f886ddff768 drm/radeon: Use the ttm execbuf utilities 4. 1e644d6dce366a7bae22484f60133b61ba322911 drm/radeon/kms: re-emit full context state for evergreen blits Kernel #1 was identified by 'git bisect' as the last "good" kernel, while kernel #2 was the first "bad" kernel (hangs 'prboom-plus'). During the bisection, I found that several commits after kernel #2, the the problem with the test program hanging was resolved, but melt/fade effect when playing again after being killed was now all black; the first commit which caused this change was #3. (Kernel #4 is simply the kernel I tried first, where I first noticed the problem with the black melts/fades.) Today I built kernels directly from drm-airlied/drm-fixes (instead of the drawn out process I used before). Here is a table comparing the results from my preliminary report and my builds directly from drm-fixes: Kernel # Preliminary drm-fixes -------- ----------- ----------- 1 OK OK 2 hangs hangs 3 black fades OK 4 black fades black fades The difference in case #3 is baffling. It is possible that I missed some important commit(s) when trying to cherry-pick the relevant stuff into stable 2.6.37; however, those first 3 kernels (when taken from drm-fixes) are actually based on 2.6.37-rc{2,3}, so the rest of the kernel was very different from the stable 2.6.37 sources I was adding cherry-picks to. At any rate, it looks like the hangs were resolved in the commits between #2 and #3. This suggests that I can bisect drm-fixes and find the commit that introduced the black fades/melts in prboom-plus. I have not attempted that yet, but I will do so now. I also discovered that the kernels with the black fades/melts dump messages like this into dmesg (and syslog): [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -35! That looks significant: none of the other kernels exhibit that behavior. (Off to bisecting again....) Part 1: Improving on what I've reported so far OK, I've spent a couple of days trying to understand my situation with this bug. My previous comments here were very preliminary, so I will now try to improve on the quality of information I have provided so far. Everything (kernel DRM, Mesa, X server, Radeon driver) is working so well that it may be a complete fluke that I noticed this glitch at all. I begin with some details about the glitch I am seeing. I am not proficient enough with screen capture software to grab stills or videos of the glitch on my system, but I have found some material on the net useful for illustrating what I am seeing. Of all the software I use for testing, only one program (so far) reveals this bug: a locally-built version of 'prboom-plus' http://prboom-plus.sourceforge.net (I use it with the Ultimate DOOM WAD file which I purchased almost 20 years ago!) DOOM used an animated fading/melting transition when starting a new game and when starting over after being killed. I'm sure you all are familiar with it; it looked like this: http://doom.wikia.com/wiki/File:Screen_melt.gif The 'prboom' and 'prboom-plus' programs attempt to clone such effects. Before my latest kernel upgrades, these fades/melts worked correctly. When I decided to try out the latest Radeon DRM bits (from drm-airlied/drm-fixes) I found everything to be working great -- games, web browsing, no desktop glitches, etc. -- until I tested 'prboom-plus'. That game worked fine, except a specific fade/melt transition (not the one when the game first starts, but the one after your player is killed and you restart on the same level) was all black on the melting part of the screen. This 7-second YouTube clip looks a lot like my bug -- I'm _only_ talking about the first half second of the clip: http://www.youtube.com/watch?v=gDaSE8U7oEo Since only the kernel had changed when I first noticed this all-black animation regression, I blamed it on the kernel. I am no longer certain that the kernel is to blame; new information (see later comments) makes me wonder whether it is a bug in xf86-video-ati. At this point I began bisecting the kernel I was using: I had created a git branch from v2.6.37 and had cherry-picked the new commits that seemed relevant to my hardware, as described above in my original report here, and in Comment 2. I ended Comment 2 as I was about to begin bisecting directly from the drm-fixes tree, in case I had botched my cherry-picks; my next comment here will pick up after that. Part 2: pre-rc1 and early rc* kernels suck In Comment 2 above, I built 4 kernels directly from drm-fixes. I was guided in those choices by the results of bisecting the kernel I had made from individual cherry-picks, believing that the order in which commits appear in 'git log' is the same as the order used during 'git bisect'. (That assumption turns out to have been wrong.) Anyway, I immediately ran into problems bisecting because the kernels you get from drm-fixes at specific SHA1 commits are in widely-varying states of usability. (That's why I was trying to cherry-pick stuff onto a stable release of 2.6.37 in the first place!). Based on my findings with the 4 kernels in Comment 2, I bisected this way: git-bisect start 1e644d6d 147666fb There were 5000+ commits between those 2, and the very first commit chosen for me in between these 2 was something from the post-2.6.37 merge window: that kernel would just hang during boot. SUGGESTION: this made me wish there was a better way to test new DRM code. For example, make it possible to test against kernels where other parts of the kernel are likely to be in good shape. Wouldn't it be relatively easy to either: A) have an additional git tree, based on the last stable kernel release, with only new DRM-related patches (and no explosive merge window or rc* detritus)? or B) provide a series of patches so that people wanting to help debug this stuff can have kernels that are fairly stable except for the new experimental code being tested? I love git, but this aspect -- using in-flux developer trees for bisecting -- is very bad for user testing. An "end-user-bisect" tree, rebased at each stable kernel release, containing only new DRM-related code patches, might be a great improvement over the current situation. (Unless having end users running 'git bisect', and spamming the f.d.o. Bugzilla, is something you devs are intentionally trying to avoid....) I know this would mean even more burden on already overtaxed devs, but it's probably something that could be automated if devs who submit patches to D. Airlie bought into supporting it. Making a kernel from individual cherry-picks after gazing at 'git log' changes since the last stable release is much more difficult, but it's something I've been doing for the past year... I'm just a noob! :-) Created attachment 42968 [details] Output of 'git bisect log' with drm-airlied/drm-fixes tree Part 3: My useless bisect of drm-fixes I started over, after having had my butt kicked by that evil merge-window kernel. This time I ran: git-bisect start 1e644d6d 147666fb drivers/gpu/drm/radeon This posed the potential issue that the commit I was trying to find was not something that touched file(s) in d/g/d/radeon. If that was the result I was going to bisect again without specifying a directory, but it actually turned out fine. The bisect log is attached. When I originally viewed 'git log' to select cherry picks, I saw an ordering like this: [NEWEST] ... 1e644d6d drm/radeon/kms: re-emit full context state for evergreen blits ... 147666fb drm/radeon: Use the ttm execbuf utilities ... 6f34be50 drm/radeon/kms: add pageflip ioctl support (v3) f5a80209 drm/kms/radeon: Add support for precise vblank timestamping. ... [OLDEST] I applied the individual cherry-picks in order from old to new, assuming later patches would depend on earlier ones to avoid conflicts. In Comment 2 above, kernel 3 (147666fb) was OK and kernel 4 (1e644d6d) caused the black fades/melts. The 'git bisect' process then surprised me: in spite of the order the commits appear in 'git log', the bisection found f5a80209 and 6f34be50 _between_ 147666fb and 1e644d6d ! This explains why I was baffled in Comment 2 above: 147666fb was "good" because it is actually before f5a80209 (the last commit before problems begin) in the git history. These results correspond exactly to what I had found in my original (preliminary) report here. This seems to point firmly at one commit: 6f34be50 drm/radeon/kms: add pageflip ioctl support (v3) However, I have many doubts. Looking at the list of commits I originally used (see my first attachment here) there are two complex series involved: 6f34be50 drm/radeon/kms: add pageflip ioctl support (v3) 3e4ea742 drm/kms/radeon: Reorder vblank and pageflip interrupt handling. b6724405 drm/kms/radeon: Use high precision timestamps for pageflip completion events. and d6ea8886 drm/ttm: Add a bo list reserve fastpath (v2) ecf7ace9 kref: Add a kref_sub function 2357cbe5 drm/ttm: Use kref_sub instead of repeatedly calling kref_put 68c4fa31 drm/ttm: Optimize ttm_eu_backoff_reservation 96726fe5 drm/ttm: Don't deadlock on recursive multi-bo reservations 702adba2 drm/ttm/radeon/nouveau: Kill the bo lock in favour of a bo device fence_lock 95762c2b drm/ttm: Improved fencing of buffer object lists 65705962 drm/ttm/vmwgfx: Have TTM manage the validation sequence. eba67093 drm/ttm: Fix up io_mem_reserve / io_mem_free calling Therefore, I think this bisection was inconclusive: kernels built from commits where the "pageflip" series is incomplete or the "TTM" series is incomplete cause hangs; but what I am _really_ looking for is the first commit that allows me to build a non-hanging kernel and which also exhibits the graphics corruption. (Here, by "hangs" I mean that 'prboom-plus' becomes unresponsive instead of performing the all-black fade/melt routine, and I have to use 'kill -9' via SSH.) For the bisection to succeed, maybe I would have to treat kernels with hangs as "good" (as well as those causing no glitches at all) and only treating _working_ kernels which cause black fades as "bad". In any event, I suspect now that all of this bisecting after the first one I did (on the 2.6.37 + cherry-picks local branch) has been for nothing. [See next comment.] It is still possible that the hangs I saw during bisection might have some relevance for bug #33515 and bug #33418, but I think it is more likely that those were merely artifacts of 'git bisect' landing in the middle of incomplete patch series (the "pageflip" and "TTM" series I listed above). Part 4: Request for guidance I would like to continue helping track this bug down. It is only causing a minor glitch for me, but it is clearly a regression and may have more of an impact on software I have not tested (or on other people's systems). The glitch I am observing reminds me of what was reported in bug #33918, specifically the difference between the 2 versions of 'test07.jpg' images. There is also the DRM error msg (see Comment 2) which appears when 'prboom-plus' exhibits the black fade/melt problem: [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -35! My bisecting pointed directly at 6f34be50 drm/radeon/kms: add pageflip ioctl support (v3) which first causes hangs -- probably irrelevant, since the hangs cease once the patch series is completed -- and then leaves me with the all-black fades/melts once non-hanging kernels appear again after the TTM series. The driver issue in #33918 made me wonder: if the pageflip support is actually involved here, what if I go back to an old driver before pageflip support was added, and use the new kernel with the pageflip DRM bits? I have tested that, and here is my current situation: $ uname -r 2.6.37+drm2.6.38-rc3.110201.desktop.kms $ apt-cache policy xserver-xorg-video-radeon | grep "Installed" Installed: 1:6.13.99+git101203.f9bbb26 I am running my 2.6.37 + cherry-picks kernel with the radeon driver from the last commit before pageflip support was added. I get no glitches in 'prboom-plus', and all my test software works just fine. I hope these results are meaningful. I don't know how to proceed further, so if I could get some guidance I might be able to help pin the problem down even further. When you bisect, you want to focus on only the specific bug you are tracking. If you hit some other bug or the status of the commit is indeterminate, don't mark the commit as bad, skip it (git bisect skip). Also, if you may need to have a certain patch applied all the time for testing certain things. E.g., if you get hangs without 1e644d6dce366a7bae22484f60133b61ba322911 applied, make a patch of that commit and manually re-apply it before testing each commit in the bisect. E.g., create a patch from the commit: git show 1e644d6dce366a7bae22484f60133b61ba322911 > fix.patch then before each test in the bisect, manually apply the patch: patch -p1 -i fix.patch (In reply to comment #7) > When you bisect, you want to focus on only the specific bug you are tracking. > If you hit some other bug or the status of the commit is indeterminate, don't > mark the commit as bad, skip it (git bisect skip). Also, if you may need to > have a certain patch applied all the time for testing certain things. E.g., if > you get hangs without 1e644d6dce366a7bae22484f60133b61ba322911 applied, make a > patch of that commit and manually re-apply it before testing each commit in the > bisect. E.g., create a patch from the commit: > git show 1e644d6dce366a7bae22484f60133b61ba322911 > fix.patch > then before each test in the bisect, manually apply the patch: > patch -p1 -i fix.patch Thanks, Alex. Does this mean you think it would be helpful if I did that bisect again? Or is it already clear that after the "pageflip" and "TTM" series are finished (both in my cherry-pick kernel and in the drm-fixes bisect) that the kernels work (without locking 'prboom-plus') but cause xf86-video-ati (after f9bbb26) to produce the black melt animation? It seems like the only thing another bisect would clarify is whether the "pageflip" series or the "TTM" series causes the problem; combined with using an older, pre-pageflip radeon driver, it seems like the problem is narrowed down to pageflip code... either in DRM or in xf86-video-ati. Anyway, when I made my 2.6.37 + cherry-picks kernel, I did a cherry-pick on 1e644d6d first. (I mentioned that in the original report; it was numbered step 3.) When I bisected drm-fixes, I did have to manually apply the 1e644d6d commit as a patch; indeed, I had tested that patch in another bug report here in January, and it still has the file name you provided for it: 0001-drm-radeon-kms-re-emit-full-context-state-for-evergr.patch DW Looking at your description and the "look-alike bug" youtube video, i wonder if it could be some synchronization bug in mesa, the ddx or a running desktop compositor instead of a kernel bug. Something for which pageflipping is a neccessary condition to show up. This... <https://www.crowproductions.de/repos/prboom/branches/prboom-plus-24/prboom2/src/gl_wipe.c> ...i believe is the code responsible for the wipe effect which goes wrong. They take a screenshot of the start screen before the transition into a texture and another screenshot of the end screen after the transition. Then they go through a loop were they draw textured quads, first the fullscreen start-screen texture, then little stripes with bits of the end screen texture. If the wipe_scr_start_tex texture would contain all-black, you'd get the visual artifacts you describe. That could be because they are capturing an all-black framebuffer instead of the proper one, e.g., because mesa fails to synchronize its framebuffer readback into textures properly with the pageflip, or because it reads from the wrong buffer (pre-pageflip vs. post-pageflip). Without pageflipping (enabled), there isn't any buffer exchange between front- and backbuffer, so a possible bug in mesa would probably stay hidden. I do remember that we had to fix some such bugs in the mesa classic driver, also for the framebuffer readback path, i don't know about the status of the gallium version. You could disable page-flipping via the xorg.conf option "EnablePageFlip" "off" (iirc) and see if that "fixes" the problem without removing any patches. Or if disabling desktop composition makes a difference. One could also mess with that file to see if something changes. E.g., adding a glFlush() or glFinish() or some wait for a few hundred msecs before executing the screenshot makes a difference. Or just display the screenshot texture for a few seconds to see if it is indeed a black texture. -mario probably a bug in the game, most likely reading from the back buffer after doing a swapbuffers. (In reply to comment #9) > Looking at your description and the "look-alike bug" youtube video, i > wonder if it could be some synchronization bug in mesa, the ddx or a > running desktop compositor instead of a kernel bug. Something for which > pageflipping is a neccessary condition to show up. In other words, because pageflipping is a feature that never existed before, pre-existing code may simply fail to produce identical results. There may not be anything "wrong" at all, other than the cosmetic change in behavior. FTR, I am not using a compositing window manager at the moment. Actually, strike that: Xfce's WM, xfdesktop4, supports compositing features... but I have always had them disabled. > This... > > <https://www.crowproductions.de/repos/prboom/branches/prboom-plus-24/ > prboom2/src/gl_wipe.c> > > ...i believe is the code responsible for the wipe effect which goes > wrong. They take a screenshot of the start screen before the transition > into a texture and another screenshot of the end screen after the > transition. Then they go through a loop were they draw textured quads, > first the fullscreen start-screen texture, then little stripes with bits > of the end screen texture. This file matches the corresponding file of the source I downloaded, built, and installed on my system. I have not (yet) done any GL programming. Can you tell me whether you see a difference between the way the effect is coded at the beginning of the game and when a killed player is resurrected? The reason I ask is this: the wipes/fades/melts at the beginning of the game and in the built-in demo have always worked, in any combination of kernel, DDX, and Mesa I have tried. Only the wipe/melt effect after hitting the space bar to start over triggers problems: all-black wipe/melt transitions, or game hangs with kernels built at certain commits (where no kernel should probably be built anyway). If that is the only function used to do the wipe/melt transitions, why does it work on some calls but not on others? > If the wipe_scr_start_tex texture would contain all-black, you'd get the > visual artifacts you describe. That could be because they are capturing > an all-black framebuffer instead of the proper one, e.g., because mesa > fails to synchronize its framebuffer readback into textures properly > with the pageflip, or because it reads from the wrong buffer (pre-pageflip > vs. post-pageflip). > > Without pageflipping (enabled), there isn't any buffer exchange between front- > and backbuffer, so a possible bug in mesa would probably stay hidden. > > I do remember that we had to fix some such bugs in the mesa classic driver, > also for the framebuffer readback path, i don't know about the status of the > gallium version. If a Mesa sync problem is actually to blame, instead of a kernel bug or DDX bug, then why is it so deterministic in behavior? What I mean is, the post-death wipe never succeeds, and the game-demo and game-start wipe never fails. BTW, if you folks discover that this is no kernel or DDX bug, then I'm satisfied to have this classified as a wishlist bug: no other aspect of the game is affected, and, as can be see in that YouTube clip, other clones have simply implemented the wipe effect in all black anyway. My main concern was that something more serious was going wrong underneath, possibly a clue to other bug reports over the past few days. > You could disable page-flipping via the xorg.conf option > "EnablePageFlip" "off" (iirc) and see if that "fixes" the problem without > removing any patches. Or if disabling desktop composition makes a > difference. One could also mess with that file to see if something > changes. E.g., adding a glFlush() or glFinish() or some wait for a few > hundred msecs before executing the screenshot makes a difference. Or just > display the screenshot texture for a few seconds to see if it is indeed a > black texture. > > -mario 1. xorg.conf option: I'll give it a try later 2. desktop compositor: disabled throughout 3. experiments with gl_wipe.c: having no experience with OpenGL coding, I am not immediately able to perform these experiments. I could consider this my big chance to starting learning... but I'm not sure how quickly I can figure out how to code those experiments. If the code is quick and easy for you to write, I could apply patches and test them! Allow me to point out again (see Comments 2 and 6) that I am getting a DRM error message which also (at least superficially) points us at kernel commit 6f34be50: [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -35! Admittedly, that message could be unrelated to the glitch I am seeing, but I've got to admit to being tempted to believe there's a connection.... Created attachment 42981 [details]
xorg.conf with Option "EnablePageFlip" "off"
Taking Mario Kleiner's advice, I added this to xorg.conf:
Option "EnablePageFlip" "off"
(I also am attaching the file, as a sanity check; I don't believe I have anything nefarious there, but other eyes may see what mine do not.)
I can report that the all-black wipe/fade issue disappears with the combination of my 2.6.37+cherry-picks kernel and xf86-video-ati 6.14. Also, the DRM error msg
*ERROR* Failed to parse relocation -35!
goes away.
HTH,
Dave W.
Here's an observation I missed: After the black wipe/melt effect, there is a pause -- several hundred millisecs, maybe a half second at the most. This does not happen when the wipe happens correctly. Sorry about not noticing that sooner. DW I did no further testing after the onset of the Japan crisis, but I have just started updating some relevant parts of my system. Updating from kernel 2.6.38-rc8 to 2.6.38.2 did not seem to have an effect, but updating Mesa to 7.11.0-devel (commit a26121f3) changed the observed behavior in prboom-plus. The black melts are gone, replaced by transient graphics corruption when starting a new game. This corruption fades within a span of about 1 second, and causes no further problems. In short, it looks like the problem was a combination of kernel changes and Mesa support for my Radeon HD 5750. Apparently some changes to Mesa over the past 7 weeks have resolved the problem I was originally reporting here. possibly a dupe of bug 35452 I have been tracking this bug since April 2011, and it is finally fixed. The original problem was an all-black "melting" effect at the beginning of a game of DOOM (using the prboom-plus client); as of comment 14, the melt effect was working correctly, but 3D glitches were occurring: at the beginning of the game walls and floors would be invisible or would blink, but the bug would disappear after a second or two into the game. (Actually, any change of the player's "height" in the virtual corld would trigger a few moment's worth of the bug.) The fix happened either in the DRM or Mesa. I went from a 3.4.5 kernel to 3.4.6, with some cherry-picks from drm-airlied in both; and with Mesa 8.1-devel I went from commit e3ff4d4c to commit e2e7b467. None of the radeon-related cherry picks in drm-next that I used (between 74da01dc - 197bbb3d) for my kernel 3.4.6 update look relevant. Promising candidates from Mesa include: commit 018e3f75d69490598d61059ece56d379867f3995 Author: Marek Olšák <maraeo@gmail.com> Date: Sun Jul 15 00:02:42 2012 +0200 r600g: fix all failing depth-stencil tests for evergreen commit ba48f47ebf7f017db0507b92a3ca83e404dc586c Author: Marek Olšák <maraeo@gmail.com> Date: Sat Jul 14 16:23:42 2012 +0200 r600g: consolidate code for setting sampler views and fix bugs in the process commit 80755ff56317446a8c89e611edc1fdf320d6779b Author: Marek Olšák <maraeo@gmail.com> Date: Sat Jul 14 17:06:27 2012 +0200 r600g: properly track which textures are depth At any rate, this minor annoyance is now completely gone. Thanks to everyone working on the open source Radeon support! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.