Summary: | Unichrome (K8M800) locks up when working with textures | ||
---|---|---|---|
Product: | Mesa | Reporter: | Valentine Sinitsyn <e_val> |
Component: | Drivers/DRI/Unichrome | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | major | ||
Priority: | high | CC: | alexdeucher, anton, bill-freedesktop.org-bugzilla, eira, gary4gar, libv, m_pupil, mpytasz, redhat.tux, viriketo, xavier |
Version: | unspecified | ||
Hardware: | x86 (IA32) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Trivial workaround
Demo program |
Description
Valentine Sinitsyn
2005-11-18 20:55:41 UTC
Created attachment 3841 [details] [review] Trivial workaround A trivial workaround for unichrome DRI module which allows you to run Trigger but not other games. The situation appears to be more complicated than I thought initially. I've did additional debugging and now think that all games are suffering from the same bug in a driver - the symptoms are quite similar. However, there are many ways to "activate" this bug, so every game (or GL app in general) has it's own workaround. For instance, in Trigger you should avoid GL_LINEAR_MIPMAP_LINEAR, in Torcs you should disable GL_ALPHA_TEST when rendering multitextures (it is always connected with textures somehow) and so on. Usually, the program hangs between return statement and the next line of code, i.e. in the sample code below: int some_func() { ... printf("BEFORE\n"); return 1; } ... while (some_func) { printf("AFTER\n"); } you will see "BEFORE" line but not "AFTER". So, I've prepared a very simple demo program (see attachement) below which hangs my computer. Hope it will help debugging driver. More details are in attachment comments. Created attachment 4010 [details]
Demo program
This small program is suffering from the described bug. It should display two
rotating triangles (blue and white) but it hangs instead. I've made a program
intentionally simple (no real textures - just autogenerated plain color, no GLU
calls for mipmapping etc) so it has minimum GL calls. To make the program work,
one should either change mipmapping mode or render texture on both triangles.
You will find the comments on which line to remove/modify inside the file. The
bug is highly reproducable with this program but in rare cases you may need to
run the program several times before it hang the computer.
I have the same problem on K8M800. I ran attached demo program and compter locks up if this line is not commented out: glDisable(GL_TEXTURE_2D); Commenting out this line or not doesn't seem to manke any difference: glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST_MIPMAP_LINEAR); I'm running Mandriva Linux 2006 (32bit) with thac&ze xorg-x11-6.9.0+openchrome+mesa-6.5.0+dkms-drm RPMs: http://www.mde.djura.org/2006.0/RPMS/ This issue appears to be hitting me as well. The demo program behaves as described, plus I get a lockup with geartrain and some of the other demos. Note the gears works fine ( 450 fps ). I'm using Mesa 6.5 ( downloaded from mesa3d.org ), xorg 7.0 ( xorg-server 1.0.2-r3 from gentoo ) and the openchrome ( http://openchrome.org/ ) driver. I have a K8M800 chipset and am running on a x86_64 notebook. I've built mesa 6.5 with debug enabled and set LD_LIBRARY_PATH to point to the libs in my build tree. I see the following on the command line :- bash$ ./geartrain __driCreateNewScreen_20050727 - succeeded Mesa warning: couldn't open libtxc_dxtn.so, software DXTn compression/decompression unavailable The initial geartrain window appears and then the display is locked. Sometimes, the whole machine is locked up. Please advise if there is anything I can do to aid debugging this further. *** Bug 7456 has been marked as a duplicate of this bug. *** Ubuntu bug: https://launchpad.net/distros/ubuntu/+source/xserver-xorg-video-via/+bug/43154 Which has 3 bugs, has been associated with this bug. affecting: Screensavers, Wine (windows non-emulator) with any application, Games, Potentially classpath when 3D enabled. Another similar freeze with 3D apps on Unichrome has been reported with the KM400, which I assume is this same bug: https://bugs.launchpad.net/ubuntu/+source/mesa/+bug/118163 The user was able to workaround it by downgrading from mesa 6.5.2 to 6.5.1. (In reply to comment #2) > The situation appears to be more complicated than I thought initially. I've did > additional debugging and now think that all games are suffering from the same > bug in a driver - the symptoms are quite similar. However, there are many ways > to "activate" this bug, so every game (or GL app in general) has it's own > workaround. For instance, in Trigger you should avoid GL_LINEAR_MIPMAP_LINEAR, > in Torcs you should disable GL_ALPHA_TEST when rendering multitextures (it is > always connected with textures somehow) and so on. Usually, the program hangs > between return statement and the next line of code, i.e. in the sample code below: > > int some_func() { > ... > printf("BEFORE\n"); > return 1; > } > ... > while (some_func) { > printf("AFTER\n"); > } > > you will see "BEFORE" line but not "AFTER". > > So, I've prepared a very simple demo program (see attachement) below which hangs > my computer. Hope it will help debugging driver. More details are in attachment > comments. I looked into this a bit more and I infer the following. I realise that you looked at the bug in terms of high-level errors at the mesa DRI code. I however, went lower than that to the exact root cause. My observations may be different from yours, but here I go. 1. Debugging into this using gdb caused a hard lock in the glFlush() portion of glx code, which in turn goes to the __mesa_Flush() in unichrome_dri.so. The locking happens at different points of the code and therefore I figured that it is an asynchronous event driven code that is causing this lock. 2. I finally went into the DRM portion of the code(libdrm) which ioctl's the kernel for running various kernel level code from user space. 3. Adding printk's to DRM code finally isolated the problem. There is a function in via_irq.c called via_driver_vblank_wait(), which is probably serviced when the VIA_IRQ_VBLANK_PENDING interrupt bit is set. It calls viadrv_acknowledge_irqs(). 4. This reads the VIA_INTERRUPTS_REG using the VIA_READ macro(which is a readl PCI post), 'or' it with the VIA_IRQ_VBLANK_PENDING bit. QUESTION: If it is interrupt driven, this bit should already be set. Why is it being set during acknowledge? Then it writes the VIA_INTERRUPTS_REG back using VIA_WRITE. 5. Looking at the sequence of printk's I see that VIA_READ and VIA_WRITE happens several times and that at one point VIA_READ simply locks. Observations: 1. Since this locking is happening in a mmio PCI Posting, it probably means there is some bus arbitration problems(memory space must be mapped to agpgart). So is the bug in agpgart? Or is there something in the hardware that says you cannot read and write to HW registers using PCI posts continuously and maybe you should introduce gaps or delays between READ's and WRITE's? 2. Since the hw is mmio, I would imagine that PCI posting(reading and writing together) although non-blocking would be properly handled by the bus aribitration queue. It would be a great help if we had the manufacturer specs. This is wierder because it happens only to a few via chipsets(Unichrome Pro B). 3. I think it must be related to certain HW timing differences between the chipsets. Matters are not helped by the fact that the bug seems to lie at kernel space where debugging is a lot more difficult. Debugging with Linice seems to be a good way of reducing wastage of time, but I don't think it is stable enough for the latest 2.6.x kernels. 4. Finally, just giving arbitray udelays do not seem to solve the problem. On the other hand, they just slow the system much more. And the VIA_READ still hangs. If it is a timing issue, then there is more to it than just simple delay between reading and writing of HW registers. 5. Would very much like someone, to go further into this, and if possible, get help from the DRI architects, as they maybe the best persons to deal with this problem, with or without HW specs for the chipset. Hope this helped in some way. I would love for comments or corrections on what I have written. It may happen that your code flow happens entirely differently. Please let me know if so. Hope this helps. (In reply to comment #9) First of all, thank you for the information: very nice job! I suspected it's something to deal with VBLANK IRQ - and now we know this for sure. This lock-up is often connected to timing issues in wikis so this partially support your conclusion. Unfortunately, I know a little about low-level hardware programming and can't imagine how to fix it but I'm sure the maintainer of this DRI driver (when it'll have one) would be be able to use your data to fix the problem. (In reply to comment #10) > (In reply to comment #9) > First of all, thank you for the information: very nice job! I suspected it's > something to deal with VBLANK IRQ - and now we know this for sure. This lock-up > is often connected to timing issues in wikis so this partially support your > conclusion. Unfortunately, I know a little about low-level hardware programming > and can't imagine how to fix it but I'm sure the maintainer of this DRI driver > (when it'll have one) would be be able to use your data to fix the problem. > Thank you. It just occured to me that this could be in some way related to [Bug 8641] New: interrupts not properly handled for VIA K8M00 / UniChrome Pro. This bug has to do with setting and clearing of interrupts not working properly. And according to description, rewriting the status register does not clear the interrupt. And the kernel disables the IRQ too. So just maybe, if someone fixes this, our bug could be fixed too! Just a thought. > It just occured to me that this could be in some way related to [Bug
8641] New: interrupts not properly handled for VIA K8M00 / UniChrome Pro.
It probably has something to do with it, but what's the bug number and/or Bugzilla where it was reported? Looking for #8641 in this Bugzilla leads to closed one: "xcb should provide (and use in generated C files) opcode defines."
(In reply to comment #12) > It probably has something to do with it, but what's the bug number and/or > Bugzilla where it was reported? Looking for #8641 in this Bugzilla leads to > closed one: "xcb should provide (and use in generated C files) opcode defines." I found it - it's in the kernel Bugzilla: http://www.mail-archive.com/dri-devel@lists.sourceforge.net/msg31026.html These problems look like being connected for me. I'm not a kernel hacker (although I've read both Linux Kernel Development, 2nd Edition and Understanding Linux Kernel ;-) but if you'd need someone with real hardware to help debugging, I'm ready to do this job. (In reply to comment #13) > (In reply to comment #12) > > > It probably has something to do with it, but what's the bug number and/or > > Bugzilla where it was reported? Looking for #8641 in this Bugzilla leads to > > closed one: "xcb should provide (and use in generated C files) opcode defines." > > I found it - it's in the kernel Bugzilla: > http://www.mail-archive.com/dri-devel@lists.sourceforge.net/msg31026.html > > These problems look like being connected for me. I'm not a kernel hacker > (although I've read both Linux Kernel Development, 2nd Edition and > Understanding Linux Kernel ;-) but if you'd need someone with real hardware to > help debugging, I'm ready to do this job. > Thanks for your offer. We probably would need the real hardware to trigger the same kind of conditions needed to cause this failure. Your dmesg logs should report the error about IRQs as soon as X is loaded. I'm pretty sure we would all see the error message. I'm not a real kernel hacker either. ;-) It seems to me that if we approach this problem from both ends(both bugs) we may stand a better chance of solving this quickly. Although there is still the possiblity that they may entirely be unrelated. Another thing we can verify is to put printk's in the kernel in exactly the same way I said and run different games and 3d applications and verify that the same root cause persists. If they are the same, we get added confirmation about the root cause. We can, in the meantime continue working on this in our spare time and hope for a breakthrough quickly. What seems to be the priority to me is to eliminate libGL, libGLX and pinpoint the bug on the kernel DRM code. All my investigations point to it, but it is always good to be more sure. ;-) Update: http://bugzilla.kernel.org/show_bug.cgi?id=8641 I have submitted a patch to this bug that deals with the IRQ interrupt bug. It seems to work, but sadly, this doesn't solve the lockup issue we face here. Maybe the two are not related. Can someone confirm the patch works? > Update: http://bugzilla.kernel.org/show_bug.cgi?id=8641
> I have submitted a patch to this bug that deals with the IRQ interrupt bug. It
> seems to work, but sadly, this doesn't solve the lockup issue we face here.
> Maybe the two are not related. Can someone confirm the patch works?
Thanks for the information, I'll try to put the patch to work. In the meantime I 've seen no traces of spurious IRQ you've observed in dmesg output, so probably these to issues are really different (although IRQ handler definitely plays a role in both cases ;-).
I agree. The spurious interrupt issue was specifically for the K8M800 device, that is device id, 3108. So if you do not have this device, then you probably wouldn't see this issue. Have an update for this. Please check http://bugzilla.kernel.org/show_bug.cgi?id=8641 (Sometimes the link is not resolved properly. So here it is again without the full url. bugzilla.kernel.org/show_bug.cgi?id=8641 (In reply to comment #18) > Have an update for this. Please check > http://bugzilla.kernel.org/show_bug.cgi?id=8641 > (Sometimes the link is not resolved properly. So here it is again without the > full url. bugzilla.kernel.org/show_bug.cgi?id=8641 > Sound sane to me. Have you seen http://sourceforge.net/docman/display_doc.php?docid=23693&group_id=102048 AFAIK there is noone around with the knowledge of VIA 3D spec so miss and try is the only way. Isn't it possible to check your conclusions by masking the annoying interrupt on PIC and enabling it, say, 10 times per second via dynamic timer? It's a huge hack but it will show the cause of the starvation (although I'm almost sure it's IRQ). > Sound sane to me.
But the question is: why do we observe random lookups (i.e. one game runs fine second game locks up) if IRQ acknowledge code is incorrect?
Good question. ;-) I have no answers to that one. But notice that the bug is not at all random. The bug is fully reproducible each and everytime it happens, and in exactly the same way. Therefore, I can only speculate that in all the offending code, some action is being done in the code that causes a large number of VBLANK's. One operation would be a glFlush I think(which, incidently, seems to be the trigger for the small test case you have written in c). AFAIK it seems to me that a combination of calls that force a lot of VBLANK's seems to be the immediate trigger. Maybe the way the game uses and clears the texture memory or the way drawing is done on screen is responsible for this. In that case, this speculation makes some sense. Each application has different methods of drawing, texture manipulation etc. So a combination of different methods of calling the DRM handler may be the trigger to this. But again, your statement is very logical. If the IRQ ACK is not accurate, the interrupts should fire continuously, but it happens under only certain conditions. Again, note the critical change in behavior. No longer hard lock, but barely responsive. That means something is eating the processor cycles, maybe a sleeping spinlock in a thread or something similar. We have some work to do! ;-) Hi. The IRQ issue on K8M800 has been around for ages. I'm not saying it can't be fixed (sometimes when my K8M800 has been running for quite some time it will work nicely), but trying to initialize it from scratch it always fails. The reason is that K8M800 fires a huge amount of spurious interrupts. If irq debug is turned off, the handler slows down the machine considerably trying to handle the interrupts. The problem is probably a hardware bug. The same problem occurs on certain variants of KM400. I once asked via about it and they claimed that IRQ functionality was never verified on these chips since they didn't do video capture. Hence no use for IRQs in VIAs windows drivers. The only sane thing to do is to not enable IRQs for these chips. Regarding the texture lockup issue, the same code works much better on other chips with the same 3D engine. It might just be a memory timing issue on K8M800, which means that tracking it down using software can be very difficult. /Thomas Appreciate your comments. You may have saved us a long time in going off in some wrong direction. ;-) As of now, turning IRQ off in the X Driver causes more frequent lockups than before. Even a simple glxinfo is a candidate for this lockup. I know timing problems are hell of a lot difficult to find, but we have no choice. Lacking open specs from the manufacturer, we have to do some blind hit and misses to get it. Another thing, if it is a memory timing issue, I assume this will lockup during a readl, writel or during writes to some buffers. Maybe, if we can isolate this, we have a better chance to understand what the issue is. It's a little more complicated than that. The Unichromes have an AGP command queue which feeds a "Virtual Queue" in vram. If the lockup occurs during a texture read, it's hard to tell exactly what primitive caused it, because it may be in the middle of reading 2MB of data. The AGP command queue has never been completely stable, so first thing would be to turn that off. "EnableAGPDMA" to "false" would eliminate AGP-related lockups. Then the Virtual command queue can also be turned off using a 2D driver option, but I can't remember which ATM. After that, all command register writes should stall until the device is ready to accept them, which may help tracking this issue down. /Thomas Appreciate that piece of wisdom. :-) Will certainly try as you advise. Will keep posting on progress as and when time permits. Thanks again. Via released a new driver. Anyone tried it? > Via released a new driver.
> Anyone tried it?
Which one? AFAIK, Via drivers have nothing to do with DRI - please correct me if I'm wrong.
VIA provides only binaries for DRI modules these days, so they should not be considered in the context of this bugreport. We are not even able to find the time to fix bugs in a free dri driver, so feel free to explain how we could support binary only drivers. (In reply to comment #28) > VIA provides only binaries for DRI modules these days, so they should not be > considered in the context of this bugreport. We are not even able to find the > time to fix bugs in a free dri driver, so feel free to explain how we could > support binary only drivers. > [quote]stinke on Ubuntu lauchpad[/quote Okay, so someone has reported the new release in the DRI Mailing List [1] and it's been commented. It doesn't look like Valentine Sinitsyn and Luc Verhaegen have really had a look at the sources (yet). From what I can tell everything except the libddmpeg.so is provided as source code in the package, contrary to what they are saying. The only question is if what is provided can resolve the Texture issue for the K8M800. I've read some very contrary reports in [2] and the Via Arena forum. OTOH I'm not sure those people are using this new code at all as it shouldn't even start X with the VIA BIOS checksum of the K8M800 (or maybe just mine? Can someone verify?). I'll see this evening If I can compile the X driver without the checksum check and see what happens. > It doesn't look like Valentine Sinitsyn and Luc Verhaegen Just to make it clear: I'm only the original bug reporter, not Unichrome-DRI/Unichrome/OpenChrome developer. > have really had a look at the sources (yet). > From what I can tell everything except the libddmpeg.so > is provided as source code in the package, > contrary to what they are saying. In fact, I've already skimmed through the code VIA released on Dec, 13th. DRI part looks pretty much like the thing you can find in current Mesa tree (although I haven't done any thorough comparison and if there is a one-liner, I would definitely miss it). The archive is 14 Mb and most of it is precompiled .so libraries for different distros (outdated ones, unfortunately). I was not able to figure out whether they are just compiled sources or contain some proprietary code. It would be really nice if they don't but the release notes are somewhat misleading: This software package supports 2D, 3D, TV-Out, hardware video mpeg2/4 and hardware video overlay. Aiglx function can be supported on Fedora Core Linux 6/7, ubuntu Desktop 7.04.Other distributions only support 2D, TV-Out, hardware video mpeg2/4 and hardware video overlay. Distros in the list are among those you can find precompiled .so for. Anyway, if the code works, it would be really nice. Unfortunately, I'll hardly have an opportunity to test it in the coming days, but if you have some news on the subject (link missed from the previous post would be helpful, too), please let me/us know. > Just to make it clear: I'm only the original bug reporter, not
> Unichrome-DRI/Unichrome/OpenChrome developer.
Right, you're not. It wasn't meant to be offensive. Also not to Luc.
Instead it's good news more people are interested in this release.
Lets hope it's worth it.
Meaning, I'm just about where you are. From the original post on Launchpad:
I used the X-Drivers from one of the binary releases and compiled the drm.ko and via.ko
from the source package.
I commented out the BIOS checksum code but the X-Drivers part of the source
package is impossible to build using the provided build chain. It's a huge mess.
I'm working on it ...
> Right, you're not. It wasn't meant to be offensive. Also not to Luc. Heh. Confusing me with driver developer looked like a compliment, not the offense. ;-) > It's a huge mess. Absolutely! (In reply to comment #31) > it's good news more people are interested in this release. > Lets hope it's worth it. Nothing is worth it when it doesn't come with source. As a user you may be interested only in having things work, developers are interested only in the source that makes it work. So please stop talking here on this bug -- email each other when you want to discuss things -- until you can post a patch that fixes this, or can provide specific information that clarifies the bug. > Nothing is worth it when it doesn't come with source.
Good god it DOES!
Futher more I did not choose to comment my findings (thats all I've
been up too) on THIS list in the first place.
I'm trying to put some sense in what
(if something) was done with the VIA/DRM Kernel Modules, Mesa DRI code,
and XFree driver and find out if they managed to fix the issue.
Gah! Just RTF comments...
(In reply to comment #29) > > > [quote]stinke on Ubuntu lauchpad[/quote > Okay, so someone has reported the new release in the > DRI Mailing List [1] and it's been commented. What release would that be and why do i not see anything in the mesa tree? > It doesn't look like Valentine Sinitsyn and Luc Verhaegen > have really had a look at the sources (yet). Have you? > From what I can tell everything except the libddmpeg.so > is provided as source code in the package, > contrary to what they are saying. What about: uma_dri.so, libOGL.so, libGL.so.1.2 ? See any code for those? And what about the libS3G.a binaries in the drm directory? Could it also be that you mistake what code there is for drm and agp as fully free source, ignoring the large binary blobs in the process? If you say out loud that maybe some people should look at some things, maybe you should verify for yourself first. Especially when you're questioning the person who has been tracking VIA movements for the past 4.5ys, and who has kicked VIA up the rear end more than once for crap licenses in their x driver sources, and got them fixed. Now, since another fable is helped out of this world, now i can, for my part, stf and do something useful. > What release would that be and why do i not see anything in the mesa tree? That was a reference to this thread which the person who copied my text didn't include. > Have you? Yes and I can't see the files you are referring to further below in your post. Filename is CLE266CN400CN-CX700CN800XORG40072-kernel-src_20071213d.tar.tgz from 13th December > What about: uma_dri.so, libOGL.so, libGL.so.1.2 ? > See any code for those? No and I don't have those files. > If you say out loud that maybe some people should look at some things, maybe > you should verify for yourself first... blah blah blah I rule bla bla Where did I say that. I had assumed you have not looked inside this particular file. Apparently I was right. Thats not saying you should. I couldn't care less if you do or don't. Furthermore, I'm not the person to comment the quality or value of this new release. So far noone here did and thats what we're all actually waiting for. But instead I'm being attacked for what exactly? > No and I don't have those files. Confirm. I don't have those files in my archive (md5: eed5daf69f0b970aec0a654fdfcb731e) either. Blobs there are standard parts of Mesa DRI driver (unichrome_dri.so and libGL.so). The only strange part is libglx.so, but according to vinstall script it's used on FC6/7 only. But whether those blobs can be generated for the sources provided and what are the licensing terms for them (and last but not least does them suffer from the bug 5092) is still unknown, so I'd rather agree with #33 - lets stop posting here unless we have some specific information regarding the bug. (In reply to comment #37) > > No and I don't have those files. > Confirm. I don't have those files in my archive (md5: > eed5daf69f0b970aec0a654fdfcb731e) either. Blobs there are standard parts of > Mesa DRI driver (unichrome_dri.so and libGL.so). The only strange part is > libglx.so, but according to vinstall script it's used on FC6/7 only. But > whether those blobs can be generated for the sources provided and what are the > licensing terms for them (and last but not least does them suffer from the bug > 5092) is still unknown, so I'd rather agree with #33 - lets stop posting here > unless we have some specific information regarding the bug. > Yes, the driver package contains some binary libraries. I've tried to compile the source on my Debian sid amd64 machine, and find out regardless it would work or not all necessary .so should be able to generated from the given source. Via utils and mpeg decoding binaries didn't come with source, but I don't care about them. The dri libraries was written against mesalib 6.5.2- which need some modification to successfully compiled in current mesalib 7.0.2+. But the result .so still crash my X when working with my old drm kernel module. The drm kernel modules was written against kernel 2.6.18-(for Debian) which was also a bit out of date and will cause lot of compile warnings. And the result .ko refused to work with exist X. According the xorg.log AIGLX report "operation not perrmitted" will initialing and glxinfo also report "operation not permitted", this is really strange, my /etc/dri/card0 was already set to 0666 and I have correct DRI section in my xorg.conf file. I also tried to compile the v4l modules, but failed. The 2D driver compiled, but didn't work in my environment. Anyone else working on this? Any good news? (In reply to comment #38) You appear to know your way around source code; you may wish to join #unichrome on Freenode.net. Is there anyone who successfully installed the driver & the freezing problem is there or "it just works"? More than a year passed... does anybody know anything new? I tried mesa 7.4, xorg server 1.5.3, libdrm 2.4.7, linux 2.6.28.9, openchrome 0.2.903. And I see the same hanging problem, maybe related to textures, because glxgears works well, but no other gl apps. I have some self-made gl programs which only put points (no wireframe, no surfaces, ...), and that also hangs all. Mesa is no longer support. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.