Created attachment 73839 [details] dmesg from my latest custom kernel Overview: There is a memory leak in the 3D graphics support for my Radeon Evergreen GPU (HD 5750). I have at least two programs which leak memory if I configure them to run fullscreen, but stop leaking memory if I run them as windows on my desktop. The testing I have done so far seems to indicate a problem in the kernel DRM, but after bisecting and locating the first commit which introduced the leak, I found that attempted reverts of the problem code on top of my local git tree still resulted in kernels that leaked. I may be able to find out more when I can find time to do more testing, but I want to report what I have learned so far in case I am doing something stupid or missing something obvious. I use a game called 'prboom-plus' as a kind of simple test program to see if the latest DRM changes, or the latest version of Mesa, are working. If 'prboom-plus' runs without corruption (or crashing) after a few minutes, then I (naively) conclude that everything is fine. I occasionally play for a longer period of time, and in Dec. had an instance of the game seeming to freeze, causing the whole system to become almost unresponsive. The game suddenly shutdown without any error being logged, and the system was back to normal; I now know that this was symptomatic of the leak and the kernel's OOM killer intervening. There is also an old DOS game I still like to play, and I use 'dosbox' for that purpose. In Dec., the 'xine' music player I use while playing the DOS game started "skipping" after playing the 'dosbox' game for a while, and the whole system became almost totally unresponsive. I now see that these were the same symptoms as were affecting 'prboom-plus', but at the time I didn't see the connection. Steps to reproduce: For 'prboom-plus' it defaults to fullscreen mode. I run it using a script which sets some audio-related environment variables and then runs the program: prboom-plus -iwad doom.wad -vidmode gl -width 1920 -height 1200 The "-iwad" option is for loading the maps from original DOS DOOM, "-vidmode gl" is to force OpenGL graphics output, and "-width" and "-height" are probably unnecessary but are set to match my monitor's resolution.) The program defaults to fullscreen; to run in a window instead, one can just add the "-window" option to the command. For 'dosbox', it is possible to toggle from fullscreen to window using the <Alt>+<Enter> hotkey combo. It requires a configuration file, so I have customized it to start my game in fullscreen mode using OpenGL graphics: [sdl] fullscreen=true fulldouble=true fullresolution=1920x1200 windowresolution=1600x1200 output=opengl ... Actual results: While trying to track down another issue (which is now resolved) I started using 'top' to check for processes which were hogging the CPU, and accidentally discovered the runaway RAM usage. I was doing this by running 'top' on VT1 while running 'prboom-plus' fullscreen in X. For convenience, I reconfigured the game to run as a window instead of fullscreen... and the leak disappeared! The same behavior is seen with 'dosbox': running in fullscreen leaks memory, but running in a window does not. Expected results: These programs should run without leaking memory. They have done so for years, up to and including kernel 3.7.X. I like to test new Radeon-related code, but I have been burned by Linux -rc kernels. For several years, I have been creating local git branches, starting with a stable release as a branch point and then cherry-picking commits from drm-next and drm-fixes which are either directly relevant to my hardware or which are relevant to all systems. I do not file bug reports if I am unable to reproduce a bug using the upstream developers' trees; in this case, I have confirmed that the HEAD of drm-fixes exhibits the memory leak as described above. System info: GPU: Radeon HD 5750 (Evergreen Juniper) Linux distribution: Debian unstable + custom X stack Software versions: libdrm: 2.4.40 (plus git commits up to 0980633a of Nov. 27) mesa: 9.1-devel (up to 5330c5a2 of Jan. 14) xorg-core: 1.13.1 xf86-video-ati: 7.0.99-devel (up to commit 793e1b0e of Dec. 6) Additional Information: I began noticing problems in late Nov. or early Dec. There were corruption problems with 'torcs', but there were also instances of my desktop becoming unresponsive and programs freezing or suddenly closing without leaving any errors in logs. The 'torcs' corruption was solved recently in drm-fixes: commit 20707874 Revert "drm/radeon: do not move bo to different placement at each cs" I had assumed that the other issues were related, but when they continued to occur I started to investigate more seriously, and discovered the memory leak using 'top'. [Can a developer rename this bug report for me, using your preferred naming conventions?]
Created attachment 73840 [details] Xorg.0.log
Created attachment 73842 [details] Output from successive runs of 'top' Running 'top' revealed the memory leak. This attachment is (clipped) output from top. Running the 'prboom-plus' game on a kernel that was not leaking, then several snapshots on a kernel that was leaking. Eventually the game was taken out by the OOM killer, and system memory usage was restored to normal.
Bisecting So far, I have only bisected using my local custom kernel branch. As described above, I use stable kernel releases to avoid the massive breakage that frequently occurs in -rc kernels, and I cherry pick commits from drm-fixes and drm-next. This is a potential source of error (on my part), so I will not be offended if anyone objects to me reporting bugs against such kernels. My justification is that the HEAD of drm-fixes, which is 3.8.0~rc3 at commit 48367432 of Jan. 26 also leaks memory if I build it and use it on my system. I bisected using my custom branch first only to save time: I add cherry picks in bunches, and keep separate list files of which commits I have used, so manually bisecting using those files rapidly led me to the first commit which caused the kernel to leak memory. [I will bisect using drm-fixes (or drm-next) if asked, but there are issues with Mesa not being compatible before a certain point in mid-December, so that I would have to use an older version of Mesa or patch some of the older bisection points with later commits in order to use the version of Mesa I currently have installed.] For this testing, I started with stable kernel 3.7.4 and applied my lists of cherry picks that had built up since 3.7-rc7 or so. I had built up 11 lists of commits as I added new upstream code to my local branch, so I could perform a manual bisection by applying an entire list at a time until I found a kernel which leaked memory. From that, I would know which list introduce the problem, and I could build kernels at each commit in that list until I found the first one that resulted in leaks. (If anyone is interested in those lists I will be glad to post them; I have only withheld them for the sake of brevity.) The manual bisection led me to a last good commit and a first bad commit. The cherry picking process causes my branch to have new SHA1 ID numbers, so I instead of using the 'git log' info from my branch am will use the info from the drm-airlied/drm-fixes branch. The first bad commit was: commit 4ac0533abaec2b83a7f2c675010eedd55664bc26 Author: Jerome Glisse <jglisse@redhat.com> Date: Thu Dec 13 12:08:11 2012 -0500 drm/radeon: fix htile buffer size computation for command stream checker Fix the size computation of the htile buffer. Signed-off-by: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> The previous commit, which worked fine, was: commit 9af20792124850369e764965690b99b20623dfc4 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Tue Dec 11 23:42:24 2012 +0100 drm/radeon: fix fence locking in the pageflip callback We need to hold bdev->fence_lock while grabbing a reference to the fence, to prevent concurrent clearing/changing of the ttm_bo->sync_obj field. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> This was a time-consuming process, but much shorter than it would have been using 'git bisect' in drm-fixes. Later, I made two attempts to revert the "bad" code on top of the HEAD of my local branch.
Created attachment 73843 [details] diff of attempt to partially revert problem commit Naive attempt to revert My first attempt to revert the patch revealed to be the problem above did not work. (A diff of the changes is attached.) The problem commit touched evergreen_cs.c and r600_cs.c, and since I have Evergreen Juniper hardware I only reverted the code in evergreen_cs.c. The kernel built using this approach still leaked memory.
Created attachment 73844 [details] diff of attempt to manually revert entire problem commit Attempt to revert the entire commit I then decided to manually revert all of the changes in the *.c files touched by that commit. (A diff of these changes is attached.) The kernel still leaked memory.
I do not know how to interpret these findings. I am not a developer, and work with the few tools that I (sort of) know how to use much more slowly than experienced developers would. I ran out of time over the weekend, and will have to investigate further this coming weekend. I have been trying to think of possible explanations: - could a later commit causes a similar (or identical) memory leak to the one caused by 4ac0533a? - is the problem really located somewhere else besides in the kernel DRM, such as in Mesa, with it being triggered by more than one kind of code change in the DRM? I thought that the bisection would be conclusive, but the failure of the revert attempts undermined my goal. If no one demands that I bisect using one of the drm-airlied branches, then I will try another custom branch which drops the commit first causing the leak, and see if any later commits also cause a leak. (If a developer demands that I bisect an upstream branch, I'll do that instead.) Is there anything else I can do to be helpful at this point? Can any of you confirm/reproduce this memory leak on your own systems?
This is indeed more likely an issue in Mesa than in the kernel. The commit you bisected also bumps KMS_DRIVER_MINOR in radeon_drv.c, which may cause the Mesa code to use different code paths. Does current Mesa Git still leak?
(In reply to comment #7) > This is indeed more likely an issue in Mesa than in the kernel. The commit > you bisected also bumps KMS_DRIVER_MINOR in radeon_drv.c, which may cause > the Mesa code to use different code paths. In my two attempted manual reverts, I didn't know what effect the code in radeon_drv.c actually had. When looking at later DRM changes (after the one identified by the bisection) I saw that the value of KMS_DRIVER_MINOR was incremented further, so I thought I should leave the value unchanged. > Does current Mesa Git still leak? After updating/merging to the current HEAD (commit 02b6da1e of Jan. 29), I built new Mesa packages and installed them. I am currently relying the last kernel from my local tree which does not leak, so I rebooted to the kernel I had built that is most current with the upstream DRM changes: stable 3.7.4 + many commits from drm-next/drm-fixes up to commit 014b3440 of Jan. 21 in drm-fixes. Yes, it still leaks. Would it be good for me to bisect Mesa? I would use my most up-to-date kernel, try to find an older version of Mesa which does not leak, and then identify the first patch to Mesa which causes the leaking. The testing I have done so far was assuming the problem was in the kernel, but I was planning to look at Mesa as well. I decided to report the bug before continuing, hoping for some guidance from the experts. If there really is a memory leak caused by the kernel DRI, then I would like to help get it fixed before 3.8 is released; or, if there is a problem in Mesa I would like to help get it fixed for 9.1.
(In reply to comment #8) > Would it be good for me to bisect Mesa? If Mesa commit 6532eb17baff6e61b427f29e076883f8941ae664 (where code depending on DRM minor 26 was first introduced) doesn't leak, yes please. BTW, does setting the environment variable R600_HYPERZ=0 for the leaking process(es) work around the problem? > The testing I have done so far was assuming the problem was in the kernel, > [...] The fact that the memory is reclaimed on process exit makes that unlikely.
> > The testing I have done so far was assuming the problem was in the kernel, > > [...] > > The fact that the memory is reclaimed on process exit makes that unlikely. That makes perfect sense. Thank you for that explanation -- I was wondering whether there were ways to distinguish a memory leak in the kernel from one in Mesa.
(In reply to comment #9) > (In reply to comment #8) > > Would it be good for me to bisect Mesa? > > If Mesa commit 6532eb17baff6e61b427f29e076883f8941ae664 (where code > depending on DRM minor 26 was first introduced) doesn't leak, yes please. Because of work, I was going to wait until Friday to check this... but I found some time last night to give it a try. With my custom 3.7.4 kernel (+ post 3.7 Radeon-related commits up to Jan. 21) I built Mesa at commit 6532eb17 ... and that leaked memory as described before. I then built Mesa at one commit earlier (24b1206a), and the memory leak was gone. That's not the sort of help I hoped to provide.... > BTW, does setting the environment variable R600_HYPERZ=0 for the leaking > process(es) work around the problem? With up-to-date kernel and up-to-date Mesa (commit 02b6da1e of Jan. 29), using R600_HYPERZ=0 _does_ eliminate the memory leak. Of course, I am hoping to be _able_ to use the Hyper Z functionality, so I have to admit disappointment about having to use that workaround; but at least it allows me to use the rest of the DRI and Mesa changes since Dec. 19. Needless to say, I'm definitely willing to test further changes in the DRI and/or Mesa!
Should be fixed by commit 5c86a728d4f688c0fe7fbf9f4b8f88060b65c4ee ('r600g: fix htile buffer leak'). Please resolve as fixed if you can confirm.
The canaries lived! I had just enough time before work to build Mesa (at 5c86a728) and test the programs which easily reproduce the leak. All is well now. Amazing how much difference a single line can make! Many thanks to Marek, Michel, and Alex.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.