Bug 18967 - Xorg freeze after using xrandr, drm debug error, with bare server
Summary: Xorg freeze after using xrandr, drm debug error, with bare server
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/other (show other bugs)
Version: XOrg git
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-12-08 18:32 UTC by peter garrone
Modified: 2009-02-03 22:50 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output with warnings (79.08 KB, text/plain)
2008-12-09 16:01 UTC, peter garrone
no flags Details
lspci output (9.94 KB, text/plain)
2008-12-09 16:01 UTC, peter garrone
no flags Details
gdb log output illustrating the resetting of variable causing loop. (3.82 KB, application/octet-stream)
2008-12-09 21:21 UTC, peter garrone
no flags Details
Code from drm_bufs.c showing DRM_SHM branch. (1018 bytes, text/plain)
2008-12-10 21:00 UTC, peter garrone
no flags Details
Patch with pringk's, and dmesg output hopefully illustrating the error. (37.33 KB, text/plain)
2008-12-12 13:20 UTC, peter garrone
no flags Details
My shell script for launching problem. Command is sudo stx -min -gdb (2.54 KB, application/octet-stream)
2008-12-12 14:46 UTC, peter garrone
no flags Details
kernel patch that addresses error (742 bytes, text/plain)
2008-12-16 15:08 UTC, peter garrone
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description peter garrone 2008-12-08 18:32:49 UTC
I am getting a freeze when I attempt to use xrandr.
My equipment is a GM45 motherboard with VGA and two SDVO/TMDS displays.
"xrandr -q" works, but the server goes into an infinite loop.
This does not occur every time, but does most times this is done.
The build is as close as I can make it to the latest git checkouts.
The kernel is 2.6.28-rc7-pae, from the anholt/drm-intel-next archive.
The kernel version 2.6.28-rc4-pae, with the same user space, does not do this.
I have investigated the problem somewhat with gdb and loading (and recompiling) the drm module with debugging enabled.

The variable master->lock.hw_lock is NULL in the kernel function drm_lock.

I get the following stack trace
when the loop is interrupted.

****************************************************************************
#0  0xffffe430 in __kernel_vsyscall ()
#1  0xb7c123e9 in ioctl () from /lib/libc.so.6
#2  0xb7a8034f in drmIoctl (fd=16, request=1074291754, arg=0xbfb211d8) at xf86drm.c:183
#3  0xb7a81107 in drmGetLock (fd=16, context=1, flags=0) at xf86drm.c:1297
#4  0xb7a8d7d0 in DRILock (pScreen=0x830ef70, flags=0) at dri.c:2201
#5  0xb7a8a587 in DRIScreenInit (pScreen=0x830ef70, pDRIInfo=0x82f2b38, pDRMFD=0x82620b8) at dri.c:525
#6  0xb7a4623e in I830DRIScreenInit (pScreen=0x830ef70) at i830_dri.c:631
#7  0xb7a0f805 in I830ScreenInit (scrnIndex=0, pScreen=0x830ef70, argc=8, argv=0xbfb21574) at i830_driver.c:3106
#8  0x080699e9 in AddScreen (pfnInit=0xb7a0f63d <I830ScreenInit>, argc=8, argv=0xbfb21574) at main.c:688
#9  0x080bc29e in InitOutput (pScreenInfo=0x824e160, argc=8, argv=0xbfb21574) at xf86Init.c:1245
#10 0x08068d34 in main (argc=8, argv=0xbfb21574, envp=0xbfb21598) at main.c:309
****************************************************************************

When the kernel function drm_lock finds the variable
master->lock.hw_lock to be NULL, it returns EINTR, and this causes an infinite
loop invoking an ioctl in drmIoctl.

A nasty side bug is that the DRM_DEBUG line in drm_lock assumes 
master->lock.hw_lock is non-NULL, and causes an OOPS when debugging is enabled.
Comment 1 peter garrone 2008-12-08 20:02:28 UTC
When I loaded the i915 module with modeset set to 1, the problem dissappeared.
The kernel configuration option CONFIG_DRM_I915_KMS is not set.
So as long as I load i915 with modeset set to 1, the problem is resolved.
Comment 2 peter garrone 2008-12-09 16:01:04 UTC
Created attachment 20969 [details]
dmesg output with warnings
Comment 3 peter garrone 2008-12-09 16:01:48 UTC
Created attachment 20970 [details]
lspci output
Comment 4 peter garrone 2008-12-09 21:21:27 UTC
Created attachment 20973 [details]
gdb log output illustrating the resetting of variable causing loop.
Comment 5 peter garrone 2008-12-09 21:22:22 UTC
The lock.hw_lock variable in the kernel that causes the infinite loop by being reset is being reset upon screen closure called upon an xrandr operation. An error message
"error setting MTRR (base = 0x20000000, size = 0x10000000, type = 1) Invalid argument (22)"
is emitted by both kernel/dmesg and by libpciaccess. Following this, there is a close, and in that close operation the flag is reset. The associated gdb log illustrates this activity. The 3rd pci region has the failing base address/size.
Comment 6 peter garrone 2008-12-10 21:00:31 UTC
Created attachment 21036 [details]
Code from drm_bufs.c showing DRM_SHM branch.

This is the code in drm/drm_bufs.c (kernel module) for the ioctl DRM_IOCTL_ADD_MAP, type DRM_SHM. This mapping must be invoked from user space before the opened dri device can be locked. It normally sets the 
master->lock.hw_lock variable. However, if the function drm_find_matching_map returns something, then the ioctl returns 0, and the master->lock.hw_lock is never set, resulting in the infinite loop when locking the dri channel is attempted later from user space.
Comment 7 peter garrone 2008-12-11 23:45:18 UTC
After a comment by Dave Airlie that if a map has been created already, then that lock should be the primary, I analysed the addition of maps to
the "dev->maplist" list in the drm code.

Each time randr is run, all drm file descriptors are closed and reopened, and the list accumulates map elements. However these elements are never removed, because, although there is code to delete elements from the list, that code is never executed. It could be if a user space ioctl were invoked, but that does not happen. 

The old elements in the list are generally not reused, because they have a "master" field that identifies them with the "master" structure active in the device when the map was created. These master structures are allocated and freed on each xrandr close/reopen cycle.

However if, by chance, an old "master" structure is returned by the dynamic memory allocation function, then one of the elements is reused, and the branch is taken in the code that does not set the master hwlock. So the infinite loop is entered when a later attempt is made to lock the drm.

It seems problematic to free dynamic master structures, without also freeing any elements that rely on a reference to that structure for correct operation. 
Actually it seems problematic having all these map structures hanging round anyway, because I cannot find where they are ever freed, so at least they represent a memory leak.
Comment 8 Eric Anholt 2008-12-12 10:18:15 UTC
Are you running a bare server (no other clients running) so that the server regenerates after each xrandr call?
Comment 9 peter garrone 2008-12-12 12:49:31 UTC
(In reply to comment #8)
> Are you running a bare server (no other clients running) so that the server
> regenerates after each xrandr call?
> 
Yes. I am targeting an embedded system. Usually I like to work by remotely logging in using ssh and running xterms on a remote computer, while running a rudimentary display on the target system. However there is a requirement to run a full desktop on the target, for developers.

When I debug the xserver, with gdb, I also run only an xorg, with gdb.


Comment 10 peter garrone 2008-12-12 13:20:15 UTC
Created attachment 21101 [details]
Patch with pringk's, and dmesg output hopefully illustrating the error. 

This is the printk output from dmesg with my added printks in a patch. At module removal, there are 25 entries in the maplist.
Comment 11 peter garrone 2008-12-12 14:46:58 UTC
Created attachment 21105 [details]
My shell script for launching problem. Command is sudo stx -min -gdb

To cause the error, I have compiled and installed X11 at prefix /usr/local/x11prefix, and I run this script with
$ sudo stx -min -gdb
On another terminal, I run xrandr, similar to what is in the script, but really just xrandr -q is necessary.
Comment 12 Eric Anholt 2008-12-13 16:58:19 UTC
Generally no user environment involves server regens, so it may be in your best interest to avoid running a testing environment involving that code path.  (Basically, run an xlogo or xterm or something before playing with xrandr).  Still a bug.
Comment 13 peter garrone 2008-12-14 13:48:18 UTC
I can confirm that the error does not occur while xlogo is also running, because the "master" is not deallocated.
Comment 14 peter garrone 2008-12-15 17:22:12 UTC
If the /dev/dri/cardN file descriptor is held open by a paused process, then this freeze error does not occur either, (as well as in the situation of running an x application). However during xrandr operations, the i915 heap allocation is recalled, and since no heap deallocation has been invoked, errors occur in the dmesg output of the nature:

[drm:i915_mem_heap_init] *ERROR* heap already initialised?

(except that this error currently has no newline, so it runs onto the next dmesg message on output)
Comment 15 peter garrone 2008-12-16 15:08:28 UTC
Created attachment 21216 [details]
kernel patch that addresses error

This kernel patch to drm_stubs.c removes all maps in dev->maplist that reference the master when the master structure is being freed, after invocation of the device destroy callback. It doesn't appear to introduce any new quirks.
Use at your own risk.
Comment 16 Greg McGee 2008-12-24 01:27:20 UTC
Comment on attachment 21216 [details]
kernel patch that addresses error

What kernel version is the patch against?

I tried it vs 2.6.17.10 and 2.6.18-pre9, no joy, 1 of 1 hunks rejected.
Comment 17 Greg McGee 2008-12-24 01:52:06 UTC
  [Bug 18967] Xorg freeze after using xrandr, drm debug error.

bugzilla-daemon
Tue, 16 Dec 2008 15:08:54 -0800

http://bugs.freedesktop.org/show_bug.cgi?id=18967





--- Comment #15 from peter garrone <pgarr...@optusnet.com.au>  2008-12-16 
15:08:28 PST ---
Created an attachment (id=21216)
 --> (http://bugs.freedesktop.org/attachment.cgi?id=21216)
kernel patch that addresses error

This kernel patch to drm_stubs.c removes all maps in dev->maplist that
reference the master when the master structure is being freed, after invocation
of the device destroy callback. It doesn't appear to introduce any new quirks.
Use at your own risk.


Thank you all for working on this, I suspect this may effect more than just the intel servers.

What kernel version is the diff against?
I tried it against 2.6.17.10 and 2.7.18-pre9


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.