Bug 57885

Summary:

Xorg server crashed with AMD Radeon HD 6310

Product:

xorg

Reporter:

Nemo_bis <federicoleva>

Component:

Driver/Radeon

Assignee:

xf86-video-ati maintainers <xorg-driver-ati>

Status:

RESOLVED INVALID

QA Contact:

Xorg Project Team <xorg-team>

Severity:

normal

Priority:

medium

Version:

unspecified

Hardware:

x86 (IA32)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
Example Xorg.0.log	none
/var/log/messages	none
List of processes with their opened files	none
ps -fax	none

Description Nemo_bis 2012-12-04 17:12:20 UTC

Created attachment 71004 [details]
Example Xorg.0.log

Lately Xorg server is crashing very randomly for me, sometimes when I'm just writing text on kwin and sometimes while watching a movie on Dragon which suddenly goes to swap death.
The panels disappear, I can't switch windows nor do almost anything but (sometimes) closing the current application with Alt+F4, even REISUB doesn't work so I have to powercycle.
ABRT brought me to https://bugzilla.redhat.com/show_bug.cgi?id=856033 which says it's a driver issue; bug 56405, bug 56081 seem to have somewhat similar errors... sorry but I don't know where to report it.

$ uname -a
Linux e350 3.6.8-2.fc17.i686.PAE #1 SMP Tue Nov 27 20:08:42 UTC 2012 i686 i686 i386 GNU/Linux

Asrock E350M1 / AMD Dual-Core Zacate E350/E350D APU / AMD A50M Chipset / Integrated AMD Radeon HD 6310 graphics

Comment 1 Alex Deucher 2012-12-04 17:28:45 UTC

Please attach your dmesg output when X hangs.  What version of mesa are you using?

Comment 2 Nemo_bis 2012-12-04 17:42:54 UTC

Thank you. I'll post my dmesg output when it happens again; now I rebooted one or two times so it doesn't have much of use.
I have mesa 8.0.4-1.fc17.

Comment 3 Nemo_bis 2012-12-06 12:28:25 UTC

Created attachment 71085 [details]
/var/log/messages

Here's the dmesg for the last 4 days or so.
About ten minutes following
     Dec  6 12:58:02 e350 kernel: [157010.314171] VFS: file-max limit 371898 reached
and short before seems to be a good example.

$ cat /proc/sys/fs/file-max
371898
$ cat /proc/sys/fs/file-nr
11488   0       371898
$ lsof | wc
  41810  411540 4726629
$ ps -ef | wc 
    177    1602   14372

Comment 4 Michel Dänzer 2012-12-06 16:33:59 UTC

(In reply to comment #3)
>      Dec  6 12:58:02 e350 kernel: [157010.314171] VFS: file-max limit 371898
> reached

Sounds like something may be leaking file descriptors...


> $ cat /proc/sys/fs/file-nr
> 11488   0       371898

I assume these numbers are from when there's no problem? Can you keep an eye on the first number, and if it grows beyond say 100000, try finding out which process has so many file descriptors open?

Comment 5 Nemo_bis 2012-12-06 21:40:53 UTC

Yes, that's right after startup. Now it's only a little higher.

How do I check what process is keeping many files open, besides killing them all one by one? I can try to identify it, but everything seems to happen very suddenly and quickly...

Comment 6 Michel Dänzer 2012-12-07 09:52:40 UTC

(In reply to comment #5)
> How do I check what process is keeping many files open, besides killing them
> all one by one?

E.g. something like

 ( for i in /proc/[0-9]*/fd; do echo "$(sudo ls $i | wc -l): $i"; done ) | sort -n | tail

does the trick. There might be a simpler way.

Comment 7 Nemo_bis 2012-12-08 11:56:56 UTC

Thanks. I didn't manage to use it because this time it happened during the night. Apparently, I have many repetitions of something like this:

Dec  8 06:48:57 e350 dhclient[782]: DHCPREQUEST on em1 to 192.168.1.254 port 67 (xid=0x582009fe)
Dec  8 06:48:57 e350 dhclient[782]: DHCPACK from 192.168.1.254 (xid=0x582009fe)
Dec  8 06:48:58 e350 NetworkManager[645]: <info> (em1): DHCPv4 state changed renew -> renew
Dec  8 06:48:58 e350 NetworkManager[645]: <info>   address 192.168.1.130
Dec  8 06:48:58 e350 NetworkManager[645]: <info>   prefix 24 (255.255.255.0)
Dec  8 06:48:58 e350 NetworkManager[645]: <info>   gateway 192.168.1.254
Dec  8 06:48:58 e350 NetworkManager[645]: <info>   hostname 'e350'
Dec  8 06:48:58 e350 NetworkManager[645]: <info>   nameserver '62.101.93.101'
Dec  8 06:48:58 e350 NetworkManager[645]: <info>   nameserver '83.103.25.250'
Dec  8 06:48:58 e350 NetworkManager[645]: <info>   domain name 'fastwebnet.it'
Dec  8 06:48:58 e350 dbus[706]: [system] Activating service name='org.freedesktop.nm_dispatcher' (using servicehelper)
Dec  8 06:48:58 e350 dhclient[782]: bound to 192.168.1.130 -- renewal in 764 seconds.
Dec  8 06:48:58 e350 dbus-daemon[706]: dbus[706]: [system] Activating service name='org.freedesktop.nm_dispatcher' (using servicehelper)
Dec  8 06:48:58 e350 dbus-daemon[706]: dbus[706]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Dec  8 06:48:58 e350 dbus[706]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'

and then about half a million times 

Dec  8 06:55:07 e350 kernel: [150257.594126] [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (4096, 4, 4096, -12)
Dec  8 06:55:09 e350 kernel: [150259.571894] [TTM] Out of kernel memory
Dec  8 06:55:09 e350 kernel: [150259.571925] [TTM] Out of kernel memory

I'll try to post the log again (if I manage to compress it decently... it's 200 MB).

Comment 8 Nemo_bis 2012-12-08 23:15:05 UTC

I placed the log at https://docs.google.com/open?id=0B2Eepd4bNt7FeUc5VXc2ZzBrQVE

Comment 9 Nemo_bis 2012-12-12 22:41:35 UTC

Sorry, it all happens too quickly, I failed to track down the offending process again. See the evolution of open files:

48672   0       371898
mer 12 dic 2012, 21.54.20, CET
66720   0       371898
mer 12 dic 2012, 22.04.20, CET
114304  0       371898
mer 12 dic 2012, 22.14.20, CET
163520  0       371898
mer 12 dic 2012, 22.24.21, CET
180768  0       371898
mer 12 dic 2012, 22.34.21, CET
198112  0       371898
mer 12 dic 2012, 22.44.21, CET
226016  0       371898
mer 12 dic 2012, 22.54.21, CET
269952  0       371898
mer 12 dic 2012, 23.04.21, CET
341312  0       371898
mer 12 dic 2012, 23.14.21, CET

Comment 10 Michel Dänzer 2012-12-13 09:51:36 UTC

(In reply to comment #9)
> Sorry, it all happens too quickly, I failed to track down the offending
> process again. See the evolution of open files:

I don't think you need to wait for the limit to be reached. If the number of file descriptors is normally around 50K, then 100K or more could already indicate a problem. Or you could run a script, either in the background or from a cron job, which dumps the total number and top 10 processes to a log file at regular intervals. Be creative. :)

Comment 11 Nemo_bis 2012-12-13 10:09:28 UTC

Yes, this time I won't rely on me checking stuff regularly.
I'm doing this now: 
    while true; do for i in /proc/[0-9]*/fd; do LOG=$(date -Ihours).txt; echo "$(sudo ls $i | wc -l): $i" >> $LOG ; done ; cat /proc/sys/fs/file-nr >> openfiles.log ; date >> openfiles.log ; sleep 600 ; done

Comment 12 Nemo_bis 2012-12-15 15:25:59 UTC

Created attachment 71551 [details]
List of processes with their opened files

No luck: the list of processes with their opened files at the time of the last crash doesn't seem to contain the offender: the total makes about 6 thousands, not 370 at all...

Comment 13 Michel Dänzer 2012-12-18 18:24:34 UTC

(In reply to comment #12)
> the total makes about 6 thousands, not 370 at all...

Hmm, then it seems the leaking file descriptors aren't exposed in those /proc files.

Comment 14 Nemo_bis 2012-12-26 10:47:17 UTC

The nasty thing is that keyboard doesn't work at all when this bug happens. I've seen the using non-PAE kernel doesn't help and now I'm trying

while true;
  do
    LOG=$(date -Ihours).txt;
    for i in /proc/[0-9]*/fd;
      do
      echo "$(sudo ls $i | wc -l): $i" >> $LOG;
      done;
    OPEN=$(cat /proc/sys/fs/file-nr);
    echo "$OPEN" >> openfiles.log;
    if $(echo $OPEN | grep -qE "^[0-9]{6,}"); then
      tput bel
    fi
    date >> openfiles.log;
    sleep 600;
  done

Comment 15 Nemo_bis 2012-12-31 07:57:57 UTC

Created attachment 72330 [details]
ps -fax

I can catch it when it happens, but it doens't help because even killing all processes one by one sometimes stops the number from increasing, but does never decrease it. I closed all programs, but not all system processes (and now I opened the browser of course), but it didn't help. Open files no longer assigned to any process, does it make sense?

This time I have this in dmesg:

[11870.410593] hrtimer: interrupt took 38748 ns
[12294.716262] Peer 2.236.97.130:4662/44274 unexpectedly shrunk window 4096424630:4096441856 (repaired)
[26270.771253] Peer 2.235.219.243:4662/48855 unexpectedly shrunk window 1702546936:1702564256 (repaired)
[39697.640585] [TTM] Out of kernel memory
[39697.640609] radeon 0000:00:01.0: object_init failed for (4096, 0x00000002)
[39697.640614] [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (4096, 2, 4096, -12)

and then endless repetition of the last three lines. Does it look like https://bugs.mageia.org/show_bug.cgi?id=1471 ? Should I try radeon.modeset=0 ?

Comment 16 Michel Dänzer 2013-01-07 17:12:51 UTC

(In reply to comment #15)
> [...] even killing all processes one by one sometimes stops the number from
> increasing, but does never decrease it.

Does it decrease after killing kwin or Xorg?

Comment 17 Nemo_bis 2013-01-08 07:37:43 UTC

> Does it decrease after killing kwin or Xorg?

Michel, how can I check Xorg? I'm not sure about kwin, I'll check again next time. Usually, it happens that after the night it takes some minutes to get the dialog to unlock the screen, then when I unlock it I get the message that kwin was killed because it was in segmentation fault and dmesg has all those errors about "Out of kernel memory" and "Failed to allocate GEM object" as in comment 7.

Comment 18 Michel Dänzer 2013-01-09 12:32:40 UTC

(In reply to comment #17)
> Michel, how can I check Xorg?

Always the same idea: Once /proc/sys/fs/file-nr has increased significantly, kill the process and see if /proc/sys/fs/file-nr decreases back to normal after that.

Comment 19 Nemo_bis 2013-01-22 21:58:16 UTC

Sorry, it was http://bugs.winehq.org/show_bug.cgi?id=32617

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.