Bug 17791 - X freezes (and kernel seems to enter in deadlock)
X freezes (and kernel seems to enter in deadlock)
Status: RESOLVED NOTOURBUG
Product: xorg
Classification: Unclassified
Component: Driver/radeonhd
7.3 (2007.09)
Other All
: medium normal
Assigned To: Egbert Eich
Xorg Project Team
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-26 05:28 UTC by BERTRAND Joël
Modified: 2008-10-14 02:08 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description BERTRAND Joël 2008-09-26 05:28:35 UTC
Hello,

I have found a reproductible bug. In a first time, I though that it comes from linux kernel itself, but if I try to reproduce it without X, kernel doesn't hang.

I use a laptop with a graphic adapter that is supported by radeonhd driver :
cauchy:[~] > lspci | grep VGA
01:00.0 VGA compatible controller: ATI Technologies Inc M76 [Radeon Mobility HD 2600 Series]

I do some complex computations and when I launch at the same time 4 or 5 process (each process take 1,6 GB, my computer has 4 GB of RAM), system swaps and load is about 10. After few minutes, disk activity stops and X freezes. There is no information in log files (no information on console too).

I use a debian/testing distribution with
cauchy:[~] > dpkg-query -l | grep xorg
ii  xorg                                                    1:7.3+16                   X.Org X Window System

Regards,

JKB
Comment 1 Egbert Eich 2008-09-29 02:32:19 UTC
(In reply to comment #0)
> Hello,
> 
> I have found a reproductible bug. In a first time, I though that it comes from
> linux kernel itself, but if I try to reproduce it without X, kernel doesn't
> hang.
> I do some complex computations and when I launch at the same time 4 or 5
> process (each process take 1,6 GB, my computer has 4 GB of RAM), system swaps
> and load is about 10. After few minutes, disk activity stops and X freezes.
> There is no information in log files (no information on console too).

We have seen problems running X on mobile systems with more than 2GB of physical RAM. Those problems went away, when the physical memory size was reduced to 2GB maximum. (it was necessary to physically remove the memory, not just set a kernel option). I assume that the problem you are seeing is related to this.
The difference to a system without X is that the GFX driver needs to map video memory and register space thru PCI which all need to fit into the 4GB memory space.
We therefore do believe that this is a BIOS/kernel issue.
Comment 2 BERTRAND Joël 2008-09-29 03:07:40 UTC
(In reply to comment #1)
> We have seen problems running X on mobile systems with more than 2GB of
> physical RAM. Those problems went away, when the physical memory size was
> reduced to 2GB maximum. (it was necessary to physically remove the memory, not
> just set a kernel option). I assume that the problem you are seeing is related
> to this.
> The difference to a system without X is that the GFX driver needs to map video
> memory and register space thru PCI which all need to fit into the 4GB memory
> space.
> We therefore do believe that this is a BIOS/kernel issue.
> 

I have some news. I have seen that mozilla hangs in a futex (with 2GB or 4 GB). Thus I have tried to run my test without X. System crashes too (but it is more difficult...). I have seen a patch to linux kernel, but I haven't tested.

Regards,

JKB
Comment 3 Egbert Eich 2008-09-29 10:07:31 UTC
BERTRAND: Can you try and remove 2GB of RAM and see what happens?
If we wanted to debug this we would have to find a way to reproduce your problem so we'd need to know what your program does. You say each of your processes take 1.6 GB, is this the problem or is the high CPU usage the problem?
Or is it sufficient to allocate 1.6GB of memory and touch each page to make the kernel actially map each page?
Would you be able to boil this down to a simple test application?

Comment 4 BERTRAND Joël 2008-09-30 04:42:55 UTC
(In reply to comment #3)
> BERTRAND: Can you try and remove 2GB of RAM and see what happens?
> If we wanted to debug this we would have to find a way to reproduce your
> problem so we'd need to know what your program does. You say each of your
> processes take 1.6 GB, is this the problem or is the high CPU usage the
> problem?
> Or is it sufficient to allocate 1.6GB of memory and touch each page to make the
> kernel actially map each page?
> Would you be able to boil this down to a simple test application?

I can remove 2GB of RAM, but I don't understand what you want to see. I have made some tests _without_ X (X is not running anymore, even on an other virtual console). Thus, I only have on my test system a text based linux system.
If this bug can be observed without X, I don't understand how this bug can be X related.

Test program is a huge multitasked/multithreaded code that does some SQL queries. I don't know if crash comes from huge cpu usage or disk usage (or some other components like raid subsystem), but it seems to be kernel related, not X related as I knew. I have to restart several times my test program to see a crash and I cannot debug because sysrq magic key doesn't work anymore.
Comment 5 Egbert Eich 2008-10-14 02:08:06 UTC
(In reply to comment #4)
 I can remove 2GB of RAM, but I don't understand what you want to see. I have
> made some tests _without_ X (X is not running anymore, even on an other virtual
> console). Thus, I only have on my test system a text based linux system.
> If this bug can be observed without X, I don't understand how this bug can be X
> related.
> 
> Test program is a huge multitasked/multithreaded code that does some SQL
> queries. I don't know if crash comes from huge cpu usage or disk usage (or some
> other components like raid subsystem), but it seems to be kernel related, not X
> related as I knew. I have to restart several times my test program to see a
> crash and I cannot debug because sysrq magic key doesn't work anymore.
> 
Bertrand, we've seen issues where reducing the size of RAM miraculously fixed issues with X. The theory was that there is an overlap between the PCIe BAR address ranges and the main memory.
Since text mode also uses a framebuffer (and thus VRAM which is mapped somewhere)
I could envision such problems there even. It was just a wild guess.
If reducing the size of RAM helped it would indicate that the problem was related to the other issues we've been seeing.
However since you are seeing this issue even without X it doesn't seem to be related to the RadeonHD driver at all.
I will therefore close it with NOTOURBUG for now. 
If you need further information please feel free to continue the discussion on this ticket.