Bug 28402 - random radeon/kms/drm related freezes with kernel 2.6.34
Summary: random radeon/kms/drm related freezes with kernel 2.6.34
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 27525 32107 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-06-05 11:00 UTC by Da Fox
Modified: 2011-02-09 02:25 UTC (History)
9 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Kernel log for one day (26.95 KB, application/gzip)
2010-06-11 11:36 UTC, Da Fox
no flags Details
cat /proc/version (145 bytes, text/plain)
2010-08-30 09:33 UTC, Lukas Schneiderbauer
no flags Details
cat /proc/cpuinfo (535 bytes, text/plain)
2010-08-30 09:35 UTC, Lukas Schneiderbauer
no flags Details
cat /proc/modules (1.17 KB, text/plain)
2010-08-30 09:36 UTC, Lukas Schneiderbauer
no flags Details
cat /proc/ioports (1.57 KB, text/plain)
2010-08-30 09:37 UTC, Lukas Schneiderbauer
no flags Details
cat /proc/iomem (1.79 KB, text/plain)
2010-08-30 09:37 UTC, Lukas Schneiderbauer
no flags Details
cat /proc/scsi/scsi (336 bytes, text/plain)
2010-08-30 09:38 UTC, Lukas Schneiderbauer
no flags Details
lspci -vvv (13.93 KB, text/plain)
2010-08-30 09:38 UTC, Lukas Schneiderbauer
no flags Details
kernel .config (68.14 KB, text/plain)
2010-08-30 09:55 UTC, Lukas Schneiderbauer
no flags Details
Output of "radeontool regmatch '*'" on a clean boot with 44ca7478d46aaad488d916f7262253e000ee60f9 (19.74 KB, application/octet-stream)
2010-09-05 15:25 UTC, Da Fox
no flags Details
Output of "radeontool regmatch '*'" on a clean boot with d594e46ace22afa1621254f6f669e65430048153 + 8e36113082821980c60ce89a6c5d45fc9492fc26 (19.76 KB, application/octet-stream)
2010-09-05 15:30 UTC, Da Fox
no flags Details
radeontool regmatch '*' on 44ca7478d46aaad488d916f7262253e000ee60f9 (19.94 KB, text/plain)
2010-09-06 01:53 UTC, Lukas Schneiderbauer
no flags Details
radeontool regmatch '*' on d594e46ace22afa1621254f6f669e65430048153 + 8e36113082821980c60ce89a6c5d45fc9492fc26 (19.97 KB, text/plain)
2010-09-06 01:55 UTC, Lukas Schneiderbauer
no flags Details
possible fix (7.88 KB, patch)
2010-09-07 08:34 UTC, Alex Deucher
no flags Details | Splinter Review
vram align patch does not seem to work, now trying this vmembase at 0 patch (496 bytes, patch)
2010-09-08 12:42 UTC, Martin Steigerwald
no flags Details | Splinter Review
x11 components version (1.70 KB, text/plain)
2010-09-09 06:08 UTC, Lukas Schneiderbauer
no flags Details
emerge --info (4.74 KB, text/plain)
2010-09-09 06:09 UTC, Lukas Schneiderbauer
no flags Details
Make sure gtt and vram are not directly adjacent (2.00 KB, patch)
2010-10-14 16:59 UTC, Alex Deucher
no flags Details | Splinter Review
updated patch (2.08 KB, patch)
2010-10-20 13:31 UTC, Alex Deucher
no flags Details | Splinter Review
Make sure MC vram map is >= pci aperture size (1.25 KB, patch)
2010-10-23 06:56 UTC, Alex Deucher
no flags Details | Splinter Review

Description Da Fox 2010-06-05 11:00:07 UTC
When using kernel 2.6.34 random lock-ups occur after a while.
I first noticed this with the -rc releases, but it is still
present in the final release. 
Lock-ups occur randomly, but it appears as if they are somewhat
related to CPU usage. For example the system often (but not
always) freezes when starting firefox (e.g. when opening a 
large number of tabs), or when switching desktops or
applications (alt-tab). When the system freezes it is totally
unresponsive to input, and requires a hard-reboot.

I have tried to bisect this, but this is a long, slow process.
The bug is not very deterministic, sometimes the system will
freeze within minutes of booting (e.g. when starting firefox),
sometimes it takes a bit longer, and occasionally it may take
several hours before a freeze occurs. This caused me to at
first mark 2.6.34-rc2 as a stable point (whereas it is not),
and generally makes that it takes a long time before I am
able to mark a commit as 'good'.

I have now traced the problem to somewhere between 2.6.33 and
7a9f0dd9c49425e2b0e39ada4757bc7a38c84873.
More specifically:
7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 - drm: Add generic multipart buffer.   <-- I have experienced a freeze in this commit
d594e46ace22afa1621254f6f669e65430048153 - drm/radeon/kms: simplify memory controller setup V2   <-- This commit causes the PC to reset when booting
44ca7478d46aaad488d916f7262253e000ee60f9 - drm/radeon: Add asic hook for dma copy to r200 cards.   <-- This I am currently testing, but has not caused a freeze yet.
Another interesting commit was 7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 - drm: Add generic multipart buffer. This commit freezes when starting X.

The problem I am facing now is that it is very difficult to test 
this commit (or several earlier commits) because for some reason
these cause the PC to respond very slowly, certainly too slow
for regular use. System load runs nearly 50% constantly, even when 
idle (the %sy value in top).

This is about all the information I have at this point, I hope 
it's of some use.
Comment 1 Jerome Glisse 2010-06-08 13:20:06 UTC
Please which GPU (lspci -v)
Comment 2 Da Fox 2010-06-08 13:33:48 UTC
I'm sorry, I should have provided that information immediately.

Output of 'lspci -v' for the video controller:
---8<---------
01:00.0 VGA compatible controller: ATI Technologies Inc RV350 [Mobility Radeon 9600 M10] (prog-if 00 [VGA controller])
	Subsystem: IBM Device 0550
	Flags: bus master, fast Back2Back, 66MHz, medium devsel, latency 66, IRQ 11
	Memory at e0000000 (32-bit, prefetchable) [size=128M]
	I/O ports at 3000 [size=256]
	Memory at c0100000 (32-bit, non-prefetchable) [size=64K]
	[virtual] Expansion ROM at c0120000 [disabled] [size=128K]
	Capabilities: [58] AGP version 2.0
	Capabilities: [50] Power Management version 2
	Kernel driver in use: radeon
--->8---------

It seems to say memory size is 128M, but this is a 64M board...
The command was run under kernel 2.6.33.
Comment 3 Jerome Glisse 2010-06-11 11:29:02 UTC
Please attach full dmesg thanks
Comment 4 Da Fox 2010-06-11 11:36:50 UTC
Created attachment 36221 [details]
Kernel log for one day
Comment 5 Aidan Marks 2010-06-12 05:19:42 UTC
Also seeing random freezes here on my Thinkpad T60 on 2.6.34 with X1300 mobility radeon.
Comment 6 Da Fox 2010-06-16 05:12:00 UTC
I've recompiled the kernel (this time revision 8e36113082821980c60ce89a6c5d45fc9492fc26) with netconsole enabled, and have 'triggered' the freeze. At this point no errors, warnings or other messages
were being printed to netconsole. Remotely logging-in via ssh was not possible.
The kernel did not respond even to SysRQ-b. 
Netconsole output did stop when the system finished booting, but was re-enabled
using the command 'dmesg -n 8', issued as root. Netconsole functionality was then tested by disabling and then re-enabling swap space. this caused the swap enabled message to be printed on the netconsole. The kernel was booted with 'debug' specified on the commandline

Is there any way to enable a more verbose output after booting? I've searched for a kernel config option to enable more verbose logging, but I could not find anything which seemed relevant.


(In reply to comment #5)
> Also seeing random freezes here on my Thinkpad T60 on 2.6.34 with X1300
> mobility radeon.

Could you please also specify the exact GPU that you have ? (lspci -v), and
your kernel logs? (perhaps there is something being logged in yours?)
Comment 7 Aidan Marks 2010-06-16 05:22:05 UTC
> (In reply to comment #5)
> > Also seeing random freezes here on my Thinkpad T60 on 2.6.34 with X1300
> > mobility radeon.
> 
> Could you please also specify the exact GPU that you have ? (lspci -v), and
> your kernel logs? (perhaps there is something being logged in yours?)

the easy one to start with:

# uname -a
Linux voyager 2.6.34-gentoo #1 SMP PREEMPT Mon May 24 10:53:46 EST 2010 i686 Intel(R) Core(TM) Duo CPU T2400 @ 1.83GHz GenuineIntel GNU/Linux
# lspci -vv -s 01:0
01:00.0 VGA compatible controller: ATI Technologies Inc M52 [Mobility Radeon X1300] (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device 2005
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at d8000000 (32-bit, prefetchable) [size=128M]
        Region 1: I/O ports at 2000 [size=256]
        Region 2: Memory at ee100000 (32-bit, non-prefetchable) [size=64K]
        [virtual] Expansion ROM at ee120000 [disabled] [size=128K]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v1) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Kernel driver in use: radeon
        Kernel modules: radeon
Comment 8 Da Fox 2010-06-18 06:44:58 UTC
(In reply to comment #7)
> the easy one to start with:

So how is it coming with the difficult one?
Was there nothing in your logs or did you not have a chance to take a look yet?
Comment 9 mrsteven 2010-07-13 05:26:49 UTC
I seem to have the same problem, see https://bugs.freedesktop.org/show_bug.cgi?id=23660#c23

GPU is the same, I also had the problem with the non-KMS driver earlier, but that magically disappeared with kernel 2.6.33 and KMS. With 2.6.34 I get these freezes again.
Comment 10 Da Fox 2010-08-15 14:21:55 UTC
I tried with kernel 2.6.35.2 over the weekend, and this issue is still present.



However there is a slight urge to get this fixed, because the rest of the
system will start to depend on more recent kernels in a while. For example
the newer udev-151 warns that:

---8<---------
ERROR: setup
  CONFIG_IDE:    should not be set. But it is.


WARN: setup
Please check to make sure these options are set correctly.
Failure to do so may cause unexpected problems.
--->8---------

This is due to the fact that I am using the old (and pretty soon deprecated) ATA drivers in the kernel, and not the newer (and nowadays stable) libata drivers.
So if I want to use the newer udev I would prefer to be able to do so on a newer kernel, given that the libata drivers are still relatively new.

This is of course but one example, but there can be a variety of reasons which necessitate a kernel upgrade. Another good example would be the inter-dependencies of Xorg, video drivers and the kernel.


So please:
Devs: take a look at the commits I pointed out in the initial description, and answer my question from comment #6.
Aidan: upload your kernel logs, I think you should have collected enough data by now.
Comment 11 Aidan Marks 2010-08-16 06:14:57 UTC
(In reply to comment #10)

I can't offer much help anymore on this defect, sorry.  Soon after my comment, my T60 laptop was returned as part of the company lease cycle and my new thinkpad has an intel gfx chip.
Comment 12 Da Fox 2010-08-28 17:16:01 UTC
This is still an issue with 2.6.36-rc2.
Comment 13 Alex Deucher 2010-08-28 17:18:40 UTC
This may be related to connector polling.  Does the patch to disable polling
in bug 29389 help?
Comment 14 Da Fox 2010-08-29 13:39:23 UTC
(In reply to comment #13)
> This may be related to connector polling.  Does the patch to disable polling
> in bug 29389 help?

I'm sorry to report that it does not. 

I applied the patch, but I had to make a small change to it because the last chunk did not apply. It seems delayed_slow_work_enqueue() was renamed to queue_delayed_work() and the patch only seems to add an additional if() before the call to queue_delayed_work() in drm_helper_hpd_irq_event(), so I modified the patch accordingly.
I also modified my kernel command-line to include drm_kms_helper.poll=0. 

I had read that the patch might cause X to refuse to start, but I did not experience this. X started normally on each attempt (I have tried 3x).
Each time the system froze in the same way as before, within a few minutes of booting.
Comment 15 Da Fox 2010-08-30 05:39:40 UTC
I just wanted to point out a thread I just found on the ArchLinux forums which describes the same issue, with many people reportedly experiencing this problem:
https://bbs.archlinux.org/viewtopic.php?id=100843&p=1
One user ('vootey') who has attempted to bisect the kernel appears to have isolated the same region of commits that I have identified: "I'm trying to git bisect the kernel and until now I was able to narrow the bug (for me) down to the higher 2.6.33 area.".

So at least now I know I'm not alone :(
Comment 16 Lukas Schneiderbauer 2010-08-30 09:30:14 UTC
Thank you, Da Fox, for pointing me to this thread. I'm "vootey".

I can confirm that issue. The freezes mostly occur, while using firefox.

As Da Fox said, I'm on my way to finish bisecting (yet 4 steps). I marked versions as "good", if there was at least 10 hours uptime without freezing. I tried to stress my system more than normally. (If a kernel was bad, the freeze always came within 2 hours uptime.) I will report, when I'm finsihed.

I will attach the usual sys-info files. Please ask, if more is needed.
Comment 17 Lukas Schneiderbauer 2010-08-30 09:33:58 UTC
Created attachment 38307 [details]
cat /proc/version

kernel is from a bisecting git-repo (so ignore the version)
Comment 18 Lukas Schneiderbauer 2010-08-30 09:35:28 UTC
Created attachment 38308 [details]
cat /proc/cpuinfo
Comment 19 Lukas Schneiderbauer 2010-08-30 09:36:12 UTC
Created attachment 38309 [details]
cat /proc/modules
Comment 20 Lukas Schneiderbauer 2010-08-30 09:37:01 UTC
Created attachment 38310 [details]
cat /proc/ioports
Comment 21 Lukas Schneiderbauer 2010-08-30 09:37:24 UTC
Created attachment 38311 [details]
cat /proc/iomem
Comment 22 Lukas Schneiderbauer 2010-08-30 09:38:29 UTC
Created attachment 38312 [details]
cat /proc/scsi/scsi
Comment 23 Lukas Schneiderbauer 2010-08-30 09:38:59 UTC
Created attachment 38313 [details]
lspci -vvv
Comment 24 Lukas Schneiderbauer 2010-08-30 09:55:56 UTC
Created attachment 38314 [details]
kernel .config
Comment 25 Lukas Schneiderbauer 2010-08-30 11:27:58 UTC
I got it.
Since the last kernels were all bad ones, progress was made a bit faster.

#############
32b3c2abaf8c61c80a8b02071c73f05252122ffe is the first bad commit
commit 32b3c2abaf8c61c80a8b02071c73f05252122ffe
Author: Jerome Glisse <jglisse@redhat.com>
Date:   Fri Feb 26 19:14:12 2010 +0000

    drm/radeon/kms: initialize set_surface_reg reg for rs600 asic

    rs600 asic was missing set_surface_reg callback leading to
    oops.

    Signed-off-by: Jerome Glisse <jglisse@redhat.com>
    Signed-off-by: Dave Airlie <airlied@redhat.com>

:040000 040000 f46b151d49ec9023ce01cded50fda4c52db311cb 4e640582f7f3b07ed9994422432580070565692e M      drivers
############
Comment 26 Da Fox 2010-08-31 06:37:02 UTC
(In reply to comment #25)
> I got it.
> Since the last kernels were all bad ones, progress was made a bit faster.
> 
> #############
> 32b3c2abaf8c61c80a8b02071c73f05252122ffe is the first bad commit
> commit 32b3c2abaf8c61c80a8b02071c73f05252122ffe
> Author: Jerome Glisse <jglisse@redhat.com>
> Date:   Fri Feb 26 19:14:12 2010 +0000
> 
>     drm/radeon/kms: initialize set_surface_reg reg for rs600 asic
> 
>     rs600 asic was missing set_surface_reg callback leading to
>     oops.
> 
>     Signed-off-by: Jerome Glisse <jglisse@redhat.com>
>     Signed-off-by: Dave Airlie <airlied@redhat.com>
> 
> :040000 040000 f46b151d49ec9023ce01cded50fda4c52db311cb
> 4e640582f7f3b07ed9994422432580070565692e M      drivers
> ############

Interesting, that is almost the same as what I found, but not exactly.
Which repository did you use for bi-secting? I used dave airlie's (airlied) drm, with what was then the drm-next branch. I don't know if that would make a difference though. I don't know how/if git preserves history across merges in different branches, i.e. if you used linuz's tree, would you see the whole history or only the merge points?
The reason I am wondering is because to me it seems 32b3c2abaf8c61c80a8b02071c73f05252122ffe is just after 2 merge points, one of which includes the commits I pointed out in my initial comment. (if I understand the gitk listing correctly at least). Also to me 32b3c2abaf8c61c80a8b02071c73f05252122ffe seems unlikely as the culprit, since it modifies something in the rs600 code, and hence should not have any effect on our r3xx cards.

Could you possibly see if the following three (consecutive) commits are in your tree too, and test them? 
7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 - drm: Add generic multipart buffer.
d594e46ace22afa1621254f6f669e65430048153 - drm/radeon/kms: simplify memory
controller setup V2
44ca7478d46aaad488d916f7262253e000ee60f9 - drm/radeon: Add asic hook for dma
copy to r200 cards.

For me 44ca7478d46aaad488d916f7262253e000ee60f9 seems to be the last stable commit (I've been testing it again today, and it didn't freeze so far), whereas I've noted 7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 down as freezing. My notes also say I could not test d594e46ace22afa1621254f6f669e65430048153 for some reason.
Comment 27 Lukas Schneiderbauer 2010-08-31 09:21:07 UTC
(In reply to comment #26)
> Which repository did you use for bi-secting?
Linus' Tree

> I don't know how/if git preserves history across merges in
> different branches, i.e. if you used linuz's tree, would you see the whole
> history or only the merge points?
I don't know. I'm not very familiar with git.

> Also to me
> 32b3c2abaf8c61c80a8b02071c73f05252122ffe seems unlikely as the culprit, since
> it modifies something in the rs600 code, and hence should not have any effect
> on our r3xx cards.
I completely agree. I will rollback a view versions and try "good"-versions again.

> Could you possibly see if the following three (consecutive) commits are in 
> your tree too, and test them?
I'd like, as soon as I find out, how to checkout these specific versions. ;)


What we need is a damn trigger to this bug. Without reproducibility trying to catch the right commit is a pure matter of luck and very frustrating.
Comment 28 Da Fox 2010-08-31 13:04:34 UTC
(In reply to comment #27)
> (In reply to comment #26)

> > Could you possibly see if the following three (consecutive) commits are in 
> > your tree too, and test them?
> I'd like, as soon as I find out, how to checkout these specific versions. ;)

Try something like 'git checkout <sha1>' to checkout a specific revision. It will complain if the commit does not exist or if there is something else which prevents the checkout (locally modified files for example).
Incase they have a diffent sha1 id in your tree you can try looking for them using 'gitk' (the git repository browser). Since the kernel is so big you may want to limit how far back gitk should show history, try gitk --since=01-01-2010. You can either try to jump directly to the commit by entering a sha1 id into the 'SHA1 ID' box, or by typing a commit message into the 'Find' box. Beware that by default the search is case-sensitive.



> What we need is a damn trigger to this bug. Without reproducibility trying to
> catch the right commit is a pure matter of luck and very frustrating.

I feel your pain, I've also mis-identified an 'unstable' commit as 'stable', which means the whole rest of the bisecting process (which takes a lot of time) is wasted. What triggers the freeze for me most of the time though is opening firefox (with a lot of windows and tabs from my previous browsing session). I do this as soon as I've logged in and my desktop environment has finished loading. 



On a side note, for those earlier commits (such as 44ca7478d46aaad488d916f7262253e000ee60f9, which I've been testing again during the day), do you also find that the system becomes very slow? As in there is a high CPU usage, but not caused by any running program?
I've seen this again today, where top reports all programs using only a small amount of cpu (a grand total of 8% or so), and yet my CPU usage was almost 50%.
Comment 29 Lukas Schneiderbauer 2010-08-31 13:19:44 UTC
Thanks for the help.

I managed to compile and boot the kernel with 7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 as head.
('uname -r' reports "2.6.33-00035-gaa71fa3")

3,5 hours up and so far no freeze.
Comment 30 Da Fox 2010-09-01 04:37:38 UTC
(In reply to comment #29)
> Thanks for the help.
> 
> I managed to compile and boot the kernel with
> 7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 as head.
> ('uname -r' reports "2.6.33-00035-gaa71fa3")
> 
> 3,5 hours up and so far no freeze.

That uname report does not match with the version you've compiled, something must have gone wrong. It should be "2.6.33-00519-7a9f0dd", the first part is the current 'base' version of the kernel (so 2.6.33, since it's before the release of 2.6.34), the second part (00035) I do not know the meaning of (so it could be different for you), and the final part is composed of the letter 'g' (for git?) followed by the first part of the sha1 id. Revision aa71fa3... is "Merge remote branch 'nouveau/for-airlied' into drm-next-stage", which is still a bit before those three commits. So no freeze there is good!

For me 7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 resets the computer when starting X. This is probably due to a bug in that version, which has been fixed a few commits later, in 8e36113082821980c60ce89a6c5d45fc9492fc26 - drm/radeon/kms: fix R3XX/R4XX memory controller intialization.

I've compiled and tested a kernel based on d594e46ace22afa1621254f6f669e65430048153 with one additionally commit, 8e36113082821980c60ce89a6c5d45fc9492fc26. This again froze within a minute of starting firefox. So the offending commit definitely must be d594e46ace22afa1621254f6f669e65430048153 - drm/radeon/kms: simplify memory
controller setup V2.

If you want to test this too you can do it like this:
$ git checkout d594e46ace22afa1621254f6f669e65430048153
$ git cherry-pick -n 8e36113082821980c60ce89a6c5d45fc9492fc26
cherry-pick applies a commit on top of the current state. the -n flag does not actually commit anything, but only makes local changes. You can now compile and test this version. To get rid of the local changes again run
$ git reset --hard
Comment 31 Lukas Schneiderbauer 2010-09-01 09:42:59 UTC
(In reply to comment #30)
> That uname report does not match with the version you've compiled
hm.. maybe I mixed it up with my bisect session. I'm sorry and thank you for your patience. :)

> For me 7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 resets the computer when
> starting X. This is probably due to a bug in that version, which has been fixed
> a few commits later, in 8e36113082821980c60ce89a6c5d45fc9492fc26 -
> drm/radeon/kms: fix R3XX/R4XX memory controller intialization.
> 
> I've compiled and tested a kernel based on
> d594e46ace22afa1621254f6f669e65430048153 with one additionally commit,
> 8e36113082821980c60ce89a6c5d45fc9492fc26. This again froze within a minute of
> starting firefox. So the offending commit definitely must be
> d594e46ace22afa1621254f6f669e65430048153 - drm/radeon/kms: simplify memory
> controller setup V2.
At the moment I am on 2.6.32-00518-gd594e46-dirty (should be the d59e.. commit with the 8e36... patch; I did as you explained) with an uptime of 1.5 hours. Hm.. somhow unusual for "bad" kernel. To be honest, I hope, a crash occurs soon. :D

Do you also know, what that "-dirty" means in the release-version?
Comment 32 Lukas Schneiderbauer 2010-09-02 07:46:15 UTC
d594e46ace22afa1621254f6f669e65430048153 finally caused a freeze.

Now I'm checking 7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 (2.6.32-00519-g7a9f0dd-dirty)
Comment 33 Lukas Schneiderbauer 2010-09-02 08:46:21 UTC
7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 freezed for me as well.
Comment 34 Da Fox 2010-09-02 09:14:44 UTC
> Do you also know, what that "-dirty" means in the release-version?
I believe it means that you have local, uncommitted changes to the source tree. This is expected since we instructed cherry-pick not to actually commit any of the changes it made.

> d594e46ace22afa1621254f6f669e65430048153 finally caused a freeze.
> 7a9f0dd9c49425e2b0e39ada4757bc7a38c84873 freezed for me as well.
Good, that confirms that the commit causing the freeze issue is indeed prior to the commit you identified at first, 32b3c2abaf8c61c80a8b02071c73f05252122ffe.

If you could now confirm 44ca7478d46aaad488d916f7262253e000ee60f9 as not causing the freeze we have finally isolated the exact commit that is causing the freezes. Hopefully the dev's can then at last fix it. Having more people point out the same commit is always more convincing than one person alone, especially in a bug which can be as random as this one (since sometimes it takes quite a bit of time before it happens, and sometimes it almost instantly freezes when launching for example firefox).
Comment 35 Lukas Schneiderbauer 2010-09-02 11:10:09 UTC
(In reply to comment #34)
> If you could now confirm 44ca7478d46aaad488d916f7262253e000ee60f9 as not
> causing the freeze we have finally isolated the exact commit that is causing
> the freezes.
I'm on it.

(In reply to comment #28)
> On a side note, for those earlier commits (such as
> 44ca7478d46aaad488d916f7262253e000ee60f9, which I've been testing again during
> the day), do you also find that the system becomes very slow? As in there is a
> high CPU usage, but not caused by any running program?
> I've seen this again today, where top reports all programs using only a small
> amount of cpu (a grand total of 8% or so), and yet my CPU usage was almost 50%.
Yes, it becomes slower, but the CPU usage is only slightly higher. I just notice it, when watching movies by experiencing lags every 5 seconds for instance.)
But I guess, that should not bother us, since this issue is none (at least for me) in higher kernel-versions and I doubt, that this has something to do with our bug. Do you agree?
Comment 36 Lukas Schneiderbauer 2010-09-05 05:11:36 UTC
(In reply to comment #34)
> If you could now confirm 44ca7478d46aaad488d916f7262253e000ee60f9 as not
> causing the freeze we have finally isolated the exact commit that is causing
> the freezes.
Confirmed.
44ca7478d46aaad488d916f7262253e000ee60f9 is in use now for ~ 3 days and no freeze has occured. That would support your assumption. And I think, it's a very likely one.
Comment 37 Da Fox 2010-09-05 15:25:17 UTC
Created attachment 38458 [details]
Output of "radeontool regmatch '*'" on a clean boot with 44ca7478d46aaad488d916f7262253e000ee60f9
Comment 38 Da Fox 2010-09-05 15:30:10 UTC
Created attachment 38459 [details]
Output of "radeontool regmatch '*'" on a clean boot with d594e46ace22afa1621254f6f669e65430048153 + 8e36113082821980c60ce89a6c5d45fc9492fc26
Comment 39 Da Fox 2010-09-05 15:39:00 UTC
> Yes, it becomes slower, but the CPU usage is only slightly higher. I just
> notice it, when watching movies by experiencing lags every 5 seconds for
> instance.)
> But I guess, that should not bother us, since this issue is none (at least for
> me) in higher kernel-versions and I doubt, that this has something to do with
> our bug. Do you agree?
Agreed, 'first things first'.



I've talked to one of the dev's on IRC ('airlied') and he requested we post the output of the following command: "radeontool regmatch '*'" for both the last known 'good' commit (44ca7478d46aaad488d916f7262253e000ee60f9) and the 'bad' commit (d594e46ace22afa1621254f6f669e65430048153 + 8e36113082821980c60ce89a6c5d45fc9492fc26). I've already attached my output, could you please do the same?

p.s.
You can find radeontool here (http://cgit.freedesktop.org/~airlied/radeontool/) if your distro does not provide a package.
Comment 40 Dave Airlie 2010-09-05 16:02:21 UTC
if you boot with a bad kernel and run

radeontool regset 0x130 0x70000000

does it stabilise any?
Comment 41 Lukas Schneiderbauer 2010-09-06 01:53:49 UTC
Created attachment 38468 [details]
radeontool regmatch '*' on 44ca7478d46aaad488d916f7262253e000ee60f9
Comment 42 Lukas Schneiderbauer 2010-09-06 01:55:34 UTC
Created attachment 38469 [details]
radeontool regmatch '*' on d594e46ace22afa1621254f6f669e65430048153 + 8e36113082821980c60ce89a6c5d45fc9492fc26
Comment 43 Lukas Schneiderbauer 2010-09-06 02:03:38 UTC
Since I don't know, what's important when calling radeontool, I'll better tell you what I did:
I booted the new kernel, started X (via kdm), switched to a vt and used radeontools.

(In reply to comment #40)
> if you boot with a bad kernel and run
> radeontool regset 0x130 0x70000000
> does it stabilise any?
No.
I booted a 2.6.35-r3 kernel with gentoo-patches, which is known for me to possess a high freeze-frequency and did as you said:

$ radeontool regset 0x130 0x70000000
OLD: 0x130 (0130)       0x40800000 (1082130432)
NEW: 0x130 (0130)       0x70000000 (1879048192)

After a few minutes, the freeze appeared again.
Comment 44 Da Fox 2010-09-06 08:07:07 UTC
(In reply to comment #43)
> Since I don't know, what's important when calling radeontool, I'll better tell
> you what I did:
> I booted the new kernel, started X (via kdm), switched to a vt and used
> radeontools.
I'm sorry for not elaborating more, but I'm no expert myself :)
However I don't think it is necessary to switch to a vt, it may in fact not be what is required. My understanding is that radeontool captures the current state of the graphics card (the important bits at least). This is likely different between X and a vt, however we are interested in the state the card has in X (since that is where it freezes). Could please also post the results of running radeontool from within X, just to be sure we capture all relevant information?



> (In reply to comment #40)
> > if you boot with a bad kernel and run
> > radeontool regset 0x130 0x70000000
> > does it stabilise any?
> No.
> I booted a 2.6.35-r3 kernel with gentoo-patches, which is known for me to
> possess a high freeze-frequency and did as you said:
> 
> $ radeontool regset 0x130 0x70000000
> OLD: 0x130 (0130)       0x40800000 (1082130432)
> NEW: 0x130 (0130)       0x70000000 (1879048192)
> 
> After a few minutes, the freeze appeared again.
I have the same result, using d594e46ace22afa1621254f6f669e65430048153 +
8e36113082821980c60ce89a6c5d45fc9492fc26:

# radeontool regset 0x130 0x70000000
OLD: 0x130 (0130)	0x70800000 (1887436800)
NEW: 0x130 (0130)	0x70000000 (1879048192)

And a freeze soon afterwards.



I am currently testing d594e46ace22afa1621254f6f669e65430048153 +
8e36113082821980c60ce89a6c5d45fc9492fc26 and the following patch as suggested by Dave Airlie on IRC:

---8<---------
diff --git a/drivers/gpu/drm/radeon/r300.c b/drivers/gpu/drm/radeon/r300.c
index c827738..d1a7803 100644
--- a/drivers/gpu/drm/radeon/r300.c
+++ b/drivers/gpu/drm/radeon/r300.c
@@ -477,7 +477,7 @@ void r300_mc_init(struct radeon_device *rdev)
default: rdev->mc.vram_width = 128; break;
}
r100_vram_init_sizes(rdev);
- base = rdev->mc.aper_base;
+ base = 0;
if (rdev->flags & RADEON_IS_IGP)
base = (RREG32(RADEON_NB_TOM) & 0xffff) << 16;
radeon_vram_location(rdev, &rdev->mc, base);
--->8---------

This seems to help for me, I'm still testing but I've been running for a couple of hours already and so far haven't seen a freeze yet. Could you please test this patch also?
Comment 45 Lukas Schneiderbauer 2010-09-07 03:48:04 UTC
(In reply to comment #44)
> Could you please test this patch also?
I'm testing the 2.6.36-rc3 kernel with this patch at the moment. Looks promising for me as well. But I still need some hours to really confirm it.

If I may ask, on which IRC-channel/server are you talking?
Comment 46 Da Fox 2010-09-07 05:37:49 UTC
(In reply to comment #45)
> If I may ask, on which IRC-channel/server are you talking?

There is an IRC channel #radeon on irc.freenode.net for radeon users and developers.
Comment 47 Alex Deucher 2010-09-07 08:34:38 UTC
Created attachment 38516 [details] [review]
possible fix

Does this patch help?  It always aligns the MC vram and gtt bases to size.
Comment 48 Da Fox 2010-09-07 12:25:46 UTC
(In reply to comment #47)
> Created an attachment (id=38516) [details]
> possible fix
> 
> Does this patch help?  It always aligns the MC vram and gtt bases to size.

I'm sorry to report that it does not. I've tried with 96576a9e1a0cdb8a43d3af5846be0948f52b4460 (current drm-next in airlied's tree).
This freezes without any patches, seems stable with airlied's patch to put vmem at address 0, but freezes still with your patch.

Lukas, can you confirm that this patch still freezes?
Comment 49 Chris Rankin 2010-09-08 01:45:45 UTC
I've also noticed (rare) random freezes with 2.6.34.x kernels. Basically, I've tried to wake the PCs from "DPMS OFF", only to find them completely unresponsive and needing a reboot instead. However, only one of those PCs has an RV350 card. The other two have rv280 and rv100 cards instead, so Dave Airlie's patch to r300.c cannot possibly help them.
Comment 50 Da Fox 2010-09-08 05:02:14 UTC
(In reply to comment #49)
> I've also noticed (rare) random freezes with 2.6.34.x kernels. Basically, I've
> tried to wake the PCs from "DPMS OFF", only to find them completely
> unresponsive and needing a reboot instead. However, only one of those PCs has
> an RV350 card. The other two have rv280 and rv100 cards instead, so Dave
> Airlie's patch to r300.c cannot possibly help them.

Chris: these freezes do occur during normal operation, i.e. while working with the computer, not only during DMPS. It happens during all kinds of activities, e.g. it may happen while browsing, typing a letter, chatting, alt-tabbing, or even not doing anything. However almost always it seems to be triggered by some activity. For me for example, for me starting firefox after a fresh boot has a 99% chance of causing a freeze during the 'restore tabs from last time' phase. 
Although it is quite possible that the freeze will also occur during DPMS sleep, I have not experienced it yet (mostly because the freeze will occur while working, so the computer didn't get a chance to go into DPMS sleep).

So the first thing to do would be to verify that you indeed are experiencing the same issue (and not an unrelated DPMS problem) is to keep using your computer and wait for a freeze to occur during usage. Your best bet would be to try the rv350 card, I have mostly only seen people with r300 and/or rv350 describe this problem, and both me and lukas have an rv350 card (we both have a Mobility Radeon 9600 M10).
Once you have confirmed that the freeze occurs during normal working operations also, you should proceed to verify our git-bisect results and test the patches provided by Dave Airlie and Alex Deucher. Best of luck!

p.s.
Is the rv350 card a PC or a laptop? I noticed both lukas and I have a laptop with an rv350 card, so perhaps it has something to do with mobility editions?
Comment 51 Lukas Schneiderbauer 2010-09-08 09:09:41 UTC
(In reply to comment #45)
> I'm testing the 2.6.36-rc3 kernel with airlied's patch at the moment. Looks
> promising for me as well. But I still need some hours to really confirm it.
1.5 days uptime and no freeze. So definitely confirmed.


(In reply to comment #48)
> I'm sorry to report that it does not. I've tried with
> 96576a9e1a0cdb8a43d3af5846be0948f52b4460 (current drm-next in airlied's tree).
> This freezes without any patches, seems stable with airlied's patch to put vmem
> at address 0, but freezes still with your patch.
>
> Lukas, can you confirm that this patch still freezes?
2.6.36-rc3 with alex' patch up for > 3 hours and waiting.. :)
Comment 52 Martin Steigerwald 2010-09-08 09:43:03 UTC
random - possibly Radeon DRM KMS related - freezes
https://bugzilla.kernel.org/show_bug.cgi?id=16376

which I reported seems to be a duplicate of this one.

I am having those freezes on a ThinkPad T42 with

shambhala:~> lspci -nn | grep -i vga
01:00.0 VGA compatible controller [0300]: ATI Technologies Inc RV350 [Mobility Radeon 9600 M10] [1002:4e50]

As per suggestion from Alex I will now test patch from comment #47. Then I will try the patches mentioned in comment #44. Da Fox, are d594e46ace22afa1621254f6f669e65430048153 + 8e36113082821980c60ce89a6c5d45fc9492fc26 in some drm related development branch? Can I apply these with git cherry-pick as well?
Comment 53 Da Fox 2010-09-08 10:42:06 UTC
(In reply to comment #52)
> I am having those freezes on a ThinkPad T42 with
I have the same laptop, I'm glad to someone else still using an old ThinkPad :)

> Da Fox, are d594e46ace22afa1621254f6f669e65430048153 +
> 8e36113082821980c60ce89a6c5d45fc9492fc26 in some drm related development
> branch? Can I apply these with git cherry-pick as well?
Yes, they're from airlied's tree, in the drm-next branch. I think they are in Linus' tree too, which is what Lukas Schneiderbauer uses.
Comment 54 Martin Steigerwald 2010-09-08 12:42:45 UTC
Created attachment 38564 [details] [review]
vram align patch does not seem to work, now trying this vmembase at 0 patch

Alex, your patch from comment #47 does not work. Kernel froze a few seconds after Plasma from KDE 4.4.5 build up the OpenGL compositing desktop.

Now testing with the vmem-base-0 patch from Dave from comment #44. I am attaching it here, since cut and paste it from the comment gives a malformed patch.

I am using 60140c143b5cd04d85fec8085d56a1430a109846 from Nigel's tuxonice-head branch, since I am now pretty sure, the freeze is unrelated to TuxOnIce and when this vmem base 0 thing works, I also have a TuxOnIce kernel without compiling another time. Its 2.6.36-rc3 and seems to contain all the other patches from comment #44 and comment #48 already.
Comment 55 Martin Steigerwald 2010-09-08 14:45:15 UTC
Looks very good so far. I will reboot this kernel several times tomorrow - as a freeze so far only every happened *before* the first hibernation / snapshot cycle - but I looked some Startrek Voyager without a freeze with:

martin@shambhala:~/Computer/Shambhala/Kernel/2.6.36> cat /proc/version 
Linux version 2.6.36-rc3-tp42-toi-3.2-rc1-vmembase-0-05032-g60140c1-dirty (martin@shambhala) (gcc version 4.4.5 20100728 (prerelease) (Debian 4.4.4-8) ) #2 PREEMPT Wed Sep 8 21:36:34 CEST 2010

Thanks.
Comment 56 Da Fox 2010-09-09 03:00:22 UTC
(In reply to comment #48)
> (In reply to comment #47)
> > Created an attachment (id=38516) [details] [details]
> > possible fix
> > 
> > Does this patch help?  It always aligns the MC vram and gtt bases to size.
> 
> I'm sorry to report that it does not. I've tried with
> 96576a9e1a0cdb8a43d3af5846be0948f52b4460 (current drm-next in airlied's tree).
> This freezes without any patches, seems stable with airlied's patch to put vmem
> at address 0, but freezes still with your patch.
> 
> Lukas, can you confirm that this patch still freezes?

I've tried this patch again today, this time using vanilla 2.6.36-rc3. Unfortunately it froze again upon launching firefox.
Comment 57 Lukas Schneiderbauer 2010-09-09 04:11:16 UTC
(In reply to comment #56)
> (In reply to comment #48)
> > (In reply to comment #47)
> > > Created an attachment (id=38516) [details] [details] [details]
> > > possible fix
> > > 
> > > Does this patch help?  It always aligns the MC vram and gtt bases to size.
> > 
> > I'm sorry to report that it does not. I've tried with
> > 96576a9e1a0cdb8a43d3af5846be0948f52b4460 (current drm-next in airlied's tree).
> > This freezes without any patches, seems stable with airlied's patch to put vmem
> > at address 0, but freezes still with your patch.
> > 
> > Lukas, can you confirm that this patch still freezes?
> 
> I've tried this patch again today, this time using vanilla 2.6.36-rc3.
> Unfortunately it froze again upon launching firefox.

Hm... damn. My 2.6.36-rc3 with alex' patch didn't give me a freeze for ~ 1 day. And I'm pretty sure, that I applied the patch correctly and didn't mix up any of these patches. (did some checks ...)
However, I did a reset of the whole tree, pulled the newest version and applied alex' patch again.
I'm on 2.6.36-rc3-00185-gd56557a-dirty and testing..
Comment 58 Martin Steigerwald 2010-09-09 04:24:36 UTC
To what I see with

git log | grep -A 4 96576a9e1a0cdb8a43d3af5846be0948f52b4460

this commit titled "agp: intel-agp: do not use PCI resources before pci_enable_device()" is already in 2.6.36-rc3. The vmem base at zero patch that fixes or at least works around the issue is the only differcence I have:

martin@shambhala:~/Computer/Shambhala/Kernel/2.6.36/tuxonice-head> git diff | egrep "^(\+|\-)" 
--- a/drivers/gpu/drm/radeon/r300.c
+++ b/drivers/gpu/drm/radeon/r300.c
-       base = rdev->mc.aper_base;
+       base = 0;

So far this kernel works fine. It locked during userspace software suspend initiating a snapshot but that seems to be a different issue. TuxOnIce hibernation worked two cycles already.
Comment 59 Martin Steigerwald 2010-09-09 04:48:18 UTC
(In reply to comment #57)
> (In reply to comment #56)
> > (In reply to comment #48)
> > > (In reply to comment #47)
> > > > Created an attachment (id=38516) [details] [details] [details] [details]
> > > > possible fix
> > > > 
> > > > Does this patch help?  It always aligns the MC vram and gtt bases to size.
> > > 
> > > I'm sorry to report that it does not. I've tried with
> > > 96576a9e1a0cdb8a43d3af5846be0948f52b4460 (current drm-next in airlied's tree).
> > > This freezes without any patches, seems stable with airlied's patch to put vmem
> > > at address 0, but freezes still with your patch.
> > > 
> > > Lukas, can you confirm that this patch still freezes?
> > 
> > I've tried this patch again today, this time using vanilla 2.6.36-rc3.
> > Unfortunately it froze again upon launching firefox.
> 
> Hm... damn. My 2.6.36-rc3 with alex' patch didn't give me a freeze for ~ 1 day.
> And I'm pretty sure, that I applied the patch correctly and didn't mix up any
> of these patches. (did some checks ...)
> However, I did a reset of the whole tree, pulled the newest version and applied
> alex' patch again.
> I'm on 2.6.36-rc3-00185-gd56557a-dirty and testing..

You seem to be the same gfx card, but different surrounding hardware, a Fujitsu-Siemens laptop? Maybe Alex patch works on your hardware, but does not work on Da Fox' and my ThinkPad T42?

You have:

01:00.0 VGA compatible controller: ATI Technologies Inc RV350 [Mobility Radeon 9600 M10] (prog-if 00 [VGA controller])
	Subsystem: Fujitsu Limited. Device 127f
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B+ DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 66 (2000ns min), Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at c8000000 (32-bit, prefetchable) [size=128M]
	Region 1: I/O ports at 2000 [size=256]
	Region 2: Memory at c0100000 (32-bit, non-prefetchable) [size=64K]
	[virtual] Expansion ROM at c0120000 [disabled] [size=128K]
	Capabilities: [58] AGP version 2.0
		Status: RQ=80 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4
		Command: RQ=32 ArqSz=0 Cal=0 SBA+ AGP+ GART64- 64bit- FW- Rate=x1
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Kernel driver in use: radeon

01:00.0 VGA compatible controller: ATI Technologies Inc RV350 [Mobility Radeon 9600 M10] (prog-if 00 [VGA controller])
        Subsystem: IBM Device 0550
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B+ DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 66 (2000ns min), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 11
        Region 0: Memory at e0000000 (32-bit, prefetchable) [size=128M]
        Region 1: I/O ports at 3000 [size=256]
        Region 2: Memory at c0100000 (32-bit, non-prefetchable) [size=64K]
        [virtual] Expansion ROM at c0120000 [disabled] [size=128K]
        Capabilities: [58] AGP version 2.0
                Status: RQ=80 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4
                Command: RQ=32 ArqSz=0 Cal=0 SBA+ AGP+ GART64- 64bit- FW- Rate=x1
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: radeon

Region 0 memory and I/O ports are at different addresses. Maybe that explains it? Apart from that only PMEClk looks slightly different. I don't know what all that means exactly, but maybe its a hint?

Maybe its also from difference in userspace that triggers or not triggers slightly different code paths? I have Debian Squeeze/Sid/Experimental with:

martin@shambhala:~> apt-show-versions | egrep "(xserver-xorg/|xserver-xorg-core/|xserver-xorg-video-radeon/|libgl1-mesa-dri/|libdrm2/|libdrm-radeon1/|kde-window-manager/|kdelibs5/)"
kde-window-manager/squeeze uptodate 4:4.4.5-3
kdelibs5/squeeze uptodate 4:4.4.5-1
libdrm-radeon1/experimental uptodate 2.4.21-2
libdrm2/experimental uptodate 2.4.21-2
libgl1-mesa-dri/experimental uptodate 7.8.2-2
xserver-xorg/squeeze uptodate 1:7.5+6
xserver-xorg-core/squeeze uptodate 2:1.7.7-4
xserver-xorg-video-radeon/squeeze uptodate 1:6.13.1-2
Comment 60 Lukas Schneiderbauer 2010-09-09 06:08:17 UTC
Created attachment 38576 [details]
x11 components version

(In reply to comment #59)
> You seem to be the same gfx card, but different surrounding hardware, a
> Fujitsu-Siemens laptop? Maybe Alex patch works on your hardware, but does not
> work on Da Fox' and my ThinkPad T42?
Yes, possible.
The 2.6.36-rc3-00185-gd56557a-dirty kernel is up for 2 hours yet. Let's see.

I should mention, while reviewing my xorg.conf, I discovered an artifact from the beginning of the "radeon"-driver time on this system.
It is a
Option     "AGPMode"                    "4"
and was nessesary to stabilize my system (freezes occured too). I'm sure, that was no longer needed with later kernel- and userspace driver -versions.
However, I don't know, if this influenced the kernel behaviours during the patch tests, but I will remove this line and see, if something changes. 


My x11-package versions are attached.
Comment 61 Lukas Schneiderbauer 2010-09-09 06:09:53 UTC
Created attachment 38577 [details]
emerge --info

.. and additional info about my system.
Comment 62 Lukas Schneiderbauer 2010-09-09 06:27:12 UTC
(In reply to comment #60)
> However, I don't know, if this influenced the kernel behaviours during the
> patch tests, but I will remove this line and see, if something changes. 

I removed this line and experienced a sudden freeze after X-restart and firefox-start.
Could you please add this option to your xorg.conf and see and test this case?
Comment 63 Martin Steigerwald 2010-09-09 06:31:43 UTC
(In reply to comment #62)
> (In reply to comment #60)
> > However, I don't know, if this influenced the kernel behaviours during the
> > patch tests, but I will remove this line and see, if something changes. 
> 
> I removed this line and experienced a sudden freeze after X-restart and
> firefox-start.
> Could you please add this option to your xorg.conf and see and test this case?

I already have this AGP 4x line in my xorg.conf, too, since ages. Not to stabilize something, but AFAIR cause otherwise the driver would only use AGP 2x or even 1x. One can see that in the X.org logs AFAIR.
Comment 64 Lukas Schneiderbauer 2010-09-09 07:11:31 UTC
2.6.36-rc3-00185-gd56557a-dirty (latest git with alex' patch) freezed as well for me (even with AGP 4x-Option in xorg.conf).

I will fall back to 2.6.36-rc3 and test again. Maybe I'll get a freeze this time. Then we would be on the same state again.
Comment 65 Alex Deucher 2010-09-09 07:36:50 UTC
The AGPMode xorg option isn't used with kms (the AGP mode is set before X starts when the drm loads).  To force a particular AGP mode with kms, use the agpmode module parameter: radeon.agpmode=x where x=-1,1,2,4,8.  -1 disables AGP and uses the on-chip gart mechanism instead.
Comment 66 Chris Rankin 2010-09-09 14:23:42 UTC
(In reply to comment #50)
> So the first thing to do would be to verify that you indeed are experiencing
> the same issue (and not an unrelated DPMS problem) is to keep using your
> computer and wait for a freeze to occur during usage.

Interesting, because this RV350 machine is my everyday desktop PC. It gets a lot of regular usage, and also a lot of intense CPU activity. And the DPMS-related freezes are the only ones I have seen.

Here are the PCI bus details:
01:00.0 VGA compatible controller: ATI Technologies Inc RV350 AS [Radeon 9550] (prog-if 00 [VGA controller])
	Subsystem: C.P. Technology Co. Ltd Device 2084
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64 (2000ns min), Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at e0000000 (32-bit, prefetchable) [size=256M]
	Region 1: I/O ports at ec00 [size=256]
	Region 2: Memory at ff8f0000 (32-bit, non-prefetchable) [size=64K]
	Expansion ROM at ff800000 [disabled] [size=128K]
	Capabilities: [58] AGP version 3.0
		Status: RQ=256 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3+ Rate=x4,x8
		Command: RQ=32 ArqSz=2 Cal=0 SBA+ AGP+ GART64- 64bit- FW- Rate=x8
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Kernel driver in use: radeon
	Kernel modules: radeon

01:00.1 Display controller: ATI Technologies Inc RV350 AS [Radeon 9550] (Secondary)
	Subsystem: C.P. Technology Co. Ltd Device 2085
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64 (2000ns min), Cache Line Size: 64 bytes
	Region 0: Memory at d0000000 (32-bit, prefetchable) [size=256M]
	Region 1: Memory at ff8e0000 (32-bit, non-prefetchable) [size=64K]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Comment 67 Lukas Schneiderbauer 2010-09-10 00:33:35 UTC
(In reply to comment #64)
> I will fall back to 2.6.36-rc3 and test again. Maybe I'll get a freeze this
> time. Then we would be on the same state again.
And it froze... :)
Comment 68 Martin Steigerwald 2010-09-14 06:17:18 UTC
I have two questions:

1) Now since we established that the vmembase at zero patch fixes or works around the problem - while the patch to align vram from comment #47 does not, and now that I bisected the range of commits down to about 10 and as far as I understand Da Fox and Lukas even bisected down to exact one commit: What next? Is the vmembase at zero patch the proper fix? Actually to me it seems more like a work-around. Is there another fix you propose? I would love to see a fix in time for 2.6.36, although I still have to figure out on how to get a kernel after 2.6.33 that does either userspace software suspend or TuxOnIce stably on my ThinkPad T42 (see bug #18162 regarding userspace software suspend and tuxonice-devel mailing list for TuxOnIce related stuff).

2) Re Comment #65:

"The AGPMode xorg option isn't used with kms (the AGP mode is set before X
starts when the drm loads).  To force a particular AGP mode with kms, use the
agpmode module parameter: radeon.agpmode=x where x=-1,1,2,4,8.  -1 disables AGP
and uses the on-chip gart mechanism instead."

Is it necessary? How do I find out with AGP mode is used. I'd prefer when it used best AGP mode (that should be 4x on my ThinkPad T42) automatically.
Comment 69 Da Fox 2010-09-16 07:10:38 UTC
(In reply to comment #68)
> I have two questions:
> 
> 1) Now since we established that the vmembase at zero patch fixes or works
> around the problem - while the patch to align vram from comment #47 does not,
> and now that I bisected the range of commits down to about 10 and as far as I
> understand Da Fox and Lukas even bisected down to exact one commit: What next?
> Is the vmembase at zero patch the proper fix? Actually to me it seems more like
> a work-around. Is there another fix you propose? I would love to see a fix in
> time for 2.6.36, although I still have to figure out on how to get a kernel
> after 2.6.33 that does either userspace software suspend or TuxOnIce stably on
> my ThinkPad T42 (see bug #18162 regarding userspace software suspend and
> tuxonice-devel mailing list for TuxOnIce related stuff).
> 
We are currently testing a variation on this patch as suggested by Dave Airlie on IRC. It involves trying to put vram on memory addresses other than 0, but with some restriction on alignment and overlap with the GTT. Interesting values to test would be 0x10000000, 0x18000000 and 0xf0000000, provided that they don't cause overlap with the GTT area. 
You can see where your GTT area lives by looking at dmesg after boot:
---8<---------
$ dmesg | grep GTT
radeon 0000:01:00.0: GTT: 256M 0xD0000000 - 0xDFFFFFFF
[drm] radeon: 256M of GTT memory ready.
--->8---------
This shows that gtt_start=0xD0000000 and gtt_end=0xDFFFFFFF.You should make sure that either 'base + "size of your vram" <= gtt_start' or that 'gtt_end < base', where base is one of 0x10000000, 0x18000000 or 0xf0000000.

I have tested placing vram at 0x10000000, which worked for me for two days without a freeze. I am currently testing vram at 0xf0000000, which thus has not caused a freeze either.
Please post your results here too.

> 2) Re Comment #65:
> 
> "The AGPMode xorg option isn't used with kms (the AGP mode is set before X
> starts when the drm loads).  To force a particular AGP mode with kms, use the
> agpmode module parameter: radeon.agpmode=x where x=-1,1,2,4,8.  -1 disables AGP
> and uses the on-chip gart mechanism instead."
> 
> Is it necessary? How do I find out with AGP mode is used. I'd prefer when it
> used best AGP mode (that should be 4x on my ThinkPad T42) automatically.
Again you can get this info by looking at your dmesg output:

---8<---------
$ dmesg | grep -i AGP
Linux agpgart interface v0.103
agpgart-intel 0000:00:00.0: Intel 855PM Chipset
agpgart-intel 0000:00:00.0: AGP aperture is 256M @ 0xd0000000
[drm] AGP mode requested: 4
agpgart-intel 0000:00:00.0: AGP 2.0 bridge
agpgart-intel 0000:00:00.0: putting AGP V2 device into 4x mode
radeon 0000:01:00.0: putting AGP V2 device into 4x mode
--->8---------
So currently I am running with AGP mode 4x.
As for is it necessary, I don't know, but I can't imagine it making a difference really.
Comment 70 Da Fox 2010-09-21 03:38:13 UTC
Ok here's an update:
I've now tested putting the vram at the following locations (in order):
 - 0x00000000: This is the same location as vram used to be at before the
               identified bad commit and works.
 - 0x10000000: This is before GTT (which starts at 0xd0000000), with some
               room to spare. This works without freezing, tested for two days.
 - 0xf0000000: This is after GTT (which ends at 0xdfffffff), with some room to
               spare. This works, tested for two days.
 - 0xcc000000: This is directly in front of GTT, with no room to spare. This
               works, tested for several days.
 - 0xe0000000: This is directly behind GTT, with no room to spare. This where
               vram is placed starting with the identified commit, and as
               expected it froze within minutes.
Comment 71 Martin Steigerwald 2010-09-21 05:31:43 UTC
(In reply to comment #70)
> Ok here's an update:
> I've now tested putting the vram at the following locations (in order):
>  - 0x00000000: This is the same location as vram used to be at before the
>                identified bad commit and works.
>  - 0x10000000: This is before GTT (which starts at 0xd0000000), with some
>                room to spare. This works without freezing, tested for two days.
>  - 0xf0000000: This is after GTT (which ends at 0xdfffffff), with some room to
>                spare. This works, tested for two days.
>  - 0xcc000000: This is directly in front of GTT, with no room to spare. This
>                works, tested for several days.
>  - 0xe0000000: This is directly behind GTT, with no room to spare. This where
>                vram is placed starting with the identified commit, and as
>                expected it froze within minutes.

Da Fox, I seem to have the same setup which is not surprising if you also have an ThinkPad T42:

shambhala:~> dmesg | grep GTT
radeon 0000:01:00.0: GTT: 256M 0xD0000000 - 0xDFFFFFFF
radeon 0000:01:00.0: GTT: 256M 0xD0000000 - 0xDFFFFFFF

Thus I do not think I need to test the same values again. Are there some other values I should test? Maybe we can share this work.

Thanks for your hints regarding AGP. I think it might make sense to use that agp mode option, cause:

shambhala:~> lspci | grep AGP
00:01.0 PCI bridge: Intel Corporation 82855PM Processor to AGP Controller (rev 03)
shambhala:~> dmesg | grep -i AGP
[drm] AGP mode requested: 1
agpgart-intel 0000:00:00.0: AGP 2.0 bridge
agpgart-intel 0000:00:00.0: putting AGP V2 device into 1x mode
radeon 0000:01:00.0: putting AGP V2 device into 1x mode
[drm] AGP mode requested: 1
agpgart-intel 0000:00:00.0: AGP 2.0 bridge
agpgart-intel 0000:00:00.0: putting AGP V2 device into 1x mode
radeon 0000:01:00.0: putting AGP V2 device into 1x mode

Did you set the agpmode module parameter for radeon or are you getting 4x setup automatically? If the later I wonder why I get 1x automatically.
Comment 72 Michel Dänzer 2010-09-21 05:42:43 UTC
(In reply to comment #71)
> [drm] AGP mode requested: 1

[...]

> [...] I wonder why I get 1x automatically.

The line above means you have (the equivalent of) radeon.agpmode=1 somewhere, either on the kernel command line or maybe in /etc/modprobe.d/ . Without an explicit option, the default is determined by the BIOS and can sometimes be tweaked in the BIOS setup.
Comment 73 Lukas Schneiderbauer 2010-09-21 05:50:21 UTC
(In reply to comment #71)
> Did you set the agpmode module parameter for radeon or are you getting 4x setup
> automatically? If the later I wonder why I get 1x automatically.
I get AGP 1x as default as well. I changed it with the kernel parameter to 4x and it seems to work as good as the old setting.
Comment 74 Martin Steigerwald 2010-09-21 06:02:17 UTC
(In reply to comment #72)
> (In reply to comment #71)
> > [drm] AGP mode requested: 1
> 
> [...]
> 
> > [...] I wonder why I get 1x automatically.
> 
> The line above means you have (the equivalent of) radeon.agpmode=1 somewhere,
> either on the kernel command line or maybe in /etc/modprobe.d/ . Without an
> explicit option, the default is determined by the BIOS and can sometimes be
> tweaked in the BIOS setup.

It does not seem so:

shambhala:~> grep -r agpmode /etc 
shambhala:~#1> grep -r agpmode /boot
shambhala:~#1>
Comment 75 Lukas Schneiderbauer 2010-09-21 06:11:45 UTC
[Friday 17 September 2010] [09:50:58] <vootey> how comes, that my kms-setup with M10 (RV350) gpu defaults to agp 1x mode?
[Friday 17 September 2010] [09:51:38] <airlied> vootey: because we have a quirk table and I guess is was unstable for someone in 4x
[Friday 17 September 2010] [09:52:00] <MrCooper> or due to the BIOS setup
Comment 76 Martin Steigerwald 2010-09-21 06:39:03 UTC
(In reply to comment #75)
> [Friday 17 September 2010] [09:50:58] <vootey> how comes, that my kms-setup
> with M10 (RV350) gpu defaults to agp 1x mode?
> [Friday 17 September 2010] [09:51:38] <airlied> vootey: because we have a quirk
> table and I guess is was unstable for someone in 4x
> [Friday 17 September 2010] [09:52:00] <MrCooper> or due to the BIOS setup

Thanks. Set it to use 4x AGP manually, will see whether its stable on my ThinkPad T42. If was back when the Xorg option still worked.
Comment 77 Da Fox 2010-09-21 11:26:57 UTC
(In reply to comment #73)
> (In reply to comment #71)
> > Did you set the agpmode module parameter for radeon or are you getting 4x setup
> > automatically? If the later I wonder why I get 1x automatically.
> I get AGP 1x as default as well. I changed it with the kernel parameter to 4x
> and it seems to work as good as the old setting.

I also set radeon.agpmode=4 on the kernel commandline in grub.conf. I don't know what the default is, I could test it if it's important. But agpmode=4 has works with the older kernels, so I don't think that is the issue



(In reply to comment #74)
> (In reply to comment #72)
> > (In reply to comment #71)
> > > [drm] AGP mode requested: 1
> > 
> > [...]
> > 
> > > [...] I wonder why I get 1x automatically.
> > 
> > The line above means you have (the equivalent of) radeon.agpmode=1 somewhere,
> > either on the kernel command line or maybe in /etc/modprobe.d/ . Without an
> > explicit option, the default is determined by the BIOS and can sometimes be
> > tweaked in the BIOS setup.
> 
> It does not seem so:
> 
> shambhala:~> grep -r agpmode /etc 
> shambhala:~#1> grep -r agpmode /boot
> shambhala:~#1>
That is odd, grep -r on /boot should match at least System.map:
~ # grep -r agpmode /boot/
/boot/System.map:c15a506d r __param_str_agpmode
/boot/System.map:c1707b90 r __param_agpmode
/boot/System.map:c173fcc0 d radeon_agpmode_quirk_list
/boot/System.map:c17ec984 B radeon_agpmode
/boot/grub/grub.conf:kernel (hd0,5)/boot/vmlinuz ro root=/dev/sda6 quiet splash=silent,theme:gerabellum CONSOLE=/dev/tty1 resume2=file:/dev/sda6:0x103130 radeon.agpmode=4 drm_kms_helper.poll=0
grep: warning: /boot/boot: recursive directory loop
Comment 78 Da Fox 2010-10-12 05:17:12 UTC
(In reply to comment #77)
> I also set radeon.agpmode=4 on the kernel commandline in grub.conf. I don't
> know what the default is, I could test it if it's important. But agpmode=4 has
> works with the older kernels, so I don't think that is the issue
So I've tested with radeon.agpmode=-1 yesterday and today. I've performed three tests, and each time the freeze happened. The first time the freeze happened soon after booting, while starting firefox (although I was also running the 'antspotlight' screensaver in a window. The other two times the freeze took a bit longer to manifest (45minutes to an hour). The third freeze occured while rebuilding my kernel to re-include the 'vram at zero' patch. So the freezing issue exists even in PCI mode. Tested with kernel 26bf62e47261142d528a6109fdd671a2e280b4ea - Merge branch 'drm-radeon-next' of ../drm-radeon-next into drm-core-next , with additional patch to print vram and gtt locations.

dmesg | grep -iE 'radeon|agp' contains the following (lines starting with 'RADEON:' mine):
---8<---------
...
gpgart-intel 0000:00:00.0: Intel 855PM Chipset
agpgart-intel 0000:00:00.0: AGP aperture is 256M @ 0xd0000000
[drm] radeon defaulting to kernel modesetting.
[drm] radeon kernel modesetting enabled.
radeon 0000:01:00.0: power state changed by ACPI to D0
radeon 0000:01:00.0: power state changed by ACPI to D0
radeon 0000:01:00.0: PCI INT A -> Link[LNKA] -> GSI 11 (level, low) -> IRQ 11
[drm] Forcing AGP to PCI mode
RADEON: base at e0000000, rdev->gtt_start at 0, base would have been at e0000000
RADEON: vram sizes: rdev->mc.mc_vram_size=4000000, rdev->mc.real_vram_size=4000000 rdev->mc.visible_vram_size=8000000
radeon 0000:01:00.0: VRAM: 64M 0xE0000000 - 0xE3FFFFFF (64M used)
radeon 0000:01:00.0: GTT: 512M 0xC0000000 - 0xDFFFFFFF
[drm] radeon: irq initialized.
[drm] radeon: 64M of VRAM memory ready
[drm] radeon: 512M of GTT memory ready.
[drm] radeon: 1 quad pipes, 1 Z pipes initialized.
radeon 0000:01:00.0: WB enabled
[drm] radeon: ring at 0x00000000C0001000
...
--->8---------
This shows that VRAM is placed directly after GTT even in PCI mode.

Can anyone please confirm these results?
Comment 79 Alex Deucher 2010-10-14 16:59:59 UTC
Created attachment 39455 [details] [review]
Make sure gtt and vram are not directly adjacent

Does this patch fix the issue?
Comment 80 Da Fox 2010-10-15 05:05:52 UTC
(In reply to comment #79)
> Created an attachment (id=39455) [details]
> Make sure gtt and vram are not directly adjacent
> 
> Does this patch fix the issue?

It should, since it simply puts vram at 0 when it is detected that
gtt and vram are adjacent. dmesg says:
---8<---------
radeon 0000:01:00.0: GTT: 256M 0xD0000000 - 0xDFFFFFFF
radeon 0000:01:00.0: VRAM: 64M 0x00000000 - 0x03FFFFFF (64M used)
--->8---------
so it has moved vram to 0. I'll test it just the same though :)
One thing I am wondering is if it is possible to get vram/gtt overlapping with this, since this patch doesn't seem to perform any further checks to ensure this when moving either vram or gtt to 0? Presumable this already handled elsewhere?
Comment 81 Martin Steigerwald 2010-10-17 06:15:00 UTC
(In reply to comment #79)
> Created an attachment (id=39455) [details]
> Make sure gtt and vram are not directly adjacent
> 
> Does this patch fix the issue?

Testing your patch in:
martin@shambhala:~> cat /proc/version
Linux version 2.6.36-rc8-tp42-gtt-vram-not-adjacent-00020-g2d01971-dirty (martin@shambhala) (gcc version 4.4.5 (Debian 4.4.5-2) ) #1 PREEMPT Sun Oct 17 13:48:48 CEST 2010

I get vram aligned to zero as well - I also have a ThinkPad T42 like Da Fox - so everything should work:

martin@shambhala:~/Computer/Shambhala/Kernel/2.6.36> dmesg | grep -i radeon
[drm] radeon kernel modesetting enabled.
radeon 0000:01:00.0: power state changed by ACPI to D0
radeon 0000:01:00.0: power state changed by ACPI to D0
radeon 0000:01:00.0: PCI INT A -> Link[LNKA] -> GSI 11 (level, low) -> IRQ 11
radeon 0000:01:00.0: putting AGP V2 device into 4x mode
radeon 0000:01:00.0: GTT: 256M 0xD0000000 - 0xDFFFFFFF
radeon 0000:01:00.0: VRAM: 64M 0x00000000 - 0x03FFFFFF (64M used)
[drm] radeon: irq initialized.
[drm] radeon: 64M of VRAM memory ready
[drm] radeon: 256M of GTT memory ready.
[drm] radeon: 1 quad pipes, 1 Z pipes initialized.
[drm] radeon: ring at 0x00000000D0000000
[drm] radeon: ib pool ready.
[drm] Radeon Display Connectors
[drm] radeon: power management initialized
fb0: radeondrmfb frame buffer device
[drm] Initialized radeon 2.6.0 20080528 for 0000:01:00.0 on minor 0

I will report back after longer term testing.

I am not using radeon.agpmode=-1 but

martin@shambhala:~> cat /etc/modprobe.d/radeon-kms.conf 
options radeon modeset=1 agpmode=4

cause it ran stable for me for

martin@shambhala:~> uprecords -m 20 | grep "2\.6\.35"
    16    12 days, 15:20:19 | Linux 2.6.35.5-tp42-vmem  Mon Oct  4 23:44:41 2010
Comment 82 Alex Deucher 2010-10-20 13:31:53 UTC
Created attachment 39595 [details] [review]
updated patch

Updated patch to make sure gtt and vram don't overlap if vram is at 0.
Comment 83 Martin Steigerwald 2010-10-21 10:26:12 UTC
Now testing 2.6.36 + your v2 patch. Three reboots all is well so far - I do not expect any surprises, since mem mapping seems to be the same:

martin@shambhala:~> dmesg | grep "radeon" | egrep -i "(gtt|vram)"
radeon 0000:01:00.0: GTT: 256M 0xD0000000 - 0xDFFFFFFF
radeon 0000:01:00.0: VRAM: 64M 0x00000000 - 0x03FFFFFF (64M used)
[drm] radeon: 64M of VRAM memory ready
[drm] radeon: 256M of GTT memory ready.

I also tried to setup Radeon KMS DRM on my Dell Dimension 5100 at work, but I found quite some issues - ttys are blank after KMS switch, machine locks with a backtrace on enabling XRANDR for 1680x1050 + 1280x1024 or so, seemed to be a memory issue. 32-Bit, almost 4GB of RAM (no PAE). I hope to be able to try on this workstation in November again and to report some bugs.
Comment 84 mrsteven 2010-10-21 15:56:00 UTC
(In reply to comment #44)
> I am currently testing d594e46ace22afa1621254f6f669e65430048153 +
> 8e36113082821980c60ce89a6c5d45fc9492fc26 and the following patch as suggested
> by Dave Airlie on IRC:
> 
> ---8<---------
> diff --git a/drivers/gpu/drm/radeon/r300.c b/drivers/gpu/drm/radeon/r300.c
> index c827738..d1a7803 100644
> --- a/drivers/gpu/drm/radeon/r300.c
> +++ b/drivers/gpu/drm/radeon/r300.c
> @@ -477,7 +477,7 @@ void r300_mc_init(struct radeon_device *rdev)
> default: rdev->mc.vram_width = 128; break;
> }
> r100_vram_init_sizes(rdev);
> - base = rdev->mc.aper_base;
> + base = 0;
> if (rdev->flags & RADEON_IS_IGP)
> base = (RREG32(RADEON_NB_TOM) & 0xffff) << 16;
> radeon_vram_location(rdev, &rdev->mc, base);

I have the same problem since 2.6.34. The above hack fixes it for me (tested with 2.6.35.7 and a few earlier 2.6.35 releases, in frequent use for more than a month now), while the patch from comment #82 does not help in my case (last freeze was while using the search bar of firefox).

Video chip is a:

01:00.0 VGA compatible controller: ATI Technologies Inc RV350 [Mobility Radeon 9600 M10] (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. Device 1772
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 11
        Memory at d0000000 (32-bit, prefetchable) [size=128M]
        I/O ports at d800 [size=256]
        Memory at ff8f0000 (32-bit, non-prefetchable) [size=64K]
        Expansion ROM at ff8c0000 [disabled] [size=128K]
        Capabilities: [58] AGP version 2.0
        Capabilities: [50] Power Management version 2
        Kernel driver in use: radeon


With the updated patch I get these messages:

Oct 21 16:12:28 [kernel] [drm] initializing kernel modesetting (RV350 0x1002:0x4E50).
Oct 21 16:12:28 [kernel] [drm] register mmio base: 0xFF8F0000
Oct 21 16:12:28 [kernel] [drm] register mmio size: 65536
Oct 21 16:12:28 [kernel] agpgart-intel 0000:00:00.0: AGP 2.0 bridge
Oct 21 16:12:28 [kernel] agpgart-intel 0000:00:00.0: putting AGP V2 device into 4x mode
Oct 21 16:12:28 [kernel] radeon 0000:01:00.0: putting AGP V2 device into 4x mode
Oct 21 16:12:28 [kernel] radeon 0000:01:00.0: GTT: 256M 0xE0000000 - 0xEFFFFFFF
Oct 21 16:12:28 [kernel] [drm] Generation 2 PCI interface, using max accessible memory
Oct 21 16:12:28 [kernel] radeon 0000:01:00.0: VRAM: 64M 0xD0000000 - 0xD3FFFFFF (64M used)
Oct 21 16:12:28 [kernel] [drm] radeon: irq initialized.
Oct 21 16:12:28 [kernel] [drm] Detected VRAM RAM=64M, BAR=128M
Oct 21 16:12:28 [kernel] [drm] RAM width 128bits DDR
Oct 21 16:12:28 [kernel] [TTM] Zone  kernel: Available graphics memory: 442550 kiB.
Oct 21 16:12:28 [kernel] [TTM] Zone highmem: Available graphics memory: 1036090 kiB.
Oct 21 16:12:28 [kernel] [TTM] Initializing pool allocator.
Oct 21 16:12:28 [kernel] [drm] radeon: 64M of VRAM memory ready
Oct 21 16:12:28 [kernel] [drm] radeon: 256M of GTT memory ready.
*SNIP*


While with the hack from comment #44 it looks like this:

Oct 22 00:28:16 [kernel] [drm] radeon kernel modesetting enabled.
Oct 22 00:28:16 [kernel] radeon 0000:01:00.0: PCI INT A -> Link[LNKA] -> GSI 11 (level, low) -> IRQ 11
Oct 22 00:28:16 [kernel] [drm] initializing kernel modesetting (RV350 0x1002:0x4E50).
Oct 22 00:28:16 [kernel] [drm] register mmio base: 0xFF8F0000
Oct 22 00:28:16 [kernel] [drm] register mmio size: 65536
Oct 22 00:28:16 [kernel] agpgart-intel 0000:00:00.0: AGP 2.0 bridge
Oct 22 00:28:16 [kernel] agpgart-intel 0000:00:00.0: putting AGP V2 device into 4x mode
Oct 22 00:28:16 [kernel] radeon 0000:01:00.0: putting AGP V2 device into 4x mode
Oct 22 00:28:16 [kernel] radeon 0000:01:00.0: GTT: 256M 0xE0000000 - 0xEFFFFFFF
Oct 22 00:28:16 [kernel] [drm] Generation 2 PCI interface, using max accessible memory
Oct 22 00:28:16 [kernel] radeon 0000:01:00.0: VRAM: 64M 0x00000000 - 0x03FFFFFF (64M used)
Oct 22 00:28:16 [kernel] [drm] radeon: irq initialized.
Oct 22 00:28:16 [kernel] [drm] Detected VRAM RAM=64M, BAR=128M
Oct 22 00:28:16 [kernel] [drm] RAM width 128bits DDR
Oct 22 00:28:16 [kernel] [TTM] Zone  kernel: Available graphics memory: 442550 kiB.
Oct 22 00:28:16 [kernel] [TTM] Zone highmem: Available graphics memory: 1036090 kiB.
Oct 22 00:28:16 [kernel] [TTM] Initializing pool allocator.
Oct 22 00:28:16 [kernel] [drm] radeon: 64M of VRAM memory ready
Oct 22 00:28:16 [kernel] [drm] radeon: 256M of GTT memory ready.
Comment 85 Alex Deucher 2010-10-23 06:56:27 UTC
Created attachment 39651 [details] [review]
Make sure MC vram map is >= pci aperture size

Ok, I think I found the root cause in this bug.  The vram map in the memory controller needs to be >= the pci aperture size.  For the systems here, the vram size is 64 MB and the aperture size is 128 MB so the vram map in the mc needs to be at least 128 MB.  However, it's getting set to 64 MB.
Comment 86 Manuel Lauss 2010-10-24 05:03:12 UTC
(In reply to comment #85)
> Created an attachment (id=39651) [details]
> Make sure MC vram map is >= pci aperture size
> 
> Ok, I think I found the root cause in this bug.  The vram map in the memory
> controller needs to be >= the pci aperture size.  For the systems here, the
> vram size is 64 MB and the aperture size is 128 MB so the vram map in the mc
> needs to be at least 128 MB.  However, it's getting set to 64 MB.

The patch fixes this long-standing issue for me.  System is rock-solid with
4xAGP and all sorts of firefox and 3d abuse didn't freeze it.  Thank you!
Comment 87 Da Fox 2010-10-24 07:23:01 UTC
(In reply to comment #85)
> Created an attachment (id=39651) [details]
> Make sure MC vram map is >= pci aperture size
> 
> Ok, I think I found the root cause in this bug.  The vram map in the memory
> controller needs to be >= the pci aperture size.  For the systems here, the
> vram size is 64 MB and the aperture size is 128 MB so the vram map in the mc
> needs to be at least 128 MB.  However, it's getting set to 64 MB.

I've tested the patch from Comment 79 for over a week now, without any issues (as expected). The patch from Comment 82 is quite similar so I assume that would work too. I'm going to test this patch now on a fresh 2.6.36.
Comment 88 Robert Y 2010-10-25 10:21:59 UTC
This bug affects my laptop as well. It has the rv350 with 64MB of RAM.
I have been testing the patch in Comment 85 on the Ubuntu 2.6.36 Natty kernel. It seems to have fixed this freezing problem. I have tried the patches in Comment 79 and Comment 82 but the system still locked up.

On my system the lock ups occurred quite fast (within minutes after boot) while doing regular stuff like web browsing, having gedit and a terminal open. When I use an unpatched kernel the system freeze happens every time with in minutes.

I have been using this laptop with a patched kernel for the past few days and it hasn't froze once. I've rebooted into an unpatched kernel every once in a while and sure enough it freezes shortly after booting up.
Comment 89 Helmut Jarausch 2010-10-26 04:44:16 UTC
(In reply to comment #85)
> Created an attachment (id=39651) [details]
> Make sure MC vram map is >= pci aperture size
> 
> Ok, I think I found the root cause in this bug.  The vram map in the memory
> controller needs to be >= the pci aperture size.  For the systems here, the
> vram size is 64 MB and the aperture size is 128 MB so the vram map in the mc
> needs to be at least 128 MB.  However, it's getting set to 64 MB.

Could you please tell me which git archive this patch has gone into.
It doesn't seem to be the official linux-next:	next-20101026

Many thanks for a hint (I do need to get a patchset against stock 2.6.36)

Helmut.
Comment 90 Alex Deucher 2010-10-26 09:05:14 UTC
(In reply to comment #89)
> Could you please tell me which git archive this patch has gone into.
> It doesn't seem to be the official linux-next:    next-20101026
> 
> Many thanks for a hint (I do need to get a patchset against stock 2.6.36)

The patch is available here on this bug and Dave pulled it into the drm-next branch of his tree:
http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff;h=b7d8cce5b558e0c0aa6898c9865356481598b46d

It's not in Linus tree yet however.
Comment 91 Helmut Jarausch 2010-10-26 09:26:07 UTC
(In reply to comment #90)
> (In reply to comment #89)
> > Could you please tell me which git archive this patch has gone into.
> > It doesn't seem to be the official linux-next:    next-20101026
> > 
> > Many thanks for a hint (I do need to get a patchset against stock 2.6.36)
> 
> The patch is available here on this bug and Dave pulled it into the drm-next
> branch of his tree:
> http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff;h=b7d8cce5b558e0c0aa6898c9865356481598b46d
> 
> It's not in Linus tree yet however.

Many thanks,
Helmut.
Comment 92 Silvano Galliani 2010-10-27 04:50:54 UTC
(In reply to comment #90)
> (In reply to comment #89)
> The patch is available here on this bug and Dave pulled it into the drm-next
> branch of his tree:
> http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff;h=b7d8cce5b558e0c0aa6898c9865356481598b46d
> 
> It's not in Linus tree yet however.
I'm using drm-next kernel available here http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-next/2010-10-26-maverick/. It seems to be working, no problems so far. Kernels from 2.6.33 were always giving hangs.

I'm using an asus a4500g with mobility radeon 9600 m10:

01:00.0 VGA compatible controller: ATI Technologies Inc RV350 [Mobility Radeon 9600 M10] (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Device 1942
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64 (2000ns min), Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 19
	Region 0: Memory at d0000000 (32-bit, prefetchable) [size=128M]
	Region 1: I/O ports at c800 [size=256]
	Region 2: Memory at dfef0000 (32-bit, non-prefetchable) [size=64K]
	Expansion ROM at dfec0000 [disabled] [size=128K]
	Capabilities: [58] AGP version 3.0
		Status: RQ=256 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3+ Rate=x4,x8
		Command: RQ=32 ArqSz=2 Cal=0 SBA+ AGP+ GART64- 64bit- FW- Rate=x8
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Kernel driver in use: radeon
	Kernel modules: radeon, radeonfb

I hope the patch will be merged as soon as possible with linus tree.
thanks for your work.
Silvano
Comment 93 Da Fox 2010-10-29 05:55:39 UTC
(In reply to comment #87)
> (In reply to comment #85)
> > Created an attachment (id=39651) [details] [details]
> > Make sure MC vram map is >= pci aperture size
> > 
> > Ok, I think I found the root cause in this bug.  The vram map in the memory
> > controller needs to be >= the pci aperture size.  For the systems here, the
> > vram size is 64 MB and the aperture size is 128 MB so the vram map in the mc
> > needs to be at least 128 MB.  However, it's getting set to 64 MB.
> 
> I've tested the patch from Comment 79 for over a week now, without any issues
> (as expected). The patch from Comment 82 is quite similar so I assume that
> would work too. I'm going to test this patch now on a fresh 2.6.36.

Ok, I've tested the patch from Comment #85 for the better part of a week now, and I haven't experienced a single freeze yet. VRAM is still placed directly after GTT:
---8<---------
Oct 29 08:47:38 localhost kernel: radeon 0000:01:00.0: GTT: 256M 0xD0000000 - 0xDFFFFFFF
Oct 29 08:47:38 localhost kernel: radeon 0000:01:00.0: VRAM: 128M 0xE0000000 - 0xE7FFFFFF (64M used)
Oct 29 08:47:38 localhost kernel: [drm] Detected VRAM RAM=128M, BAR=128M
Oct 29 08:47:38 localhost kernel: [drm] radeon: 64M of VRAM memory ready
Oct 29 08:47:38 localhost kernel: [drm] radeon: 256M of GTT memory ready.
Oct 29 08:47:38 localhost kernel: [drm] vram apper at 0xE0000000
Oct 29 10:42:30 localhost kernel: radeon 0000:01:00.0: GTT: 256M 0xD0000000 - 0xDFFFFFFF
--->8---------
And the VRAM sizes are still a bit confusing (it's listed as 128M with 64M 'used'?)

So I can confirm that this patch indeed fixes the issue. Job well done!
Comment 94 Alex Deucher 2010-10-29 07:10:02 UTC
(In reply to comment #93)
> And the VRAM sizes are still a bit confusing (it's listed as 128M with 64M
> 'used'?)

The vram aperture size in the memory controller has to match or exceed the pci vram aperture.  The pci aperture size is 128 MB, so the MC aperture has to be >= 128 MB.  However, of that 128 MB aperture, only 64 MB is actually usable.

Fix is upstream:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b7d8cce5b558e0c0aa6898c9865356481598b46d
Comment 95 Lukas Schneiderbauer 2010-10-29 10:24:19 UTC
Thanks to all of you, guys!
I'm looking forward to see the next (patched) kernel-release!
Comment 96 madbiologist 2010-11-02 06:53:28 UTC
This fix has been released in kernel 2.6.37-rc1, and queued for stable.
Comment 97 mrsteven 2010-11-03 15:10:44 UTC
Just for the record: The patch in Comment #85 fixes it for me as well. Firefox, compositing (KWin 4.4), Extreme-Tuxracer - everything's just rock solid now.

Thanks for your amazing work! This is what I call great support for Linux!
Comment 98 Alex Deucher 2010-12-05 06:44:00 UTC
*** Bug 32107 has been marked as a duplicate of this bug. ***
Comment 99 Michel Dänzer 2011-02-09 02:25:07 UTC
*** Bug 27525 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.