Bug 28474 - [r300g] lugaru/etc locks up laptop
Summary: [r300g] lugaru/etc locks up laptop
Status: RESOLVED DUPLICATE of bug 29389
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/r300 (show other bugs)
Version: git
Hardware: x86 (IA32) Linux (All)
: medium major
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-09 14:04 UTC by Tormod Volden
Modified: 2010-08-27 21:21 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Xorg log from hung session (29.66 KB, application/x-trash)
2010-06-09 14:04 UTC, Tormod Volden
Details
Bisect log (1.59 KB, text/plain)
2010-07-29 00:24 UTC, Niels Ole Salscheider
Details
dmesg of crash (1.50 KB, text/plain)
2010-08-14 23:33 UTC, Jan Kreuzer
Details
add module parameter to disable polling (1.64 KB, patch)
2010-08-19 07:07 UTC, Giacomo Perale
Details | Splinter Review

Description Tormod Volden 2010-06-09 14:04:14 UTC
Created attachment 36187 [details]
Xorg log from hung session

Since upgrading from 20100523 (fa552261) to 20100528 (f4bcd0ca) lugaru locks the machine up solidly, pretty much as soon as the rabbit starts to move. Anyone else seeing this? (I am using lucid + "radeon gallium" PPA on RV410)

Any educated guess on which commit could have caused this? I can do some bisecting, but the full machine reset cycle makes it a pain.

By locked up solidly I mean that I have to power off. sysrq-b does not respond.

Still the same with 20100608 (fccfb7b1).

Last message in syslog, not sure if it is related:
[ 3279.852242] hda-intel: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj.
Comment 1 Marek Olšák 2010-06-09 16:37:54 UTC
I don't have this problem.

I have reviewed all the r300-related commits between your good and bad ones. One of these commits must be the cause:

ebe2b54 r300g: report vertex format support in is_format_supported
3262554 r300g/swtcl: fix WPOS
90e5a37 r300g/swtcl: fix secondary color and back-face color outputs
76034aa r300g: decouple drawing code and two-sided stencil refvalue fallback
1345c5b r300g/swtcl: handle large index count properly
3a6fd21 r300g/swtcl: force vertex prefetching for non-indexed primitives
55a6d37 r300g/swtcl: move emitting AOS to prepare_for_rendering
6ca3f86 r300g/swtcl: do not use u_upload_mgr and do not compute max_index
1bdbc0e r300g: fix fence referencing
f0896e7 r300/compiler: implement SGT+SLE opcodes
e6a8513 r300g: more efficient finish + fix comments
2c072c8 r300g: implement fake but compliant fences
876de34 r300g,util: remove pipe_surface from the util_blitter_copy interface and clean up
59e51d9 r300g,util: remove pipe_surface from the util_blitter_fill interface and clean up

Could you please try these one by one in a bottom-up fashion? Don't forget to type "sync" before you expect a hardlock. Once you find the first bad commit, go one commit back and try again. Then please give a report. Thank you.
Comment 2 Tormod Volden 2010-06-10 15:38:57 UTC
Thanks, I got almost the same list with "git log --format=oneline fa552261..f4bcd0ca -- src/gallium/drivers/r300" and started with the "monosecting". However it hung on the first commit. Now I reinstalled the same fa552261 packages and still got the hang. So back to square one. I will reinstall the same kernel as I had at the time. I am always using package management so I can find everything in dpkg.log. Just need a rainy weekend day I guess.

BTW, should it be ok to use the stock libdrm from Lucid?
Comment 3 Marek Olšák 2010-06-10 15:43:07 UTC
libdrm shouldn't hardlock no matter what version you have.
Comment 4 Tormod Volden 2010-06-11 14:33:41 UTC
I have no idea any longer. My install log does not match my memory, and I guess that means EBRAIN. It seems I never used 20100523 (fa552261) but in fact upgraded straight from 20100429 (f7cf8b46) to 20100528 (f4bcd0ca). So now I checked out f7cf8b46 from git but it also locked up.

From timestamps I can see that I installed lugaru 20100528 and then did the mesa upgrade to f4bcd0ca some hours later. At the time I installed lugaru (and did not see any crash) I had just installed 2.6.34-996.201005261005 (drm-next from mainline kernel PPA). Hours before I had installed 2.6.34-999.201005231006 (linus tree). So I must have been running one of those kernels.

Both these kernels hang when I try them now with mesa 20100523 fa552261 (installed), and also with 20100429 f7cf8b46 (git checkout, using LD_LIBRARY_PATH and LIBGL_DRIVERS_PATH).

So maybe I was just lucky when I tried lugaro out the first times. I think I played with it more than 10 minutes. The way I reproduce the hang now is simply by choosing Tutorial and then do nothing. Just let the rabbit stand there and jump by itself. After 10-30 seconds the machine locks up. The sound goes into a stuttering (sub-second) loop and I have to power off with the power button.
Comment 5 Marek Olšák 2010-06-11 15:09:04 UTC
This is weird. I think all hardlocks a driver can provoke are more or less instant. I believe something is wrong in the kernel.

IIRC Corbin had some issues with RV410 too.
Comment 6 Marek Olšák 2010-06-26 16:20:21 UTC
1) Does r300c lock up too?

2) Set these environment variables:

RADEON_NO_TCL=1
RADEON_DEBUG=notiling,noimmd,fakeocc

and please let me know whether it helps or not.
Comment 7 Tormod Volden 2010-06-27 01:55:42 UTC
1) No, I never could make lugaru lock up with classic.

2) With RADEON_NO_TCL=1 the lugaru menu is rotated or mirrored, by moving the cursor to an edge the rotation changes, but never to normal. Starting the tutorial I see some rough pixelated images (like if resolution is totally wrong, I see a "W" letter sized the half of the screen, and there is no acceleration)

So I tried with just RADEON_DEBUG=notiling,noimmd,fakeocc and it locked up as usual.
Comment 8 Tormod Volden 2010-06-27 03:02:52 UTC
BTW, also antmaze (from xscreensaver) locks up.
Comment 9 Niels Ole Salscheider 2010-07-29 00:23:19 UTC
I am not sure if this is the same bug but I experience lockups with r300g in games (extreme tux racer, speed dreams) a few seconds after the start since I upgraded to 2.6.35.

I tried to bisect the kernel and I found that 61cf059325a30995a78c5001db2ed2a8ab1d4c36 is good while 5876dd249e8e47c730cac090bf6edd88e5f04327 is bad. When I try to continue bisecting my OpenGL applications do not display anything.
Comment 10 Niels Ole Salscheider 2010-07-29 00:24:37 UTC
Created attachment 37427 [details]
Bisect log
Comment 11 Giacomo Perale 2010-08-04 08:43:56 UTC
(In reply to comment #10)
> Created an attachment (id=37427) [details]
> Bisect log

Hi,

this seems the issue I reported yesterday in bug 29389 after upgrading to 2.6.35 final. I tried to bisect too and I had the same problem (blank screen/black windows) during the run.

In the end I limited git bisect to the drm directory ('git bisect start -- drivers/gpu/drm/') and used 'git bisect skip' to skip those revisions, so I ended up with a single commit. 

However I'm not sure it's the real culprit, could you try to bisect again?
Comment 12 Tormod Volden 2010-08-11 14:35:16 UTC
While having the gallium drivers installed I ran antmaze with LIBGL_DRIVERS pointing to the classic drivers (both git from yesterday) and it locked up also, although after running it longer than I usually can with gallium. So maybe the bug is something in the kernel that is more easily triggered when using gallium?
Comment 13 Jan Kreuzer 2010-08-14 23:33:06 UTC
Created attachment 37879 [details]
dmesg of crash

Same here. On my x700 mobility pcie 2.6.34 works with r300g, 2.6.35 locks up (black screen), i have some dmesgs of the lock. I attached one from 2.6.35, however i have also some from 2.6.35-rc kernels. Bisecting the kernel, however due to other problems i doubt i will be succesful.

Greetings Jan
Comment 14 Giacomo Perale 2010-08-19 07:07:08 UTC
Created attachment 37982 [details] [review]
add module parameter to disable polling

Hi,

I'm not sure we're all having the same problem, anyway since this bug report seems to be more active than the one I started I'll post here too.

I came back from vacation and found a patch from Chris Wilson in dri-devel who added a parameter to disable polling; since my bisection run pointed to the commit who enabled polling as the origin of my problem, I quickly adapted the patch to vanilla 2.6.35 to do some tests.

The patch is attached to this comment; to try it patch the kernel and boot with drm_kms_helper.poll=0; be careful, without this parameter sometime X refused to start here. Since this was only a test I didn't look too hard, I think it's related to the slow work->workqueues conversion.

Anyway, with polling disabled I was able to play with openarena for about 15 minutes, while with polling (without the patch) I had a hardlock in 30-60s.

Since this could be important, I have a VGA/DVI card and I'm using the DVI port, with nothing connected to the VGA port.
Comment 15 Andrew Randrianasulu 2010-08-19 10:55:29 UTC
(In reply to comment #14)
> Created an attachment (id=37982) [details]
> add module parameter to disable polling
> 
> Hi,
> 
> I'm not sure we're all having the same problem, anyway since this bug report
> seems to be more active than the one I started I'll post here too.
> 
> I came back from vacation and found a patch from Chris Wilson in dri-devel who
> added a parameter to disable polling; since my bisection run pointed to the
> commit who enabled polling as the origin of my problem, I quickly adapted the
> patch to vanilla 2.6.35 to do some tests.
> 
> The patch is attached to this comment; to try it patch the kernel and boot with
> drm_kms_helper.poll=0; be careful, without this parameter sometime X refused to
> start here. Since this was only a test I didn't look too hard, I think it's
> related to the slow work->workqueues conversion.
> 
> Anyway, with polling disabled I was able to play with openarena for about 15
> minutes, while with polling (without the patch) I had a hardlock in 30-60s.
> 
> Since this could be important, I have a VGA/DVI card and I'm using the DVI
> port, with nothing connected to the VGA port.

Tried  this patch on nouveau (as included in drm-radeon-testing tree, based on 2.6.35. Yes, odd.) . But something goes wrong. modinfo show no additional parameters, and booting with drm_kms_helper.poll=0 just prevent drm/nouveau from loading.

Reason for testing on nouveau: polling TV-out seems to be expensive operation here, at least for old single-core AMD K7.
Comment 16 Jan Kreuzer 2010-08-20 07:48:39 UTC
Tried the patch and booted with polling=0, openarena (timedemo via phoronix test suite) no longer crashes the machine when using r300g.
chipset x700 mobility pcie.

Jan
Comment 17 Tormod Volden 2010-08-26 14:57:18 UTC
I also tried the patch and with drm_kms_helper.poll=0 I can watch antmaze as much as I want and more, and the lugaru bunny just keeps jumping without any issues. If any test users want patched Ubuntu kernels, just tell me.
Comment 18 Marek Olšák 2010-08-27 21:21:55 UTC
Thanks for figuring out the cause and testing.

Please follow bug 29389 instead, it's not cluttered with lots of unrelated info like this one (bug 28474).

*** This bug has been marked as a duplicate of bug 29389 ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.