20341 – [NV31] lockup when using AGP. agpmode=0 fixes it.

Bug 20341 - [NV31] lockup when using AGP. agpmode=0 fixes it.

Summary: [NV31] lockup when using AGP. agpmode=0 fixes it.

Status:	VERIFIED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/nouveau (show other bugs)
Version:	git
Hardware:	x86 (IA32) Linux (All)

Importance:	medium normal
Assignee:	Nouveau Project
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-02-26 17:51 UTC by Jason Detring
Modified:	2014-02-09 00:23 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
Xorg.0.log (29.49 KB, text/plain) 2009-02-26 17:51 UTC, Jason Detring	no flags	Details
Verbose drm.ko output from dmesg (27.54 KB, text/plain) 2009-02-26 17:51 UTC, Jason Detring	no flags	Details
xorg.conf (15.64 KB, text/plain) 2009-02-26 17:59 UTC, Jason Detring	no flags	Details
Valgrind mmap trace log (2.23 MB, application/octet-stream) 2013-08-25 03:56 UTC, Jason Detring	no flags	Details
firefox corruption (477.67 KB, image/png) 2013-08-25 04:41 UTC, Jason Detring	no flags	Details
x11perf corruption at lockup (657.02 KB, image/jpeg) 2013-08-25 04:50 UTC, Jason Detring	no flags	Details
quirk patch (3.10 KB, patch) 2013-10-19 15:56 UTC, Ilia Mirkin	no flags	Details \| Splinter Review
View All

Description Jason Detring 2009-02-26 17:51:00 UTC

Created attachment 23346 [details]
Xorg.0.log

I'm seeing a lockup when images are changed on screen.  The display will freeze and all mouse clicks and keyboard input is lost, including keylock toggles.  Curiously, the pointer will continue tracking mouse movement.  Attached X clients also seem to be happy remaining connected.

In practice, this happens quite a bit in Opera when switching tabs or scrolling web pages with large images.  I've found a test case using "x11perf -putimage500" which triggers the same bug (I think?) after one or two seconds of drawing the test pattern.

Connecting to the machine over the network shows X taking 100% CPU time.  Attaching strace shows a rapidly repeating output of:
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])

I can recover to a working system with a "killall -9 X; vbetool post; startx" stanza, but that's really not something I enjoy doing once an hour.  ;-)


My hardware is
> lspci
01:00.0 VGA compatible controller: nVidia Corporation NV31 [GeForce FX 5600 Ultra] (rev a1) 
> lspci -n
01:00.0 0300: 10de:0311 (rev a1)]
on a VIA KT133 on AGP.

My software is:
Linux 2.6.27.2 (vanilla)
Xserver 1.6.0
xf86-video-nouveau git master 1c4a284a80ebed9f9d1e01c47b929481801566b5
libdrm + nouveau.ko + drm.ko git master e96fc62e5339e3c8c8944dfe9f5163f769bccbd8
The system is otherwise a Slackware-current installation upgraded with just enough Xorg componentry to build the new X server.

Thanks,
Jason

Comment 1 Jason Detring 2009-02-26 17:51:49 UTC

Created attachment 23347 [details]
Verbose drm.ko output from dmesg

Comment 2 Jason Detring 2009-02-26 17:59:07 UTC

Created attachment 23348 [details]
xorg.conf

Comment 3 Ben Skeggs 2009-03-17 20:31:49 UTC

Does this issue disappear if you add
  Option "CBLocation" "VRAM"
to the device section of your xorg.conf?

Comment 4 Jason Detring 2009-03-17 21:22:55 UTC

Yes, the lockups appear to have stopped from those two test conditions.

However, there seems to be minor screen corruption on anything more complex than core fonts.  Text has a slightly "dirty" look to it, and bitmaps seem to be "echoed" in vertical stripes.  Some images seem to be echoed in RGB-ish stripes.  Others don't follow the "vertical stripe pattern" at all and are smudged (I can still make out what a given icon shape was for example), similar to the text.

Comment 5 Ben Skeggs 2009-03-17 21:36:11 UTC

It's probably still some AGP braindamage causing your corruption.  That option just stops it being used for the command stream we send to the GPU.  Are you able to get me a mmio-trace of how the NVIDIA driver initialises your card?

The instructions at http://nouveau.freedesktop.org/wiki/MmioTrace should help you with that.

Thanks!
Ben.

Comment 6 Jason Detring 2009-03-18 21:01:37 UTC

Here's a copy of the e-mail I just sent.

Regards,
Jason

-----

Howdy,

This dump is in reference to fd.o bugzilla ticket 20341.  According to the submission guidelines, you need the following information:

> uname -a
Linux anduril 2.6.28.8 #2 PREEMPT Wed Mar 18 12:25:17 CDT 2009 i686 AMD Athlon(tm) Processor AuthenticAMD GNU/Linux

> dmesg | grep -A 5 -i nvidia 
nvidia: module license 'NVIDIA' taints kernel.
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 3
PCI: setting IRQ 3 as level-triggered
nvidia 0000:01:00.0: PCI INT A -> Link[LNKA] -> GSI 3 (level, low) -> IRQ 3
NVRM: loading NVIDIA UNIX x86 Kernel Module  173.14.18  Mon Mar  2 12:17:40 PST 2009

The Nvidia board in question has one each of a DVI, S-Video, and VGA connectors.  A Sun 24" LCD is attached on the DVI port.  I'm not sure what is meant by "what display mode you used" but I always boot to a VGA (non-framebuffer) console if that's what you want.

A kernel rebuild was required since the old 2.6.27.2 didn't have debugfs included.  I took the opportunity to upgrade to 2.6.28.8.

The testing procedure involved increasing the trace entries to 12000000 before events were no longer dropped.

Enabling MMIO tracing would always lock the system when X was started.  On a freshly booted system, the lockup would happen extremely early in X startup, resulting in no data beyond the initial PCIDEV entries being written to the text file.  I was able to start capturing data by first "priming" the graphics hardware by starting X, exiting, unloading nvidia.ko, then setting up the MMIO trace knobs and restarting X.

NVidia's GLX implementation combined with the running tracer seemed to freeze X midway through startup.  By disabling GLX, I was able to reach a window manager.  X promptly froze afterward due to reasons unknown.

Despite all this, I'm hoping there is enough data to determine how the blob sets up AGP for my system.  If not, I'll be happy to make another attempt.

Thanks,
Jason

Comment 7 Pekka Paalanen 2009-03-19 10:05:45 UTC

Mmiotrace recently got some important fixes that just might help you. If you could try the latest rc kernel (or at least 2.6.29-rc7), I would be very interested to hear how it behaves.

If it still locks up, it would be good to investigate, but I don't know if you have the time or the tools (serial terminal, netconsole, or debugging via firewire). Please contact me personally, if you want to help in debugging mmiotrace.

Thanks.

Comment 8 Jason Detring 2009-03-19 21:54:43 UTC

Another e-mail sent to mmio.dumps:
-----
This is a second attempt at tracing X startup.

> uname -a
Linux anduril 2.6.29-rc8 #1 PREEMPT Thu Mar 19 22:19:24 CDT 2009 i686 AMD Athlon(tm) Processor AuthenticAMD GNU/Linux

I am using the same driver (173.14.18 "legacy") as before.

I experienced no lockups during this trace, even with GLX enabled.  Thanks for the suggestion!  The only default I needed to change was increasing buffer_size_kb to 20000.

Cheers,
Jason

Comment 9 DarkRaven 2010-08-13 00:13:54 UTC

Same problem.
After a kernel update (git 20100807).
But I'm using a NV34 graphic card(FX5200).

Comment 10 DarkRaven 2010-08-14 21:04:54 UTC

No such problem with nouveau kernel tree;
seems that nouveau doesn't get along well with code in the upstream

Comment 11 Albert Pool 2011-10-17 08:50:05 UTC

My NV34 (XFX GeForce FX 5200) suffers from this problem too; but the mouse pointer turns into a white square with black blur also. So I filed it as a seperate bug #41892. Feel free to mark that one as duplicate when you think it's the same problem.

Comment 12 Ilia Mirkin 2013-08-18 18:09:57 UTC

It appears that this bug report has laid dormant for quite a while. Sorry we haven't gotten to it. Since we fix bugs all the time, chances are pretty good that your issue has been fixed with the latest software. Please give it a shot. (Linux kernel 3.10.7, xf86-video-nouveau 1.0.9, mesa 9.1.6, or their git versions.) If upgrading to the latest isn't an option for you, your distro's bugzilla is probably the right destination for your bug report.

In an effort to clean up our bug list, we're pre-emptively closing all bugs that haven't seen updates since 2011. If the original issue remains, please make sure to provide fresh info, see http://nouveau.freedesktop.org/wiki/Bugs/ for what we need to see, and re-open this one.

Thanks,

The Nouveau Team

Comment 13 Jason Detring 2013-08-23 05:40:42 UTC

Hi Ilia, thanks for the ticket bump.

I pulled this machine out of storage to retest.  The entire graphics stack has now been upgraded.
- Mesa 9.1.6
- Xorg server 1.13.4
- xf86-video-nouveau 1.0.9
- Linux 3.11-rc6

Nouveau now reports at bootup:
[    6.584942] nouveau  [  DEVICE][0000:01:00.0] BOOT0  : 0x031100a1
[    6.585121] nouveau  [  DEVICE][0000:01:00.0] Chipset: NV31 (NV31)
[    6.585212] nouveau  [  DEVICE][0000:01:00.0] Family : NV30
[    6.587705] nouveau  [   VBIOS][0000:01:00.0] checking PRAMIN for image...
[    6.629569] nouveau  [   VBIOS][0000:01:00.0] ... appears to be valid
[    6.629663] nouveau  [   VBIOS][0000:01:00.0] using image from PRAMIN
[    6.629755] nouveau  [   VBIOS][0000:01:00.0] BMP version 5.27
[    6.630214] nouveau  [   VBIOS][0000:01:00.0] version 04.31.20.52.00
[    6.631927] nouveau W[  PTIMER][0000:01:00.0] unknown input clock freq
[    6.632210] nouveau  [     PFB][0000:01:00.0] RAM type: DDR1
[    6.632340] nouveau  [     PFB][0000:01:00.0] RAM size: 128 MiB
[    6.632429] nouveau  [     PFB][0000:01:00.0]    ZCOMP: 262144 tags
[    6.642148] [TTM] Zone  kernel: Available graphics memory: 93186 kiB
[    6.642277] [TTM] Initializing pool allocator
[    6.642535] nouveau  [     DRM] VRAM: 127 MiB
[    6.642661] nouveau  [     DRM] GART: 128 MiB
[    6.642752] nouveau  [     DRM] BMP version 5.39
[    6.642839] nouveau  [     DRM] DCB version 2.2
[    6.642928] nouveau  [     DRM] DCB outp 00: 01000300 00009c40
[    6.643043] nouveau  [     DRM] DCB outp 01: 02010310 00009c40
[    6.643132] nouveau  [     DRM] DCB outp 02: 01010312 00000000
[    6.643220] nouveau  [     DRM] DCB outp 03: 02020321 00000003
[    6.643663] nouveau  [     DRM] Loading NV17 power sequencing microcode
[    6.645661] nouveau  [     DRM] Saving VGA fonts
[    6.690672] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[    6.690779] [drm] No driver support for vblank timestamp query.
[    6.690966] nouveau  [     DRM] 0xE176: Parsing digital output script table
[    6.691764] nouveau  [     DRM] 0 available performance level(s)
[    6.691859] nouveau  [     DRM] c: core 234MHz memory 501MHz voltage 1220mV
[    6.695827] nouveau  [     DRM] MM: using M2MF for buffer copies
[    6.696202] nouveau  [     DRM] Setting dpms mode 3 on TV encoder (output 3)
[    6.777513] nouveau  [     DRM] allocated 1920x1200 fb: 0x9000, bo cbb04600
[    6.777995] fbcon: nouveaufb (fb0) is primary device
[    6.794224] Console: switching to colour frame buffer device 240x75
[    6.798678] nouveau 0000:01:00.0: fb0: nouveaufb frame buffer device
[    6.798701] nouveau 0000:01:00.0: registered panic notifier
[    6.798729] [drm] Initialized nouveau 1.1.1 20120801 for 0000:01:00.0 on minor 0


I spent some time ensuring the Nvidia driver (173.14.18) tested earlier in the ticket was completely removed.  glxinfo now yields
   direct rendering: Yes
   ...
   OpenGL vendor string: nouveau
   OpenGL renderer string: Gallium 0.4 on NV31
   OpenGL version string: 1.5 Mesa 9.1.6
as expected.

Nouveau's 3D engine seems to have no lockup problems.  I spent a few minutes working my way through xscreensaver's GL modules with no catastrophic consequences.  It appears only 2D acceleration has issues.

Running "x11perf -putimage500" still locks up the machine.  Symptoms aren't exactly the same as earlier in the ticket, but the end result still is loss of a usable UI.
1. X freezes.  Local input is dropped.  Mouse pointer freezes, keyboard lights do not respond to toggles.
2. Machine is not completely frozen, SSH still works.
3. X continues to run for a few seconds, then crashes.  X dies, but the system does not return to a console.  The keyboard is still locked and the screen is black.
4. dmesg has been spammed as follows:

[ 5096.360378] nouveau E[   PFIFO][0000:01:00.0] DMA_PUSHER - ch 1 [X[1337]] get 0x00037be4 put 0x0001d690 state 0x2000a428 (err: CALL_SUBR_ACTIVE) push 0x00000000
[ 5096.360431] nouveau E[   PFIFO][0000:01:00.0] DMA_PUSHER - ch 1 [X[1337]] get 0x0001d690 put 0x0001d6a0 state 0x80000000 (err: INVALID_CMD) push 0x00000000
[ 5096.360475] nouveau E[   PFIFO][0000:01:00.0] DMA_PUSHER - ch 1 [X[1337]] get 0x0001d6a0 put 0x0001d6b0 state 0x80000000 (err: INVALID_CMD) push 0x00000000
[ 5096.360516] nouveau E[   PFIFO][0000:01:00.0] DMA_PUSHER - ch 1 [X[1337]] get 0x0001d6b0 put 0x0001d6c0 state 0x80000000 (err: INVALID_CMD) push 0x00000000

... lots of above lines repeated ...

[ 5126.421032] nouveau E[ X[1337]] failed to idle channel 0xcccc0000 [X[1337]]
[ 5126.425268] nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 1 [X[1337]] subc 0 mthd 0x0000 data 0x00130000
[ 5141.425031] nouveau E[ X[1337]] failed to idle channel 0xcccc0000 [X[1337]]
[ 5141.457144] ------------[ cut here ]------------
[ 5141.457328] WARNING: CPU: 0 PID: 1337 at drivers/gpu/drm/nouveau/nouveau_bo.c:151 nouveau_bo_del_ttm+0x66/0x70 [nouveau]()
[ 5141.457334] Modules linked in: lm90 ipv6 lp fuse hid_generic usbhid hid nouveau mxm_wmi snd_via82xx wmi video ttm drm_kms_helper snd_mpu401_uart snd_rawmidi snd_seq_device snd_ac97_codec snd_pcm snd_page_alloc snd_timer drm i2c_algo_bit via_agp agpgart uhci_hcd snd soundcore via686a ac97_bus gameport mperf i2c_viapro processor ppdev e1000 i2c_core shpchp parport_pc ehci_hcd parport thermal_sys button psmouse serio_raw evdev freq_table hwmon loop [last unloaded: cpuid]
[ 5141.457413] CPU: 0 PID: 1337 Comm: X Not tainted 3.11.0-rc6 #1
[ 5141.457419] Hardware name: Compaq Compaq PC                      /06E4h, BIOS 786K1 07/26/2001
[ 5141.457424]  00000000 00000000 c76a3ca0 c150d8e5 c76a3cd0 c10389fa c16233b0 00000000
[ 5141.457434]  00000539 ccc512f4 00000097 ccc10a16 ccc10a16 c5ca9800 c5ca9824 00006240
[ 5141.457444]  c76a3ce0 c1038ac2 00000009 00000000 c76a3cf8 ccc10a16 c76a3cf8 cc88190f
[ 5141.457454] Call Trace:
[ 5141.457480]  [<c150d8e5>] dump_stack+0x16/0x18
[ 5141.457497]  [<c10389fa>] warn_slowpath_common+0x7a/0xa0
[ 5141.457543]  [<ccc10a16>] ? nouveau_bo_del_ttm+0x66/0x70 [nouveau]
[ 5141.457586]  [<ccc10a16>] ? nouveau_bo_del_ttm+0x66/0x70 [nouveau]
[ 5141.457595]  [<c1038ac2>] warn_slowpath_null+0x22/0x30
[ 5141.457638]  [<ccc10a16>] nouveau_bo_del_ttm+0x66/0x70 [nouveau]
[ 5141.457685]  [<cc88190f>] ? drm_mm_put_block+0x3f/0x50 [drm]
[ 5141.457703]  [<cca385ee>] ttm_bo_release_list+0x6e/0xa0 [ttm]
[ 5141.457714]  [<cca3912e>] ttm_bo_release+0x13e/0x1d0 [ttm]
[ 5141.457725]  [<cca391e5>] ttm_bo_unref+0x25/0x30 [ttm]
[ 5141.457772]  [<ccc13abe>] nouveau_gem_object_del+0x3e/0x60 [nouveau]
[ 5141.457789]  [<cc879ed2>] drm_gem_object_free+0x22/0x30 [drm]
[ 5141.457804]  [<cc87a1e8>] drm_gem_object_release_handle+0x88/0xb0 [drm]
[ 5141.457818]  [<cc87a160>] ? drm_gem_handle_delete+0x110/0x110 [drm]
[ 5141.457840]  [<c1272cb5>] idr_for_each+0xa5/0x100
[ 5141.457854]  [<cc87a160>] ? drm_gem_handle_delete+0x110/0x110 [drm]
[ 5141.457873]  [<cc88748c>] ? drm_fb_release+0x9c/0xb0 [drm]
[ 5141.457889]  [<cc87aa1a>] drm_gem_release+0x1a/0x30 [drm]
[ 5141.457903]  [<cc879469>] drm_release+0x4a9/0x520 [drm]
[ 5141.457917]  [<c110168d>] __fput+0xbd/0x1e0
[ 5141.457924]  [<c11017ed>] ____fput+0xd/0x10
[ 5141.457932]  [<c104fa31>] task_work_run+0x81/0xa0
[ 5141.457941]  [<c10394f8>] do_exit+0x1f8/0x7f0
[ 5141.457957]  [<c1043886>] ? recalc_sigpending+0x16/0x50
[ 5141.457966]  [<c103a81e>] do_group_exit+0x2e/0x70
[ 5141.457976]  [<c1046237>] get_signal_to_deliver+0x157/0x520
[ 5141.457992]  [<c102ce90>] ? vmalloc_sync_all+0xe0/0xe0
[ 5141.458000]  [<c1001829>] do_signal+0x39/0x940
[ 5141.458007]  [<c10ffd40>] ? do_sync_write+0x60/0x90
[ 5141.458045]  [<c11005ad>] ? vfs_write+0x15d/0x1c0
[ 5141.458064]  [<c1072c5b>] ? get_monotonic_coarse+0x6b/0x80
[ 5141.458083]  [<c127e1c8>] ? copy_to_user+0x28/0x40
[ 5141.458093]  [<c1050fe0>] ? posix_get_realtime_coarse+0x20/0x20
[ 5141.458100]  [<c102ce90>] ? vmalloc_sync_all+0xe0/0xe0
[ 5141.458107]  [<c100217d>] do_notify_resume+0x4d/0x80
[ 5141.458118]  [<c1512533>] work_notifysig+0x24/0x31
[ 5141.458124] ---[ end trace 1bc9e065918e3541 ]---

The 'CBLocation' parameter, as mentioned earlier in Comment #3, seems to have been removed, so I was unable to test this.
[    98.210] (WW) NOUVEAU(0): Option "CBLocation" is not used

Ben is probably correct when he suggested the chip has an unstable AGP bus.  Setting the agpmode=0 parameter on the kernel module makes everything nice and stable (but probably slower on the top end where it is needed).  Is there some workaround or hardware manipulation ordering that is in Mesa, but hasn't made its way into xf86-video-nouveau?


Thanks,
Jason

Comment 14 Ilia Mirkin 2013-08-23 08:05:52 UTC

Don't worry about the WARN -- that should get fixed in -rc7 (or you can grab nouveau/master).

[ 5096.360378] nouveau E[   PFIFO][0000:01:00.0] DMA_PUSHER - ch 1 [X[1337]] get 0x00037be4 put 0x0001d690 state 0x2000a428 (err: CALL_SUBR_ACTIVE) push 0x00000000

This is bad. Once this happens, I guess the fifo is out of sync, and that is a recipe for bad times.

From https://github.com/envytools/envytools/blob/master/hwdocs/fifo/dma-pusher.txt:

"""
The call command copies dma_get to subr_return, sets subr_active to 1, and
sets dma_get to the target. If subr_active is already set before the call, the
DMA_PUSHER error of type CALL_SUBR_ACTIVE is raised.

The return command copies subr_return to dma_get and clears subr_active. If
subr_active isn't set, it instead raises DMA_PUSHER error of type
RET_SUBR_INACTIVE.
"""

And it looks like we make use of the call functionality in nouveau_gem_ioctl_pushbuf. The whole scheme seems a bit fragile since we do the call to some new pushbuf, and then tell userspace to write a return at the beginning of its next pushbuf. However we might need to do a call in order to get to that pushbuf, thus causing this condition. But... this code is very subtle so perhaps I'm missing why this can't happen.

Another thing to try is nouveau.vram_pushbuf on the kernel cmdline -- this might be what the old CBLocation thing did, not sure.

Comment 15 Ilia Mirkin 2013-08-23 15:14:23 UTC

Hm, after more careful review, I take back my comment about the scheme being fragile. (I mean, it's still fragile, but not AS fragile as I thought. I don't see a way for it to fail under normal circumstances.)

However one very simple explanation is that a call command is making it onto the userspace-supplied pushbuf. Let's make sure that this is not the case -- Get valgrind-mmt going (http://nouveau.freedesktop.org/wiki/Valgrind-mmt/), and run X inside of mmt, and run x11perf until it dies, e.g.

valgrind --tool=mmt --mmt-trace-file=/dev/dri/card0 --mmt-trace-nouveau-ioctls xinit x11perf -putimage500 -- :1 >& xorg-mmt.log

or something along those lines. (And you can run it in a separate X server so you don't have to kill your "real" session.)

Comment 16 Jason Detring 2013-08-25 03:56:55 UTC

Created attachment 84575 [details]
Valgrind mmap trace log

Attaching valgrind mmap trace log.

Notes:
- Prior to testing the kernel was upgraded to nouveau/linux-2.6 master @ 3b56bba6abaa70d629fccdcf8490e087ea3a1ab4 "drm/nv04/disp: fix framebuffer pin refcounting" 2013-08-21 01:37:09 (GMT)

- The agpmode=0 noveau.ko parameter was removed.

- X was especially slow during the test, as expected.  This corresponds to the note on valgrind-mmt's wiki page: "tracing slows down Xorg considerably - ~20x"

- X did not crash during this test.  The test procedure was conducted 5 times to ensure the system didn't have a spurious nonfailure.

- This is a huge logfile.  It will expand to ~220 MB when decompressed.  This log covers only the first test instance.

- Rerunning the test without valgrind caused the usual lockup.

Comment 17 Ilia Mirkin 2013-08-25 04:04:23 UTC

Argh. So if it didn't crash with valgrind-mmt, it's a heisenbug :( Some silly timing thing... and agpmode=0 either slows it down enough or it's something to do with the GART. (I assume agpmode=0 -> 0 GART memory, yes?) In either case, this is out of my (debugging) league. Thanks for doing the tests, and sorry they didn't lead to anything easy. I'll leave this open, perhaps some brave soul will delve into it.

Did you ever give nouveau.vram_pushbuf=1 a shot (without agpmode=0)?

Comment 18 Jason Detring 2013-08-25 04:41:31 UTC

Created attachment 84576 [details]
firefox corruption

Setting vram_pushbuf=1 seems to help a little.  There are no more lockups, but graphics get trashed when things start moving fast (x11perf) or big and complicated (fullscreen web browsers).  Basically, any large volume data transfers.

After the initial corruption occurs due to the mystery trigger, the system never really goes back to normal.  There's broken graphics everywhere, even on unrelated, small, sparsely-updated windows.

Things look roughly like I remember writing about in Comment #4, so this parameter is probably doing the same job as 'CBLocation'.

Comment 19 Ilia Mirkin 2013-08-25 04:46:28 UTC

Yeah, looks like good ol' data corruption. Keeping the pushbuf out of the corruption avoids lockups, but your images are still messed up. Dunno what to say other than "agp is broken". Unfortunately I know next to nothing about its workings. You could try agpmode=1 or 2 (since it goes into 4x for you).

Comment 20 Jason Detring 2013-08-25 04:50:28 UTC

Created attachment 84577 [details]
x11perf corruption at lockup

Also related--if it helps anybody.  When vram_pushbuf and agpmode are unset lockups will typically (but not always) show vertical banding on the part of the screen x11perf was updating.  It looks like the black lines are pulled apart into red and blue components.  

I'm not sure if x11perf is intended to work this way (drawing color components seperately and converging to black), but it seems to be a common symptom that shows up when things stop moving.

Comment 21 mypersonalmailbox1 2013-08-25 15:09:36 UTC

Hi,

I believe VIA Technologies KT133's AGP may be the culprit.

http://us.download.nvidia.com/XFree86/Linux-x86/173.14.09/README/chapter-12.html

KT133's (VT8363) AGP is likely to be generationally be closer to Apollo Pro133A (VT82C694X) and KX133 (VT8371).
KT133 and KX133 are fairly similar with KT133 correctly supporting AMD Socket A.
KX133 supposedly did not implement a certain type of I/O buffer for EV6 bus to support Socket A, and as a result, it will not be able to support Socket A.
Of course, that issue is not related to AGP, but since KT133 came out fairly quickly after the EV6 bus issue was acknowledged, it is probably a minor updated version of KX133.
Nouveau may have to add special code to detect the chipset, and if it detects certain known VIA Technologies chipsets, it should turn off AGP 4X mode (i.e., run at 2X mode).
I do not know if VIA Technologies really fixed the AGP for these chipsets, or perhaps they fixed it with their newer chipsets.
I hope this information adds to the discussion.

Regards,

fpgahardwareengineer

Comment 22 mypersonalmailbox1 2013-08-25 15:15:53 UTC

Hi,

I guess running the AGP at 2X mode might be a good way to see if this issue is caused by VIA Technologies' bad implementation of AGP 4X mode around that time (1999 to early 2000s, I am sure they fixed the problems on later chipsets.).

Regards,

fpgahardwareengineer

Comment 23 Ilia Mirkin 2013-09-26 23:41:34 UTC

Do nouveau.agpmode=1 or agpmode=2 produce the same corruption? I don't remember the KT133 as being the most reliable chipset in the world... even the proprietary driver has the NvAGP setting (or something along those lines). Does the blob work for you btw? Can you play with forcing different AGP speeds on it, and/or perhaps figure out from logs what AGP speed it's defaulting to?

Comment 24 Jason Detring 2013-09-27 05:09:08 UTC

Ah, yeah, sorry about that. This ticket got away from me. I've got a small pile of research and resources I'd promised myself to post Real Soon Now. So I guess I should.

I haven't reinstalled the blob, but from what I recall it did work without problems last time it was loaded. I don't remember if it left a console message describing which AGP mode it was running at, but it likely had the 2x limiter since those drivers were from 2009.

Answering the big question: yes. Setting agpmode=2 makes things stable. Hooray. But it is sort of a hollow victory to say things work great at half-speed. As fpgahardwareengineer said, the problem is likely in the KT133. The linked NVIDIA document points to the AGP signalling drive strength being insufficent.

Following up on "AGP drive strength", I found copies of a document called the "BIOS Optimization Guide" floating around the internet. Parts of it are reposted on various bulletin boards [1] [2] [3] [4] and websites [5] [6]. An older version talks about upping the drive strength from 0xCA or 0xDA to 0xEA or 0xEE when running GeForce boards.

[1] http://forums.pcper.com/showthread.php?99837-A7M266-Infinite-Loop-Troubleshooting-(detailed)&p=681721#post681721
[2] http://www.rage3d.com/board/showpost.php?p=1331276579&postcount=2
[3] http://www.scrigroup.com/calculatoare/tutorials/414/OPTIMIZING-WINDOWS-TIPS64128.php
[4] http://arstechnica.com/civis/viewtopic.php?f=8&t=737578
[5] http://hardwarehell.com/articles/videobios.htm
[6] http://archive.arstechnica.com/guide/building/bios/m-bios-3.html

It also sounds like ATI had a similiar problem and did a thorough investigation at one point [7]. Their solution was to force down to 2x mode on a chipset blacklist, and only allow 4x if the "VIA chipset driver" had been installed. The advisory document talks about something called the "AGP Read Synchronization bit" needing set at register 0xAC[6]. I'm guessing this chipset driver is some combination of this read synchronization register poke or the abovementioned drive strength register poke.

[7] http://www.rage3d.com/board/showthread.php?p=1331422045#post1331422045

The ATI document is slightly at odds with the KT133(a) programmer's guide [8], which lists 0xAC[6] as "CPU Stall on AGP command FIFO GART Address Request", but I didn't see anything else remotely resembling a "Read synchronization bit".

[8] http://gkernel.sourceforge.net/specs/via/KT133a.pdf.bz2

So, I tried testing these theories. It didn't go so well.

1. My BIOS has a pretty spartan set of options since the PC is a mass-market consumer box. There was no option to adjust AGP drive strength. Darn.

2. I tried to adjust the drive strength while Linux was running. According to the VIA doc, AGP drive strength is set from PCI register 0xB1[7-0]. It looks like this entire register is the P Ctrl and N Ctrl values for drive strength. On the console with nouveau already loaded, I wrote 0xEA. Nothing broke, so I started X. It had the same lockup as usual. Darn. I tried 0xEB through 0xEE as well. No good.

3. I tried to adjust the read synchronization bit. It was already set. So, no use in that.

Presently, I'm stuck. There's not a whole lot else I can think of that might be worth trying. Maybe there's some commit register or reinit procedure that needs to take place after writing the drive strength? I don't know enough about AGP to figure out where to go next.

At this time, I agree with fpgahardwareengineer's assessment. A quirk for those chipsets listed in the ATI doc should be added to allow negotiation up to AGP 2x maximum. It is the safest option in all cases. Maybe at the same time leave a note in the dmesg describing "buggy chipset detected, defaulting to 2x, please read fd.o bug 20341 for more information", and also allow agpmode=4 to override the safe default. I'm not sure whether this quirk should be in nouveau or in via-agp.ko, since it sounds like the problem affects all major vendors.

Comment 25 Ilia Mirkin 2013-10-19 15:56:16 UTC

Created attachment 87854 [details] [review]
quirk patch

Could you try this patch -- it adds a quirk to limit the mode to 2 for your particular configuration. This is similar to the quirks that the radeon driver uses, and was recommended to me as the proper way forward. (Unfortunately, it seems like *combinations* of chipset/card are the issue, not just the chipset on its own.)

I guessed about your chipset id, if it doesn't trigger, please attach the output of lspci -vnn (I'd like to see it for both the chipset and your card).

You should see a print like

Setting agp speed to 2X. Use agpmode to override.

Comment 26 Jason Detring 2013-10-21 01:58:54 UTC

Works well, just needs the chipset ID swapped.

-       { PCI_VENDOR_ID_VIA, PCI_DEVICE_ID_VIA_8363_0, PCI_VENDOR_ID_NVIDIA, 0x0311, 2 },
+       { PCI_VENDOR_ID_VIA, PCI_DEVICE_ID_VIA_82C691_0, PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_GEFORCE_FX_5600_ULTRA, 2 },



> lspci -vnn
00:00.0 Host bridge [0600]: VIA Technologies, Inc. VT82C693A/694x [Apollo PRO133x] [1106:0691] (rev 03)
        Flags: bus master, medium devsel, latency 0
        Memory at 90000000 (32-bit, prefetchable) [size=64M]
        Capabilities: [a0] AGP version 2.0
        Capabilities: [c0] Power Management version 2
        Kernel driver in use: agpgart-via

00:01.0 PCI bridge [0604]: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP] [1106:8305] (prog-if 00 [Normal decode])
        Flags: bus master, 66MHz, medium devsel, latency 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        Memory behind bridge: 80000000-80ffffff
        Prefetchable memory behind bridge: 88000000-8fffffff
        Capabilities: [80] Power Management version 2

00:05.0 Ethernet controller [0200]: Intel Corporation 82541PI Gigabit Ethernet Controller [8086:107c] (rev 05)
        Subsystem: Intel Corporation PRO/1000 GT Desktop Adapter [8086:1376]
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 5
        Memory at a0000000 (32-bit, non-prefetchable) [size=128K]
        Memory at a0100000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 1480 [size=64]
        [virtual] Expansion ROM at 0c000000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2
        Capabilities: [e4] PCI-X non-bridge device
        Kernel driver in use: e1000

00:14.0 ISA bridge [0601]: VIA Technologies, Inc. VT82C686 [Apollo Super South] [1106:0686] (rev 22)
        Flags: bus master, stepping, medium devsel, latency 0
        Kernel driver in use: parport_pc

00:14.1 IDE interface [0101]: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE [1106:0571] (rev 10) (prog-if 8a [Master SecP PriP])
        Flags: bus master, medium devsel, latency 64
        [virtual] Memory at 000001f0 (32-bit, non-prefetchable) [size=8]
        [virtual] Memory at 000003f0 (type 3, non-prefetchable) [size=1]
        [virtual] Memory at 00000170 (32-bit, non-prefetchable) [size=8]
        [virtual] Memory at 00000370 (type 3, non-prefetchable) [size=1]
        I/O ports at 1440 [size=16]
        Capabilities: [c0] Power Management version 2
        Kernel driver in use: pata_via

00:14.2 USB controller [0c03]: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller [1106:3038] (rev 10) (prog-if 00 [UHCI])
        Subsystem: First International Computer, Inc. VA-502 Mainboard [0925:1234]
        Flags: bus master, medium devsel, latency 66, IRQ 11
        I/O ports at 1400 [size=32]
        Capabilities: [80] Power Management version 2
        Kernel driver in use: uhci_hcd

00:14.3 USB controller [0c03]: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller [1106:3038] (rev 10) (prog-if 00 [UHCI])
        Subsystem: First International Computer, Inc. VA-502 Mainboard [0925:1234]
        Flags: bus master, medium devsel, latency 66, IRQ 11
        I/O ports at 1420 [size=32]
        Capabilities: [80] Power Management version 2
        Kernel driver in use: uhci_hcd

00:14.4 Bridge [0680]: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] [1106:3057] (rev 30)
        Flags: medium devsel, IRQ 9
        Capabilities: [68] Power Management version 2

00:14.5 Multimedia audio controller [0401]: VIA Technologies, Inc. VT82C686 AC97 Audio Controller [1106:3058] (rev 20)
        Subsystem: Compaq Computer Corporation Device [0e11:003d]
        Flags: medium devsel, IRQ 10
        I/O ports at 1000 [size=256]
        I/O ports at 1450 [size=4]
        I/O ports at 1454 [size=4]
        Capabilities: [c0] Power Management version 2
        Kernel driver in use: snd_via82xx

01:00.0 VGA compatible controller [0300]: nVidia Corporation NV31 [GeForce FX 5600 Ultra] [10de:0311] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Chaintech Computer Co. Ltd Device [270f:1946]
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 3
        Memory at 80000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 88000000 (32-bit, prefetchable) [size=128M]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [60] Power Management version 2
        Capabilities: [44] AGP version 3.0
        Kernel driver in use: nouveau


> uname -a
Linux anduril 3.12.0-rc6 #2 SMP Sat Oct 19 16:47:44 CDT 2013 i686 AMD Athlon(tm) Processor AuthenticAMD GNU/Linux


> dmesg
...
[ 1786.573538] [drm] hdmi device  not found 1 0 1
[ 1786.573760] nouveau  [  DEVICE][0000:01:00.0] BOOT0  : 0x031100a1
[ 1786.573769] nouveau  [  DEVICE][0000:01:00.0] Chipset: NV31 (NV31)
[ 1786.573776] nouveau  [  DEVICE][0000:01:00.0] Family : NV30
[ 1786.575845] nouveau  [   VBIOS][0000:01:00.0] checking PRAMIN for image...
[ 1786.616891] nouveau  [   VBIOS][0000:01:00.0] ... appears to be valid
[ 1786.616900] nouveau  [   VBIOS][0000:01:00.0] using image from PRAMIN
[ 1786.616909] nouveau  [   VBIOS][0000:01:00.0] BMP version 5.27
[ 1786.617452] nouveau  [   VBIOS][0000:01:00.0] version 04.31.20.52.00
[ 1786.623690] nouveau W[  PTIMER][0000:01:00.0] unknown input clock freq
[ 1786.623718] nouveau  [     PFB][0000:01:00.0] RAM type: DDR1
[ 1786.623724] nouveau  [     PFB][0000:01:00.0] RAM size: 128 MiB
[ 1786.623731] nouveau  [     PFB][0000:01:00.0]    ZCOMP: 262144 tags
[ 1786.636667] nouveau  [  DEVICE][0000:01:00.0] Setting agp speed to 2X. Use agpmode to override.
[ 1786.636683] agpgart-via 0000:00:00.0: AGP 2.0 bridge
[ 1786.636715] agpgart-via 0000:00:00.0: putting AGP V2 device into 2x mode
[ 1786.636766] nouveau 0000:01:00.0: putting AGP V2 device into 2x mode
[ 1786.640057] [TTM] Zone  kernel: Available graphics memory: 92344 kiB
[ 1786.640064] [TTM] Initializing pool allocator
[ 1786.640118] nouveau  [     DRM] VRAM: 127 MiB
[ 1786.640123] nouveau  [     DRM] GART: 64 MiB
[ 1786.640135] nouveau  [     DRM] BMP version 5.39
[ 1786.640145] nouveau  [     DRM] DCB version 2.2
[ 1786.640155] nouveau  [     DRM] DCB outp 00: 01000300 00009c40
[ 1786.640163] nouveau  [     DRM] DCB outp 01: 02010310 00009c40
[ 1786.640169] nouveau  [     DRM] DCB outp 02: 01010312 00000000
[ 1786.640176] nouveau  [     DRM] DCB outp 03: 02020321 00000003
[ 1786.640562] nouveau  [     DRM] Loading NV17 power sequencing microcode
[ 1786.646548] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[ 1786.646560] [drm] No driver support for vblank timestamp query.
[ 1786.646580] nouveau  [     DRM] 0xE1B4: Parsing digital output script table
[ 1786.647246] nouveau  [     DRM] 0 available performance level(s)
[ 1786.647258] nouveau  [     DRM] c: core 234MHz memory 501MHz voltage 1220mV
[ 1786.655499] nouveau  [     DRM] MM: using M2MF for buffer copies
[ 1786.655538] nouveau  [     DRM] Setting dpms mode 3 on TV encoder (output 3)
[ 1786.751345] nouveau  [     DRM] allocated 1920x1200 fb: 0x9000, bo cab83400
[ 1786.751589] fbcon: nouveaufb (fb0) is primary device
[ 1786.781270] Console: switching to colour frame buffer device 240x75
[ 1786.784371] nouveau 0000:01:00.0: fb0: nouveaufb frame buffer device
[ 1786.784376] nouveau 0000:01:00.0: registered panic notifier
[ 1786.789898] [drm] Initialized nouveau 1.1.1 20120801 for 0000:01:00.0 on minor 0



I've had x11perf in a while-true loop for the better part of a day.  Seems solid, no issues to report.

Comment 27 Ilia Mirkin 2014-02-05 07:20:18 UTC

The quirk went into 3.13-rc1 (commit fd34381b0e2827228cbda45aa2cca4127ff073b2).

Comment 28 Jason Detring 2014-02-09 00:23:34 UTC

This was tested for 24+ hours against a vanilla 3.13.1.  No problems found, so setting VERIFIED.  I'm happy to see this bug being put to rest.

Thank you for taking care of this issue for me!

Jason


[   18.213336] [drm] Initialized drm 1.1.0 20060810
[   18.463592] wmi: Mapper loaded
[   18.669773] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 3
[   18.669891] PCI: setting IRQ 3 as level-triggered
[   18.670951] [drm] hdmi device  not found 1 0 1
[   18.671233] nouveau  [  DEVICE][0000:01:00.0] BOOT0  : 0x031100a1
[   18.671326] nouveau  [  DEVICE][0000:01:00.0] Chipset: NV31 (NV31)
[   18.671416] nouveau  [  DEVICE][0000:01:00.0] Family : NV30
[   18.673632] nouveau  [   VBIOS][0000:01:00.0] checking PRAMIN for image...
[   18.714842] nouveau  [   VBIOS][0000:01:00.0] ... appears to be valid
[   18.714936] nouveau  [   VBIOS][0000:01:00.0] using image from PRAMIN
[   18.715029] nouveau  [   VBIOS][0000:01:00.0] BMP version 5.27
[   18.715527] nouveau  [   VBIOS][0000:01:00.0] version 04.31.20.52.00
[   18.716448] nouveau W[  PTIMER][0000:01:00.0] unknown input clock freq
[   18.716555] nouveau  [     PFB][0000:01:00.0] RAM type: DDR1
[   18.716645] nouveau  [     PFB][0000:01:00.0] RAM size: 128 MiB
[   18.716794] nouveau  [     PFB][0000:01:00.0]    ZCOMP: 262144 tags
[   18.724684] nouveau  [     CLK][0000:01:00.0] --:   
[   18.724853] nouveau  [  DEVICE][0000:01:00.0] Forcing agp mode to 2X. Use agpmode to override.
[   18.724994] agpgart-via 0000:00:00.0: AGP 2.0 bridge
[   18.725106] agpgart-via 0000:00:00.0: putting AGP V2 device into 2x mode
[   18.725243] nouveau 0000:01:00.0: putting AGP V2 device into 2x mode
[   18.725559] [TTM] Zone  kernel: Available graphics memory: 444174 kiB
[   18.725650] [TTM] Zone highmem: Available graphics memory: 513810 kiB
[   18.725739] [TTM] Initializing pool allocator
[   18.725836] [TTM] Initializing DMA pool allocator
[   18.725962] nouveau  [     DRM] VRAM: 127 MiB
[   18.726048] nouveau  [     DRM] GART: 64 MiB
[   18.726138] nouveau  [     DRM] BMP version 5.39
[   18.726228] nouveau  [     DRM] DCB version 2.2
[   18.726318] nouveau  [     DRM] DCB outp 00: 01000300 00009c40
[   18.726409] nouveau  [     DRM] DCB outp 01: 02010310 00009c40
[   18.726498] nouveau  [     DRM] DCB outp 02: 01010312 00000000
[   18.726588] nouveau  [     DRM] DCB outp 03: 02020321 00000003
[   18.727035] nouveau  [     DRM] Loading NV17 power sequencing microcode
[   18.728729] nouveau  [     DRM] Saving VGA fonts
[   18.773146] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[   18.773253] [drm] No driver support for vblank timestamp query.
[   18.773517] nouveau  [     DRM] 0xE176: Parsing digital output script table
[   18.778990] nouveau  [     DRM] MM: using M2MF for buffer copies
[   18.779120] nouveau  [     DRM] Setting dpms mode 3 on TV encoder (output 3)
[   18.864770] nouveau  [     DRM] allocated 1920x1200 fb: 0x9000, bo f5ae2e00
[   18.865359] fbcon: nouveaufb (fb0) is primary device
[   18.891030] Console: switching to colour frame buffer device 240x75
[   18.893942] nouveau 0000:01:00.0: fb0: nouveaufb frame buffer device
[   18.893967] nouveau 0000:01:00.0: registered panic notifier
[   18.894004] [drm] Initialized nouveau 1.1.1 20120801 for 0000:01:00.0 on minor 0

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.