Bug 30370

Summary: Nouveau module crashes with "divide error"
Product: xorg Reporter: aa_z9
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: medium CC: emil.l.velikov
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg output with drm.debug=14 nouveau.reg_debug=0x0200
none
debug patch
none
dmesg output with options: drm.debug=14 nouveau.reg_debug=0x0200
none
Video card BIOS none

Description aa_z9 2010-09-24 23:01:45 UTC
Created attachment 38947 [details]
dmesg output with drm.debug=14  nouveau.reg_debug=0x0200

The latest version of the nouveau module crashes on my PC with a divide error.

git://anongit.freedesktop.org/nouveau/linux-2.6
Commit: 9c4f93718e0ced5c2b028cb84159a063daa5d576
Date: Fri Sep 24 09:17:02 2010 +1000

From dmesg:
...
[    4.730824] divide error: 0000 [#1] SMP 
...
[    4.737222] Pid: 667, comm: modprobe Not tainted 2.6.36-rc5+ #4 Portable PC/Portable PC
[    4.737342] RIP: 0010:[<ffffffffa0211898>]  [<ffffffffa0211898>] nv50_pm_clock_get+0xa8/0xc0 [nouveau]
[    4.737560] RSP: 0018:ffff880139875a20  EFLAGS: 00010246
[    4.737660] RAX: 0000000000000000 RBX: ffff880138180000 RCX: 0000000000000000
[    4.737762] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffc90004704034
[    4.737864] RBP: ffff880139875a28 R08: 000000000003171d R09: 0000000000000000
[    4.737965] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000
[    4.738067] R13: ffff880138180f68 R14: ffff880139875a78 R15: 0000000000000108
[    4.738170] FS:  00007ffdcdf4c700(0000) GS:ffff880001700000(0000) knlGS:0000000000000000
[    4.738290] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    4.738389] CR2: 000000000061d488 CR3: 0000000139130000 CR4: 00000000000006e0
[    4.738491] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.738594] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    4.738696] Process modprobe (pid: 667, threadinfo ffff880139874000, task ffff88013b652920)
[    4.738815] Stack:
[    4.738904]  ffff880139b92120 ffff880139875a58 ffffffffa01d8c71 0000000000000003
[    4.739207] <0> ffff880138180f68 ffff880138180000 ffff880139b92120 ffff880139875bb8
[    4.739669] <0> ffffffffa01d8ed3 ffff880100000028 ffff880139875ac8 342079726f6d656d
[    4.740005] Call Trace:
[    4.740005]  [<ffffffffa01d8c71>] nouveau_pm_perflvl_get+0x91/0xe0 [nouveau]
[    4.740005]  [<ffffffffa01d8ed3>] nouveau_pm_init+0x103/0x3a0 [nouveau]
[    4.740005]  [<ffffffffa01d28e2>] ? nouveau_bios_init+0x1a72/0x2800 [nouveau]
[    4.740005]  [<ffffffff81088a42>] ? __get_free_pages+0x12/0x50
[    4.740005]  [<ffffffff812a35c9>] ? mutex_lock+0x19/0x50
[    4.740005]  [<ffffffffa01ad21b>] nouveau_card_init+0x3bb/0x1390 [nouveau]
[    4.740005]  [<ffffffffa01ae739>] nouveau_load+0x429/0x700 [nouveau]
[    4.740005]  [<ffffffffa0143870>] drm_get_pci_dev+0x180/0x2a0 [drm]
[    4.740005]  [<ffffffffa0212ea0>] nouveau_pci_probe+0x10/0x17 [nouveau]
[    4.740005]  [<ffffffff81168fc2>] local_pci_probe+0x12/0x20
[    4.740005]  [<ffffffff81169fd0>] pci_device_probe+0x80/0xb0
[    4.740005]  [<ffffffff811de03a>] ? driver_sysfs_add+0x7a/0xb0
[    4.740005]  [<ffffffff811de179>] driver_probe_device+0x89/0x1a0
[    4.740005]  [<ffffffff811de323>] __driver_attach+0x93/0xa0
[    4.740005]  [<ffffffff811de290>] ? __driver_attach+0x0/0xa0
[    4.740005]  [<ffffffff811dd638>] bus_for_each_dev+0x68/0x90
[    4.740005]  [<ffffffff811ddfb9>] driver_attach+0x19/0x20
[    4.740005]  [<ffffffff811dd8f8>] bus_add_driver+0xb8/0x260
[    4.740005]  [<ffffffffa0233000>] ? nouveau_init+0x0/0x4a [nouveau]
[    4.740005]  [<ffffffff811de618>] driver_register+0x78/0x140
[    4.740005]  [<ffffffffa0233000>] ? nouveau_init+0x0/0x4a [nouveau]
[    4.740005]  [<ffffffff8116a241>] __pci_register_driver+0x51/0xd0
[    4.740005]  [<ffffffffa0143a5f>] drm_pci_init+0xcf/0xe0 [drm]
[    4.740005]  [<ffffffffa0233000>] ? nouveau_init+0x0/0x4a [nouveau]
[    4.740005]  [<ffffffffa013cba3>] drm_init+0x53/0x70 [drm]
[    4.740005]  [<ffffffffa0233048>] nouveau_init+0x48/0x4a [nouveau]
[    4.740005]  [<ffffffff810001de>] do_one_initcall+0x3e/0x170
[    4.740005]  [<ffffffff8106abaa>] sys_init_module+0xba/0x200
[    4.740005]  [<ffffffff8100246b>] system_call_fastpath+0x16/0x1b

Full dmesg output attached.
Comment 1 Marcin Slusarz 2010-09-25 04:42:04 UTC
Created attachment 38950 [details] [review]
debug patch

Check this patch and please attach dmesg output with it.
The patch will prevent the crash and may help figure out how to fix it properly.
Comment 2 aa_z9 2010-09-25 21:40:29 UTC
Created attachment 38957 [details]
dmesg output with options: drm.debug=14 nouveau.reg_debug=0x0200

Patch applied, new dmesg output attached.

Thanks for your help.
Comment 3 Marcin Slusarz 2010-09-26 05:12:47 UTC
Ok, so we can't calculate clock for PLL_UNK05, which purpose is... unknown.

I guess we could apply this patch as is?
Comment 4 Emil Velikov 2010-09-26 10:57:37 UTC
The problem is a combined issue of two things

1. The original nv50 and nv98 use a pll_limits table v2.1 which has a default pll_limits entry (reg == 0)

Note: The same revision of the pll table is being used on a range of nv4x cards, where the some pll's are not present in the table, thus the default entry's setting are being used 

2. The different cards have different UNK plls, that somewhat need to be setup (their exact purpose is currently unknown but the vbios has values for them, that appear to be used)

The above patch is a quick and easy solution, although actual check if the pll has to be setup on the specific card, should be implemented
Comment 5 Emil Velikov 2010-10-09 10:04:18 UTC
Another patch solving your issue has been pushed already in master, but it introduced a regression. Can you please test the latest git and tell us if the issue still occur. Although in order for us to understand better the PM procedure on your card can you dump your vbios[1] (and possibly provide a PM-dump[2]) and send it to us

[1] http://nouveau.freedesktop.org/wiki/DumpingVideoBios
[2] http://nouveau.freedesktop.org/wiki/PowerManagementDumps
Comment 6 aa_z9 2010-10-12 03:59:25 UTC
Created attachment 39377 [details]
Video card BIOS

The latest git (4ec13442f9f9dfedb15473fdfc99fa71967ed48e) does not exhibit the issue. This is good news.

Dump of video card BIOS is attached.

I can try to get a PM dump if it will help you, but it will have to wait until I reconfigure the NVIDIA binary driver.
Comment 7 Lucas Stach 2011-02-15 01:58:17 UTC
@Emil: could you please comment on this? Do we still need something or should we close this bug as the reported problem is apparently fixed?
Comment 8 Emil Velikov 2011-02-15 02:21:35 UTC
Currently there is a patch that does "handle" this situation [1]. Unfortunately I was
 silly enough to assume perfect PLL limits table, therefore on some cards it introduces
 a recession. It is mainly related to understanding/mapping the correct registers on 
the all of the cards (thus my request for a Power-management dump).
As the mappings can be quite extensive and currently not fully known I would opt out
 of my solution and prefer the one given by Marcin Slusarz [2]

[1] http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=eadc69cc9054594ff7860d407f855536af13af99
[2] https://bugs.freedesktop.org/attachment.cgi?id=38950
________________________________________
From: bugzilla-daemon@freedesktop.org [bugzilla-daemon@freedesktop.org]
Sent: 15 February 2011 09:58
To: eeydev@nottingham.ac.uk
Subject: [Bug 30370] Nouveau module crashes with "divide error"

https://bugs.freedesktop.org/show_bug.cgi?id=30370

Lucas Stach <dev@lynxeye.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |eeydev@nottingham.ac.uk

--- Comment #7 from Lucas Stach <dev@lynxeye.de> 2011-02-15 01:58:17 PST ---
@Emil: could you please comment on this? Do we still need something or should
we close this bug as the reported problem is apparently fixed?

--
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.   Please do not use, copy or disclose the information contained in this message or in any attachment.  Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment
may still contain software viruses which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.
Comment 9 Emil Velikov 2012-11-01 17:19:54 UTC
The "regression" mentioned is that the incorrect pll entry/setting were mapped.
All of that has been resolved with the linux 3.7 kernel rework

Marking as Resolved/Fixed

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.