Bug 90276 - [NVE4] read fault error, bisected
Summary: [NVE4] read fault error, bisected
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-01 19:03 UTC by Arthur Heymans
Modified: 2016-05-29 00:19 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Kernel Log (150.00 KB, text/plain)
2015-05-01 19:03 UTC, Arthur Heymans
no flags Details
kernel_log_3.17 (62.73 KB, text/plain)
2015-05-01 21:59 UTC, Arthur Heymans
no flags Details
bisect_log (2.81 KB, text/plain)
2015-05-02 12:45 UTC, Arthur Heymans
no flags Details
vbios.rom using linux v4.0 (156.05 KB, application/octet-stream)
2015-05-02 16:56 UTC, Arthur Heymans
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arthur Heymans 2015-05-01 19:03:13 UTC
Created attachment 115504 [details]
Kernel Log

When I launch the 3d game Xonotic or 0ad I get a systematically reproducible crash: screen freezes and system can't be shutdown anymore from ssh.

How to reproduce:
Launch the game Xonotic


This has been so since at least linux 3.18
Does not happen on linux 3.14
 
lspci -v:
Flags: bus master, VGA palette snoop, 66MHz, medium devsel, latency 64
        Bus: primary=00, secondary=04, subordinate=04, sec-latency=64
        Memory behind bridge: fe100000-fe1fffff

01:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 660 Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device 2843
        Flags: bus master, fast devsel, latency 0, IRQ 29
        Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
        Memory at f0000000 (64-bit, prefetchable) [size=128M]
        Memory at f8000000 (64-bit, prefetchable) [size=32M]
        I/O ports at e000 [size=128]
        Expansion ROM at fe000000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: nouveau
        Kernel modules: nouveau

mesa 10.5.4-1
libdrm 2.4.60-2
xf86-video-nouveau 1.0.11-3
linux 4.0.1
Comment 1 Ilia Mirkin 2015-05-01 19:09:19 UTC
The screen freeze is most likely unrelated to the error you quote, but is rather related to the PDISP errors earlier in the log.

Can you try bisecting? (My bet is the display rework in 3.16/17...)
Comment 2 Arthur Heymans 2015-05-01 19:34:01 UTC
Never done that but I'll try... (so might take a while)
Comment 3 Arthur Heymans 2015-05-01 21:59:06 UTC
Created attachment 115506 [details]
kernel_log_3.17

linux 3.17 fails to probe nouveau...
nouveau: probe of 0000:01:00.0 failed with error -12
Comment 4 Ilia Mirkin 2015-05-01 22:03:57 UTC
Probably because of your NvBIOS=PRAMIN thing
Comment 5 Arthur Heymans 2015-05-01 22:45:37 UTC
Every NvBios option gives me same result
Comment 6 Ilia Mirkin 2015-05-01 22:46:05 UTC
How about removing it? :)
Comment 7 Arthur Heymans 2015-05-01 22:49:06 UTC
Same result :)
Comment 8 Arthur Heymans 2015-05-02 12:45:24 UTC
Created attachment 115513 [details]
bisect_log

found the patch that caused the PDISP errors
Comment 9 Ilia Mirkin 2015-05-02 15:46:47 UTC
Great, thanks for doing that. The commit you landed on certainly *seems* related to the whole DISP thing, which is good:

commit 7a014a872914a6bb5af8b67eba603f8546794ab9
Author: Ben Skeggs <bskeggs@redhat.com>
Date:   Fri May 16 14:36:15 2014 +1000

    drm/nouveau/disp: add internal representaion of output paths and connectors
    
    This will, at some point, be used to replace various bits and pieces of
    code doing direct bios parsing.  For now, it'll just be used for some
    DP improvements.
    
    Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

Arthur, can you attach your VBIOS (cat /sys/kernel/debug/dri/0/vbios.rom) and tell us which connectors have monitors connected?

Ben, any ideas?
Comment 10 Arthur Heymans 2015-05-02 16:56:14 UTC
Created attachment 115514 [details]
vbios.rom using linux v4.0

xrandr:
Screen 0: minimum 320 x 200, current 3200 x 1080, maximum 8192 x 8192
DVI-I-1 connected 1280x1024+1920+0 (normal left inverted right x axis y axis) 338mm x 270mm
   1280x1024     60.02*+
   1152x864      75.00  
   1024x768      75.08    75.03    60.00  
   832x624       74.55  
   800x600       75.00    60.32  
   640x480       75.00    60.00  
   720x400       70.08  
DVI-D-1 connected primary 1920x1080+0+0 (normal left inverted right x axis y axis) 510mm x 287mm
   1920x1080     60.00*+
   1280x1024     75.02    60.02  
   1152x864      75.00  
   1024x768      75.08    60.00  
   800x600       75.00    60.32  
   640x480       75.00    60.00  
   720x400       70.08  
HDMI-1 disconnected (normal left inverted right x axis y axis)
DP-1 disconnected (normal left inverted right x axis y axis)
Comment 11 Ilia Mirkin 2015-05-15 06:08:55 UTC
A suggestion from Ben:

In drivers/gpu/drm/nouveau/nv50_display.c:

-       if (show && nv_crtc->cursor.nvbo)
+       if (show && nv_crtc->base.enabled && nv_crtc->cursor.nvbo)
Comment 12 Arthur Heymans 2015-05-15 10:23:53 UTC
I can confirm that this fixes the PDISP errors!

But I still get the original error (screen freeze) while launching some games:

nouveau E[   PFIFO][0000:01:00.0] read fault at 0x000a940000 [PTE] from CE2/GR_CE on channel 0x007f369000 [unknown]

Using
mesa 10.5.5-1
linux-git with this fix
libdrm 2.4.61-1
xf86-video-nouveau 1.0.11-3

Should I file a new bug for that issue?
Comment 13 Arthur Heymans 2015-05-16 13:58:13 UTC
I tried bisecting the read fault error.

Before the commit  3d9e3921f4d77bcaeea913c48b894d1208f0cb06 there are no errors.
After that commit modesetting fails. This is the case until commit 13dfe1286d1ea1af4c9330b039c2316d0d92c484 with which modesetting does works again. 
The read fault error is present in this last version so the cause is something in between those two commits.
Comment 14 Marcin Slusarz 2015-05-16 20:27:13 UTC
You can workaround it by applying the contents of commits: 
79456e1a10d5f4e708822287ed0e97af469bf49b
d979ab975ecdb336ed4da77a808be813a293b59e
d7bda18c9102b65078c132fd7d7ffd835058f021
13dfe1286d1ea1af4c9330b039c2316d0d92c484
at each step of the bisection.

Of course it's possible that those 5 patches (4 above and 3d9e3921f4d77bcaeea913c48b894d1208f0cb06) are the culprit, so first check whether 3d9e3921f4d77bcaeea913c48b894d1208f0cb06 + above patches works.
Comment 15 Arthur Heymans 2015-05-16 23:22:27 UTC
3d9e3921f4d77bcaeea913c48b894d1208f0cb06 + those 4 patches has the read fault error!
Comment 16 Timothy Pearson 2015-09-04 05:40:54 UTC
(In reply to Arthur Heymans from comment #15)
> 3d9e3921f4d77bcaeea913c48b894d1208f0cb06 + those 4 patches has the read
> fault error!

Out of all of the changes in the aforementioned patches this one would seem to be most likely to cause the issue:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/diff/drivers/gpu/drm/nouveau/core/core/mm.c?id=d979ab975ecdb336ed4da77a808be813a293b59e

Any chance you can revert that change in isolation and see if the faults disappear?
Comment 17 Arthur Heymans 2015-09-06 22:35:09 UTC
Reversing 3d9e3921f4d77bcaeea913c48b894d1208f0cb06 solves the problem.

Howerever c39f472e9f14e49a9bc091977ced0ec45fc00c57 changes some names so I don't know what to for recent kernels.
Comment 18 Ilia Mirkin 2015-09-06 22:41:22 UTC
(In reply to Arthur Heymans from comment #17)
> Reversing 3d9e3921f4d77bcaeea913c48b894d1208f0cb06 solves the problem.
> 
> Howerever c39f472e9f14e49a9bc091977ced0ec45fc00c57 changes some names so I
> don't know what to for recent kernels.

The code in question is still here:

http://cgit.freedesktop.org/~darktama/nouveau/tree/drm/nouveau/nvkm/subdev/fb/ramgf100.c#n600

It's surprising that reverting that commit helps... it fixed issues for people with funny memory partitioning IIRC.
Comment 19 Arthur Heymans 2015-09-06 23:28:35 UTC
Ok there are still crashes but they tend to happen less easier/fast. Applications that once produced a reproducible instant crashes, now crashes with same errors after ~30min or more. (also works on more recent kernels)
Comment 20 Emil Velikov 2015-09-09 10:46:44 UTC
(In reply to Arthur Heymans from comment #17)
> Reversing 3d9e3921f4d77bcaeea913c48b894d1208f0cb06 solves the problem.
> 
> Howerever c39f472e9f14e49a9bc091977ced0ec45fc00c57 changes some names so I
> don't know what to for recent kernels.

One possible issue (albeit unlikely) is that due to u32 maths the extra multiplication (combined with << 8) is causing an overflow. Fwiw latest upstream is explicitly using u64 typed variables.
Comment 21 Arthur Heymans 2015-09-09 15:11:33 UTC
Well freezes also happen on versions before that particular patch. They are less common and I have not found a way to make them reproducible (similar to reverting that patch on recent kernel).
Comment 22 Ilia Mirkin 2015-10-22 03:31:01 UTC
(In reply to Arthur Heymans from comment #21)
> Well freezes also happen on versions before that particular patch. They are
> less common and I have not found a way to make them reproducible (similar to
> reverting that patch on recent kernel).

Can you see if you still have issues with mesa 11.0.3 and a regular (and recent) upstream kernel? Could you be you were just getting lucky/unlucky with the other kernel changes.
Comment 23 Arthur Heymans 2015-10-22 15:14:03 UTC
Using
linux 4.2.3
mesa 11.0.3
I still get frequent things like
 nouveau E[   PFIFO][0000:03:00.0] read fault at 0x0011990000 [UNSUPPORTED_KIND] from CE2/GR_CE on channel 0x007f121000 [unknown]
which freezes display.
So I would say nothing really changed.
Comment 24 abi 2015-12-13 20:40:14 UTC
I have exactly the same problem with screens freeze, inability to shutdown desktop from remote and read fault at 0x00bca00000 [UNSUPPORTED_KIND] from CE2/GR_CE on channel 0x007f6ef000 [unknown] in the logs.

4.2.5-1-ARCH

mesa 11.0.7-1
libdrm 2.4.65-1
xf86-video-nouveau 1.0.11+31+g1ff13a9-1

04:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 770] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Device 8465
	Physical Slot: 4
	Flags: bus master, fast devsel, latency 0, IRQ 70
	Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
	Memory at f0000000 (64-bit, prefetchable) [size=128M]
	Memory at f8000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	Expansion ROM at fb000000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100] Virtual Channel
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Kernel driver in use: nouveau
	Kernel modules: nouveau

Game triggers the issue is Wasteland 2 (native, x64). This is the only game I have, so I can't extrapolate.
Comment 25 Lucas Ribeiro 2016-05-29 00:19:22 UTC
I suggest trying again with kernel 4.6, all freezes have stopped here on 660ti NVE4.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.