Bug 98016

Summary: [bisected] Fury fails to boot on drm-next-4.9
Product: DRI Reporter: Ernst Sjöstrand <ernstp>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Patch that fixes the issue.
none
possible fix none

Description Ernst Sjöstrand 2016-10-02 13:31:59 UTC
Haven't been able to boot agd5f kernels for a while, did a big bisect today.
It came up with this commit:

commit 6f0359ff73076483902de0c17f9649bf55651e2a
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Wed Aug 24 17:15:33 2016 -0400

    drm/amdgpu/vce3: add support for third vce ring
    
    Not of much use at the moment (we don't really use
    the second ring either), but may be useful later.
    
    Reviewed-by: JimQu <Jim.Qu@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

[    16.822] (--) AMDGPU(0): Chipset: "FIJI" (ChipID = 0x7300)
Comment 1 Ernst Sjöstrand 2016-10-02 13:55:28 UTC
The patch reverts cleanly on drm-next-4.9-wip and makes that branch bootable also.
Comment 2 Ernst Sjöstrand 2016-10-03 20:08:25 UTC
Wait, now I'm not so sure about the bisect.
Could there be a could/warm boot thing playing tricks with me?
Now I couldn't boot even with the patch reverted. But if I reboot from
a stable kernel quickly I can boot...
Comment 3 Ernst Sjöstrand 2016-10-03 20:51:30 UTC
This seems to be very hard to reproduce. Perhaps it's related to the kernel I'm rebooting from or some complex combination.
Pretty hard to debug when the computer simply stops responding without any text etc.
Comment 4 Andy Furniss 2016-10-04 09:18:47 UTC
(In reply to Ernst Sjöstrand from comment #3)
> This seems to be very hard to reproduce. Perhaps it's related to the kernel
> I'm rebooting from or some complex combination.
> Pretty hard to debug when the computer simply stops responding without any
> text etc.

Yea, it's tricky sometimes.

For my (I guess different) tonga start up issue I had to power down "properly" between bisect steps. By properly I mean shutdown/halt then cut the power to the PSU and wait a couple of minutes before powering up again.
Comment 5 Andy Furniss 2016-10-04 09:30:07 UTC
Also the way 4.9-wip changes and things get merged may not make bisecting a long way back that easy compared to something like mesa.

I notice it's changed again 17 hours ago - it may even work now without you being able to see any sign od a revert or fix in the history.

Maybe testing other branches (after power off of course) would give you a better idea where you are.
Comment 6 Ernst Sjöstrand 2016-10-05 07:08:30 UTC
Do you think I can get both false positives and false negatives?
Have tested a bit more but still behaving in a silly way...
Comment 7 Andy Furniss 2016-10-05 09:18:03 UTC
(In reply to Ernst Sjöstrand from comment #6)
> Do you think I can get both false positives and false negatives?
> Have tested a bit more but still behaving in a silly way...

I guess your issue not as clear cut as mine then.

If you find that you can sometimes boot and sometimes not on the same kernel after power off, then that's going to be quite a pain and time consuming to find.

It's not really normal to have to power off - I haven't needed to for ages, just that with my recent boot up issue I knew I did need to as when I first built it it worked from a reboot, but not the next day from off. I though your issue may be similar, but apparently not.
Comment 8 Ernst Sjöstrand 2016-10-10 10:35:10 UTC
I have never been able to reproduce the problem with self-compiled 4.8-rc8 from that workspace so I think my local builds are working at least.
drm/amd/amdgpu: enable clockgating only after late init
sounded like it could be similar to my problem but it didn't help.
Comment 9 Ernst Sjöstrand 2016-10-14 11:39:20 UTC
I know what may have been fooling me.
I run sudo make install on Ubuntu. It adds a .old entry in grub.
However that .old entry doesn't have a .old initrd, where the amdgpu kernel module lives, it shares initrd.
Comment 10 Ernst Sjöstrand 2016-10-14 14:38:43 UTC
Created attachment 127299 [details] [review]
Patch that fixes the issue.
Comment 11 Alex Deucher 2016-10-14 14:50:41 UTC
Odd.  We haven't seen any failures internally with Fiji and 3 VCE rings enabled.  Can you provide a log of the failure?  Try manually loading amdgpu after boot.  E.g., append modprobe.blacklist=amdgpu to the kernel command line and boot to a non-X runlevel, then manually modprobe amdgpu and capture the dmesg output. Please attach the dmesg output from a successful boot as well.   If you have remote access via ssh, that would make it easier.  What VCE firmware are you using?  Can you try the latest version from git and see if that helps:
https://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git
Comment 12 Ernst Sjöstrand 2016-10-14 23:02:34 UTC
Couldn't get anything after modprobe, the computer really dies.
However updating the firmware fixed it.

Should there be some kind of check or do you consider this fixed?
Comment 13 Ernst Sjöstrand 2016-10-14 23:03:23 UTC
I was running the Ubuntu firmware from Xenial, the firmware bundled in Wily fixed it.
Comment 14 Alex Deucher 2016-10-16 18:00:43 UTC
(In reply to Ernst Sjöstrand from comment #12)
> Couldn't get anything after modprobe, the computer really dies.
> However updating the firmware fixed it.
> 
> Should there be some kind of check or do you consider this fixed?

I'll add a check.  Thanks for confirming.
Comment 15 Alex Deucher 2016-10-25 13:41:02 UTC
Created attachment 127542 [details] [review]
possible fix

Does this patch fix the issue?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.