Summary: | Display freezes after login with kernel 3.11.0-rc5 on Cayman with dpm=1 | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Alexandre Demers <alexandre.f.demers> | ||||||||||||||||||||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||||||||||||||||||
Severity: | normal | ||||||||||||||||||||||||||
Priority: | medium | CC: | frederic.romagne, vmerlet | ||||||||||||||||||||||||
Version: | XOrg git | ||||||||||||||||||||||||||
Hardware: | Other | ||||||||||||||||||||||||||
OS: | All | ||||||||||||||||||||||||||
See Also: |
https://bugs.freedesktop.org/show_bug.cgi?id=69721 https://bugs.freedesktop.org/show_bug.cgi?id=69723 |
||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||||
Attachments: |
|
Description
Alexandre Demers
2013-08-18 01:33:30 UTC
Created attachment 84186 [details]
errors when freeze happens
Errors logged from my two last try at booting and logging with kernel 3.11.0-rc5 when dpm=1 with RADEON_va=1
I began bisecting tonight. Rc2 was already having this bug. More news to come before the weekend. kernel 3.11.0-rc1 was experiencing a bug, but not the one seen in rc2 and beyond. I'll dig on the "fix" that brought us to the state seen since rc2. If nothing can be found, I'll go up the drm-next branch that was included in rc1. After bisect in one direction, I've ended up with the following commit: f90555cbe629e14c6af1dcec1933a3833ecd321f is the first bad commit commit f90555cbe629e14c6af1dcec1933a3833ecd321f Author: Alex Deucher <alexander.deucher@amd.com> Date: Wed Jul 17 16:34:12 2013 -0400 drm/radeon/dpm/atom: fix broken gcc harder See bugs: https://bugs.freedesktop.org/show_bug.cgi?id=66932 https://bugs.freedesktop.org/show_bug.cgi?id=66972 https://bugs.freedesktop.org/show_bug.cgi?id=66945 Signed-off-by: Alex Deucher <alexander.deucher@amd.com> :040000 040000 c32ad9a80c5356236e935eeb5198683727b9d00d eb5aa1083eb33e7b9aebebdb310dda0399152e87 M drivers Now, I must say this commit actually fixes a visual problem after commit 69e0b57 (which is a good commit over here without any known problem). So, I'll dig in the other direction to find which commit broke the known good state. You might try this branch in case gcc is having problems with the variable sized arrays used in the driver: http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.12-wip-gcc-fixes (In reply to comment #5) > You might try this branch in case gcc is having problems with the variable > sized arrays used in the driver: > http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.12-wip-gcc-fixes Ok, I'll try it tonight. About the bisection I'm doing on the other direction (to find what broke the display), I should also be able to narrow it down tonight. Hi Alex. I'm about to test your suggestion. Meanwhile, I identified the original commit that broke the driver before being fixed by f90555cbe629e14c6af1dcec1933a3833ecd321f (but ending by the display hanging, eventhough I can connect through ssh) So the first bad commit was: 7ad8d0687bb5030c3328bc7229a3183ce179ab25 is the first bad commit commit 7ad8d0687bb5030c3328bc7229a3183ce179ab25 Author: Alex Deucher <alexander.deucher@amd.com> Date: Mon Jul 1 16:07:18 2013 -0400 drm/radeon/dpm: re-enable state transitions for Cayman Was disabled due to stability issues on certain boards caused by the a bug in the parsing of the atom mc reg tables. That's fixed now so re-enable. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> :040000 040000 de8dfc2a15d5114e81636811d7e3b39c15fc515b d0e1ee828f10456d39e2ab30cc6598203e50fa6e M drivers Heading for your suggestion right away with http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.12-wip-gcc-fixes. Tested with http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.12-wip-gcc-fixes and it does exactly the same thing: it boots fine, show the login screen. I can even login in if it doesn't hang right away. Then, it will hang at some point (either at login screen or after loading the desktop). It displays generally grey vertical bars. Still the same with kernel 3.11.0. Tried with VM=0, aspm=0, disconnected my UPC (just in case it was something with a "battery" state or something similar), tried Gnome 3 and XFCE, all the same. The only thing working for now is to set dpm=0 or to force ret=1 in ni_dpm_set_power_state when checking what ni_restrict_performance_levels_before_switch answered. However, I don't know if the problem is with ni_dpm_set_power_state or with something executed after, so I'll play in there. If ret=1 just after ni_restrict_performance_levels_before_switch(), ni_dpm_set_power_state() doesn't go any further and there is no hang. So, it seems like if the problem is not with ni_restrict_performance_levels_before_switch() but instead with a combination of some sort. So, after getting out at different points from ni_dpm_set_power_state(), it seems I can go down to ni_power_control_set_level() without problem. However, if I move to the next call which is ret = ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO), it hangs. Could it be because we are setting something wrong in auto performance level? I'll be attaching my vbios just in case. Created attachment 85157 [details]
Cayman 6950 XFX vbios
Is there anything else I can do to give a better idea of what is happening and why it crashes? If this can be of any value, my 6950 is of the following model: XFX HD-695X-ZNDC (1GB DDR5, 830MHz Core Clock and 5200MHz Memory Clock) Created attachment 85578 [details] [review] disable various dpm features I would suggest disabling various dpm features and see if you can narrow down which, if any, help. This patch disables just about everything. ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO) is what actually sets the dynamic performance switching into motion. Prior to that the hw is locked into the low performance level. I sounds like there is some bad parameter that is causing a lock up when the smc enables state switching. Separate from the patch can you also try changing the ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO) call in ni_dpm_set_power_state() to low (RADEON_DPM_FORCED_LEVEL_LOW) or high (RADEON_DPM_FORCED_LEVEL_HIGH) rather than auto? See if you still get a lock up. Another thing worth checking, what is the value of module_index passed to radeon_atom_init_mc_reg_table() in ni_initialize_mc_reg_table() in ni_dpm.c on your system? (In reply to comment #14) > Created attachment 85578 [details] [review] [review] > disable various dpm features > > I would suggest disabling various dpm features and see if you can narrow > down which, if any, help. This patch disables just about everything. > > ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO) is what > actually sets the dynamic performance switching into motion. Prior to that > the hw is locked into the low performance level. I sounds like there is > some bad parameter that is causing a lock up when the smc enables state > switching. > > Separate from the patch can you also try changing the > ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO) call in > ni_dpm_set_power_state() to low (RADEON_DPM_FORCED_LEVEL_LOW) or high > (RADEON_DPM_FORCED_LEVEL_HIGH) rather than auto? See if you still get a > lock up. I'll try it later today. (In reply to comment #15) > Another thing worth checking, what is the value of module_index passed to > radeon_atom_init_mc_reg_table() in ni_initialize_mc_reg_table() in ni_dpm.c > on your system? How can I get it? Should I print it in dmesg? (In reply to comment #17) > (In reply to comment #15) > > Another thing worth checking, what is the value of module_index passed to > > radeon_atom_init_mc_reg_table() in ni_initialize_mc_reg_table() in ni_dpm.c > > on your system? > > How can I get it? Should I print it in dmesg? yes, that would be great. (In reply to comment #16) > (In reply to comment #14) > > Created attachment 85578 [details] [review] [review] [review] > > disable various dpm features > > > > I would suggest disabling various dpm features and see if you can narrow > > down which, if any, help. This patch disables just about everything. > > > > ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO) is what > > actually sets the dynamic performance switching into motion. Prior to that > > the hw is locked into the low performance level. I sounds like there is > > some bad parameter that is causing a lock up when the smc enables state > > switching. > > > > Separate from the patch can you also try changing the > > ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO) call in > > ni_dpm_set_power_state() to low (RADEON_DPM_FORCED_LEVEL_LOW) or high > > (RADEON_DPM_FORCED_LEVEL_HIGH) rather than auto? See if you still get a > > lock up. > > I'll try it later today. I had time for now to play with forcing RADEON_DPM_FORCED_LEVEL_LOW and RADEON_DPM_FORCED_LEVEL_HIGH. The first one works fine, the second triggers the problem. I'm about to play with the suggested patch. Ok, if I apply the whole suggested patch but the following, it hangs: @@ -4152,14 +4152,14 @@ int ni_dpm_init(struct radeon_device *rdev) } ni_pi->mclk_rtt_mode_threshold = eg_pi->mclk_edc_wr_enable_threshold; - pi->voltage_control = - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); + pi->voltage_control = false; +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); - pi->mvdd_control = - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0); + pi->mvdd_control = false; +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0); - eg_pi->vddci_control = - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0); + eg_pi->vddci_control = false; +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0); rv770_get_engine_memory_ss(rdev); I'll try to play with that a bit and I'll come back. I also still have to give you the module_index. Adding printk(KERN_DEBUG "DEBUG: about to pass the following value of module_index to radeon_atom_init_mc_reg_table(): %d", module_index); just before calling radeon_atom_init_mc_reg_table() returns 2. (In reply to comment #20) > Ok, if I apply the whole suggested patch but the following, it hangs: > @@ -4152,14 +4152,14 @@ int ni_dpm_init(struct radeon_device *rdev) > } > ni_pi->mclk_rtt_mode_threshold = eg_pi->mclk_edc_wr_enable_threshold; > > - pi->voltage_control = > - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); > + pi->voltage_control = false; > +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); > > - pi->mvdd_control = > - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0); > + pi->mvdd_control = false; > +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0); > > - eg_pi->vddci_control = > - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0); > + eg_pi->vddci_control = false; > +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0); > > rv770_get_engine_memory_ss(rdev); So does just applying this portion of the patch by itself fix the hang? (In reply to comment #21) > Adding printk(KERN_DEBUG "DEBUG: about to pass the following value of > module_index to radeon_atom_init_mc_reg_table(): %d", module_index); just > before calling radeon_atom_init_mc_reg_table() returns 2. Ok, that looks good. (In reply to comment #22) > (In reply to comment #20) > > Ok, if I apply the whole suggested patch but the following, it hangs: > > @@ -4152,14 +4152,14 @@ int ni_dpm_init(struct radeon_device *rdev) > > } > > ni_pi->mclk_rtt_mode_threshold = eg_pi->mclk_edc_wr_enable_threshold; > > > > - pi->voltage_control = > > - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); > > + pi->voltage_control = false; > > +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); > > > > - pi->mvdd_control = > > - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0); > > + pi->mvdd_control = false; > > +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0); > > > > - eg_pi->vddci_control = > > - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0); > > + eg_pi->vddci_control = false; > > +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0); > > > > rv770_get_engine_memory_ss(rdev); > > So does just applying this portion of the patch by itself fix the hang? Applying just this returns an error when booting: ni_upload_sw_state failed, but obviously the system doesn't hang after that (though it can't change its performance state) (In reply to comment #22) > (In reply to comment #20) > > Ok, if I apply the whole suggested patch but the following, it hangs: > > @@ -4152,14 +4152,14 @@ int ni_dpm_init(struct radeon_device *rdev) > > } > > ni_pi->mclk_rtt_mode_threshold = eg_pi->mclk_edc_wr_enable_threshold; > > > > - pi->voltage_control = > > - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); > > + pi->voltage_control = false; > > +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); > > > > - pi->mvdd_control = > > - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0); > > + pi->mvdd_control = false; > > +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0); > > > > - eg_pi->vddci_control = > > - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0); > > + eg_pi->vddci_control = false; > > +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0); > > > > rv770_get_engine_memory_ss(rdev); > > So does just applying this portion of the patch by itself fix the hang? The only way I don't have a "ni_upload_sw_state failed" is by letting pi->voltage_control = radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); However, I inevitably end up with a hang either at login or when my session is loading (however, going in a terminal before it hangs prevents any hang from happening as long as I stay in terminal). If I patch that part of code, I always have the "ni_upload_sw_state failed" error, thus not hanging but preventing any dpm. I can patch everything else or nothing at all (I tried different combinations) and they don't seem to change a thing about the hang. Can you attach your dmesg with dpm enabled? (In reply to comment #26) > Can you attach your dmesg with dpm enabled? Do you mean with the patch applied (total and/or problematic part left alone)? (In reply to comment #27) > (In reply to comment #26) > > Can you attach your dmesg with dpm enabled? > > Do you mean with the patch applied (total and/or problematic part left > alone)? Doesn't matter. I just want to see the basic driver output and power state list. Created attachment 85798 [details]
dpm=1 with partial patch applied on 3.11.0
dmesg output when dpm=1 with partial patch applied (deactivation of pretty much everything but one to pass ni_upload_sw_state)
If there were any fixes pushed in kernel 3.12-rc1, none changed anything. Created attachment 85989 [details] [review] testing patch Try this patch independent from any other patches. It forces the engine and memory clocks of all performance levels within a power state to the lowest level. If it works, then try and comment out either the sclk part or the mclk part and see if either helps. That should help us narrow down whether it's a mclk problem or an sclk problem. (In reply to comment #31) > Created attachment 85989 [details] [review] [review] > testing patch > > Try this patch independent from any other patches. It forces the engine and > memory clocks of all performance levels within a power state to the lowest > level. If it works, then try and comment out either the sclk part or the > mclk part and see if either helps. That should help us narrow down whether > it's a mclk problem or an sclk problem. Running with the patch works fine over a vanilla kernel 3.12-rc1. The following works also fine: // if (pl->sclk > 25000) // pl->sclk = 25000; if (pl->mclk > 15000) pl->mclk = 15000; Which means sclk is working properly. However, the opposite results in a blank screen before I can even get at the login screen. It seems mclk is the problematic part. Created attachment 86111 [details] [review] testing patch - force mclk to high Try this patch by itself. This patch will force the mclk to the highest for all performance levels. If it works, the issue is probably related to the changing of mclks, if not, then we are probably programming one of the mclk parameters wrong. Created attachment 86112 [details] [review] testing patch - force mclk to high Sorry, had some garbage in my tree. use this one instead. (In reply to comment #34) > Created attachment 86112 [details] [review] [review] > testing patch - force mclk to high > > Sorry, had some garbage in my tree. use this one instead. Tested and the screen ended up blank or frozen somewhere near when Xorg and gdm are being launched (tried twice). Before that, the console was being displayed OK. A test of my own: diff --git a/drivers/gpu/drm/radeon/ni_dpm.c b/drivers/gpu/drm/radeon/ni_dpm.c index f7b625c..c1875d2 100644 --- a/drivers/gpu/drm/radeon/ni_dpm.c +++ b/drivers/gpu/drm/radeon/ni_dpm.c @@ -3952,10 +3952,14 @@ static void ni_parse_pplib_clock_info(struct radeon_device *rdev, pl->mclk = le16_to_cpu(clock_info->evergreen.usMemoryClockLow); pl->mclk |= clock_info->evergreen.ucMemoryClockHigh << 16; + pl->mclk = 100000; + pl->vddc = le16_to_cpu(clock_info->evergreen.usVDDC); pl->vddci = le16_to_cpu(clock_info->evergreen.usVDDCI); pl->flags = le32_to_cpu(clock_info->evergreen.ulFlags); + pl->vddci = 1150; + /* patch up vddc if necessary */ if (pl->vddc == 0xff01) { if (radeon_atom_get_max_vddc(rdev, 0, 0, &vddc) == 0) This works. I haven't pushed higher yet. Went to pl->mclk = 115000, runs fine. Running with mclk at 120000. I went under Windows and launch GPU-Z. We should be able to reach 1300MHz. I've read that some Cayman cards were made to use a VDDCi between 1.15 and 1.16. I'm pretty sure I can reach stability at 130000 by rising VDDCI a bit. Running with mclk at 125000 Should I continu to see what value I can reach? Created attachment 86147 [details] [review] mclk debugging pll debugging output Can you attach the dmesg output with this patch applied? I want to make sure the mclk parameters are being properly calculated for the 130000 mclk. (In reply to comment #41) > Created attachment 86147 [details] [review] [review] > mclk debugging pll debugging output > > Can you attach the dmesg output with this patch applied? I want to make > sure the mclk parameters are being properly calculated for the 130000 mclk. I'll try it at home later today. Created attachment 86168 [details]
dmesg with 86147
Created attachment 86296 [details] [review] patch 1/2 This patch set works around the issue by limiting the sclk and mclk to the highest levels listed in the clk/voltage dependency tables. I'll need to dig a bit more internally to try and figure out how to handle these clks properly. Created attachment 86297 [details] [review] patch 2/2 apply these two patches independent of any others. It seems to allow the system to work properly. No crash with patches on 3.11.0 (but another problem with 3.12-rc1, probably a new bug). I added a printk to show what are the max values. Here is what I get: [ 3.088984] : Hitting max values... max_sclk_vddc->80000, max_mclk_vddci->125000, max_mclk_vddc->125000 So, as it is, I'm unable to run at top speed (mem) if I understand correctly, right? (In reply to comment #46) > It seems to allow the system to work properly. No crash with patches on > 3.11.0 (but another problem with 3.12-rc1, probably a new bug). I added a > printk to show what are the max values. Here is what I get: > [ 3.088984] : Hitting max values... max_sclk_vddc->80000, > max_mclk_vddci->125000, max_mclk_vddc->125000 > > So, as it is, I'm unable to run at top speed (mem) if I understand > correctly, right? Right, it will limit you the the fastest clock in the voltage dependency tables until I sort out how I'm suuposed to properly handle faster clocks. OK, then with the two last patches on top of kernel 3.11.0, it works fine and I'm closing this bug. Should I open a new "bug" for the part about the faster clock and vddci? Also, the bug I saw when testing patches with kernel 3.12-rc1 just happened with 3.11.0. The screen turns white and everything is frozen. I can't connect through ssh (without the patches, when the screen hanged, I was able to connect through ssh). I can't find anything in logs that could help identify what is going on. I wasn't doing anything special and I can start a game under Steam where the GPU's fan will accelerate (which is a sign the card is now running faster) without any problem. The computer can just sit there while nothing happens and freezes (with a white screen). I'm tempted to open a different bug, what do you think Alex? Go ahead and open new bugs for those issues. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.