Bug 69723

Summary:

GPU lockups with kernel 3.11.0 / 3.12-rc1 when dpm=1 on r600g (Cayman)

Product:

DRI

Reporter:

Alexandre Demers <alexandre.f.demers>

Component:

DRM/Radeon

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

critical

Priority:

medium

CC:

alexdeucher, g02maran, klavkalashj, marc, perry3d, vmerlet

Version:

XOrg git

Hardware:

All

OS:

All

See Also:

https://bugs.freedesktop.org/show_bug.cgi?id=68235

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
disable various dpm features	none
dmesg with 3.12.0-rc3	none
A small simplification to low state adjustment	none
journalctl from last couple of boot/hang cycles	none
GPU-Z Cayman log playing a Youtube video	none
A second GPU-Z log, this time pushing a bit more	none
possible fix	none
possible fix	none
possible fix	none

Description Alexandre Demers 2013-09-23 16:36:00 UTC

In bug 68235, the display was freezing with kernel 3.11.0 (also tested with 3.12-rc1) either at logon or shortly after. It was possible to access the machine through ssh. Since applied patches, the max clocks are detected, allowing the computer to run normally.

However, at some point, it always ends up frozen (it can be just sitting there, nothing being done on the computer). The frozen state is deeper than before since it's impossible to access the machine through ssh. I've tested the computer by playing Dota2 and, until now, it doesn't seem related to the card working harder or not.

Logs don't show anything for the moment, I'll have to push the debug state a bit further.

Any suggestion where to start to give us a clue on what is triggering the freeze?

Comment 1 Alex Deucher 2013-09-23 23:20:14 UTC

Created attachment 86424 [details] [review]
disable various dpm features

Try this patch and see if you can narrow down which, if any, of these features is problematic.

Comment 2 Alexandre Demers 2013-09-25 01:05:27 UTC

It doesn't seem to be one of these, it still happens. While it is not always related to a change in power state, I do experience it more often when I'm launching games or just after I hear the fan accelerate.

Comment 3 Alexandre Demers 2013-09-25 01:57:36 UTC

I'll try playing with patch 85578, which was the same thing as this patch with a bit more. I'm pretty sure we are hitting something wrong again with mclk, but in a different case than the one fixed in bug 68235.

Comment 4 Alexandre Demers 2013-09-26 17:08:21 UTC

Definitively something about sclk or mclk and the voltages, but I haven't had the time to dig deeper for now. I've added some printk to be sure everything was being maxed as supposed. Could it be a combo between freq and voltage?

Comment 5 Alex Deucher 2013-09-26 17:27:41 UTC

Take a look at ni_apply_state_adjust_rules() to see how the power state is adjusted based on various factors.

Comment 6 Alexandre Demers 2013-09-26 17:47:15 UTC

(In reply to comment #5)
> Take a look at ni_apply_state_adjust_rules() to see how the power state is
> adjusted based on various factors.

Ok, but I may have to wait until this weekend or at the beginning of next week to do so, I may not have some spare time until then.

Comment 7 Alexandre Demers 2013-09-29 01:29:16 UTC

Alex, using debugfs, should I see the maxed values (sclk and mclk) or the theorical value from the table? For now, even if I have confirmation in dmesg the values were maxed out, looking in debugfs gives me:
uvd    vclk: 0 dclk: 0
power level 2    sclk: 83000 mclk: 130000 vddc: 1060 vddci: 1150
(it should be sclk: 80000 mclk: 125000...)

Comment 8 Alex Deucher 2013-09-30 05:37:20 UTC

(In reply to comment #7)
> Alex, using debugfs, should I see the maxed values (sclk and mclk) or the
> theorical value from the table? For now, even if I have confirmation in
> dmesg the values were maxed out, looking in debugfs gives me:
> uvd    vclk: 0 dclk: 0
> power level 2    sclk: 83000 mclk: 130000 vddc: 1060 vddci: 1150
> (it should be sclk: 80000 mclk: 125000...)

IIRC, it references the unpatched power state so you'll see the unpatched state.  The hw just returns an index which is used to look up the state attributes.

Comment 9 Alexandre Demers 2013-10-01 03:09:08 UTC

Created attachment 86889 [details]
dmesg with 3.12.0-rc3

Comment 10 Alexandre Demers 2013-10-01 03:15:56 UTC

Just to be sure: vddc is associated only to sclk and vddci to mclk, right?

Also, how are a new freq and a new voltage applied to the card? Are they applied simultanously or sequentially? In the second case, we must be sure to raise voltage before frequency when pushing the performances up, while we should low the frequency before lowering the voltage when we are slowing down.

Comment 11 Alex Deucher 2013-10-01 13:19:25 UTC

(In reply to comment #10)
> Just to be sure: vddc is associated only to sclk and vddci to mclk, right?
> 

Not exactly.  Mclk is tied to vddci (memory interface voltage), but both mclk and sclk (and the core display clock) are tied to vddc (core voltage).

> Also, how are a new freq and a new voltage applied to the card? Are they
> applied simultanously or sequentially? In the second case, we must be sure
> to raise voltage before frequency when pushing the performances up, while we
> should low the frequency before lowering the voltage when we are slowing
> down.

The actual adjustments are done by a microcontroller on the GPU.  You pass a set of structures defining the performance levels within the power state to the microcontroller and the microcontroller handles the switching.  It takes into account all of the ordering and chip state dependencies.

Comment 12 Alexandre Demers 2013-10-01 13:32:26 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > Just to be sure: vddc is associated only to sclk and vddci to mclk, right?
> > 
> 
> Not exactly.  Mclk is tied to vddci (memory interface voltage), but both
> mclk and sclk (and the core display clock) are tied to vddc (core voltage).
> 
> > Also, how are a new freq and a new voltage applied to the card? Are they
> > applied simultanously or sequentially? In the second case, we must be sure
> > to raise voltage before frequency when pushing the performances up, while we
> > should low the frequency before lowering the voltage when we are slowing
> > down.
> 
> The actual adjustments are done by a microcontroller on the GPU.  You pass a
> set of structures defining the performance levels within the power state to
> the microcontroller and the microcontroller handles the switching.  It takes
> into account all of the ordering and chip state dependencies.

I was asking, just in case there was a manual control over the process and I would have been in a situation where the card was too near of its limits.

I changed a little something in the code yesterday and I was lucky enough to not have any hangs. I just want to be sure it is because of this little change I've made and not some obscure planets alignment. I'll test it further today and I'll let you know.

Comment 13 Alexandre Demers 2013-10-01 17:42:18 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > (In reply to comment #10)
> > > Just to be sure: vddc is associated only to sclk and vddci to mclk, right?
> > > 
> > 
> > Not exactly.  Mclk is tied to vddci (memory interface voltage), but both
> > mclk and sclk (and the core display clock) are tied to vddc (core voltage).
> > 
> > > Also, how are a new freq and a new voltage applied to the card? Are they
> > > applied simultanously or sequentially? In the second case, we must be sure
> > > to raise voltage before frequency when pushing the performances up, while we
> > > should low the frequency before lowering the voltage when we are slowing
> > > down.
> > 
> > The actual adjustments are done by a microcontroller on the GPU.  You pass a
> > set of structures defining the performance levels within the power state to
> > the microcontroller and the microcontroller handles the switching.  It takes
> > into account all of the ordering and chip state dependencies.
> 
> I was asking, just in case there was a manual control over the process and I
> would have been in a situation where the card was too near of its limits.
> 
> I changed a little something in the code yesterday and I was lucky enough to
> not have any hangs. I just want to be sure it is because of this little
> change I've made and not some obscure planets alignment. I'll test it
> further today and I'll let you know.

Well, it was only luck it seems... I'll send a patch though, since it simplifies a couple of lines.

I'll have to continue digging.

Comment 14 Alexandre Demers 2013-10-02 06:08:36 UTC

Created attachment 86945 [details] [review]
A small simplification to low state adjustment

This doesn't solve the problem, but it simplifies a bit the low state adjustment. Please, review and commit if good.

Comment 15 Alexandre Demers 2013-10-26 04:32:44 UTC

Still there with 3.12.0-rc6. If I just force the mclk to 125000 and vddci to 1150, I'm running fine. I don't know where to look at anymore. Any other suggestions?

Comment 16 Alexandre Demers 2013-10-27 00:13:27 UTC

By the way, even if setting mclk and vddci at fixed values, it eventually freezes. It takes longer though. It will happen when scrolling a window or focusing on a new one. Problem is, I can't seem to get any message when it happens.

The more I look at it, the more I think it is freezing somewhere that has nothing to do with the mclk or vddci values. It could be a race condition or we could be trying to use something that was already released. A register or an uninitialized variable maybe. What do you think?

Comment 17 Alexandre Demers 2013-10-27 00:39:24 UTC

Here is the last thing I have before freezing and just after I rebooted (from journal with systemd, no more dmesg). As you'll see, there is nothing there. I'll look into other logs just in case.


Oct 26 19:57:08 Xander kernel: kworker/u16:1 (18455) used greatest stack depth: 3472 bytes left
Oct 26 19:58:13 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 24336
Oct 26 19:58:18 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 24352
Oct 26 19:58:23 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 24400
Oct 26 19:58:33 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 24432
Oct 26 19:58:36 Xander dbus-daemon[2858]: dbus[2858]: [system] Activating via systemd: service name='org.freedesktop.ModemMa
Oct 26 19:58:36 Xander dbus-daemon[2858]: dbus[2858]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.
Oct 26 19:58:36 Xander dbus[2858]: [system] Activating via systemd: service name='org.freedesktop.ModemManager1' unit='dbus-
Oct 26 19:58:36 Xander dbus[2858]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.ModemManager1.servi
Oct 26 19:58:43 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 24448
Oct 26 19:59:18 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 25936
Oct 26 19:59:43 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 26112
Oct 26 20:00:18 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 26624
Oct 26 20:00:36 Xander dbus-daemon[2858]: dbus[2858]: [system] Activating via systemd: service name='org.freedesktop.ModemMa
Oct 26 20:00:36 Xander dbus[2858]: [system] Activating via systemd: service name='org.freedesktop.ModemManager1' unit='dbus-
Oct 26 20:00:36 Xander dbus[2858]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.ModemManager1.servi
Oct 26 20:00:36 Xander dbus-daemon[2858]: dbus[2858]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.
Oct 26 20:00:46 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 26848
Oct 26 20:00:56 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 26944
Oct 26 20:01:01 Xander crond[22879]: pam_unix(crond:session): session opened for user root by (uid=0)
Oct 26 20:01:01 Xander CROND[22880]: (root) CMD (run-parts /etc/cron.hourly)
Oct 26 20:01:01 Xander CROND[22879]: pam_unix(crond:session): session closed for user root
Oct 26 20:01:03 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 27040
Oct 26 20:01:13 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 27120
Oct 26 20:01:33 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 27296
Oct 26 20:01:43 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 27360
Oct 26 20:02:03 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 29760
Oct 26 20:02:36 Xander dbus-daemon[2858]: dbus[2858]: [system] Activating via systemd: service name='org.freedesktop.ModemMa
Oct 26 20:02:36 Xander dbus[2858]: [system] Activating via systemd: service name='org.freedesktop.ModemManager1' unit='dbus-
Oct 26 20:02:36 Xander dbus[2858]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.ModemManager1.servi
Oct 26 20:02:36 Xander dbus-daemon[2858]: dbus[2858]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.
Oct 26 20:02:44 Xander kernel: r8712u 1-1:1.0 wlan0: r8712_got_addbareq_event_callback: mac = 96:0c:6d:f1:2e:b2, seq = 37984
-- Reboot --
Oct 26 20:19:34 Xander systemd-journal[1936]: Runtime journal is using 1.3M (max 395.4M, leaving 593.1M of free 3.8G, curren
Oct 26 20:19:34 Xander systemd-journal[1936]: Runtime journal is using 1.3M (max 395.4M, leaving 593.1M of free 3.8G, curren
Oct 26 20:19:34 Xander kernel: Initializing cgroup subsys cpuset
Oct 26 20:19:34 Xander kernel: Initializing cgroup subsys cpu
Oct 26 20:19:34 Xander kernel: Initializing cgroup subsys cpuacct
Oct 26 20:19:34 Xander kernel: Linux version 3.12.0-rc6-VANILLA-00284-ge6036c0-dirty (dema1701@Xander) (gcc version 4.8.2 (G
Oct 26 20:19:34 Xander kernel: Command line: BOOT_IMAGE=/vmlinuz-3.12.0-rc6-VANILLA-00284-ge6036c0-dirty root=UUID=04e7cc91-
Oct 26 20:19:34 Xander kernel: e820: BIOS-provided physical RAM map:

Comment 18 Alexandre Demers 2013-10-27 05:02:32 UTC

Could the kworker warning be a sign to look at?

Comment 19 Alexandre Demers 2013-11-07 20:23:22 UTC

Official kernel 3.12 still failing with latest mesa and drm. It always ends up with a white screen when it freezes and it is impossible to connect remotely by SSH.

Would it be possible to add some debug/trace option to find where was the last recorded call or to print some key values/settings when they are changed/set? Maybe that could help narrow down where it hangs.

Comment 20 klavkalashj 2013-11-07 22:07:33 UTC

I was thinking, since DPM is to be enabled by default in 3.13, maybe the severity of this bug should be increased? I mean, lots of computers are going to hard freeze by default. I don't know much about these things, if there's anything I can do to help solving this, please tell.

Comment 21 Alex Deucher 2013-11-07 22:10:33 UTC

(In reply to comment #20)
> I was thinking, since DPM is to be enabled by default in 3.13, maybe the
> severity of this bug should be increased? I mean, lots of computers are
> going to hard freeze by default. I don't know much about these things, if
> there's anything I can do to help solving this, please tell.

dpm is only enabled on certain families by default in 3.13.  It is not enabled by default yet on cayman.

Comment 22 Alex Deucher 2013-11-07 22:14:19 UTC

(In reply to comment #19)
> Official kernel 3.12 still failing with latest mesa and drm. It always ends
> up with a white screen when it freezes and it is impossible to connect
> remotely by SSH.
> 
> Would it be possible to add some debug/trace option to find where was the
> last recorded call or to print some key values/settings when they are
> changed/set? Maybe that could help narrow down where it hangs.

The problem is the driver is not really involved except for the initial programming of the states.  The GPU does the reclocking internally once it's enabled.  Best bet would be to adjust the values programmed into the state that's loaded into the SMC to see if you can narrow down what aspects are problematic.

Comment 23 Alexandre Demers 2013-11-08 00:15:58 UTC

(In reply to comment #22)
> (In reply to comment #19)
> > Official kernel 3.12 still failing with latest mesa and drm. It always ends
> > up with a white screen when it freezes and it is impossible to connect
> > remotely by SSH.
> > 
> > Would it be possible to add some debug/trace option to find where was the
> > last recorded call or to print some key values/settings when they are
> > changed/set? Maybe that could help narrow down where it hangs.
> 
> The problem is the driver is not really involved except for the initial
> programming of the states.  The GPU does the reclocking internally once it's
> enabled.  Best bet would be to adjust the values programmed into the state
> that's loaded into the SMC to see if you can narrow down what aspects are
> problematic.

Well, maybe the first thing would be to identify when or what sequence leads to the hang. That's why I was suggesting to trace it. But if I understand you correctly, you are saying the GPU is programmed once and then we just move from state to another, right?

I'm not sure, but I think I've read somewhere it was possible to set a power state manually, to force dpm to a given state in other words. Am I right?

Comment 24 Alex Deucher 2013-11-08 01:31:29 UTC

(In reply to comment #23)
> Well, maybe the first thing would be to identify when or what sequence leads
> to the hang. That's why I was suggesting to trace it. But if I understand
> you correctly, you are saying the GPU is programmed once and then we just
> move from state to another, right?
> 
> I'm not sure, but I think I've read somewhere it was possible to set a power
> state manually, to force dpm to a given state in other words. Am I right?

A power state consists of several performance levels (generally 3 on cayman; low, medium, and high).  The driver loads a power state and then the GPU dynamically changes between the performance levels within that state based on GPU load.  You can force the GPU to always stay in the low or high state via sysfs.  See:
http://www.botchco.com/agd5f/?p=57
for more info.  That basically tells the GPU to stay in that performance level and to not dynamically transition between performance levels.

Comment 25 Alexandre Demers 2013-11-08 04:13:26 UTC

(In reply to comment #24)
> (In reply to comment #23)
> > Well, maybe the first thing would be to identify when or what sequence leads
> > to the hang. That's why I was suggesting to trace it. But if I understand
> > you correctly, you are saying the GPU is programmed once and then we just
> > move from state to another, right?
> > 
> > I'm not sure, but I think I've read somewhere it was possible to set a power
> > state manually, to force dpm to a given state in other words. Am I right?
> 
> A power state consists of several performance levels (generally 3 on cayman;
> low, medium, and high).  The driver loads a power state and then the GPU
> dynamically changes between the performance levels within that state based
> on GPU load.  You can force the GPU to always stay in the low or high state
> via sysfs.  See:
> http://www.botchco.com/agd5f/?p=57
> for more info.  That basically tells the GPU to stay in that performance
> level and to not dynamically transition between performance levels.

Thanks, I knew I had read it somewhere. You are confirming what I had understood of the performance levels in a given state.

However, I'm thinking we are not looking at the right spot... I've been testing power states and performance levels, running Youtube videos (which hitting in the general GPU part, not UVD), playing videos with VLC (VDPAU enabled, which was using UVD) and doing both at once. Everything was rock solid. Their must be something else that has nothing to do with the power states. All performance levels and transitions from one to another worked fine.

Is there something to your knowledge that could have more chance of hanging when switching performance levels that would be less vulnerable with a fixed mclk and vddci?

Comment 26 Alexandre Demers 2013-11-08 06:32:46 UTC

Alex, in drivers/gpu/drm/radeon/ni_dpm.c, when we are limiting the sclk and mclk to the max speed according to vddc and vddci, aren't we screwing possibly something with mclk? I mean, could we be setting a mclk value that is wrong? We have two conditions where we can max mclk value, but we are not looking at the lowest one. I think it should be something like this instead to be sure we are using the most restrictive value:
	/* Select the lowest mclk value according to the most restrictive between vddc and vddci*/
	if (max_mclk_vddc || max_mclk_vddci) {
		max_mclk_vddcx = (max_mclk_vddc > max_mclk_vddci) ? max_mclk_vddci : max_mclk_vddc;
	}

	for (i = 0; i < ps->performance_level_count; i++) {
		if (max_sclk_vddc) {
			if (ps->performance_levels[i].sclk > max_sclk_vddc)
				ps->performance_levels[i].sclk = max_sclk_vddc;
		}
		if (max_mclk_vddcx) {
			if (ps->performance_levels[i].mclk > max_mclk_vddcx)
				ps->performance_levels[i].mclk = max_mclk_vddcx;
		}
	}

I'm also quoting you: "Not exactly. Mclk is tied to vddci (memory interface voltage), but both mclk and sclk (and the core display clock) are tied to vddc (core voltage)." Which means, mclk shouldn't run at its max speed if vddc is not at its max value, isn't it? Otherwise, we may encounter stability problem.

Comment 27 Alexandre Demers 2013-11-08 06:44:16 UTC

(In reply to comment #26)
> Alex, in drivers/gpu/drm/radeon/ni_dpm.c, when we are limiting the sclk and
> mclk to the max speed according to vddc and vddci, aren't we screwing
> possibly something with mclk? I mean, could we be setting a mclk value that
> is wrong? We have two conditions where we can max mclk value, but we are not
> looking at the lowest one. I think it should be something like this instead
> to be sure we are using the most restrictive value:
> 	/* Select the lowest mclk value according to the most restrictive between
> vddc and vddci*/
> 	if (max_mclk_vddc || max_mclk_vddci) {
> 		max_mclk_vddcx = (max_mclk_vddc > max_mclk_vddci) ? max_mclk_vddci :
> max_mclk_vddc;
> 	}
> 
> 	for (i = 0; i < ps->performance_level_count; i++) {
> 		if (max_sclk_vddc) {
> 			if (ps->performance_levels[i].sclk > max_sclk_vddc)
> 				ps->performance_levels[i].sclk = max_sclk_vddc;
> 		}
> 		if (max_mclk_vddcx) {
> 			if (ps->performance_levels[i].mclk > max_mclk_vddcx)
> 				ps->performance_levels[i].mclk = max_mclk_vddcx;
> 		}
> 	}
> 
> I'm also quoting you: "Not exactly. Mclk is tied to vddci (memory interface
> voltage), but both mclk and sclk (and the core display clock) are tied to
> vddc (core voltage)." Which means, mclk shouldn't run at its max speed if
> vddc is not at its max value, isn't it? Otherwise, we may encounter
> stability problem.

Forget this, I'm getting tired and I didn't realized that we were already making sure we were maxing the value if it was smaller.

Comment 28 Alexandre Demers 2013-11-08 07:39:23 UTC

Alex, as I quoted you earlier, mclk depends on both vddc and vddci. How can we know mclk at 125000 is stable when vddc is 1000 and vddci 1150 (here, vddc is not at its max of 1050)?

Comment 29 Alexandre Demers 2013-11-14 05:31:16 UTC

Just in case, I went back in time for bisection between v3.11 and v3.10, began at commit 69e0b57a91adca2e3eb56ed4db39ab90f3ae1043 when dpm was implemented on Cayman. I applied patches from bug 68235 so I could run without display hanging. I've been running videos, daily tasks and piglit runs without a crash. I'll move forward and see if I can find something that could help us explain the hangs happening since v3.11.0.

Comment 30 Alex Deucher 2013-11-14 14:42:34 UTC

(In reply to comment #29)
> Just in case, I went back in time for bisection between v3.11 and v3.10,
> began at commit 69e0b57a91adca2e3eb56ed4db39ab90f3ae1043 when dpm was
> implemented on Cayman. I applied patches from bug 68235 so I could run
> without display hanging. I've been running videos, daily tasks and piglit
> runs without a crash. I'll move forward and see if I can find something that
> could help us explain the hangs happening since v3.11.0.

As I mentioned in bug 68235, with 69e0b57a91adca2e3eb56ed4db39ab90f3ae1043 the driver did not enable dynamic performance level adjustments since it caused hangs. With that commit the GPU always stays in the lowest performance level which is why it's stable.  The hangs were fixed by 4da18e26e0cc2387597ff57ac33df1bc6369fbed, 7a80c2c9a957b1ab056fac235140ebd6c43d9831, and 779187f2c3e69b8c06488538e0fd9fd02163359e and dynamic performance levels were enabled in commit 7ad8d0687bb5030c3328bc7229a3183ce179ab25.

Comment 31 Alexandre Demers 2013-11-14 15:06:09 UTC

(In reply to comment #30)
> (In reply to comment #29)
> > Just in case, I went back in time for bisection between v3.11 and v3.10,
> > began at commit 69e0b57a91adca2e3eb56ed4db39ab90f3ae1043 when dpm was
> > implemented on Cayman. I applied patches from bug 68235 so I could run
> > without display hanging. I've been running videos, daily tasks and piglit
> > runs without a crash. I'll move forward and see if I can find something that
> > could help us explain the hangs happening since v3.11.0.
> 
> As I mentioned in bug 68235, with 69e0b57a91adca2e3eb56ed4db39ab90f3ae1043
> the driver did not enable dynamic performance level adjustments since it
> caused hangs. With that commit the GPU always stays in the lowest
> performance level which is why it's stable.  The hangs were fixed by
> 4da18e26e0cc2387597ff57ac33df1bc6369fbed,
> 7a80c2c9a957b1ab056fac235140ebd6c43d9831, and
> 779187f2c3e69b8c06488538e0fd9fd02163359e and dynamic performance levels were
> enabled in commit 7ad8d0687bb5030c3328bc7229a3183ce179ab25.

Yes, I know. I just needed to start with something "solid". If running at the lowest performances is stable, it gives me a reference. I had time to test another commit in the branch, but since I'm not at home, I can't say if all the commits you had identified are in. I'll probably continue tonight.

Comment 32 Alexandre Demers 2013-11-15 06:50:30 UTC

Created attachment 89246 [details]
journalctl from last couple of boot/hang cycles

This could be interesting:
I've been bisecting. The first usable kernel on Cayman with dpm was between 3.11-rc1 and 3.11-rc2. Before that, the system was unusable (major corruption). So, with patches from related bug, I was able to start a session and run applications until it hanged.

On Nov 15 00:19:51, I was booting with a safe kernel (before dpm was enabled, a144acb).

On Nov 15 00:36:23, I booted with the kernel exhibiting a desired behavior. Logged, ran tasks, hanged on Nov 15 00:41:03. Segfault in libLLVM-3.3.so and possible recursive locking detected, followed by DEADLOCK.

However, I was unable to get the same result in the following boot.

Then switch to 3.12-rc7 to send journal.

Comment 33 Alexandre Demers 2013-11-24 07:06:56 UTC

I made an observation: the hang will happen sooner when I play a movie. However, playing a video is not needed to hang the system, it just happens quicker. It happens only usual tasks (usually when switching of window or when scrolling in a window). I know to have tracked it that the performance level changes a lot when a video is being played depending of what else is going on. So, I tried something: I forced performance level to "high". By doing so, I had no crash (daily tasks, launched a quick piglit run, started TF2). I then force performance level to "low" and then I did pretty much the same thing without any problem.

So it seems the problem is not with a performance level per-se, at least not with low and high. I would have to test with medium.

It could also be interesting to test UVD performance levels, which I doubt is the problem since I can play videos and hangs happen even when UVD is not being used or no videos are being played.

We could look at other things also. Maybe it is triggered after switching performance levels soo many times or when doing some combination. Could we be switching too often or too fast for the card?

Many questions to investigate.

Comment 34 Alexandre Demers 2013-11-24 20:54:45 UTC

I may have to withdraw what I said earlier about stability on high performance. I had fixed it to high to launch a piglit run, after doing the same thing at low performance, and it eventually hanged. Now, was it related to a dpm problem or to a piglit hitting a bug (see bug 71859), I can't tell right now? They may even be related.

I was trying to confirm that I have different results when doing a piglit run while using high performance VS low performance.

I'll give some updates later.

Comment 35 Alexandre Demers 2013-11-24 20:57:05 UTC

(In reply to comment #34)
> I may have to withdraw what I said earlier about stability on high
> performance. I had fixed it to high to launch a piglit run, after doing the
> same thing at low performance, and it eventually hanged. Now, was it related
> to a dpm problem or to a piglit hitting a bug (see bug 71859), I can't tell
> right now? They may even be related.
> 
> I was trying to confirm that I have different results when doing a piglit
> run while using high performance VS low performance.
> 
> I'll give some updates later.

To be noted: I was using a balanced power state this time, not a performance power state. To be investigated.

Comment 36 Alex Deucher 2013-11-25 18:12:33 UTC

(In reply to comment #35)
> To be noted: I was using a balanced power state this time, not a performance
> power state. To be investigated.

On most cards there are only performance states.  Selecting balanced also selects performance.  You are probably using the same state in both cases.

Comment 37 Alexandre Demers 2013-11-25 19:53:12 UTC

(In reply to comment #36)
> (In reply to comment #35)
> > To be noted: I was using a balanced power state this time, not a performance
> > power state. To be investigated.
> 
> On most cards there are only performance states.  Selecting balanced also
> selects performance.  You are probably using the same state in both cases.

This is also what it seems. However, it may be completly unrelated, but I only had hangs when using the balanced setting.

Comment 38 Alexandre Demers 2013-11-26 22:47:42 UTC

A couple of observations. I've forced power state to performance and performance state to high. Here is the result:
[root@Xander device]# more /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 0 dclk: 0
power level 2    sclk: 83000 mclk: 130000 vddc: 1060 vddci: 1150

Now, keeping this configuration, I launched a video relying on UVD. The result downclocks the core clock from 830 (probably limited to 800 as we know) to 725.
[root@Xander device]# more /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 54000 dclk: 40000
power level 2    sclk: 72500 mclk: 130000 vddc: 1060 vddci: 1150

However, if I don't force this power and performance states combination (letting it as balanced and auto or performance and auto), I have the following:
[root@Xander device]# more /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 54000 dclk: 40000
power level 0    sclk: 50000 mclk: 130000 vddc: 1000 vddci: 1150
[root@Xander device]# more /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 54000 dclk: 40000
power level 2    sclk: 72500 mclk: 130000 vddc: 1060 vddci: 1150

As you can see, it will adapt to the needed performance state.

- So, first question: is it expected to see a lowered sclk when UVD is active?
- Second one: when the performance is changed automatically (auto), could we be triggering a performance state change too quickly?
- Third one: it was previously said mclk is tied to vddci AND vddc. Wouldn't there be a chance we could encounter a problem here if vddc=1000 and not 1060 when running at full speed?
- Last one: is there a way to monitor the GPU temperature and/or the GPU fan speed? (even at full speed when highly solicited, the fan is not running as fast as when dpm=0. I'm wondering if I'm not overheating from time to time).

Comment 39 Alex Deucher 2013-11-26 23:13:10 UTC

(In reply to comment #38)
> A couple of observations. I've forced power state to performance and
> performance state to high. Here is the result:
> [root@Xander device]# more /sys/kernel/debug/dri/64/radeon_pm_info
> uvd    vclk: 0 dclk: 0
> power level 2    sclk: 83000 mclk: 130000 vddc: 1060 vddci: 1150
> 
> Now, keeping this configuration, I launched a video relying on UVD. The
> result downclocks the core clock from 830 (probably limited to 800 as we
> know) to 725.
> [root@Xander device]# more /sys/kernel/debug/dri/64/radeon_pm_info
> uvd    vclk: 54000 dclk: 40000
> power level 2    sclk: 72500 mclk: 130000 vddc: 1060 vddci: 1150
> 
> However, if I don't force this power and performance states combination
> (letting it as balanced and auto or performance and auto), I have the
> following:
> [root@Xander device]# more /sys/kernel/debug/dri/64/radeon_pm_info
> uvd    vclk: 54000 dclk: 40000
> power level 0    sclk: 50000 mclk: 130000 vddc: 1000 vddci: 1150
> [root@Xander device]# more /sys/kernel/debug/dri/64/radeon_pm_info
> uvd    vclk: 54000 dclk: 40000
> power level 2    sclk: 72500 mclk: 130000 vddc: 1060 vddci: 1150
> 
> As you can see, it will adapt to the needed performance state.
> 

Not exactly.  Here are the power states defined for your system:

 == power state 0 ==
          ui class: none
          internal class: boot 
          caps: 
          uvd    vclk: 0 dclk: 0
          power level 0    sclk: 25000 mclk: 15000 vddc: 1000 vddci: 1000
          power level 1    sclk: 25000 mclk: 15000 vddc: 1000 vddci: 1000
          power level 2    sclk: 25000 mclk: 15000 vddc: 1000 vddci: 1000
          status: c r b 
 == power state 1 ==
          ui class: performance
          internal class: none
          caps: 
          uvd    vclk: 0 dclk: 0
          power level 0    sclk: 25000 mclk: 15000 vddc: 900 vddci: 950
          power level 1    sclk: 50000 mclk: 130000 vddc: 1000 vddci: 1150
          power level 2    sclk: 83000 mclk: 130000 vddc: 1060 vddci: 1150
          status: 
 == power state 2 ==
          ui class: none
          internal class: uvd 
          caps: video 
          uvd    vclk: 54000 dclk: 40000
          power level 0    sclk: 50000 mclk: 130000 vddc: 1000 vddci: 1150
          power level 1    sclk: 50000 mclk: 130000 vddc: 1000 vddci: 1150
          power level 2    sclk: 72500 mclk: 130000 vddc: 1060 vddci: 1150
          status: 
 == power state 3 ==
          ui class: none
          internal class: uvd_mvc 
          caps: video 
          uvd    vclk: 70000 dclk: 56000
          power level 0    sclk: 50000 mclk: 130000 vddc: 1000 vddci: 1150
          power level 1    sclk: 50000 mclk: 130000 vddc: 1000 vddci: 1150
          power level 2    sclk: 72500 mclk: 130000 vddc: 1060 vddci: 1150
          status: 

When you select performance, battery, or balanced state on your system, power state 1 is used.  When you activate UVD, the driver selects power state 2.  When UVD is not in use, the previously active power state (1 in this case) is selected again.  The driver only selects the power state.  The power levels (0-2) within the power state are either selected automatically by the hw based on GPU load (auto) or forced to power level 0 or 2 (low or high) is you force the performance level.  When you force the performance level to high, it will apply to all power states that you select (both power state 1 and 2) which is why you see the UVD state using 500 vs. 725 Mhz.

> - So, first question: is it expected to see a lowered sclk when UVD is
> active?

Yes since the driver selects a different power state (one tailored to the requirements of the UVD block for smooth video playback).

> - Second one: when the performance is changed automatically (auto), could we
> be triggering a performance state change too quickly?

The driver doesn't trigger performance level changes, the hw does.  The driver just selects the overall power state.  

> - Third one: it was previously said mclk is tied to vddci AND vddc. Wouldn't
> there be a chance we could encounter a problem here if vddc=1000 and not
> 1060 when running at full speed?

Sure.  That's why we have the ni_apply_state_adjust_rules() to make sure the power state is valid based on the current requirements.

> - Last one: is there a way to monitor the GPU temperature and/or the GPU fan
> speed? (even at full speed when highly solicited, the fan is not running as
> fast as when dpm=0. I'm wondering if I'm not overheating from time to time).

You can see the temperature in sysfs.  There should be a entry under /sys/class/hwmon/ for radeon.

Comment 40 Alexandre Demers 2013-11-27 01:39:35 UTC

(In reply to comment #39)
> > - Third one: it was previously said mclk is tied to vddci AND vddc. Wouldn't
> > there be a chance we could encounter a problem here if vddc=1000 and not
> > 1060 when running at full speed?
> 
> Sure.  That's why we have the ni_apply_state_adjust_rules() to make sure the
> power state is valid based on the current requirements.

Thank you for all the explainations.

Do you think I could try to force to vddc=1060 in ni_apply_state_adjust() instead of 1000 when mclk=130000 to see if the system becomes stable?

Comment 41 Alex Deucher 2013-11-27 02:19:29 UTC

(In reply to comment #40)
> Do you think I could try to force to vddc=1060 in ni_apply_state_adjust()
> instead of 1000 when mclk=130000 to see if the system becomes stable?

The driver will currently limit the mclk to 125000 since that's the max level in the vddc/mclk dep table.  You could try it.  The driver will use 1060 when the sclk is 80000.

Comment 42 Alexandre Demers 2013-11-27 04:16:28 UTC

(In reply to comment #41)
> (In reply to comment #40)
> > Do you think I could try to force to vddc=1060 in ni_apply_state_adjust()
> > instead of 1000 when mclk=130000 to see if the system becomes stable?
> 
> The driver will currently limit the mclk to 125000 since that's the max
> level in the vddc/mclk dep table.  You could try it.  The driver will use
> 1060 when the sclk is 80000.

Well, it isn't that... I'll look at mclk again if I limit it a bit lower (120000 maybe). I ran a video on Youtube (no UVD) and it hangeg again. The only way I don't have this problem is by forcing the power level (sorry, I was calling it performance level earlier). Doing so is OK almost everytime while using the auto mode freezes the card.

I also had a look under Windows. I played the same video and I used GPU-Z to get a log. I'm attaching it right away to compare what I get under Linux VS Windows.

Comment 43 Alexandre Demers 2013-11-27 04:30:34 UTC

Created attachment 89884 [details]
GPU-Z Cayman log playing a Youtube video

I used GPU-Z under Windows to monitor mclk, sclk, VDDC and temperature. Temperature was pretty much the same as the one I had under Linux.

However, power levels are different.

First, under Windows, anytime the memory goes above the default mclk speed (150MHz), VDDC is set to 1.063V even if the GPU is not running at full speed. Under Linux, VDDC is set to 1.060V only if both mclk and sclk are running at full speed. Otherwise, VDDC is kept at 1.000V. Why this difference?

Next, there is an intermediate power level where sclk runs at 500MHz and mclk runs at 650MHz with a VDDC at 1.063V. Even when running in 1080p, it never went above that speed. Under Linux, the intermediate power level (power level 1) uses a sclk at 500MHz (same) and mclk at 1300MHz (twice as fast, the same as the maximum power level 2) with only a VDDC of 1.000V.

There was no indication about the VDDCI.

Comment 44 Alexandre Demers 2013-11-27 05:01:37 UTC

Created attachment 89885 [details]
A second GPU-Z log, this time pushing a bit more

In this new log, I pushed the video card a bit more. We can see there is an sclk 500MHz with a mclk of 650MHz (VDDC @ 1.063V), another intermediate power level using a sclk 500MHz with a mclk of 1300MHz (VDDC @ 1.063V) and finally a last power level using a sclk 830MHz with a mclk of 1300MHz (also using a VDDC @ 1.063V).

So, we are missing power level under Linux AND we are not using the same VDDC above the lowest power level. This fact could also explain why we had to limit sclk and mclk (800MHz and 1250MHz) on cards that are not using stock speed (overclocked cards, like mine) if we were not using a high enough VDDC, isn't it?

Comment 45 Alex Deucher 2013-11-27 06:04:58 UTC

The windows driver should operate pretty much the same as the linux driver as far as I know.  I'm not really familiar with how gpu-z reads back the clocks and voltages and that may have something to do with the differences.  Not all of the aspects of the level transition happen at the same time.

You could try setting the vddc to 1060 and changing the mclk of level 1 (mid level) to 650Mhz.  Perhaps the jump from 150 to 1300 is too big and the pll is not able to lock properly.

Comment 46 Alexandre Demers 2013-11-27 14:14:05 UTC

(In reply to comment #45)
> The windows driver should operate pretty much the same as the linux driver
> as far as I know.  I'm not really familiar with how gpu-z reads back the
> clocks and voltages and that may have something to do with the differences. 
> Not all of the aspects of the level transition happen at the same time.
> 
> You could try setting the vddc to 1060 and changing the mclk of level 1 (mid
> level) to 650Mhz.  Perhaps the jump from 150 to 1300 is too big and the pll
> is not able to lock properly.

I tried it, but I had the same result: a hang. However, I'd be curious to see the result on the ower con sumption. I'll keep that for another time.

I'll try to follow Vddci under Windows with something like Afterburner.

Comment 47 Alexandre Demers 2013-11-28 14:50:18 UTC

Latest news: tried to force VDDC to 1100 (some 6950 cards are using 1.1V instead of 1.06V, mostly factory overclocked ones), tried to force VDDCI to 1150 for all power levels (crashed even quicker), tried to force VDDCI to 1100 for high power level just in case (hanged as usual after mostly the same delay).

So I turned my attention on mclk and downclocked it to 120000 (instead of 125000). Until now, the auto power level runs fine. I'll tweak it until I can find out at which value it begins to hang... and why it hangs mostly on auto power level and almost never on high or low (even at high, which is mostly as if dpm was disabled)... and never under Windows.

Comment 48 Alexandre Demers 2013-11-29 01:21:41 UTC

Went to 122500, still no total hang. Ran quick.test (piglit) while playing a movie. The display went black a couple of time and according to dmesg/journal, it was related to texelFetch segfaulting in llvm (another bug I've reported). So, for now, we can say it is still stable at this frequency. I'll push it a bit more.

Comment 49 Michel Dänzer 2013-11-29 02:57:07 UTC

(In reply to comment #48)
> The display went black a couple of time and according to dmesg/journal, it was
> related to texelFetch segfaulting in llvm [...]

'The display went black a couple of time' sounds like GPU resets after lockups, which are unlikely to be directly related to segfaulting piglit tests FWIW.

Comment 50 Alexandre Demers 2013-11-29 15:46:52 UTC

(In reply to comment #49)
> (In reply to comment #48)
> > The display went black a couple of time and according to dmesg/journal, it was
> > related to texelFetch segfaulting in llvm [...]
> 
> 'The display went black a couple of time' sounds like GPU resets after
> lockups, which are unlikely to be directly related to segfaulting piglit
> tests FWIW.

Maybe, but I have no other indication. Also, I'm pretty that, when in auto power level, when I run my piglit tests, it locks at there.

Another thing I noticed: if I run the quick piglit tests at a low power level, the total number of passed tests is higher than when I run it at an higher speed. The failing tests explaining this difference are mostly texelFetch related. I'll send the piglit results from both runs once I'll have set back mclk to its default value.

Comment 51 Alexandre Demers 2013-11-29 15:47:58 UTC

(In reply to comment #50)
> (In reply to comment #49)
> > (In reply to comment #48)
> > > The display went black a couple of time and according to dmesg/journal, it was
> > > related to texelFetch segfaulting in llvm [...]
> > 
> > 'The display went black a couple of time' sounds like GPU resets after
> > lockups, which are unlikely to be directly related to segfaulting piglit
> > tests FWIW.
> 
> Maybe, but I have no other indication. Also, I'm pretty that, when in auto
> power level, when I run my piglit tests, it locks at there.
> 
> Another thing I noticed: if I run the quick piglit tests at a low power
> level, the total number of passed tests is higher than when I run it at an
> higher speed. The failing tests explaining this difference are mostly
> texelFetch related. I'll send the piglit results from both runs once I'll
> have set back mclk to its default value.

Oops, "... I'm pretty that..." -> "... I'm pretty sure that..."

Comment 52 Alexandre Demers 2013-11-29 19:31:18 UTC

I disabled R600_LLVM and ran another piglit at high power level. It crashed anyway. But I got a different message in my journal:

Nov 29 14:02:48 Xander kernel: glx-create-cont[14825]: segfault at 17c ip 00007f474ede915e sp 00007fff555e8ff0 error 6 in r600_dri.so[7f474e81d000+80e000]
Nov 29 14:02:48 Xander systemd-coredump[14834]: Process 14825 (glx-create-cont) dumped core.
Nov 29 14:09:06 Xander kernel: traps: shader_runner[20951] trap int3 ip:7fd47a14782f sp:7fffeb3ab520 error:0
Nov 29 14:09:06 Xander systemd-coredump[20968]: Process 20951 (shader_runner) dumped core.
Nov 29 14:09:21 Xander kernel: traps: shader_runner[22965] trap int3 ip:7f9e6348982f sp:7fff739db970 error:0
Nov 29 14:09:21 Xander systemd-coredump[22983]: Process 22965 (shader_runner) dumped core.
Nov 29 14:11:09 Xander dbus-daemon[3115]: dbus[3115]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
Nov 29 14:11:09 Xander dbus[3115]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
Nov 29 14:11:09 Xander systemd[1]: Starting Hostname Service...
Nov 29 14:11:09 Xander dbus-daemon[3115]: dbus[3115]: [system] Successfully activated service 'org.freedesktop.hostname1'
Nov 29 14:11:09 Xander dbus[3115]: [system] Successfully activated service 'org.freedesktop.hostname1'
Nov 29 14:11:09 Xander systemd[1]: Started Hostname Service.
Nov 29 14:12:19 Xander kernel: radeon_gem_object_create:62 alloc size 1365Mb bigger than 256Mb limit
Nov 29 14:12:19 Xander kernel: radeon_gem_object_create:62 alloc size 1365Mb bigger than 256Mb limit
Nov 29 14:12:41 Xander kernel: radeon_gem_object_create:62 alloc size 1024Mb bigger than 256Mb limit
Nov 29 14:12:41 Xander kernel: radeon_gem_object_create:62 alloc size 1024Mb bigger than 256Mb limit
Nov 29 14:13:47 Xander systemd[1]: Starting Cleanup of Temporary Directories...
-- Reboot --

Still dumps and a GEM object allocation problem. Anything usefull?

Comment 53 Alexandre Demers 2013-12-02 23:33:05 UTC

Possible circular locking dependency and a DEADLOCK. Using latest 3.13.0-rc2 (with some added printk(), so no difference from rc1) and having set performance level to low, I've just found the following in my journal from last night. The system never crashed nor anything, but this may be a clue pointing at... no one else than pm.mclk...


Dec 02 01:01:05 Xander kernel: Dec 02 01:01:05 Xander kernel: ======================================================
Dec 02 01:01:05 Xander kernel: [ INFO: possible circular locking dependency detected ]
Dec 02 01:01:05 Xander kernel: 3.13.0-rc2-VANILLA-dirty #170 Tainted: G         C  
Dec 02 01:01:05 Xander kernel: -------------------------------------------------------
Dec 02 01:01:05 Xander kernel: chromium/3786 is trying to acquire lock:
Dec 02 01:01:05 Xander kernel:  (reservation_ww_class_mutex){+.+.+.}, at: [<ffffffffa00047d9>] ttm_bo_wait_unreserved+0x39/0x70 [ttm]
Dec 02 01:01:05 Xander kernel: 
                               but task is already holding lock:
Dec 02 01:01:05 Xander kernel:  (&bo->wu_mutex){+.+...}, at: [<ffffffffa00047bd>] ttm_bo_wait_unreserved+0x1d/0x70 [ttm]
Dec 02 01:01:05 Xander kernel: 
                               which lock already depends on the new lock.
Dec 02 01:01:05 Xander kernel: 
                               the existing dependency chain (in reverse order) is:
Dec 02 01:01:05 Xander kernel: 
                               -> #4 (&bo->wu_mutex){+.+...}:
Dec 02 01:01:05 Xander kernel:        [<ffffffff810a1a62>] lock_acquire+0x72/0xa0
Dec 02 01:01:05 Xander kernel:        [<ffffffff8182d7f8>] mutex_lock_interruptible_nested+0x58/0x550
Dec 02 01:01:05 Xander kernel:        [<ffffffffa00047bd>] ttm_bo_wait_unreserved+0x1d/0x70 [ttm]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa0006079>] ttm_bo_vm_fault+0x389/0x470 [ttm]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa00485c7>] radeon_ttm_fault+0x47/0x60 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffff81125b4c>] __do_fault+0x6c/0x4c0
Dec 02 01:01:05 Xander kernel:        [<ffffffff81129846>] handle_mm_fault+0x2e6/0xc90
Dec 02 01:01:05 Xander kernel:        [<ffffffff81833455>] __do_page_fault+0x165/0x560
Dec 02 01:01:05 Xander kernel:        [<ffffffff81833859>] do_page_fault+0x9/0x10
Dec 02 01:01:05 Xander kernel:        [<ffffffff8182ffc8>] page_fault+0x28/0x30
Dec 02 01:01:05 Xander kernel: 
                               -> #3 (&rdev->pm.mclk_lock){++++++}:
Dec 02 01:01:05 Xander kernel:        [<ffffffff810a1a62>] lock_acquire+0x72/0xa0
Dec 02 01:01:05 Xander kernel:        [<ffffffff8182e181>] down_write+0x31/0x60
Dec 02 01:01:05 Xander kernel:        [<ffffffffa0089b4e>] radeon_pm_compute_clocks+0x2ee/0x790 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa008ab51>] radeon_pm_init+0x7c1/0x960 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa005731f>] radeon_modeset_init+0x40f/0x9a0 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa00310d0>] radeon_driver_load_kms+0xe0/0x210 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffff8146a6ff>] drm_dev_register+0x9f/0x1d0
Dec 02 01:01:05 Xander kernel:        [<ffffffff8146c53d>] drm_get_pci_dev+0x8d/0x140
Dec 02 01:01:05 Xander kernel:        [<ffffffffa002d3ff>] radeon_pci_probe+0x9f/0xd0 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffff8139e7e0>] local_pci_probe+0x40/0xa0
Dec 02 01:01:05 Xander kernel:        [<ffffffff8107633f>] work_for_cpu_fn+0xf/0x20
Dec 02 01:01:05 Xander kernel:        [<ffffffff810796bb>] process_one_work+0x1cb/0x490
Dec 02 01:01:05 Xander kernel:        [<ffffffff8107a478>] worker_thread+0x258/0x3a0
Dec 02 01:01:05 Xander kernel:        [<ffffffff81080907>] kthread+0xf7/0x110
Dec 02 01:01:05 Xander kernel:        [<ffffffff8183654c>] ret_from_fork+0x7c/0xb0
Dec 02 01:01:05 Xander kernel: 
                               -> #2 (&dev->struct_mutex){+.+.+.}:
Dec 02 01:01:05 Xander kernel:        [<ffffffff810a1a62>] lock_acquire+0x72/0xa0
Dec 02 01:01:05 Xander kernel:        [<ffffffff8182d35b>] mutex_lock_nested+0x4b/0x490
Dec 02 01:01:05 Xander kernel:        [<ffffffffa0089b46>] radeon_pm_compute_clocks+0x2e6/0x790 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa008ab51>] radeon_pm_init+0x7c1/0x960 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa005731f>] radeon_modeset_init+0x40f/0x9a0 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa00310d0>] radeon_driver_load_kms+0xe0/0x210 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffff8146a6ff>] drm_dev_register+0x9f/0x1d0
Dec 02 01:01:05 Xander kernel:        [<ffffffff8146c53d>] drm_get_pci_dev+0x8d/0x140
Dec 02 01:01:05 Xander kernel:        [<ffffffffa002d3ff>] radeon_pci_probe+0x9f/0xd0 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffff8139e7e0>] local_pci_probe+0x40/0xa0
Dec 02 01:01:05 Xander kernel:        [<ffffffff8107633f>] work_for_cpu_fn+0xf/0x20
Dec 02 01:01:05 Xander kernel:        [<ffffffff810796bb>] process_one_work+0x1cb/0x490
Dec 02 01:01:05 Xander kernel:        [<ffffffff8107a478>] worker_thread+0x258/0x3a0
Dec 02 01:01:05 Xander kernel:        [<ffffffff81080907>] kthread+0xf7/0x110
Dec 02 01:01:05 Xander kernel:        [<ffffffff8183654c>] ret_from_fork+0x7c/0xb0
Dec 02 01:01:05 Xander kernel: 
                               -> #1 (&rdev->pm.mutex){+.+.+.}:
Dec 02 01:01:05 Xander kernel:        [<ffffffff810a1a62>] lock_acquire+0x72/0xa0
Dec 02 01:01:05 Xander kernel:        [<ffffffff8182d35b>] mutex_lock_nested+0x4b/0x490
Dec 02 01:01:05 Xander kernel:        [<ffffffffa008a069>] radeon_dpm_enable_uvd+0x79/0xc0 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa00b238f>] radeon_uvd_note_usage+0xef/0x110 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa0060610>] radeon_cs_ioctl+0x8f0/0x9f0 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffff81464782>] drm_ioctl+0x502/0x640
Dec 02 01:01:05 Xander kernel:        [<ffffffffa002d049>] radeon_drm_ioctl+0x49/0x80 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffff81174bd0>] do_vfs_ioctl+0x300/0x520
Dec 02 01:01:05 Xander kernel:        [<ffffffff81174e71>] SyS_ioctl+0x81/0xa0
Dec 02 01:01:05 Xander kernel:        [<ffffffff818365fd>] system_call_fastpath+0x1a/0x1f
Dec 02 01:01:05 Xander kernel: 
                               -> #0 (reservation_ww_class_mutex){+.+.+.}:
Dec 02 01:01:05 Xander kernel:        [<ffffffff810a10fa>] __lock_acquire+0x195a/0x1b20
Dec 02 01:01:05 Xander kernel:        [<ffffffff810a1a62>] lock_acquire+0x72/0xa0
Dec 02 01:01:05 Xander kernel:        [<ffffffff8182d7f8>] mutex_lock_interruptible_nested+0x58/0x550
Dec 02 01:01:05 Xander kernel:        [<ffffffffa00047d9>] ttm_bo_wait_unreserved+0x39/0x70 [ttm]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa0006079>] ttm_bo_vm_fault+0x389/0x470 [ttm]
Dec 02 01:01:05 Xander kernel:        [<ffffffffa00485c7>] radeon_ttm_fault+0x47/0x60 [radeon]
Dec 02 01:01:05 Xander kernel:        [<ffffffff81125b4c>] __do_fault+0x6c/0x4c0
Dec 02 01:01:05 Xander kernel:        [<ffffffff81129846>] handle_mm_fault+0x2e6/0xc90
Dec 02 01:01:05 Xander kernel:        [<ffffffff81833455>] __do_page_fault+0x165/0x560
Dec 02 01:01:05 Xander kernel:        [<ffffffff81833859>] do_page_fault+0x9/0x10
Dec 02 01:01:05 Xander kernel:        [<ffffffff8182ffc8>] page_fault+0x28/0x30
Dec 02 01:01:05 Xander kernel: 
                               other info that might help us debug this:
Dec 02 01:01:05 Xander kernel: Chain exists of:
                                 reservation_ww_class_mutex --> &rdev->pm.mclk_lock --> &bo->wu_mutex
Dec 02 01:01:05 Xander kernel:  Possible unsafe locking scenario:
Dec 02 01:01:05 Xander kernel:        CPU0                    CPU1
Dec 02 01:01:05 Xander kernel:        ----                    ----
Dec 02 01:01:05 Xander kernel:   lock(&bo->wu_mutex);
Dec 02 01:01:05 Xander kernel:                                lock(&rdev->pm.mclk_lock);
Dec 02 01:01:05 Xander kernel:                                lock(&bo->wu_mutex);
Dec 02 01:01:05 Xander kernel:   lock(reservation_ww_class_mutex);
Dec 02 01:01:05 Xander kernel: 
                                *** DEADLOCK ***
Dec 02 01:01:05 Xander kernel: 2 locks held by chromium/3786:
Dec 02 01:01:05 Xander kernel:  #0:  (&rdev->pm.mclk_lock){++++++}, at: [<ffffffffa00485b6>] radeon_ttm_fault+0x36/0x60 [radeon]
Dec 02 01:01:05 Xander kernel:  #1:  (&bo->wu_mutex){+.+...}, at: [<ffffffffa00047bd>] ttm_bo_wait_unreserved+0x1d/0x70 [ttm]
Dec 02 01:01:05 Xander kernel: 
                               stack backtrace:
Dec 02 01:01:05 Xander kernel: CPU: 1 PID: 3786 Comm: chromium Tainted: G         C   3.13.0-rc2-VANILLA-dirty #170
Dec 02 01:01:05 Xander kernel: Hardware name: Gigabyte Technology Co., Ltd. GA-990FXA-UD3/GA-990FXA-UD3, BIOS F10a 01/24/2013
Dec 02 01:01:05 Xander kernel:  ffffffff8241ade0 ffff8803e9a03a18 ffffffff81824a57 ffffffff8241af90
Dec 02 01:01:05 Xander kernel:  ffff8803e9a03a58 ffffffff8181f529 ffff8803e9a03ab0 ffff8803e5d7c6d8
Dec 02 01:01:05 Xander kernel:  0000000000000001 ffff8803e5d7c6b0 ffff8803e5d7c080 ffff8803e5d7c6d8
Dec 02 01:01:05 Xander kernel: Call Trace:
Dec 02 01:01:05 Xander kernel:  [<ffffffff81824a57>] dump_stack+0x45/0x56
Dec 02 01:01:05 Xander kernel:  [<ffffffff8181f529>] print_circular_bug+0x1f9/0x208
Dec 02 01:01:05 Xander kernel:  [<ffffffff810a10fa>] __lock_acquire+0x195a/0x1b20
Dec 02 01:01:05 Xander kernel:  [<ffffffff810a1a62>] lock_acquire+0x72/0xa0
Dec 02 01:01:05 Xander kernel:  [<ffffffffa00047d9>] ? ttm_bo_wait_unreserved+0x39/0x70 [ttm]
Dec 02 01:01:05 Xander kernel:  [<ffffffff8182d7f8>] mutex_lock_interruptible_nested+0x58/0x550
Dec 02 01:01:05 Xander kernel:  [<ffffffffa00047d9>] ? ttm_bo_wait_unreserved+0x39/0x70 [ttm]
Dec 02 01:01:05 Xander kernel:  [<ffffffffa00047d9>] ? ttm_bo_wait_unreserved+0x39/0x70 [ttm]
Dec 02 01:01:05 Xander kernel:  [<ffffffffa0006071>] ? ttm_bo_vm_fault+0x381/0x470 [ttm]
Dec 02 01:01:05 Xander kernel:  [<ffffffffa00047d9>] ttm_bo_wait_unreserved+0x39/0x70 [ttm]
Dec 02 01:01:05 Xander kernel:  [<ffffffffa0006079>] ttm_bo_vm_fault+0x389/0x470 [ttm]
Dec 02 01:01:05 Xander kernel:  [<ffffffffa00485b6>] ? radeon_ttm_fault+0x36/0x60 [radeon]
Dec 02 01:01:05 Xander kernel:  [<ffffffffa00485c7>] radeon_ttm_fault+0x47/0x60 [radeon]
Dec 02 01:01:05 Xander kernel:  [<ffffffff81125b4c>] __do_fault+0x6c/0x4c0
Dec 02 01:01:05 Xander kernel:  [<ffffffff81129846>] handle_mm_fault+0x2e6/0xc90
Dec 02 01:01:05 Xander kernel:  [<ffffffff81833455>] __do_page_fault+0x165/0x560
Dec 02 01:01:05 Xander kernel:  [<ffffffff811a0703>] ? fsnotify+0x83/0x340
Dec 02 01:01:05 Xander kernel:  [<ffffffff81836629>] ? sysret_check+0x22/0x5d
Dec 02 01:01:05 Xander kernel:  [<ffffffff8136ef6d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
Dec 02 01:01:05 Xander kernel:  [<ffffffff81833859>] do_page_fault+0x9/0x10
Dec 02 01:01:05 Xander kernel:  [<ffffffff8182ffc8>] page_fault+0x28/0x30

Comment 54 Martin Andersson 2013-12-08 17:37:37 UTC

I have a 6950 and I'm seeing the exact same things as Alexandre, random hangs that completely lockup the machine. Can't ssh into it and nothing is printed to the logs, the only thing that works is a power cycle. If I disable dpm the machine is stable.

I have run with dpm since 3.11 and I have had the occasional lockup, maybe one every two weeks. But I started playing some more games recently and noticed that the lockups became much more frequent. So I decided to investigate.

The method I use to trigger a lockup is to run GpuTest in loop, with a 10 seconds sleep after each run. I do this to trigger power level switches. The arguments to GpuTest is /test=plot3d /benchmark /benchmark_duration_ms=10000 /no_scorebox. At the same time I run piglit quick.tests in a loop, I later found out that the piglit tests are not essential to get lockups but I kept doing them for consistency. 20 of these tests have resulted in a lockup, of these the longest running one lasted 80 minutes and shortest 3 minutes with an average of 23 minutes. The tests that didn't cause lockups either had dpm completely disabled or only certain features, which features are described below. If I run GpuTest constantly, without the sleep and longer benchmark duration, I don't get any lockups (I have done several long runs, with longest being over six hours).

I also tried to find a good commit. I started with 7ad8d0687bb5030c3328bc7229a3183ce179ab25 (drm/radeon/dpm: re-enable state transitions for Cayman) + the gcc fixes, but I get lockups on that commit as well. I checked out 3.13-rc2 and started disabling features in ni_dpm_init. I disabled the following things without any improvement. I reenabled each feature after I had tested it and cold booted the machine.

eg_pi->smu_uvd_hs
pi->mvdd_control
eg_pi->vddci_control
pi->gfx_clock_gating
pi->mg_clock_gating
pi->mgcgtssm
pi->dynamic_pcie_gen2
pi->thermal_protection
pi->display_gap
pi->dcodt
pi->ulps
eg_pi->abm
eg_pi->mcls
eg_pi->light_sleep
eg_pi->memory_transition
ni_pi->cac_weights->enable_power_containment_by_default
ni_pi->use_power_boost_limit
pi->sclk_ss

eg_pi->pcie_performance_request, was already false so I didn't test it.

I noticed that pi->mvdd_control wasn't set, is that normal?

I don't get any lockups with pi->voltage_control disabled, but I also don't get any power level switches.

If I set eg_pi->dynamic_ac_timing to false my machine lockups somewhere in the boot process, I haven't looked into that any deeper.

However if I set pi->dynamic_ss to false the lockups disappear, it also works with dynamic_ss set to true and pi->mclk_ss set to false.

So it seems, at least for me, it has something to do with mclk together with power level switches. I'm not sure what to test next, but one thing might be to try to remove the performance power level 2, so that it could only switch between 0 and 1. But I haven't figured out how to accomplish that yet.

Comment 55 Alexandre Demers 2013-12-08 18:15:39 UTC

I'll try to reproduce your observations in the next couple of days, but I'm pretty sure we are experiencing the same problem. What is your video card model?

About the performance level switch, you could either force level 2 to use the same values as level 1 OR you could catch when the performance level and force it to 1.

Comment 56 Martin Andersson 2013-12-08 18:31:07 UTC

(In reply to comment #55)
> I'll try to reproduce your observations in the next couple of days, but I'm
> pretty sure we are experiencing the same problem. What is your video card
> model?

Sapphire Radeon HD 6950

Comment 57 Alexandre Demers 2013-12-09 02:02:14 UTC

(In reply to comment #54)
> However if I set pi->dynamic_ss to false the lockups disappear, it also
> works with dynamic_ss set to true and pi->mclk_ss set to false.
>
So this seems to point to a spread spectrum mischief. I don't know if dynamic_ss automatically applies to mclk but it seems to, since disabling spread spectrum only for mclk solves your problem. We could suspect that at a given frequency, we have a problem restoring the original message / clock (the higher we get, the harder it is) until at some point it becomes unreliable.

I should be able to test it later tonight to confirm if this fixes the bug on my side too.

Comment 58 Alexandre Demers 2013-12-09 07:42:12 UTC

Disabling dynamic_ss seems also to do the trick over here.

Comment 59 Alexandre Demers 2013-12-09 16:56:05 UTC

This morning, I tested a bit the kernel after disabling only mclk_ss and it seems to work correctly when it is disabled. Martin may have put his finger where the problem is.

Comment 60 Martin Andersson 2013-12-09 20:01:59 UTC

Instead of triggering the power level switches by running GpuTest in bursts, I put this in a bash script:

for i in {1..3600}
do
   echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level
   sleep 1
   echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
   sleep 5
done

and just let GpuTest, and piglit, run continously. I found that this also trigger the lockups within minutes.

Then I ran the script by itself, no GpuTest or piglit, and left it running while I was at work. When I came home the machine was still running, so it ran for six hours without any lockup. So it seems the power level switching alone is not sufficient to trigger the lockups, it also needs a load of some sort.

Comment 61 Alexandre Demers 2013-12-09 20:37:32 UTC

(In reply to comment #60)
> Instead of triggering the power level switches by running GpuTest in bursts,
> I put this in a bash script:
> 
> for i in {1..3600}
> do
>    echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level
>    sleep 1
>    echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
>    sleep 5
> done
> 
> and just let GpuTest, and piglit, run continously. I found that this also
> trigger the lockups within minutes.
> 
> Then I ran the script by itself, no GpuTest or piglit, and left it running
> while I was at work. When I came home the machine was still running, so it
> ran for six hours without any lockup. So it seems the power level switching
> alone is not sufficient to trigger the lockups, it also needs a load of some
> sort.

Do you mean while spread spectrum is still enabled?

Comment 62 Martin Andersson 2013-12-09 20:43:13 UTC

(In reply to comment #61)
> (In reply to comment #60)
> > Instead of triggering the power level switches by running GpuTest in bursts,
> > I put this in a bash script:
> > 
> > for i in {1..3600}
> > do
> >    echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level
> >    sleep 1
> >    echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
> >    sleep 5
> > done
> > 
> > and just let GpuTest, and piglit, run continously. I found that this also
> > trigger the lockups within minutes.
> > 
> > Then I ran the script by itself, no GpuTest or piglit, and left it running
> > while I was at work. When I came home the machine was still running, so it
> > ran for six hours without any lockup. So it seems the power level switching
> > alone is not sufficient to trigger the lockups, it also needs a load of some
> > sort.
> 
> Do you mean while spread spectrum is still enabled?

Yes

Comment 63 Alex Deucher 2013-12-09 22:59:48 UTC

Created attachment 90542 [details] [review]
possible fix

Thanks for tracking this down.  The attached patch should fix the issue.  With this fixed, it may be worth checking to see if you can reliably use the tweaked clocks on certain oem boards (basically disable the ni code added to fix bug 69723).

Comment 64 Alexandre Demers 2013-12-10 00:01:06 UTC

(In reply to comment #63)
> Created attachment 90542 [details] [review] [review]
> possible fix
> 
> Thanks for tracking this down.  The attached patch should fix the issue. 
> With this fixed, it may be worth checking to see if you can reliably use the
> tweaked clocks on certain oem boards (basically disable the ni code added to
> fix bug 69723).

You certainly meant bug 68235. I'll test it in a couple of minutes.

Comment 65 Alexandre Demers 2013-12-10 01:08:07 UTC

(In reply to comment #63)
> Created attachment 90542 [details] [review] [review]
> possible fix
> 
> Thanks for tracking this down.  The attached patch should fix the issue. 
> With this fixed, it may be worth checking to see if you can reliably use the
> tweaked clocks on certain oem boards (basically disable the ni code added to
> fix bug 69723).

I reverted patches from bug 68235 as suggested, applied the proposed patch to disable mclk spread spectrum and I'm now running this new kernel. First observation: without patches from bug 68235 prior to disabling mclk ss, I was unable to load a session without it hanging in less thant a minute. It's now rock solid running GPU clock at 830MHz and memory clock at 1300MHz. I'll run a couple of other tests, but I think we got it!

Comment 66 Alexandre Demers 2013-12-10 02:14:20 UTC

(In reply to comment #65)
> (In reply to comment #63)
> > Created attachment 90542 [details] [review] [review] [review]
> > possible fix
> > 
> > Thanks for tracking this down.  The attached patch should fix the issue. 
> > With this fixed, it may be worth checking to see if you can reliably use the
> > tweaked clocks on certain oem boards (basically disable the ni code added to
> > fix bug 69723).
> 
> I reverted patches from bug 68235 as suggested, applied the proposed patch
> to disable mclk spread spectrum and I'm now running this new kernel. First
> observation: without patches from bug 68235 prior to disabling mclk ss, I
> was unable to load a session without it hanging in less thant a minute. It's
> now rock solid running GPU clock at 830MHz and memory clock at 1300MHz. I'll
> run a couple of other tests, but I think we got it!

Sadly, I pushed the card a bit harder with Serious Sam 3 and it hanged again. So I'll try two different things:
- keep spread spectrum disabled only for mclk but reapplying both patches from bug 68235;
- disable all spread spectrum without reapplying the other two patches.

Comment 67 Alexandre Demers 2013-12-10 04:31:37 UTC

I may have been lucky until now, but disabling the whole ss seems to be better. I'll continue testing.

Comment 68 Martin Andersson 2013-12-10 09:09:08 UTC

It seems I also spoke to soon, because I tried again with mclk disabled but this time with the test method where I forced the low and high power levels by using  /sys/class/drm/card0/device/power_dpm_force_performance_level and this time I got a lockup. 

So either it was just chance that I didn't hit the lockup with mclk disabled the first time and/or my second test method have a higher chance of triggering the lockup.

I'm currently running a test with dynamic_ss disabled.

Comment 69 Martin Andersson 2013-12-10 19:20:32 UTC

I have now completed a 12+ hour long test run with dynamic_ss disabled (mclk_ss was also disable) without any problem. So it seems that disabling mclk_ss makes it a little more stable, but disabling dynamic_ss makes it much more stable. I have not had any lockups with dynamic_ss disabled, but I haven't tested it that much. Only this long session and one 6+ hour long session before that.

The next thing I'm gonna test is dynamic_ss disabled and the patches from https://bugs.freedesktop.org/show_bug.cgi?id=68235 reverted.

Comment 70 Alexandre Demers 2013-12-11 03:33:59 UTC

(In reply to comment #67)
> I may have been lucky until now, but disabling the whole ss seems to be
> better. I'll continue testing.

It took me longer, but I hanged the GPU again running Phoronix test suite. Nothing in the journal about the latest hang. I'm now testing with mclk_ss disabled on stock 3.13-rc3 (no patch reverted).

Comment 71 Alexandre Demers 2013-12-11 03:36:03 UTC

(In reply to comment #70)
> (In reply to comment #67)
> > I may have been lucky until now, but disabling the whole ss seems to be
> > better. I'll continue testing.
> 
> It took me longer, but I hanged the GPU again running Phoronix test suite.
> Nothing in the journal about the latest hang. I'm now testing with mclk_ss
> disabled on stock 3.13-rc3 (no patch reverted).

Oops, I may have to revisit what I just said. I'll be in touch soon.

Comment 72 Martin Andersson 2013-12-11 19:26:44 UTC

I have now run another 12+ hour test without problems, where I disabled both sclk_ss and mclk_ss and disabled the clock limiting code (even though I didn't see any difference to the clock speeds).

So at least for me it seems very stable with both sclk_ss and mclk_ss disabled. I will continue using it to see if I will run into any problems.

Comment 73 Alexandre Demers 2013-12-11 19:59:55 UTC

(In reply to comment #72)
> I have now run another 12+ hour test without problems, where I disabled both
> sclk_ss and mclk_ss and disabled the clock limiting code (even though I
> didn't see any difference to the clock speeds).
> 
> So at least for me it seems very stable with both sclk_ss and mclk_ss
> disabled. I will continue using it to see if I will run into any problems.

Good to know. I'll be also confirming on my side soon (maybe tomorrow).

Comment 74 Alexandre Demers 2013-12-12 14:07:53 UTC

On my side, disabling spread spectrum but keeping the patches limiting the gpu and mem clocks to reference board ones seems to be OK. I've been running phoronix-test-suite tests for the last day without problem. Still, while it seems stable, it may not correct everything, only time will tell.

I'll have to test again with patches from bug 68235 reverted because I suspect I had selected the wrong kernel when I last posted my results about it.

Comment 75 Alex Deucher 2013-12-12 14:55:37 UTC

Created attachment 90667 [details]
possible fix

Sounds like both are causing problems.

Comment 76 Alexandre Demers 2013-12-13 02:48:48 UTC

(In reply to comment #75)
> Created attachment 90667 [details]
> possible fix
> 
> Sounds like both are causing problems.

Since my last cold boot, I tried with unlimited clocks (reverted previous patches) with hangs almost at start. I tried again with stock 3.13-rc3 with spread spectrum disabled, and I also hit hangs when testing (without having to push the video card with heavy tests). So, for me, disabling spread spectrum may sometimes help, but it is not a real solution to this bug.

Sorry for bringing bad news.

Comment 77 Alex Deucher 2013-12-13 03:03:07 UTC

(In reply to comment #76)
>
> Since my last cold boot, I tried with unlimited clocks (reverted previous
> patches) with hangs almost at start. I tried again with stock 3.13-rc3 with
> spread spectrum disabled, and I also hit hangs when testing (without having
> to push the video card with heavy tests). So, for me, disabling spread
> spectrum may sometimes help, but it is not a real solution to this bug.
> 
> Sorry for bringing bad news.

Make sure you try on a cold boot.

Comment 78 Alexandre Demers 2013-12-13 05:39:45 UTC

(In reply to comment #77)
> (In reply to comment #76)
> >
> > Since my last cold boot, I tried with unlimited clocks (reverted previous
> > patches) with hangs almost at start. I tried again with stock 3.13-rc3 with
> > spread spectrum disabled, and I also hit hangs when testing (without having
> > to push the video card with heavy tests). So, for me, disabling spread
> > spectrum may sometimes help, but it is not a real solution to this bug.
> > 
> > Sorry for bringing bad news.
> 
> Make sure you try on a cold boot.

I had already done it and from what I could see until now, it doesn't change anything if it is a cold or a hot boot. Just in case, after your comment, I did it again and it still ended hung (3.13-rc3 with all spread spectrum disabled). It happened before that my system would not hang for some reason once in many boots, maybe that's what happened.

Comment 79 Martin Andersson 2013-12-13 08:21:32 UTC

Maybe we don't have the same issue then (or our cards have different sensitivity to this issue) because I see a huge improvement with spread spectrum disabled. I saw that you had got hangs while playing serious sam 3, so I bought it and tried with spread spectrum enabled and it hung within 5 minutes, but when I disabled spread spectrum I played for 2 hours without problems.

I will continue testing it and report back if I encounter any problems.

Comment 80 Alexandre Demers 2013-12-13 17:12:15 UTC

(In reply to comment #79)
> Maybe we don't have the same issue then (or our cards have different
> sensitivity to this issue) because I see a huge improvement with spread
> spectrum disabled. I saw that you had got hangs while playing serious sam 3,
> so I bought it and tried with spread spectrum enabled and it hung within 5
> minutes, but when I disabled spread spectrum I played for 2 hours without
> problems.
> 
> I will continue testing it and report back if I encounter any problems.

That's why I suggested to push it as a workaround for now since it seems to help both of us on different levels. Only, it's not a total cure on my side.

Comment 81 Alexandre Demers 2013-12-16 06:13:29 UTC

I think I've figured out what's going on (now running rc4 with ss disabled patch + a little modification). I'll confirm it and I'll be back.

Comment 82 Martin Andersson 2013-12-16 20:11:34 UTC

I can now also report that disabling spread spectrum isn't a complete fix for me either. It is much better but I still get hangs.

Comment 83 Alexandre Demers 2013-12-20 01:27:22 UTC

Alex Deucher, would there be any interest in testing "[PATCH 00/18] Rework PM init order" and the following ones?

Comment 84 Alex Deucher 2013-12-20 13:42:38 UTC

(In reply to comment #83)
> Alex Deucher, would there be any interest in testing "[PATCH 00/18] Rework
> PM init order" and the following ones?

Yeah, probably worth a shot.  For convenience, I've pushed the patches to a branch:
http://cgit.freedesktop.org/~agd5f/linux/log/?h=dpm-reorder

Comment 85 Alexandre Demers 2013-12-22 01:46:26 UTC

(In reply to comment #84)
> (In reply to comment #83)
> > Alex Deucher, would there be any interest in testing "[PATCH 00/18] Rework
> > PM init order" and the following ones?
> 
> Yeah, probably worth a shot.  For convenience, I've pushed the patches to a
> branch:
> http://cgit.freedesktop.org/~agd5f/linux/log/?h=dpm-reorder

Nope, not working...

Comment 86 Alexandre Demers 2014-01-12 21:19:54 UTC

Martin, could you try something? I may have been lucky until now, but I've been running with kernel 3.13-rc7 with HyperZ disabled and for now it has been stable (inspired by another bug report, bug 73088).

Comment 87 Martin Andersson 2014-01-12 22:26:55 UTC

(In reply to comment #86)
> Martin, could you try something? I may have been lucky until now, but I've
> been running with kernel 3.13-rc7 with HyperZ disabled and for now it has
> been stable (inspired by another bug report, bug 73088).

I'm running 3.13-rc7 and it has been stable for me so far, but I have only played FTL, since I got bored with serious sam 3.

I will try to find a benchmark that triggers the hang and then I can see if disabling HyperZ helps.

Comment 88 Alexandre Demers 2014-01-12 23:30:51 UTC

Forget that... locked as always.

I'll test 3.14 soon, either from Alex's drm-next or when we'll get 3.14-rc1...

Comment 89 Marc 2014-01-15 21:02:45 UTC

I am on 3.12.7 and the R600_DEBUG=nohyperz seems to work fine for me. Without it, it crashes after just couple of minutes (machine rebooting automatically). I have a HD6950. So thank you for the tip.

Comment 90 Alexandre Demers 2014-01-15 21:15:16 UTC

(In reply to comment #89)
> I am on 3.12.7 and the R600_DEBUG=nohyperz seems to work fine for me.
> Without it, it crashes after just couple of minutes (machine rebooting
> automatically). I have a HD6950. So thank you for the tip.

Are you seeing this problem with HyperZ since the introduction of dpm or is it something new (a regression)? Have you tried a 3.13-rcX kernel to see if this is still happening?

Comment 91 Alex Deucher 2014-01-15 21:17:59 UTC

(In reply to comment #88)
> Forget that... locked as always.
> 
> I'll test 3.14 soon, either from Alex's drm-next or when we'll get
> 3.14-rc1...

Have you tried disabling hyperz globally rather than just for a specific app?  E.g., set env var R600_DEBUG=nohyperz in /etc/environment or however your distro handles global env vars.

Comment 92 Alexandre Demers 2014-01-15 21:33:01 UTC

(In reply to comment #91)
> (In reply to comment #88)
> > Forget that... locked as always.
> > 
> > I'll test 3.14 soon, either from Alex's drm-next or when we'll get
> > 3.14-rc1...
> 
> Have you tried disabling hyperz globally rather than just for a specific
> app?  E.g., set env var R600_DEBUG=nohyperz in /etc/environment or however
> your distro handles global env vars.

That's what I did. I'll test it again, just in case.

Comment 93 Marc 2014-01-16 08:12:59 UTC

I thought R600_DEBUG=nohyperz worked out for me but it just delayed the crash so instead of couple of minutes, it took almost 1h.

Comment 94 Alexandre Demers 2014-01-16 20:39:11 UTC

And funny enough on my side, I cold booted with this option last night and was able to play all the time, go online and run some movies without any problem.

However, I must specify some modifications I did recently:
Updated LLVM to 3.4 (this prevents some application crashes I was experiencing)
Switched to GLAMOR (fixed a rendering issue I had when not using it with applications using LLVM 3.4)

However, I've been lucky before from time to time where I had no crashes, so it might just be that. But I suspect cold booting and hot booting may be part of the problem with the hyperz disabled. More test to come tonight.

Comment 95 Alexandre Demers 2014-01-17 15:43:08 UTC

Second night without any problem (Video, Glamor and games were all tested). Now, I'll try to find what seems to be part of the solution by reverting one change at a time until the system begins to hang again.

Comment 96 Alex Deucher 2014-01-17 15:53:26 UTC

(In reply to comment #95)
> Second night without any problem (Video, Glamor and games were all tested).
> Now, I'll try to find what seems to be part of the solution by reverting one
> change at a time until the system begins to hang again.

It may some sort of synchronization issue between the EXA acceleration code in ddx and the 3D acceleration code in mesa.  When you use glamor, all acceleration uses the code in mesa.

Comment 97 Alexandre Demers 2014-01-17 16:36:41 UTC

(In reply to comment #96)
> (In reply to comment #95)
> > Second night without any problem (Video, Glamor and games were all tested).
> > Now, I'll try to find what seems to be part of the solution by reverting one
> > change at a time until the system begins to hang again.
> 
> It may some sort of synchronization issue between the EXA acceleration code
> in ddx and the 3D acceleration code in mesa.  When you use glamor, all
> acceleration uses the code in mesa.

Indeed. I'll begin by reverting this change first then.

Comment 98 Alexandre Demers 2014-01-18 03:08:33 UTC

I went back to EXA and my display froze after under 10 minutes. So it begins to look like a trend to me about EXA VS Glamour. Still, I'm continuing to test (I've just activated Glamor once more).

Comment 99 Marc 2014-01-18 12:48:34 UTC

I tried radeon.dpm=1 and glamor. It took 5h but same issue, my screen turned white and the PC rebooted.

Comment 100 Alexandre Demers 2014-01-18 17:01:17 UTC

I also experienced a hang last night when quitting a game with my said "more stable" setup. It does seem to greatly help though on my side.

Comment 101 Marc 2014-01-20 13:58:37 UTC

It does greatly help for me as well as it lasted 5h.

But setting the option to glamor isn't so good for me because there were display artifacts in some cases. For example I would open vinagre and the header bar (where there is Remote / View / ... menus and the Connect / Disconnect / ... buttons) wouldn't be drawn, I would see the content of what was there on the screen before I open that window instead. Same with gnome-control-center.

I just noticed a package update in my distrib today: glamor-egl-0.5.1.r258-1-x86_64

I might give it another shot. I'll also update my kernel to 3.13 as I am on 3.12.7 at the moment.

Comment 102 Alexandre Demers 2014-01-23 02:28:38 UTC

Went ahead with drm-next 3.14 and it still hangs from time to time (not better, not worse)

Comment 103 Alex Deucher 2014-01-30 17:37:12 UTC

I don't remember if we've tried this recently, but does disabling power containment help?

diff --git a/drivers/gpu/drm/radeon/ni_dpm.c b/drivers/gpu/drm/radeon/ni_dpm.c
index 22c3391..19b7c68 100644
--- a/drivers/gpu/drm/radeon/ni_dpm.c
+++ b/drivers/gpu/drm/radeon/ni_dpm.c
@@ -4250,7 +4250,7 @@ int ni_dpm_init(struct radeon_device *rdev)
                break;
        }
 
-       if (ni_pi->cac_weights->enable_power_containment_by_default) {
+       if (0/*ni_pi->cac_weights->enable_power_containment_by_default*/) {
                ni_pi->enable_power_containment = true;
                ni_pi->enable_cac = true;
                ni_pi->enable_sq_ramping = true;

Comment 104 Alexandre Demers 2014-01-30 18:30:09 UTC

(In reply to comment #103)
> I don't remember if we've tried this recently, but does disabling power
> containment help?
> 
> diff --git a/drivers/gpu/drm/radeon/ni_dpm.c
> b/drivers/gpu/drm/radeon/ni_dpm.c
> index 22c3391..19b7c68 100644
> --- a/drivers/gpu/drm/radeon/ni_dpm.c
> +++ b/drivers/gpu/drm/radeon/ni_dpm.c
> @@ -4250,7 +4250,7 @@ int ni_dpm_init(struct radeon_device *rdev)
>                 break;
>         }
>  
> -       if (ni_pi->cac_weights->enable_power_containment_by_default) {
> +       if (0/*ni_pi->cac_weights->enable_power_containment_by_default*/) {
>                 ni_pi->enable_power_containment = true;
>                 ni_pi->enable_cac = true;
>                 ni_pi->enable_sq_ramping = true;

I don't remember playing with it lately, so I'll try it either later today (tonight) or tomorrow. I have to complete a report first for a personal project.

Comment 105 Alexandre Demers 2014-02-02 19:47:47 UTC

(In reply to comment #104)
> (In reply to comment #103)
> > I don't remember if we've tried this recently, but does disabling power
> > containment help?
> > 
> > diff --git a/drivers/gpu/drm/radeon/ni_dpm.c
> > b/drivers/gpu/drm/radeon/ni_dpm.c
> > index 22c3391..19b7c68 100644
> > --- a/drivers/gpu/drm/radeon/ni_dpm.c
> > +++ b/drivers/gpu/drm/radeon/ni_dpm.c
> > @@ -4250,7 +4250,7 @@ int ni_dpm_init(struct radeon_device *rdev)
> >                 break;
> >         }
> >  
> > -       if (ni_pi->cac_weights->enable_power_containment_by_default) {
> > +       if (0/*ni_pi->cac_weights->enable_power_containment_by_default*/) {
> >                 ni_pi->enable_power_containment = true;
> >                 ni_pi->enable_cac = true;
> >                 ni_pi->enable_sq_ramping = true;
> 
> I don't remember playing with it lately, so I'll try it either later today
> (tonight) or tomorrow. I have to complete a report first for a personal
> project.

I've been testing it since Friday night (Desktop and games). I had a single lock, but it was while playing a video. Since I suspect this may be related to other issues that I've seen patches for, I'll continue testing it and I'll report in a couple of days if things seem stable (or more stable).

Comment 106 Alexandre Demers 2014-03-07 04:32:33 UTC

Alex Deucher, should we try https://bugzilla.kernel.org/attachment.cgi?id=128321 as proposed in other bugs?

Comment 107 Alex Deucher 2014-03-07 14:12:24 UTC

(In reply to comment #106)
> Alex Deucher, should we try
> https://bugzilla.kernel.org/attachment.cgi?id=128321 as proposed in other
> bugs?

No, that patch only applies to evergreen and BTC parts.

Comment 108 Alexandre Demers 2014-03-11 19:12:50 UTC

Sorry for not giving any news with kernel 3.14 cycle for now: I'm struggling with another bug unrelated to GPU and I haven't had much time to complete bisecting it. Once done and fixed, I'll give some new inputs.

On a different topic, I hit a GPU hang/reset situation this week with a 3.13.6 kernel while dpm was enabled, which is... extremely rare. Usually, it either hang completly at some point or work for some time without problem. I was able to take a screen shot. I'll look in the logs to see if I can get something out. However, I was wondering if I should push it here or open a new bug. Any suggestion?

Comment 109 Alexandre Demers 2014-04-11 22:19:46 UTC

Can we expect something similar to what was proposed in bug 75992 as a possible fix (new ucode)? Just praying for a new proposition to test over here...

Comment 110 Alex Deucher 2014-04-11 22:23:13 UTC

(In reply to comment #109)
> Can we expect something similar to what was proposed in bug 75992 as a
> possible fix (new ucode)? Just praying for a new proposition to test over
> here...

I already checked.  There's no new mc ucode for cayman.

Comment 111 Alexandre Demers 2014-04-21 16:28:45 UTC

To update a bit this bug: it is still experienced with kernel 3.15-rc1.

Also, something being displayed must be called to change (scrolling a window, clicking a link, opening a new window, visualizing a video, the display going in standby). It will not freeze if there is no modification to the display.

I'm convinced that spread spectrum is not linked to the bug. It just helps stabilize things, but it is not the root cause of the problem.

By any chance, do you know what was modified in the new ucodes for Bonaire and above?

It should be noted there seems to be no correlation between the power level and the occurence of the bug. What I mean is, as long as DPM is enabled, it can happen at the lowest power level at any time; the same goes when the power level is at its highest. While I don't hear the profile change (by that I mean the fan usually spins faster|lower when it does), I'm pretty sure it has to do with the memory controller and I'm thinking more and more it comes from the ucode. It could be because the memory controller doesn't wait to have completed its changes to a new state before changing it again? It would explain why even when doing light work, it can happen (a very short raise in power level and going back to the previous state wouldn't be heard, the fan wouldn't have the time to accelerate).

Comment 112 Alex Deucher 2014-07-01 16:19:12 UTC

Created attachment 102082 [details] [review]
possible fix

Does this patch fix the issues?

Comment 113 Alexandre Demers 2014-07-01 18:32:01 UTC

Thanks Alex, I had seen this patch yesterday and I added it to my things to be testes. I'll test it as soon as possible, but I still need to fix my kernel build setup. I had to reinstall all my system a couple of weeks ago. I should be able to test it in the next couple of days.

Comment 114 Marc 2014-07-01 20:15:20 UTC

I am trying the patch on kernel 3.15.0 (I modified line 1318). I'll give a feedback in 24h.

Comment 115 Alexandre Demers 2014-07-02 22:27:56 UTC

Testing as of now with some games and videos. Temperature is around 79 Celcius, power level changing as expected.

If everything works as we want, maybe we'll be able to look at bug 69721 (to reach full speed capacity for cards not sticking to reference board)

Comment 116 Alexandre Demers 2014-07-03 02:52:11 UTC

Up until now, no problem encountered. The problem fixed by the patch would also points in the same direction as some of my previous tests and supposions were heading. I think this might really be our culprit.

Which means we should also be in a position to reenable Spread Spectrum if everything continues to run smoothly.

I'll give you some updates tomorrow.

Comment 117 Marc 2014-07-03 07:30:16 UTC

Same here, no freeze with this patch for 24h so it looks this bug might be fixed. I'll confirm again at the end of the day.

Comment 118 Alexandre Demers 2014-07-04 21:28:50 UTC

Well, everything seems to still be fine. Suspended, woken up, ran some games (L4D2, FEZ, The Witcher 2, SS3), navigated on the web, watched some movies and still running.

Comment 119 Martin Andersson 2014-07-06 16:42:01 UTC

I have been running 3.15 + plus the latest patch for a couple of days now. So far not a single lockup.

Comment 120 Alexandre Demers 2014-07-06 20:27:23 UTC

Alex, I think we can close this bug as soon as the "drm/radeon/dpm: fix vddci setup typo on cayman" patch lands in the kernel tree: I had no problem for the last couple of days since I applied the patch. The same goes for Martin.

Any chance of seeing it included in kernel 3.16 (against which I'm testing the patch)?

Comment 121 Alexandre Demers 2014-07-06 20:33:21 UTC

I just saw the latest RC kernel and the patch was included. Thanks!

Comment 122 Alexandre Demers 2014-07-07 17:13:46 UTC

Tested with kernel 3.16-RC4 (which includes the patch by commit b0880e87c1fd038b84498944f52e52c3e86ebe59) and still working flawlessly. Closing this bug. I'll reopen it if needed.

Comment 123 Marc 2014-07-11 18:22:35 UTC

I just got 1 lockup. The 1st in 10 days.

Comment 124 Alexandre Demers 2014-07-11 20:12:04 UTC

(In reply to comment #123)
> I just got 1 lockup. The 1st in 10 days.

Do you have exactly the same symptoms as before? On my side, everything is still fine. The only problem I've encountered is related to X.

I've been running and testing everything with kernel 3.16 which has some other fixes related to Cayman, so maybe you encountered a different issue than the one from the current bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.