Bug 45511

Summary: Services that BindTo another service should be restarted if systemd restarts the target service when it exits.
Product: systemd Reporter: Chris Paulson-Ellis <chris>
Component: generalAssignee: Lennart Poettering <lennart>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: colin, david.ward
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: schedule JOB_RESTART from SERVICE_AUTO_RESTART state
service: actually delay auto-restart if another job is pending
service: fix auto-restart handling in service_stop()
service: fix auto-restart handling in service_start()

Description Chris Paulson-Ellis 2012-02-01 11:33:38 UTC
At present, if a client uses BindTo on a server, then if the server is restarted with systemctl, then the client is restarted too. However, if the server is restarted after exiting due to Restart=always, then the client is not restarted.

See discussion in http://thread.gmane.org/gmane.comp.sysutils.systemd.devel/4283
Comment 1 David Ward 2012-02-12 21:14:28 UTC
Created attachment 56944 [details] [review]
schedule JOB_RESTART from SERVICE_AUTO_RESTART state

I agree. Services that Require a target service continue to run if the target service automatically restarts. It seems appropriate that services which BindTo a target service should be restarted but ultimately continue to run if the target service automatically restarts. Right now the services are stopped but do not start back. In my case I am trying to replace the 'watchquagga' supervisor by using the automatic restart capabilities of systemd, but this is the piece that doesn't work.

The attached patch fixes the behavior for me. When a service is in the 'auto-restart' state, a 'restart' job is now created for it, rather than placing the service into the 'dead' or 'failed' states and then creating a 'start' job for it. As a result, during the dependency checking a 'restart' job will be created for other services that bind to the first one, rather than a 'stop' job. I believe this also means that units specified in OnFailure= will not be launched during an automatic restart, unless the service ultimately gives up on restarting and moves to a 'failed' state.
Comment 2 Lennart Poettering 2012-04-03 05:34:03 UTC
Thanks, patch applied.

And sorry again for letting this rot for so long. We had some other patches queued up in that area and I wanted those right first.
Comment 3 Colin Guthrie 2012-04-08 05:33:37 UTC
This would appear to be the commit responsible for breaking OnFailure= handling....

I had an OnFailure= directive in my prefdm.service (which has Restart=always) but even when prefdm fails (and is reported as such), the unit referred to never gets started.

Reverting this commit solved that.

So while the warning about the OnFailure mentioned here is true, it seems that it is never run at all now..

Note that this is seen when booting the system, and thus starting the service, not restarting it - tho' not sure if that' makes a difference.

Reopening for now, but please feel free to close it again if you feel a separate bug is more appropriate.
Comment 4 Colin Guthrie 2012-04-08 05:48:54 UTC
So looking further, it seems that with this patch applied, after systemd gives up restarting the unit it enters an inactive+dead state:

Description=Display Manager
LoadState=loaded
ActiveState=inactive
SubState=dead


Without the patch applied it is in a failed+failed state:

Description=Display Manager
LoadState=loaded
ActiveState=failed
SubState=failed


I suspect that this difference is what causes the OnFailure to never kick in.

I agree that the OnFailure action should not kick in until the final failure statue is reached.... we just need to ensure it does actually get there :D

Now, I'm not sure if the SubState of failed would mean that systemd will try and restart it again later automatically. IMO we don't want this - i.e. it should require admin interaction.

So would a:
ActiveState=failed
SubState=failed-permanently

make sense here (if such a substate exists... I've seen the phrase fly past but not looked at it properly).
Comment 5 David Ward 2012-04-22 20:10:15 UTC
Colin,

Is this still happening? How do you create a condition where the service fails and gives up on auto-restarting (so I can try to reproduce it)?

David
Comment 6 Colin Guthrie 2012-04-23 00:58:19 UTC
Hi David,

As far as I'm aware, this is still an issue although I've not tried a full git master of late.

Here are a couple of sample units that should reproduce the problem:

[root@jimmy system]# cat my-failing.service
[Unit]
Description=A Failing Service

OnFailure=my-on-failure.service

[Service]
ExecStart=/bin/false
Restart=always
RestartSec=0

[root@jimmy system]# cat my-on-failure.service
[Unit]
Description=An On Failure Service

[Service]
ExecStart=/bin/touch /tmp/it-failed.txt
Type=oneshot
RemainAfterExit=yes


So I dropped thos services in /lib/systemd/systemd/ and did a daemon-reload:

[root@jimmy system]# systemctl status my-failing.service my-on-failure.service
my-failing.service - A Failing Service
	  Loaded: loaded (/lib/systemd/system/my-failing.service; static)
	  Active: inactive (dead)
	  CGroup: name=systemd:/system/my-failing.service


my-on-failure.service - An On Failure Service
	  Loaded: loaded (/lib/systemd/system/my-on-failure.service; static)
	  Active: inactive (dead)
	  CGroup: name=systemd:/system/my-on-failure.service


Both units are loaded, but inactive. Without the patch applied, I start the failing service:

[root@jimmy system]# systemctl start my-failing.service
[root@jimmy system]# systemctl status my-failing.service my-on-failure.service
my-failing.service - A Failing Service
	  Loaded: loaded (/lib/systemd/system/my-failing.service; static)
	  Active: failed (Result: exit-code) since Mon, 23 Apr 2012 08:50:49 +0100; 3min 45s ago
	 Process: 16606 ExecStart=/bin/false (code=exited, status=1/FAILURE)
	  CGroup: name=systemd:/system/my-failing.service


my-on-failure.service - An On Failure Service
	  Loaded: loaded (/lib/systemd/system/my-on-failure.service; static)
	  Active: active (exited) since Mon, 23 Apr 2012 08:50:49 +0100; 3min 45s ago
	 Process: 16593 ExecStart=/bin/touch /tmp/it-failed.txt (code=exited, status=0/SUCCESS)
	  CGroup: name=systemd:/system/my-on-failure.service

[root@jimmy system]# ls -l /tmp/it-failed.txt 
-rw-r--r-- 1 root root 0 Apr 23 08:50 /tmp/it-failed.txt



So as you can see, all works as intended and the OnFailure unit is activated. With the patch applied however this failure unit never got started.

Hope that's enough info to reproduce the problem. I've not tried this with the patch applied but these are copied from the setup we used in production so I think it should be valid. I'll try and reapply the patch and retry if you don't get it immediately working.
Comment 7 Chris Paulson-Ellis 2012-04-23 02:00:59 UTC
Hi,

I filed this issue and although I haven't checked recently, I have no reason to believe it's resolved. See the mailing list link in the report for a description of how to reproduce (client.service BindTo -> server.service with Restart=always; kill server - client doesn't restart).

Chris.
Comment 8 Colin Guthrie 2012-04-23 02:42:44 UTC
@Chris: I thought the patch David wrote and which was committed (see comment 2) solved the original problem you reported?

My issue is that it introduces a regression which I've outlined and detailed.
Comment 9 Chris Paulson-Ellis 2012-04-23 02:48:32 UTC
Sorry you're right... I misread what David was asking and was assuming that his patch had been reverted, leaving my original issue unresolved.
Comment 10 David Ward 2012-04-29 07:15:00 UTC
Created attachment 60770 [details] [review]
service: actually delay auto-restart if another job is pending
Comment 11 David Ward 2012-04-29 07:15:37 UTC
Created attachment 60771 [details] [review]
service: fix auto-restart handling in service_stop()
Comment 12 David Ward 2012-04-29 07:19:44 UTC
Created attachment 60772 [details] [review]
service: fix auto-restart handling in service_start()

Colin, please try the three additional patches I am attaching to this bug, which fix your issue for me, as well as some other ones.

Note that the original behavior is restored -- OnFailure units will be triggered after each time that a service fails. There is an RFE for using StartLimitAction to trigger units only after the service gives up trying to restart: https://bugzilla.redhat.com/show_bug.cgi?id=746279
Comment 13 David Ward 2012-04-29 10:55:37 UTC
Also note that these patches can also be applied on top of v44; just change the target from src/core/service.c to src/service.c.
Comment 14 Colin Guthrie 2012-04-30 13:45:24 UTC
Can't easily test right now as we're in freeze and I'm not really keen on playing about too much right now with things! However a quick review of the patches seems to suggest they'd do the job and are all sensible (and suitably small too :)) Cheers
Comment 15 David Ward 2012-05-03 06:59:40 UTC
This bug blocks the removal of the 'watchquagga' daemon supervisor in Fedora:
https://bugzilla.redhat.com/show_bug.cgi?id=817990

A separate tracking bug has been created in Fedora:
https://bugzilla.redhat.com/show_bug.cgi?id=817968

Lennart, please review the attached patches when you get a chance. Thanks!
Comment 16 Lennart Poettering 2012-05-21 16:51:56 UTC
Comment on attachment 56944 [details] [review]
schedule JOB_RESTART from SERVICE_AUTO_RESTART state

This one is already applied.
Comment 17 Lennart Poettering 2012-05-21 16:53:13 UTC
Comment on attachment 60770 [details] [review]
service: actually delay auto-restart if another job is pending

Applied this one.
Comment 18 Lennart Poettering 2012-05-21 16:56:21 UTC
Comment on attachment 60771 [details] [review]
service: fix auto-restart handling in service_stop()

Applied
Comment 19 Lennart Poettering 2012-05-21 16:57:48 UTC
Comment on attachment 60772 [details] [review]
service: fix auto-restart handling in service_start()

Applied.
Comment 20 Lennart Poettering 2012-05-21 16:58:43 UTC
Sorry for not reviewing this more timely!

All patches applied! Thanks for your work!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.