Bug 1106457
| Summary: | Doesn't implement NOTIFY_SOCKET ERRNO | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Miguel Angel Ajo <majopela> | ||||||||||||
| Component: | systemd | Assignee: | Michal Sekletar <msekleta> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Branislav Blaškovič <bblaskov> | ||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||
| Priority: | high | ||||||||||||||
| Version: | 7.1 | CC: | abeekhof, bblaskov, fdinitto, jscotka, lnykryn, majopela, msekleta, riehecky, systemd-maint-list | ||||||||||||
| Target Milestone: | rc | ||||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | systemd-208-12.el7 | Doc Type: | Bug Fix | ||||||||||||
| Doc Text: |
Cause:
No support for ERRNO= messages incoming on notify socket
Consequence:
Previously if services of type notify couldn't notify systemd about error conditions without actually exiting with some error code.
Fix:
Support for ERRNO= messages now implemented on notify socket.
Result:
Services of type notify are now able to inform systemd about various error conditions during runtime by sending message of type ERRNO on notify socket.
|
Story Points: | --- | ||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2015-03-05 11:10:13 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 1106489 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Miguel Angel Ajo
2014-06-09 12:03:38 UTC
Hmm I don't think that mixing errnos and lsb return codes is a good idea. ALsou you should not parse systemctl status, that is a "user" command. For that we have systemctl show option. But anyway can you describe your use-case a little bit more? (In reply to Lukáš Nykrýn from comment #2) > Hmm I don't think that mixing errnos and lsb return codes is a good idea. > ALsou you should not parse systemctl status, that is a "user" command. For > that we have systemctl show option. Hi Lukáš, What I'm missing in systemd is the mechanism to propagate status conditions (numeric, not text) from a running service back to any other service 'looking' for it (pacemaker in this case). You're right that "systemctl show" could make sense here as it's dumping all the status of a daemon as variables. << adding fdinitto here to clarify how pacemaker interacts with systemd >> Fabbio, can we use the 'show' verb from systemctl and parse it to gather our information, or shall we stick to 'status' (via service XXXX status -> systemctl status XXXX) or # example: # systemctl show sshd [.... cutted down ...] ExecMainStartTimestamp=Fri 2014-06-06 13:49:04 CEST ExecMainStartTimestampMonotonic=1384388429820 ExecMainExitTimestampMonotonic=0 ExecMainPID=10217 ExecMainCode=0 ExecMainStatus=0 > > But anyway can you describe your use-case a little bit more? Sure, We're using pacemaker in RHEL-OSP/HA to manage all the running services in HA. As for now (init.d and RHEL6) some of the services, were providing extended LSB status (app codes 150-19x) in certain conditions: * internal failures recoverable within a certain timeframe * high load conditions, ... * temporary lack of messaging bus connectivity... That was returned by /etc/init.d/XXXXX status verb as the Linux init.d LSB codes specify, and parsed by pacemaker to take action as programmed. But the important part is, that the service *doesn't exit* even when providing those status conditions. It's pacemakers ultimate responsibility here to restart, stop, or move the service out to a different host. I'm looking for a way to do the same thing with systemd, and I'm proposing a openstack-wide mechanism upstream to provide this systemd integration. Lukáš, another question: given the systemctl show output, does it make sense to put the ERRNO information as "ExecMainCode", or does it make sense somewhere else?. I have a patch ready for parsing it, and I think I was setting it there, but I was also trying to patch the 'status' verb, which I may roll back... bouncing this one to Andrew. I don´t know the internal implementation of systemd <-> pacemaker bridge. We use the dbus interface. Currently we only look into the ActiveState property returned by the GetAll method but we could look elsewhere too if needed. Given the feedback, Lukáš, would it make sense a patch like this for systemd (I'd propose it upstream), and then using the "MainExecCode" from pacemaker?
> systemd-208/src/core/dbus-service.c: { "ExecMainCode", bus_property_append_int, "i", offsetof(ExecStatus, code) },
The patch would look like:
diff --git a/src/core/service.c b/src/core/service.c
index cefb253..6c27bb9 100644
--- a/src/core/service.c
+++ b/src/core/service.c
@@ -3376,6 +3376,7 @@ static void service_notify_cgroup_empty_event(Unit *u) {
static void service_notify_message(Unit *u, pid_t pid, char **tags) {
Service *s = SERVICE(u);
const char *e;
+ int errno;
assert(u);
@@ -3414,6 +3415,23 @@ static void service_notify_message(Unit *u, pid_t pid, char **tags) {
}
}
+ /* Interpret ERRNO= */
+ if ((e = strv_find_prefix(tags, "ERRNO=")) &&
+ (s->state == SERVICE_START ||
+ s->state == SERVICE_START_POST ||
+ s->state == SERVICE_RUNNING ||
+ s->state == SERVICE_RELOAD)) {
+
+ if (safe_atoi(e + 6, &errno) < 0)
+ log_warning_unit(u->id,
+ "Failed to parse notification message %s", e);
+ else {
+ log_debug_unit(u->id,
+ "%s: got %s", u->id, e);
+ s->main_exec_status.status = errno;
+ }
+ }
+
/* Interpret READY= */
if (s->type == SERVICE_NOTIFY &&
s->state == SERVICE_START &&
I also wonder if it could make sense to extend the protocol here: http://www.freedesktop.org/software/systemd/man/sd_notify.html With "STATUSCODE", to complement "STATUS" message. In the proposed use case, we would be setting the "ERRNO" without actually exiting, which could contradict the sd_notify.html defined protocol, but on the other hand we're setting an status/error condition. Sorry, + s->main_exec_status.status = errno; may be: + s->main_exec_status.code = errno; I don't think that storing ERRNO value to s->main_exec_code makes much sense, because stored value will be overwritten with actual exit code once daemon exits. Most likely very shortly after ERRNO message being received. Also this is sort of racy if ERRNO message is sent just before exit() is called (but that is not likely in HA setup I assume). Looking at the source code I think we don't interpret ERRNO message in any way and that is bug which should be fixed. Why can't you use STATUS= for exposing more state information from services to systemd and subsequently via D-Bus to peacemaker? Even ERRNO is implemented using it for purpose you have in mind seems wrong. Hi Michal, Getting the errno overwritten at exit would be ok. What we need is to expose a temporary error condition with an status code. I believe "STATUS" message is meant for human-readable output, this is why I thought of "ERRNO" or, protocol extension. Please find the attached use-case example, very generic. Some error codes could be: 1) Facing resource exhaustion 2) Subprocesses failing for unknown reason 3) Partial conectivity lost to other seervices in distributed environments... .... Conditions that won't totally stop the service from working in a degraded way (if we don't exit), but which could be propagated to any monitoring solution in a computer-readable format. run example along with the patch of comment6+comment8 [root@controllerNN ~]# systemctl show neutron-test-agent.service |grep ExecMainCode ExecMainCode=96 [root@controllerNN ~]# systemctl show neutron-test-agent.service |grep StatusText StatusText=I'm running, with status 96 Find the .service , patch and .py fake deamon attached. Created attachment 910807 [details]
systemd patch
Created attachment 910809 [details]
fake daemon for systemd
Created attachment 910810 [details]
service definition
Hmm Michal: given the specifications of sd_notify, """ ERRNO=... If a daemon fails, the errno-style error code, formatted as string. Example: "ERRNO=2" for ENOENT. """ would it make sense to store the ERRNO value and expose it individually on dbus, or... to make it override the actual daemon exit code? (In reply to Miguel Angel Ajo from comment #14) > would it make sense to store the ERRNO value and expose it individually on > dbus, or... to make it override the actual daemon exit code? As I've said before, I think former is more appropriate here. Other solution is acceptable in your case when service keeps running in degraded mode, but not for everybody. I assume that in most cases service exits right after sending ERRNO= message. Hence there is very short time window when errno code is accessible before it gets overwritten by actual exit code of service which may or may not be the same, and I deem later is more common. There is one problem with all this. Namely, service is expected to transition to failed state after sending notification with ERRNO set. However I don't think that we have to be so strict here. Protocol specification SHOULD allow service to run in degraded mode rather than exiting with error immediately. Created attachment 913408 [details]
Systemd patch based on upstream tree
Attached a patch for upstream adding an specific errno field. I have to test it yet. posted here for review.
Only difference of our d/s branch vs upstream is the notify_dbus var set to true.
Michal, does the above patch look good? let me know if I understood your comment 15. If it looks good (and my tests here pass) I'll propose it upstream. Looks good to me. Please post upstream. If it gets accepted I can take care of backport to RHEL7. Thanks!
I was trying my patch on f21 before submitting upstream,
and I realized it has a problem.
[root@f21-systemd ~]# systemctl show sshd
Failed to get properties: No such device or address
Michal, any idea of what could that be?
I suppose it's related to dbus communication or parameter listing
because of my changes
strace of the systemctl execution (I removed the non-important-parts):
sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\5\0\0\0\1\0\0\0\237\0\0\0\1\1o\0-\0\0\0/org/freedesktop/systemd1/unit/sshd_2eservice\0\0\0\3\1s\0\6\0\0\0GetAll\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\10\1g\0\1s\0\0", 176}, {"\0\0\0\0\0", 5}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 181
recvmsg(3, 0x7fffd278ef10, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
...
recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"\4\1s\0\22\0\0\0System.Error.ENXIO\0\0\0\0\0\0\10\1g\0\1s\0\0\31\0\0\0No such device or address\0", 70}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 70
writev(2, [{"Failed to get properties: No such device or address", 51}, {"\n", 1}], 2Failed to get properties: No such device or address
) = 52
close(1) = 0
Ah, I think I found the issue, clearing the needinfo for now.
There were changes missing in dbus-execute.c
[majopela@redcylon core]$ git diff
diff --git a/src/core/dbus-execute.c b/src/core/dbus-execute.c
index ecbadd7..df29764 100644
--- a/src/core/dbus-execute.c
+++ b/src/core/dbus-execute.c
@@ -657,7 +657,7 @@ static int append_exec_command(sd_bus_message *reply, ExecCommand *c) {
if (!c->path)
return 0;
- r = sd_bus_message_open_container(reply, 'r', "sasbttttuii");
+ r = sd_bus_message_open_container(reply, 'r', "sasbttttuiii");
if (r < 0)
return r;
@@ -669,7 +669,7 @@ static int append_exec_command(sd_bus_message *reply, ExecCommand *c) {
if (r < 0)
return r;
- r = sd_bus_message_append(reply, "bttttuii",
+ r = sd_bus_message_append(reply, "bttttuiii",
c->ignore,
c->exec_status.start_timestamp.realtime,
c->exec_status.start_timestamp.monotonic,
@@ -677,7 +677,8 @@ static int append_exec_command(sd_bus_message *reply, ExecCommand *c) {
c->exec_status.exit_timestamp.monotonic,
(uint32_t) c->exec_status.pid,
(int32_t) c->exec_status.code,
- (int32_t) c->exec_status.status);
+ (int32_t) c->exec_status.status,
+ (int32_t) c->exec_status.errno_code);
if (r < 0)
return r;
@@ -699,7 +700,7 @@ int bus_property_get_exec_command(
assert(bus);
assert(reply);
- r = sd_bus_message_open_container(reply, 'a', "(sasbttttuii)");
+ r = sd_bus_message_open_container(reply, 'a', "(sasbttttuiii)");
if (r < 0)
return r;
@@ -725,7 +726,7 @@ int bus_property_get_exec_command_list(
assert(bus);
assert(reply);
- r = sd_bus_message_open_container(reply, 'a', "(sasbttttuii)");
+ r = sd_bus_message_open_container(reply, 'a', "(sasbttttuiii)");
if (r < 0)
return r;
Created attachment 914407 [details]
Systemd patch based on upstream tree
Fixed a few missing changes for dbus communication.
(In reply to Miguel Angel Ajo from comment #21) > Created attachment 914407 [details] > Systemd patch based on upstream tree > > Fixed a few missing changes for dbus communication. Yes. I missed those of course, because I didn't compile systemd with patch just made very brief check of general approach. If the current version works on Fedora 21, please make sure it applies cleanly on top of upstream HEAD and send out. Thanks! I started a review here, http://lists.freedesktop.org/archives/systemd-devel/2014-July/020869.html I had a couple of doubts, it works on the current implementation, and applies on top of upstream. Thanks!. Ready for backport. It's accepted upstream with the next commit ids: b4af5a803aa71a57733ca46fef29b7afb20a626c service patch 2040ccf171404b709acb0ecf1d1f17b87c5d05f0 (non negative numbers accepted) b4af5a803aa71a57733ca46fef29b7afb20a626c systemctl status support Miguel thanks for implementing this. I'll cherry-pick the patch, to RHEL-7.1. Thanks a lot! (In reply to Michal Sekletar from comment #26) > http://10.3.11.34/msekleta/systemd-rhel/commit/ > 4323a5406bb40c76076814fd998e09c58b433e7d Michal, There were a couple of commits right after that one, from Lenart Poettering, that was adding support on systemctl to read that, and a couple of extra things. Could you backport those too? So far I have backported only one from Lennart which in turn makes negative values for ERRNO ignored. I will look for more related work. Thanks for reminding me. TestCase passed on systemd-208-18.el7, switching to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0509.html |