Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1106457

Summary:

Doesn't implement NOTIFY_SOCKET ERRNO

Product:

Red Hat Enterprise Linux 7

Reporter:

Miguel Angel Ajo <majopela>

Component:

systemd

Assignee:

Michal Sekletar <msekleta>

Status:

CLOSED ERRATA

QA Contact:

Branislav Blaškovič <bblaskov>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.1

CC:

abeekhof, bblaskov, fdinitto, jscotka, lnykryn, majopela, msekleta, riehecky, systemd-maint-list

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

systemd-208-12.el7

Doc Type:

Bug Fix

Doc Text:

Cause: No support for ERRNO= messages incoming on notify socket Consequence: Previously if services of type notify couldn't notify systemd about error conditions without actually exiting with some error code. Fix: Support for ERRNO= messages now implemented on notify socket. Result: Services of type notify are now able to inform systemd about various error conditions during runtime by sending message of type ERRNO on notify socket.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-03-05 11:10:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1106489

Attachments:

Description	Flags
systemd patch	none
fake daemon for systemd	none
service definition	none
Systemd patch based on upstream tree	none
Systemd patch based on upstream tree	none

Description Miguel Angel Ajo 2014-06-09 12:03:38 UTC

Description of problem:

  NOTIFY_SOCKET specification of systemd [1] describes the ERRNO argument,
that would allow for passing error numbers for 'systemctl status' return.

  It's actually not implemented [3]+[4], which doesn't allow to signal 
error conditions to anything like pacemaker without actually bringing 
down the service.

  I discovered it during the design of this specifications:
http://docs-draft.openstack.org/48/97748/1/check/gate-oslo-specs-docs/c4a464e/doc/build/html/

  Ideally a ERRNO message in the LSB range of 150-199 (see [2]) should
be reflected into a systemctl status query to the service [4].

[1] http://www.freedesktop.org/software/systemd/man/sd_notify.html
[2] http://refspecs.linuxbase.org/LSB_3.1.1/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
[3] http://cgit.freedesktop.org/systemd/systemd/tree/src/core/service.c?id=194bbe33382f5365be3865ed1779147cb#n3201
[4] http://cgit.freedesktop.org/systemd/systemd/tree/src/systemctl/systemctl.c?id=194bbe33382f5365be3865ed1779147cb#n2882


Version-Release number of selected component (if applicable):

systemd-sysv-208-11.el7.x86_64
systemd-libs-208-11.el7.x86_64
systemd-208-11.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. start a daemon that sends a ERRNO=150 sd_notify message
2. systemctl status 'daemon' ; echo $?
3.

Actual results:

returns 0, as it's running

Expected results:

150

Additional info:

Comment 2 Lukáš Nykrýn 2014-06-09 12:44:50 UTC

Hmm I don't think that mixing errnos and lsb return codes is a good idea. ALsou you should not parse systemctl status, that is a "user" command. For that we have systemctl show option.

But anyway can you describe your use-case a little bit more?

Comment 3 Miguel Angel Ajo 2014-06-11 06:47:48 UTC

(In reply to Lukáš Nykrýn from comment #2)
> Hmm I don't think that mixing errnos and lsb return codes is a good idea.
> ALsou you should not parse systemctl status, that is a "user" command. For
> that we have systemctl show option.

Hi Lukáš, 

   What I'm missing in systemd is the mechanism to propagate status conditions 
(numeric, not text) from a running service back to any other service 'looking' 
for it (pacemaker in this case). You're right that "systemctl show" could make sense here as it's dumping all the status of a daemon as variables.

<< adding fdinitto here to clarify how pacemaker interacts with systemd >> 

Fabbio, can we use the 'show' verb from systemctl and parse it to gather our information, or shall we stick to 'status' (via service XXXX status  -> systemctl status XXXX) or 

# example:
# systemctl show sshd
[.... cutted down ...]
ExecMainStartTimestamp=Fri 2014-06-06 13:49:04 CEST
ExecMainStartTimestampMonotonic=1384388429820
ExecMainExitTimestampMonotonic=0
ExecMainPID=10217
ExecMainCode=0
ExecMainStatus=0


> 
> But anyway can you describe your use-case a little bit more?

Sure,

We're using pacemaker in RHEL-OSP/HA to manage all the running services in HA.

As for now (init.d and RHEL6) some of the services, were providing extended LSB status (app codes 150-19x) in certain conditions:

 * internal failures recoverable within a certain timeframe
 * high load conditions, ...
 * temporary lack of messaging bus connectivity...

That was returned by /etc/init.d/XXXXX  status verb as the Linux init.d LSB codes specify, and parsed by pacemaker to take action as programmed.
But the important part is, that the service *doesn't exit* even when providing 
those status conditions. It's pacemakers ultimate responsibility here to
restart, stop, or move the service out to a different host.

I'm looking for a way to do the same thing with systemd,

and I'm proposing a openstack-wide mechanism upstream to provide this systemd integration.



Lukáš, another question: given the systemctl show output, does it make sense to put the ERRNO information as "ExecMainCode", or does it make sense somewhere else?. I have a patch ready for parsing it, and I think I was setting it there, but I was also trying to patch the 'status' verb, which I may roll back...

Comment 4 Fabio Massimo Di Nitto 2014-06-11 06:49:28 UTC

bouncing this one to Andrew. I don´t know the internal implementation of systemd <-> pacemaker bridge.

Comment 5 Andrew Beekhof 2014-06-11 09:27:54 UTC

We use the dbus interface.

Currently we only look into the ActiveState property returned by the GetAll method but we could look elsewhere too if needed.

Comment 6 Miguel Angel Ajo 2014-06-20 12:03:30 UTC

Given the feedback, Lukáš, would it make sense a patch like this for systemd (I'd propose it upstream), and then using the "MainExecCode" from pacemaker?

> systemd-208/src/core/dbus-service.c:        { "ExecMainCode",                   bus_property_append_int,  "i", offsetof(ExecStatus, code)                      },

The patch would look like:


diff --git a/src/core/service.c b/src/core/service.c
index cefb253..6c27bb9 100644
--- a/src/core/service.c
+++ b/src/core/service.c
@@ -3376,6 +3376,7 @@ static void service_notify_cgroup_empty_event(Unit *u) {
 static void service_notify_message(Unit *u, pid_t pid, char **tags) {
         Service *s = SERVICE(u);
         const char *e;
+        int errno;
 
         assert(u);
 
@@ -3414,6 +3415,23 @@ static void service_notify_message(Unit *u, pid_t pid, char **tags) {
                 }
         }
 
+        /* Interpret ERRNO= */
+        if ((e = strv_find_prefix(tags, "ERRNO=")) &&
+            (s->state == SERVICE_START ||
+             s->state == SERVICE_START_POST ||
+             s->state == SERVICE_RUNNING ||
+             s->state == SERVICE_RELOAD)) {
+
+                if (safe_atoi(e + 6, &errno) < 0)
+                        log_warning_unit(u->id,
+                                         "Failed to parse notification message %s", e);
+                else {
+                        log_debug_unit(u->id,
+                                       "%s: got %s", u->id, e);
+                        s->main_exec_status.status = errno;
+                }
+        }
+
         /* Interpret READY= */
         if (s->type == SERVICE_NOTIFY &&
             s->state == SERVICE_START &&

Comment 7 Miguel Angel Ajo 2014-06-20 12:06:32 UTC

I also wonder if it could make sense to extend the protocol here:

http://www.freedesktop.org/software/systemd/man/sd_notify.html

With "STATUSCODE", to complement "STATUS" message.

In the proposed use case, we would be setting the "ERRNO" without
actually exiting, which could contradict the sd_notify.html 
defined protocol, but on the other hand we're setting an 
status/error condition.

Comment 8 Miguel Angel Ajo 2014-06-20 12:12:19 UTC

Sorry, 

+                        s->main_exec_status.status = errno;

may be:

+                        s->main_exec_status.code = errno;

Comment 9 Michal Sekletar 2014-06-20 14:38:03 UTC

I don't think that storing ERRNO value to s->main_exec_code makes much sense, because stored value will be overwritten with actual exit code once daemon exits. Most likely very shortly after ERRNO message being received. Also this is sort of racy if ERRNO message is sent just before exit() is called (but that is not likely in HA setup I assume).

Looking at the source code I think we don't interpret ERRNO message in any way and that is bug which should be fixed.

Why can't you use STATUS= for exposing more state information from services to systemd and subsequently via D-Bus to peacemaker? Even ERRNO is implemented using it for purpose you have in mind seems wrong.

Comment 10 Miguel Angel Ajo 2014-06-20 15:43:42 UTC

Hi Michal,

   Getting the errno overwritten at exit would be ok. What we need is to expose a temporary error condition with an status code.

   I believe "STATUS" message is meant for human-readable output, this is why I thought of "ERRNO" or, protocol extension.

   Please find the attached use-case example, very generic. Some error codes could be:

  1) Facing resource exhaustion
  2) Subprocesses failing for unknown reason
  3) Partial conectivity lost to other seervices in distributed environments...
  ....

  Conditions that won't totally stop the service from working in a degraded way (if we don't exit), but which could be propagated to any monitoring solution in a computer-readable format.


run example along with the patch of comment6+comment8

[root@controllerNN ~]#  systemctl show neutron-test-agent.service |grep ExecMainCode
ExecMainCode=96
[root@controllerNN ~]#  systemctl show neutron-test-agent.service |grep StatusText
StatusText=I'm running, with status 96


  
Find the .service , patch and .py fake deamon attached.

Comment 11 Miguel Angel Ajo 2014-06-20 15:44:37 UTC

Created attachment 910807 [details]
systemd patch

Comment 12 Miguel Angel Ajo 2014-06-20 15:46:07 UTC

Created attachment 910809 [details]
fake daemon for systemd

Comment 13 Miguel Angel Ajo 2014-06-20 15:46:57 UTC

Created attachment 910810 [details]
service definition

Comment 14 Miguel Angel Ajo 2014-06-20 15:52:45 UTC


Hmm Michal: 
   given the specifications of sd_notify, 

"""
ERRNO=...
If a daemon fails, the errno-style error code, formatted as string. Example: "ERRNO=2" for ENOENT.
"""

would it make sense to store the ERRNO value and expose it individually on dbus, or... to make it override the actual daemon exit code?

Comment 15 Michal Sekletar 2014-06-23 08:27:34 UTC

(In reply to Miguel Angel Ajo from comment #14)
> would it make sense to store the ERRNO value and expose it individually on
> dbus, or... to make it override the actual daemon exit code?

As I've said before, I think former is more appropriate here. Other solution is acceptable in your case when service keeps running in degraded mode, but not for everybody. I assume that in most cases service exits right after sending ERRNO= message. Hence there is very short time window when errno code is accessible before it gets overwritten by actual exit code of service which may or may not be the same, and I deem later is more common.

There is one problem with all this. Namely, service is expected to transition to failed state after sending notification with ERRNO set. However I don't think that we have to be so strict here. Protocol specification SHOULD allow service to run in degraded mode rather than exiting with error immediately.

Comment 16 Miguel Angel Ajo 2014-06-30 13:52:43 UTC

Created attachment 913408 [details]
Systemd patch based on upstream tree

Attached a patch for upstream adding an specific errno field. I have to test it yet. posted here for review.

Only difference of our d/s branch vs upstream is the notify_dbus var set to true.

Comment 17 Miguel Angel Ajo 2014-06-30 13:55:03 UTC

Michal, does the above patch look good? let me know if I understood your comment 15.

If it looks good (and my tests here pass) I'll propose it upstream.

Comment 18 Michal Sekletar 2014-06-30 20:26:29 UTC

Looks good to me. Please post upstream. If it gets accepted I can take care of backport to RHEL7. Thanks!

Comment 19 Miguel Angel Ajo 2014-07-03 08:23:55 UTC

I was trying my patch on f21 before submitting upstream,
and I realized it has a problem.

[root@f21-systemd ~]# systemctl show sshd
Failed to get properties: No such device or address

Michal, any idea of what could that be?


I suppose it's related to dbus communication or parameter listing
because of my changes


strace of the systemctl execution (I removed the non-important-parts):

sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\5\0\0\0\1\0\0\0\237\0\0\0\1\1o\0-\0\0\0/org/freedesktop/systemd1/unit/sshd_2eservice\0\0\0\3\1s\0\6\0\0\0GetAll\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\10\1g\0\1s\0\0", 176}, {"\0\0\0\0\0", 5}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 181

recvmsg(3, 0x7fffd278ef10, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)


...

recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"\4\1s\0\22\0\0\0System.Error.ENXIO\0\0\0\0\0\0\10\1g\0\1s\0\0\31\0\0\0No such device or address\0", 70}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 70




writev(2, [{"Failed to get properties: No such device or address", 51}, {"\n", 1}], 2Failed to get properties: No such device or address
) = 52
close(1)                                = 0

Comment 20 Miguel Angel Ajo 2014-07-03 08:36:54 UTC

Ah, I think I found the issue, clearing the needinfo for now.

There were changes missing in dbus-execute.c


[majopela@redcylon core]$ git diff
diff --git a/src/core/dbus-execute.c b/src/core/dbus-execute.c
index ecbadd7..df29764 100644
--- a/src/core/dbus-execute.c
+++ b/src/core/dbus-execute.c
@@ -657,7 +657,7 @@ static int append_exec_command(sd_bus_message *reply, ExecCommand *c) {
         if (!c->path)
                 return 0;
 
-        r = sd_bus_message_open_container(reply, 'r', "sasbttttuii");
+        r = sd_bus_message_open_container(reply, 'r', "sasbttttuiii");
         if (r < 0)
                 return r;
 
@@ -669,7 +669,7 @@ static int append_exec_command(sd_bus_message *reply, ExecCommand *c) {
         if (r < 0)
                 return r;
 
-        r = sd_bus_message_append(reply, "bttttuii",
+        r = sd_bus_message_append(reply, "bttttuiii",
                                   c->ignore,
                                   c->exec_status.start_timestamp.realtime,
                                   c->exec_status.start_timestamp.monotonic,
@@ -677,7 +677,8 @@ static int append_exec_command(sd_bus_message *reply, ExecCommand *c) {
                                   c->exec_status.exit_timestamp.monotonic,
                                   (uint32_t) c->exec_status.pid,
                                   (int32_t) c->exec_status.code,
-                                  (int32_t) c->exec_status.status);
+                                  (int32_t) c->exec_status.status,
+                                  (int32_t) c->exec_status.errno_code);
         if (r < 0)
                 return r;
 
@@ -699,7 +700,7 @@ int bus_property_get_exec_command(
         assert(bus);
         assert(reply);
 
-        r = sd_bus_message_open_container(reply, 'a', "(sasbttttuii)");
+        r = sd_bus_message_open_container(reply, 'a', "(sasbttttuiii)");
         if (r < 0)
                 return r;
 
@@ -725,7 +726,7 @@ int bus_property_get_exec_command_list(
         assert(bus);
         assert(reply);
 
-        r = sd_bus_message_open_container(reply, 'a', "(sasbttttuii)");
+        r = sd_bus_message_open_container(reply, 'a', "(sasbttttuiii)");
         if (r < 0)
                 return r;

Comment 21 Miguel Angel Ajo 2014-07-03 08:48:10 UTC

Created attachment 914407 [details]
Systemd patch based on upstream tree

Fixed a few missing changes for dbus communication.

Comment 22 Michal Sekletar 2014-07-03 08:57:23 UTC

(In reply to Miguel Angel Ajo from comment #21)
> Created attachment 914407 [details]
> Systemd patch based on upstream tree
> 
> Fixed a few missing changes for dbus communication.

Yes. I missed those of course, because I didn't compile systemd with patch just made very brief check of general approach. If the current version works on Fedora 21, please make sure it applies cleanly on top of upstream HEAD and send out. Thanks!

Comment 23 Miguel Angel Ajo 2014-07-03 09:42:33 UTC

I started a review here,
http://lists.freedesktop.org/archives/systemd-devel/2014-July/020869.html

I had a couple of doubts, it works on the current implementation, and 
applies on top of upstream.


Thanks!.

Comment 24 Miguel Angel Ajo 2014-07-08 07:25:00 UTC

Ready for backport.

It's accepted upstream with the next commit ids:

b4af5a803aa71a57733ca46fef29b7afb20a626c service patch
2040ccf171404b709acb0ecf1d1f17b87c5d05f0 (non negative numbers accepted)
b4af5a803aa71a57733ca46fef29b7afb20a626c systemctl status support

Comment 25 Michal Sekletar 2014-07-22 10:49:31 UTC

Miguel thanks for implementing this. I'll cherry-pick the patch, to RHEL-7.1. Thanks a lot!

Comment 26 Michal Sekletar 2014-08-11 08:14:06 UTC

http://10.3.11.34/msekleta/systemd-rhel/commit/4323a5406bb40c76076814fd998e09c58b433e7d

Comment 27 Miguel Angel Ajo 2014-08-13 08:55:36 UTC

(In reply to Michal Sekletar from comment #26)
> http://10.3.11.34/msekleta/systemd-rhel/commit/
> 4323a5406bb40c76076814fd998e09c58b433e7d

Michal, 

 There were a couple of commits right after that one, from Lenart Poettering,
that was adding support on systemctl to read that, and a couple of extra things. 

  Could you backport those too?

Comment 28 Michal Sekletar 2014-08-13 09:31:43 UTC

So far I have backported only one from Lennart which in turn makes negative values for ERRNO ignored. I will look for more related work. Thanks for reminding me.

Comment 32 Branislav Blaškovič 2014-11-25 12:25:05 UTC

TestCase passed on systemd-208-18.el7, switching to VERIFIED.

Comment 34 errata-xmlrpc 2015-03-05 11:10:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0509.html