Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1872670

Summary:	[rhosp-director] epmd@0.0.0.0.socket unit remains in failed state
Product:	Red Hat OpenStack	Reporter:	Ketan Mehta <kmehta>
Component:	rabbitmq-server	Assignee:	John Eckersberg <jeckersb>
Status:	CLOSED NEXTRELEASE	QA Contact:	pkomarov
Severity:	high	Docs Contact:
Priority:	high
Version:	13.0 (Queens)	CC:	apevec, jeckersb, lhh, lmiccini, michele, plemenko, vkoul
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-20 08:16:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ketan Mehta 2020-08-26 11:18:29 UTC

Description of problem:

Description of problem:

If we take a look at the systemd file of this unit it listens on port 4369, for the epmd process on the undercloud node.

[Socket]
ListenStream=%i:4369
Accept=false

[root@undercloud-0 ~]# netstat -tunpl |grep 4369
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN      12392/epmd          
tcp6       0      0 :::4369                 :::*                    LISTEN      12392/epmd       
   
[root@undercloud-0 ~]# ps -ef |grep 12392
root      6839 29013  0 08:35 pts/3    00:00:00 grep --color=auto 12392
root     12392     1  0 Jul10 ?        00:00:08 /usr/lib64/erlang/erts-7.3.1.6/bin/epmd -daemon

However, the unit itself remains in failed state, citing address already in use.

Jul 10 13:40:51 undercloud-0.redhat.local systemd[1]: epmd.0.0.socket failed to listen on sockets: Address already in use
Jul 10 13:40:51 undercloud-0.redhat.local systemd[1]: Failed to listen on Erlang Port Mapper Daemon Activation Socket.
Jul 10 13:40:51 undercloud-0.redhat.local systemd[1]: Unit epmd.0.0.socket entered failed state.

So, I stopped the unit and killed the empd process using kill -9 12392 to see if it causes any problem followed by a systemd restart of rabbitmq-server, which actually has a dependency on this unit.

[Unit]
Description=RabbitMQ broker
After=network.target epmd.0.0.socket
Wants=network.target epmd.0.0.socket

Now, once I do that the connection on port 4369 is created again, however this time it is now systemd controlled.

[root@undercloud-0 ~]# systemctl restart rabbitmq-server
[root@undercloud-0 ~]# netstat -tunpl |grep 4369
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN      1/systemd  

Following which, the unit is in active state:

 epmd.0.0.socket - Erlang Port Mapper Daemon Activation Socket
   Loaded: loaded (/usr/lib/systemd/system/epmd@.socket; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-07-20 08:37:46 EDT; 19min ago
   Listen: 0.0.0.0:4369 (Stream)

Jul 20 08:37:46 undercloud-0.redhat.local systemd[1]: Listening on Erlang Port Mapper Daemon Activation Socket.

Also, I see this was discussed in [1] but I wonder which one is the correct behaviour?

Can we correct the behaviour here so that the unit does not enters failed state and manual intervention if not needed to bring it to active state?

Should the port 4369 be handled by the epmd process or systemd? And by default should it be in a stopped and disabled state so that rabbitmq can control it's stop / start?

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1525369


Version-Release number of selected component (if applicable):

RHOSP13 undercloud, all versions

Comment 5 Peter Lemenkov 2020-12-17 22:07:35 UTC

Hello All!
We do not support systemd integration yet. It is provided but never used or tested widely. A customer should use pacemaker for service management. We are going to use system services since RHOS18 (*maybe* from RHOS17).

Comment 8 John Eckersberg 2021-05-07 14:02:02 UTC

I just spent the better part of a day on bug 1957638 which is basically the same thing, so this is fresh in my mind and I can speak intelligently to it right now.

On the OSP13 undercloud, epmd should be controlled by systemd via socket activation. The reason for this is subtle and I had to go dig through old bugs from 7 years ago to remind myself because I couldn't remember. So let me write it out here to sum it up. Then in 7 more years I can hopefully just refer to this comment :)

(Disclaimer, some of this may be outdated? But this is how it worked 7 years ago.)

The key is understanding how systemd manages processes in the cgroups for the services it launches. If systemd starts the rabbitmq-server service, then it knows which pid is the main process that it launched. It expects all processes under the service/cgroup to be parented under the main process. The main process *cannot* double fork() to daemonize another service. This action re-parents the daemon to the init process (systemd), and then systemd will kill the newly daemonized process.

With that, here is how things can go wrong on the OSP13 undercloud if epmd is not managed by systemd:

- rabbitmq-server service is started via systemd

- As part of the rabbitmq-server startup, it tries to connect to epmd (tcp to 4369) to register itself. This connection is refused (no epmd running).

- The erlang runtime will double fork() to daemonize off the epmd service

- The rabbit app registers itself with epmd and finishes booting

- At some (very soon) later point, systemd handles the epmd service being reparented and kills it

At this point, the next action taken against the service is likely to fail. For example, if a healthcheck runs and tries to do `rabbitmqctl status` or similar:

- rabbitmqctl boots and tries to connect to epmd (again tcp to 4369). This connection is refused (no epmd running because systemd killed it).

- This erlang runtime will double fork() to daemonize off the epmd service. Depending on where this particular invocation of rabbitmqctl is running, **this epmd instance may persist**. For example if it happens during a puppet run as root. Then the process gets daemonized as root as a normal (non-service) process and will run as long as the host is running.

- rabbitmqctl contacts this new epmd and asks for the location of the rabbit app. Since rabbit was only registered in the old epmd and not the new one, this lookup fails so the rabbitmqctl command will also fail.

Eventually the system will reach an equilibrium state. The rabbit app will periodically check that epmd is running and that the rabbit app is registered. It will notice epmd is up, but rabbit is not registered and then re-register it. From this point forward everything works as expected, but you will be in the situation originally noted in this bug: the epmd service is running as root, as a normal daemonized process instead of under systemd control. And because of this, the systemd epmd.0.0.socket service is in failed state. It cannot bind to 0.0.0.0:4369 because the daemonized epmd is already bound there.

Now, under normal operation, I'm not sure how you'd get into the bad state without manual intervention. The epmd.0.0.socket should start prior to any other erlang code in the boot process. Once the very first erlang service runs (be it rabbitmq-server or some random rabbitmqctl invocation or anything else), it will connect to epmd as part of its boot process. Once it connects to the systemd socket, systemd will start the epmd service via socket activation and hand it the file descriptor to accept() the connection. From that point on, any epmd connections should succeed and connect to the systemd-managed service that is running.

It's quite possible there is a bug somewhere in the upgrade/update process. I know there are instances where we explicitly kill epmd in tripleo; we may inadvertently kill the systemd-managed process and then spawn a non-systemd service accidentally after that.