Bug 1872670
| Summary: | [rhosp-director] epmd@0.0.0.0.socket unit remains in failed state | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ketan Mehta <kmehta> |
| Component: | rabbitmq-server | Assignee: | John Eckersberg <jeckersb> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | pkomarov |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 13.0 (Queens) | CC: | apevec, jeckersb, lhh, lmiccini, michele, plemenko, vkoul |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-20 08:16:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ketan Mehta
2020-08-26 11:18:29 UTC
Hello All! We do not support systemd integration yet. It is provided but never used or tested widely. A customer should use pacemaker for service management. We are going to use system services since RHOS18 (*maybe* from RHOS17). I just spent the better part of a day on bug 1957638 which is basically the same thing, so this is fresh in my mind and I can speak intelligently to it right now. On the OSP13 undercloud, epmd should be controlled by systemd via socket activation. The reason for this is subtle and I had to go dig through old bugs from 7 years ago to remind myself because I couldn't remember. So let me write it out here to sum it up. Then in 7 more years I can hopefully just refer to this comment :) (Disclaimer, some of this may be outdated? But this is how it worked 7 years ago.) The key is understanding how systemd manages processes in the cgroups for the services it launches. If systemd starts the rabbitmq-server service, then it knows which pid is the main process that it launched. It expects all processes under the service/cgroup to be parented under the main process. The main process *cannot* double fork() to daemonize another service. This action re-parents the daemon to the init process (systemd), and then systemd will kill the newly daemonized process. With that, here is how things can go wrong on the OSP13 undercloud if epmd is not managed by systemd: - rabbitmq-server service is started via systemd - As part of the rabbitmq-server startup, it tries to connect to epmd (tcp to 4369) to register itself. This connection is refused (no epmd running). - The erlang runtime will double fork() to daemonize off the epmd service - The rabbit app registers itself with epmd and finishes booting - At some (very soon) later point, systemd handles the epmd service being reparented and kills it At this point, the next action taken against the service is likely to fail. For example, if a healthcheck runs and tries to do `rabbitmqctl status` or similar: - rabbitmqctl boots and tries to connect to epmd (again tcp to 4369). This connection is refused (no epmd running because systemd killed it). - This erlang runtime will double fork() to daemonize off the epmd service. Depending on where this particular invocation of rabbitmqctl is running, **this epmd instance may persist**. For example if it happens during a puppet run as root. Then the process gets daemonized as root as a normal (non-service) process and will run as long as the host is running. - rabbitmqctl contacts this new epmd and asks for the location of the rabbit app. Since rabbit was only registered in the old epmd and not the new one, this lookup fails so the rabbitmqctl command will also fail. Eventually the system will reach an equilibrium state. The rabbit app will periodically check that epmd is running and that the rabbit app is registered. It will notice epmd is up, but rabbit is not registered and then re-register it. From this point forward everything works as expected, but you will be in the situation originally noted in this bug: the epmd service is running as root, as a normal daemonized process instead of under systemd control. And because of this, the systemd epmd.0.0.socket service is in failed state. It cannot bind to 0.0.0.0:4369 because the daemonized epmd is already bound there. Now, under normal operation, I'm not sure how you'd get into the bad state without manual intervention. The epmd.0.0.socket should start prior to any other erlang code in the boot process. Once the very first erlang service runs (be it rabbitmq-server or some random rabbitmqctl invocation or anything else), it will connect to epmd as part of its boot process. Once it connects to the systemd socket, systemd will start the epmd service via socket activation and hand it the file descriptor to accept() the connection. From that point on, any epmd connections should succeed and connect to the systemd-managed service that is running. It's quite possible there is a bug somewhere in the upgrade/update process. I know there are instances where we explicitly kill epmd in tripleo; we may inadvertently kill the systemd-managed process and then spawn a non-systemd service accidentally after that. |