1441603 – pacemaker_remote should reap any children when run as pid1 in a container

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1441603 - pacemaker_remote should reap any children when run as pid1 in a container

Summary: pacemaker_remote should reap any children when run as pid1 in a container

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	7.4
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-12 10:25 UTC by Michele Baldessari
Modified:	2017-08-01 17:54 UTC (History)
CC List:	9 users (show)
Fixed In Version:	pacemaker-1.1.16-10.el7
Doc Type:	No Doc Update
Doc Text:	This is an implementation detail of the new bundle support which will be documented separately.
Clone Of:
Environment:
Last Closed:	2017-08-01 17:54:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:1862	0	normal	SHIPPED_LIVE	pacemaker bug fix and enhancement update	2017-08-01 18:04:15 UTC

Description Michele Baldessari 2017-04-12 10:25:17 UTC

Description of problem:
When we run services like rabbit in a container via pacemaker_remote, the following happens:
- pacemaker_remote is PID1
- erlang uses a bunch of helpers that are spawn via double forks()
- this creates a number of zombies that keep being accumulated because none waits() for them.

This actually becomes a problem with rabbitmq as we exhaust the process table fairly quickly. Instead of adding a small init program that does that and calls pacemaker remote (like dumbinit or sinit), we should just add this feature to pacemaker remote.

Comment 2 Jan Pokorný [poki] 2017-04-12 12:56:59 UTC

"Add this feature" may also be interpreted as "reinvent the wheel".

There's also "systemd-nspawn --as-pid2", though it may be a goal to make
do without any such dependency.

Just saying.

Comment 3 Ken Gaillot 2017-04-12 17:28:30 UTC

I do think pacemaker_remoted should provide this ability, since we are now promoting it as a potential PID 1 inside a container. It will involve a significant change to Pacemaker's process management model, so it will not be a quick project.

The workaround for now is to use one of the container launch scripts made for this purpose, like dumbinit or sinit mentioned in the Description. Another example is myinit, available at https://github.com/phusion/baseimage-docker/blob/rel-0.9.16/image/bin/my_init

Comment 4 Andrew Beekhof 2017-04-13 02:42:52 UTC

(In reply to Jan Pokorný from comment #2)
> "Add this feature" may also be interpreted as "reinvent the wheel".

More like copying

> 
> There's also "systemd-nspawn --as-pid2", though it may be a goal to make
> do without any such dependency.
> 
> Just saying.

"systemd-nspawn may be used to run a command or OS in a light-weight namespace container. "

So you're proposing a container inside of the real container?

Comment 5 Jan Pokorný [poki] 2017-04-13 12:14:12 UTC

re [comment 4]:

> "systemd-nspawn may be used to run a command or OS in a light-weight
> namespace container. "
> 
> So you're proposing a container inside of the real container?

No, but there are several considerations to take into account
related to that:

- there are more containerization gateways (actually not the very
  backends, as these are mostly in-kernel), such as systemd-nspawn

- some such may spawn PID 1 on their own (such as with mentioned
  "systemd-nspawn --as-pid2"), so that, at the very least,
  pacemaker_remoted should only care in case it's indeed getpid() == 1

As a corollary of the former, we should think twice before setting the
way to encode the "bundle" information in CIB to stone, see also

http://oss.clusterlabs.org/pipermail/users/2017-April/005480.html

Comment 6 Andrew Beekhof 2017-04-18 03:46:06 UTC

If one takes even a passing look at systemd/src/nspawn/nspawn-stub-pid1.c it is abundantly clear that it is completely unsuitable for our purposes.

Comment 7 Jan Pokorný [poki] 2017-04-18 17:53:01 UTC

This bug talks talks about reaping children.
What are other "our purposes"?

Is it in an issue the children can bring the container down with
a signal?  (they can do the same within bare metal domain)

Or is it rather some lack of instrumentation for diagnostics/log/control
purposes?

Comment 8 Ken Gaillot 2017-05-03 22:21:03 UTC

Andrew Beekhof came up with a much simpler implementation that has been merged upstream:

https://github.com/ClusterLabs/pacemaker/commit/8abdd82ba85ee384ab78ce1db617f51b692e9df6

Comment 10 Ken Gaillot 2017-05-04 21:47:18 UTC

Test procedure:

1. Configure a Pacemaker cluster of at least two cluster nodes. You'll need about 450MB free disk space on each node. 

2. On every node:
2a. Install, enable and run docker. You can use whatever RH ships, though in my personal testing, I've been using the upstream repo:

# yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
# yum install docker-ce
# systemctl enable --now docker

2b, Pull a base image to work with. In my personal testing, I've been using CentOS 7:

# docker pull centos:centos7

2c. Create some infrastructure for the tests:

# mkdir -p /root/bz1441603 /var/log/pacemaker/bundles/zombie-bundle-0

2d. Put a copy of the pacemaker-cli, pacemaker-libs, pacemaker, pacemaker-cluster-libs, and pacemaker-remote RPMs for this BZ into /root/bz1441603.

2e. Create a modified version of ocf:pacemaker:Dummy that will fork a child process when doing the start action, then exit, creating a zombie process:

sed -e $'s#dummy_monitor$#dummy_monitor\\n    /usr/bin/python -c \'import subprocess; import time; child = subprocess.Popen(["/bin/sleep", "5"])\' \\&#' < /usr/lib/ocf/resource.d/pacemaker/Dummy >/root/bz1441603/zombie

2f. Create a Dockerfile for testing (replace centos:centos7 with your base image).

# cat >/root/bz1441603/Dockerfile <<EOF
FROM centos:centos7

COPY pacemaker*.rpm ./
RUN yum update -y
RUN yum install -y ./pacemaker*.rpm python which resource-agents
CMD rm -f pacemaker*.rpm
COPY zombie /usr/lib/ocf/resource.d/pacemaker/zombie
EOF

3. On every node, build a custom image. This step should be repeated if during testing you need to switch out the pacemaker packages or change the Dockerfile:

3a. Build the image:
# cd /root/bz1441603
# docker rmi pcmktest:zombie 2>/dev/null || true
# docker build -t pcmktest:zombie .

3b. If desired, verify that the image was created:

# docker images # output should look something like:
REPOSITORY          TAG                 IMAGE ID            CREATED              SIZE
pcmktest            zombie              aab04ad64ab0        About a minute ago   412 MB
centos              centos7             98d35105a391        12 days ago          192 MB

3c. At least in my testing, building triggers a docker "waiting for lo to become free" bug. Reboot the node to avoid this possibility.

4. From any one node, start the cluster, and configure a bundle using the test image. Replace the IP address with something appropriate. The cib-upgrade is only necessary if you're reusing a configuration from an older version:

# pcs cluster start --all --wait
# pcs cluster cib-upgrade
# cibadmin --modify --allow-create --scope resources -X '<bundle id="zombie-bundle">
  <docker image="pcmktest:zombie" />
  <network ip-range-start="192.168.122.131" host-interface="eth0" host-netmask="24" />
  <primitive class="ocf" id="zombie" provider="pacemaker" type="zombie"/>
</bundle>'

5. Wait about 10 seconds, then on whichever node is running the bundle, list any zombie processes inside the container:

    docker exec zombie-bundle-docker-0 bash -c 'ps -o state,pid,command -e | grep ^Z'

With the fixed packages here, there won't be a python zombie. The 7.3 packages don't support bundles, so if you want to reproduce the issue (to show a python zombie here), you'll have to grab the 1.1.16-8 7.4 build.

Comment 14 Damien Ciabrini 2017-06-21 14:55:31 UTC

Additionally, Michele Baldessari and myself are testing the new features from this build since a month now, so I can say that's it's working as expected for us.

We're following different instructions [1] to deploy an OpenStack cluster with containerized ocf resources, and this is the result:

[1] https://github.com/dciabrin/undercloud_ha_containers

[root@rhelz ~]# crm_mon -1
Stack: corosync
Current DC: rhelz (version 1.1.16-11.el7-94ff4df) - partition with quorum
Last updated: Wed Jun 21 09:48:16 2017
Last change: Wed Jun 21 09:29:42 2017 by root via cibadmin on rhelz

4 nodes configured
16 resources configured

Online: [ rhelz ]
GuestOnline: [ galera-bundle-0@rhelz rabbitmq-bundle-0@rhelz redis-bundle-0@rhelz ]

Active resources:

 Docker container: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:2017-06-19.1]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started rhelz
 Docker container: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-06-19.1]
   galera-bundle-0      (ocf::heartbeat:galera):        Master rhelz
 Docker container: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:2017-06-19.1]
   redis-bundle-0       (ocf::heartbeat:redis): Master rhelz
 ip-192.168.122.254     (ocf::heartbeat:IPaddr2):       Started rhelz
 ip-192.168.122.250     (ocf::heartbeat:IPaddr2):       Started rhelz
 ip-192.168.122.249     (ocf::heartbeat:IPaddr2):       Started rhelz
 ip-192.168.122.253     (ocf::heartbeat:IPaddr2):       Started rhelz
 ip-192.168.122.247     (ocf::heartbeat:IPaddr2):       Started rhelz
 ip-192.168.122.248     (ocf::heartbeat:IPaddr2):       Started rhelz
 Docker container: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:2017-06-19.1]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started rhelz


In the containerized resource rabbitmq-bundle-0, the Erlang VM implements blocking IO by forking child processes which daemonize themselves and finish once the blocking operation is done.

When attaching to the container, one can see the pacemaker_remote  acting as a pid1-child-reaper:

[root@rhelz ~]# docker exec -it rabbitmq-bundle-docker-0 /bin/bash
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
()[root@rhelz /]# export TERM=xterm
()[root@rhelz /]# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 09:26 ?        00:00:00 pcmk-init
root        12     1  0 09:26 ?        00:00:03 /usr/sbin/pacemaker_remoted
rabbitmq   169     1  0 09:26 ?        00:00:00 /usr/lib64/erlang/erts-7.3.1.3/bin/epmd -daemon
root       225     1  0 09:26 ?        00:00:00 sh -c /usr/sbin/rabbitmq-server > /var/log/rabbitmq/startup_log 2> /var/log/rabbitmq/startup_err
root       228   225  0 09:26 ?        00:00:00 /bin/sh /usr/sbin/rabbitmq-server
root       249   228  0 09:26 ?        00:00:00 su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmq-server 
rabbitmq   254   249  0 09:26 ?        00:00:00 /bin/sh -e /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq   448   254  0 09:26 ?        00:00:28 /usr/lib64/erlang/erts-7.3.1.3/bin/beam.smp -W w -A 64 -K true -P 1048576 -K true -B i -- -root /usr/lib64/erlang -progname erl -- -ho

There is no orphan process in the container, which validates that pcmk-init (pacemaker-remote's child reaper) is working as expected.

Comment 16 errata-xmlrpc 2017-08-01 17:54:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1862

Note You need to log in before you can comment on or make changes to this bug.