DescriptionDamien Ciabrini
2017-12-01 12:25:11 UTC
Description of problem:
When running overcloud deploy on an existing containerized HA cloud, one of the operations that are being run on the host is to kill any spurious epmd that might be running on the host. The logic relies on lsns to determine whether epmd is running on the host (spurious) on is containerized (from pacemaker-managed rabbitmq).
for pid in $(pgrep epmd); do if [ "$(lsns -o NS -p $pid)" == "$(lsns -o NS -p 1)" ]; then kill $pid; break; fi; done
Problem with that logic is that if lsns errors out for whatever reasons, the current test always returns true and in turn the containerized epmd is killed. This unexpectedly kills messaging service and also prevents pacemaker from restarting it properly for unrelated reasons.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. deploy a stack
2. force run an additional epmd on the host. As root on a controller:
. /etc/rabbitmq/rabbitmq-env.conf
epmd -daemon
3. use the same command as 1 to redeploy on top of the existing stack
Actual results:
all epmd processes are killed
Expected results:
Only the epmd from the host should be killed
Additional info:
Is this only in case of controller replacement or is this issue also reproducable in any config update changes in overcloud, scaling, and/or other operations post initial overcloud deployment?
(In reply to Jaromir Coufal from comment #3)
> Is this only in case of controller replacement or is this issue also
> reproducable in any config update changes in overcloud, scaling, and/or
> other operations post initial overcloud deployment?
I think this will not block minor updates, but this will block compute scaling.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2018:0602
Description of problem: When running overcloud deploy on an existing containerized HA cloud, one of the operations that are being run on the host is to kill any spurious epmd that might be running on the host. The logic relies on lsns to determine whether epmd is running on the host (spurious) on is containerized (from pacemaker-managed rabbitmq). for pid in $(pgrep epmd); do if [ "$(lsns -o NS -p $pid)" == "$(lsns -o NS -p 1)" ]; then kill $pid; break; fi; done Problem with that logic is that if lsns errors out for whatever reasons, the current test always returns true and in turn the containerized epmd is killed. This unexpectedly kills messaging service and also prevents pacemaker from restarting it properly for unrelated reasons. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. deploy a stack 2. force run an additional epmd on the host. As root on a controller: . /etc/rabbitmq/rabbitmq-env.conf epmd -daemon 3. use the same command as 1 to redeploy on top of the existing stack Actual results: all epmd processes are killed Expected results: Only the epmd from the host should be killed Additional info: