Bug 1771566
| Summary: | [IPI][OpenStack] Many haproxy processes hanging around due to monitor not closing connections. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ryan Howe <rhowe> |
| Component: | Machine Config Operator | Assignee: | Martin André <m.andre> |
| Status: | CLOSED ERRATA | QA Contact: | weiwei jiang <wjiang> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.2.0 | CC: | agogala, amurdaca, bperkins, kgarriso, m.andre, rheinzma, wsun, yboaron |
| Target Milestone: | --- | ||
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Because HAProxy timeout values can be sensitive to some applications (e.g: Kuryr) we use long (24 hours) timeout values for the API LB.
Consequence: If HAProxy reload operation is triggered many times in a short period we may end-up with many HAProxy processes hanging around.
Fix: Force sending SIGTERM after timeout (default is 120 seconds) to old HAProxy processes which haven't terminated.
Result: No more long lived duplicate haproxy processes.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-13 21:52:37 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ryan Howe
2019-11-12 15:45:14 UTC
This isn't really an MCO bug. Assigning to someone from Baremetal team. Thanks for the report, @Ryan. Will check ASAP. I managed to reproduce this problem also in Baremetal environment Since HAProxy timeout values could be sensitive to some applications (e.g: Kuryr) we use long (24 hours) timeout values for the API LB. Because of these timeout values, sometimes old HAProxy processes being terminated only after 24Hours when HAProxy reload operation activated. So, if HAProxy reload operation triggered many times in a short period we may end-up with many HAProxy processes hanging around (as it's described in this Bug) I wonder what causes the Monitor container to send so many reload requests to the HAProxy container in a very short period. Is it possible for you to add the logs from the Monitor container? We solved this problem for Baremetal case. The relevant Openstack file [1] should be updated according to [2] [1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/openstack/files/openstack-haproxy.yaml [2] https://github.com/openshift/machine-config-operator/pull/1274 Pulled in Yossi's fix from https://bugzilla.redhat.com/show_bug.cgi?id=1777204 to OpenStack in https://github.com/openshift/machine-config-operator/pull/1369 Checked with 4.4.0-0.nightly-2020-02-06-041236 it's fixed.
$ for i in {0..2}; do oc debug nodes/wj431shedup-vwljr-master-${i} -- chroot /host pgrep haproxy -a; done
Starting pod/wj431shedup-vwljr-master-0-debug ...
To use host binaries, run `chroot /host`
44161 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 56 53
44166 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 56 53
Removing debug pod ...
Starting pod/wj431shedup-vwljr-master-1-debug ...
To use host binaries, run `chroot /host`
64980 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 70 67
64985 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 70 67
Removing debug pod ...
Starting pod/wj431shedup-vwljr-master-2-debug ...
To use host binaries, run `chroot /host`
9765 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 26 23 10 9
9807 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 26 23 10 9
Removing debug pod ...
$ oc get pods -n openshift-openstack-infra -l app=openstack-infra-api-lb -o jsonpath='{.items[*].spec.containers[*].command}'
[/bin/bash -c #/bin/bash
verify_old_haproxy_ps_being_deleted()
{
local prev_pids
prev_pids="$1"
sleep $OLD_HAPROXY_PS_FORCE_DEL_TIMEOUT
cur_pids=$(pidof haproxy)
for val in $prev_pids; do
if [[ $cur_pids =~ (^|[[:space:]])"$val"($|[[:space:]]) ]] ; then
kill $val
fi
done
}
reload_haproxy()
{
old_pids=$(pidof haproxy)
if [ -n "$old_pids" ]; then
/usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf $old_pids &
#There seems to be some cases where HAProxy doesn't drain properly.
#To handle that case, SIGTERM signal being sent to old HAProxy processes which haven't terminated.
verify_old_haproxy_ps_being_deleted "$old_pids" &
else
/usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid &
fi
}
msg_handler()
{
while read -r line; do
echo "The client send: $line" >&2
# currently only 'reload' msg is supported
if [ "$line" = reload ]; then
reload_haproxy
fi
done
}
set -ex
declare -r haproxy_sock="/var/run/haproxy/haproxy-master.sock"
declare -r haproxy_log_sock="/var/run/haproxy/haproxy-log.sock"
export -f msg_handler
export -f reload_haproxy
export -f verify_old_haproxy_ps_being_deleted
rm -f "$haproxy_sock" "$haproxy_log_sock"
socat UNIX-RECV:${haproxy_log_sock} STDOUT &
if [ -s "/etc/haproxy/haproxy.cfg" ]; then
/usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid &
fi
socat UNIX-LISTEN:${haproxy_sock},fork system:'bash -c msg_handler'
] [monitor /etc/kubernetes/kubeconfig /config/haproxy.cfg.tmpl /etc/haproxy/haproxy.cfg --api-vip 192.168.0.5] [/bin/bash -c #/bin/bash
verify_old_haproxy_ps_being_deleted()
{
local prev_pids
prev_pids="$1"
sleep $OLD_HAPROXY_PS_FORCE_DEL_TIMEOUT
cur_pids=$(pidof haproxy)
for val in $prev_pids; do
if [[ $cur_pids =~ (^|[[:space:]])"$val"($|[[:space:]]) ]] ; then
kill $val
fi
done
}
reload_haproxy()
{
old_pids=$(pidof haproxy)
if [ -n "$old_pids" ]; then
/usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf $old_pids &
#There seems to be some cases where HAProxy doesn't drain properly.
#To handle that case, SIGTERM signal being sent to old HAProxy processes which haven't terminated.
verify_old_haproxy_ps_being_deleted "$old_pids" &
else
/usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid &
fi
}
msg_handler()
{
while read -r line; do
echo "The client send: $line" >&2
# currently only 'reload' msg is supported
if [ "$line" = reload ]; then
reload_haproxy
fi
done
}
set -ex
declare -r haproxy_sock="/var/run/haproxy/haproxy-master.sock"
declare -r haproxy_log_sock="/var/run/haproxy/haproxy-log.sock"
export -f msg_handler
export -f reload_haproxy
export -f verify_old_haproxy_ps_being_deleted
rm -f "$haproxy_sock" "$haproxy_log_sock"
socat UNIX-RECV:${haproxy_log_sock} STDOUT &
if [ -s "/etc/haproxy/haproxy.cfg" ]; then
/usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid &
fi
socat UNIX-LISTEN:${haproxy_sock},fork system:'bash -c msg_handler'
] [monitor /etc/kubernetes/kubeconfig /config/haproxy.cfg.tmpl /etc/haproxy/haproxy.cfg --api-vip 192.168.0.5] [/bin/bash -c #/bin/bash
verify_old_haproxy_ps_being_deleted()
{
local prev_pids
prev_pids="$1"
sleep $OLD_HAPROXY_PS_FORCE_DEL_TIMEOUT
cur_pids=$(pidof haproxy)
for val in $prev_pids; do
if [[ $cur_pids =~ (^|[[:space:]])"$val"($|[[:space:]]) ]] ; then
kill $val
fi
done
}
reload_haproxy()
{
old_pids=$(pidof haproxy)
if [ -n "$old_pids" ]; then
/usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf $old_pids &
#There seems to be some cases where HAProxy doesn't drain properly.
#To handle that case, SIGTERM signal being sent to old HAProxy processes which haven't terminated.
verify_old_haproxy_ps_being_deleted "$old_pids" &
else
/usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid &
fi
}
msg_handler()
{
while read -r line; do
echo "The client send: $line" >&2
# currently only 'reload' msg is supported
if [ "$line" = reload ]; then
reload_haproxy
fi
done
}
set -ex
declare -r haproxy_sock="/var/run/haproxy/haproxy-master.sock"
declare -r haproxy_log_sock="/var/run/haproxy/haproxy-log.sock"
export -f msg_handler
export -f reload_haproxy
export -f verify_old_haproxy_ps_being_deleted
rm -f "$haproxy_sock" "$haproxy_log_sock"
socat UNIX-RECV:${haproxy_log_sock} STDOUT &
if [ -s "/etc/haproxy/haproxy.cfg" ]; then
/usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid &
fi
socat UNIX-LISTEN:${haproxy_sock},fork system:'bash -c msg_handler'
] [monitor /etc/kubernetes/kubeconfig /config/haproxy.cfg.tmpl /etc/haproxy/haproxy.cfg --api-vip 192.168.0.5]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |