Description of problem: On masters with a IPI OpenStack install the haproxy static pod in non-existent namespace has many processes hanging around due to the haproxy monitor script not closing connections. Version-Release number of selected component (if applicable): 4.2.2 How reproducible: 100% Steps to Reproduce: 1. Just install OCP on openstack with IPI Actual results: 50-150 haproxy processes running on masters. # ps aux | grep /sbin/haproxy 1001 14257 0.0 0.0 75752 7132 ? S Nov11 0:00 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1002 1001 998 997 994 993 990 989 986 985 982 981 978 977 974 973 970 969 962 961 954 953 946 945 942 941 938 937 934 933 930 929 922 921 914 913 910 909 906 905 902 901 898 897 890 889 886 885 878 877 874 873 870 869 866 865 862 861 854 853 850 849 846 845 842 841 838 837 834 833 1001 14259 0.0 0.0 76136 5236 ? Ss Nov11 0:01 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1002 1001 998 997 994 993 990 989 986 985 982 981 978 977 974 973 970 969 962 961 954 953 946 945 942 941 938 937 934 933 930 929 922 921 914 913 910 909 906 905 902 901 898 897 890 889 886 885 878 877 874 873 870 869 866 865 862 861 854 853 850 849 846 845 842 841 838 837 834 833 ... ... 1001 129224 0.0 0.0 76176 5192 ? Ss Nov11 0:01 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 994 993 990 989 986 985 982 981 978 977 974 973 970 969 962 961 954 953 946 945 942 941 938 937 934 933 930 929 922 921 914 913 910 909 906 905 902 901 898 897 890 889 886 885 878 877 874 873 870 869 866 865 862 861 854 853 850 849 846 845 842 841 838 837 834 833 1001 129898 0.0 0.0 75752 7304 ? S Nov11 0:00 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 998 997 994 993 990 989 986 985 982 981 978 977 974 973 970 969 962 961 954 953 946 945 942 941 938 937 934 933 930 929 922 921 914 913 910 909 906 905 902 901 898 897 890 889 886 885 878 877 874 873 870 869 866 865 862 861 854 853 850 849 846 845 842 841 838 837 834 833 1001 129900 0.0 0.0 76276 5312 ? Ss Nov11 0:01 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 998 997 994 993 990 989 986 985 982 981 978 977 974 973 970 969 962 961 954 953 946 945 942 941 938 937 934 933 930 929 922 921 914 913 910 909 906 905 902 901 898 897 890 889 886 885 878 877 874 873 870 869 866 865 862 861 854 853 850 849 846 845 842 841 838 837 834 833 http://pastebin.test.redhat.com/813315 Expected results: Connections to be closed after monitor is done checking the endpoint. Additional info: Might be due to the defer close [1] or due to the 24hr timeouts in haproxy [2] [1] https://github.com/openshift/baremetal-runtimecfg/blob/release-4.2/pkg/monitor/monitor.go#L45 [2] /etc/kubernetes/static-pod-resources/haproxy/haproxy.cfg.tmpl timeout http-request 1m timeout queue 1m timeout connect 10s timeout client 86400s timeout server 86400s timeout tunnel 86400s
This isn't really an MCO bug. Assigning to someone from Baremetal team.
Thanks for the report, @Ryan. Will check ASAP.
I managed to reproduce this problem also in Baremetal environment
Since HAProxy timeout values could be sensitive to some applications (e.g: Kuryr) we use long (24 hours) timeout values for the API LB. Because of these timeout values, sometimes old HAProxy processes being terminated only after 24Hours when HAProxy reload operation activated. So, if HAProxy reload operation triggered many times in a short period we may end-up with many HAProxy processes hanging around (as it's described in this Bug) I wonder what causes the Monitor container to send so many reload requests to the HAProxy container in a very short period. Is it possible for you to add the logs from the Monitor container?
We solved this problem for Baremetal case. The relevant Openstack file [1] should be updated according to [2] [1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/openstack/files/openstack-haproxy.yaml [2] https://github.com/openshift/machine-config-operator/pull/1274
Pulled in Yossi's fix from https://bugzilla.redhat.com/show_bug.cgi?id=1777204 to OpenStack in https://github.com/openshift/machine-config-operator/pull/1369
Checked with 4.4.0-0.nightly-2020-02-06-041236 it's fixed. $ for i in {0..2}; do oc debug nodes/wj431shedup-vwljr-master-${i} -- chroot /host pgrep haproxy -a; done Starting pod/wj431shedup-vwljr-master-0-debug ... To use host binaries, run `chroot /host` 44161 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 56 53 44166 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 56 53 Removing debug pod ... Starting pod/wj431shedup-vwljr-master-1-debug ... To use host binaries, run `chroot /host` 64980 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 70 67 64985 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 70 67 Removing debug pod ... Starting pod/wj431shedup-vwljr-master-2-debug ... To use host binaries, run `chroot /host` 9765 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 26 23 10 9 9807 /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 26 23 10 9 Removing debug pod ... $ oc get pods -n openshift-openstack-infra -l app=openstack-infra-api-lb -o jsonpath='{.items[*].spec.containers[*].command}' [/bin/bash -c #/bin/bash verify_old_haproxy_ps_being_deleted() { local prev_pids prev_pids="$1" sleep $OLD_HAPROXY_PS_FORCE_DEL_TIMEOUT cur_pids=$(pidof haproxy) for val in $prev_pids; do if [[ $cur_pids =~ (^|[[:space:]])"$val"($|[[:space:]]) ]] ; then kill $val fi done } reload_haproxy() { old_pids=$(pidof haproxy) if [ -n "$old_pids" ]; then /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf $old_pids & #There seems to be some cases where HAProxy doesn't drain properly. #To handle that case, SIGTERM signal being sent to old HAProxy processes which haven't terminated. verify_old_haproxy_ps_being_deleted "$old_pids" & else /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid & fi } msg_handler() { while read -r line; do echo "The client send: $line" >&2 # currently only 'reload' msg is supported if [ "$line" = reload ]; then reload_haproxy fi done } set -ex declare -r haproxy_sock="/var/run/haproxy/haproxy-master.sock" declare -r haproxy_log_sock="/var/run/haproxy/haproxy-log.sock" export -f msg_handler export -f reload_haproxy export -f verify_old_haproxy_ps_being_deleted rm -f "$haproxy_sock" "$haproxy_log_sock" socat UNIX-RECV:${haproxy_log_sock} STDOUT & if [ -s "/etc/haproxy/haproxy.cfg" ]; then /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid & fi socat UNIX-LISTEN:${haproxy_sock},fork system:'bash -c msg_handler' ] [monitor /etc/kubernetes/kubeconfig /config/haproxy.cfg.tmpl /etc/haproxy/haproxy.cfg --api-vip 192.168.0.5] [/bin/bash -c #/bin/bash verify_old_haproxy_ps_being_deleted() { local prev_pids prev_pids="$1" sleep $OLD_HAPROXY_PS_FORCE_DEL_TIMEOUT cur_pids=$(pidof haproxy) for val in $prev_pids; do if [[ $cur_pids =~ (^|[[:space:]])"$val"($|[[:space:]]) ]] ; then kill $val fi done } reload_haproxy() { old_pids=$(pidof haproxy) if [ -n "$old_pids" ]; then /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf $old_pids & #There seems to be some cases where HAProxy doesn't drain properly. #To handle that case, SIGTERM signal being sent to old HAProxy processes which haven't terminated. verify_old_haproxy_ps_being_deleted "$old_pids" & else /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid & fi } msg_handler() { while read -r line; do echo "The client send: $line" >&2 # currently only 'reload' msg is supported if [ "$line" = reload ]; then reload_haproxy fi done } set -ex declare -r haproxy_sock="/var/run/haproxy/haproxy-master.sock" declare -r haproxy_log_sock="/var/run/haproxy/haproxy-log.sock" export -f msg_handler export -f reload_haproxy export -f verify_old_haproxy_ps_being_deleted rm -f "$haproxy_sock" "$haproxy_log_sock" socat UNIX-RECV:${haproxy_log_sock} STDOUT & if [ -s "/etc/haproxy/haproxy.cfg" ]; then /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid & fi socat UNIX-LISTEN:${haproxy_sock},fork system:'bash -c msg_handler' ] [monitor /etc/kubernetes/kubeconfig /config/haproxy.cfg.tmpl /etc/haproxy/haproxy.cfg --api-vip 192.168.0.5] [/bin/bash -c #/bin/bash verify_old_haproxy_ps_being_deleted() { local prev_pids prev_pids="$1" sleep $OLD_HAPROXY_PS_FORCE_DEL_TIMEOUT cur_pids=$(pidof haproxy) for val in $prev_pids; do if [[ $cur_pids =~ (^|[[:space:]])"$val"($|[[:space:]]) ]] ; then kill $val fi done } reload_haproxy() { old_pids=$(pidof haproxy) if [ -n "$old_pids" ]; then /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf $old_pids & #There seems to be some cases where HAProxy doesn't drain properly. #To handle that case, SIGTERM signal being sent to old HAProxy processes which haven't terminated. verify_old_haproxy_ps_being_deleted "$old_pids" & else /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid & fi } msg_handler() { while read -r line; do echo "The client send: $line" >&2 # currently only 'reload' msg is supported if [ "$line" = reload ]; then reload_haproxy fi done } set -ex declare -r haproxy_sock="/var/run/haproxy/haproxy-master.sock" declare -r haproxy_log_sock="/var/run/haproxy/haproxy-log.sock" export -f msg_handler export -f reload_haproxy export -f verify_old_haproxy_ps_being_deleted rm -f "$haproxy_sock" "$haproxy_log_sock" socat UNIX-RECV:${haproxy_log_sock} STDOUT & if [ -s "/etc/haproxy/haproxy.cfg" ]; then /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid & fi socat UNIX-LISTEN:${haproxy_sock},fork system:'bash -c msg_handler' ] [monitor /etc/kubernetes/kubeconfig /config/haproxy.cfg.tmpl /etc/haproxy/haproxy.cfg --api-vip 192.168.0.5]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581