2071635 – After upgrading from 4.8.13 to 4.8.33 upgrade was success, but haproxy pod are crashloopback with error "[ALERT] 082/151218 (9) : Starting proxy health_check_http_url: cannot bind socket [:::30936]"

Bug 2071635 - After upgrading from 4.8.13 to 4.8.33 upgrade was success, but haproxy pod are crashloopback with error "[ALERT] 082/151218 (9) : Starting proxy health_check_http_url: cannot bind socket [:::30936]"

Summary: After upgrading from 4.8.13 to 4.8.33 upgrade was success, but haproxy pod ar...

Keywords:
Status:	CLOSED DUPLICATE of bug 2069740
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Douglas Schilling Landgraf
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-04 12:22 UTC by Ravi Chaudhary
Modified:	2022-05-24 09:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-06 20:11:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ravi Chaudhary 2022-04-04 12:22:48 UTC

Description of problem:

After upgrading from 4.8.13 to 4.8.33 upgrade was success, but haproxy pod are crashloopback with error "[ALERT] 082/151218 (9) : Starting proxy health_check_http_url: cannot bind socket [:::30936]" 
 later one container started in pod but others not
$ omg get pods -A |grep haproxy
openshift-vsphere-infra                           haproxy-esp01-66qgh-master-0                             1/2    Running    367       22h
openshift-vsphere-infra                           haproxy-esp01-66qgh-master-1                             1/2    Running    355       22h
openshift-vsphere-infra                           haproxy-esp01-66qgh-master-2                             2/2    Running    365       22h

Logs from haproxy pods see below:

The logs showed:
+ declare -r haproxy_sock=/var/run/haproxy/haproxy-master.sock
+ declare -r haproxy_log_sock=/var/run/haproxy/haproxy-log.sock
+ export -f msg_handler
+ export -f reload_haproxy
+ export -f verify_old_haproxy_ps_being_deleted
+ rm -f /var/run/haproxy/haproxy-master.sock /var/run/haproxy/haproxy-log.sock
+ '[' -s /etc/haproxy/haproxy.cfg ']'
+ socat UNIX-RECV:/var/run/haproxy/haproxy-log.sock STDOUT
+ socat UNIX-LISTEN:/var/run/haproxy/haproxy-master.sock,fork 'system:bash -c msg_handler'
+ /usr/sbin/haproxy -W -db -f /etc/haproxy/haproxy.cfg -p /var/lib/haproxy/run/haproxy.pid
<133>Mar 24 15:12:18 haproxy[9]: Proxy main started.
[NOTICE] 082/151218 (9) : haproxy version is 2.2.13-5f3eb59
[NOTICE] 082/151218 (9) : path to executable is /usr/sbin/haproxy
[ALERT] 082/151218 (9) : Starting proxy health_check_http_url: cannot bind socket [:::30936]
<133>Mar 24 15:12:18 haproxy[9]: Proxy stats started.
<133>Mar 24 15:12:18 haproxy[9]: Proxy masters started.


haproxy-esp01-66qgh-master-2
Warning  ProbeError  10m (x1027 over 21h)  kubelet  (combined from similar events): Liveness probe error: Get http://10.10.30.82:30936/haproxy_ready: read tcp 10.10.30.82:40358->10.10.30.82:30936: read: connection reset by peer

haproxy-esp01-66qgh-master-0
Warning  ProbeError  120m (x895 over 20h)  kubelet  (combined from similar events): Liveness probe error: Get http://10.10.30.81:30936/haproxy_ready: read tcp 10.10.30.81:34222->10.10.30.81:30936: read: connection reset by peer


---
On checking on master nodes i can see port is getting used by SVC "Handle NodePort service elksaas-gd-ls-logstash port 30936"



    core@esp01-66qgh-master-0 ~]$ sudo netstat -atlpo | grep 30936
    tcp6       0      0 [::]:30936              [::]:*                  LISTEN      547545/ovnkube       off (0.00/0/0)


[core@esp01-66qgh-master-1 ~]$ sudo netstat -atlpo | grep 30936
tcp6       0      0 [::]:30936              [::]:*                  LISTEN      2855901/ovnkube      off (0.00/0/0)

[core@esp01-66qgh-master-2 ~]$ sudo netstat -atlpo | grep 30936
tcp6       0      0 [::]:30936              [::]:*                  LISTEN      3975186/ovnkube      off (0.00/0/0)


---

[scripts]$ oc logs ovnkube-node-gzbqr -n openshift-ovn-kubernetes -c ovnkube-node | grep 30936
I0325 11:35:11.874496 3975186 gateway_iptables.go:45] Adding rule in table: nat, chain: OVN-KUBE-NODEPORT with args: "-p TCP -m addrtype --dst-type LOCAL --dport 30936 -j DNAT --to-destination 172.30.180.193:5035" for protocol: 0 
I0325 11:35:12.225603 3975186 gateway_iptables.go:45] Adding rule in table: nat, chain: OVN-KUBE-NODEPORT with args: "-p TCP -m addrtype --dst-type LOCAL --dport 30936 -j DNAT --to-destination 172.30.180.193:5035" for protocol: 0 
I0325 11:35:12.525202 3975186 port_claim.go:182] Handle NodePort service elksaas-gd-ls-logstash port 30936
I0325 11:35:12.525209 3975186 port_claim.go:40] Opening socket for service: elksaas/elksaas-gd-ls-logstash, port: 30936 and protocol TCP
I0325 11:35:12.525212 3975186 port_claim.go:63] Opening socket for LocalPort "nodePort for elksaas/elksaas-gd-ls-logstash:rpa" (:30936/tcp)
I0325 11:35:12.534890 3975186 gateway_iptables.go:45] Adding rule in table: nat, chain: OVN-KUBE-NODEPORT with args: "-p TCP -m addrtype --dst-type LOCAL --dport 30936 -j DNAT --to-destination 172.30.180.193:5035" for protocol: 0

Version-Release number of selected component (if applicable):

4.8.33


How reproducible:

After upgrading to 4.8.33 having this issue.


Steps to Reproduce:
1. upgrade from 4.8.13 to 4.8.33
2. then we will haproxy pod are trying to use port 30936 from nodePort svc range
3.

Actual results:

Haproxy pods using nodePort svc range port 30936

Expected results:

Haproxy pods should not use port from Nodeport svc range


Additional info:
Haproxy config:

$ oc debug node/esp01-66qgh-master-0
Starting pod/esp01-66qgh-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.10.30.81
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ll /etc/haproxy/
sh: ll: command not found
sh-4.4# ls -al /etc/haproxy/
total 16
drwxr-xr-x.   2 root root   25 Mar 23 16:52 .
drwxr-xr-x. 102 root root 8192 Mar 26 17:42 ..
-rw-r--r--.   1 root root 1158 Mar 23 17:42 haproxy.cfg
sh-4.4# cat /etc/haproxy/haproxy.cfg 
global
  stats socket /var/lib/haproxy/run/haproxy.sock  mode 600 level admin expose-fd listeners
defaults
  maxconn 20000
  mode    tcp
  log     /var/run/haproxy/haproxy-log.sock local0
  option  dontlognull
  retries 3
  timeout http-request 30s
  timeout queue        1m
  timeout connect      10s
  timeout client       86400s
  timeout server       86400s
  timeout tunnel       86400s
frontend  main
  bind :::9445 v4v6
  default_backend masters
listen health_check_http_url
  bind :::30936 v4v6
  mode http
  monitor-uri /haproxy_ready
  option dontlognull
listen stats
  bind localhost:50000
  mode http
  stats enable
  stats hide-version
  stats uri /haproxy_stats
  stats refresh 30s
  stats auth Username:Password
backend masters
   option  httpchk GET /readyz HTTP/1.0
   option  log-health-checks
   balance roundrobin
   server esp01-66qgh-master-0 10.10.30.81:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3
   server esp01-66qgh-master-2 10.10.30.82:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3
   server esp01-66qgh-master-1 10.10.30.83:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3

Comment 2 Ben Nemec 2022-04-06 20:11:08 UTC

Yeah, I just noticed this conflict a couple of weeks ago. Patches are up for 4.11 to fix it and will need to be backported to all supported releases.

*** This bug has been marked as a duplicate of bug 2069740 ***

Note You need to log in before you can comment on or make changes to this bug.