Bug 2022144

Summary:	1 of 3 ovnkube-master pods stuck in clbo after ipi bm deployment - dualstack (Intermittent)
Product:	OpenShift Container Platform	Reporter:	Chad Crum <ccrum>
Component:	Networking	Assignee:	Mohamed Mahmoud <mmahmoud>
Networking sub component:	ovn-kubernetes	QA Contact:	nshidlin <nshidlin>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	ccrum, dcbw, mbooth, rbrattai, surya, yprokule, zzhao
Version:	4.9	Keywords:	TestBlocker
Target Milestone:	---
Target Release:	4.10.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:26:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2046476, 2092976

Description Chad Crum 2021-11-10 21:27:09 UTC

Description of problem:
Intermittently 1 of 3 ovnkube-master-* pods are stuck in clbo after ipi bm deployment. This is a dualstack deployment that was deployed from a hub ocp cluster via ACM/Assisted Service (env was ipv4 connected). Causes network cluster operator to fail deployment.

network                         4.9.5     True        True          True       22h     DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-8nhnh is in CrashLoopBackOff State
DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2021-11-10T15:15:50Z

Version-Release number of selected component (if applicable):
4.9.5

How reproducible:
Intermittent

Steps to Reproduce:
1. Deploy a dualstack multinode openshift spoke cluster via an ocp hub cluster with RHACM / Assisted service
2. Check ocp spoke deployment after install should be complete

Actual results:
In some cases the network co does not complete due to 1 in 3 ovnkube-masters being in CLBO - 
[kni@provisionhost-0-0 tmp]$ oc get pods -n openshift-ovn-kubernetes 
NAME                   READY   STATUS             RESTARTS         AGE
ovnkube-master-8nhnh   4/6     CrashLoopBackOff   199 (102s ago)   6h11m
ovnkube-master-jlv6z   6/6     Running            0                6h11m
ovnkube-master-np686   6/6     Running            0                22h
ovnkube-node-42qrn     4/4     Running            0                22h
ovnkube-node-jst2b     4/4     Running            0                22h
ovnkube-node-nk8fp     4/4     Running            1 (22h ago)      22h
ovnkube-node-rmhvb     4/4     Running            0                22h
ovnkube-node-zfw5g     4/4     Running            0                22h



Expected results:
Deployment succeeds every time 

Additional info:

I can provide an environment with the issue replicated

Comment 2 Chad Crum 2021-11-11 14:31:40 UTC

I've now tried to reproduce 7 times on different hypervisors (Finding this via libvirt) and I cannot reproduce it, although I am sure I have seen it before.

Interesting logs I found:

I1109 22:40:19.747784       1 node_tracker.go:162] Processing possible switch / router updates for node mds-master-0-0
I1109 22:40:19.747805       1 node_tracker.go:179] Node mds-master-0-0 has invalid / no gateway config: k8s.ovn.org/l3-gateway-config annotation not found for node "mds-master-0-0"
I1109 22:40:19.747833       1 node_tracker.go:143] Node mds-master-0-0 switch + router changed, syncing services

W1109 22:40:40.308090       1 ovn.go:969] k8s.ovn.org/node-chassis-id annotation not found for node mds-master-0-0

I1109 22:41:07.953097       1 node_tracker.go:179] Node mds-master-0-0 has invalid / no gateway config: k8s.ovn.org/l3-gateway-config annotation not found for node "mds-master-0-0"

Where master-0-0 is the pod in clbo.

I do not have the environment now to check the node annotation, but I have asked for my team to let me know if they hit it.

Any ideas or is this known?

Comment 4 Chad Crum 2021-11-11 15:05:08 UTC

As a test, in a successful deploy I removed the l3-gateway and node-chassis-id annotation for master-0-0 node and restarted the ovnkube-master-* pods. But they all started back up successfully.

I showed the l3-gateway annotation error in the logs , but the pods started successfully.
I1111 14:55:55.919577       1 node_tracker.go:179] Node mdchad-master-0-0 has invalid / no gateway config: k8s.ovn.org/l3-gateway-config annotation not found for node "mdchad-master-0-0"

Not sure if the annotation is related or not, although I don't know why the annotation would be missing..

Comment 5 Chad Crum 2021-11-11 15:22:41 UTC

Looking further into the logs for the clbo pod (ovnkube-master-2r8wq) I see that the nbdb and sbdb containers got into the CLBO state, while other containers were running.

# ndbd container has no logs and is in CrashLoopBackOff
    nbdb: container is clbo
        State:          Waiting
        Reason:       CrashLoopBackOff
    oc logs -f ovnkube-master-2r8wq -c nbdb


# sdbd container is clbo and logs do not have much info
    sbdb: container is clbo
        State:          Waiting
        Reason:       CrashLoopBackOff

    sdbd logs:

    ]0;kni@provisionhost-0-0:~[kni@provisionhost-0-0 ~]$ oc logs -f ovnkube-master-2r8wq -c sdbd[K[K[Kbdb
    + [[ -f /env/_master ]]
    + ovn_kubernetes_namespace=openshift-ovn-kubernetes
    + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
    + transport=ssl
    + ovn_raft_conn_ip_url_suffix=
    + [[ 192.168.138.15 == *\:* ]]
    + db=sb
    + db_port=9642
    + ovn_db_file=/etc/ovn/ovnsb_db.db
    ++ bracketify 192.168.138.15
    ++ case "$1" in
    ++ echo 192.168.138.15
    + OVN_ARGS='--db-sb-cluster-local-port=9644   --db-sb-cluster-local-addr=192.168.138.15   --no-monitor   --db-sb-cluster-local-proto=ssl   --ovn-sb-db-ssl-key=/ovn-cert/tls.key   --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt   --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt'
    + CLUSTER_INITIATOR_IP=192.168.138.16
    ++ date -Iseconds
    + echo '2021-11-10T14:57:20+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=192.168.138.16'
    2021-11-10T14:57:20+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=192.168.138.16
    + initial_raft_create=true
    + initialize=false
    + [[ ! -e /etc/ovn/ovnsb_db.db ]]
    + [[ false == \t\r\u\e ]]
    + exec /usr/share/ovn/scripts/ovn-ctl --db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=192.168.138.15 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-sb-log=-vconsole:info -vfile:off' run_sb_ovsdb

Comment 6 Dan Williams 2021-11-11 16:32:50 UTC

It appears that some of the database containers are unable to make progress; we'd expect lots more log messages after the "ovn-ctl ... run_sb_ovsdb" command in the script, with output from the ovsdb-server processes themselves. Because we don't see that, we would need to debug futher. One option could be resource contention; the nodes apparently have 4 CPUs and 16GB RAM.

Comment 7 Chad Crum 2021-11-12 16:09:22 UTC

*** Bug 2013048 has been marked as a duplicate of this bug. ***

Comment 9 Chad Crum 2021-11-15 13:50:46 UTC

With the original dualstack spoke that had ovnkube master clbo, both nbdb and sbdb containers were clbo. In this case I just see sbdb with logs:

 oc logs -p ovnkube-master-bf4s6 -c sbdb
+ [[ -f /env/_master ]]
+ ovn_kubernetes_namespace=openshift-ovn-kubernetes
+ ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
+ transport=ssl
+ ovn_raft_conn_ip_url_suffix=
+ [[ fd2e:6f44:5dd8:a::11 == *\:* ]]
+ ovn_raft_conn_ip_url_suffix=':[::]'
+ db=sb
+ db_port=9642
+ ovn_db_file=/etc/ovn/ovnsb_db.db
++ bracketify fd2e:6f44:5dd8:a::11
++ case "$1" in
++ echo '[fd2e:6f44:5dd8:a::11]'
+ OVN_ARGS='--db-sb-cluster-local-port=9644   --db-sb-cluster-local-addr=[fd2e:6f44:5dd8:a::11]   --no-monitor   --db-sb-cluster-local-proto=ssl   --ovn-sb-db-ssl-key=/ovn-cert/tls.key   --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt   --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt'
+ CLUSTER_INITIATOR_IP=fd2e:6f44:5dd8:a::10
++ date -Iseconds
+ echo '2021-11-15T13:47:22+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=fd2e:6f44:5dd8:a::10'
2021-11-15T13:47:22+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=fd2e:6f44:5dd8:a::10
+ initial_raft_create=true
+ initialize=false
+ [[ ! -e /etc/ovn/ovnsb_db.db ]]
+ [[ false == \t\r\u\e ]]
+ exec /usr/share/ovn/scripts/ovn-ctl --db-sb-cluster-local-port=9644 '--db-sb-cluster-local-addr=[fd2e:6f44:5dd8:a::11]' --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-sb-log=-vconsole:info -vfile:off' run_sb_ovsdb

Comment 19 Ross Brattain 2021-12-13 12:07:52 UTC

Temp workaround: remove the pidfile for the failing container.

rm /var/run/ovn/ovn?b_db.pid

Comment 23 Mohamed Mahmoud 2021-12-17 14:02:43 UTC

*** Bug 2032541 has been marked as a duplicate of this bug. ***

Comment 28 nshidlin 2022-01-27 06:43:57 UTC

This does not reproduce with 4.10.0-0.nightly-2022-01-23-013716. Moving to verified

Comment 31 errata-xmlrpc 2022-03-10 16:26:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056