Bug 2022144
Summary: | 1 of 3 ovnkube-master pods stuck in clbo after ipi bm deployment - dualstack (Intermittent) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Chad Crum <ccrum> |
Component: | Networking | Assignee: | Mohamed Mahmoud <mmahmoud> |
Networking sub component: | ovn-kubernetes | QA Contact: | nshidlin <nshidlin> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | ccrum, dcbw, mbooth, rbrattai, surya, yprokule, zzhao |
Version: | 4.9 | Keywords: | TestBlocker |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:26:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2046476, 2092976 |
Description
Chad Crum
2021-11-10 21:27:09 UTC
I've now tried to reproduce 7 times on different hypervisors (Finding this via libvirt) and I cannot reproduce it, although I am sure I have seen it before. Interesting logs I found: I1109 22:40:19.747784 1 node_tracker.go:162] Processing possible switch / router updates for node mds-master-0-0 I1109 22:40:19.747805 1 node_tracker.go:179] Node mds-master-0-0 has invalid / no gateway config: k8s.ovn.org/l3-gateway-config annotation not found for node "mds-master-0-0" I1109 22:40:19.747833 1 node_tracker.go:143] Node mds-master-0-0 switch + router changed, syncing services W1109 22:40:40.308090 1 ovn.go:969] k8s.ovn.org/node-chassis-id annotation not found for node mds-master-0-0 I1109 22:41:07.953097 1 node_tracker.go:179] Node mds-master-0-0 has invalid / no gateway config: k8s.ovn.org/l3-gateway-config annotation not found for node "mds-master-0-0" Where master-0-0 is the pod in clbo. I do not have the environment now to check the node annotation, but I have asked for my team to let me know if they hit it. Any ideas or is this known? As a test, in a successful deploy I removed the l3-gateway and node-chassis-id annotation for master-0-0 node and restarted the ovnkube-master-* pods. But they all started back up successfully. I showed the l3-gateway annotation error in the logs , but the pods started successfully. I1111 14:55:55.919577 1 node_tracker.go:179] Node mdchad-master-0-0 has invalid / no gateway config: k8s.ovn.org/l3-gateway-config annotation not found for node "mdchad-master-0-0" Not sure if the annotation is related or not, although I don't know why the annotation would be missing.. Looking further into the logs for the clbo pod (ovnkube-master-2r8wq) I see that the nbdb and sbdb containers got into the CLBO state, while other containers were running. # ndbd container has no logs and is in CrashLoopBackOff nbdb: container is clbo State: Waiting Reason: CrashLoopBackOff oc logs -f ovnkube-master-2r8wq -c nbdb # sdbd container is clbo and logs do not have much info sbdb: container is clbo State: Waiting Reason: CrashLoopBackOff sdbd logs: ]0;kni@provisionhost-0-0:~[kni@provisionhost-0-0 ~]$ oc logs -f ovnkube-master-2r8wq -c sdbd[K[K[Kbdb + [[ -f /env/_master ]] + ovn_kubernetes_namespace=openshift-ovn-kubernetes + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' + transport=ssl + ovn_raft_conn_ip_url_suffix= + [[ 192.168.138.15 == *\:* ]] + db=sb + db_port=9642 + ovn_db_file=/etc/ovn/ovnsb_db.db ++ bracketify 192.168.138.15 ++ case "$1" in ++ echo 192.168.138.15 + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=192.168.138.15 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' + CLUSTER_INITIATOR_IP=192.168.138.16 ++ date -Iseconds + echo '2021-11-10T14:57:20+00:00 - starting sbdb CLUSTER_INITIATOR_IP=192.168.138.16' 2021-11-10T14:57:20+00:00 - starting sbdb CLUSTER_INITIATOR_IP=192.168.138.16 + initial_raft_create=true + initialize=false + [[ ! -e /etc/ovn/ovnsb_db.db ]] + [[ false == \t\r\u\e ]] + exec /usr/share/ovn/scripts/ovn-ctl --db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=192.168.138.15 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-sb-log=-vconsole:info -vfile:off' run_sb_ovsdb It appears that some of the database containers are unable to make progress; we'd expect lots more log messages after the "ovn-ctl ... run_sb_ovsdb" command in the script, with output from the ovsdb-server processes themselves. Because we don't see that, we would need to debug futher. One option could be resource contention; the nodes apparently have 4 CPUs and 16GB RAM. *** Bug 2013048 has been marked as a duplicate of this bug. *** With the original dualstack spoke that had ovnkube master clbo, both nbdb and sbdb containers were clbo. In this case I just see sbdb with logs: oc logs -p ovnkube-master-bf4s6 -c sbdb + [[ -f /env/_master ]] + ovn_kubernetes_namespace=openshift-ovn-kubernetes + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' + transport=ssl + ovn_raft_conn_ip_url_suffix= + [[ fd2e:6f44:5dd8:a::11 == *\:* ]] + ovn_raft_conn_ip_url_suffix=':[::]' + db=sb + db_port=9642 + ovn_db_file=/etc/ovn/ovnsb_db.db ++ bracketify fd2e:6f44:5dd8:a::11 ++ case "$1" in ++ echo '[fd2e:6f44:5dd8:a::11]' + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=[fd2e:6f44:5dd8:a::11] --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' + CLUSTER_INITIATOR_IP=fd2e:6f44:5dd8:a::10 ++ date -Iseconds + echo '2021-11-15T13:47:22+00:00 - starting sbdb CLUSTER_INITIATOR_IP=fd2e:6f44:5dd8:a::10' 2021-11-15T13:47:22+00:00 - starting sbdb CLUSTER_INITIATOR_IP=fd2e:6f44:5dd8:a::10 + initial_raft_create=true + initialize=false + [[ ! -e /etc/ovn/ovnsb_db.db ]] + [[ false == \t\r\u\e ]] + exec /usr/share/ovn/scripts/ovn-ctl --db-sb-cluster-local-port=9644 '--db-sb-cluster-local-addr=[fd2e:6f44:5dd8:a::11]' --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-sb-log=-vconsole:info -vfile:off' run_sb_ovsdb Temp workaround: remove the pidfile for the failing container. rm /var/run/ovn/ovn?b_db.pid *** Bug 2032541 has been marked as a duplicate of this bug. *** This does not reproduce with 4.10.0-0.nightly-2022-01-23-013716. Moving to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |