Hide Forgot
Description of problem: - One of ovnkube-master starts failing. $ oc get pod -A |grep ovn openshift-ovn-kubernetes ovnkube-master-fb97s 6/6 Running 2 23h openshift-ovn-kubernetes ovnkube-master-flz6m 6/6 Running 0 23h openshift-ovn-kubernetes ovnkube-master-wq4sj 4/6 CrashLoopBackOff 10 19m openshift-ovn-kubernetes ovnkube-node-4lrrk 3/3 Running 0 23h openshift-ovn-kubernetes ovnkube-node-4vdst 3/3 Running 0 23h openshift-ovn-kubernetes ovnkube-node-6xdd4 3/3 Running 0 23h openshift-ovn-kubernetes ovnkube-node-8vlls 3/3 Running 0 23h openshift-ovn-kubernetes ovnkube-node-94tp9 3/3 Running 0 23h openshift-ovn-kubernetes ovnkube-node-g7cg9 3/3 Running 0 23h openshift-ovn-kubernetes ovnkube-node-lsvf8 3/3 Running 0 23h Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.13 True False 62d Cluster version is 4.7.13 How reproducible: Not 100%. It failed during scale test. Steps to Reproduce: 1. Deploy 3000 pods (Knative service.) 2. It starts failing around 2526 pods Actual results: - Pod starts failing with the error: $ oc get pod -A |grep ovn openshift-ovn-kubernetes ovnkube-master-wq4sj 4/6 PostStartHookError: command '/bin/bash -c set -x while ! ovn-nbctl --no-leader-only -t 5 set-connection pssl:9641 -- set connection . inactivity_probe=60000; do echo "$(date -Iseconds) - ERROR RESTARTING - nbdb - too many failed ovn-nbctl attempts, giving up" DB_SCHEMA="/usr/share/ovn/ovn-nb.ovsschema" DB_SERVER="unix:/var/run/ovn/ovnnb_db.sock" OVN_NB_CTL="ovn-nbctl -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt \ while current_election_timer=$(ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null \ while is_candidate=$(ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null \ is_leader=$(ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null \ if ! ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound ${election_timer}; then if ! ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound ${max_election_timer}; then + ovn-nbctl --no-leader-only -t 5 set-connection pssl:9641 -- set connection . inactivity_probe=60000 ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory) ... keep continue ... - Then it starts CrashLoopbackoff. Expected results: - ovn pod keeps running.
Not sure if this is related or not, but pods takes long time to start as it keeps "Pending" status until the network interface is attached. $ oc -n cupcake-1000-stage describe pod helloworld-go-00001-deployment-59fb487487-bvt8t ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8m3s default-scheduler Successfully assigned cupcake-1000-stage/helloworld-go-00001-deployment-59fb487487-qhmrx to worker002 Normal AddedInterface 5m33s multus Add eth0 [10.131.0.10/23] Normal Pulled 5m30s kubelet Container image "quay.io/wreicher/quarkus-native-hello@sha256:09eb7225f59c6147d6a2acca945ed1b284b3ac317b3e0a51ca7f99136abc711c" already present on machine Normal Created 5m23s kubelet Created container user-container Normal Started 5m23s kubelet Started container user-container Normal Pulled 5m23s kubelet Container image "registry.redhat.io/openshift-serverless-1/serving-queue-rhel8@sha256:c2a97c0868e19f4e5a269d29bfe3b7c6b6ef870e135a5419388047965cc0b19d" already present on machine Normal Created 5m18s kubelet Created container queue-proxy Normal Started 5m18s kubelet Started container queue-proxy Also, it sometimes gets the following error. $ oc -n cupcake-1000-stage describe pod Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m45s default-scheduler Successfully assigned cupcake-1000-stage/helloworld-go-00001-deployment-59fb487487-m2h7z to worker002 Warning ErrorAddingLogicalPort 4m5s controlplane failed to add IP "10.131.0.20" to address set "edcc6c70-2415-4719-b32b-1d6841058f0d/cupcake-1000-stage_v4/a5666218552365519683", stderr: "2021-08-17T13:47:43Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n" (OVN command '/usr/bin/ovn-nbctl --timeout=15 add address_set edcc6c70-2415-4719-b32b-1d6841058f0d addresses "10.131.0.20"' failed: signal: alarm clock) I would appreciate if I could know if these slow start is related to the ovnkube-master issue or not.
This issue is quite similar to bz1952819. I will ask perf team to upgrade to 4.7.22 (bz1962608 said it fixed). *** This bug has been marked as a duplicate of bug 1962608 ***