Bug 1994615

Summary: ovnkube-master starts failing due to "ERROR RESTARTING - nbdb - too many failed ovn-nbctl attempts, giving up"
Product: OpenShift Container Platform Reporter: Kenjiro Nakayama <knakayam>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified    
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-19 00:36:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kenjiro Nakayama 2021-08-17 14:30:24 UTC
Description of problem:

- One of ovnkube-master starts failing.

  $ oc get pod -A |grep ovn
  openshift-ovn-kubernetes                           ovnkube-master-fb97s                                              6/6     Running            2          23h
  openshift-ovn-kubernetes                           ovnkube-master-flz6m                                              6/6     Running            0          23h
  openshift-ovn-kubernetes                           ovnkube-master-wq4sj                                              4/6     CrashLoopBackOff   10         19m
  openshift-ovn-kubernetes                           ovnkube-node-4lrrk                                                3/3     Running            0          23h
  openshift-ovn-kubernetes                           ovnkube-node-4vdst                                                3/3     Running            0          23h
  openshift-ovn-kubernetes                           ovnkube-node-6xdd4                                                3/3     Running            0          23h
  openshift-ovn-kubernetes                           ovnkube-node-8vlls                                                3/3     Running            0          23h
  openshift-ovn-kubernetes                           ovnkube-node-94tp9                                                3/3     Running            0          23h
  openshift-ovn-kubernetes                           ovnkube-node-g7cg9                                                3/3     Running            0          23h
  openshift-ovn-kubernetes                           ovnkube-node-lsvf8                                                3/3     Running            0          23h

Version-Release number of selected component (if applicable):

  $ oc get clusterversion
  NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
  version   4.7.13    True        False         62d     Cluster version is 4.7.13

How reproducible:

Not 100%. It failed during scale test.

Steps to Reproduce:
1. Deploy 3000 pods (Knative service.)
2. It starts failing around 2526 pods

Actual results:
- Pod starts failing with the error:

  $ oc get pod -A |grep ovn
  openshift-ovn-kubernetes                           ovnkube-master-wq4sj                                              4/6     PostStartHookError: command '/bin/bash -c set -x
    while ! ovn-nbctl --no-leader-only -t 5 set-connection pssl:9641 -- set connection . inactivity_probe=60000; do
      echo "$(date -Iseconds) - ERROR RESTARTING - nbdb - too many failed ovn-nbctl attempts, giving up"
    DB_SCHEMA="/usr/share/ovn/ovn-nb.ovsschema"
    DB_SERVER="unix:/var/run/ovn/ovnnb_db.sock"
  OVN_NB_CTL="ovn-nbctl -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt \
  while current_election_timer=$(ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null \
    while is_candidate=$(ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null \
    is_leader=$(ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null \
          if ! ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound ${election_timer}; then
          if ! ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound ${max_election_timer}; then
  + ovn-nbctl --no-leader-only -t 5 set-connection pssl:9641 -- set connection . inactivity_probe=60000
  ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory)
   ... keep continue ...

- Then it starts CrashLoopbackoff.

Expected results:
- ovn pod keeps running.

Comment 1 Kenjiro Nakayama 2021-08-17 14:56:50 UTC
Not sure if this is related or not, but pods takes long time to start as it keeps "Pending" status until the network interface is attached. 

  $ oc  -n cupcake-1000-stage describe pod helloworld-go-00001-deployment-59fb487487-bvt8t
   ...
  Events:
    Type     Reason          Age                   From               Message
    ----     ------          ----                  ----               -------
    Normal   Scheduled       8m3s                  default-scheduler  Successfully assigned cupcake-1000-stage/helloworld-go-00001-deployment-59fb487487-qhmrx to worker002
    Normal   AddedInterface  5m33s                 multus             Add eth0 [10.131.0.10/23]
    Normal   Pulled          5m30s                 kubelet            Container image "quay.io/wreicher/quarkus-native-hello@sha256:09eb7225f59c6147d6a2acca945ed1b284b3ac317b3e0a51ca7f99136abc711c" already present on machine
    Normal   Created         5m23s                 kubelet            Created container user-container
    Normal   Started         5m23s                 kubelet            Started container user-container
    Normal   Pulled          5m23s                 kubelet            Container image "registry.redhat.io/openshift-serverless-1/serving-queue-rhel8@sha256:c2a97c0868e19f4e5a269d29bfe3b7c6b6ef870e135a5419388047965cc0b19d" already present on machine
    Normal   Created         5m18s                 kubelet            Created container queue-proxy
    Normal   Started         5m18s                 kubelet            Started container queue-proxy

Also, it sometimes gets the following error.

  $ oc  -n cupcake-1000-stage describe pod 
  Events:
    Type     Reason                  Age    From               Message
    ----     ------                  ----   ----               -------
    Normal   Scheduled               4m45s  default-scheduler  Successfully assigned cupcake-1000-stage/helloworld-go-00001-deployment-59fb487487-m2h7z to worker002
    Warning  ErrorAddingLogicalPort  4m5s   controlplane       failed to add IP "10.131.0.20" to address set "edcc6c70-2415-4719-b32b-1d6841058f0d/cupcake-1000-stage_v4/a5666218552365519683", stderr: "2021-08-17T13:47:43Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n" (OVN command '/usr/bin/ovn-nbctl --timeout=15 add address_set edcc6c70-2415-4719-b32b-1d6841058f0d addresses "10.131.0.20"' failed: signal: alarm clock)

I would appreciate if I could know if these slow start is related to the ovnkube-master issue or not.

Comment 3 Kenjiro Nakayama 2021-08-19 00:36:16 UTC
This issue is quite similar to bz1952819. I will ask perf team to upgrade to 4.7.22 (bz1962608 said it fixed).

*** This bug has been marked as a duplicate of bug 1962608 ***