Bug 2096456 - [HyperShift] Election timeouts on OVNKube masters for Hypershift guests post statefulset recreation
Summary: [HyperShift] Election timeouts on OVNKube masters for Hypershift guests post ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.11.z
Assignee: Patryk Diak
QA Contact: Ross Brattain
URL:
Whiteboard:
: 2093057 (view as bug list)
Depends On: 2103590
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-13 21:43 UTC by Anurag saxena
Modified: 2023-09-15 01:55 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2103590 (view as bug list)
Environment:
Last Closed: 2022-09-07 20:49:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1512 0 None open [release-4.11] Bug 2096456: Add init container to ensure that Status.podIP is set before postStart hooks run 2022-08-16 18:54:55 UTC
Red Hat Product Errata RHSA-2022:6287 0 None None None 2022-09-07 20:49:54 UTC

Comment 1 Ross Brattain 2022-06-13 22:13:47 UTC
On all OVN clusters we expect Raft election to converge within 30 seconds of the deletion of all ovnkube-master pods.


OVN stateful set does not meet this condition.

      [09:19:00] INFO> Shell Commands: oc delete pod -l app\=ovnkube-master --kubeconfig=ocp4_admin.kubeconfig
      pod "ovnkube-master-0" deleted
      pod "ovnkube-master-1" deleted
      pod "ovnkube-master-2" deleted
      [09:19:03] INFO> Exit Status: 0
      [09:19:12] INFO> Exit Status: 1
      [09:19:17] INFO> Shell Commands: oc exec ovnkube-master-0  --kubeconfig=ocp4_admin.kubeconfig -n clusters-hypershift-ci-15114  --container=northd -i -- ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
      aee7
      Name: OVN_Northbound
      Cluster ID: 1a8d (1a8dd9f1-9866-42cc-9ccd-33953fcc3df7)
      Server ID: aee7 (aee71e75-9e8c-4ea8-868f-9e7abc1be4ce)
      Address: ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643
      Status: cluster member
      Role: follower
      Term: 63
      Leader: unknown
      Vote: self
      
      Election timer: 10000
      Log: [2, 3002]
      Entries not yet committed: 0
      Entries not yet applied: 0
      Connections: (->e451) (->358a)
      Disconnections: 0
      Servers:
          aee7 (aee7 at ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643) (self)
          e451 (e451 at ssl:ovnkube-master-1.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643)
          358a (358a at ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643)
      
      STDERR:
      E0610 05:19:19.141910  950758 v2.go:105] read /dev/stdin: resource temporarily unavailable
      [09:19:19] INFO> Exit Status: 0
      [09:19:24] INFO> Shell Commands: oc exec ovnkube-master-0  --kubeconfig=ocp4_admin.kubeconfig -n clusters-hypershift-ci-15114  --container=northd -i -- ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
      aee7
      Name: OVN_Northbound
      Cluster ID: 1a8d (1a8dd9f1-9866-42cc-9ccd-33953fcc3df7)
      Server ID: aee7 (aee71e75-9e8c-4ea8-868f-9e7abc1be4ce)
      Address: ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643
      Status: disconnected from the cluster (election timeout)
      Role: candidate
      Term: 64
      Leader: unknown
      Vote: self
      
      Last Election started 6120 ms ago, reason: timeout
      Election timer: 10000
      Log: [2, 3002]
      Entries not yet committed: 0
      Entries not yet applied: 0
      Connections: (->e451) (->358a)
      Disconnections: 0
      Servers:
          aee7 (aee7 at ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643) (self) (voted for aee7)
          e451 (e451 at ssl:ovnkube-master-1.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643)
          358a (358a at ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643)
      
      STDERR:
      E0610 05:19:26.148518  950936 v2.go:105] read /dev/stdin: resource temporarily unavailable
      [09:19:26] INFO> Exit Status: 0
      [09:19:31] INFO> Shell Commands: oc exec ovnkube-master-0  --kubeconfig=ocp4_admin.kubeconfig -n clusters-hypershift-ci-15114  --container=northd -i -- ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
      aee7
      Name: OVN_Northbound
      Cluster ID: 1a8d (1a8dd9f1-9866-42cc-9ccd-33953fcc3df7)
      Server ID: aee7 (aee71e75-9e8c-4ea8-868f-9e7abc1be4ce)
      Address: ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643
      Status: disconnected from the cluster (election timeout)
      Role: candidate
      Term: 65
      Leader: unknown
      Vote: self
      
      Last Election started 2856 ms ago, reason: timeout
      Election timer: 10000
      Log: [2, 3002]
      Entries not yet committed: 0
      Entries not yet applied: 0
      Connections: (->e451) (->358a)
      Disconnections: 0
      Servers:
          aee7 (aee7 at ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643) (self) (voted for aee7)
          e451 (e451 at ssl:ovnkube-master-1.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643)
          358a (358a at ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hypershift-ci-15114.svc.cluster.local:9643)
      
      STDERR:
      E0610 05:19:33.424861  951120 v2.go:105] read /dev/stdin: resource temporarily unavailable
      [09:19:33] INFO> Exit Status: 0
      Unknown leader (RuntimeError)

Comment 2 zenghui.shi 2022-06-16 05:43:58 UTC
I tried to reproduce this issue locally, some findings:

1) it takes ~1m for the ovnkube-master-0 in mamagement cluster to pass containercreating.

oc describe pod ovnkube-master-0 doesn't show any event on why nbdb takes 40s to start; 
looking at the nbdb container log, it doesn't show any abnormal behavior, it detects the ovn db file and start without delay. not sure if it needs time to mount the storage PV where database file is saved (we don't use PV in normal openshift deployment, only hypershift)

2) ovnkube-master containers are crashlooping in both ovnkube-master-1 and ovnkube-master-2 pods

### ovnkube-master container log ###

F0616 05:10:26.486395       1 ovnkube.go:133] error when trying to initialize libovsdb SB client: unable to connect to any endpoints: failed to connect to ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9642: failed to open connection: dial tcp: lookup ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local: no such host. failed to connect to ssl:ovnkube-master-1.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9642: endpoint is not leader. failed to connect to ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9642: endpoint is not leader


In the above ovnkube-master container log, it failed to find the endpoint leader which results in the ovnkube-master container crashes contineously.
I didn't find any leader election configmap in hosted cluster namespace, but in openshift-ovn-kubernetes namespace, there is one created by management ovnk deployment.
According to ovn-config-namespace settings in ovnkube-config configmap, it still uses openshift-ovn-kubernetes namespace for hypershift deployment which is wrong since it is supposed to use hostedcluser namespace on management cluster. we would need to change the ovn-config-namespace to hostedcluster namespace for it to create its own election lock.

PS: this ovn-config-namespace config is also used in ovnk metrics, dbmanager and topology version and could potentially cause other issues.

Comment 3 zenghui.shi 2022-06-16 06:51:00 UTC
(In reply to zenghui.shi from comment #2)
> I tried to reproduce this issue locally, some findings:
> 
> 1) it takes ~1m for the ovnkube-master-0 in mamagement cluster to pass
> containercreating.
> 
> oc describe pod ovnkube-master-0 doesn't show any event on why nbdb takes
> 40s to start; 
> looking at the nbdb container log, it doesn't show any abnormal behavior, it
> detects the ovn db file and start without delay. not sure if it needs time
> to mount the storage PV where database file is saved (we don't use PV in
> normal openshift deployment, only hypershift)
> 
> 2) ovnkube-master containers are crashlooping in both ovnkube-master-1 and
> ovnkube-master-2 pods
> 
> ### ovnkube-master container log ###
> 
> F0616 05:10:26.486395       1 ovnkube.go:133] error when trying to
> initialize libovsdb SB client: unable to connect to any endpoints: failed to
> connect to
> ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.
> local:9642: failed to open connection: dial tcp: lookup
> ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.
> local: no such host. failed to connect to
> ssl:ovnkube-master-1.ovnkube-master-internal.clusters-hostedovn.svc.cluster.
> local:9642: endpoint is not leader. failed to connect to
> ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.
> local:9642: endpoint is not leader
> 
> 
> In the above ovnkube-master container log, it failed to find the endpoint
> leader which results in the ovnkube-master container crashes contineously.
> I didn't find any leader election configmap in hosted cluster namespace, but
> in openshift-ovn-kubernetes namespace, there is one created by management
> ovnk deployment.
> According to ovn-config-namespace settings in ovnkube-config configmap, it
> still uses openshift-ovn-kubernetes namespace for hypershift deployment
> which is wrong since it is supposed to use hostedcluser namespace on
> management cluster. we would need to change the ovn-config-namespace to
> hostedcluster namespace for it to create its own election lock.

Ignore the above paragrah, the election lock is created in guest cluster
openshift-ovn-kubernetes namesapce and it works fine.

Comment 4 zenghui.shi 2022-06-16 07:23:43 UTC
log from one of the nbdb containers (ovnkube-master-1) :

-> nbdb starts at 06:49:50 

2022-06-16T06:49:50+00:00 - starting nbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local, K8S_NODE_IP=10.0.135.118
+ echo '2022-06-16T06:49:50+00:00 - starting nbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local, K8S_NODE_IP=10.0.135.118'
+ initial_raft_create=true
+ initialize=false
+ [[ ! -e /etc/ovn/ovnnb_db.db ]]
+ [[ false == \t\r\u\e ]]
+ wait 10
+ exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=ovnkube-master-1.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m' run_nb_ovsdb

[...]

ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory)
2022-06-16T06:49:50.246Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2022-06-16T06:49:50Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2022-06-16T06:49:50Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2022-06-16T06:49:50.267Z|00002|dns_resolve|WARN|ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local: failed to resolve
2022-06-16T06:49:50.267Z|00003|raft|INFO|local server ID is 08ee
2022-06-16T06:49:50.284Z|00004|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.17.2

-> failed to resolve addresses of the other 2 members

2022-06-16T06:49:50.288Z|00005|stream_ssl|ERR|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connect: Address family not supported by protocol
2022-06-16T06:49:50.288Z|00006|reconnect|INFO|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connecting...
2022-06-16T06:49:50.288Z|00007|reconnect|INFO|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connection attempt failed (Address family not supported by protocol)
2022-06-16T06:49:50.288Z|00008|stream_ssl|ERR|ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connect: Address family not supported by protocol
2022-06-16T06:49:50.288Z|00009|reconnect|INFO|ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connecting...
2022-06-16T06:49:50.288Z|00010|reconnect|INFO|ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connection attempt failed (Address family not supported by protocol)

-> self connected

2022-06-16T06:49:51Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2022-06-16T06:49:51Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected

-> continue connect the other members in background

2022-06-16T06:49:57.288Z|00030|reconnect|INFO|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: continuing to reconnect in the background but suppressing further logging
2022-06-16T06:49:57.289Z|00034|reconnect|INFO|ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: continuing to reconnect in the background but suppressing further logging

2022-06-16T06:50:00.292Z|00035|memory|INFO|38604 kB peak resident set size after 10.0 seconds
2022-06-16T06:50:00.292Z|00036|memory|INFO|atoms:3237 cells:3235 monitors:0 raft-connections:2 raft-log:1103 sessions:1 triggers:1 txn-history:55 txn-history-atoms:3200

-> election expired several times

2022-06-16T06:50:00.374Z|00037|raft|INFO|term 43: 10107 ms timeout expired, starting election
2022-06-16T06:50:11.144Z|00040|raft|INFO|term 44: 10770 ms timeout expired, starting election
2022-06-16T06:50:13.290Z|00041|stream_ssl|ERR|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connect: Address family not supported by protocol
2022-06-16T06:50:13.290Z|00042|stream_ssl|ERR|ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connect: Address family not supported by protocol

-> alarm 

2022-06-16T06:50:20Z|00005|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/usr/share/openvswitch/scripts/ovs-lib: line 109:    89 Alarm clock             "$@"
Waiting for OVN_Northbound to come up ... failed!

-> continue connecting after alarm

2022-06-16T06:50:21.290Z|00043|stream_ssl|ERR|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connect: Address family not supported by protocol
2022-06-16T06:50:21.290Z|00044|stream_ssl|ERR|ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connect: Address family not supported by protocol

-> election expired several times

2022-06-16T06:50:21.531Z|00045|raft|INFO|term 45: 10386 ms timeout expired, starting election
2022-06-16T06:50:32.517Z|00048|raft|INFO|term 46: 10986 ms timeout expired, starting election
2022-06-16T06:50:42.996Z|00051|raft|INFO|term 47: 10479 ms timeout expired, starting election

-> connected leader master-2 at 06:50:53, total time spent (06:49:50 ~ 06:50:53 = 63s )

2022-06-16T06:50:53.294Z|00055|dns_resolve|WARN|ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local: failed to resolve
2022-06-16T06:50:53.294Z|00056|stream_ssl|ERR|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connect: Address family not supported by protocol
2022-06-16T06:50:53.299Z|00057|reconnect|INFO|ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connected
2022-06-16T06:50:53.316Z|00058|raft|INFO|ssl:10.129.2.3:50674: learned server ID 2e5d
2022-06-16T06:50:53.316Z|00059|raft|INFO|ssl:10.129.2.3:50674: learned remote address ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643
2022-06-16T06:50:54.475Z|00060|raft|INFO|ssl:10.131.0.3:59782: learned server ID d65b
2022-06-16T06:50:54.475Z|00061|raft|INFO|ssl:10.131.0.3:59782: learned remote address ssl:ovnkube-master-2.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643
2022-06-16T06:50:55.103Z|00062|raft|INFO|server d65b is leader for term 49
2022-06-16T06:51:01.295Z|00063|stream_ssl|ERR|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connect: Address family not supported by protocol
2022-06-16T06:51:09.296Z|00064|stream_ssl|ERR|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connect: Address family not supported by protocol
2022-06-16T06:51:17.303Z|00065|reconnect|INFO|ssl:ovnkube-master-0.ovnkube-master-internal.clusters-hostedovn.svc.cluster.local:9643: connected

-> connected master-0 at 06:51:17, total time spent (06:49:50 ~ 06:51:17 = 87s)

It took ~1m to resolve and connect the pod dns name of the other members, which includes one connection alarm and several election expirations.
The "address family not supported by protocol" message seems an indication that ovnk pod IP is not resolvable.
Given that ovnk pods are not hostnetworked in hypershift, their IP change everytime during recreation, I wonder whether the delay is caused by pod dns config not populated immediately.

Comment 5 zenghui.shi 2022-06-17 09:35:35 UTC
> Given that ovnk pods are not hostnetworked in hypershift, their IP change
> everytime during recreation, I wonder whether the delay is caused by pod dns
> config not populated immediately.

The max ttl timeout in coredns config is set to 900 which may result in long resolve time for db headless service name.

# oc get configmap/dns-default -n openshift-dns -o yaml | grep cache -A3

        cache 900 {
            denial 9984 30
        }
        reload

Can we try and see if it can be improved by using short cache time with the following steps?

1) set unmanaged to dns operator managementState:

# oc patch dns.operator.openshift.io default --type merge --patch '{"spec":{"managementState":"Unmanaged"}}'

2) edit the coredns configmap to replace 900 with 30 or 10 or 5

oc edit configmap/dns-default -n openshift-dns

3) rerun the db HA tests

Comment 6 Ross Brattain 2022-06-23 02:33:21 UTC
> In the above ovnkube-master container log, it failed to find the endpoint leader which results in the ovnkube-master container crashes contineously.
> I didn't find any leader election configmap in hosted cluster namespace, but in openshift-ovn-kubernetes namespace, there is one created by management ovnk deployment.
> According to ovn-config-namespace settings in ovnkube-config configmap, it still uses openshift-ovn-kubernetes namespace for hypershift deployment which is wrong since it is supposed to use hostedcluser namespace on management cluster. we would need to change the ovn-config-namespace to hostedcluster namespace for it to create its own election lock.

This looks like an additional issue that should be fixed, we have had issues if the leader election lock is not released properly BZ 2089807 and BZ 1944180.

Comment 7 Ross Brattain 2022-06-23 03:36:50 UTC
With cache  5 election finished in ~1m06s
With cache 10 election finished in ~1m05s
With cache 60 election finished in ~1m15s

This is better but still not as good as regular OVN.  Not sure what the ultimate consequences will be.

Comment 10 Ross Brattain 2022-07-20 00:22:15 UTC
Tested on 4.11.0-0.ci.test-2022-07-19-205901-ci-ln-rqvyry2-latest

      pod "ovnkube-master-0" deleted
      pod "ovnkube-master-1" deleted
      pod "ovnkube-master-2" deleted
      [00:08:28] INFO> Exit Status: 0
      [00:08:53] INFO> cb.new_north_leader.name = ovnkube-master-1
~ 25 seconds

      pod "ovnkube-master-0" deleted
      pod "ovnkube-master-1" deleted
      pod "ovnkube-master-2" deleted
      [00:12:11] INFO> Exit Status: 0
      [00:12:43] INFO> cb.new_north_leader.name = ovnkube-master-2
~ 32 seconds


Pod election logs.

# rg -e 'server \w+ is leader for term \d+' -e 'local server ID is \w+' -e 'elected leader by' -e 'learned server ID' -e '^\d+[^s]*starting [ns]bdb  CLUSTER_INITIATOR_IP' logs-1/
logs-1/log_ovnkube-master-2
118:2022-07-20T00:08:40+00:00 - starting nbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local, K8S_NODE_IP=10.0.137.178
167:2022-07-20T00:08:40.860Z|00002|raft|INFO|local server ID is 09c9
179:2022-07-20T00:08:47.894Z|00010|raft|INFO|server e67c is leader for term 4
183:2022-07-20T00:08:52.660Z|00013|raft|INFO|ssl:10.128.2.39:59954: learned server ID e67c
185:2022-07-20T00:09:00.204Z|00015|raft|INFO|ssl:10.129.2.43:50862: learned server ID 3924
201:2022-07-20T00:08:49+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local
264:2022-07-20T00:08:49.325Z|00002|raft|INFO|local server ID is 54ef
270:2022-07-20T00:08:49.649Z|00008|raft|INFO|ssl:10.129.2.43:52048: learned server ID bb47
272:2022-07-20T00:08:50.219Z|00010|raft|INFO|ssl:10.128.2.39:42294: learned server ID 5db2
284:2022-07-20T00:09:05.991Z|00020|raft|INFO|term 5: elected leader by 2+ of 3 servers

logs-1/log_ovnkube-master-0
112:2022-07-20T00:08:37+00:00 - starting nbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local, K8S_NODE_IP=10.0.185.106
161:2022-07-20T00:08:37.178Z|00002|raft|INFO|local server ID is 3924
172:2022-07-20T00:08:37.659Z|00011|raft|INFO|ssl:10.128.2.39:37260: learned server ID e67c
192:2022-07-20T00:08:40.882Z|00029|raft|INFO|ssl:10.131.0.54:51198: learned server ID 09c9
206:2022-07-20T00:08:47.894Z|00043|raft|INFO|server e67c is leader for term 4
226:2022-07-20T00:08:49+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local
289:2022-07-20T00:08:49.587Z|00002|raft|INFO|local server ID is bb47
295:2022-07-20T00:08:50.218Z|00008|raft|INFO|ssl:10.128.2.39:44708: learned server ID 5db2
297:2022-07-20T00:08:50.377Z|00010|raft|INFO|ssl:10.131.0.54:38266: learned server ID 54ef
305:2022-07-20T00:09:05.992Z|00016|raft|INFO|server 54ef is leader for term 5

logs-1/log_ovnkube-master-1
125:2022-07-20T00:08:37+00:00 - starting nbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local, K8S_NODE_IP=10.0.207.171
174:2022-07-20T00:08:37.636Z|00002|raft|INFO|local server ID is e67c
194:2022-07-20T00:08:40.884Z|00018|raft|INFO|ssl:10.131.0.54:56182: learned server ID 09c9
204:2022-07-20T00:08:47.893Z|00028|raft|INFO|term 4: elected leader by 2+ of 3 servers
218:2022-07-20T00:08:52.203Z|00041|raft|INFO|ssl:10.129.2.43:45900: learned server ID 3924
235:2022-07-20T00:08:50+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local
298:2022-07-20T00:08:50.157Z|00002|raft|INFO|local server ID is 5db2
304:2022-07-20T00:08:50.378Z|00008|raft|INFO|ssl:10.131.0.54:50894: learned server ID 54ef
306:2022-07-20T00:08:50.672Z|00010|raft|INFO|ssl:10.129.2.43:35428: learned server ID bb47
313:2022-07-20T00:09:05.992Z|00015|raft|INFO|server 54ef is leader for term 5




logs-2/log_ovnkube-master-1
095:2022-07-20T00:12:22+00:00 - starting nbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local, K8S_NODE_IP=10.0.207.171
147:2022-07-20T00:12:22.259Z|00002|raft|INFO|local server ID is e67c
156:2022-07-20T00:12:22.778Z|00011|raft|INFO|ssl:10.129.2.44:32956: learned server ID 3924
176:2022-07-20T00:12:26.632Z|00029|raft|INFO|ssl:10.131.0.55:58540: learned server ID 09c9
192:2022-07-20T00:12:37.478Z|00045|raft|INFO|server 09c9 is leader for term 6
209:2022-07-20T00:12:38+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local
272:2022-07-20T00:12:38.755Z|00002|raft|INFO|local server ID is 5db2
278:2022-07-20T00:12:39.225Z|00008|raft|INFO|ssl:10.129.2.44:58672: learned server ID bb47
280:2022-07-20T00:12:39.262Z|00010|raft|INFO|ssl:10.131.0.55:45126: learned server ID 54ef
293:2022-07-20T00:12:55.301Z|00021|raft|INFO|server bb47 is leader for term 6

logs-2/log_ovnkube-master-0
097:2022-07-20T00:12:22+00:00 - starting nbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local, K8S_NODE_IP=10.0.185.106
146:2022-07-20T00:12:22.755Z|00002|raft|INFO|local server ID is 3924
166:2022-07-20T00:12:26.631Z|00018|raft|INFO|ssl:10.131.0.55:45784: learned server ID 09c9
176:2022-07-20T00:12:37.478Z|00028|raft|INFO|server 09c9 is leader for term 6
179:2022-07-20T00:12:45.283Z|00030|raft|INFO|ssl:10.128.2.40:54150: learned server ID e67c
196:2022-07-20T00:12:39+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local
259:2022-07-20T00:12:39.173Z|00002|raft|INFO|local server ID is bb47
265:2022-07-20T00:12:39.260Z|00008|raft|INFO|ssl:10.131.0.55:51526: learned server ID 54ef
267:2022-07-20T00:12:39.812Z|00010|raft|INFO|ssl:10.128.2.40:54192: learned server ID 5db2
277:2022-07-20T00:12:55.300Z|00018|raft|INFO|term 6: elected leader by 2+ of 3 servers

logs-2/log_ovnkube-master-2
1639:2022-07-20T00:12:26+00:00 - starting nbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local, K8S_NODE_IP=10.0.137.178
1688:2022-07-20T00:12:26.608Z|00002|raft|INFO|local server ID is 09c9
1702:2022-07-20T00:12:37.283Z|00012|raft|INFO|ssl:10.128.2.40:38020: learned server ID e67c
1705:2022-07-20T00:12:37.477Z|00015|raft|INFO|term 6: elected leader by 2+ of 3 servers
1719:2022-07-20T00:12:45.781Z|00028|raft|INFO|ssl:10.129.2.44:41054: learned server ID 3924
1735:2022-07-20T00:12:39+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=ovnkube-master-0.ovnkube-master-internal.clusters-hypershift-ci-25142.svc.cluster.local
1798:2022-07-20T00:12:39.208Z|00002|raft|INFO|local server ID is 54ef
1804:2022-07-20T00:12:39.813Z|00008|raft|INFO|ssl:10.128.2.40:52762: learned server ID 5db2
1808:2022-07-20T00:12:40.225Z|00010|raft|INFO|ssl:10.129.2.44:47806: learned server ID bb47
1813:2022-07-20T00:12:55.301Z|00015|raft|INFO|server bb47 is leader for term 6

Comment 12 Ross Brattain 2022-08-19 03:10:59 UTC
Fix in 4.11.0-0.nightly-2022-08-18-223628

pre-verified.

Comment 16 Patryk Diak 2022-09-07 09:43:07 UTC
*** Bug 2093057 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2022-09-07 20:49:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.11.3 packages and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6287

Comment 19 Red Hat Bugzilla 2023-09-15 01:55:52 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.