Description of problem: Brought up a cluster on OVNKubernetes which found to be encountered BZ1849736/BZ1825219 $ oc debug node/geliu0727-8vrp4-master-0 -- chroot /host ip route show cache;oc debug node/geliu0727-8vrp4-master-1 -- chroot /host ip route show cache;oc debug node/geliu0727-8vrp4-master-2 -- chroot /host ip route show cache Starting pod/geliu0727-8vrp4-master-0-debug ... To use host binaries, run `chroot /host` 10.0.0.8 dev eth0 cache 10.0.0.7 dev eth0 cache Removing debug pod ... Starting pod/geliu0727-8vrp4-master-1-debug ... To use host binaries, run `chroot /host` 10.0.0.8 dev eth0 cache Removing debug pod ... Starting pod/geliu0727-8vrp4-master-2-debug ... To use host binaries, run `chroot /host` 10.0.0.5 dev eth0 cache expires 441sec mtu 1400 <<<<<<<<<<<<<<<<<<<< 10.0.0.7 dev eth0 cache expires 404sec mtu 1400 <<<<<<<<<<<<<<<<<<<< But along with that we also noticed ovnkube-master is intermittently restarting with following errors in logs: Failed to get northd_probe_interval value stderr(ovn-nbctl: no key "northd_probe_interval" in NB_Global record "." column options ) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1) $ oc get pods -n openshift-ovn-kubernetes -l=app=ovnkube-master NAME READY STATUS RESTARTS AGE ovnkube-master-65b6g 4/4 Running 0 40h ovnkube-master-6b2sb 4/4 Running 19 40h ovnkube-master-djqtx 4/4 Running 0 40h must-gather: http://file.bos.redhat.com/~anusaxen/must-gather-ovn-restart.tar.gz Version-Release number of selected component (if applicable):4.6.0-0.nightly-2020-07-25-091217 How reproducible:Rare Steps to Reproduce: 1.Bring up cluster with networktype OVNKubernetes 2. 3. Actual results: Cluster exhibits various issues as described Expected results:Cluster should e installed fin without any errors Additional info: $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-07-25-091217 True False False 40h cloud-credential 4.6.0-0.nightly-2020-07-25-091217 True False False 40h cluster-autoscaler 4.6.0-0.nightly-2020-07-25-091217 True False False 40h config-operator 4.6.0-0.nightly-2020-07-25-091217 True False False 40h console 4.6.0-0.nightly-2020-07-25-091217 True False False 39h csi-snapshot-controller 4.6.0-0.nightly-2020-07-25-091217 True False False 39h dns 4.6.0-0.nightly-2020-07-25-091217 True False False 40h etcd 4.6.0-0.nightly-2020-07-25-091217 True False False 40h image-registry 4.6.0-0.nightly-2020-07-25-091217 True False False 39h ingress 4.6.0-0.nightly-2020-07-25-091217 True False False 40h insights 4.6.0-0.nightly-2020-07-25-091217 True False False 40h kube-apiserver 4.6.0-0.nightly-2020-07-25-091217 True False False 40h kube-controller-manager 4.6.0-0.nightly-2020-07-25-091217 True False False 40h kube-scheduler 4.6.0-0.nightly-2020-07-25-091217 True False False 40h kube-storage-version-migrator 4.6.0-0.nightly-2020-07-25-091217 True False False 39h machine-api 4.6.0-0.nightly-2020-07-25-091217 True False False 40h machine-approver 4.6.0-0.nightly-2020-07-25-091217 True False False 40h machine-config 4.6.0-0.nightly-2020-07-25-091217 True False False 40h marketplace 4.6.0-0.nightly-2020-07-25-091217 True False False 39h monitoring 4.6.0-0.nightly-2020-07-25-091217 True False False 9s network 4.6.0-0.nightly-2020-07-25-091217 True False False 40h node-tuning 4.6.0-0.nightly-2020-07-25-091217 True False False 40h openshift-apiserver 4.6.0-0.nightly-2020-07-25-091217 False False False 46s openshift-controller-manager 4.6.0-0.nightly-2020-07-25-091217 True False False 21h openshift-samples 4.6.0-0.nightly-2020-07-25-091217 True False False 40h operator-lifecycle-manager 4.6.0-0.nightly-2020-07-25-091217 True False False 40h operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-07-25-091217 True False False 40h operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-07-25-091217 True False False 7h46m service-ca 4.6.0-0.nightly-2020-07-25-091217 True False False 40h storage 4.6.0-0.nightly-2020-07-25-091217 True False False 40h $ oc get pods -n openshift-ovn-kubernetes -l=app=ovnkube-master NAME READY STATUS RESTARTS AGE ovnkube-master-65b6g 4/4 Running 0 40h ovnkube-master-6b2sb 4/4 Running 19 40h ovnkube-master-djqtx 4/4 Running 0 40h -bash-4.2$ oc get co openshift-apiserver -oyaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" creationTimestamp: "2020-07-27T01:06:18Z" generation: 1 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:exclude.release.openshift.io/internal-openshift-hosted: {} f:spec: {} f:status: .: {} f:extension: {} manager: cluster-version-operator operation: Update time: "2020-07-27T01:06:18Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: f:conditions: {} f:relatedObjects: {} f:versions: {} manager: cluster-openshift-apiserver-operator operation: Update time: "2020-07-28T17:36:11Z" name: openshift-apiserver resourceVersion: "2652343" selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver uid: 2018d749-84e4-4f54-ac0a-f080e356fce8 spec: {} status: conditions: - lastTransitionTime: "2020-07-27T01:58:15Z" reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2020-07-27T01:19:15Z" reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2020-07-28T17:28:51Z" message: |- APIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) reason: APIServices_Error status: "False" type: Available - lastTransitionTime: "2020-07-27T01:13:11Z" reason: AsExpected status: "True" type: Upgradeable
Seems like its repro in normal clusters as well. Independent of frag issue BZ1849736/BZ1825219 I have a cluster if anybody wants to take a look
> But along with that we also noticed ovnkube-master is intermittently restarting with following errors in logs: > > Failed to get northd_probe_interval value stderr(ovn-nbctl: no key "northd_probe_interval" in NB_Global record "." column options > ) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1) This is not a fatal error. It just means that one of the metrics values will be unset. Whatever is causing ovnkube-master to restart is something else.
(We shouldn't be hitting that error, and it should be logged as a warning not an error if we do log it, but anyway, my point is that it is not related to any ovnkube-master restarts)
Ah, thats right Dan, its not fatal.yea this certainly is not the cause behind the restart. must-gather might have more details pertaining to the restart issue. Some logs from restarted master 2020-07-28T17:03:31.440382882Z ) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1) 2020-07-28T17:03:42.408361331Z E0728 17:03:42.408282 1 leaderelection.go:320] error retrieving resource lock openshift-ovn-kubernetes/ovn-kubernetes-master: Get "https://api-int.geliu0727.qe.azure.devcluster.openshift.com:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps/ovn-kubernetes-master": context deadline exceeded 2020-07-28T17:03:42.408361331Z I0728 17:03:42.408337 1 leaderelection.go:277] failed to renew lease openshift-ovn-kubernetes/ovn-kubernetes-master: timed out waiting for the condition 2020-07-28T17:03:42.408418132Z I0728 17:03:42.408359 1 master.go:97] No longer leader; exiting
Discussion at https://github.com/ovn-org/ovn-kubernetes/issues/1553
This is fixed in the latest 4.6 nightlies as of Monday 08/03.
Not reproducible on 4.6.0-0.nightly-2020-08-06-093209. Moving this to verified! Thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196