Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1861484

Summary: ovn master intermittently restarting with failed to get northd_probe_interval value stderr error
Product: OpenShift Container Platform Reporter: Anurag saxena <anusaxen>
Component: NetworkingAssignee: Aniket Bhat <anbhat>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: anbhat, danw, dcbw, huirwang, rbrattai, weliang, zzhao
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:21:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anurag saxena 2020-07-28 17:42:59 UTC
Description of problem: Brought up a cluster on OVNKubernetes which found to be encountered BZ1849736/BZ1825219

$ oc debug node/geliu0727-8vrp4-master-0 -- chroot /host ip route show cache;oc debug node/geliu0727-8vrp4-master-1 -- chroot /host ip route show cache;oc debug node/geliu0727-8vrp4-master-2 -- chroot /host ip route show cache

Starting pod/geliu0727-8vrp4-master-0-debug ...
To use host binaries, run `chroot /host`
10.0.0.8 dev eth0                                   
    cache 
10.0.0.7 dev eth0                                   
    cache 

Removing debug pod ...
Starting pod/geliu0727-8vrp4-master-1-debug ...
To use host binaries, run `chroot /host`
10.0.0.8 dev eth0 
    cache 

Removing debug pod ...
Starting pod/geliu0727-8vrp4-master-2-debug ...
To use host binaries, run `chroot /host`
10.0.0.5 dev eth0 
    cache expires 441sec mtu 1400    <<<<<<<<<<<<<<<<<<<<
10.0.0.7 dev eth0  
    cache expires 404sec mtu 1400    <<<<<<<<<<<<<<<<<<<<

But along with that we also noticed ovnkube-master is intermittently restarting with following errors in logs:

Failed to get northd_probe_interval value stderr(ovn-nbctl: no key "northd_probe_interval" in NB_Global record "." column options
) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1)


$ oc get pods -n openshift-ovn-kubernetes -l=app=ovnkube-master
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-65b6g   4/4     Running   0          40h
ovnkube-master-6b2sb   4/4     Running   19         40h
ovnkube-master-djqtx   4/4     Running   0          40h

must-gather: http://file.bos.redhat.com/~anusaxen/must-gather-ovn-restart.tar.gz

Version-Release number of selected component (if applicable):4.6.0-0.nightly-2020-07-25-091217


How reproducible:Rare


Steps to Reproduce:
1.Bring up cluster with networktype OVNKubernetes
2.
3.

Actual results: Cluster exhibits various issues as described 


Expected results:Cluster should e installed fin without any errors


Additional info:

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
cloud-credential                           4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
cluster-autoscaler                         4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
config-operator                            4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
console                                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
csi-snapshot-controller                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
dns                                        4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
etcd                                       4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
image-registry                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
ingress                                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
insights                                   4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
kube-apiserver                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
kube-controller-manager                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
kube-scheduler                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
kube-storage-version-migrator              4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
machine-api                                4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
machine-approver                           4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
machine-config                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
marketplace                                4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
monitoring                                 4.6.0-0.nightly-2020-07-25-091217   True        False         False      9s
network                                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
node-tuning                                4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
openshift-apiserver                        4.6.0-0.nightly-2020-07-25-091217   False       False         False      46s
openshift-controller-manager               4.6.0-0.nightly-2020-07-25-091217   True        False         False      21h
openshift-samples                          4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
operator-lifecycle-manager                 4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-07-25-091217   True        False         False      7h46m
service-ca                                 4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
storage                                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h

$ oc get pods -n openshift-ovn-kubernetes -l=app=ovnkube-master
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-65b6g   4/4     Running   0          40h
ovnkube-master-6b2sb   4/4     Running   19         40h
ovnkube-master-djqtx   4/4     Running   0          40h
-bash-4.2$ oc get co openshift-apiserver -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
  creationTimestamp: "2020-07-27T01:06:18Z"
  generation: 1
  managedFields:
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:exclude.release.openshift.io/internal-openshift-hosted: {}
      f:spec: {}
      f:status:
        .: {}
        f:extension: {}
    manager: cluster-version-operator
    operation: Update
    time: "2020-07-27T01:06:18Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
        f:relatedObjects: {}
        f:versions: {}
    manager: cluster-openshift-apiserver-operator
    operation: Update
    time: "2020-07-28T17:36:11Z"
  name: openshift-apiserver
  resourceVersion: "2652343"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  uid: 2018d749-84e4-4f54-ac0a-f080e356fce8
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-07-27T01:58:15Z"
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-07-27T01:19:15Z"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-07-28T17:28:51Z"
    message: |-
      APIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
      APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
      APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
      APIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
    reason: APIServices_Error
    status: "False"
    type: Available
  - lastTransitionTime: "2020-07-27T01:13:11Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable

Comment 1 Anurag saxena 2020-07-29 15:29:36 UTC
Seems like its repro in normal clusters as well. Independent of frag issue BZ1849736/BZ1825219

I have a cluster if anybody wants to take a look

Comment 2 Dan Winship 2020-07-29 15:43:03 UTC
> But along with that we also noticed ovnkube-master is intermittently restarting with following errors in logs:
> 
> Failed to get northd_probe_interval value stderr(ovn-nbctl: no key "northd_probe_interval" in NB_Global record "." column options
> ) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1)

This is not a fatal error. It just means that one of the metrics values will be unset. Whatever is causing ovnkube-master to restart is something else.

Comment 3 Dan Winship 2020-07-29 15:43:50 UTC
(We shouldn't be hitting that error, and it should be logged as a warning not an error if we do log it, but anyway, my point is that it is not related to any ovnkube-master restarts)

Comment 4 Anurag saxena 2020-07-29 16:18:28 UTC
Ah, thats right Dan, its not fatal.yea this certainly is not the cause behind the restart. must-gather might have more details pertaining to the restart issue. Some logs from restarted master

2020-07-28T17:03:31.440382882Z ) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1)
2020-07-28T17:03:42.408361331Z E0728 17:03:42.408282       1 leaderelection.go:320] error retrieving resource lock openshift-ovn-kubernetes/ovn-kubernetes-master: Get "https://api-int.geliu0727.qe.azure.devcluster.openshift.com:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps/ovn-kubernetes-master": context deadline exceeded
2020-07-28T17:03:42.408361331Z I0728 17:03:42.408337       1 leaderelection.go:277] failed to renew lease openshift-ovn-kubernetes/ovn-kubernetes-master: timed out waiting for the condition
2020-07-28T17:03:42.408418132Z I0728 17:03:42.408359       1 master.go:97] No longer leader; exiting

Comment 7 Dan Winship 2020-07-29 18:26:53 UTC
Discussion at https://github.com/ovn-org/ovn-kubernetes/issues/1553

Comment 8 Aniket Bhat 2020-08-05 00:36:36 UTC
This is fixed in the latest 4.6 nightlies as of Monday 08/03.

Comment 11 Anurag saxena 2020-08-07 15:19:00 UTC
Not reproducible on 4.6.0-0.nightly-2020-08-06-093209. Moving this to verified! Thanks

Comment 13 errata-xmlrpc 2020-10-27 16:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196