Bug 1861484 - ovn master intermittently restarting with failed to get northd_probe_interval value stderr error
Summary: ovn master intermittently restarting with failed to get northd_probe_interval...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Aniket Bhat
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-28 17:42 UTC by Anurag saxena
Modified: 2020-10-27 16:21 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:21:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 733 0 None closed Configure northd probe interval during startup 2020-11-10 17:06:17 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:21:40 UTC

Description Anurag saxena 2020-07-28 17:42:59 UTC
Description of problem: Brought up a cluster on OVNKubernetes which found to be encountered BZ1849736/BZ1825219

$ oc debug node/geliu0727-8vrp4-master-0 -- chroot /host ip route show cache;oc debug node/geliu0727-8vrp4-master-1 -- chroot /host ip route show cache;oc debug node/geliu0727-8vrp4-master-2 -- chroot /host ip route show cache

Starting pod/geliu0727-8vrp4-master-0-debug ...
To use host binaries, run `chroot /host`
10.0.0.8 dev eth0                                   
    cache 
10.0.0.7 dev eth0                                   
    cache 

Removing debug pod ...
Starting pod/geliu0727-8vrp4-master-1-debug ...
To use host binaries, run `chroot /host`
10.0.0.8 dev eth0 
    cache 

Removing debug pod ...
Starting pod/geliu0727-8vrp4-master-2-debug ...
To use host binaries, run `chroot /host`
10.0.0.5 dev eth0 
    cache expires 441sec mtu 1400    <<<<<<<<<<<<<<<<<<<<
10.0.0.7 dev eth0  
    cache expires 404sec mtu 1400    <<<<<<<<<<<<<<<<<<<<

But along with that we also noticed ovnkube-master is intermittently restarting with following errors in logs:

Failed to get northd_probe_interval value stderr(ovn-nbctl: no key "northd_probe_interval" in NB_Global record "." column options
) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1)


$ oc get pods -n openshift-ovn-kubernetes -l=app=ovnkube-master
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-65b6g   4/4     Running   0          40h
ovnkube-master-6b2sb   4/4     Running   19         40h
ovnkube-master-djqtx   4/4     Running   0          40h

must-gather: http://file.bos.redhat.com/~anusaxen/must-gather-ovn-restart.tar.gz

Version-Release number of selected component (if applicable):4.6.0-0.nightly-2020-07-25-091217


How reproducible:Rare


Steps to Reproduce:
1.Bring up cluster with networktype OVNKubernetes
2.
3.

Actual results: Cluster exhibits various issues as described 


Expected results:Cluster should e installed fin without any errors


Additional info:

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
cloud-credential                           4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
cluster-autoscaler                         4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
config-operator                            4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
console                                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
csi-snapshot-controller                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
dns                                        4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
etcd                                       4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
image-registry                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
ingress                                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
insights                                   4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
kube-apiserver                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
kube-controller-manager                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
kube-scheduler                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
kube-storage-version-migrator              4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
machine-api                                4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
machine-approver                           4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
machine-config                             4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
marketplace                                4.6.0-0.nightly-2020-07-25-091217   True        False         False      39h
monitoring                                 4.6.0-0.nightly-2020-07-25-091217   True        False         False      9s
network                                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
node-tuning                                4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
openshift-apiserver                        4.6.0-0.nightly-2020-07-25-091217   False       False         False      46s
openshift-controller-manager               4.6.0-0.nightly-2020-07-25-091217   True        False         False      21h
openshift-samples                          4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
operator-lifecycle-manager                 4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-07-25-091217   True        False         False      7h46m
service-ca                                 4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h
storage                                    4.6.0-0.nightly-2020-07-25-091217   True        False         False      40h

$ oc get pods -n openshift-ovn-kubernetes -l=app=ovnkube-master
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-65b6g   4/4     Running   0          40h
ovnkube-master-6b2sb   4/4     Running   19         40h
ovnkube-master-djqtx   4/4     Running   0          40h
-bash-4.2$ oc get co openshift-apiserver -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
  creationTimestamp: "2020-07-27T01:06:18Z"
  generation: 1
  managedFields:
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:exclude.release.openshift.io/internal-openshift-hosted: {}
      f:spec: {}
      f:status:
        .: {}
        f:extension: {}
    manager: cluster-version-operator
    operation: Update
    time: "2020-07-27T01:06:18Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
        f:relatedObjects: {}
        f:versions: {}
    manager: cluster-openshift-apiserver-operator
    operation: Update
    time: "2020-07-28T17:36:11Z"
  name: openshift-apiserver
  resourceVersion: "2652343"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  uid: 2018d749-84e4-4f54-ac0a-f080e356fce8
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-07-27T01:58:15Z"
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-07-27T01:19:15Z"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-07-28T17:28:51Z"
    message: |-
      APIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
      APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
      APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
      APIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
    reason: APIServices_Error
    status: "False"
    type: Available
  - lastTransitionTime: "2020-07-27T01:13:11Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable

Comment 1 Anurag saxena 2020-07-29 15:29:36 UTC
Seems like its repro in normal clusters as well. Independent of frag issue BZ1849736/BZ1825219

I have a cluster if anybody wants to take a look

Comment 2 Dan Winship 2020-07-29 15:43:03 UTC
> But along with that we also noticed ovnkube-master is intermittently restarting with following errors in logs:
> 
> Failed to get northd_probe_interval value stderr(ovn-nbctl: no key "northd_probe_interval" in NB_Global record "." column options
> ) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1)

This is not a fatal error. It just means that one of the metrics values will be unset. Whatever is causing ovnkube-master to restart is something else.

Comment 3 Dan Winship 2020-07-29 15:43:50 UTC
(We shouldn't be hitting that error, and it should be logged as a warning not an error if we do log it, but anyway, my point is that it is not related to any ovnkube-master restarts)

Comment 4 Anurag saxena 2020-07-29 16:18:28 UTC
Ah, thats right Dan, its not fatal.yea this certainly is not the cause behind the restart. must-gather might have more details pertaining to the restart issue. Some logs from restarted master

2020-07-28T17:03:31.440382882Z ) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1)
2020-07-28T17:03:42.408361331Z E0728 17:03:42.408282       1 leaderelection.go:320] error retrieving resource lock openshift-ovn-kubernetes/ovn-kubernetes-master: Get "https://api-int.geliu0727.qe.azure.devcluster.openshift.com:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps/ovn-kubernetes-master": context deadline exceeded
2020-07-28T17:03:42.408361331Z I0728 17:03:42.408337       1 leaderelection.go:277] failed to renew lease openshift-ovn-kubernetes/ovn-kubernetes-master: timed out waiting for the condition
2020-07-28T17:03:42.408418132Z I0728 17:03:42.408359       1 master.go:97] No longer leader; exiting

Comment 7 Dan Winship 2020-07-29 18:26:53 UTC
Discussion at https://github.com/ovn-org/ovn-kubernetes/issues/1553

Comment 8 Aniket Bhat 2020-08-05 00:36:36 UTC
This is fixed in the latest 4.6 nightlies as of Monday 08/03.

Comment 11 Anurag saxena 2020-08-07 15:19:00 UTC
Not reproducible on 4.6.0-0.nightly-2020-08-06-093209. Moving this to verified! Thanks

Comment 13 errata-xmlrpc 2020-10-27 16:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.