Bug 1861484
| Summary: | ovn master intermittently restarting with failed to get northd_probe_interval value stderr error | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Anurag saxena <anusaxen> |
| Component: | Networking | Assignee: | Aniket Bhat <anbhat> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | anbhat, danw, dcbw, huirwang, rbrattai, weliang, zzhao |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:21:16 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Anurag saxena
2020-07-28 17:42:59 UTC
Seems like its repro in normal clusters as well. Independent of frag issue BZ1849736/BZ1825219 I have a cluster if anybody wants to take a look > But along with that we also noticed ovnkube-master is intermittently restarting with following errors in logs:
>
> Failed to get northd_probe_interval value stderr(ovn-nbctl: no key "northd_probe_interval" in NB_Global record "." column options
> ) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1)
This is not a fatal error. It just means that one of the metrics values will be unset. Whatever is causing ovnkube-master to restart is something else.
(We shouldn't be hitting that error, and it should be logged as a warning not an error if we do log it, but anyway, my point is that it is not related to any ovnkube-master restarts) Ah, thats right Dan, its not fatal.yea this certainly is not the cause behind the restart. must-gather might have more details pertaining to the restart issue. Some logs from restarted master 2020-07-28T17:03:31.440382882Z ) :(OVN command '/usr/bin/ovn-nbctl --timeout=5 get NB_Global . options:northd_probe_interval' failed: exit status 1) 2020-07-28T17:03:42.408361331Z E0728 17:03:42.408282 1 leaderelection.go:320] error retrieving resource lock openshift-ovn-kubernetes/ovn-kubernetes-master: Get "https://api-int.geliu0727.qe.azure.devcluster.openshift.com:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps/ovn-kubernetes-master": context deadline exceeded 2020-07-28T17:03:42.408361331Z I0728 17:03:42.408337 1 leaderelection.go:277] failed to renew lease openshift-ovn-kubernetes/ovn-kubernetes-master: timed out waiting for the condition 2020-07-28T17:03:42.408418132Z I0728 17:03:42.408359 1 master.go:97] No longer leader; exiting Discussion at https://github.com/ovn-org/ovn-kubernetes/issues/1553 This is fixed in the latest 4.6 nightlies as of Monday 08/03. Not reproducible on 4.6.0-0.nightly-2020-08-06-093209. Moving this to verified! Thanks Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |