Bug 1936515
| Summary: | sdn-controller is missing some health checks | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lukasz Szaszkiewicz <lszaszki> |
| Component: | Networking | Assignee: | Federico Paolinelli <fpaoline> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | aconstan, bbennett, fpaoline, krizza, scuppett, trozet, wking, zzhao |
| Version: | 4.8 | Keywords: | Upgrades |
| Target Milestone: | --- | Flags: | scuppett:
needinfo-
|
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:51:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1962036 | ||
|
Description
Lukasz Szaszkiewicz
2021-03-08 16:33:42 UTC
hi, fpaoline
following above steps in description. when one master node marked as 'NotReady', However the ds sdn-controller is changing to 2 from 3. and network operator did not report this? , is it expected?
$ oc get ds -n openshift-sdn
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ovs 6 6 5 6 5 kubernetes.io/os=linux 71m
sdn 6 6 5 6 5 kubernetes.io/os=linux 71m
sdn-controller 2 2 2 2 2 node-role.kubernetes.io/master= 71m
$ oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-146-157.us-east-2.compute.internal Ready worker 57m v1.20.0+5f82cdb
ip-10-0-147-64.us-east-2.compute.internal NotReady master 63m v1.20.0+5f82cdb
ip-10-0-163-226.us-east-2.compute.internal Ready master 63m v1.20.0+5f82cdb
ip-10-0-163-73.us-east-2.compute.internal Ready worker 57m v1.20.0+5f82cdb
ip-10-0-201-132.us-east-2.compute.internal Ready worker 56m v1.20.0+5f82cdb
ip-10-0-203-137.us-east-2.compute.internal Ready master 63m v1.20.0+5f82cdb
$ oc get co network -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
network.operator.openshift.io/last-seen-state: '{"DaemonsetStates":[{"Namespace":"openshift-sdn","Name":"sdn","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.242309949Z"},{"Namespace":"openshift-multus","Name":"multus","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.241749394Z"},{"Namespace":"openshift-sdn","Name":"ovs","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.242059449Z"}],"DeploymentStates":[]}'
creationTimestamp: "2021-04-14T06:54:26Z"
generation: 1
managedFields:
- apiVersion: config.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:include.release.openshift.io/ibm-cloud-managed: {}
f:include.release.openshift.io/self-managed-high-availability: {}
f:include.release.openshift.io/single-node-developer: {}
f:spec: {}
f:status:
.: {}
f:extension: {}
manager: cluster-version-operator
operation: Update
time: "2021-04-14T06:54:26Z"
- apiVersion: config.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:network.operator.openshift.io/last-seen-state: {}
f:status:
f:conditions: {}
f:relatedObjects: {}
f:versions: {}
manager: cluster-network-operator
operation: Update
time: "2021-04-14T07:02:34Z"
name: network
resourceVersion: "52784"
uid: 18598777-0af3-4ae0-ad6d-0ce2ff180acd
spec: {}
status:
conditions:
- lastTransitionTime: "2021-04-14T08:17:11Z"
message: |-
DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-04-14T08:05:11Z
DaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2021-04-14T08:05:11Z
DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2021-04-14T08:05:11Z
reason: RolloutHung
status: "True"
type: Degraded
- lastTransitionTime: "2021-04-14T07:01:53Z"
status: "False"
type: ManagementStateDegraded
- lastTransitionTime: "2021-04-14T07:01:53Z"
status: "True"
type: Upgradeable
- lastTransitionTime: "2021-04-14T08:05:11Z"
message: |-
DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes)
DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes)
DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes)
DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
reason: Deploying
The sdn controller daemon changed it's number of replicas, I checked when I tested this and I think it's because of the node.kubernetes.io/unreachable taint. The other ds have toleration: exist However, the initial bug was CNO not reporting the degraded state, and that is working now. I did not want to mess up with more logic than we needed to have the degraded state reported. @zzhao does my comment above make sense? ok, thanks the information. Federico Paolinelli Move this bug to verified. Hey folks, Is there a reason the leader election fix wasn't backported? We found the problem on openshift upgrade when we were looking into a similar locking issue in the openshift-marketplace operator: https://bugzilla.redhat.com/show_bug.cgi?id=1958888#c6 The main reason was the original scenario, which was not related to upgrades but to a very specific case where a master goes not reachable and the operator was not reporting the state in a proper manner. It's not clear to me if this issue in CNO is preventing upgrades like the marketplace-operator did or not, I would say it does given the comments on the other bug but we weren't aware of such issue. I'll create the backport right now, does it make sense to go back to 4.6? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |