Bug 1936515
Summary: | sdn-controller is missing some health checks | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Lukasz Szaszkiewicz <lszaszki> |
Component: | Networking | Assignee: | Federico Paolinelli <fpaoline> |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | aconstan, bbennett, fpaoline, krizza, scuppett, trozet, wking, zzhao |
Version: | 4.8 | Keywords: | Upgrades |
Target Milestone: | --- | Flags: | scuppett:
needinfo-
|
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:51:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1962036 |
Description
Lukasz Szaszkiewicz
2021-03-08 16:33:42 UTC
hi, fpaoline following above steps in description. when one master node marked as 'NotReady', However the ds sdn-controller is changing to 2 from 3. and network operator did not report this? , is it expected? $ oc get ds -n openshift-sdn NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ovs 6 6 5 6 5 kubernetes.io/os=linux 71m sdn 6 6 5 6 5 kubernetes.io/os=linux 71m sdn-controller 2 2 2 2 2 node-role.kubernetes.io/master= 71m $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-146-157.us-east-2.compute.internal Ready worker 57m v1.20.0+5f82cdb ip-10-0-147-64.us-east-2.compute.internal NotReady master 63m v1.20.0+5f82cdb ip-10-0-163-226.us-east-2.compute.internal Ready master 63m v1.20.0+5f82cdb ip-10-0-163-73.us-east-2.compute.internal Ready worker 57m v1.20.0+5f82cdb ip-10-0-201-132.us-east-2.compute.internal Ready worker 56m v1.20.0+5f82cdb ip-10-0-203-137.us-east-2.compute.internal Ready master 63m v1.20.0+5f82cdb $ oc get co network -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" network.operator.openshift.io/last-seen-state: '{"DaemonsetStates":[{"Namespace":"openshift-sdn","Name":"sdn","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.242309949Z"},{"Namespace":"openshift-multus","Name":"multus","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.241749394Z"},{"Namespace":"openshift-sdn","Name":"ovs","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.242059449Z"}],"DeploymentStates":[]}' creationTimestamp: "2021-04-14T06:54:26Z" generation: 1 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:include.release.openshift.io/ibm-cloud-managed: {} f:include.release.openshift.io/self-managed-high-availability: {} f:include.release.openshift.io/single-node-developer: {} f:spec: {} f:status: .: {} f:extension: {} manager: cluster-version-operator operation: Update time: "2021-04-14T06:54:26Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:network.operator.openshift.io/last-seen-state: {} f:status: f:conditions: {} f:relatedObjects: {} f:versions: {} manager: cluster-network-operator operation: Update time: "2021-04-14T07:02:34Z" name: network resourceVersion: "52784" uid: 18598777-0af3-4ae0-ad6d-0ce2ff180acd spec: {} status: conditions: - lastTransitionTime: "2021-04-14T08:17:11Z" message: |- DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-04-14T08:05:11Z DaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2021-04-14T08:05:11Z DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2021-04-14T08:05:11Z reason: RolloutHung status: "True" type: Degraded - lastTransitionTime: "2021-04-14T07:01:53Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2021-04-14T07:01:53Z" status: "True" type: Upgradeable - lastTransitionTime: "2021-04-14T08:05:11Z" message: |- DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes) DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) reason: Deploying The sdn controller daemon changed it's number of replicas, I checked when I tested this and I think it's because of the node.kubernetes.io/unreachable taint. The other ds have toleration: exist However, the initial bug was CNO not reporting the degraded state, and that is working now. I did not want to mess up with more logic than we needed to have the degraded state reported. @zzhao does my comment above make sense? ok, thanks the information. Federico Paolinelli Move this bug to verified. Hey folks, Is there a reason the leader election fix wasn't backported? We found the problem on openshift upgrade when we were looking into a similar locking issue in the openshift-marketplace operator: https://bugzilla.redhat.com/show_bug.cgi?id=1958888#c6 The main reason was the original scenario, which was not related to upgrades but to a very specific case where a master goes not reachable and the operator was not reporting the state in a proper manner. It's not clear to me if this issue in CNO is preventing upgrades like the marketplace-operator did or not, I would say it does given the comments on the other bug but we weren't aware of such issue. I'll create the backport right now, does it make sense to go back to 4.6? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |