Description of problem: The network operator is not reporting when one replica of sdn-controller is missing How reproducible: Always Steps to Reproduce: 1. Block outgoing traffic from a node to Kube API iptables -A OUTPUT -p tcp -m state --state RELATED,ESTABLISHED,NEW -m tcp --dport 6443 -j DROP 2. oc get co Actual results: Many operators reported degraded but the network operator didn't: ❯ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.ci-2021-03-07-070446 True False True 93m baremetal 4.8.0-0.ci-2021-03-07-070446 True False False 6h53m cloud-credential 4.8.0-0.ci-2021-03-07-070446 True False False 7h3m cluster-autoscaler 4.8.0-0.ci-2021-03-07-070446 True False False 6h52m config-operator 4.8.0-0.ci-2021-03-07-070446 True False False 6h54m console 4.8.0-0.ci-2021-03-07-070446 True False False 93m csi-snapshot-controller 4.8.0-0.ci-2021-03-07-070446 True False False 93m dns 4.8.0-0.ci-2021-03-07-070446 True False False 6h52m etcd 4.8.0-0.ci-2021-03-07-070446 True False True 6h52m image-registry 4.8.0-0.ci-2021-03-07-070446 True False False 6h45m ingress 4.8.0-0.ci-2021-03-07-070446 True False False 6h45m insights 4.8.0-0.ci-2021-03-07-070446 True False False 6h47m kube-apiserver 4.8.0-0.ci-2021-03-07-070446 True True True 6h50m kube-controller-manager 4.8.0-0.ci-2021-03-07-070446 True False True 6h51m kube-scheduler 4.8.0-0.ci-2021-03-07-070446 True False True 6h52m kube-storage-version-migrator 4.8.0-0.ci-2021-03-07-070446 True False False 6h45m machine-api 4.8.0-0.ci-2021-03-07-070446 True False False 6h44m machine-approver 4.8.0-0.ci-2021-03-07-070446 True False False 6h53m machine-config 4.8.0-0.ci-2021-03-07-070446 False False True 82m marketplace 4.8.0-0.ci-2021-03-07-070446 True False False 5h38m monitoring 4.8.0-0.ci-2021-03-07-070446 False True True 87m network 4.8.0-0.ci-2021-03-07-070446 True False False 6h54m node-tuning 4.8.0-0.ci-2021-03-07-070446 True False False 6h52m openshift-apiserver 4.8.0-0.ci-2021-03-07-070446 True False True 6h4m openshift-controller-manager 4.8.0-0.ci-2021-03-07-070446 True False False 6h51m openshift-samples 4.8.0-0.ci-2021-03-07-070446 True False False 6h46m operator-lifecycle-manager 4.8.0-0.ci-2021-03-07-070446 True False False 6h52m operator-lifecycle-manager-catalog 4.8.0-0.ci-2021-03-07-070446 True False False 6h52m operator-lifecycle-manager-packageserver 4.8.0-0.ci-2021-03-07-070446 True False False 93m service-ca 4.8.0-0.ci-2021-03-07-070446 True False False 6h54m storage 4.8.0-0.ci-2021-03-07-070446 True True False 87m Expected results: Since a master node was marked as NotReady I would expect the network operator to report that one replica is missing. ❯ oc get nodes -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-139-25.ec2.internal Ready master 5h26m v1.20.0+aa519d9 10.0.139.25 <none> Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54 ip-10-0-139-92.ec2.internal Ready worker 5h17m v1.20.0+aa519d9 10.0.139.92 <none> Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54 ip-10-0-149-47.ec2.internal NotReady master 5h25m v1.20.0+aa519d9 10.0.149.47 <none> Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54 ip-10-0-159-183.ec2.internal Ready worker 5h17m v1.20.0+aa519d9 10.0.159.183 <none> Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54 ip-10-0-161-27.ec2.internal Ready master 5h25m v1.20.0+aa519d9 10.0.161.27 <none> Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54 ip-10-0-173-41.ec2.internal Ready worker 5h17m v1.20.0+aa519d9 10.0.173.41 <none> Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54 Additional info: must-gather is available at https://drive.google.com/drive/folders/1nEnWuZCYe6CTNsPWAZJcpGzRE1pPGUed
hi, fpaoline following above steps in description. when one master node marked as 'NotReady', However the ds sdn-controller is changing to 2 from 3. and network operator did not report this? , is it expected? $ oc get ds -n openshift-sdn NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ovs 6 6 5 6 5 kubernetes.io/os=linux 71m sdn 6 6 5 6 5 kubernetes.io/os=linux 71m sdn-controller 2 2 2 2 2 node-role.kubernetes.io/master= 71m $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-146-157.us-east-2.compute.internal Ready worker 57m v1.20.0+5f82cdb ip-10-0-147-64.us-east-2.compute.internal NotReady master 63m v1.20.0+5f82cdb ip-10-0-163-226.us-east-2.compute.internal Ready master 63m v1.20.0+5f82cdb ip-10-0-163-73.us-east-2.compute.internal Ready worker 57m v1.20.0+5f82cdb ip-10-0-201-132.us-east-2.compute.internal Ready worker 56m v1.20.0+5f82cdb ip-10-0-203-137.us-east-2.compute.internal Ready master 63m v1.20.0+5f82cdb $ oc get co network -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" network.operator.openshift.io/last-seen-state: '{"DaemonsetStates":[{"Namespace":"openshift-sdn","Name":"sdn","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.242309949Z"},{"Namespace":"openshift-multus","Name":"multus","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.241749394Z"},{"Namespace":"openshift-sdn","Name":"ovs","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.242059449Z"}],"DeploymentStates":[]}' creationTimestamp: "2021-04-14T06:54:26Z" generation: 1 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:include.release.openshift.io/ibm-cloud-managed: {} f:include.release.openshift.io/self-managed-high-availability: {} f:include.release.openshift.io/single-node-developer: {} f:spec: {} f:status: .: {} f:extension: {} manager: cluster-version-operator operation: Update time: "2021-04-14T06:54:26Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:network.operator.openshift.io/last-seen-state: {} f:status: f:conditions: {} f:relatedObjects: {} f:versions: {} manager: cluster-network-operator operation: Update time: "2021-04-14T07:02:34Z" name: network resourceVersion: "52784" uid: 18598777-0af3-4ae0-ad6d-0ce2ff180acd spec: {} status: conditions: - lastTransitionTime: "2021-04-14T08:17:11Z" message: |- DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-04-14T08:05:11Z DaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2021-04-14T08:05:11Z DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2021-04-14T08:05:11Z reason: RolloutHung status: "True" type: Degraded - lastTransitionTime: "2021-04-14T07:01:53Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2021-04-14T07:01:53Z" status: "True" type: Upgradeable - lastTransitionTime: "2021-04-14T08:05:11Z" message: |- DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes) DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) reason: Deploying
The sdn controller daemon changed it's number of replicas, I checked when I tested this and I think it's because of the node.kubernetes.io/unreachable taint. The other ds have toleration: exist However, the initial bug was CNO not reporting the degraded state, and that is working now. I did not want to mess up with more logic than we needed to have the degraded state reported.
@zzhao does my comment above make sense?
ok, thanks the information. Federico Paolinelli Move this bug to verified.
Hey folks, Is there a reason the leader election fix wasn't backported? We found the problem on openshift upgrade when we were looking into a similar locking issue in the openshift-marketplace operator: https://bugzilla.redhat.com/show_bug.cgi?id=1958888#c6
The main reason was the original scenario, which was not related to upgrades but to a very specific case where a master goes not reachable and the operator was not reporting the state in a proper manner. It's not clear to me if this issue in CNO is preventing upgrades like the marketplace-operator did or not, I would say it does given the comments on the other bug but we weren't aware of such issue. I'll create the backport right now, does it make sense to go back to 4.6?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days