Bug 1936515 - sdn-controller is missing some health checks
Summary: sdn-controller is missing some health checks
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Federico Paolinelli
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1962036
TreeView+ depends on / blocked
 
Reported: 2021-03-08 16:33 UTC by Lukasz Szaszkiewicz
Modified: 2021-07-27 22:52 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:51:42 UTC
Target Upstream Version:
bbennett: needinfo? (trozet)
scuppett: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1052 0 None open Bug 1936515: Use the election mechanism provided by library-go 2021-04-08 09:10:19 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:52:19 UTC

Description Lukasz Szaszkiewicz 2021-03-08 16:33:42 UTC
Description of problem: The network operator is not reporting when one replica of sdn-controller is missing


How reproducible: Always


Steps to Reproduce:
1. Block outgoing traffic from a node to Kube API
 
   iptables -A OUTPUT -p tcp -m state --state RELATED,ESTABLISHED,NEW -m tcp --dport 6443 -j DROP

2. oc get co


Actual results: Many operators reported degraded but the network operator didn't:

❯ oc get co
NAME                                       VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.ci-2021-03-07-070446   True        False         True       93m
baremetal                                  4.8.0-0.ci-2021-03-07-070446   True        False         False      6h53m
cloud-credential                           4.8.0-0.ci-2021-03-07-070446   True        False         False      7h3m
cluster-autoscaler                         4.8.0-0.ci-2021-03-07-070446   True        False         False      6h52m
config-operator                            4.8.0-0.ci-2021-03-07-070446   True        False         False      6h54m
console                                    4.8.0-0.ci-2021-03-07-070446   True        False         False      93m
csi-snapshot-controller                    4.8.0-0.ci-2021-03-07-070446   True        False         False      93m
dns                                        4.8.0-0.ci-2021-03-07-070446   True        False         False      6h52m
etcd                                       4.8.0-0.ci-2021-03-07-070446   True        False         True       6h52m
image-registry                             4.8.0-0.ci-2021-03-07-070446   True        False         False      6h45m
ingress                                    4.8.0-0.ci-2021-03-07-070446   True        False         False      6h45m
insights                                   4.8.0-0.ci-2021-03-07-070446   True        False         False      6h47m
kube-apiserver                             4.8.0-0.ci-2021-03-07-070446   True        True          True       6h50m
kube-controller-manager                    4.8.0-0.ci-2021-03-07-070446   True        False         True       6h51m
kube-scheduler                             4.8.0-0.ci-2021-03-07-070446   True        False         True       6h52m
kube-storage-version-migrator              4.8.0-0.ci-2021-03-07-070446   True        False         False      6h45m
machine-api                                4.8.0-0.ci-2021-03-07-070446   True        False         False      6h44m
machine-approver                           4.8.0-0.ci-2021-03-07-070446   True        False         False      6h53m
machine-config                             4.8.0-0.ci-2021-03-07-070446   False       False         True       82m
marketplace                                4.8.0-0.ci-2021-03-07-070446   True        False         False      5h38m
monitoring                                 4.8.0-0.ci-2021-03-07-070446   False       True          True       87m
network                                    4.8.0-0.ci-2021-03-07-070446   True        False         False      6h54m
node-tuning                                4.8.0-0.ci-2021-03-07-070446   True        False         False      6h52m
openshift-apiserver                        4.8.0-0.ci-2021-03-07-070446   True        False         True       6h4m
openshift-controller-manager               4.8.0-0.ci-2021-03-07-070446   True        False         False      6h51m
openshift-samples                          4.8.0-0.ci-2021-03-07-070446   True        False         False      6h46m
operator-lifecycle-manager                 4.8.0-0.ci-2021-03-07-070446   True        False         False      6h52m
operator-lifecycle-manager-catalog         4.8.0-0.ci-2021-03-07-070446   True        False         False      6h52m
operator-lifecycle-manager-packageserver   4.8.0-0.ci-2021-03-07-070446   True        False         False      93m
service-ca                                 4.8.0-0.ci-2021-03-07-070446   True        False         False      6h54m
storage                                    4.8.0-0.ci-2021-03-07-070446   True        True          False      87m


Expected results: Since a master node was marked as NotReady I would expect the network operator to report that one replica is missing.

❯ oc get nodes -owide
NAME                           STATUS     ROLES    AGE     VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-139-25.ec2.internal    Ready      master   5h26m   v1.20.0+aa519d9   10.0.139.25    <none>        Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54
ip-10-0-139-92.ec2.internal    Ready      worker   5h17m   v1.20.0+aa519d9   10.0.139.92    <none>        Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54
ip-10-0-149-47.ec2.internal    NotReady   master   5h25m   v1.20.0+aa519d9   10.0.149.47    <none>        Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54
ip-10-0-159-183.ec2.internal   Ready      worker   5h17m   v1.20.0+aa519d9   10.0.159.183   <none>        Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54
ip-10-0-161-27.ec2.internal    Ready      master   5h25m   v1.20.0+aa519d9   10.0.161.27    <none>        Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54
ip-10-0-173-41.ec2.internal    Ready      worker   5h17m   v1.20.0+aa519d9   10.0.173.41    <none>        Red Hat Enterprise Linux CoreOS 48.83.202103060900-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitb422fc2.el8.54


Additional info: must-gather is available at https://drive.google.com/drive/folders/1nEnWuZCYe6CTNsPWAZJcpGzRE1pPGUed

Comment 2 zhaozhanqi 2021-04-14 08:19:25 UTC
hi, fpaoline@redhat.com 

following above steps in description. when one master node marked as 'NotReady',  However the ds sdn-controller is changing to 2 from 3.  and network operator did not report this? , is it expected?

$ oc get ds -n openshift-sdn
NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
ovs              6         6         5       6            5           kubernetes.io/os=linux            71m
sdn              6         6         5       6            5           kubernetes.io/os=linux            71m
sdn-controller   2         2         2       2            2           node-role.kubernetes.io/master=   71m


$ oc get node
NAME                                         STATUS     ROLES    AGE   VERSION
ip-10-0-146-157.us-east-2.compute.internal   Ready      worker   57m   v1.20.0+5f82cdb
ip-10-0-147-64.us-east-2.compute.internal    NotReady   master   63m   v1.20.0+5f82cdb
ip-10-0-163-226.us-east-2.compute.internal   Ready      master   63m   v1.20.0+5f82cdb
ip-10-0-163-73.us-east-2.compute.internal    Ready      worker   57m   v1.20.0+5f82cdb
ip-10-0-201-132.us-east-2.compute.internal   Ready      worker   56m   v1.20.0+5f82cdb
ip-10-0-203-137.us-east-2.compute.internal   Ready      master   63m   v1.20.0+5f82cdb



$ oc get co network -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    network.operator.openshift.io/last-seen-state: '{"DaemonsetStates":[{"Namespace":"openshift-sdn","Name":"sdn","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.242309949Z"},{"Namespace":"openshift-multus","Name":"multus","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.241749394Z"},{"Namespace":"openshift-sdn","Name":"ovs","LastSeenStatus":{"currentNumberScheduled":6,"numberMisscheduled":0,"desiredNumberScheduled":6,"numberReady":5,"observedGeneration":1,"updatedNumberScheduled":6,"numberAvailable":5,"numberUnavailable":1},"LastChangeTime":"2021-04-14T08:05:11.242059449Z"}],"DeploymentStates":[]}'
  creationTimestamp: "2021-04-14T06:54:26Z"
  generation: 1
  managedFields:
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:include.release.openshift.io/ibm-cloud-managed: {}
          f:include.release.openshift.io/self-managed-high-availability: {}
          f:include.release.openshift.io/single-node-developer: {}
      f:spec: {}
      f:status:
        .: {}
        f:extension: {}
    manager: cluster-version-operator
    operation: Update
    time: "2021-04-14T06:54:26Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:network.operator.openshift.io/last-seen-state: {}
      f:status:
        f:conditions: {}
        f:relatedObjects: {}
        f:versions: {}
    manager: cluster-network-operator
    operation: Update
    time: "2021-04-14T07:02:34Z"
  name: network
  resourceVersion: "52784"
  uid: 18598777-0af3-4ae0-ad6d-0ce2ff180acd
spec: {}
status:
  conditions:
  - lastTransitionTime: "2021-04-14T08:17:11Z"
    message: |-
      DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-04-14T08:05:11Z
      DaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2021-04-14T08:05:11Z
      DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2021-04-14T08:05:11Z
    reason: RolloutHung
    status: "True"
    type: Degraded
  - lastTransitionTime: "2021-04-14T07:01:53Z"
    status: "False"
    type: ManagementStateDegraded
  - lastTransitionTime: "2021-04-14T07:01:53Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2021-04-14T08:05:11Z"
    message: |-
      DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes)
      DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
      DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes)
      DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes)
      DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    reason: Deploying

Comment 3 Federico Paolinelli 2021-04-14 08:28:58 UTC
The sdn controller daemon changed it's number of replicas, I checked when I tested this and I think it's because of the node.kubernetes.io/unreachable taint.
The other ds have toleration: exist

However, the initial bug was CNO not reporting the degraded state, and that is working now.
I did not want to mess up with more logic than we needed to have the degraded state reported.

Comment 4 Federico Paolinelli 2021-04-19 10:59:41 UTC
@zzhao@redhat.com does my comment above make sense?

Comment 5 zhaozhanqi 2021-04-22 11:47:29 UTC
ok, thanks the information.  Federico Paolinelli

Move this bug to verified.

Comment 6 Kevin Rizza 2021-05-18 19:20:04 UTC
Hey folks,

Is there a reason the leader election fix wasn't backported? We found the problem on openshift upgrade when we were looking into a similar locking issue in the openshift-marketplace operator: https://bugzilla.redhat.com/show_bug.cgi?id=1958888#c6

Comment 8 Federico Paolinelli 2021-05-19 08:45:17 UTC
The main reason was the original scenario, which was not related to upgrades but to a very specific case where a master goes not reachable and the operator was not reporting the state in a proper manner.
It's not clear to me if this issue in CNO is preventing upgrades like the marketplace-operator did or not, I would say it does given the comments on the other bug but we weren't aware of such issue.

I'll create the backport right now, does it make sense to go back to 4.6?

Comment 12 errata-xmlrpc 2021-07-27 22:51:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.