Bug 1811748 - Resources not rendered are not removed upon CNO recreation
Summary: Resources not rendered are not removed upon CNO recreation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.5.0
Assignee: Maysa Macedo
QA Contact: GenadiC
URL:
Whiteboard:
Depends On:
Blocks: 1811830
TreeView+ depends on / blocked
 
Reported: 2020-03-09 16:59 UTC by Maysa Macedo
Modified: 2020-07-13 17:19 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1811830 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:19:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 520 0 None closed Bug 1811748: Ensure removal of not rendered resources upon CNO recreation 2020-09-26 11:33:42 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:19:41 UTC

Description Maysa Macedo 2020-03-09 16:59:01 UTC
Description of problem:

If the CNO pod is recreated, the resources that are not rendered anymore are not deleted. Basically, the related objects field of the cluster operator status is wiped out upon cno recreation, which breaks the deletion of the related objetcs not rendered as there is no objects saved on status manager.

In the following outputs first the related objects is removed, then updated with no admission controller, and the admission controller DaemonSet is still present on the cluster.

(shiftstack) [stack@undercloud-0 ~]$ oc get co network -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    network.operator.openshift.io/last-seen-state: '{"DaemonsetStates":[],"DeploymentStates":[]}'
  creationTimestamp: "2020-03-03T20:50:20Z"
  generation: 1
  name: network
  resourceVersion: "2579694"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/network
  uid: 014888b0-a1bf-4c1a-b427-d5467f33ba76
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-03-09T15:57:56Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-03-03T20:50:20Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2020-03-09T16:10:31Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-03-03T20:57:53Z"
    status: "True"
    type: Available
  extension: null
  versions:
  - name: operator
    version: 4.4.0-0.nightly-2020-03-03-110909

(shiftstack) [stack@undercloud-0 ~]$ oc get co network -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    network.operator.openshift.io/last-seen-state: '{"DaemonsetStates":[],"DeploymentStates":[{"Namespace":"openshift-kuryr","Name":"kuryr-controller","LastSeenStatus":{"observedGeneration":26,"replicas":1,"updatedReplicas":1,"unavailableReplicas":1,"conditions":[{"type":"Progressing","status":"True","lastUpdateTime":"2020-03-08T19:37:53Z","lastTransitionTime":"2020-03-03T20:52:57Z","reason":"NewReplicaSetAvailable","message":"ReplicaSet
      \"kuryr-controller-57c7f8d95f\" has successfully progressed."},{"type":"Available","status":"False","lastUpdateTime":"2020-03-09T16:11:33Z","lastTransitionTime":"2020-03-09T16:11:33Z","reason":"MinimumReplicasUnavailable","message":"Deployment
      does not have minimum availability."}]},"LastChangeTime":"2020-03-09T16:12:04.3674935Z"}]}'
  creationTimestamp: "2020-03-03T20:50:20Z"
  generation: 1
  name: network
  resourceVersion: "2579785"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/network
  uid: 014888b0-a1bf-4c1a-b427-d5467f33ba76
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-03-09T15:57:56Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-03-03T20:50:20Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2020-03-09T16:12:04Z"
    message: Deployment "openshift-kuryr/kuryr-controller" is not available (awaiting
      1 nodes)
    reason: Deploying
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-03-03T20:57:53Z"
    status: "True"
    type: Available
  extension: null
  relatedObjects:
  - group: ""
    name: applied-cluster
    namespace: openshift-network-operator
    resource: configmaps
  - group: apiextensions.k8s.io
    name: network-attachment-definitions.k8s.cni.cncf.io
    resource: customresourcedefinitions
  - group: ""
    name: openshift-multus
    resource: namespaces
  - group: rbac.authorization.k8s.io
    name: multus
    resource: clusterroles
  - group: ""
    name: multus
    namespace: openshift-multus
    resource: serviceaccounts
  - group: rbac.authorization.k8s.io
    name: multus
    resource: clusterrolebindings
  - group: apps
    name: multus
    namespace: openshift-multus
    resource: daemonsets
  - group: ""
    name: multus-admission-controller
    namespace: openshift-multus
    resource: services
  - group: rbac.authorization.k8s.io
    name: multus-admission-controller-webhook
    resource: clusterroles
  - group: rbac.authorization.k8s.io
    name: multus-admission-controller-webhook
    resource: clusterrolebindings
  - group: admissionregistration.k8s.io
    name: multus.openshift.io
    resource: validatingwebhookconfigurations
  - group: ""
    name: openshift-service-ca
    namespace: openshift-network-operator
    resource: configmaps
  - group: apps
    name: multus-admission-controller
    namespace: openshift-multus
    resource: daemonsets
  - group: monitoring.coreos.com
    name: monitor-multus-admission-controller
    namespace: openshift-multus
    resource: servicemonitors
  - group: ""
    name: multus-admission-controller-monitor-service
    namespace: openshift-multus
    resource: services
  - group: rbac.authorization.k8s.io
    name: prometheus-k8s
    namespace: openshift-multus
    resource: roles
  - group: rbac.authorization.k8s.io
    name: prometheus-k8s
    namespace: openshift-multus
    resource: rolebindings
  - group: monitoring.coreos.com
    name: prometheus-k8s-rules
    namespace: openshift-multus
    resource: prometheusrules
  - group: ""
    name: openshift-kuryr
    resource: namespaces
  - group: rbac.authorization.k8s.io
    name: kuryr
    resource: clusterroles
  - group: ""
    name: kuryr
    namespace: openshift-kuryr
    resource: serviceaccounts
  - group: rbac.authorization.k8s.io
    name: kuryr
    resource: clusterrolebindings
  - group: apiextensions.k8s.io
    name: kuryrnets.openstack.org
    resource: customresourcedefinitions
  - group: apiextensions.k8s.io
    name: kuryrnetpolicies.openstack.org
    resource: customresourcedefinitions
  - group: ""
    name: kuryr-config
    namespace: openshift-kuryr
    resource: configmaps
  - group: apps
    name: kuryr-cni
    namespace: openshift-kuryr
    resource: daemonsets
  - group: apps
    name: kuryr-controller
    namespace: openshift-kuryr
    resource: deployments
  - group: ""
    name: openshift-network-operator
    resource: namespaces
  versions:
  - name: operator
    version: 4.4.0-0.nightly-2020-03-03-110909

(shiftstack) [stack@undercloud-0 ~]$ oc get po -n openshift-kuryr
NAME                                   READY   STATUS    RESTARTS   AGE
kuryr-cni-4plvz                        1/1     Running   0          4m59s
kuryr-cni-68bkt                        1/1     Running   0          5m58s
kuryr-cni-6k2x2                        1/1     Running   0          6m29s
kuryr-cni-msbtk                        1/1     Running   0          7m2s
kuryr-cni-qlnrk                        1/1     Running   0          4m25s
kuryr-cni-rgl6w                        1/1     Running   0          5m25s
kuryr-controller-59d7fcf5fd-p5n8l      1/1     Running   3          7m6s
kuryr-dns-admission-controller-dzlpl   1/1     Running   0          14m
kuryr-dns-admission-controller-lmx2s   1/1     Running   0          14m
kuryr-dns-admission-controller-w97jb   1/1     Running   0          14m
Version-Release number of selected component (if applicable):

Tested with ocp 4.4, but also applicable to other releases.

How reproducible:


Steps to Reproduce:
1. Recreate the CNO with some new configuration
2. This new config makes a Kubernetes resource to not be rendered anymore
3. Notice the resource is still there even if not rendered

Actual results:


Expected results:


Additional info:

Comment 3 Ross Brattain 2020-03-16 18:43:49 UTC

Unable to reproduce orphaned resources on 4.5.0-0.nightly-2020-03-16-101116 with OpenShiftSDN


The behaviour of the clusteroperator network relatedObjects on OpenShiftSDN seems to be slightly different, 'relatedObjects' never seems to be nil.

When I delete the network-operator pod I do not see relatedObjects changing to nil.

The only way I was able to reproduce the original issue on 4.4 SDN was to scale the CNO DaemonSet to 0, then oc edit and change the network config.


Orphaned resources reproduction steps on 4.4 OpenShiftSDN

1. Add a new multus network network
  oc edit networks.operator.openshift.io cluster

  spec:
    additionalNetworks:
    - name: bridge-ipam-dhcp
      namespace: openshift-multus
      rawCNIConfig: '{ "name": "bridge-ipam-dhcp", "cniVersion": "0.3.1", "type": "bridge",
        "master": "ens5", "ipam": { "type": "dhcp" } }'
      type: Raw

2. Verify dhcp daemon pods are created in multus namespace
   oc get -n openshift-multus pods -l app=dhcp-daemon

3. scale the CNO to 0 and verify the pod is deleted
   oc -n openshift-network-operator scale deployment network-operator --replicas=0

4. oc edit networks.operator.openshift.io cluster and delete the additional network we added in step 1

5. oc -n openshift-network-operator scale deployment network-operator --replicas=1

6. verify the dhcp pods are still alive and have not been terminated
   oc get -n openshift-multus pods -l app=dhcp-daemon


With these steps the dhcp-pods are not terminated on 4.4

On 4.5.0-0.nightly-2020-03-16-101116 the dhcp-daemon pods are terminated

This seems to suggest something has been resolved on 4.5

With OpenShiftSDN I have never seen clusteroperator network 'relatedObjects' be nil

@anusaxen reports that with OVNKubernetes he also has not seen 'relatedObjects' be nil


Can the Kuryr team also look and see if the root cause for the 'relatedObjects' nil state can be identified as well?

Comment 4 Maysa Macedo 2020-03-17 14:49:15 UTC
The fix was already in place with 4.5.0-0.nightly-2020-03-16-101116 release image.

I could see the 'relatedObjects' fields not present also when using OpenShiftSDN, by constantly checking the field value with: 'oc get co network -o yaml -w' (it keeps the record of changes that happened in the object)

The same issue can be seen with Kuryr, as the population of the relatedObjects only happens after the updated of ClusterOperator have happened. The fix solves the issue with Kuryr as well.

Comment 6 errata-xmlrpc 2020-07-13 17:19:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.