Description of problem: Using a chaosmonkey-like game CNO got stuck in Progressing=True state Version-Release number of selected component (if applicable): 4.1.3 How reproducible: Rare chance to hit this on actual system Steps to Reproduce: 1. Kill one of the multus pods 2. Kill CNO pod Actual results: If the new CNO pod comes back sooner than multus DS refreshes status, CNO would get stuck in Progressing=True Expected results: CNO would refresh Progressing state once Multus DS is available Additional info:
It's not just multus. The problem is that CNO doesn't ensure that the operator status is correct when it starts up. It only updates it when something changes while the CNO is running. So the order of events is: 1. multus pod is killed 2. multus daemonset updates to reflect that we're missing a multus pod 3. CNO sees the daemonset change, updates operator state to Progressing 4. CNO is killed 5. multus pod is restarted, multus daemonset updates to say it's OK 6. CNO is restarted, does nothing 7. (5 minutes later) CNO does a full resync, sees that nothing has changed, does nothing 8. (eventually) another multus pod is killed and comes back, CNO finally fixes operator status
I think Alexander fixed this. Over him to verify and close.
Yes, this has been fixed with the PR: https://github.com/openshift/cluster-network-operator/pull/232 I am assigning "modified" for QA testing.
Tested and verified in v4.2.0-0.ci-2019-07-30-115127, CNO would not stuck in Progressing=True any more. [root@dhcp-41-193 ~]# oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE network False True False 43s [root@dhcp-41-193 ~]# oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE network False True False 63s [root@dhcp-41-193 ~]# oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE network 4.2.0-0.ci-2019-07-30-115127 True False False 2s [root@dhcp-41-193 ~]# oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE network 4.2.0-0.ci-2019-07-30-115127 True False False 14s [root@dhcp-41-193 ~]# oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE network 4.2.0-0.ci-2019-07-30-115127 True False False 95s [root@dhcp-41-193 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922