Description of problem: The DaemoneSet updatedNumberScheduled status value can sometimes be short and not reflect the number of updated pods that have been scheduled. Version-Release number of selected component (if applicable): 4.4.0 How reproducible: Sometimes Actual results: updatedNumberScheduled = 2, even though 3 updated pods have been scheduled and are running. Expected results: updatedNumberScheduled = 3 Additional info: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098
is the status up to date? can you supply the whole yaml? the status might be stale (for previous version of the object) unless you wait for status.observedGeneration == metadata.generation
Order of events: Feb 18 17:37:18.328: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ci-op-15462z30/release@sha256:5e455fb1aea20108bb7ed9b64b4f120b5ce61cdc0c091fbf31e6bbe4ab331d8f Feb 18 17:44:25: ocm-o reports Progressing false <- bug 1, see oas-o https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/pkg/operator/workloadcontroller/workload_controller_openshiftapiserver_v311_00.go#L59-L102 for how this should be handled Feb 18 17:44:59: first ocm ds pod controller-manager-d82mq reports ready status (see https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098/artifacts/e2e-gcp-upgrade/must-gather/registry-svc-ci-openshift-org-ci-op-15462z30-stable-sha256-a273f5ac7f1ad8f7ffab45205ac36c8dff92d9107ef3ae429eeb135fa8057b8b/namespaces/openshift-controller-manager/core/pods.yaml) Feb 18 17:45:42: second ocm ds pod controller-manager-9tdwp reports ready status (see ^) Feb 18 17:46:12: third ocm ds pod controller-manager-47q8b is marked for deletion (see ^^) Feb 18 17:49:43.365: INFO: cluster upgrade is Progressing: Working towards 0.0.1-2020-02-18-165527: 77% complete, waiting on openshift-controller-manager (still) The test fails with DS having following status (from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098/artifacts/e2e-gcp-upgrade/must-gather/registry-svc-ci-openshift-org-ci-op-15462z30-stable-sha256-a273f5ac7f1ad8f7ffab45205ac36c8dff92d9107ef3ae429eeb135fa8057b8b/namespaces/openshift-controller-manager/apps/daemonsets.yaml): status: currentNumberScheduled: 3 desiredNumberScheduled: 3 numberAvailable: 3 numberMisscheduled: 0 numberReady: 3 observedGeneration: 3 updatedNumberScheduled: 3 but inside spec you'll see: updateStrategy: rollingUpdate: maxUnavailable: 1 type: RollingUpdate This is the missing final pod, which wasn't updated, yet.
The ocm-o status bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1804434
The ocm-o 4.4 bug is https://bugzilla.redhat.com/show_bug.cgi?id=1804937 and the improvements for DS status will be tracked in https://issues.redhat.com/browse/WRKLDS-132
https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/142 merged so moving to qa.
Confirmed with payload: 4.4.0-0.nightly-2020-03-01-215047, still could reproduce the issue now, when daemonset rolling out, the openshift-controller-manager-operator's status not right. [root@dhcp-140-138 ~]# oc get po NAME READY STATUS RESTARTS AGE controller-manager-65mf2 1/1 Running 0 27s controller-manager-8mcfj 1/1 Terminating 0 56s controller-manager-drqz4 1/1 Running 0 24s [root@dhcp-140-138 ~]# oc get daemonset.apps/controller-manager -o yaml apiVersion: apps/v1 kind: DaemonSet metadata: annotations: deprecated.daemonset.template.generation: "22" operator.openshift.io/force: 3e4e22a5-b188-494f-b4ad-bdaf85fe2665 operator.openshift.io/pull-spec: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ce5997cb44de6c1a5895035287dc12577dc8f3be240ed540c7a86be80063b7 release.openshift.io/version: 4.4.0-0.nightly-2020-03-01-215047 creationTimestamp: "2020-03-02T02:03:51Z" generation: 22 .... updateStrategy: rollingUpdate: maxUnavailable: 3 type: RollingUpdate status: currentNumberScheduled: 3 desiredNumberScheduled: 3 numberAvailable: 2 numberMisscheduled: 0 numberReady: 2 numberUnavailable: 1 observedGeneration: 22 [root@dhcp-140-138 ~]# oc get co/openshift-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-controller-manager 4.4.0-0.nightly-2020-03-01-215047 True False False 7h51m
Verified with registry.svc.ci.openshift.org/ocp/release:4.4.0-0.ci-2020-03-03-033811 and this looks correct. There's an important change in the process, coming from that linked PR. Starting from now on, ocm operator will report progressing until at least one pod is available, after that point the operator will report ready. This is because we lowered the required limit for available pods for ocm to just one.
Confirmed with payload: 4.4.0-0.nightly-2020-03-02-231151, the issue has fixed: [root@dhcp-140-138 ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-controller-manager 4.4.0-0.nightly-2020-03-02-231151 True True False 22h [root@dhcp-140-138 ~]# oc get po -n openshift-controller-manager NAME READY STATUS RESTARTS AGE controller-manager-dss78 1/1 Running 0 31s controller-manager-gkv6k 1/1 Running 0 14s controller-manager-rvd8p 1/1 Terminating 0 44s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581