Bug 1804446
| Summary: | DaemonSet updatedNumberScheduled status not always up to date | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Adam Kaplan <adam.kaplan> |
| Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.4 | CC: | aos-bugs, mfojtik, tnozicka |
| Target Milestone: | --- | Flags: | adam.kaplan:
needinfo-
|
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-04 11:37:31 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Adam Kaplan
2020-02-18 20:35:23 UTC
is the status up to date? can you supply the whole yaml? the status might be stale (for previous version of the object) unless you wait for status.observedGeneration == metadata.generation Order of events: Feb 18 17:37:18.328: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ci-op-15462z30/release@sha256:5e455fb1aea20108bb7ed9b64b4f120b5ce61cdc0c091fbf31e6bbe4ab331d8f Feb 18 17:44:25: ocm-o reports Progressing false <- bug 1, see oas-o https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/pkg/operator/workloadcontroller/workload_controller_openshiftapiserver_v311_00.go#L59-L102 for how this should be handled Feb 18 17:44:59: first ocm ds pod controller-manager-d82mq reports ready status (see https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098/artifacts/e2e-gcp-upgrade/must-gather/registry-svc-ci-openshift-org-ci-op-15462z30-stable-sha256-a273f5ac7f1ad8f7ffab45205ac36c8dff92d9107ef3ae429eeb135fa8057b8b/namespaces/openshift-controller-manager/core/pods.yaml) Feb 18 17:45:42: second ocm ds pod controller-manager-9tdwp reports ready status (see ^) Feb 18 17:46:12: third ocm ds pod controller-manager-47q8b is marked for deletion (see ^^) Feb 18 17:49:43.365: INFO: cluster upgrade is Progressing: Working towards 0.0.1-2020-02-18-165527: 77% complete, waiting on openshift-controller-manager (still) The test fails with DS having following status (from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098/artifacts/e2e-gcp-upgrade/must-gather/registry-svc-ci-openshift-org-ci-op-15462z30-stable-sha256-a273f5ac7f1ad8f7ffab45205ac36c8dff92d9107ef3ae429eeb135fa8057b8b/namespaces/openshift-controller-manager/apps/daemonsets.yaml): status: currentNumberScheduled: 3 desiredNumberScheduled: 3 numberAvailable: 3 numberMisscheduled: 0 numberReady: 3 observedGeneration: 3 updatedNumberScheduled: 3 but inside spec you'll see: updateStrategy: rollingUpdate: maxUnavailable: 1 type: RollingUpdate This is the missing final pod, which wasn't updated, yet. The ocm-o status bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1804434 The ocm-o 4.4 bug is https://bugzilla.redhat.com/show_bug.cgi?id=1804937 and the improvements for DS status will be tracked in https://issues.redhat.com/browse/WRKLDS-132 https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/142 merged so moving to qa. Confirmed with payload: 4.4.0-0.nightly-2020-03-01-215047, still could reproduce the issue now, when daemonset rolling out, the openshift-controller-manager-operator's status not right.
[root@dhcp-140-138 ~]# oc get po
NAME READY STATUS RESTARTS AGE
controller-manager-65mf2 1/1 Running 0 27s
controller-manager-8mcfj 1/1 Terminating 0 56s
controller-manager-drqz4 1/1 Running 0 24s
[root@dhcp-140-138 ~]# oc get daemonset.apps/controller-manager -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "22"
operator.openshift.io/force: 3e4e22a5-b188-494f-b4ad-bdaf85fe2665
operator.openshift.io/pull-spec: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ce5997cb44de6c1a5895035287dc12577dc8f3be240ed540c7a86be80063b7
release.openshift.io/version: 4.4.0-0.nightly-2020-03-01-215047
creationTimestamp: "2020-03-02T02:03:51Z"
generation: 22
....
updateStrategy:
rollingUpdate:
maxUnavailable: 3
type: RollingUpdate
status:
currentNumberScheduled: 3
desiredNumberScheduled: 3
numberAvailable: 2
numberMisscheduled: 0
numberReady: 2
numberUnavailable: 1
observedGeneration: 22
[root@dhcp-140-138 ~]# oc get co/openshift-controller-manager
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
openshift-controller-manager 4.4.0-0.nightly-2020-03-01-215047 True False False 7h51m
Verified with registry.svc.ci.openshift.org/ocp/release:4.4.0-0.ci-2020-03-03-033811 and this looks correct. There's an important change in the process, coming from that linked PR. Starting from now on, ocm operator will report progressing until at least one pod is available, after that point the operator will report ready. This is because we lowered the required limit for available pods for ocm to just one. Confirmed with payload: 4.4.0-0.nightly-2020-03-02-231151, the issue has fixed: [root@dhcp-140-138 ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-controller-manager 4.4.0-0.nightly-2020-03-02-231151 True True False 22h [root@dhcp-140-138 ~]# oc get po -n openshift-controller-manager NAME READY STATUS RESTARTS AGE controller-manager-dss78 1/1 Running 0 31s controller-manager-gkv6k 1/1 Running 0 14s controller-manager-rvd8p 1/1 Terminating 0 44s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |