1804446 – DaemonSet updatedNumberScheduled status not always up to date

Bug 1804446 - DaemonSet updatedNumberScheduled status not always up to date

Summary: DaemonSet updatedNumberScheduled status not always up to date

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-18 20:35 UTC by Adam Kaplan
Modified:	2020-05-04 11:38 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:37:31 UTC
Target Upstream Version:
Embargoed:
Flags:	adam.kaplan: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:38:06 UTC

Description Adam Kaplan 2020-02-18 20:35:23 UTC

Description of problem:

The DaemoneSet updatedNumberScheduled status value can sometimes be short and not reflect the number of updated pods that have been scheduled.


Version-Release number of selected component (if applicable): 4.4.0


How reproducible: Sometimes

Actual results:

updatedNumberScheduled = 2, even though 3 updated pods have been scheduled and are running.


Expected results:

updatedNumberScheduled = 3


Additional info:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098

Comment 1 Tomáš Nožička 2020-02-19 09:34:22 UTC

is the status up to date? can you supply the whole yaml?

the status might be stale (for previous version of the object) unless you wait for status.observedGeneration == metadata.generation

Comment 2 Maciej Szulik 2020-02-19 13:21:31 UTC

Order of events:

Feb 18 17:37:18.328: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ci-op-15462z30/release@sha256:5e455fb1aea20108bb7ed9b64b4f120b5ce61cdc0c091fbf31e6bbe4ab331d8f

Feb 18 17:44:25: ocm-o reports Progressing false <- bug 1, see oas-o https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/pkg/operator/workloadcontroller/workload_controller_openshiftapiserver_v311_00.go#L59-L102 for how this should be handled
  
Feb 18 17:44:59: first ocm ds pod controller-manager-d82mq reports ready status (see https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098/artifacts/e2e-gcp-upgrade/must-gather/registry-svc-ci-openshift-org-ci-op-15462z30-stable-sha256-a273f5ac7f1ad8f7ffab45205ac36c8dff92d9107ef3ae429eeb135fa8057b8b/namespaces/openshift-controller-manager/core/pods.yaml)
Feb 18 17:45:42: second ocm ds pod controller-manager-9tdwp reports ready status (see ^)

Feb 18 17:46:12: third ocm ds pod controller-manager-47q8b is marked for deletion (see ^^)

Feb 18 17:49:43.365: INFO: cluster upgrade is Progressing: Working towards 0.0.1-2020-02-18-165527: 77% complete, waiting on openshift-controller-manager (still)

The test fails with DS having following status (from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098/artifacts/e2e-gcp-upgrade/must-gather/registry-svc-ci-openshift-org-ci-op-15462z30-stable-sha256-a273f5ac7f1ad8f7ffab45205ac36c8dff92d9107ef3ae429eeb135fa8057b8b/namespaces/openshift-controller-manager/apps/daemonsets.yaml):

  status:
    currentNumberScheduled: 3
    desiredNumberScheduled: 3
    numberAvailable: 3
    numberMisscheduled: 0
    numberReady: 3
    observedGeneration: 3
    updatedNumberScheduled: 3

but inside spec you'll see:

    updateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate

This is the missing final pod, which wasn't updated, yet.

Comment 3 Maciej Szulik 2020-02-19 14:50:01 UTC

The ocm-o status bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1804434

Comment 4 Maciej Szulik 2020-02-24 11:59:39 UTC

The ocm-o 4.4 bug is https://bugzilla.redhat.com/show_bug.cgi?id=1804937 and the improvements for DS status will be tracked in https://issues.redhat.com/browse/WRKLDS-132

Comment 5 Maciej Szulik 2020-02-24 13:56:51 UTC

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/142 merged so moving to qa.

Comment 8 zhou ying 2020-03-02 10:05:51 UTC

Confirmed with payload: 4.4.0-0.nightly-2020-03-01-215047, still could reproduce the issue now, when daemonset rolling out, the openshift-controller-manager-operator's status not right. 

[root@dhcp-140-138 ~]# oc get po 
NAME                       READY   STATUS        RESTARTS   AGE
controller-manager-65mf2   1/1     Running       0          27s
controller-manager-8mcfj   1/1     Terminating   0          56s
controller-manager-drqz4   1/1     Running       0          24s
[root@dhcp-140-138 ~]# oc get daemonset.apps/controller-manager -o yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "22"
    operator.openshift.io/force: 3e4e22a5-b188-494f-b4ad-bdaf85fe2665
    operator.openshift.io/pull-spec: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ce5997cb44de6c1a5895035287dc12577dc8f3be240ed540c7a86be80063b7
    release.openshift.io/version: 4.4.0-0.nightly-2020-03-01-215047
  creationTimestamp: "2020-03-02T02:03:51Z"
  generation: 22
....
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 3
    type: RollingUpdate
status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 2
  numberMisscheduled: 0
  numberReady: 2
  numberUnavailable: 1
  observedGeneration: 22
[root@dhcp-140-138 ~]# oc get co/openshift-controller-manager
NAME                           VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-controller-manager   4.4.0-0.nightly-2020-03-01-215047   True        False         False      7h51m

Comment 9 Maciej Szulik 2020-03-03 10:45:03 UTC

Verified with registry.svc.ci.openshift.org/ocp/release:4.4.0-0.ci-2020-03-03-033811 and this looks correct.
There's an important change in the process, coming from that linked PR. Starting from now on, ocm operator
will report progressing until at least one pod is available, after that point the operator will report
ready. This is because we lowered the required limit for available pods for ocm to just one.

Comment 11 zhou ying 2020-03-04 05:28:45 UTC

Confirmed with payload: 4.4.0-0.nightly-2020-03-02-231151, the issue has fixed:

[root@dhcp-140-138 ~]# oc get co 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-controller-manager               4.4.0-0.nightly-2020-03-02-231151   True        True          False      22h

[root@dhcp-140-138 ~]# oc get po -n openshift-controller-manager
NAME                       READY   STATUS        RESTARTS   AGE
controller-manager-dss78   1/1     Running       0          31s
controller-manager-gkv6k   1/1     Running       0          14s
controller-manager-rvd8p   1/1     Terminating   0          44s

Comment 13 errata-xmlrpc 2020-05-04 11:37:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.