Bug 1804446 - DaemonSet updatedNumberScheduled status not always up to date
Summary: DaemonSet updatedNumberScheduled status not always up to date
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-18 20:35 UTC by Adam Kaplan
Modified: 2020-05-04 11:38 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:37:31 UTC
Target Upstream Version:
adam.kaplan: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:38:06 UTC

Description Adam Kaplan 2020-02-18 20:35:23 UTC
Description of problem:

The DaemoneSet updatedNumberScheduled status value can sometimes be short and not reflect the number of updated pods that have been scheduled.


Version-Release number of selected component (if applicable): 4.4.0


How reproducible: Sometimes

Actual results:

updatedNumberScheduled = 2, even though 3 updated pods have been scheduled and are running.


Expected results:

updatedNumberScheduled = 3


Additional info:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098

Comment 1 Tomáš Nožička 2020-02-19 09:34:22 UTC
is the status up to date? can you supply the whole yaml?

the status might be stale (for previous version of the object) unless you wait for status.observedGeneration == metadata.generation

Comment 2 Maciej Szulik 2020-02-19 13:21:31 UTC
Order of events:

Feb 18 17:37:18.328: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ci-op-15462z30/release@sha256:5e455fb1aea20108bb7ed9b64b4f120b5ce61cdc0c091fbf31e6bbe4ab331d8f

Feb 18 17:44:25: ocm-o reports Progressing false <- bug 1, see oas-o https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/pkg/operator/workloadcontroller/workload_controller_openshiftapiserver_v311_00.go#L59-L102 for how this should be handled
  
Feb 18 17:44:59: first ocm ds pod controller-manager-d82mq reports ready status (see https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098/artifacts/e2e-gcp-upgrade/must-gather/registry-svc-ci-openshift-org-ci-op-15462z30-stable-sha256-a273f5ac7f1ad8f7ffab45205ac36c8dff92d9107ef3ae429eeb135fa8057b8b/namespaces/openshift-controller-manager/core/pods.yaml)
Feb 18 17:45:42: second ocm ds pod controller-manager-9tdwp reports ready status (see ^)

Feb 18 17:46:12: third ocm ds pod controller-manager-47q8b is marked for deletion (see ^^)

Feb 18 17:49:43.365: INFO: cluster upgrade is Progressing: Working towards 0.0.1-2020-02-18-165527: 77% complete, waiting on openshift-controller-manager (still)

The test fails with DS having following status (from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1484/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1098/artifacts/e2e-gcp-upgrade/must-gather/registry-svc-ci-openshift-org-ci-op-15462z30-stable-sha256-a273f5ac7f1ad8f7ffab45205ac36c8dff92d9107ef3ae429eeb135fa8057b8b/namespaces/openshift-controller-manager/apps/daemonsets.yaml):

  status:
    currentNumberScheduled: 3
    desiredNumberScheduled: 3
    numberAvailable: 3
    numberMisscheduled: 0
    numberReady: 3
    observedGeneration: 3
    updatedNumberScheduled: 3

but inside spec you'll see:

    updateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate

This is the missing final pod, which wasn't updated, yet.

Comment 3 Maciej Szulik 2020-02-19 14:50:01 UTC
The ocm-o status bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1804434

Comment 4 Maciej Szulik 2020-02-24 11:59:39 UTC
The ocm-o 4.4 bug is https://bugzilla.redhat.com/show_bug.cgi?id=1804937 and the improvements for DS status will be tracked in https://issues.redhat.com/browse/WRKLDS-132

Comment 5 Maciej Szulik 2020-02-24 13:56:51 UTC
https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/142 merged so moving to qa.

Comment 8 zhou ying 2020-03-02 10:05:51 UTC
Confirmed with payload: 4.4.0-0.nightly-2020-03-01-215047, still could reproduce the issue now, when daemonset rolling out, the openshift-controller-manager-operator's status not right. 

[root@dhcp-140-138 ~]# oc get po 
NAME                       READY   STATUS        RESTARTS   AGE
controller-manager-65mf2   1/1     Running       0          27s
controller-manager-8mcfj   1/1     Terminating   0          56s
controller-manager-drqz4   1/1     Running       0          24s
[root@dhcp-140-138 ~]# oc get daemonset.apps/controller-manager -o yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "22"
    operator.openshift.io/force: 3e4e22a5-b188-494f-b4ad-bdaf85fe2665
    operator.openshift.io/pull-spec: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ce5997cb44de6c1a5895035287dc12577dc8f3be240ed540c7a86be80063b7
    release.openshift.io/version: 4.4.0-0.nightly-2020-03-01-215047
  creationTimestamp: "2020-03-02T02:03:51Z"
  generation: 22
....
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 3
    type: RollingUpdate
status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 2
  numberMisscheduled: 0
  numberReady: 2
  numberUnavailable: 1
  observedGeneration: 22
[root@dhcp-140-138 ~]# oc get co/openshift-controller-manager
NAME                           VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-controller-manager   4.4.0-0.nightly-2020-03-01-215047   True        False         False      7h51m

Comment 9 Maciej Szulik 2020-03-03 10:45:03 UTC
Verified with registry.svc.ci.openshift.org/ocp/release:4.4.0-0.ci-2020-03-03-033811 and this looks correct.
There's an important change in the process, coming from that linked PR. Starting from now on, ocm operator
will report progressing until at least one pod is available, after that point the operator will report
ready. This is because we lowered the required limit for available pods for ocm to just one.

Comment 11 zhou ying 2020-03-04 05:28:45 UTC
Confirmed with payload: 4.4.0-0.nightly-2020-03-02-231151, the issue has fixed:

[root@dhcp-140-138 ~]# oc get co 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-controller-manager               4.4.0-0.nightly-2020-03-02-231151   True        True          False      22h

[root@dhcp-140-138 ~]# oc get po -n openshift-controller-manager
NAME                       READY   STATUS        RESTARTS   AGE
controller-manager-dss78   1/1     Running       0          31s
controller-manager-gkv6k   1/1     Running       0          14s
controller-manager-rvd8p   1/1     Terminating   0          44s

Comment 13 errata-xmlrpc 2020-05-04 11:37:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.