Bug 1866782 - deploy replicas number is changed wrongly under some conditions
Summary: deploy replicas number is changed wrongly under some conditions
Keywords:
Status: CLOSED DUPLICATE of bug 1868750
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Tomáš Nožička
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-06 10:41 UTC by Junqi Zhao
Modified: 2020-08-21 11:49 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-21 11:49:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Junqi Zhao 2020-08-06 10:41:57 UTC
Description of problem:
upgrade from 4.5.5 to 4.6.0-0.nightly-2020-08-05-103641, since there is network issue, cluster operator monitoring is Degraded and it expected 2 prometheus-operator replicas, actually we need only 1, see from the .spec.replicas:1 from deploy prometheus-operator
# oc get co/monitoring -oyaml
...
status:
  conditions:
  - lastTransitionTime: "2020-08-06T08:50:08Z"
    message: 'Failed to rollout the stack. Error: running task Updating Prometheus
      Operator failed: reconciling Prometheus Operator Deployment failed: updating
      Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator:
      expected 2 replicas, got 1 updated replicas'
    reason: UpdatingPrometheusOperatorFailed
    status: "True"
    type: Degraded
...
# oc -n openshift-monitoring get deploy prometheus-operator -oyaml
...
spec:
  progressDeadlineSeconds: 600
  replicas: 1
...
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2020-08-06T07:17:52Z"
    lastUpdateTime: "2020-08-06T07:17:52Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2020-08-06T08:55:09Z"
    lastUpdateTime: "2020-08-06T08:55:09Z"
    message: ReplicaSet "prometheus-operator-7896ccc77c" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 57
  readyReplicas: 1
  replicas: 2
  unavailableReplicas: 1
  updatedReplicas: 1
...
# oc -n openshift-monitoring get pod | grep prometheus-operator
prometheus-operator-7795d56f7-d84bb            2/2     Running             0          94m
prometheus-operator-7896ccc77c-jv2tx           0/2     ContainerCreating   0          89m

# oc -n openshift-monitoring describe pod prometheus-operator-7896ccc77c-jv2tx
  Warning  FailedCreatePodSandBox  <invalid> (x45 over 63m)  kubelet, kasturi-upg1-5hljl-master-2  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(aeb95da994e35dcea7be7a4122ec3ddca7afc909463de399377e2b26d72ef907): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition

# oc -n openshift-monitoring get event | grep prometheus-operator-7795d56f7
118m        Normal    Scheduled                pod/prometheus-operator-7795d56f7-d84bb             Successfully assigned openshift-monitoring/prometheus-operator-7795d56f7-d84bb to kasturi-upg1-5hljl-master-2
118m        Normal    AddedInterface           pod/prometheus-operator-7795d56f7-d84bb             Add eth0 [10.129.0.13/23]
118m        Normal    Pulling                  pod/prometheus-operator-7795d56f7-d84bb             Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:848dfc960804c25aef2ec9f45f0c9d236dc1616879785384d5111f14c70dd52c"
118m        Normal    Pulled                   pod/prometheus-operator-7795d56f7-d84bb             Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:848dfc960804c25aef2ec9f45f0c9d236dc1616879785384d5111f14c70dd52c"
118m        Normal    Created                  pod/prometheus-operator-7795d56f7-d84bb             Created container prometheus-operator
118m        Normal    Started                  pod/prometheus-operator-7795d56f7-d84bb             Started container prometheus-operator
118m        Normal    Pulling                  pod/prometheus-operator-7795d56f7-d84bb             Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c80cd7ddfcd963384b55bf43b801f4bd551bddf560bfa2354c5195552b52f4c"
118m        Normal    Pulled                   pod/prometheus-operator-7795d56f7-d84bb             Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c80cd7ddfcd963384b55bf43b801f4bd551bddf560bfa2354c5195552b52f4c"
118m        Normal    Created                  pod/prometheus-operator-7795d56f7-d84bb             Created container kube-rbac-proxy
118m        Normal    Started                  pod/prometheus-operator-7795d56f7-d84bb             Started container kube-rbac-proxy
118m        Normal    SuccessfulCreate         replicaset/prometheus-operator-7795d56f7            Created pod: prometheus-operator-7795d56f7-d84bb
118m        Normal    ScalingReplicaSet        deployment/prometheus-operator                      Scaled up replica set prometheus-operator-7795d56f7 to 1

# oc -n openshift-monitoring get event | grep  prometheus-operator-7896ccc77c
114m        Normal    Scheduled                pod/prometheus-operator-7896ccc77c-jv2tx            Successfully assigned openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx to kasturi-upg1-5hljl-master-2
112m        Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(3a36a03fdd25d271bed46b1fac03f370c72535f6410b5d8985a06342ea72c9e8): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
111m        Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(91cafe671005b7494dbfc30899fdd36e15108a7c22a307b9b65f2bb4185792a2): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
109m        Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(b5844ffe1d629effe817a5497af382eb53315066e91f625f6692f47f8acb4f7f): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
108m        Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(de52607f4fe180751cea839bfe337e85db52720b9ddc90141ef6330ff99b10ea): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
106m        Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(ef67e63a9e0fac2df5615e34ee94a50c7c3d08f6435b620823bf1773bf554aef): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
105m        Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(451838746e2515be66f900b5a25d7fe31ffa6ab49bd4e0e327b6820ed6f97955): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
103m        Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(70d15d9acdfca080aa37de9dcff25a878b623260c5c75f1fc776c16c551fa072): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
102m        Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(606a2733bcdf53687d6703a3e03273fb1b799d60d7706ebce7248cddb280b57b): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
100m        Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(78152b8f85cb85c145720c8149e7e62b8bff70c179447141673f3f8b610982b7): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
2m19s       Warning   FailedCreatePodSandBox   pod/prometheus-operator-7896ccc77c-jv2tx            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-operator-7896ccc77c-jv2tx_openshift-monitoring_a8c431f8-4235-42a4-bd2b-16364ade8fbd_0(b1285ff64718d51f69941edcc53c0f9a61e78a86f956a9a8fb435fda3a200b34): Multus: [openshift-monitoring/prometheus-operator-7896ccc77c-jv2tx]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
114m        Normal    SuccessfulCreate         replicaset/prometheus-operator-7896ccc77c           Created pod: prometheus-operator-7896ccc77c-jv2tx
114m        Normal    ScalingReplicaSet        deployment/prometheus-operator                      Scaled up replica set prometheus-operator-7896ccc77c to 1

Version-Release number of selected component (if applicable):
upgrade from 4.5.5 to 4.6.0-0.nightly-2020-08-05-103641

How reproducible:
not sure

Steps to Reproduce:
1. see the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 RamaKasturi 2020-08-06 15:53:40 UTC
Hit similar issue upgrading from 4.5.5-x86_64 -> 4.6.0-0.nightly-2020-08-05-103641 on matrix 23_IPI on OSP13 & FIPS on & OVN. Below are the error details:

[ramakasturinarra@dhcp35-60 ~]$ oc describe co/monitoring
Name:         monitoring
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-08-06T07:08:53Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
      f:status:
        .:
        f:extension:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2020-08-06T07:08:53Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:relatedObjects:
        f:versions:
    Manager:         operator
    Operation:       Update
    Time:            2020-08-06T09:57:09Z
  Resource Version:  132621
  Self Link:         /apis/config.openshift.io/v1/clusteroperators/monitoring
  UID:               22961c8a-e86f-4300-b128-79a4ad5554d3
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-08-06T09:57:09Z
    Message:               Rollout of the monitoring stack is in progress. Please wait until it finishes.
    Reason:                RollOutInProgress
    Status:                True
    Type:                  Upgradeable
    Last Transition Time:  2020-08-06T08:50:08Z
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-08-06T09:57:09Z
    Message:               Rolling out the stack.
    Reason:                RollOutInProgress
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-08-06T08:50:08Z
    Message:               Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator: expected 2 replicas, got 1 updated replicas
    Reason:                UpdatingPrometheusOperatorFailed
    Status:                True
    Type:                  Degraded
  Extension:               <nil>
  Related Objects:
    Group:     
    Name:      openshift-monitoring
    Resource:  namespaces
    Group:     monitoring.coreos.com
    Name:      
    Resource:  servicemonitors
    Group:     monitoring.coreos.com
    Name:      
    Resource:  prometheusrules
    Group:     monitoring.coreos.com
    Name:      
    Resource:  alertmanagers
    Group:     monitoring.coreos.com
    Name:      
    Resource:  prometheuses
  Versions:
    Name:     operator
    Version:  4.6.0-0.nightly-2020-08-05-103641
Events:       <none>
[ramakasturinarra@dhcp35-60 ~]$ oc describe co/monitoring
Name:         monitoring
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-08-06T07:08:53Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
      f:status:
        .:
        f:extension:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2020-08-06T07:08:53Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:relatedObjects:
        f:versions:
    Manager:         operator
    Operation:       Update
    Time:            2020-08-06T15:32:16Z
  Resource Version:  481839
  Self Link:         /apis/config.openshift.io/v1/clusteroperators/monitoring
  UID:               22961c8a-e86f-4300-b128-79a4ad5554d3
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-08-06T15:32:16Z
    Message:               Rolling out the stack.
    Reason:                RollOutInProgress
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-08-06T08:50:08Z
    Message:               Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator: expected 2 replicas, got 1 updated replicas
    Reason:                UpdatingPrometheusOperatorFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-08-06T15:32:16Z
    Message:               Rollout of the monitoring stack is in progress. Please wait until it finishes.
    Reason:                RollOutInProgress
    Status:                True
    Type:                  Upgradeable
    Last Transition Time:  2020-08-06T08:50:08Z
    Status:                False
    Type:                  Available
  Extension:               <nil>
  Related Objects:
    Group:     
    Name:      openshift-monitoring
    Resource:  namespaces
    Group:     monitoring.coreos.com
    Name:      
    Resource:  servicemonitors
    Group:     monitoring.coreos.com
    Name:      
    Resource:  prometheusrules
    Group:     monitoring.coreos.com
    Name:      
    Resource:  alertmanagers
    Group:     monitoring.coreos.com
    Name:      
    Resource:  prometheuses
  Versions:
    Name:     operator
    Version:  4.6.0-0.nightly-2020-08-05-103641
Events:       <none>

Comment 2 Maciej Szulik 2020-08-11 11:00:58 UTC
It would be good to get the kcm logs, when this happens.

Comment 3 Tomáš Nožička 2020-08-21 11:49:42 UTC
KCM gets Unauthorized and doesn't react for some time until KAS let's it proceed. Being investigated in 1868750

*** This bug has been marked as a duplicate of bug 1868750 ***


Note You need to log in before you can comment on or make changes to this bug.