1787334 – Upgrade can not start due to version pod fail to request expected ephemeral-storage on master node

Bug 1787334 - Upgrade can not start due to version pod fail to request expected ephemeral-storage on master node

Summary: Upgrade can not start due to version pod fail to request expected ephemeral-s...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1787422 (view as bug list)
Depends On:	1787424
Blocks:	1786315
TreeView+	depends on / blocked

Reported:	2020-01-02 12:46 UTC by Wei Sun
Modified:	2020-01-23 11:20 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1786315
Environment:
Last Closed:	2020-01-23 11:19:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 290	0	None	closed	Bug 1787334: pkg/cvo/updatepayload: Drop ephemeral-storage request	2020-10-12 17:28:24 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:20:00 UTC

Description Wei Sun 2020-01-02 12:46:51 UTC

+++ This bug was initially created as a clone of Bug #1786315 +++

Description of problem:
Run upgrade from v4.2 to v4.3 failed.
# oc adm upgrade
info: An upgrade is in progress. Working towards registry.svc.ci.openshift.org/ocp/release@sha256:6ece1c63d87fb90a66b28c038920651464230f45712b389040445437d5aab82c: downloading update

warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: currently installed version 4.2.0-0.nightly-2019-12-22-150714 not found in the "stable-4.2" channel

====================================================================
Checked the version pod can not run due to OutOfephemeral-storage.
# oc project openshift-cluster-version
# oc get pod
NAME                                           READY   STATUS                   RESTARTS   AGE
pod/cluster-version-operator-7447dc7fd-2thsb   1/1     Running                  2          20m
pod/version--vqztg-g68lx                       0/1     OutOfephemeral-storage   0          9s
pod/version--vqztg-ml7pg                       0/1     OutOfephemeral-storage   0          9s

# oc describe pod/version--vqztg-ml7pg
Name:           version--vqztg-ml7pg
Namespace:      openshift-cluster-version
Priority:       0
Node:           control-plane-0/
Start Time:     Tue, 24 Dec 2019 10:19:53 +0000
Labels:         controller-uid=ecc2f2e3-2636-11ea-b92f-0050568b75cf
                job-name=version--vqztg
Annotations:    <none>
Status:         Failed
Reason:         OutOfephemeral-storage
Message:        Pod Node didn't have enough resource: ephemeral-storage, requested: 2097152, used: 0, capacity: 0
IP:             
IPs:            <none>
Controlled By:  Job/version--vqztg
Containers:
  payload:
    Image:      registry.svc.ci.openshift.org/ocp/release@sha256:6ece1c63d87fb90a66b28c038920651464230f45712b389040445437d5aab82c
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      -c
      mkdir -p /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g && mv /manifests /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g/manifests && mkdir -p /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g && mv /release-manifests /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g/release-manifests
    Requests:
      cpu:                10m
      ephemeral-storage:  2Mi
      memory:             50Mi
    Environment:          <none>
    Mounts:
      /etc/cvo/updatepayloads from payloads (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-246zg (ro)
Volumes:
  payloads:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cvo/updatepayloads
    HostPathType:  
  default-token-246zg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-246zg
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age        From                      Message
  ----     ------                  ----       ----                      -------
  Warning  OutOfephemeral-storage  <invalid>  kubelet, control-plane-0  Node didn't have enough resource: ephemeral-storage, requested: 2097152, used: 0, capacity: 0

Checked master node does not have ephemeral-storage capacity.
# oc get node control-plane-0 -o json|jq .status.capacity
{
  "cpu": "4",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8163844Ki",
  "pods": "250"
}


Version-Release number of the following components:
4.2.0-0.nightly-2019-12-22-150714

How reproducible:
always

Steps to Reproduce:
1. Run upgrade from 4.2.0-0.nightly-2019-12-22-150714 to 4.3.0-0.nightly-2019-12-24-053745
2.
3.

Actual results:
Upgrade hang on creating version pod.

Expected results:
upgrade succeed.

Additional info:
should be related with https://github.com/openshift/cluster-version-operator/pull/286

--- Additional comment from liujia on 2019-12-24 11:07:09 UTC ---

This will block 4.2 upgrade, so add testblocker.

--- Additional comment from liujia on 2019-12-30 08:49:03 UTC ---

To be more clear, any upgrade from an 4.2 build with pr286 merged will hit the issue. For both v4.2 to v4.3 and v4.2 to v4.2 latest.

--- Additional comment from Wei Sun on 2020-01-02 07:38:52 UTC ---

pr 286 was merged to 4.2.0-0.nightly-2019-12-20-184812, per the comment#2, the issue will not be happened when upgrading from 4.2.12 to the latest 4.2.z.

Per the test result: 4.2.12-> 4.2.0-0.nightly-2019-12-23-132554 , it works for upgrade from 4.2.12 to 4.2.0-0.nightly-2019-12-23-132554

#oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      54m
cloud-credential                           4.2.0-0.nightly-2019-12-23-132554   True        False         False      74m
cluster-autoscaler                         4.2.0-0.nightly-2019-12-23-132554   True        False         False      67m
console                                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      82s
dns                                        4.2.0-0.nightly-2019-12-23-132554   True        False         False      73m
image-registry                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      17m
ingress                                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      61m
insights                                   4.2.0-0.nightly-2019-12-23-132554   True        False         False      74m
kube-apiserver                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      72m
kube-controller-manager                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      72m
kube-scheduler                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      71m
machine-api                                4.2.0-0.nightly-2019-12-23-132554   True        False         False      74m
machine-config                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      54s
marketplace                                4.2.0-0.nightly-2019-12-23-132554   True        False         False      2m8s
monitoring                                 4.2.0-0.nightly-2019-12-23-132554   True        False         False      3m58s
network                                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      73m
node-tuning                                4.2.0-0.nightly-2019-12-23-132554   True        False         False      4m46s
openshift-apiserver                        4.2.0-0.nightly-2019-12-23-132554   True        False         False      81s
openshift-controller-manager               4.2.0-0.nightly-2019-12-23-132554   True        False         False      71m
openshift-samples                          4.2.0-0.nightly-2019-12-23-132554   True        False         False      23m
operator-lifecycle-manager                 4.2.0-0.nightly-2019-12-23-132554   True        False         False      73m
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-12-23-132554   True        False         False      73m
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-12-23-132554   True        False         False      73s
service-ca                                 4.2.0-0.nightly-2019-12-23-132554   True        False         False      74m
service-catalog-apiserver                  4.2.0-0.nightly-2019-12-23-132554   True        False         False      70m
service-catalog-controller-manager         4.2.0-0.nightly-2019-12-23-132554   True        False         False      70m
storage                                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      24m

Comment 1 Wei Sun 2020-01-02 12:49:12 UTC

Per the https://bugzilla.redhat.com/show_bug.cgi?id=1786315#c2 , clone this bug for  4.3.0

Comment 2 W. Trevor King 2020-01-02 19:49:03 UTC

*** Bug 1787422 has been marked as a duplicate of this bug. ***

Comment 4 liujia 2020-01-03 02:38:29 UTC

The original issue from bz1786315 will not happen during an upgrade from v4.3. But the fix/workaround in bz1786315 may cause the inconsistency between v4.2 and v4.3, which will cause further issue when do continues upgrade following 4.2-4.3-4.3 latest path. So this bug is for enhance and consistancy in v4.3. QE will do regression test against the pr and check no ephemeral-storage request from v4.3 cvo.

Comment 5 W. Trevor King 2020-01-03 04:14:52 UTC

you could also validate it against the interrupted-update flow from [1], you'd just need to trigger the second 4.3->4.3 update (step 6) before the first 4.2->4.3 update (step 2) got far enough to bump control-plane kubelets.  But I'm fine with more basic regression testing too ;).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1787422#c0

Comment 6 liujia 2020-01-03 07:41:11 UTC

Run upgrade from 4.3.0-0.nightly-2020-01-02-214950 to 4.3.0-0.nightly-2020-01-03-005054 succeed.

Checked the version pod did not request ephemeral-storage resource even the scheduled node had the capacity.
# oc get pod/version--hwwg5-wgpbq -o json -n openshift-cluster-version |jq .spec.containers[].resources
{
  "requests": {
    "cpu": "10m",
    "memory": "50Mi"
  }
}
# oc get node control-plane-1 -o json|jq .status.capacity
{
  "cpu": "4",
  "ephemeral-storage": "30905324Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8163844Ki",
  "pods": "250"
}

Comment 8 errata-xmlrpc 2020-01-23 11:19:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.