1786315 – Upgrade can not start due to version pod fail to request expected ephemeral-storage on master node

Bug 1786315 - Upgrade can not start due to version pod fail to request expected ephemeral-storage on master node

Summary: Upgrade can not start due to version pod fail to request expected ephemeral-s...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.2.z
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1786374 (view as bug list)
Depends On:	1787334 1787422
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-24 11:00 UTC by liujia
Modified:	2020-01-14 16:46 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1787334 1787422 (view as bug list)
Environment:
Last Closed:	2020-01-14 16:46:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 288	0	None	closed	Bug 1786315: pkg/cvo/updatepayload: Drop ephemeral-storage request	2021-01-10 23:29:13 UTC
Red Hat Product Errata	RHBA-2020:0066	0	None	None	None	2020-01-14 16:46:34 UTC

Description liujia 2019-12-24 11:00:18 UTC

Description of problem:
Run upgrade from v4.2 to v4.3 failed.
# oc adm upgrade
info: An upgrade is in progress. Working towards registry.svc.ci.openshift.org/ocp/release@sha256:6ece1c63d87fb90a66b28c038920651464230f45712b389040445437d5aab82c: downloading update

warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: currently installed version 4.2.0-0.nightly-2019-12-22-150714 not found in the "stable-4.2" channel

====================================================================
Checked the version pod can not run due to OutOfephemeral-storage.
# oc project openshift-cluster-version
# oc get pod
NAME                                           READY   STATUS                   RESTARTS   AGE
pod/cluster-version-operator-7447dc7fd-2thsb   1/1     Running                  2          20m
pod/version--vqztg-g68lx                       0/1     OutOfephemeral-storage   0          9s
pod/version--vqztg-ml7pg                       0/1     OutOfephemeral-storage   0          9s

# oc describe pod/version--vqztg-ml7pg
Name:           version--vqztg-ml7pg
Namespace:      openshift-cluster-version
Priority:       0
Node:           control-plane-0/
Start Time:     Tue, 24 Dec 2019 10:19:53 +0000
Labels:         controller-uid=ecc2f2e3-2636-11ea-b92f-0050568b75cf
                job-name=version--vqztg
Annotations:    <none>
Status:         Failed
Reason:         OutOfephemeral-storage
Message:        Pod Node didn't have enough resource: ephemeral-storage, requested: 2097152, used: 0, capacity: 0
IP:             
IPs:            <none>
Controlled By:  Job/version--vqztg
Containers:
  payload:
    Image:      registry.svc.ci.openshift.org/ocp/release@sha256:6ece1c63d87fb90a66b28c038920651464230f45712b389040445437d5aab82c
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      -c
      mkdir -p /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g && mv /manifests /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g/manifests && mkdir -p /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g && mv /release-manifests /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g/release-manifests
    Requests:
      cpu:                10m
      ephemeral-storage:  2Mi
      memory:             50Mi
    Environment:          <none>
    Mounts:
      /etc/cvo/updatepayloads from payloads (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-246zg (ro)
Volumes:
  payloads:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cvo/updatepayloads
    HostPathType:  
  default-token-246zg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-246zg
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age        From                      Message
  ----     ------                  ----       ----                      -------
  Warning  OutOfephemeral-storage  <invalid>  kubelet, control-plane-0  Node didn't have enough resource: ephemeral-storage, requested: 2097152, used: 0, capacity: 0

Checked master node does not have ephemeral-storage capacity.
# oc get node control-plane-0 -o json|jq .status.capacity
{
  "cpu": "4",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8163844Ki",
  "pods": "250"
}


Version-Release number of the following components:
4.2.0-0.nightly-2019-12-22-150714

How reproducible:
always

Steps to Reproduce:
1. Run upgrade from 4.2.0-0.nightly-2019-12-22-150714 to 4.3.0-0.nightly-2019-12-24-053745
2.
3.

Actual results:
Upgrade hang on creating version pod.

Expected results:
upgrade succeed.

Additional info:
should be related with https://github.com/openshift/cluster-version-operator/pull/286

Comment 1 liujia 2019-12-24 11:07:09 UTC

This will block 4.2 upgrade, so add testblocker.

Comment 2 liujia 2019-12-30 08:49:03 UTC

To be more clear, any upgrade from an 4.2 build with pr286 merged will hit the issue. For both v4.2 to v4.3 and v4.2 to v4.2 latest.

Comment 3 Wei Sun 2020-01-02 07:38:52 UTC

pr 286 was merged to 4.2.0-0.nightly-2019-12-20-184812, per the comment#2, the issue will not be happened when upgrading from 4.2.12 to the latest 4.2.z.

Per the test result: 4.2.12-> 4.2.0-0.nightly-2019-12-23-132554 , it works for upgrade from 4.2.12 to 4.2.0-0.nightly-2019-12-23-132554

#oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      54m
cloud-credential                           4.2.0-0.nightly-2019-12-23-132554   True        False         False      74m
cluster-autoscaler                         4.2.0-0.nightly-2019-12-23-132554   True        False         False      67m
console                                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      82s
dns                                        4.2.0-0.nightly-2019-12-23-132554   True        False         False      73m
image-registry                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      17m
ingress                                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      61m
insights                                   4.2.0-0.nightly-2019-12-23-132554   True        False         False      74m
kube-apiserver                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      72m
kube-controller-manager                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      72m
kube-scheduler                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      71m
machine-api                                4.2.0-0.nightly-2019-12-23-132554   True        False         False      74m
machine-config                             4.2.0-0.nightly-2019-12-23-132554   True        False         False      54s
marketplace                                4.2.0-0.nightly-2019-12-23-132554   True        False         False      2m8s
monitoring                                 4.2.0-0.nightly-2019-12-23-132554   True        False         False      3m58s
network                                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      73m
node-tuning                                4.2.0-0.nightly-2019-12-23-132554   True        False         False      4m46s
openshift-apiserver                        4.2.0-0.nightly-2019-12-23-132554   True        False         False      81s
openshift-controller-manager               4.2.0-0.nightly-2019-12-23-132554   True        False         False      71m
openshift-samples                          4.2.0-0.nightly-2019-12-23-132554   True        False         False      23m
operator-lifecycle-manager                 4.2.0-0.nightly-2019-12-23-132554   True        False         False      73m
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-12-23-132554   True        False         False      73m
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-12-23-132554   True        False         False      73s
service-ca                                 4.2.0-0.nightly-2019-12-23-132554   True        False         False      74m
service-catalog-apiserver                  4.2.0-0.nightly-2019-12-23-132554   True        False         False      70m
service-catalog-controller-manager         4.2.0-0.nightly-2019-12-23-132554   True        False         False      70m
storage                                    4.2.0-0.nightly-2019-12-23-132554   True        False         False      24m

Comment 4 W. Trevor King 2020-01-02 17:11:37 UTC

*** Bug 1786374 has been marked as a duplicate of this bug. ***

Comment 5 W. Trevor King 2020-01-02 18:34:28 UTC

I've filed [1] with a narrow ephemeral-storage revert for 4.2.z.  We're still trying to figure out if we need to do anything about master/4.3.

[1]: https://github.com/openshift/cluster-version-operator/pull/288

Comment 6 W. Trevor King 2020-01-02 19:02:28 UTC

[1] is a 4.2.0-0.nightly-2019-12-22-150714 -> 4.2.0-0.nightly-2019-12-23-132554 update failing with this mode.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/55

Comment 7 W. Trevor King 2020-01-02 20:50:34 UTC

I spun the lack of capacity reporting out into bug 1787427.

Comment 8 liujia 2020-01-03 09:15:51 UTC

Version: 4.2.0-0.nightly-2020-01-03-055246

Verify upgrade from 4.2 to 4.3 since no newer 4.2 build available as a target version.

Run upgrade from 4.2.0-0.nightly-2020-01-03-055246 to 4.3.0-0.nightly-2020-01-03-005054 and checked that the version pod did not request ephemeral-storage resource and the scheduled node had not the capacity.
# oc get pod/version--k28tz-cfw45 -o json -n openshift-cluster-version |jq .spec.containers[].resources{
  "requests": {
    "cpu": "10m",
    "memory": "50Mi"
  }
}
# oc get node control-plane-0 -o json|jq .status.capacity{
  "cpu": "4",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8163860Ki",
  "pods": "250"
}

Comment 10 errata-xmlrpc 2020-01-14 16:46:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0066

Note You need to log in before you can comment on or make changes to this bug.