Bug 1769157

Summary:	Concurrent builds on 4.3 are frequently evicted with node low on ephemeral storage. AWS instances root filesystm is only 9.5G in 4.3
Product:	OpenShift Container Platform	Reporter:	Mike Fiedler <mifiedle>
Component:	RHCOS	Assignee:	Colin Walters <walters>
Status:	CLOSED ERRATA	QA Contact:	Mike Fiedler <mifiedle>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	aos-bugs, bbreard, behoward, dustymabe, imcleod, jligon, miabbott, nstielau, walters, wzheng
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:11:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Fiedler 2019-11-06 01:35:08 UTC

Description of problem:

In 4.2 and 4.1 we could easily run 40-50 concurrent builds on a compute node with 16GB RAM and 4 vCPUs. In 4.3 we are seeing a 50% failure rate for only 3 concurrent builds on a node of this size. The build status is Failed (BuildPodEvicted) and the namespace events show this pattern:

38m Normal Created pod/cakephp-mysql-example-3-build Created container sti-build
38m Normal Started pod/cakephp-mysql-example-3-build Started container sti-build
37m Warning Evicted pod/cakephp-mysql-example-3-build The node was low on resource: ephemeral-storage. Container sti-build was using 212952Ki, which exceeds its request of 0.
37m Normal Killing pod/cakephp-mysql-example-3-build Stopping container sti-build

Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2019-11-02-092336

How reproducible: Always

Steps to Reproduce:
1. Install a cluster with 3 masters and 3 computes (AWS m5.xlarge or GCP standard-n4). 16GB memory and 4vCPU
2. Create 9 projects with the cakephp-mysql template. Optional (it's the way we run), set the replicas to 0 and remove the build triggers so that all builds and deployments are controlled manually
3. Start all 9 builds simultaneously

Actual results:

For 10 iterations of starting all 9 builds simultaneously, 46 of the builds failed due to pod eviction

Expected results:

Success rates in line with 4.2.

Additional info:

Will add must-gather in a separate comment

Comment 2 Adam Kaplan 2019-11-07 14:54:53 UTC

@Mike - can you confirm that the storage available on the node has been the same? Ephemeral-storage resource constraints are unrelated to RAM and vCPU.

Comment 3 Mike Fiedler 2019-11-07 15:28:57 UTC

Re: comment 2 - previous and current runs were done on m5.xlarge (AWS) and n1-standard-4 (GCP) instance created by the IPI installer.   I will boot a 4.2 cluster and do a side-by-side run and comparison of instance diskspace usage during the run today.

Removing TestBlocker for now until investigation complete.

Comment 4 Mike Fiedler 2019-11-07 17:20:19 UTC

Not sure if this should go to RHCOS, MCO or installer - please help redirect if needed.   

The problem is that on 4.3, the root filesystem (where emptyDir volumes are created) is only 9.5 GB on AWS in 4.3 vs ~120 GB in 4.2.    On GCP in 4.3 they are 120 GB

4.3 on AWS
==============
sh-4.4# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0   120G  0 disk 
|-nvme0n1p1 259:1    0   384M  0 part /boot
|-nvme0n1p2 259:2    0   127M  0 part /boot/efi
|-nvme0n1p3 259:3    0     1M  0 part 
`-nvme0n1p4 259:4    0 119.5G  0 part /sysroot

sh-4.4# df -h | egrep -v "tmpfs|overlay"
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4   16G  6.1G  9.5G  39% /
/dev/nvme0n1p1  364M  147M  194M  44% /boot
/dev/nvme0n1p2  127M  2.9M  124M   3% /boot/efi


4.2 on AWS
==============
sh-4.4# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0  120G  0 disk 
|-nvme0n1p1 259:1    0    1M  0 part 
|-nvme0n1p2 259:2    0    1G  0 part /boot
`-nvme0n1p3 259:3    0  119G  0 part /sysroot

sh-4.4# df -h | egrep -v "tmpfs|overlay"
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p3  119G  5.4G  114G   5% /
/dev/nvme0n1p2  976M  144M  765M  16% /boot

4.3 on GCP
==============
sh-4.4# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   128G  0 disk 
|-sda1   8:1    0   384M  0 part /boot
|-sda2   8:2    0   127M  0 part /boot/efi
|-sda3   8:3    0     1M  0 part 
`-sda4   8:4    0 127.5G  0 part /sysroot

sh-4.4# df -h | egrep -v "tmpfs|overlay"
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4       128G  7.3G  121G   6% /
/dev/sda1       364M  147M  194M  44% /boot
/dev/sda2       127M  2.9M  124M   3% /boot/efi

Comment 5 Colin Walters 2019-11-07 17:45:27 UTC

If someone has an affected cluster up, can you `oc debug node/` then `chroot /host systemctl status coreos-growpart` ?

Comment 7 Mike Fiedler 2019-11-07 18:16:00 UTC

sh-4.4# systemctl status coreos-growpart
● coreos-growpart.service - Resize root partition
   Loaded: loaded (/usr/lib/systemd/system/coreos-growpart.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2019-11-07 14:45:46 UTC; 3h 29min ago
  Process: 1772 ExecStart=/usr/libexec/coreos-growpart / (code=exited, status=1/FAILURE)
 Main PID: 1772 (code=exited, status=1/FAILURE)
      CPU: 44ms

Nov 07 14:45:45 localhost systemd[1]: Starting Resize root partition...
Nov 07 14:45:46 localhost coreos-growpart[1772]: NOCHANGE: partition 4 is size 250607583. it cannot be grown
Nov 07 14:45:46 localhost coreos-growpart[1772]: /usr/libexec/coreos-growpart: line 35: TYPE: unbound variable
Nov 07 14:45:46 localhost systemd[1]: coreos-growpart.service: Main process exited, code=exited, status=1/FAILURE
Nov 07 14:45:46 localhost systemd[1]: coreos-growpart.service: Failed with result 'exit-code'.
Nov 07 14:45:46 localhost systemd[1]: Failed to start Resize root partition.
Nov 07 14:45:46 localhost systemd[1]: coreos-growpart.service: Consumed 44ms CPU time

Comment 8 Colin Walters 2019-11-07 18:27:40 UTC

> Nov 07 14:45:46 localhost coreos-growpart[1772]: /usr/libexec/coreos-growpart: line 35: TYPE: unbound variable

Right...I think this is a regression from the half-done LUKS work.  We will fix this in an updated bootimage.

Comment 10 Colin Walters 2019-11-07 18:37:56 UTC

This should have already been fixed by
https://github.com/openshift/installer/pull/2609/commits/6f7da477e2f3392b3ee9f70df82f68b2db4dd1e2

Please update to a newer installer (hence newer bootimage) and close this BZ if it works for you!

(I just verified with cluster-bot `launch 4.3 aws` things worked fine)

Comment 16 Mike Fiedler 2019-11-08 01:56:00 UTC

Still broken.   Nothing has changed in the environment.   The rootfs is still the wrong size and the coreos-growpart unit is still failed. 
I will re-run on tonight's nightly build if that helps and reopen this if it fails.

Comment 17 Mike Fiedler 2019-11-08 13:57:40 UTC

Still an issue on 4.3.0-0.nightly-2019-11-08-080321.  Maybe something jammed in release/CD/ART?

Installer version:  openshift-install-linux-4.3.0-0.nightly-2019-11-08-080321.tar.gz

$ ./openshift-install version
./openshift-install v4.3.0
built from commit 9b3ffb0c1f016f3e8874a5448456d9096e6483b4
release image registry.svc.ci.openshift.org/ocp/release@sha256:a5490311723af39a03cafc34e89e1b813d6c15dc9217b3d6c013021e475beef4

From a node:

sh-4.4# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cc496fe238e18b8c40c11e43231a434e77ab526cd8b4014579fbe1c8333ff470
              CustomOrigin: Managed by machine-config-operator
                   Version: 43.81.201911080455.0 (2019-11-08T05:00:19Z)

  ostree://f323fee503fd47afa77de27ebb3bb8509f1772518149a5c5c3b6b9c3c771fd4e
                   Version: 43.81.201911011153.0 (2019-11-01T11:58:16Z)


The AMI for the instance (us-east-2) is:

AMI ID rhcos-43.81.201911011153.0-hvm (ami-03c40a2479a5e3593)

Let me know what other info I can get.   As I mentioned, not an issue for GCP with same build, so maybe something in the pipeline for AWS.

Comment 18 Colin Walters 2019-11-08 16:16:19 UTC

Hmm.  Are you overriding the instance types?  I wonder if we regressed something in growpart here on e.g. nvme devices.

Can you try using `oc debug node` again, then paste this modified growpart script into e.g. `/root/coreos-growpart`:

```
#!/bin/bash
set -euo pipefail
path=$1
shift

majmin=$(findmnt -nvr -o MAJ:MIN "$path")

# Detect if the rootfs is on a luks container and map
# it to the underlying partition. This assumes that the
# LUKS volumes is housed in a partition.
src=$(findmnt -nvr -o SOURCE "$path")
is_luks=0
if [[ "${src}" =~ /dev/mapper ]]; then
    majmin=$(dmsetup table ${src} | cut -d " " -f7)
fi

devpath=$(realpath "/sys/dev/block/$majmin")
partition=$(cat "$devpath/partition")
parent_path=$(dirname "$devpath")
parent_device=/dev/$(basename "${parent_path}")

echo growpart "${parent_device}" "${partition}" || true
blkid -o export "${parent_device}${partition}"
```

Then chmod a+x coreos-growpart and run it:
./coreos-growpart /

(I changed it to just print things and not run commands)

Comment 19 Colin Walters 2019-11-08 16:25:30 UTC

OK yep, reproduced with https://github.com/coreos/coreos-assembler/pull/906
working on a fix.

Comment 20 Colin Walters 2019-11-08 17:01:26 UTC

https://github.com/coreos/fedora-coreos-config/pull/222

Comment 21 Ben Howard 2019-11-08 19:23:39 UTC

I think I know and the fix has been included in:
https://gitlab.cee.redhat.com/coreos/redhat-coreos/blob/b43c82b7e4ae91d0f6e37caa30d8b6b9b6402dd7/overlay.d/05rhcos/usr/libexec/rhcos-growpart

I've dropped the 'eval $(blkid...) logic since it wasn't being used, and instead am using the `dmsetup` information to identify if a device is LUKS or not.

Comment 22 Mike Fiedler 2019-11-08 19:41:55 UTC

re: comment 18 - to confirm, I am overriding instance types.

Comment 23 Colin Walters 2019-11-08 21:04:40 UTC

https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/701

Comment 24 Micah Abbott 2019-11-21 21:50:10 UTC

Various PRs/MRs have been merged for over a week; changes should be present in latest 4.3 builds.  Moving to MODIFIED.

@mfiedler you could probably retest with the latest 4.3 nightly and report how it goes.

Comment 26 Micah Abbott 2019-12-03 15:06:44 UTC

@mfiedler Could you retest this and see if the latest builds fix what you saw?

Comment 27 Mike Fiedler 2019-12-03 22:11:01 UTC

Verified.

[root@ip-172-31-53-199 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-12-03-094421   True        False         56m     Cluster version is 4.3.0-0.nightly-2019-12-03-094421
[root@ip-172-31-53-199 ~]# oc debug node/ip-10-0-140-246.us-west-2.compute.internal
Starting pod/ip-10-0-140-246us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.140.246
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# df -h | egrep -v "tmpfs|overlay"
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/coreos-luks-root  120G  6.4G  114G   6% /
/dev/nvme0n1p1                364M  160M  182M  47% /boot
/dev/nvme0n1p2                127M  3.0M  124M   3% /boot/ef

Comment 29 errata-xmlrpc 2020-01-23 11:11:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062