Description of problem: In 4.2 and 4.1 we could easily run 40-50 concurrent builds on a compute node with 16GB RAM and 4 vCPUs. In 4.3 we are seeing a 50% failure rate for only 3 concurrent builds on a node of this size. The build status is Failed (BuildPodEvicted) and the namespace events show this pattern: 38m Normal Created pod/cakephp-mysql-example-3-build Created container sti-build 38m Normal Started pod/cakephp-mysql-example-3-build Started container sti-build 37m Warning Evicted pod/cakephp-mysql-example-3-build The node was low on resource: ephemeral-storage. Container sti-build was using 212952Ki, which exceeds its request of 0. 37m Normal Killing pod/cakephp-mysql-example-3-build Stopping container sti-build Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2019-11-02-092336 How reproducible: Always Steps to Reproduce: 1. Install a cluster with 3 masters and 3 computes (AWS m5.xlarge or GCP standard-n4). 16GB memory and 4vCPU 2. Create 9 projects with the cakephp-mysql template. Optional (it's the way we run), set the replicas to 0 and remove the build triggers so that all builds and deployments are controlled manually 3. Start all 9 builds simultaneously Actual results: For 10 iterations of starting all 9 builds simultaneously, 46 of the builds failed due to pod eviction Expected results: Success rates in line with 4.2. Additional info: Will add must-gather in a separate comment
@Mike - can you confirm that the storage available on the node has been the same? Ephemeral-storage resource constraints are unrelated to RAM and vCPU.
Re: comment 2 - previous and current runs were done on m5.xlarge (AWS) and n1-standard-4 (GCP) instance created by the IPI installer. I will boot a 4.2 cluster and do a side-by-side run and comparison of instance diskspace usage during the run today. Removing TestBlocker for now until investigation complete.
Not sure if this should go to RHCOS, MCO or installer - please help redirect if needed. The problem is that on 4.3, the root filesystem (where emptyDir volumes are created) is only 9.5 GB on AWS in 4.3 vs ~120 GB in 4.2. On GCP in 4.3 they are 120 GB 4.3 on AWS ============== sh-4.4# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 120G 0 disk |-nvme0n1p1 259:1 0 384M 0 part /boot |-nvme0n1p2 259:2 0 127M 0 part /boot/efi |-nvme0n1p3 259:3 0 1M 0 part `-nvme0n1p4 259:4 0 119.5G 0 part /sysroot sh-4.4# df -h | egrep -v "tmpfs|overlay" Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p4 16G 6.1G 9.5G 39% / /dev/nvme0n1p1 364M 147M 194M 44% /boot /dev/nvme0n1p2 127M 2.9M 124M 3% /boot/efi 4.2 on AWS ============== sh-4.4# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 120G 0 disk |-nvme0n1p1 259:1 0 1M 0 part |-nvme0n1p2 259:2 0 1G 0 part /boot `-nvme0n1p3 259:3 0 119G 0 part /sysroot sh-4.4# df -h | egrep -v "tmpfs|overlay" Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p3 119G 5.4G 114G 5% / /dev/nvme0n1p2 976M 144M 765M 16% /boot 4.3 on GCP ============== sh-4.4# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 128G 0 disk |-sda1 8:1 0 384M 0 part /boot |-sda2 8:2 0 127M 0 part /boot/efi |-sda3 8:3 0 1M 0 part `-sda4 8:4 0 127.5G 0 part /sysroot sh-4.4# df -h | egrep -v "tmpfs|overlay" Filesystem Size Used Avail Use% Mounted on /dev/sda4 128G 7.3G 121G 6% / /dev/sda1 364M 147M 194M 44% /boot /dev/sda2 127M 2.9M 124M 3% /boot/efi
If someone has an affected cluster up, can you `oc debug node/` then `chroot /host systemctl status coreos-growpart` ?
sh-4.4# systemctl status coreos-growpart ● coreos-growpart.service - Resize root partition Loaded: loaded (/usr/lib/systemd/system/coreos-growpart.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Thu 2019-11-07 14:45:46 UTC; 3h 29min ago Process: 1772 ExecStart=/usr/libexec/coreos-growpart / (code=exited, status=1/FAILURE) Main PID: 1772 (code=exited, status=1/FAILURE) CPU: 44ms Nov 07 14:45:45 localhost systemd[1]: Starting Resize root partition... Nov 07 14:45:46 localhost coreos-growpart[1772]: NOCHANGE: partition 4 is size 250607583. it cannot be grown Nov 07 14:45:46 localhost coreos-growpart[1772]: /usr/libexec/coreos-growpart: line 35: TYPE: unbound variable Nov 07 14:45:46 localhost systemd[1]: coreos-growpart.service: Main process exited, code=exited, status=1/FAILURE Nov 07 14:45:46 localhost systemd[1]: coreos-growpart.service: Failed with result 'exit-code'. Nov 07 14:45:46 localhost systemd[1]: Failed to start Resize root partition. Nov 07 14:45:46 localhost systemd[1]: coreos-growpart.service: Consumed 44ms CPU time
> Nov 07 14:45:46 localhost coreos-growpart[1772]: /usr/libexec/coreos-growpart: line 35: TYPE: unbound variable Right...I think this is a regression from the half-done LUKS work. We will fix this in an updated bootimage.
This should have already been fixed by https://github.com/openshift/installer/pull/2609/commits/6f7da477e2f3392b3ee9f70df82f68b2db4dd1e2 Please update to a newer installer (hence newer bootimage) and close this BZ if it works for you! (I just verified with cluster-bot `launch 4.3 aws` things worked fine)
Still broken. Nothing has changed in the environment. The rootfs is still the wrong size and the coreos-growpart unit is still failed. I will re-run on tonight's nightly build if that helps and reopen this if it fails.
Still an issue on 4.3.0-0.nightly-2019-11-08-080321. Maybe something jammed in release/CD/ART? Installer version: openshift-install-linux-4.3.0-0.nightly-2019-11-08-080321.tar.gz $ ./openshift-install version ./openshift-install v4.3.0 built from commit 9b3ffb0c1f016f3e8874a5448456d9096e6483b4 release image registry.svc.ci.openshift.org/ocp/release@sha256:a5490311723af39a03cafc34e89e1b813d6c15dc9217b3d6c013021e475beef4 From a node: sh-4.4# rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cc496fe238e18b8c40c11e43231a434e77ab526cd8b4014579fbe1c8333ff470 CustomOrigin: Managed by machine-config-operator Version: 43.81.201911080455.0 (2019-11-08T05:00:19Z) ostree://f323fee503fd47afa77de27ebb3bb8509f1772518149a5c5c3b6b9c3c771fd4e Version: 43.81.201911011153.0 (2019-11-01T11:58:16Z) The AMI for the instance (us-east-2) is: AMI ID rhcos-43.81.201911011153.0-hvm (ami-03c40a2479a5e3593) Let me know what other info I can get. As I mentioned, not an issue for GCP with same build, so maybe something in the pipeline for AWS.
Hmm. Are you overriding the instance types? I wonder if we regressed something in growpart here on e.g. nvme devices. Can you try using `oc debug node` again, then paste this modified growpart script into e.g. `/root/coreos-growpart`: ``` #!/bin/bash set -euo pipefail path=$1 shift majmin=$(findmnt -nvr -o MAJ:MIN "$path") # Detect if the rootfs is on a luks container and map # it to the underlying partition. This assumes that the # LUKS volumes is housed in a partition. src=$(findmnt -nvr -o SOURCE "$path") is_luks=0 if [[ "${src}" =~ /dev/mapper ]]; then majmin=$(dmsetup table ${src} | cut -d " " -f7) fi devpath=$(realpath "/sys/dev/block/$majmin") partition=$(cat "$devpath/partition") parent_path=$(dirname "$devpath") parent_device=/dev/$(basename "${parent_path}") echo growpart "${parent_device}" "${partition}" || true blkid -o export "${parent_device}${partition}" ``` Then chmod a+x coreos-growpart and run it: ./coreos-growpart / (I changed it to just print things and not run commands)
OK yep, reproduced with https://github.com/coreos/coreos-assembler/pull/906 working on a fix.
https://github.com/coreos/fedora-coreos-config/pull/222
I think I know and the fix has been included in: https://gitlab.cee.redhat.com/coreos/redhat-coreos/blob/b43c82b7e4ae91d0f6e37caa30d8b6b9b6402dd7/overlay.d/05rhcos/usr/libexec/rhcos-growpart I've dropped the 'eval $(blkid...) logic since it wasn't being used, and instead am using the `dmsetup` information to identify if a device is LUKS or not.
re: comment 18 - to confirm, I am overriding instance types.
https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/701
Various PRs/MRs have been merged for over a week; changes should be present in latest 4.3 builds. Moving to MODIFIED. @mfiedler you could probably retest with the latest 4.3 nightly and report how it goes.
@mfiedler Could you retest this and see if the latest builds fix what you saw?
Verified. [root@ip-172-31-53-199 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-12-03-094421 True False 56m Cluster version is 4.3.0-0.nightly-2019-12-03-094421 [root@ip-172-31-53-199 ~]# oc debug node/ip-10-0-140-246.us-west-2.compute.internal Starting pod/ip-10-0-140-246us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.140.246 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# df -h | egrep -v "tmpfs|overlay" Filesystem Size Used Avail Use% Mounted on /dev/mapper/coreos-luks-root 120G 6.4G 114G 6% / /dev/nvme0n1p1 364M 160M 182M 47% /boot /dev/nvme0n1p2 127M 3.0M 124M 3% /boot/ef
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062