Description of problem: e2e-metal installation is failing pending on DaemonSet \"openshift-multus/multus\" level=info msg="Cluster operator insights Disabled is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=error msg="Cluster operator monitoring Degraded is True with UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)" level=info msg="Cluster operator monitoring Available is False with : " level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/ovs\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/sdn\" is not available (awaiting 1 nodes)" level=info msg="Cluster operator network Available is False with Startup: The network is starting up" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: monitoring, network" https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1102 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1101
Courtesy of @dcbw Jan 08 09:16:33 worker-2.ci-op-q79n9swn-4c0cd.origin-ci-int-aws.dev.rhcloud.com hyperkube[3850]: E0108 09:16:33.278924 3850 pod_workers.go:191] Error syncing pod 522e70a2-5f97-411b-b65e-f96c3ccacb01 ("ovs-zwgvs_openshift-sdn(522e70a2-5f97-411b-b65e-f96c3ccacb01)"), skipping: failed to "StartContainer" for "openvswitch" with ErrImagePull: "rpc error: code = Unknown desc = Error writing blob: error storing blob to file \"/var/tmp/storage096417626/1\": write /var/tmp/storage096417626/1: no space left on device" Other installations have failed in similar manner. All of these hosts should have 120GiB SSD storage, we need to figure out why certain hosts are seemingly running out of space. shows kubelet reporting disk pressure and kube-apiserver pods failing https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1099/artifacts/e2e-metal/nodes.json { "apiVersion": "v1", "kind": "Node", "metadata": { "annotations": { "volumes.kubernetes.io/controller-managed-attach-detach": "true" }, "creationTimestamp": "2020-01-07T16:12:17Z", "labels": { "beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/os": "linux", "kubernetes.io/arch": "amd64", "kubernetes.io/hostname": "master-1.ci-op-qyr2s16v-4c0cd.origin-ci-int-aws.dev.rhcloud.com", "kubernetes.io/os": "linux", "node-role.kubernetes.io/master": "", "node.openshift.io/os_id": "rhcos" }, "name": "master-1.ci-op-qyr2s16v-4c0cd.origin-ci-int-aws.dev.rhcloud.com", "resourceVersion": "15185", "selfLink": "/api/v1/nodes/master-1.ci-op-qyr2s16v-4c0cd.origin-ci-int-aws.dev.rhcloud.com", "uid": "d08cd107-479a-414d-b2c2-45162b17486b" }, "status": { "allocatable": { "cpu": "7500m", "ephemeral-storage": "2391363990", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "32280716Ki", "pods": "250" }, "capacity": { "cpu": "8", "ephemeral-storage": "2594796Ki", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "32895116Ki", "pods": "250" }, "conditions": [ { "lastHeartbeatTime": "2020-01-07T16:41:38Z", "lastTransitionTime": "2020-01-07T16:12:37Z", "message": "kubelet has disk pressure", "reason": "KubeletHasDiskPressure", "status": "True", "type": "DiskPressure" }, https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1099/artifacts/e2e-metal/pods.json "status": { "message": "The node was low on resource: ephemeral-storage. ", "phase": "Failed", "reason": "Evicted", "startTime": "2020-01-07T16:18:25Z" }
Moving to node component to track down what's consuming ephemeral storage and causing eviction of critical pods during bootstrap process. I'd start with the logs from this job https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1099
growpart failed due to ENOSPC (ironically). I think it's possible we need to add more free space to our metal images. Regardless RHCOS should chase this down.
https://github.com/openshift/installer/pull/2914 includes the fix in it's referenced boot images.
Moving to modified per request.
Verified growpart runs earlier than machine-config-daemon-firstboot.service now. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-01-16-031402 True False 76m Cluster version is 4.3.0-0.nightly-2020-01-16-031402 $ oc debug node/ip-10-0-130-38.ec2.internal Starting pod/ip-10-0-130-38ec2internal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c4059f184596157f64d69c4edbea9c9ef600560b7804a482779f513c3e0f40e CustomOrigin: Managed by machine-config-operator Version: 43.81.202001142154.0 (2020-01-14T21:59:51Z) ostree://23527ffc123c6e2bedf3479ff7e96f38d92cec88d5a7951fa56e9d0ec75ddd77 Version: 43.81.202001142154.0 (2020-01-14T21:59:51Z) sh-4.4# journalctl | grep -10 rhcos-growpart Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Starting RPC Bind... Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started RPC Bind. Jan 16 18:04:25 ip-10-0-130-38 rpc.statd[2357]: Version 2.3.3 starting Jan 16 18:04:25 ip-10-0-130-38 rpc.statd[2357]: Flags: TI-RPC Jan 16 18:04:25 ip-10-0-130-38 rpc.statd[2357]: Initializing NSM state Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started NFS status monitor for NFSv2/3 locking.. Jan 16 18:04:25 ip-10-0-130-38 afterburn[2222]: Jan 16 18:04:25.299 INFO Fetching http://169.254.169.254/2009-04-04/meta-data/public-keys: Attempt #2 Jan 16 18:04:25 ip-10-0-130-38 afterburn[2222]: Jan 16 18:04:25.359 INFO Fetch failed with 404: resource not found Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started Afterburn (SSH Keys). Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: afterburn-sshkeys: Consumed 33ms CPU time Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: meta-data=/dev/mapper/coreos-luks-root-nocrypt isize=512 agcount=4, agsize=170048 blks Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: = sectsz=512 attr=2, projid32bit=1 Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: = crc=1 finobt=1, sparse=1, rmapbt=0 Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: = reflink=1 Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: data = bsize=4096 blocks=680192, imaxpct=25 Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: = sunit=0 swidth=0 blks Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: naming =version 2 bsize=4096 ascii-ci=0, ftype=1 Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: log =internal log bsize=4096 blocks=2560, version=2 Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: = sectsz=512 sunit=0 blks, lazy-count=1 Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: realtime =none extsz=4096 blocks=0, rtextents=0 Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: data blocks changed from 680192 to 31321851 Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started Resize root partition. Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Starting Machine Config Daemon Firstboot... Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Starting Machine Config Daemon Initial... Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Starting OpenSSH server daemon... Jan 16 18:04:25 ip-10-0-130-38 sshd[2370]: Server listening on 0.0.0.0 port 22. Jan 16 18:04:25 ip-10-0-130-38 sshd[2370]: Server listening on :: port 22. Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started OpenSSH server daemon. sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062