Bug 1789014

Summary:	e2e-metal: Installation fails to complete pending on DaemonSet \"openshift-multus/multus\"
Product:	OpenShift Container Platform	Reporter:	Scott Dodson <sdodson>
Component:	RHCOS	Assignee:	Jonathan Lebon <jlebon>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.3.0	CC:	aos-bugs, bbreard, dtrainor, dustymabe, imcleod, jlebon, jligon, jokerman, miabbott, mifiedle, nstielau, rphillips, scuppett, smilner, walters
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1791280 (view as bug list)		Environment:
Last Closed:	2020-01-23 11:20:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1791280
Bug Blocks:

Description Scott Dodson 2020-01-08 15:17:50 UTC

Description of problem:
e2e-metal installation is failing pending on DaemonSet \"openshift-multus/multus\"


level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)"
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/ovs\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/sdn\" is not available (awaiting 1 nodes)"
level=info msg="Cluster operator network Available is False with Startup: The network is starting up"
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: monitoring, network"

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1102
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1101

Comment 1 Scott Dodson 2020-01-08 16:34:44 UTC

Courtesy of @dcbw

Jan 08 09:16:33 worker-2.ci-op-q79n9swn-4c0cd.origin-ci-int-aws.dev.rhcloud.com hyperkube[3850]: E0108 09:16:33.278924    3850 pod_workers.go:191] Error syncing pod 522e70a2-5f97-411b-b65e-f96c3ccacb01 ("ovs-zwgvs_openshift-sdn(522e70a2-5f97-411b-b65e-f96c3ccacb01)"), skipping: failed to "StartContainer" for "openvswitch" with ErrImagePull: "rpc error: code = Unknown desc = Error writing blob: error storing blob to file \"/var/tmp/storage096417626/1\": write /var/tmp/storage096417626/1: no space left on device"

Other installations have failed in similar manner. All of these hosts should have 120GiB SSD storage, we need to figure out why certain hosts are seemingly running out of space.
shows kubelet reporting disk pressure and kube-apiserver pods failing

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1099/artifacts/e2e-metal/nodes.json
{
            "apiVersion": "v1",
            "kind": "Node",
            "metadata": {
                "annotations": {
                    "volumes.kubernetes.io/controller-managed-attach-detach": "true"
                },
                "creationTimestamp": "2020-01-07T16:12:17Z",
                "labels": {
                    "beta.kubernetes.io/arch": "amd64",
                    "beta.kubernetes.io/os": "linux",
                    "kubernetes.io/arch": "amd64",
                    "kubernetes.io/hostname": "master-1.ci-op-qyr2s16v-4c0cd.origin-ci-int-aws.dev.rhcloud.com",
                    "kubernetes.io/os": "linux",
                    "node-role.kubernetes.io/master": "",
                    "node.openshift.io/os_id": "rhcos"
                },
                "name": "master-1.ci-op-qyr2s16v-4c0cd.origin-ci-int-aws.dev.rhcloud.com",
                "resourceVersion": "15185",
                "selfLink": "/api/v1/nodes/master-1.ci-op-qyr2s16v-4c0cd.origin-ci-int-aws.dev.rhcloud.com",
                "uid": "d08cd107-479a-414d-b2c2-45162b17486b"
            },
            "status": {
                "allocatable": {
                    "cpu": "7500m",
                    "ephemeral-storage": "2391363990",
                    "hugepages-1Gi": "0",
                    "hugepages-2Mi": "0",
                    "memory": "32280716Ki",
                    "pods": "250"
                },
                "capacity": {
                    "cpu": "8",
                    "ephemeral-storage": "2594796Ki",
                    "hugepages-1Gi": "0",
                    "hugepages-2Mi": "0",
                    "memory": "32895116Ki",
                    "pods": "250"
                },
                "conditions": [
                    {
                        "lastHeartbeatTime": "2020-01-07T16:41:38Z",
                        "lastTransitionTime": "2020-01-07T16:12:37Z",
                        "message": "kubelet has disk pressure",
                        "reason": "KubeletHasDiskPressure",
                        "status": "True",
                        "type": "DiskPressure"
                    },

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1099/artifacts/e2e-metal/pods.json

            "status": {
                "message": "The node was low on resource: ephemeral-storage. ",
                "phase": "Failed",
                "reason": "Evicted",
                "startTime": "2020-01-07T16:18:25Z"
            }

Comment 2 Scott Dodson 2020-01-08 19:16:53 UTC

Moving to node component to track down what's consuming ephemeral storage and causing eviction of critical pods during bootstrap process. I'd start with the logs from this job https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1099

Comment 3 Colin Walters 2020-01-08 20:44:55 UTC

growpart failed due to ENOSPC (ironically).  I think it's possible we need to add more free space to our metal images.  Regardless RHCOS should chase this down.

Comment 15 Steve Milner 2020-01-14 19:02:17 UTC

https://github.com/openshift/installer/pull/2914 includes the fix in it's referenced boot images.

Comment 17 Steve Milner 2020-01-15 15:04:30 UTC

Moving to modified per request.

Comment 19 Michael Nguyen 2020-01-16 19:32:23 UTC

Verified growpart runs earlier than machine-config-daemon-firstboot.service now. 

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-01-16-031402   True        False         76m     Cluster version is 4.3.0-0.nightly-2020-01-16-031402

$ oc debug node/ip-10-0-130-38.ec2.internal
Starting pod/ip-10-0-130-38ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c4059f184596157f64d69c4edbea9c9ef600560b7804a482779f513c3e0f40e
              CustomOrigin: Managed by machine-config-operator
                   Version: 43.81.202001142154.0 (2020-01-14T21:59:51Z)

  ostree://23527ffc123c6e2bedf3479ff7e96f38d92cec88d5a7951fa56e9d0ec75ddd77
                   Version: 43.81.202001142154.0 (2020-01-14T21:59:51Z)
sh-4.4# journalctl | grep -10 rhcos-growpart
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Starting RPC Bind...
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started RPC Bind.
Jan 16 18:04:25 ip-10-0-130-38 rpc.statd[2357]: Version 2.3.3 starting
Jan 16 18:04:25 ip-10-0-130-38 rpc.statd[2357]: Flags: TI-RPC
Jan 16 18:04:25 ip-10-0-130-38 rpc.statd[2357]: Initializing NSM state
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started NFS status monitor for NFSv2/3 locking..
Jan 16 18:04:25 ip-10-0-130-38 afterburn[2222]: Jan 16 18:04:25.299 INFO Fetching http://169.254.169.254/2009-04-04/meta-data/public-keys: Attempt #2
Jan 16 18:04:25 ip-10-0-130-38 afterburn[2222]: Jan 16 18:04:25.359 INFO Fetch failed with 404: resource not found
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started Afterburn (SSH Keys).
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: afterburn-sshkeys: Consumed 33ms CPU time
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: meta-data=/dev/mapper/coreos-luks-root-nocrypt isize=512    agcount=4, agsize=170048 blks
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]:          =                       sectsz=512   attr=2, projid32bit=1
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]:          =                       crc=1        finobt=1, sparse=1, rmapbt=0
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]:          =                       reflink=1
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: data     =                       bsize=4096   blocks=680192, imaxpct=25
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]:          =                       sunit=0      swidth=0 blks
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: log      =internal log           bsize=4096   blocks=2560, version=2
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]:          =                       sectsz=512   sunit=0 blks, lazy-count=1
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: realtime =none                   extsz=4096   blocks=0, rtextents=0
Jan 16 18:04:25 ip-10-0-130-38 rhcos-growpart[2318]: data blocks changed from 680192 to 31321851
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started Resize root partition.
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Starting Machine Config Daemon Firstboot...
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Starting Machine Config Daemon Initial...
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Starting OpenSSH server daemon...
Jan 16 18:04:25 ip-10-0-130-38 sshd[2370]: Server listening on 0.0.0.0 port 22.
Jan 16 18:04:25 ip-10-0-130-38 sshd[2370]: Server listening on :: port 22.
Jan 16 18:04:25 ip-10-0-130-38 systemd[1]: Started OpenSSH server daemon.
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...

Comment 21 errata-xmlrpc 2020-01-23 11:20:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062