Description of problem: Deploying a cluster using https://github.com/openshift-kni/install-scripts (which BTW, uses the internal ocp/release:4.2 image https://github.com/openshift-kni/install-scripts/blob/master/OpenShift/03_create_cluster.sh#L7) it seems the etcd cluster doesn't form properly as one of the etcd-member pods is in crashloopback state. The installer finish successfully tho. Version-Release number of selected component (if applicable): $ oc version Client Version: openshift-clients-4.2.0-201909081401 Server Version: 4.2.0-0.nightly-2019-09-09-073137 Kubernetes Version: v1.14.6+9d7f0a8 How reproducible: See https://github.com/openshift-kni/install-scripts/issues/100 (the title seems to be misleading) Steps to Reproduce: 1. Deploy an OCP4 cluster with install-scripts or dev-scripts (https://github.com/openshift-metal3/dev-scripts) 2. Verify the etcd-member pods are ok 3. Actual results: 1/3 is not: $ oc get pods -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-member-kni1-master-0.env.mydomain.example.com 2/2 Running 0 87m etcd-member-kni1-master-1.env.mydomain.example.com 1/2 CrashLoopBackOff 17 87m etcd-member-kni1-master-2.env.mydomain.example.com 2/2 Running 0 87m $ oc logs etcd-member-kni1-master-1.env.mydomain.example.com -n openshift-etcd Error from server (BadRequest): a container name must be specified for pod etcd-member-kni1-master-1.env.mydomain.example.com, choose one of: [etcd-member etcd-metrics] or one of the init containers: [discovery certs] $ oc logs etcd-member-kni1-master-1.env.mydomain.example.com -c etcd-member -n openshift-etcd /bin/sh: line 3: /run/etcd/environment: Permission denied $ for node in $(oc get nodes -o jsonpath="{.items[*].metadata.name}"); do ssh core@${nod e} sudo cat /run/etcd/environment; ssh core@${node} sudo ls -lZ /run/etcd/environment; done export ETCD_DISCOVERY_SRV=kni1.env.mydomain.example.com ETCD_WILDCARD_DNS_NAME=*.kni1.env.mydomain.example.com ETCD_IPV4_ADDRESS=10.19.138.11 ETCD_DNS_NAME=etcd-0.kni1.env.mydomain.example.com -rw-r--r--. 1 root root system_u:object_r:container_var_run_t:s0 205 Sep 9 11:29 /run/etcd/environment export ETCD_DISCOVERY_SRV=kni1.env.mydomain.example.com ETCD_IPV4_ADDRESS=10.19.138.12 ETCD_DNS_NAME=etcd-1.kni1.env.mydomain.example.com ETCD_WILDCARD_DNS_NAME=*.kni1.env.mydomain.example.com -rw-r--r--. 1 root root system_u:object_r:container_var_run_t:s0 205 Sep 9 11:29 /run/etcd/environment export ETCD_DISCOVERY_SRV=kni1.env.mydomain.example.com ETCD_WILDCARD_DNS_NAME=*.kni1.env.mydomain.example.com ETCD_IPV4_ADDRESS=10.19.138.13 ETCD_DNS_NAME=etcd-2.kni1.env.mydomain.example.com -rw-r--r--. 1 root root system_u:object_r:container_var_run_t:s0 205 Sep 9 11:28 /run/etcd/environment Expected results: 3/3 pods up Additional info: I've just moved the etcd-member static pod definition in the affected host (to simulate a oc delete but for the static pod) and it seems to fix it... $ ssh core.mydomain.example.com sudo mv /etc/kubernetes/manifests/etcd-member.yaml /root/ $ ssh core.mydomain.example.com sudo mv /root/etcd-member.yaml /etc/kubernetes/manifests/etcd-member.yaml $ oc get pods NAME READY STATUS RESTARTS AGE etcd-member-kni1-master-0.env.mydomain.example.com 2/2 Running 2 129m etcd-member-kni1-master-1.env.mydomain.example.com 2/2 Running 28 23m etcd-member-kni1-master-2.env.mydomain.example.com 2/2 Running 0 129m
The discovery init container seems to be the one that shall write that file that is then used by the etcd-metrics one... maybe it is just some cache stuff where the discovery container writes it and the etcd-metrics tries to read it but it is not yet available/bad permissions in the disk?
Met this during force upgrade from 4.2.0-0.nightly-2019-09-08-232045 to 4.2.0-0.nightly-2019-09-09-150607. ➜ ceph git:(master) ✗ oc get pods -n openshift-etcd -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES etcd-member-share-0909d-4lxj2-master-0 2/2 Running 0 20h 192.168.0.14 share-0909d-4lxj2-master-0 <none> <none> etcd-member-share-0909d-4lxj2-master-1 2/2 Running 2 68m 192.168.0.28 share-0909d-4lxj2-master-1 <none> <none> etcd-member-share-0909d-4lxj2-master-2 1/2 CrashLoopBackOff 6 33m 192.168.0.41 share-0909d-4lxj2-master-2 <none> <none> ➜ ceph git:(master) ✗ oc -n openshift-etcd logs etcd-member-share-0909d-4lxj2-master-2 -c etcd-member /bin/sh: line 3: /run/etcd/environment: Permission denied ➜ ceph git:(master) ✗ oc debug nodes/share-0909d-4lxj2-master-2 Starting pod/share-0909d-4lxj2-master-2-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.0.41 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# bash [root@share-0909d-4lxj2-master-2 /]# ls -al /run/etcd/environment -alZ -rw-r--r--. 1 root root system_u:object_r:container_var_run_t:s0 184 Sep 9 09:39 /run/etcd/environment [root@share-0909d-4lxj2-master-2 /]# cat /run/etcd/environment export ETCD_DISCOVERY_SRV=share-0909d.qe.rhcloud.com ETCD_DNS_NAME=etcd-2.share-0909d.qe.rhcloud.com ETCD_WILDCARD_DNS_NAME=*.share-0909d.qe.rhcloud.com ETCD_IPV4_ADDRESS=192.168.0.41
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922