Created attachment 1511758 [details] The inventory file used. Description of problem: When running the deploy_cluster.yml playbook during an OpenShift install, the installer fails with an error. Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: 100% reproducible Steps to Reproduce: 1. Create two RHEL 7.4 VMs, master and node1 2. Set up named instance on a node outside the OpenShift cluster nodes and verify that DNS functions correctly 3. Follow the instructions in Install OpenShift Container Platform instructions to the letter 4. Run the prerequisites.yml playbook (completes successfully) 5. Run the deploy_cluster.yml playbook (fails) Actual results: The deploy_cluster.yml playbook fails to complete, at the the point where the control plane pods are supposed to appear, seeing the error message: [master.openshift.example.com] (item=api) => {"attempts": 60, "changed": false, "failed": true, "item": "api", "msg": {"cmd": "/usr/bin/oc get pod master-api-master -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server master.openshift.example.com:8443 was refused - did you specify the right host or port?\n", "stdout": ""}} This is a valid node in my configuration. I'm kind of surprised that the installation script is asking me if the node is correct when its the installation script that would have started this service. Expected results: The cluster should be correctly deployed. Additional info: Please attach logs from ansible-playbook with the -vvv flag
Created attachment 1511760 [details] Log from the error
I followed the instructions in https://docs.openshift.com/container-platform/3.10/getting_started/install_openshift.html to the letter. The problem with this documentation is that it provides explicit instructions for setting up a *simple* 2 node OpenShift cluster, right down to the commands to type at each point, except for one, glaring omission: it does not provide the playbook inventory file. Rather, it tells you to look at some examples, all of which involve 4 or more nodes. This in itself is a documentation bug. And it is probably related to the error I am seeing. There are too many configuration options in OpenShift for this to be acceptable: you must provide a concrete 2 node playbook inventory which is guaranteed to work. Otherwise, this product does not work out of the box and customers get frustrated wasting days trying to find a configuration that works.
Created attachment 1511773 [details] Log from Success (3.11)
I re-ran the whole exercise using 3.11 instead of 3.10 and the results were different. This time, the following happened: - the prerequisites.yml playbook completed - the deploy_cluster.yml playbook hung when trying to install Docker and had to be Ctrl-C'd and restarted (I though prerequisites.yml installed all the prerequisite software ...) - the deploy_cluster.yml completed (see the log Log From Success) - however, when I tried to login to the cluster, the login failed: [root@clusterdev01 ~]# oc login -u system:admin Server [https://localhost:8443]: https://master.openshift.example.com:8443 error: dial tcp 192.168.0.114:8443: getsockopt: no route to host - verify you have provided the correct host and port and that the server is currently running. - in fact, there are no processes listening on 8443 on master: [root@master ~]# netstat -ant Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN tcp 0 0 192.168.0.114:22 192.168.0.114:40376 ESTABLISHED tcp 0 0 192.168.0.114:48566 192.168.0.124:22 ESTABLISHED tcp 0 0 192.168.0.114:715 192.168.0.150:2049 ESTABLISHED tcp 0 0 192.168.0.114:40376 192.168.0.114:22 ESTABLISHED tcp 0 0 192.168.0.114:22 192.168.0.110:59480 ESTABLISHED tcp6 0 0 :::111 :::* LISTEN tcp6 0 0 :::22 :::* LISTEN - so the installer indicates success, but the cluster is not successfully deployed. Should I file another bug?
For the 3.10 version of the pronblem: rpm -q openshift-ansible openshift-ansible-3.10.73-1.git.0.8b65cea.el7.noarch rpm -q ansible ansible-2.4.6.0-1.el7ae.noarch ansible --version ansible 2.4.6.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]
API server didn't come up, what does `master-logs api api` shows?
I've already moved on, deleted the VMs are trying another configuration. I'll rebuild it again, and check what is stated there as you request.
This is what I see after yet an other failed install: [root@master ~]# master-logs api api Component api is stopped or not running I will attach the playbook --- output (as much as I could retrieve from my screen) as Failed Playbook deploy_cluster Log 3.10. I have also attached the list of processes running at the time of failure (or at least after I terminate the playbook). I'll save the VMs in case you have other files I can pass.
Created attachment 1511863 [details] Processes running at playbook failure
Created attachment 1511864 [details] Failed Playbook deploy_cluster Log 3.10
I re-ran the 3.10 scripts on a three machine configuration. I have attached the inventory as The Three Node Inventory. This used one master,one infra and one compute node. The installation went smoother - there was no locking up of the docker installation in prerequisites.yml as before, and the deploy_clusteryml completed without errors. However, no deployed cluster. Here are the processes running after this "successful" playbook execution: [root@master ~]# ps -efww UID PID PPID C STIME TTY TIME CMD root 1 0 0 22:08 ? 00:00:01 /usr/lib/systemd/systemd --switched-root --system --deserialize 22 root 2 0 0 22:08 ? 00:00:00 [kthreadd] root 3 2 0 22:08 ? 00:00:00 [ksoftirqd/0] root 5 2 0 22:08 ? 00:00:00 [kworker/0:0H] root 7 2 0 22:08 ? 00:00:00 [migration/0] root 8 2 0 22:08 ? 00:00:00 [rcu_bh] root 9 2 0 22:08 ? 00:00:00 [rcu_sched] root 10 2 0 22:08 ? 00:00:00 [lru-add-drain] root 11 2 0 22:08 ? 00:00:00 [watchdog/0] root 12 2 0 22:08 ? 00:00:00 [watchdog/1] root 13 2 0 22:08 ? 00:00:00 [migration/1] root 14 2 0 22:08 ? 00:00:00 [ksoftirqd/1] root 15 2 0 22:08 ? 00:00:00 [kworker/1:0] root 16 2 0 22:08 ? 00:00:00 [kworker/1:0H] root 17 2 0 22:08 ? 00:00:00 [watchdog/2] root 18 2 0 22:08 ? 00:00:00 [migration/2] root 19 2 0 22:08 ? 00:00:00 [ksoftirqd/2] root 21 2 0 22:08 ? 00:00:00 [kworker/2:0H] root 22 2 0 22:08 ? 00:00:00 [watchdog/3] root 23 2 0 22:08 ? 00:00:00 [migration/3] root 24 2 0 22:08 ? 00:00:00 [ksoftirqd/3] root 26 2 0 22:08 ? 00:00:00 [kworker/3:0H] root 27 2 0 22:08 ? 00:00:00 [watchdog/4] root 28 2 0 22:08 ? 00:00:00 [migration/4] root 29 2 0 22:08 ? 00:00:00 [ksoftirqd/4] root 30 2 0 22:08 ? 00:00:00 [kworker/4:0] root 31 2 0 22:08 ? 00:00:00 [kworker/4:0H] root 32 2 0 22:08 ? 00:00:00 [watchdog/5] root 33 2 0 22:08 ? 00:00:00 [migration/5] root 34 2 0 22:08 ? 00:00:00 [ksoftirqd/5] root 36 2 0 22:08 ? 00:00:00 [kworker/5:0H] root 38 2 0 22:08 ? 00:00:00 [kdevtmpfs] root 39 2 0 22:08 ? 00:00:00 [netns] root 40 2 0 22:08 ? 00:00:00 [khungtaskd] root 41 2 0 22:08 ? 00:00:00 [writeback] root 42 2 0 22:08 ? 00:00:00 [kintegrityd] root 43 2 0 22:08 ? 00:00:00 [bioset] root 44 2 0 22:08 ? 00:00:00 [bioset] root 45 2 0 22:08 ? 00:00:00 [bioset] root 46 2 0 22:08 ? 00:00:00 [kblockd] root 47 2 0 22:08 ? 00:00:00 [md] root 48 2 0 22:08 ? 00:00:00 [edac-poller] root 49 2 0 22:08 ? 00:00:00 [watchdogd] root 55 2 0 22:08 ? 00:00:00 [kswapd0] root 56 2 0 22:08 ? 00:00:00 [ksmd] root 57 2 0 22:08 ? 00:00:00 [khugepaged] root 58 2 0 22:08 ? 00:00:00 [crypto] root 66 2 0 22:08 ? 00:00:00 [kthrotld] root 67 2 0 22:08 ? 00:00:00 [kworker/u12:1] root 68 2 0 22:08 ? 00:00:00 [kmpath_rdacd] root 69 2 0 22:08 ? 00:00:00 [kaluad] root 70 2 0 22:08 ? 00:00:00 [kpsmoused] root 71 2 0 22:08 ? 00:00:00 [ipv6_addrconf] root 84 2 0 22:08 ? 00:00:00 [deferwq] root 85 2 0 22:08 ? 00:00:00 [kworker/1:1] root 120 2 0 22:08 ? 00:00:00 [kauditd] root 202 2 0 22:08 ? 00:00:00 [kworker/0:2] root 235 2 0 22:08 ? 00:00:00 [kworker/3:1] root 815 2 0 22:08 ? 00:00:00 [ata_sff] root 864 2 0 22:08 ? 00:00:00 [scsi_eh_0] root 878 2 0 22:08 ? 00:00:00 [scsi_tmf_0] root 879 2 0 22:08 ? 00:00:00 [scsi_eh_1] root 881 2 0 22:08 ? 00:00:00 [scsi_tmf_1] root 925 2 0 22:08 ? 00:00:00 [kworker/u12:3] root 988 2 0 22:08 ? 00:00:00 [ttm_swap] root 1675 2 0 22:08 ? 00:00:00 [kworker/4:1H] root 1699 2 0 22:08 ? 00:00:00 [kworker/0:3] root 1832 2 0 22:08 ? 00:00:00 [kdmflush] root 1833 2 0 22:08 ? 00:00:00 [bioset] root 1847 2 0 22:08 ? 00:00:00 [kdmflush] root 1849 2 0 22:08 ? 00:00:00 [bioset] root 1867 2 0 22:08 ? 00:00:00 [bioset] root 1869 2 0 22:08 ? 00:00:00 [xfsalloc] root 1871 2 0 22:08 ? 00:00:00 [xfs_mru_cache] root 1878 2 0 22:08 ? 00:00:00 [xfs-buf/dm-0] root 1879 2 0 22:08 ? 00:00:00 [xfs-data/dm-0] root 1880 2 0 22:08 ? 00:00:00 [xfs-conv/dm-0] root 1886 2 0 22:08 ? 00:00:00 [xfs-cil/dm-0] root 1887 2 0 22:08 ? 00:00:00 [xfs-reclaim/dm-] root 1888 2 0 22:08 ? 00:00:00 [xfs-log/dm-0] root 1890 2 0 22:08 ? 00:00:00 [xfs-eofblocks/d] root 1893 2 0 22:08 ? 00:00:00 [xfsaild/dm-0] root 1894 2 0 22:08 ? 00:00:00 [kworker/5:1H] root 1950 2 0 22:08 ? 00:00:00 [kworker/2:1] root 1967 1 0 22:08 ? 00:00:00 /usr/lib/systemd/systemd-journald root 1988 1 0 22:08 ? 00:00:00 /usr/sbin/lvmetad -f root 2000 1 0 22:08 ? 00:00:00 /usr/lib/systemd/systemd-udevd root 2864 2 0 22:08 ? 00:00:00 [kworker/3:2] root 3127 2 0 22:08 ? 00:00:00 [xfs-buf/vda1] root 3141 2 0 22:08 ? 00:00:00 [xfs-data/vda1] root 3160 2 0 22:08 ? 00:00:00 [xfs-conv/vda1] root 3183 2 0 22:08 ? 00:00:00 [xfs-cil/vda1] root 3203 2 0 22:08 ? 00:00:00 [xfs-reclaim/vda] root 3239 2 0 22:08 ? 00:00:00 [xfs-log/vda1] root 3248 2 0 22:08 ? 00:00:00 [xfs-eofblocks/v] root 3256 2 0 22:08 ? 00:00:00 [xfsaild/vda1] root 3664 2 0 22:08 ? 00:00:00 [kdmflush] root 3665 2 0 22:08 ? 00:00:00 [bioset] root 3676 2 0 22:08 ? 00:00:00 [xfs-buf/dm-2] root 3677 2 0 22:08 ? 00:00:00 [xfs-data/dm-2] root 3678 2 0 22:08 ? 00:00:00 [xfs-conv/dm-2] root 3679 2 0 22:08 ? 00:00:00 [xfs-cil/dm-2] root 3680 2 0 22:08 ? 00:00:00 [xfs-reclaim/dm-] root 3681 2 0 22:08 ? 00:00:00 [xfs-log/dm-2] root 3682 2 0 22:08 ? 00:00:00 [xfs-eofblocks/d] root 3683 2 0 22:08 ? 00:00:00 [xfsaild/dm-2] root 3701 2 0 22:08 ? 00:00:00 [kworker/1:1H] root 3717 1 0 22:08 ? 00:00:00 /sbin/auditd root 3721 2 0 22:08 ? 00:00:00 [rpciod] root 3722 2 0 22:08 ? 00:00:00 [xprtiod] root 3746 2 0 22:08 ? 00:00:00 [kworker/2:1H] root 3750 1 0 22:08 ? 00:00:00 /usr/lib/systemd/systemd-logind root 3751 1 0 22:08 ? 00:00:00 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-ports/org.qemu.guest_agent.0 --blacklist=guest-file-open,guest-file-close,guest-file-read,guest-file-write,guest-file-seek,guest-file-flush,guest-exec,guest-exec-status -F/etc/qemu-ga/fsfreeze-hook dbus 3752 1 0 22:08 ? 00:00:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation root 3754 1 0 22:08 ? 00:00:00 /usr/sbin/gssproxy -D chrony 3755 1 0 22:08 ? 00:00:00 /usr/sbin/chronyd root 3766 1 0 22:08 ? 00:00:00 /usr/sbin/NetworkManager --no-daemon polkitd 3767 1 0 22:08 ? 00:00:00 /usr/lib/polkit-1/polkitd --no-debug root 3768 1 0 22:08 ? 00:00:01 /sbin/rngd -f root 3769 1 0 22:08 ? 00:00:00 /usr/sbin/irqbalance --foreground libstor+ 3772 1 0 22:08 ? 00:00:00 /usr/bin/lsmd -d root 3773 1 0 22:08 ? 00:00:00 /usr/sbin/smartd -n -q never root 3774 1 0 22:08 ? 00:00:00 /usr/sbin/abrtd -d -s root 3775 1 0 22:08 ? 00:00:00 /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption list_add corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody cared IRQ handler type mismatch Kernel panic - not syncing: Machine Check Exception: Machine check events logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD root 3932 3766 0 22:08 ? 00:00:00 /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-130213eb-f7af-43c3-bc97-0d678529cd8d-eth0.lease -cf /var/lib/NetworkManager/dhclient-eth0.conf eth0 root 4232 1 0 22:08 ? 00:00:00 /usr/bin/python2 -Es /usr/sbin/tuned -l -P root 4234 1 0 22:08 ? 00:00:00 /usr/sbin/sshd -D root 4238 1 0 22:08 ? 00:00:00 /usr/sbin/rsyslogd -n root 4241 1 0 22:08 ? 00:00:00 /usr/bin/rhsmcertd root 4270 2 0 22:08 ? 00:00:00 [nfsiod] root 4306 2 0 22:08 ? 00:00:00 [kworker/0:1H] root 4308 2 0 22:08 ? 00:00:00 [nfsv4.1-svc] root 4327 1 0 22:08 ? 00:00:00 /usr/sbin/crond -n root 4328 1 0 22:08 ? 00:00:00 /usr/sbin/atd -f root 4344 1 0 22:08 ? 00:00:00 rhnsd root 4350 1 0 22:08 ? 00:00:02 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json --selinux-enabled --signature-verification=False --storage-driver overlay2 --mtu=1450 --add-registry registry.access.redhat.com --add-registry registry.access.redhat.com --add-registry docker.io --add-registry registry.fedoraproject.org --add-registry quay.io --add-registry registry.centos.org root 4354 2 0 22:08 ? 00:00:00 [kworker/3:1H] root 4360 1 0 22:08 tty1 00:00:00 /sbin/agetty --noclear tty1 linux root 4422 1 0 22:08 ? 00:00:00 /usr/libexec/postfix/master -w postfix 4423 4422 0 22:08 ? 00:00:00 pickup -l -t unix -u postfix 4424 4422 0 22:08 ? 00:00:00 qmgr -l -t unix -u root 4429 4350 0 22:08 ? 00:00:01 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc --runtime-args --systemd-cgroup=true root 4590 1 0 22:08 ? 00:00:00 /usr/libexec/docker/rhel-push-plugin root 10625 2 0 22:22 ? 00:00:00 [kworker/2:0] root 10637 4234 0 22:26 ? 00:00:00 sshd: root@pts/0 root 10640 10637 0 22:26 pts/0 00:00:00 -bash root 10684 2 0 22:27 ? 00:00:00 [kworker/5:2] root 10720 2 0 22:32 ? 00:00:00 [kworker/5:0] root 10765 2 0 22:34 ? 00:00:00 [kworker/4:1] root 10779 10640 0 22:42 pts/0 00:00:00 ps -efww
I also looked in journalctl on master for any ansible related errors: nothing I could see. I'll attach that file as well.
Created attachment 1511902 [details] 3-node-attempt-journalctl
Created attachment 1511903 [details] 3-node-attempt-inventory
Created attachment 1511904 [details] 3-node-attempt-deploy_cluster.log
Created attachment 1511905 [details] 3-node-attempt-prerequisite.log
There was also this warning at the start of the deploy_cluster playbook: [root@master ~]# ansible-playbook -i /etc/ansible/hosts /usr/share/ansible/openshift-ansible/playbooks/prerequisites.yml > deploy_cluster.log [WARNING]: Could not match supplied host pattern, ignoring: oo_lb_to_config [WARNING]: Could not match supplied host pattern, ignoring: oo_nfs_to_config [WARNING]: Consider using yum, dnf or zypper module rather than running rpm [WARNING]: Could not match supplied host pattern, ignoring: oo_hosts_containerized_managed_true Everything seems to stop when docker is started.
I'll leave the VMs as they are; let me know if you need any other information.
We need logs from stopped api servers. Please on each master find containers with names starting `k8s_api_master-api...` and attach the logs to the issue
Please also attach the output of deploy playbook with `ansible-playbook -vvv`
(In reply to Vadim Rutkovsky from comment #19) > We need logs from stopped api servers. Please on each master find containers > with names starting `k8s_api_master-api...` and attach the logs to the issue [root@clusterdev01 ~]# ssh master Last login: Wed Dec 5 23:22:17 2018 from clusterdev01.lab.eng.brq.redhat.com [root@master ~]# docker container ls CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
I think I may have made a mistake in running the three node cluster: I seem to have run the prerequisites.yml playbook twice in a row, instead of prerequisites followed by deploy_cluster.yml. Rerunning it now.
Created attachment 1511910 [details] 3-node-attempt-prerequisite.yml-vvv
Created attachment 1511911 [details] 3-node-attempt-deploy_cluster.log-vvv
I have attached the logs for running both playbooks with -vvv Docker is running on master: [root@master ~]# systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─custom.conf Active: active (running) since Wed 2018-12-05 22:08:29 CET; 1h 55min ago Docs: http://docs.docker.com Main PID: 4350 (dockerd-current) CGroup: /system.slice/docker.service ├─4350 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-p... └─4429 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/... Dec 05 22:08:27 master dockerd-current[4350]: time="2018-12-05T22:08:27.772635448+01:00" level=info msg="libcontainerd: new containerd process, pid: 4429" Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.833697556+01:00" level=info msg="Graph migration to content-addressability took 0.00 seconds" Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.834622176+01:00" level=info msg="Loading containers: start." Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.864326852+01:00" level=info msg="Firewalld running: false" Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.960846798+01:00" level=info msg="Default bridge (docker0) is assigned with an IP address 17...P address" Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.987040762+01:00" level=info msg="Loading containers: done." Dec 05 22:08:29 master dockerd-current[4350]: time="2018-12-05T22:08:29.022979241+01:00" level=info msg="Daemon has completed initialization" Dec 05 22:08:29 master dockerd-current[4350]: time="2018-12-05T22:08:29.023039877+01:00" level=info msg="Docker daemon" commit="07f3374/1.13.1" graphdriver=overlay...ion=1.13.1 Dec 05 22:08:29 master dockerd-current[4350]: time="2018-12-05T22:08:29.061630267+01:00" level=info msg="API listen on /var/run/docker.sock" Dec 05 22:08:29 master systemd[1]: Started Docker Application Container Engine. Hint: Some lines were ellipsized, use -l to show in full. However, there are no images in the docker repo.
[root@master ~]# skopeo inspect docker://openshift.example.com/openshift3/ose-deployer:v3.10 FATA[0010] pinging docker registry returned: Get https://openshift.example.com/v2/: dial tcp: lookup openshift.example.com on 192.168.0.115:53: no such host The looks to be something wrong in the configuration...
oreg_url=openshift.example.com/openshift3/ose-${component}:${version} Are we sure that is correct? comment it out that is probably the problem.
Yes, there is no such host. And I didn't write that line. The line is present in all the sample configurations provided by the docs, and does not reflect the name of any other host in the inventory of those configurations. The problem is that we are spending all our time guessing as to which configuration might work, then spending 2 hours trying to install and find out that it doesn't work and trying again. As mentioned above, even the detailed 2 node example does not provide a concrete inventory file which is guaranteed to work.
(In reply to Richard Achmatowicz from comment #28) > Yes, there is no such host. And I didn't write that line. The line is > present in all the sample configurations provided by the docs, and does not > reflect the name of any other host in the inventory of those configurations. `hosts.examples` has a description for it: ># Cluster Image Source (registry) configuration ># openshift-enterprise default is 'registry.redhat.io/openshift3/ose-${component}:${version}' ># origin default is 'docker.io/openshift/origin-${component}:${version}' >#oreg_url=example.com/openshift3/ose-${component}:${version} If you're not using a different registry for images you don't need to uncomment it Does it work with this line commented?
I'll give it a try by rerunning the deploy_cluster playbook. When products have complex configurations, like OpenShift, rather than expecting the user to sift through the many, many options and come up with a correct configuration, either the default values need to work out of the box, or detailed sample configurations which work for common use cases need to be provided for the most common cases. This isn't happening here.
Another piece of feedback which may or may not apply: in addition to running lots of checks for the dependencies required by openshift (which also seem to be faulty from all the "turn off the check" issues I have seen, a sanity check for the configuration is in order. If I had a line in the inventory for which a host didn't exist or was not connectable, this would identify it before wasting 2 hours to run through the install.
I reran with that statement commented out: here is the tail fromthe log: INSTALLER STATUS ********************************************************************************************************************* Initialization : Complete (0:00:14) Health Check : In Progress (0:09:59) This phase can be restarted by running: playbooks/openshift-checks/pre-install.yml Failure summary: 1. Hosts: node1.openshift.example.com, node2.openshift.example.com Play: OpenShift Health Checks Task: Run health checks (install) - EL Message: One or more checks failed Details: check "docker_image_availability": One or more required container images are not available: registry.access.redhat.com/openshift3/ose-deployer:v3.10, registry.access.redhat.com/openshift3/ose-docker-registry:v3.10, registry.access.redhat.com/openshift3/ose-haproxy-router:v3.10, registry.access.redhat.com/openshift3/ose-pod:v3.10, registry.access.redhat.com/openshift3/registry-console:v3.10 Checked with: skopeo inspect [--tls-verify=false] [--creds=<user>:<pass>] docker://<registry>/<image> 2. Hosts: master.openshift.example.com Play: OpenShift Health Checks Task: Run health checks (install) - EL Message: One or more checks failed Details: check "docker_image_availability": One or more required container images are not available: registry.access.redhat.com/openshift3/ose-control-plane:v3.10, registry.access.redhat.com/openshift3/ose-deployer:v3.10, registry.access.redhat.com/openshift3/ose-docker-registry:v3.10, registry.access.redhat.com/openshift3/ose-haproxy-router:v3.10, registry.access.redhat.com/openshift3/ose-pod:v3.10, registry.access.redhat.com/openshift3/registry-console:v3.10, registry.access.redhat.com/rhel7/etcd:3.2.22 Checked with: skopeo inspect [--tls-verify=false] [--creds=<user>:<pass>] docker://<registry>/<image> The execution of "/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml" includes checks designed to fail early if the requirements of the playbook are not met. One or more of these checks failed. To disregard these results,explicitly disable checks by setting an Ansible variable: openshift_disable_check=docker_image_availability Failing check names are shown in the failure details above. Some checks may be configurable by variables if your requirements are different from the defaults; consult check documentation. Variables can be set in the inventory or passed on the command line using the -e flag to ansible-playbook.
You need a valid subscription to pull these images. Does it work once this is set? See https://docs.openshift.com/container-platform/3.11/getting_started/install_openshift.html#attach-subscription and further for details
[root@master ~]# subscription-manager list --consumed | grep Container Red Hat Container Images for IBM Power LE Red Hat Container Images Beta for IBM Power LE Red Hat Container Images Red Hat Container Images Beta Red Hat Container Images HTB Red Hat OpenShift Container Platform Red Hat Container Development Kit I have all the hosts subscribed to Emplyee SKU, which contained OpenShift Container platform. When following the instructions in the docs to install the two node cluster, it happened on several occasions that prerequisites.yml would hang (for 2 minutes) when installing Docker. I would Ctrl-C the playbook, run it again, it would pass, and then continue with the other instructions. It looks as though Docker is not being installed correctly and this is done by the playbook.
I meant 20 minutes, not 2.
And the repos are enabled as well: subscription-manager: error: no such option: --enabled [root@master ~]# subscription-manager repos --list-enabled +----------------------------------------------------------+ Available Repositories in /etc/yum.repos.d/redhat.repo +----------------------------------------------------------+ Repo ID: rhel-7-server-ansible-2.4-rpms Repo Name: Red Hat Ansible Engine 2.4 RPMs for Red Hat Enterprise Linux 7 Server Repo URL: https://cdn.redhat.com/content/dist/rhel/server/7/7Server/$basearch/ansible/2.4/os Enabled: 1 Repo ID: rhel-7-server-extras-rpms Repo Name: Red Hat Enterprise Linux 7 Server - Extras (RPMs) Repo URL: https://cdn.redhat.com/content/dist/rhel/server/7/7Server/$basearch/extras/os Enabled: 1 Repo ID: rhel-7-server-rpms Repo Name: Red Hat Enterprise Linux 7 Server (RPMs) Repo URL: https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/os Enabled: 1 Repo ID: rhel-7-server-ose-3.10-rpms Repo Name: Red Hat OpenShift Container Platform 3.10 (RPMs) Repo URL: https://cdn.redhat.com/content/dist/rhel/server/7/7Server/$basearch/ose/3.10/os Enabled: 1
Another note: I seem to remember on one of the first installs that this message was appearing, I looked up the issue on the Internet and found that many people were finding that the check for docker images was not working correctly and they were turning off the check: https://github.com/openshift/openshift-ansible/issues/7721 And it still seems to be a problem: [root@master ~]# skopeo inspect --tls-verify=false docker://docker.io/cockpit/kubernetes:latest { "Name": "docker.io/cockpit/kubernetes", "Digest": "sha256:f38c7b0d2b85989f058bf78c1759bec5b5d633f26651ea74753eac98f9e70c9b", "RepoTags": [ "latest", "wip" ], "Created": "2018-10-12T11:48:09.409602119Z", "DockerVersion": "18.03.1-ee-1-tp5", "Labels": null, "Architecture": "amd64", "Os": "linux", "Layers": [ "sha256:565884f490d9ec697e519c57d55d09e268542ef2c1340fd63262751fa308f047", "sha256:dd655a54c8454c55e559be4c07218ffee8390476583a137b98c0a0592101cfc9", "sha256:b715fff0ceb39ce45574c981aa8b29dbc94be81770735b55743b587501cef41f", "sha256:1dcd53aa9421189eadac2b599717a6b8234563a48257297c6be14d392262eb2e" ] }
Ok, it seems that am back back to where I was when I started, but a little wiser. With regard to the inability of the playbooks to correctly install software, the docker_image_availability indicates that the playbook failed to download the docker images. Jean-Frederic also ran into this problem and suggested I download the images myself. Wrote a script to do this. Execution of the script took 6 minutes to download all of the images from registry.access.redhat.com. Wonder why I can do it and the playbook cannot? Can't say for the moment. Rerunning the deploy_cluster playbook took me back to the same position I was in when trying to install on the two node cluster, which happens to be the subject of this issue: cannot load control plan pods. There is no server called master.openshift.example.com:8443 running when this operation is attempted. So this is the next target. However, this time I did see some relevant processes: root 46423 1 1 19:59 ? 00:00:04 /usr/bin/hyperkube kubelet --v=2 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-ttl=5m --authorization-mode=Webhook --authorization-webhook-cache-authorized-ttl=5m --authorization-webhook-cache-unauthorized-ttl=5m --bootstrap-kubeconfig=/etc/origin/node/bootstrap.kubeconfig --cadvisor-port=0 --cert-dir=/etc/origin/node/certificates --cgroup-driver=systemd --client-ca-file=/etc/origin/node/client-ca.crt --cluster-dns=192.168.122.97 --cluster-domain=cluster.local --container-runtime-endpoint=/var/run/dockershim.sock --containerized=false --enable-controller-attach-detach=true --experimental-dockershim-root-directory=/var/lib/dockershim --fail-swap-on=false --feature-gates=RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true --file-check-frequency=0s --healthz-bind-address= --healthz-port=0 --host-ipc-sources=api --host-ipc-sources=file --host-network-sources=api --host-network-sources=file --host-pid-sources=api --host-pid-sources=file --hostname-override= --http-check-frequency=0s --image-service-endpoint=/var/run/dockershim.sock --iptables-masquerade-bit=0 --kubeconfig=/etc/origin/node/node.kubeconfig --max-pods=250 --network-plugin=cni --node-ip= --pod-infra-container-image=registry.access.redhat.com/openshift3/ose-pod:v3.10.72 --pod-manifest-path=/etc/origin/node/pods --port=10250 --read-only-port=0 --register-node=true --root-dir=/var/lib/origin/openshift.local.volumes --rotate-certificates=true --tls-cert-file= --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA --tls-cipher-suites=TLS_RSA_WITH_AES_128_GCM_SHA256 --tls-cipher-suites=TLS_RSA_WITH_AES_256_GCM_SHA384 --tls-cipher-suites=TLS_RSA_WITH_AES_128_CBC_SHA --tls-cipher-suites=TLS_RSA_WITH_AES_256_CBC_SHA --tls-min-version=VersionTLS12 --tls-private-key-file= and root 46904 46866 0 19:59 ? 00:00:02 openshift start master controllers --config=/etc/origin/master/master-config.yaml --listen=https://0.0.0.0:8444 --loglevel=2
[root@master ~]# ps -ef | grep master root 4405 1 0 Dec06 ? 00:00:00 /usr/libexec/postfix/master -w root 6067 127735 0 09:56 pts/0 00:00:00 grep --color=auto master root 46904 46866 0 Dec06 ? 00:05:19 openshift start master controllers --config=/etc/origin/master/master-config.yaml --listen=https://0.0.0.0:8444 --loglevel=2 why 8444?
I have no idea. It must be coming from some default in the master inventory used; my inventory does not contain that port. By the way, I have started re-reading Ansible Up and Running (which I read about a year ago) to understand the playbooks. In executing the examples in that book which, for example, install nginx on a VM, ansible *will* report incorrect results from the server; for example, I had one small error in the nginx config file (missing semicolon), the nginx server did not start correctly, but ansible reported that all was well. It took hunting through the server logs for that particular module (service) to find the problem. So, with these playbooks that involve literally hundreds of tasks, there is a lot of scope for error. This is probably what is happening here.
This may be the reason for the appearance of an 8444; when we have two httpd servers set up on the same host: first server receives on 8080/8443 second server receives on 8081/8444
master MUST be using 8443 you have to remove what is running on the port before starting the installation. Note puting 8444 in kube config doesn't help: +++ [root@master ~]# oc config get-clusters NAME master-openshift-example-com:8444 [root@master ~]# oc get nodes Error from server (InternalError): an error on the server ("Internal Server Error: \"/api/v1/nodes?limit=500\": Post https://master.openshift.example.com:8443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews: dial tcp 192.168.0.114:8443: getsockopt: connection refused") has prevented the request from succeeding (get nodes) [root@master ~]# kube config view bash: kube: command not found [root@master ~]# kubectl config view apiVersion: v1 clusters: - cluster: certificate-authority-data: REDACTED server: https://master.openshift.example.com:8444 name: master-openshift-example-com:8444 +++
Proposed - https://github.com/openshift/openshift-ansible/pull/10884
The stage i'm at now is that: - the service atomic-openshift-node is getting started - it is executing the command /usr/bin/hyperkube kubelet with a lot of parameters (many of which are flagged as being deprecated and should be instead in a --config file) - the --pod-manifest-path correctly identifies the directory containing the three configured static pods (apiserver.yaml, controller.yaml and etcd.yaml) - the images for those pods are available in the host's docker registry (ose-control-plane and etcd): [root@master pods]# docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE registry.access.redhat.com/openshift3/ose-node v3.10 8f2fe66622e7 9 days ago 1.28 GB registry.access.redhat.com/openshift3/ose-haproxy-router v3.10 26a21dbbf59a 9 days ago 812 MB registry.access.redhat.com/openshift3/ose-deployer v3.10 f3007f9bef28 9 days ago 792 MB registry.access.redhat.com/openshift3/ose-control-plane v3.10 d4f29f0adefd 9 days ago 792 MB registry.access.redhat.com/openshift3/ose-docker-registry v3.10 51b02d4237d5 9 days ago 288 MB registry.access.redhat.com/openshift3/ose-pod v3.10 8fbf8b6e3a44 9 days ago 217 MB registry.access.redhat.com/openshift3/registry-console v3.10 963d6b3f6e5a 9 days ago 235 MB registry.access.redhat.com/rhel7/etcd 3.2.22 635bb36d7fc7 3 weeks ago 259 MB -so, the API server and the controller should be getting started correctly - but when the service starts, these pods don't get started at all I found someone who had a similar issue, and he concluded that the problem was with pods starting very slowly: https://github.com/kubernetes/website/issues/4166 Thst ia a possibility here given that the image downloads also faile due to slowness. Two options i'll try tomorrow on hyperkube: --image-pull-progress-deadline this value is set to 1 minute apparently, so any image pull taking longer than 1 min is cancelled (but the images are there) --v=n increase this to get more information out of hyperkube and what it is doing
For example, here are the journalctl entries for the failed startup of the service atomic-openshift-node (see journalctl-attempt#2-startup-problems) There are several warning messages which may be related to the startup problem: - lots of deprecated flags being passed by Ansible to hyperkube, its unclear which options are being accepted and which are not - a warning about being unable to update the cni networking configuration - lack of a cloud provider being specified I'm not sure what these mean at the moment, but will investigate.
Created attachment 1514386 [details] journalctl messages from startup of failing service atomic-openshift-node
By the way, many people with the same issue: https://github.com/kubernetes/kubernetes/issues/54918 Some of the solutions which did work were network restarted.
I spun up two instances on AWS and ran through the installation instructions [1]. The install completed successfully. I noticed the following items in the instructions that could have posed a problem. 1. The following two lines should not be part of a default inventory. I read through the commit history for those lines and found they were added because of customer cases where there was a bug in OSE which caused the creation of improper registry URLs. Please ensure these lines are commented out in a default inventory. #oreg_url=example.com/openshift3/ose-${component}:${version} #openshift_examples_modify_imagestreams=true 2. After performing `yum update`, I rebooted my hosts. It is expected that after running `yum update` hosts will be rebooted to ensure all services are running as expected prior to installing OCP. The reboot is mentioned in the Host Preparation [2] steps. Please reboot any hosts after `yum update` has been performed. 3. Before running the playbooks, change directory to the openshift-ansible directory to ensure the provided ansible.cfg is used when executing playbooks. $ cd /usr/share/ansible/openshift-ansible Please attempt an install with two fresh hosts and follow the items mentioned above to see if this resolves your issues. OCP is a complex platform with many different features and capabilities. It is recommended to read the documentation to fully understand settings and requirements. I will investigate updating the documentation for the items mentioned above. [1] https://docs.openshift.com/container-platform/3.10/getting_started/install_openshift.html [2] https://docs.openshift.com/container-platform/3.10/install/host_preparation.html#installing-base-packages Host file in use: $ cat /etc/ansible/hosts # Create an OSEv3 group that contains the masters, nodes, and etcd groups [OSEv3:children] masters nodes etcd # Set variables common for all OSEv3 hosts [OSEv3:vars] # SSH user, this user should allow ssh based auth without requiring a password ansible_ssh_user=ec2-user # If ansible_ssh_user is not root, ansible_become must be set to true ansible_become=true openshift_deployment_type=openshift-enterprise #oreg_url=example.com/openshift3/ose-${component}:${version} #openshift_examples_modify_imagestreams=true # uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider #openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}] # testing deployment nodes don't meet these requirements openshift_disable_check=disk_availability,memory_availability # host group for masters [masters] host1.compute-1.amazonaws.com # host group for etcd [etcd] host1.compute-1.amazonaws.com # host group for nodes, includes region info [nodes] host1.compute-1.amazonaws.com openshift_node_group_name='node-config-master-infra' host2.compute-1.amazonaws.com openshift_node_group_name='node-config-compute'
Proposed docs update: https://github.com/openshift/openshift-docs/pull/13128
Russell Thanks for taking the time to test out the install on AWS and pointing out the possible sources of error in running the installer. OK, I have re-run the install using the same hosts, after using the uninstall script (it takes me a while to set up the new hosts so I decided to try with just an uninstall -it's late Friday :-)). Changes to my usual procedure: 1. Checking my inventory, the line for modifying example image streams was uncommented, so I commented it out. 2. I rebooted all hosts involved in the install 3. I executed the playbooks from the /usr/share/ansible/openshift-ansible directory Observations: 1. I noticed immediately that the scripts were behaving differently. Changing to the openshift-ansible directory picked up a different ansible.cfg file and changed things dramatically. (e.g. seeing time and date stamps on each task execution where as previously I did not). 2. When service atomic-openshift-node was started, I immediately saw a lot more processes running, and in particular the api and controllers processes, as well as some mux related things. This is in the attached file. 3. However, still the same problem with not being able to connect to the master.openshift.example.com:8443 server. This is in the attached file. Remarks: 1. Please make a small note in the docs that changing to that directory is done in order to pick on the correct ansible.cfg file. There are too many people (like me) who will execute the same command from the current directory using a FQPN for simplicity sake if they are not warned otherwise. Or provide long form alternative by specifying the cfg file via a command line option (as is done with the inventory). The time lost is too great to not mention this simple "mistake". So, progress, but still not there yet. I'll have to continue next week trying this on fresh hosts.
Hi Russell Thanks for taking the time to look into this. I really appreciate it. The clusterdev lab is composed of 5 physical hosts, and we create VMs on each of those physical hosts for running software (testing frameworks, OpenShift clusters, etc). The 5 hosts are connected by a Juniper switch and communicate with each other over a private internal network (192.168.0.1xy) where x = {1..5} and y = {1..5}. Each physical host and each VM also has a public interface to the outside world (these are the 192.168.111.211 interfaces, for example). In my inventory, I have specified hostnames (master, node1, node2) which correspond to interfaces on the private internal network only. I noticed in the logs that the installer was binding to some of the public interfaces, which might be due to the fact that the public interface on the VMs is usually eth0, the first non-loopback interface, and sometimes network software will use this as a way of choosing an interface that is not otherwise specified. I also noted binding to 0.0.0.0 which is generally good to avoid.
Richard, The best guidance I have at this time is to make sure the hostname of the host resolves to the first interface. If you continue to have problems with installation, please open a new bug specifically about installation on multi-homed hosts. This bug will be used to track the documentation issues already addressed above.
PR not merged to 3.10. need to cherrypick https://github.com/openshift/openshift-ansible/pull/10884 to 3.10.
PR for 3.10 https://github.com/openshift/openshift-ansible/pull/10917
PR merged.
Fixed. openshift-ansible-docs-3.10.98-1.git.0.198012d.el7.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2509