1656475 – OpenShift installer cannot load control plan pods, fails installation

Bug 1656475 - OpenShift installer cannot load control plan pods, fails installation

Summary: OpenShift installer cannot load control plan pods, fails installation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Russell Teague
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-05 15:24 UTC by Richard Achmatowicz
Modified:	2019-08-21 13:52 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Updated in repo example host inventory to include the requirement of openshift_node_group_name for hosts.
Clone Of:
Environment:
Last Closed:	2019-08-21 13:52:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
The inventory file used. (1.31 KB, text/plain) 2018-12-05 15:24 UTC, Richard Achmatowicz	no flags	Details
Log from the error (98.61 KB, text/plain) 2018-12-05 15:27 UTC, Richard Achmatowicz	no flags	Details
Log from Success (3.11) (138.50 KB, text/plain) 2018-12-05 15:58 UTC, Richard Achmatowicz	no flags	Details
Processes running at playbook failure (15.12 KB, text/plain) 2018-12-05 18:39 UTC, Richard Achmatowicz	no flags	Details
Failed Playbook deploy_cluster Log 3.10 (775.27 KB, text/plain) 2018-12-05 18:40 UTC, Richard Achmatowicz	no flags	Details
3-node-attempt-journalctl (38.14 KB, text/plain) 2018-12-05 21:49 UTC, Richard Achmatowicz	no flags	Details
3-node-attempt-inventory (1.38 KB, text/plain) 2018-12-05 21:51 UTC, Richard Achmatowicz	no flags	Details
3-node-attempt-deploy_cluster.log (81.32 KB, text/plain) 2018-12-05 21:57 UTC, Richard Achmatowicz	no flags	Details
3-node-attempt-prerequisite.log (82.00 KB, text/plain) 2018-12-05 21:57 UTC, Richard Achmatowicz	no flags	Details
3-node-attempt-prerequisite.yml-vvv (1.65 MB, text/plain) 2018-12-05 23:06 UTC, Richard Achmatowicz	no flags	Details
3-node-attempt-deploy_cluster.log-vvv (1.02 MB, text/plain) 2018-12-05 23:07 UTC, Richard Achmatowicz	no flags	Details
journalctl messages from startup of failing service atomic-openshift-node (39.70 KB, text/plain) 2018-12-14 14:46 UTC, Richard Achmatowicz	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2509	0	None	None	None	2019-08-21 13:52:09 UTC

Description Richard Achmatowicz 2018-12-05 15:24:49 UTC

Created attachment 1511758 [details]
The inventory file used.

Description of problem:
When running the deploy_cluster.yml playbook during an OpenShift install, the installer fails with an error.

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:
100% reproducible

Steps to Reproduce:
1. Create two RHEL 7.4 VMs, master and node1
2. Set up named instance on a node outside the OpenShift cluster nodes and verify that DNS functions correctly
3. Follow the instructions in Install OpenShift Container Platform instructions to the letter
4. Run the prerequisites.yml playbook (completes successfully)
5. Run the deploy_cluster.yml playbook (fails)

Actual results:
The deploy_cluster.yml playbook fails to complete, at the  the point where the control plane pods are supposed to appear, seeing the error message:

[master.openshift.example.com] (item=api) => {"attempts": 60, "changed": false, "failed": true, "item": "api", "msg": {"cmd": "/usr/bin/oc get pod master-api-master -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server master.openshift.example.com:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}

This is a valid node in my configuration. I'm kind of surprised that the installation script is asking me if the node is correct when its the installation script that would have started this service. 

Expected results:
The cluster should be correctly deployed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Richard Achmatowicz 2018-12-05 15:27:35 UTC

Created attachment 1511760 [details]
Log from the error

Comment 2 Richard Achmatowicz 2018-12-05 15:34:39 UTC

I followed the instructions in https://docs.openshift.com/container-platform/3.10/getting_started/install_openshift.html to the letter.

The problem with this documentation is that it provides explicit instructions for setting up a *simple* 2 node OpenShift cluster, right down to the commands to type at each point, except for one, glaring omission: it does not provide the playbook inventory file. Rather, it tells you to look at some examples, all of which involve 4 or more nodes. This in itself is a documentation bug. And it is probably related to the error I am seeing.

There are too many configuration options in OpenShift for this to be acceptable: you must provide a concrete 2 node playbook inventory which is guaranteed to work. Otherwise, this product does not work out of the box and customers get frustrated wasting days trying to find a configuration that works.

Comment 3 Richard Achmatowicz 2018-12-05 15:58:43 UTC

Created attachment 1511773 [details]
Log from Success (3.11)

Comment 4 Richard Achmatowicz 2018-12-05 16:06:03 UTC

I re-ran the whole exercise using 3.11 instead of 3.10 and the results were different. 

This time, the following happened:
- the prerequisites.yml playbook completed
- the deploy_cluster.yml playbook hung when trying to install Docker and had to be Ctrl-C'd and restarted (I though prerequisites.yml installed all the prerequisite software ...)
- the deploy_cluster.yml completed (see the log Log From Success)
- however, when I tried to login to the cluster, the login failed:

[root@clusterdev01 ~]# oc login -u system:admin
Server [https://localhost:8443]: https://master.openshift.example.com:8443
error: dial tcp 192.168.0.114:8443: getsockopt: no route to host - verify you have provided the correct host and port and that the server is currently running.


- in fact, there are no processes listening on 8443 on master:
[root@master ~]# netstat -ant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN     
tcp        0      0 192.168.0.114:22        192.168.0.114:40376     ESTABLISHED
tcp        0      0 192.168.0.114:48566     192.168.0.124:22        ESTABLISHED
tcp        0      0 192.168.0.114:715       192.168.0.150:2049      ESTABLISHED
tcp        0      0 192.168.0.114:40376     192.168.0.114:22        ESTABLISHED
tcp        0      0 192.168.0.114:22        192.168.0.110:59480     ESTABLISHED
tcp6       0      0 :::111                  :::*                    LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     

- so the installer indicates success, but the cluster is not successfully deployed. Should I file another bug?

Comment 5 Richard Achmatowicz 2018-12-05 16:13:22 UTC

For the 3.10 version of the pronblem:
rpm -q openshift-ansible 
openshift-ansible-3.10.73-1.git.0.8b65cea.el7.noarch

rpm -q ansible
ansible-2.4.6.0-1.el7ae.noarch

ansible --version
ansible 2.4.6.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

Comment 6 Vadim Rutkovsky 2018-12-05 16:34:23 UTC

API server didn't come up, what does `master-logs api api` shows?

Comment 7 Richard Achmatowicz 2018-12-05 16:52:00 UTC

I've already moved on, deleted the VMs are trying another configuration. I'll rebuild it again, and check what is stated there as you request.

Comment 8 Richard Achmatowicz 2018-12-05 18:38:43 UTC

This is what I see after yet an other failed install:

[root@master ~]# master-logs api api
Component api is stopped or not running


I will attach the playbook --- output (as much as I could retrieve from my screen) as Failed Playbook deploy_cluster Log 3.10. I have also attached the list of processes running at the time of failure (or at least after I terminate the playbook).

I'll save the VMs in case you have other files I can pass.

Comment 9 Richard Achmatowicz 2018-12-05 18:39:26 UTC

Created attachment 1511863 [details]
Processes running at playbook failure

Comment 10 Richard Achmatowicz 2018-12-05 18:40:17 UTC

Created attachment 1511864 [details]
Failed Playbook deploy_cluster Log 3.10

Comment 11 Richard Achmatowicz 2018-12-05 21:44:07 UTC

I re-ran the 3.10 scripts on a three machine configuration. I have attached the inventory as The Three Node Inventory. This used one master,one infra and one compute node. 

The installation went smoother - there was no locking up of the docker installation in prerequisites.yml as before, and the deploy_clusteryml completed without errors. 

However, no deployed cluster. Here are the processes running after this "successful" playbook execution:
[root@master ~]# ps -efww
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 22:08 ?        00:00:01 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
root         2     0  0 22:08 ?        00:00:00 [kthreadd]
root         3     2  0 22:08 ?        00:00:00 [ksoftirqd/0]
root         5     2  0 22:08 ?        00:00:00 [kworker/0:0H]
root         7     2  0 22:08 ?        00:00:00 [migration/0]
root         8     2  0 22:08 ?        00:00:00 [rcu_bh]
root         9     2  0 22:08 ?        00:00:00 [rcu_sched]
root        10     2  0 22:08 ?        00:00:00 [lru-add-drain]
root        11     2  0 22:08 ?        00:00:00 [watchdog/0]
root        12     2  0 22:08 ?        00:00:00 [watchdog/1]
root        13     2  0 22:08 ?        00:00:00 [migration/1]
root        14     2  0 22:08 ?        00:00:00 [ksoftirqd/1]
root        15     2  0 22:08 ?        00:00:00 [kworker/1:0]
root        16     2  0 22:08 ?        00:00:00 [kworker/1:0H]
root        17     2  0 22:08 ?        00:00:00 [watchdog/2]
root        18     2  0 22:08 ?        00:00:00 [migration/2]
root        19     2  0 22:08 ?        00:00:00 [ksoftirqd/2]
root        21     2  0 22:08 ?        00:00:00 [kworker/2:0H]
root        22     2  0 22:08 ?        00:00:00 [watchdog/3]
root        23     2  0 22:08 ?        00:00:00 [migration/3]
root        24     2  0 22:08 ?        00:00:00 [ksoftirqd/3]
root        26     2  0 22:08 ?        00:00:00 [kworker/3:0H]
root        27     2  0 22:08 ?        00:00:00 [watchdog/4]
root        28     2  0 22:08 ?        00:00:00 [migration/4]
root        29     2  0 22:08 ?        00:00:00 [ksoftirqd/4]
root        30     2  0 22:08 ?        00:00:00 [kworker/4:0]
root        31     2  0 22:08 ?        00:00:00 [kworker/4:0H]
root        32     2  0 22:08 ?        00:00:00 [watchdog/5]
root        33     2  0 22:08 ?        00:00:00 [migration/5]
root        34     2  0 22:08 ?        00:00:00 [ksoftirqd/5]
root        36     2  0 22:08 ?        00:00:00 [kworker/5:0H]
root        38     2  0 22:08 ?        00:00:00 [kdevtmpfs]
root        39     2  0 22:08 ?        00:00:00 [netns]
root        40     2  0 22:08 ?        00:00:00 [khungtaskd]
root        41     2  0 22:08 ?        00:00:00 [writeback]
root        42     2  0 22:08 ?        00:00:00 [kintegrityd]
root        43     2  0 22:08 ?        00:00:00 [bioset]
root        44     2  0 22:08 ?        00:00:00 [bioset]
root        45     2  0 22:08 ?        00:00:00 [bioset]
root        46     2  0 22:08 ?        00:00:00 [kblockd]
root        47     2  0 22:08 ?        00:00:00 [md]
root        48     2  0 22:08 ?        00:00:00 [edac-poller]
root        49     2  0 22:08 ?        00:00:00 [watchdogd]
root        55     2  0 22:08 ?        00:00:00 [kswapd0]
root        56     2  0 22:08 ?        00:00:00 [ksmd]
root        57     2  0 22:08 ?        00:00:00 [khugepaged]
root        58     2  0 22:08 ?        00:00:00 [crypto]
root        66     2  0 22:08 ?        00:00:00 [kthrotld]
root        67     2  0 22:08 ?        00:00:00 [kworker/u12:1]
root        68     2  0 22:08 ?        00:00:00 [kmpath_rdacd]
root        69     2  0 22:08 ?        00:00:00 [kaluad]
root        70     2  0 22:08 ?        00:00:00 [kpsmoused]
root        71     2  0 22:08 ?        00:00:00 [ipv6_addrconf]
root        84     2  0 22:08 ?        00:00:00 [deferwq]
root        85     2  0 22:08 ?        00:00:00 [kworker/1:1]
root       120     2  0 22:08 ?        00:00:00 [kauditd]
root       202     2  0 22:08 ?        00:00:00 [kworker/0:2]
root       235     2  0 22:08 ?        00:00:00 [kworker/3:1]
root       815     2  0 22:08 ?        00:00:00 [ata_sff]
root       864     2  0 22:08 ?        00:00:00 [scsi_eh_0]
root       878     2  0 22:08 ?        00:00:00 [scsi_tmf_0]
root       879     2  0 22:08 ?        00:00:00 [scsi_eh_1]
root       881     2  0 22:08 ?        00:00:00 [scsi_tmf_1]
root       925     2  0 22:08 ?        00:00:00 [kworker/u12:3]
root       988     2  0 22:08 ?        00:00:00 [ttm_swap]
root      1675     2  0 22:08 ?        00:00:00 [kworker/4:1H]
root      1699     2  0 22:08 ?        00:00:00 [kworker/0:3]
root      1832     2  0 22:08 ?        00:00:00 [kdmflush]
root      1833     2  0 22:08 ?        00:00:00 [bioset]
root      1847     2  0 22:08 ?        00:00:00 [kdmflush]
root      1849     2  0 22:08 ?        00:00:00 [bioset]
root      1867     2  0 22:08 ?        00:00:00 [bioset]
root      1869     2  0 22:08 ?        00:00:00 [xfsalloc]
root      1871     2  0 22:08 ?        00:00:00 [xfs_mru_cache]
root      1878     2  0 22:08 ?        00:00:00 [xfs-buf/dm-0]
root      1879     2  0 22:08 ?        00:00:00 [xfs-data/dm-0]
root      1880     2  0 22:08 ?        00:00:00 [xfs-conv/dm-0]
root      1886     2  0 22:08 ?        00:00:00 [xfs-cil/dm-0]
root      1887     2  0 22:08 ?        00:00:00 [xfs-reclaim/dm-]
root      1888     2  0 22:08 ?        00:00:00 [xfs-log/dm-0]
root      1890     2  0 22:08 ?        00:00:00 [xfs-eofblocks/d]
root      1893     2  0 22:08 ?        00:00:00 [xfsaild/dm-0]
root      1894     2  0 22:08 ?        00:00:00 [kworker/5:1H]
root      1950     2  0 22:08 ?        00:00:00 [kworker/2:1]
root      1967     1  0 22:08 ?        00:00:00 /usr/lib/systemd/systemd-journald
root      1988     1  0 22:08 ?        00:00:00 /usr/sbin/lvmetad -f
root      2000     1  0 22:08 ?        00:00:00 /usr/lib/systemd/systemd-udevd
root      2864     2  0 22:08 ?        00:00:00 [kworker/3:2]
root      3127     2  0 22:08 ?        00:00:00 [xfs-buf/vda1]
root      3141     2  0 22:08 ?        00:00:00 [xfs-data/vda1]
root      3160     2  0 22:08 ?        00:00:00 [xfs-conv/vda1]
root      3183     2  0 22:08 ?        00:00:00 [xfs-cil/vda1]
root      3203     2  0 22:08 ?        00:00:00 [xfs-reclaim/vda]
root      3239     2  0 22:08 ?        00:00:00 [xfs-log/vda1]
root      3248     2  0 22:08 ?        00:00:00 [xfs-eofblocks/v]
root      3256     2  0 22:08 ?        00:00:00 [xfsaild/vda1]
root      3664     2  0 22:08 ?        00:00:00 [kdmflush]
root      3665     2  0 22:08 ?        00:00:00 [bioset]
root      3676     2  0 22:08 ?        00:00:00 [xfs-buf/dm-2]
root      3677     2  0 22:08 ?        00:00:00 [xfs-data/dm-2]
root      3678     2  0 22:08 ?        00:00:00 [xfs-conv/dm-2]
root      3679     2  0 22:08 ?        00:00:00 [xfs-cil/dm-2]
root      3680     2  0 22:08 ?        00:00:00 [xfs-reclaim/dm-]
root      3681     2  0 22:08 ?        00:00:00 [xfs-log/dm-2]
root      3682     2  0 22:08 ?        00:00:00 [xfs-eofblocks/d]
root      3683     2  0 22:08 ?        00:00:00 [xfsaild/dm-2]
root      3701     2  0 22:08 ?        00:00:00 [kworker/1:1H]
root      3717     1  0 22:08 ?        00:00:00 /sbin/auditd
root      3721     2  0 22:08 ?        00:00:00 [rpciod]
root      3722     2  0 22:08 ?        00:00:00 [xprtiod]
root      3746     2  0 22:08 ?        00:00:00 [kworker/2:1H]
root      3750     1  0 22:08 ?        00:00:00 /usr/lib/systemd/systemd-logind
root      3751     1  0 22:08 ?        00:00:00 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-ports/org.qemu.guest_agent.0 --blacklist=guest-file-open,guest-file-close,guest-file-read,guest-file-write,guest-file-seek,guest-file-flush,guest-exec,guest-exec-status -F/etc/qemu-ga/fsfreeze-hook
dbus      3752     1  0 22:08 ?        00:00:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root      3754     1  0 22:08 ?        00:00:00 /usr/sbin/gssproxy -D
chrony    3755     1  0 22:08 ?        00:00:00 /usr/sbin/chronyd
root      3766     1  0 22:08 ?        00:00:00 /usr/sbin/NetworkManager --no-daemon
polkitd   3767     1  0 22:08 ?        00:00:00 /usr/lib/polkit-1/polkitd --no-debug
root      3768     1  0 22:08 ?        00:00:01 /sbin/rngd -f
root      3769     1  0 22:08 ?        00:00:00 /usr/sbin/irqbalance --foreground
libstor+  3772     1  0 22:08 ?        00:00:00 /usr/bin/lsmd -d
root      3773     1  0 22:08 ?        00:00:00 /usr/sbin/smartd -n -q never
root      3774     1  0 22:08 ?        00:00:00 /usr/sbin/abrtd -d -s
root      3775     1  0 22:08 ?        00:00:00 /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption list_add corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody cared IRQ handler type mismatch Kernel panic - not syncing: Machine Check Exception: Machine check events logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
root      3932  3766  0 22:08 ?        00:00:00 /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-130213eb-f7af-43c3-bc97-0d678529cd8d-eth0.lease -cf /var/lib/NetworkManager/dhclient-eth0.conf eth0
root      4232     1  0 22:08 ?        00:00:00 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
root      4234     1  0 22:08 ?        00:00:00 /usr/sbin/sshd -D
root      4238     1  0 22:08 ?        00:00:00 /usr/sbin/rsyslogd -n
root      4241     1  0 22:08 ?        00:00:00 /usr/bin/rhsmcertd
root      4270     2  0 22:08 ?        00:00:00 [nfsiod]
root      4306     2  0 22:08 ?        00:00:00 [kworker/0:1H]
root      4308     2  0 22:08 ?        00:00:00 [nfsv4.1-svc]
root      4327     1  0 22:08 ?        00:00:00 /usr/sbin/crond -n
root      4328     1  0 22:08 ?        00:00:00 /usr/sbin/atd -f
root      4344     1  0 22:08 ?        00:00:00 rhnsd
root      4350     1  0 22:08 ?        00:00:02 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json --selinux-enabled --signature-verification=False --storage-driver overlay2 --mtu=1450 --add-registry registry.access.redhat.com --add-registry registry.access.redhat.com --add-registry docker.io --add-registry registry.fedoraproject.org --add-registry quay.io --add-registry registry.centos.org
root      4354     2  0 22:08 ?        00:00:00 [kworker/3:1H]
root      4360     1  0 22:08 tty1     00:00:00 /sbin/agetty --noclear tty1 linux
root      4422     1  0 22:08 ?        00:00:00 /usr/libexec/postfix/master -w
postfix   4423  4422  0 22:08 ?        00:00:00 pickup -l -t unix -u
postfix   4424  4422  0 22:08 ?        00:00:00 qmgr -l -t unix -u
root      4429  4350  0 22:08 ?        00:00:01 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc --runtime-args --systemd-cgroup=true
root      4590     1  0 22:08 ?        00:00:00 /usr/libexec/docker/rhel-push-plugin
root     10625     2  0 22:22 ?        00:00:00 [kworker/2:0]
root     10637  4234  0 22:26 ?        00:00:00 sshd: root@pts/0
root     10640 10637  0 22:26 pts/0    00:00:00 -bash
root     10684     2  0 22:27 ?        00:00:00 [kworker/5:2]
root     10720     2  0 22:32 ?        00:00:00 [kworker/5:0]
root     10765     2  0 22:34 ?        00:00:00 [kworker/4:1]
root     10779 10640  0 22:42 pts/0    00:00:00 ps -efww

Comment 12 Richard Achmatowicz 2018-12-05 21:45:26 UTC

I also looked in journalctl on master for any ansible related errors: nothing I could see. I'll attach that file as well.

Comment 13 Richard Achmatowicz 2018-12-05 21:49:51 UTC

Created attachment 1511902 [details]
3-node-attempt-journalctl

Comment 14 Richard Achmatowicz 2018-12-05 21:51:42 UTC

Created attachment 1511903 [details]
3-node-attempt-inventory

Comment 15 Richard Achmatowicz 2018-12-05 21:57:13 UTC

Created attachment 1511904 [details]
3-node-attempt-deploy_cluster.log

Comment 16 Richard Achmatowicz 2018-12-05 21:57:49 UTC

Created attachment 1511905 [details]
3-node-attempt-prerequisite.log

Comment 17 Richard Achmatowicz 2018-12-05 22:12:03 UTC

There was also this warning at the start of the deploy_cluster playbook:

 [root@master ~]# ansible-playbook -i /etc/ansible/hosts /usr/share/ansible/openshift-ansible/playbooks/prerequisites.yml > deploy_cluster.log
 [WARNING]: Could not match supplied host pattern, ignoring: oo_lb_to_config
 [WARNING]: Could not match supplied host pattern, ignoring: oo_nfs_to_config
 [WARNING]: Consider using yum, dnf or zypper module rather than running rpm
 [WARNING]: Could not match supplied host pattern, ignoring: oo_hosts_containerized_managed_true


Everything seems to stop when docker is started.

Comment 18 Richard Achmatowicz 2018-12-05 22:12:43 UTC

I'll leave the VMs as they are; let me know if you need any other information.

Comment 19 Vadim Rutkovsky 2018-12-05 22:21:54 UTC

We need logs from stopped api servers. Please on each master find containers with names starting `k8s_api_master-api...` and attach the logs to the issue

Comment 20 Vadim Rutkovsky 2018-12-05 22:23:00 UTC

Please also attach the output of deploy playbook with `ansible-playbook -vvv`

Comment 21 Richard Achmatowicz 2018-12-05 22:33:07 UTC

(In reply to Vadim Rutkovsky from comment #19)
> We need logs from stopped api servers. Please on each master find containers
> with names starting `k8s_api_master-api...` and attach the logs to the issue

[root@clusterdev01 ~]# ssh master
Last login: Wed Dec  5 23:22:17 2018 from clusterdev01.lab.eng.brq.redhat.com
[root@master ~]# docker container ls
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Comment 22 Richard Achmatowicz 2018-12-05 22:52:45 UTC

I think I may have made a mistake in running the three node cluster: I seem to have run the prerequisites.yml playbook twice in a row, instead of prerequisites followed by deploy_cluster.yml. Rerunning it now.

Comment 23 Richard Achmatowicz 2018-12-05 23:06:50 UTC

Created attachment 1511910 [details]
3-node-attempt-prerequisite.yml-vvv

Comment 24 Richard Achmatowicz 2018-12-05 23:07:48 UTC

Created attachment 1511911 [details]
3-node-attempt-deploy_cluster.log-vvv

Comment 25 Richard Achmatowicz 2018-12-05 23:09:24 UTC

I have attached the logs for running both playbooks with -vvv
Docker is running on master:
[root@master ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─custom.conf
   Active: active (running) since Wed 2018-12-05 22:08:29 CET; 1h 55min ago
     Docs: http://docs.docker.com
 Main PID: 4350 (dockerd-current)
   CGroup: /system.slice/docker.service
           ├─4350 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-p...
           └─4429 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/...

Dec 05 22:08:27 master dockerd-current[4350]: time="2018-12-05T22:08:27.772635448+01:00" level=info msg="libcontainerd: new containerd process, pid: 4429"
Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.833697556+01:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.834622176+01:00" level=info msg="Loading containers: start."
Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.864326852+01:00" level=info msg="Firewalld running: false"
Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.960846798+01:00" level=info msg="Default bridge (docker0) is assigned with an IP address 17...P address"
Dec 05 22:08:28 master dockerd-current[4350]: time="2018-12-05T22:08:28.987040762+01:00" level=info msg="Loading containers: done."
Dec 05 22:08:29 master dockerd-current[4350]: time="2018-12-05T22:08:29.022979241+01:00" level=info msg="Daemon has completed initialization"
Dec 05 22:08:29 master dockerd-current[4350]: time="2018-12-05T22:08:29.023039877+01:00" level=info msg="Docker daemon" commit="07f3374/1.13.1" graphdriver=overlay...ion=1.13.1
Dec 05 22:08:29 master dockerd-current[4350]: time="2018-12-05T22:08:29.061630267+01:00" level=info msg="API listen on /var/run/docker.sock"
Dec 05 22:08:29 master systemd[1]: Started Docker Application Container Engine.
Hint: Some lines were ellipsized, use -l to show in full.

However, there are no images in the docker repo.

Comment 26 Jean-frederic Clere 2018-12-06 08:15:37 UTC

[root@master ~]# skopeo inspect docker://openshift.example.com/openshift3/ose-deployer:v3.10
FATA[0010] pinging docker registry returned: Get https://openshift.example.com/v2/: dial tcp: lookup openshift.example.com on 192.168.0.115:53: no such host 

The looks to be something wrong in the configuration...

Comment 27 Jean-frederic Clere 2018-12-06 08:19:36 UTC

oreg_url=openshift.example.com/openshift3/ose-${component}:${version}

Are we sure that is correct?

comment it out that is probably the problem.

Comment 28 Richard Achmatowicz 2018-12-06 13:46:58 UTC

Yes, there is no such host. And I didn't write that line. The line is present in all the sample configurations provided by the docs, and does not reflect the name of any other host in the inventory of those configurations.

The problem is that we are spending all our time guessing as to which configuration might work, then spending 2 hours trying to install and find out that it doesn't work and trying again. As mentioned above, even the detailed 2 node example does not provide a concrete inventory file which is guaranteed to work.

Comment 29 Vadim Rutkovsky 2018-12-06 13:51:28 UTC

(In reply to Richard Achmatowicz from comment #28)
> Yes, there is no such host. And I didn't write that line. The line is
> present in all the sample configurations provided by the docs, and does not
> reflect the name of any other host in the inventory of those configurations.

`hosts.examples` has a description for it:

># Cluster Image Source (registry) configuration
># openshift-enterprise default is 'registry.redhat.io/openshift3/ose-${component}:${version}'
># origin default is 'docker.io/openshift/origin-${component}:${version}'
>#oreg_url=example.com/openshift3/ose-${component}:${version}

If you're not using a different registry for images you don't need to uncomment it

Does it work with this line commented?

Comment 30 Richard Achmatowicz 2018-12-06 13:56:57 UTC

I'll give it a try by rerunning the deploy_cluster playbook.

When products have complex configurations, like OpenShift, rather than expecting the user to sift through the many, many options and come up with a correct configuration, either the default values need to work out of the box, or detailed sample configurations which work for common use cases need to be provided for the most common cases.  This isn't happening here.

Comment 31 Richard Achmatowicz 2018-12-06 14:02:37 UTC

Another piece of feedback which may or may not apply: in addition to running lots of checks for the dependencies required by openshift (which also seem to be faulty from all the "turn off the check" issues I have seen, a sanity check for the configuration is in order. If I had a line in the inventory for which a host didn't exist or was not connectable, this would identify it before wasting 2 hours to run through the install.

Comment 32 Richard Achmatowicz 2018-12-06 14:11:59 UTC

I reran with that statement commented out: here is the tail fromthe log:
INSTALLER STATUS *********************************************************************************************************************
Initialization  : Complete (0:00:14)
Health Check    : In Progress (0:09:59)
	This phase can be restarted by running: playbooks/openshift-checks/pre-install.yml


Failure summary:


  1. Hosts:    node1.openshift.example.com, node2.openshift.example.com
     Play:     OpenShift Health Checks
     Task:     Run health checks (install) - EL
     Message:  One or more checks failed
     Details:  check "docker_image_availability":
               One or more required container images are not available:
                   registry.access.redhat.com/openshift3/ose-deployer:v3.10,
                   registry.access.redhat.com/openshift3/ose-docker-registry:v3.10,
                   registry.access.redhat.com/openshift3/ose-haproxy-router:v3.10,
                   registry.access.redhat.com/openshift3/ose-pod:v3.10,
                   registry.access.redhat.com/openshift3/registry-console:v3.10
               Checked with: skopeo inspect [--tls-verify=false] [--creds=<user>:<pass>] docker://<registry>/<image>
               

  2. Hosts:    master.openshift.example.com
     Play:     OpenShift Health Checks
     Task:     Run health checks (install) - EL
     Message:  One or more checks failed
     Details:  check "docker_image_availability":
               One or more required container images are not available:
                   registry.access.redhat.com/openshift3/ose-control-plane:v3.10,
                   registry.access.redhat.com/openshift3/ose-deployer:v3.10,
                   registry.access.redhat.com/openshift3/ose-docker-registry:v3.10,
                   registry.access.redhat.com/openshift3/ose-haproxy-router:v3.10,
                   registry.access.redhat.com/openshift3/ose-pod:v3.10,
                   registry.access.redhat.com/openshift3/registry-console:v3.10,
                   registry.access.redhat.com/rhel7/etcd:3.2.22
               Checked with: skopeo inspect [--tls-verify=false] [--creds=<user>:<pass>] docker://<registry>/<image>
               

The execution of "/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml" includes checks designed to fail early if the requirements of the playbook are not met. One or more of these checks failed. To disregard these results,explicitly disable checks by setting an Ansible variable:
   openshift_disable_check=docker_image_availability
Failing check names are shown in the failure details above. Some checks may be configurable by variables if your requirements are different from the defaults; consult check documentation.
Variables can be set in the inventory or passed on the command line using the -e flag to ansible-playbook.

Comment 33 Vadim Rutkovsky 2018-12-06 14:13:29 UTC

You need a valid subscription to pull these images.

Does it work once this is set? See https://docs.openshift.com/container-platform/3.11/getting_started/install_openshift.html#attach-subscription and further for details

Comment 34 Richard Achmatowicz 2018-12-06 14:24:45 UTC

[root@master ~]# subscription-manager list --consumed | grep Container
                     Red Hat Container Images for IBM Power LE
                     Red Hat Container Images Beta for IBM Power LE
                     Red Hat Container Images
                     Red Hat Container Images Beta
                     Red Hat Container Images HTB
                     Red Hat OpenShift Container Platform
                     Red Hat Container Development Kit

I have all the hosts subscribed to Emplyee SKU, which contained OpenShift Container platform.  When following the instructions in the docs to install the two node cluster, it happened on several occasions that prerequisites.yml would hang (for 2 minutes) when installing Docker. I would Ctrl-C the playbook, run it again, it would pass, and then continue with the other instructions.

It looks as though Docker is not being installed correctly  and this is done by the playbook.

Comment 35 Richard Achmatowicz 2018-12-06 14:25:23 UTC

I meant 20 minutes, not 2.

Comment 36 Richard Achmatowicz 2018-12-06 14:32:07 UTC

And the repos are enabled as well:
subscription-manager: error: no such option: --enabled
[root@master ~]# subscription-manager repos --list-enabled
+----------------------------------------------------------+
    Available Repositories in /etc/yum.repos.d/redhat.repo
+----------------------------------------------------------+
Repo ID:   rhel-7-server-ansible-2.4-rpms
Repo Name: Red Hat Ansible Engine 2.4 RPMs for Red Hat Enterprise Linux 7 Server
Repo URL:  https://cdn.redhat.com/content/dist/rhel/server/7/7Server/$basearch/ansible/2.4/os
Enabled:   1

Repo ID:   rhel-7-server-extras-rpms
Repo Name: Red Hat Enterprise Linux 7 Server - Extras (RPMs)
Repo URL:  https://cdn.redhat.com/content/dist/rhel/server/7/7Server/$basearch/extras/os
Enabled:   1

Repo ID:   rhel-7-server-rpms
Repo Name: Red Hat Enterprise Linux 7 Server (RPMs)
Repo URL:  https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/os
Enabled:   1

Repo ID:   rhel-7-server-ose-3.10-rpms
Repo Name: Red Hat OpenShift Container Platform 3.10 (RPMs)
Repo URL:  https://cdn.redhat.com/content/dist/rhel/server/7/7Server/$basearch/ose/3.10/os
Enabled:   1

Comment 37 Richard Achmatowicz 2018-12-06 14:39:20 UTC

Another note: I seem to remember on one of the first installs that this message was appearing, I looked up the issue on the Internet and found that many people were finding that the check for docker images was not working correctly and they were turning off the check: 

https://github.com/openshift/openshift-ansible/issues/7721

And it still seems to be a problem:
[root@master ~]# skopeo inspect --tls-verify=false  docker://docker.io/cockpit/kubernetes:latest
 

{
    "Name": "docker.io/cockpit/kubernetes",
    "Digest": "sha256:f38c7b0d2b85989f058bf78c1759bec5b5d633f26651ea74753eac98f9e70c9b",
    "RepoTags": [
        "latest",
        "wip"
    ],
    "Created": "2018-10-12T11:48:09.409602119Z",
    "DockerVersion": "18.03.1-ee-1-tp5",
    "Labels": null,
    "Architecture": "amd64",
    "Os": "linux",
    "Layers": [
        "sha256:565884f490d9ec697e519c57d55d09e268542ef2c1340fd63262751fa308f047",
        "sha256:dd655a54c8454c55e559be4c07218ffee8390476583a137b98c0a0592101cfc9",
        "sha256:b715fff0ceb39ce45574c981aa8b29dbc94be81770735b55743b587501cef41f",
        "sha256:1dcd53aa9421189eadac2b599717a6b8234563a48257297c6be14d392262eb2e"
    ]
}

Comment 39 Richard Achmatowicz 2018-12-06 19:30:01 UTC

Ok, it seems that am back back to where I was when I started, but a little wiser. 

With regard to the inability of the playbooks to correctly install software, the docker_image_availability indicates that the playbook failed to download the docker images. Jean-Frederic also ran into this problem and suggested I download the images myself. Wrote a script to do this. Execution of the script took 6 minutes to download all of the images from registry.access.redhat.com. Wonder why I can do it and the playbook cannot? Can't say for the moment. 

Rerunning the deploy_cluster playbook took me back to the same position I was in when trying to install on the two node cluster, which happens to be the subject of this issue: cannot load control plan pods. There is no server called master.openshift.example.com:8443 running when this operation is attempted. So this is the next target.

However, this time I did see some relevant processes:

root      46423      1  1 19:59 ?        00:00:04 /usr/bin/hyperkube kubelet --v=2 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-ttl=5m --authorization-mode=Webhook --authorization-webhook-cache-authorized-ttl=5m --authorization-webhook-cache-unauthorized-ttl=5m --bootstrap-kubeconfig=/etc/origin/node/bootstrap.kubeconfig --cadvisor-port=0 --cert-dir=/etc/origin/node/certificates --cgroup-driver=systemd --client-ca-file=/etc/origin/node/client-ca.crt --cluster-dns=192.168.122.97 --cluster-domain=cluster.local --container-runtime-endpoint=/var/run/dockershim.sock --containerized=false --enable-controller-attach-detach=true --experimental-dockershim-root-directory=/var/lib/dockershim --fail-swap-on=false --feature-gates=RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true --file-check-frequency=0s --healthz-bind-address= --healthz-port=0 --host-ipc-sources=api --host-ipc-sources=file --host-network-sources=api --host-network-sources=file --host-pid-sources=api --host-pid-sources=file --hostname-override= --http-check-frequency=0s --image-service-endpoint=/var/run/dockershim.sock --iptables-masquerade-bit=0 --kubeconfig=/etc/origin/node/node.kubeconfig --max-pods=250 --network-plugin=cni --node-ip= --pod-infra-container-image=registry.access.redhat.com/openshift3/ose-pod:v3.10.72 --pod-manifest-path=/etc/origin/node/pods --port=10250 --read-only-port=0 --register-node=true --root-dir=/var/lib/origin/openshift.local.volumes --rotate-certificates=true --tls-cert-file= --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA --tls-cipher-suites=TLS_RSA_WITH_AES_128_GCM_SHA256 --tls-cipher-suites=TLS_RSA_WITH_AES_256_GCM_SHA384 --tls-cipher-suites=TLS_RSA_WITH_AES_128_CBC_SHA --tls-cipher-suites=TLS_RSA_WITH_AES_256_CBC_SHA --tls-min-version=VersionTLS12 --tls-private-key-file=

and
root      46904  46866  0 19:59 ?        00:00:02 openshift start master controllers --config=/etc/origin/master/master-config.yaml --listen=https://0.0.0.0:8444 --loglevel=2

Comment 40 Jean-frederic Clere 2018-12-07 08:58:04 UTC

[root@master ~]# ps -ef | grep master
root       4405      1  0 Dec06 ?        00:00:00 /usr/libexec/postfix/master -w
root       6067 127735  0 09:56 pts/0    00:00:00 grep --color=auto master
root      46904  46866  0 Dec06 ?        00:05:19 openshift start master controllers --config=/etc/origin/master/master-config.yaml --listen=https://0.0.0.0:8444 --loglevel=2

why 8444?

Comment 41 Richard Achmatowicz 2018-12-07 14:39:16 UTC

I have no idea. It must be coming from some default in the master inventory used; my inventory does not contain that port.

By the way, I have started re-reading Ansible Up and Running (which I read about a year ago) to understand the playbooks. In executing the examples in that book which, for example, install nginx on a VM, ansible *will* report incorrect results from the server; for example, I had one small error in the nginx config file (missing semicolon), the nginx server did not start correctly, but ansible reported that all was well. It took hunting through the server logs for that particular module (service) to find the problem. So, with these playbooks that involve literally hundreds of tasks, there is a lot of scope for error.

This is probably what is happening here.

Comment 42 Richard Achmatowicz 2018-12-07 16:16:05 UTC

This may be the reason for the appearance of an 8444; when we have two httpd servers set up on the same host:

first server receives on 8080/8443
second server receives on 8081/8444

Comment 43 Jean-frederic Clere 2018-12-10 09:50:06 UTC

master MUST be using 8443 you have to remove what is running on the port before starting the installation.

Note puting  8444 in kube config doesn't help:
+++
[root@master ~]# oc config get-clusters
NAME
master-openshift-example-com:8444
[root@master ~]# oc get nodes
Error from server (InternalError): an error on the server ("Internal Server Error: \"/api/v1/nodes?limit=500\": Post https://master.openshift.example.com:8443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews: dial tcp 192.168.0.114:8443: getsockopt: connection refused") has prevented the request from succeeding (get nodes)
[root@master ~]# kube config view
bash: kube: command not found
[root@master ~]# kubectl config view
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: REDACTED
    server: https://master.openshift.example.com:8444
  name: master-openshift-example-com:8444
+++

Comment 45 Russell Teague 2018-12-13 20:52:48 UTC

Proposed - https://github.com/openshift/openshift-ansible/pull/10884

Comment 48 Richard Achmatowicz 2018-12-14 03:59:29 UTC

The stage i'm at now is that:
- the service atomic-openshift-node is getting started
- it is executing the command /usr/bin/hyperkube kubelet with a lot of parameters (many of which are flagged as being deprecated and should be instead in a --config file)
- the --pod-manifest-path correctly identifies the  directory containing the three configured static pods (apiserver.yaml, controller.yaml  and etcd.yaml)
- the images for those pods are available in the host's docker registry (ose-control-plane and etcd):
[root@master pods]# docker image ls
REPOSITORY                                                  TAG                 IMAGE ID            CREATED             SIZE
registry.access.redhat.com/openshift3/ose-node              v3.10               8f2fe66622e7        9 days ago          1.28 GB
registry.access.redhat.com/openshift3/ose-haproxy-router    v3.10               26a21dbbf59a        9 days ago          812 MB
registry.access.redhat.com/openshift3/ose-deployer          v3.10               f3007f9bef28        9 days ago          792 MB
registry.access.redhat.com/openshift3/ose-control-plane     v3.10               d4f29f0adefd        9 days ago          792 MB
registry.access.redhat.com/openshift3/ose-docker-registry   v3.10               51b02d4237d5        9 days ago          288 MB
registry.access.redhat.com/openshift3/ose-pod               v3.10               8fbf8b6e3a44        9 days ago          217 MB
registry.access.redhat.com/openshift3/registry-console      v3.10               963d6b3f6e5a        9 days ago          235 MB
registry.access.redhat.com/rhel7/etcd                       3.2.22              635bb36d7fc7        3 weeks ago         259 MB

-so, the API server and the controller should be getting started correctly
- but when the service starts, these pods don't get started at all

I found someone who had a similar issue, and he concluded that the problem was with pods starting very slowly: https://github.com/kubernetes/website/issues/4166
Thst ia a possibility here given that the image downloads also faile due to slowness.

Two options i'll try tomorrow on hyperkube:
  --image-pull-progress-deadline       this value is set to 1 minute apparently, so any image pull taking longer than 1 min is cancelled (but the images are there)
  --v=n                                increase this to get more information out of hyperkube and what it is doing

Comment 49 Richard Achmatowicz 2018-12-14 14:44:59 UTC

For example, here are the journalctl entries for the failed startup of the service atomic-openshift-node (see journalctl-attempt#2-startup-problems)

There are several warning messages which may be related to the startup problem:
- lots of deprecated flags being passed by Ansible to hyperkube, its unclear which options are being accepted and which are not
- a warning about being unable to update the cni networking configuration
- lack of a cloud provider being specified

I'm not sure what these mean at the moment, but will investigate.

Comment 50 Richard Achmatowicz 2018-12-14 14:46:03 UTC

Created attachment 1514386 [details]
journalctl messages from startup of failing service atomic-openshift-node

Comment 51 Richard Achmatowicz 2018-12-14 15:04:19 UTC

By the way, many people with the same issue: https://github.com/kubernetes/kubernetes/issues/54918
Some of the solutions which did work were network restarted.

Comment 52 Russell Teague 2018-12-14 18:35:58 UTC

I spun up two instances on AWS and ran through the installation instructions [1]. The install completed successfully.

I noticed the following items in the instructions that could have posed a problem.

1. The following two lines should not be part of a default inventory. I read through the commit history for those lines and found they were added because of customer cases where there was a bug in OSE which caused the creation of improper registry URLs. Please ensure these lines are commented out in a default inventory.

#oreg_url=example.com/openshift3/ose-${component}:${version}
#openshift_examples_modify_imagestreams=true

2. After performing `yum update`, I rebooted my hosts. It is expected that after running `yum update` hosts will be rebooted to ensure all services are running as expected prior to installing OCP. The reboot is mentioned in the Host Preparation [2] steps. Please reboot any hosts after `yum update` has been performed.

3. Before running the playbooks, change directory to the openshift-ansible directory to ensure the provided ansible.cfg is used when executing playbooks.
$ cd /usr/share/ansible/openshift-ansible

Please attempt an install with two fresh hosts and follow the items mentioned above to see if this resolves your issues.

OCP is a complex platform with many different features and capabilities. It is recommended to read the documentation to fully understand settings and requirements.

I will investigate updating the documentation for the items mentioned above.

[1] https://docs.openshift.com/container-platform/3.10/getting_started/install_openshift.html
[2] https://docs.openshift.com/container-platform/3.10/install/host_preparation.html#installing-base-packages

Host file in use:
$ cat /etc/ansible/hosts
# Create an OSEv3 group that contains the masters, nodes, and etcd groups
[OSEv3:children]
masters
nodes
etcd

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=ec2-user

# If ansible_ssh_user is not root, ansible_become must be set to true
ansible_become=true

openshift_deployment_type=openshift-enterprise
#oreg_url=example.com/openshift3/ose-${component}:${version}
#openshift_examples_modify_imagestreams=true

# uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider
#openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

# testing deployment nodes don't meet these requirements
openshift_disable_check=disk_availability,memory_availability

# host group for masters
[masters]
host1.compute-1.amazonaws.com

# host group for etcd
[etcd]
host1.compute-1.amazonaws.com

# host group for nodes, includes region info
[nodes]
host1.compute-1.amazonaws.com openshift_node_group_name='node-config-master-infra'
host2.compute-1.amazonaws.com openshift_node_group_name='node-config-compute'

Comment 53 Russell Teague 2018-12-14 19:06:11 UTC

Proposed docs update: https://github.com/openshift/openshift-docs/pull/13128

Comment 54 Richard Achmatowicz 2018-12-14 20:58:00 UTC

Russell

Thanks for taking the time to test out the install on AWS and pointing out the possible sources of error in running the installer.

OK, I have re-run the install using the same hosts, after using the uninstall script (it takes me a while to set up the new hosts so I decided to try with just an uninstall -it's late Friday :-)).

Changes to my usual procedure:
1. Checking my inventory, the line for modifying example image streams was uncommented, so I commented it out.
2. I rebooted all hosts involved in the install
3. I executed the playbooks from the /usr/share/ansible/openshift-ansible directory

Observations:
1. I noticed immediately that the scripts were behaving differently. Changing to the openshift-ansible directory picked up a different ansible.cfg file and changed things dramatically. (e.g. seeing time and date stamps on each task execution where as previously I did not).
2. When service atomic-openshift-node was started, I immediately saw a lot more processes running, and in particular the api and controllers processes, as well as some mux related things. This is in the attached file.
3. However, still the same problem with not being able to connect to the master.openshift.example.com:8443 server. This is in the attached file.

Remarks:
1. Please make a small note in the docs that changing to that directory is done in order to pick on the correct ansible.cfg file. There are too many people (like me) who will execute the same command from the current directory using a FQPN for simplicity sake if they are not warned otherwise. Or provide long form alternative by specifying the cfg file via a command line option (as is done with the inventory). The time lost is too great to not mention this simple "mistake".

So, progress, but still not there yet. I'll have to continue next week trying this on fresh hosts.

Comment 62 Richard Achmatowicz 2018-12-19 15:00:50 UTC

Hi Russell
Thanks for taking the time to look into this. I really appreciate it.

The clusterdev lab is composed of 5 physical hosts, and we create VMs on each of those physical hosts for running software (testing frameworks, OpenShift clusters, etc).
The 5 hosts are connected by a Juniper switch and communicate with each other over a private internal network (192.168.0.1xy) where x = {1..5} and y = {1..5}. Each physical host and each VM also has a public interface to the outside world (these are the 192.168.111.211 interfaces, for example).

In my inventory, I have specified hostnames (master, node1, node2) which correspond to interfaces on the private internal network only. I noticed in the logs that the installer was binding to some of the public interfaces, which might be due to the fact that the public interface on the VMs is usually eth0, the first non-loopback interface, and sometimes network software will use this as a way of choosing an interface that is not otherwise specified. I also noted binding to 0.0.0.0 which is generally good to avoid.

Comment 63 Russell Teague 2018-12-19 16:18:54 UTC

Richard,
The best guidance I have at this time is to make sure the hostname of the host resolves to the first interface.  If you continue to have problems with installation, please open a new bug specifically about installation on multi-homed hosts.  This bug will be used to track the documentation issues already addressed above.

Comment 66 Weihua Meng 2018-12-20 09:13:51 UTC

PR not merged to 3.10.
need to cherrypick https://github.com/openshift/openshift-ansible/pull/10884 to 3.10.

Comment 67 Weihua Meng 2018-12-21 02:35:05 UTC

PR for 3.10
https://github.com/openshift/openshift-ansible/pull/10917

Comment 69 Weihua Meng 2019-01-08 00:50:37 UTC

PR merged.

Comment 70 Weihua Meng 2019-01-10 08:06:38 UTC

Fixed.

openshift-ansible-docs-3.10.98-1.git.0.198012d.el7.noarch

Comment 72 errata-xmlrpc 2019-08-21 13:52:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2509

Note You need to log in before you can comment on or make changes to this bug.