Description of problem: When instances are created in a VPC which specified subnet as 10.128.0.0/24 (as long as it conflict with network plugin's network), node service can not be started caused the installation failed without abvious error message, this is really bad user experience. Version-Release number of selected component (if applicable): openshift-ansible: 3.6.140-1.git.0.4a02427.el7 openshift: 3.6.140-1.git.0.a7c42e0.el7 How reproducible: Always Steps to Reproduce: 1. Create a VPC network. gcloud compute networks create vpc-lxia --mode custom 2. Create a subnet in previously created network. gcloud compute networks subnets create sn-vpc-lxia \ --network vpc-lxia \ --region "us-central1" \ --range 10.128.0.0/24 3. Add filewall rules for this network. gcloud compute firewall-rules create fw-lxia-allow \ --network vpc-lxia \ --allow "tcp:22,icmp,tcp:15441" \ --source-ranges 0.0.0.0/0 4. Try to set up OCP cluster with instances in above network using openshift-ansible. Actual results: Installation failed at task "openshift_node : Start and enable node" Expected results: Workaround is installation failed with proper/clear errors why failed. The better solution is that installation dynamically select a network which does not conflict with the node's network. Additional info: =============================================================================== TASK [openshift_node : Start and enable node] ********************************** Wednesday 12 July 2017 09:17:13 +0000 (0:00:00.176) 0:12:32.830 ******** FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left). FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left). fatal: [qe-lxia-master-1.0712-bfj.qe.rhcloud.com]: FAILED! => { "attempts": 1, "changed": false, "failed": true } MSG: Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details. fatal: [qe-lxia-node-registry-router-1.0712-bfj.qe.rhcloud.com]: FAILED! => { "attempts": 1, "changed": false, "failed": true } MSG: Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details. =============================================================================== [root@qe-lxia-master-1 ~]# systemctl status atomic-openshift-node.service ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d └─openshift-sdn-ovs.conf Active: activating (auto-restart) (Result: exit-code) since Wed 2017-07-12 05:21:36 EDT; 4s ago Docs: https://github.com/openshift/origin Process: 21586 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS) Process: 21584 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS) Process: 21575 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255) Process: 21573 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS) Process: 21572 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS) Main PID: 21575 (code=exited, status=255) Jul 12 05:21:36 qe-lxia-master-1 systemd[1]: Failed to start OpenShift Node. Jul 12 05:21:36 qe-lxia-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. Jul 12 05:21:36 qe-lxia-master-1 systemd[1]: atomic-openshift-node.service failed. =============================================================================== [root@qe-lxia-master-1 ~]# journalctl -xe Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: Starting OpenShift Node... -- Subject: Unit atomic-openshift-node.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit atomic-openshift-node.service has begun starting up. Jul 12 05:29:13 qe-lxia-master-1 dnsmasq[19546]: setting upstream servers from DBus Jul 12 05:29:13 qe-lxia-master-1 dnsmasq[19546]: using nameserver 127.0.0.1#53 for domain cluster.local Jul 12 05:29:13 qe-lxia-master-1 dnsmasq[19546]: using nameserver 127.0.0.1#53 for domain in-addr.arpa Jul 12 05:29:13 qe-lxia-master-1 dnsmasq[19546]: using nameserver 169.254.169.254#53 Jul 12 05:29:13 qe-lxia-master-1 atomic-openshift-node[22919]: I0712 05:29:13.927750 22919 start_node.go:251] Reading node configuration from /etc/origin/node/node-config.yaml Jul 12 05:29:13 qe-lxia-master-1 atomic-openshift-master[16253]: I0712 05:29:13.975880 16253 handler.go:146] kube-apiserver: GET "/oapi/v1/clusternetworks/default" satisfied by gorestful with webservice /oapi/v1 Jul 12 05:29:13 qe-lxia-master-1 atomic-openshift-master[16253]: I0712 05:29:13.977156 16253 wrap.go:42] GET /oapi/v1/clusternetworks/default: (1.660894ms) 404 [[openshift/v3.6.140 (linux/amd64) openshift/a7c42e0] 10.128.0.2:50550] Jul 12 05:29:13 qe-lxia-master-1 atomic-openshift-node[22919]: F0712 05:29:13.978416 22919 start_node.go:140] master has not created a default cluster network, network plugin "redhat/openshift-ovs-subnet" can not start Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: Failed to start OpenShift Node. -- Subject: Unit atomic-openshift-node.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit atomic-openshift-node.service has failed. -- -- The result is failed. Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: atomic-openshift-node.service failed. =============================================================================== [root@qe-lxia-master-1 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc pfifo_fast state UP qlen 1000 link/ether 42:01:0a:80:00:02 brd ff:ff:ff:ff:ff:ff inet 10.128.0.2/32 brd 10.128.0.2 scope global dynamic eth0 valid_lft 84461sec preferred_lft 84461sec inet6 fe80::4001:aff:fe80:2/64 scope link valid_lft forever preferred_lft forever 3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN link/ether 02:42:fa:bf:1c:66 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 scope global docker0 valid_lft forever preferred_lft forever [root@qe-lxia-master-1 ~]# ip r default via 10.128.0.1 dev eth0 proto static metric 100 10.128.0.1 dev eth0 proto dhcp scope link metric 100 10.128.0.2 dev eth0 proto kernel scope link src 10.128.0.2 metric 100 169.254.169.254 via 10.128.0.1 dev eth0 proto dhcp metric 100 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
I'm pretty sure we have a bug for this already where it's an RFE that the installer checks to see if the SDN CIDR or Services CIDR conflicts with the network the hosts are in and aborts early. I'll look for that, moving to 3.7.
There appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached.