Bug 1470037

Summary: Node service can not start when instance in vpc which has subnet conflict with network plugin's network
Product: OpenShift Container Platform Reporter: Liang Xia <lxia>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED WONTFIX QA Contact: Johnny Liu <jialiu>
Severity: low Docs Contact:
Priority: medium    
Version: 3.6.0CC: aos-bugs, bleanhar, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-18 18:16:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Liang Xia 2017-07-12 10:20:08 UTC
Description of problem:
When instances are created in a VPC which specified subnet as 
10.128.0.0/24 (as long as it conflict with network plugin's network),
node service can not be started caused the installation failed without abvious error message, this is really bad user experience.

Version-Release number of selected component (if applicable):
openshift-ansible: 3.6.140-1.git.0.4a02427.el7
openshift: 3.6.140-1.git.0.a7c42e0.el7

How reproducible:
Always

Steps to Reproduce:
1. Create a VPC network.
gcloud compute networks create vpc-lxia --mode custom
2. Create a subnet in previously created network.
gcloud compute networks subnets create sn-vpc-lxia \
          --network vpc-lxia \
          --region "us-central1" \
          --range 10.128.0.0/24
3. Add filewall rules for this network.
gcloud compute firewall-rules create fw-lxia-allow \
        --network vpc-lxia \
        --allow "tcp:22,icmp,tcp:15441" \
        --source-ranges 0.0.0.0/0
4. Try to set up OCP cluster with instances in above network using openshift-ansible.

Actual results:
Installation failed at task "openshift_node : Start and enable node"

Expected results:
Workaround is installation failed with proper/clear errors why failed.
The better solution is that installation dynamically select a network which does not conflict with the node's network. 


Additional info:
===============================================================================
TASK [openshift_node : Start and enable node] **********************************
Wednesday 12 July 2017  09:17:13 +0000 (0:00:00.176)       0:12:32.830 ******** 

FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left).

FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left).

fatal: [qe-lxia-master-1.0712-bfj.qe.rhcloud.com]: FAILED! => {
    "attempts": 1, 
    "changed": false, 
    "failed": true
}

MSG:

Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.


fatal: [qe-lxia-node-registry-router-1.0712-bfj.qe.rhcloud.com]: FAILED! => {
    "attempts": 1, 
    "changed": false, 
    "failed": true
}

MSG:

Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

===============================================================================

[root@qe-lxia-master-1 ~]# systemctl status atomic-openshift-node.service
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: activating (auto-restart) (Result: exit-code) since Wed 2017-07-12 05:21:36 EDT; 4s ago
     Docs: https://github.com/openshift/origin
  Process: 21586 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
  Process: 21584 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
  Process: 21575 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
  Process: 21573 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
  Process: 21572 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
 Main PID: 21575 (code=exited, status=255)

Jul 12 05:21:36 qe-lxia-master-1 systemd[1]: Failed to start OpenShift Node.
Jul 12 05:21:36 qe-lxia-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
Jul 12 05:21:36 qe-lxia-master-1 systemd[1]: atomic-openshift-node.service failed.

===============================================================================

[root@qe-lxia-master-1 ~]# journalctl -xe
Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: Starting OpenShift Node...
-- Subject: Unit atomic-openshift-node.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit atomic-openshift-node.service has begun starting up.
Jul 12 05:29:13 qe-lxia-master-1 dnsmasq[19546]: setting upstream servers from DBus
Jul 12 05:29:13 qe-lxia-master-1 dnsmasq[19546]: using nameserver 127.0.0.1#53 for domain cluster.local
Jul 12 05:29:13 qe-lxia-master-1 dnsmasq[19546]: using nameserver 127.0.0.1#53 for domain in-addr.arpa
Jul 12 05:29:13 qe-lxia-master-1 dnsmasq[19546]: using nameserver 169.254.169.254#53
Jul 12 05:29:13 qe-lxia-master-1 atomic-openshift-node[22919]: I0712 05:29:13.927750   22919 start_node.go:251] Reading node configuration from /etc/origin/node/node-config.yaml
Jul 12 05:29:13 qe-lxia-master-1 atomic-openshift-master[16253]: I0712 05:29:13.975880   16253 handler.go:146] kube-apiserver: GET "/oapi/v1/clusternetworks/default" satisfied by gorestful with webservice /oapi/v1
Jul 12 05:29:13 qe-lxia-master-1 atomic-openshift-master[16253]: I0712 05:29:13.977156   16253 wrap.go:42] GET /oapi/v1/clusternetworks/default: (1.660894ms) 404 [[openshift/v3.6.140 (linux/amd64) openshift/a7c42e0] 10.128.0.2:50550]
Jul 12 05:29:13 qe-lxia-master-1 atomic-openshift-node[22919]: F0712 05:29:13.978416   22919 start_node.go:140] master has not created a default cluster network, network plugin "redhat/openshift-ovs-subnet" can not start
Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: Failed to start OpenShift Node.
-- Subject: Unit atomic-openshift-node.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit atomic-openshift-node.service has failed.
-- 
-- The result is failed.
Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
Jul 12 05:29:13 qe-lxia-master-1 systemd[1]: atomic-openshift-node.service failed.

===============================================================================

[root@qe-lxia-master-1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc pfifo_fast state UP qlen 1000
    link/ether 42:01:0a:80:00:02 brd ff:ff:ff:ff:ff:ff
    inet 10.128.0.2/32 brd 10.128.0.2 scope global dynamic eth0
       valid_lft 84461sec preferred_lft 84461sec
    inet6 fe80::4001:aff:fe80:2/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 02:42:fa:bf:1c:66 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

[root@qe-lxia-master-1 ~]# ip r
default via 10.128.0.1 dev eth0  proto static  metric 100 
10.128.0.1 dev eth0  proto dhcp  scope link  metric 100 
10.128.0.2 dev eth0  proto kernel  scope link  src 10.128.0.2  metric 100 
169.254.169.254 via 10.128.0.1 dev eth0  proto dhcp  metric 100 
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1

Comment 1 Scott Dodson 2017-07-12 12:34:47 UTC
I'm pretty sure we have a bug for this already where it's an RFE that the installer checks to see if the SDN CIDR or Services CIDR conflicts with the network the hosts are in and aborts early. I'll look for that, moving to 3.7.

Comment 2 Scott Dodson 2019-02-18 18:16:43 UTC
There appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached.