Description of problem: kuryr-daemon connects to K8s API through an LB. In case of connection failure, it tries to reconnect during the time defined in watch_retry_timeout (60 seconds by default). Sometimes this value is too short because the LB needs more time to get ready and responsive, so in this case the watcher thread dies and the connection with the K8s API will never be done. watch_retry_timeout should be increased for kuryr-cni. Version-Release number of the following components: $ rpm -q openshift-ansible openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch $ rpm -q ansible ansible-2.4.4.0-1.el7ae.noarch $ ansible --version ansible 2.4.4.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/home/cloud-user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] How reproducible: depends on LB/containers creation timing 1. Deploy OpenStack (OSP13) 2. Deploy a DNS server and the Ansible host in the overcloud 3. Download OCP rpm and configure: - OpenStack (inventory/group_vars/all.yml) . Configure Kuryr SDN - OpenShift (inventory/group_vars/OSEv3.yml) . Configure the Red Hat LDAP identity provider 4. Install OpenShift by running the playbooks for OpenStack (deployed 3 masters, 2 infra and 2 app nodes) and verify the installer succeeds without any errors. 5. Check all pods are Running Actual results: No watch_retry_timeout is defined in kuryr-cni.conf Expected results: watch_retry_timeout is defined in kuryr-cni.conf section under [kubernetes], with a value greater than 60 seconds.
https://github.com/openshift/openshift-ansible/pull/8952 release-3.10 backport already merged
Verified in openshift-ansible-3.10.59-1.git.0.f9ba890.el7.noarch on OSP 13 2018-10-02.1 puddle. OCP on OSP installation playbooks do end successfully and all the pods are in Running status. $ oc get pods --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default docker-registry-1-gp9qg 1/1 Running 0 5h 10.11.0.7 infra-node-0.openshift.example.com default registry-console-1-pfrcm 1/1 Running 0 5h 10.11.0.18 master-0.openshift.example.com default router-1-hs54c 1/1 Running 0 5h 192.168.99.5 infra-node-0.openshift.example.com kube-system master-api-master-0.openshift.example.com 1/1 Running 1 5h 192.168.99.14 master-0.openshift.example.com kube-system master-controllers-master-0.openshift.example.com 1/1 Running 1 5h 192.168.99.14 master-0.openshift.example.com kube-system master-etcd-master-0.openshift.example.com 1/1 Running 1 5h 192.168.99.14 master-0.openshift.example.com openshift-infra kuryr-cni-ds-27tvb 2/2 Running 0 5h 192.168.99.14 master-0.openshift.example.com openshift-infra kuryr-cni-ds-llwgw 2/2 Running 0 5h 192.168.99.10 app-node-0.openshift.example.com openshift-infra kuryr-cni-ds-ngvcz 2/2 Running 0 5h 192.168.99.5 infra-node-0.openshift.example.com openshift-infra kuryr-cni-ds-rs2h4 2/2 Running 0 5h 192.168.99.13 app-node-1.openshift.example.com openshift-infra kuryr-controller-59fc7f478b-q6bxt 1/1 Running 0 5h 192.168.99.13 app-node-1.openshift.example.com openshift-node sync-8nfc9 1/1 Running 0 5h 192.168.99.10 app-node-0.openshift.example.com openshift-node sync-qlkx6 1/1 Running 0 5h 192.168.99.5 infra-node-0.openshift.example.com openshift-node sync-t7c7z 1/1 Running 0 5h 192.168.99.14 master-0.openshift.example.com openshift-node sync-vrldf 1/1 Running 0 5h 192.168.99.13 app-node-1.openshift.example.com $ oc -n openshift-infra get configmap kuryr-config -o yaml | grep watch_retry watch_retry_timeout = 3600
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2709