Bug 1594187
| Summary: | Openshift-on-OpenStack playbook increase watch_retry_timeout for kuryr-cni | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jon Uriarte <juriarte> |
| Component: | Installer | Assignee: | MichaĆ Dulko <mdulko> |
| Status: | CLOSED ERRATA | QA Contact: | Jon Uriarte <juriarte> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.10.0 | CC: | aos-bugs, jokerman, juriarte, mmccomas, tsedovic, vlaad |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | 3.10.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Kuryr was only retrying connections to OpenShift API for 60 seconds.
Consequence: When OpenShift API outage lasted longer than 60 seconds Kuryr pods were stopping retrying but wasn't actually stopping pod execution. This led to pods being alive, but not functional at all.
Fix: Increase the 60 seconds timeout to 3600 seconds.
Result: This makes Kuryr services retry connections for an hour, which is virtually forever (if OpenShift API has an hour-long outage, there's definitely some major issue outside of Kuryr).
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-11-11 16:39:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
https://github.com/openshift/openshift-ansible/pull/8952 release-3.10 backport already merged Verified in openshift-ansible-3.10.59-1.git.0.f9ba890.el7.noarch on OSP 13 2018-10-02.1 puddle.
OCP on OSP installation playbooks do end successfully and all the pods are in Running status.
$ oc get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default docker-registry-1-gp9qg 1/1 Running 0 5h 10.11.0.7 infra-node-0.openshift.example.com
default registry-console-1-pfrcm 1/1 Running 0 5h 10.11.0.18 master-0.openshift.example.com
default router-1-hs54c 1/1 Running 0 5h 192.168.99.5 infra-node-0.openshift.example.com
kube-system master-api-master-0.openshift.example.com 1/1 Running 1 5h 192.168.99.14 master-0.openshift.example.com
kube-system master-controllers-master-0.openshift.example.com 1/1 Running 1 5h 192.168.99.14 master-0.openshift.example.com
kube-system master-etcd-master-0.openshift.example.com 1/1 Running 1 5h 192.168.99.14 master-0.openshift.example.com
openshift-infra kuryr-cni-ds-27tvb 2/2 Running 0 5h 192.168.99.14 master-0.openshift.example.com
openshift-infra kuryr-cni-ds-llwgw 2/2 Running 0 5h 192.168.99.10 app-node-0.openshift.example.com
openshift-infra kuryr-cni-ds-ngvcz 2/2 Running 0 5h 192.168.99.5 infra-node-0.openshift.example.com
openshift-infra kuryr-cni-ds-rs2h4 2/2 Running 0 5h 192.168.99.13 app-node-1.openshift.example.com
openshift-infra kuryr-controller-59fc7f478b-q6bxt 1/1 Running 0 5h 192.168.99.13 app-node-1.openshift.example.com
openshift-node sync-8nfc9 1/1 Running 0 5h 192.168.99.10 app-node-0.openshift.example.com
openshift-node sync-qlkx6 1/1 Running 0 5h 192.168.99.5 infra-node-0.openshift.example.com
openshift-node sync-t7c7z 1/1 Running 0 5h 192.168.99.14 master-0.openshift.example.com
openshift-node sync-vrldf 1/1 Running 0 5h 192.168.99.13 app-node-1.openshift.example.com
$ oc -n openshift-infra get configmap kuryr-config -o yaml | grep watch_retry
watch_retry_timeout = 3600
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2709 |
Description of problem: kuryr-daemon connects to K8s API through an LB. In case of connection failure, it tries to reconnect during the time defined in watch_retry_timeout (60 seconds by default). Sometimes this value is too short because the LB needs more time to get ready and responsive, so in this case the watcher thread dies and the connection with the K8s API will never be done. watch_retry_timeout should be increased for kuryr-cni. Version-Release number of the following components: $ rpm -q openshift-ansible openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch $ rpm -q ansible ansible-2.4.4.0-1.el7ae.noarch $ ansible --version ansible 2.4.4.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/home/cloud-user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] How reproducible: depends on LB/containers creation timing 1. Deploy OpenStack (OSP13) 2. Deploy a DNS server and the Ansible host in the overcloud 3. Download OCP rpm and configure: - OpenStack (inventory/group_vars/all.yml) . Configure Kuryr SDN - OpenShift (inventory/group_vars/OSEv3.yml) . Configure the Red Hat LDAP identity provider 4. Install OpenShift by running the playbooks for OpenStack (deployed 3 masters, 2 infra and 2 app nodes) and verify the installer succeeds without any errors. 5. Check all pods are Running Actual results: No watch_retry_timeout is defined in kuryr-cni.conf Expected results: watch_retry_timeout is defined in kuryr-cni.conf section under [kubernetes], with a value greater than 60 seconds.