Bug 1594187 - Openshift-on-OpenStack playbook increase watch_retry_timeout for kuryr-cni
Summary: Openshift-on-OpenStack playbook increase watch_retry_timeout for kuryr-cni
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 3.10.z
Assignee: Michał Dulko
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-22 10:56 UTC by Jon Uriarte
Modified: 2018-11-11 16:40 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Kuryr was only retrying connections to OpenShift API for 60 seconds. Consequence: When OpenShift API outage lasted longer than 60 seconds Kuryr pods were stopping retrying but wasn't actually stopping pod execution. This led to pods being alive, but not functional at all. Fix: Increase the 60 seconds timeout to 3600 seconds. Result: This makes Kuryr services retry connections for an hour, which is virtually forever (if OpenShift API has an hour-long outage, there's definitely some major issue outside of Kuryr).
Clone Of:
Environment:
Last Closed: 2018-11-11 16:39:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 8915 0 None None None 2018-06-22 11:00:32 UTC
Red Hat Product Errata RHSA-2018:2709 0 None None None 2018-11-11 16:40:08 UTC

Description Jon Uriarte 2018-06-22 10:56:16 UTC
Description of problem:

kuryr-daemon connects to K8s API through an LB. In case of connection failure, it tries to reconnect during the time defined in watch_retry_timeout (60 seconds by default).
Sometimes this value is too short because the LB needs more time to get ready and responsive, so in this case the watcher thread dies and the connection with the K8s API will never be done.

watch_retry_timeout should be increased for kuryr-cni.

Version-Release number of the following components:

$ rpm -q openshift-ansible
openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch

$ rpm -q ansible
ansible-2.4.4.0-1.el7ae.noarch

$ ansible --version
ansible 2.4.4.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/cloud-user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible: depends on LB/containers creation timing

1. Deploy OpenStack (OSP13)
2. Deploy a DNS server and the Ansible host in the overcloud
3. Download OCP rpm and configure:
   - OpenStack (inventory/group_vars/all.yml)
       . Configure Kuryr SDN
   - OpenShift (inventory/group_vars/OSEv3.yml)
       . Configure the Red Hat LDAP identity provider
4. Install OpenShift by running the playbooks for OpenStack (deployed 3 masters, 2 infra and 2 app nodes) and verify the installer succeeds without any errors.
5. Check all pods are Running

Actual results:
No watch_retry_timeout is defined in kuryr-cni.conf

Expected results:
 watch_retry_timeout is defined in kuryr-cni.conf section under [kubernetes], with a value greater than 60 seconds.

Comment 1 Scott Dodson 2018-10-05 17:47:20 UTC
https://github.com/openshift/openshift-ansible/pull/8952 release-3.10 backport already merged

Comment 4 Jon Uriarte 2018-10-22 15:00:17 UTC
Verified in openshift-ansible-3.10.59-1.git.0.f9ba890.el7.noarch on OSP 13 2018-10-02.1 puddle.

OCP on OSP installation playbooks do end successfully and all the pods are in Running status.

$ oc get pods --all-namespaces -o wide
NAMESPACE         NAME                                                READY     STATUS    RESTARTS   AGE       IP              NODE
default           docker-registry-1-gp9qg                             1/1       Running   0          5h        10.11.0.7       infra-node-0.openshift.example.com
default           registry-console-1-pfrcm                            1/1       Running   0          5h        10.11.0.18      master-0.openshift.example.com
default           router-1-hs54c                                      1/1       Running   0          5h        192.168.99.5    infra-node-0.openshift.example.com
kube-system       master-api-master-0.openshift.example.com           1/1       Running   1          5h        192.168.99.14   master-0.openshift.example.com
kube-system       master-controllers-master-0.openshift.example.com   1/1       Running   1          5h        192.168.99.14   master-0.openshift.example.com
kube-system       master-etcd-master-0.openshift.example.com          1/1       Running   1          5h        192.168.99.14   master-0.openshift.example.com
openshift-infra   kuryr-cni-ds-27tvb                                  2/2       Running   0          5h        192.168.99.14   master-0.openshift.example.com
openshift-infra   kuryr-cni-ds-llwgw                                  2/2       Running   0          5h        192.168.99.10   app-node-0.openshift.example.com
openshift-infra   kuryr-cni-ds-ngvcz                                  2/2       Running   0          5h        192.168.99.5    infra-node-0.openshift.example.com
openshift-infra   kuryr-cni-ds-rs2h4                                  2/2       Running   0          5h        192.168.99.13   app-node-1.openshift.example.com
openshift-infra   kuryr-controller-59fc7f478b-q6bxt                   1/1       Running   0          5h        192.168.99.13   app-node-1.openshift.example.com
openshift-node    sync-8nfc9                                          1/1       Running   0          5h        192.168.99.10   app-node-0.openshift.example.com
openshift-node    sync-qlkx6                                          1/1       Running   0          5h        192.168.99.5    infra-node-0.openshift.example.com
openshift-node    sync-t7c7z                                          1/1       Running   0          5h        192.168.99.14   master-0.openshift.example.com
openshift-node    sync-vrldf                                          1/1       Running   0          5h        192.168.99.13   app-node-1.openshift.example.com


$ oc -n openshift-infra get configmap kuryr-config -o yaml | grep watch_retry
    watch_retry_timeout = 3600

Comment 6 errata-xmlrpc 2018-11-11 16:39:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2709


Note You need to log in before you can comment on or make changes to this bug.