Bug 1594187

Summary: Openshift-on-OpenStack playbook increase watch_retry_timeout for kuryr-cni
Product: OpenShift Container Platform Reporter: Jon Uriarte <juriarte>
Component: InstallerAssignee: MichaƂ Dulko <mdulko>
Status: CLOSED ERRATA QA Contact: Jon Uriarte <juriarte>
Severity: high Docs Contact:
Priority: medium    
Version: 3.10.0CC: aos-bugs, jokerman, juriarte, mmccomas, tsedovic, vlaad
Target Milestone: ---Keywords: Triaged
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Kuryr was only retrying connections to OpenShift API for 60 seconds. Consequence: When OpenShift API outage lasted longer than 60 seconds Kuryr pods were stopping retrying but wasn't actually stopping pod execution. This led to pods being alive, but not functional at all. Fix: Increase the 60 seconds timeout to 3600 seconds. Result: This makes Kuryr services retry connections for an hour, which is virtually forever (if OpenShift API has an hour-long outage, there's definitely some major issue outside of Kuryr).
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-11 16:39:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jon Uriarte 2018-06-22 10:56:16 UTC
Description of problem:

kuryr-daemon connects to K8s API through an LB. In case of connection failure, it tries to reconnect during the time defined in watch_retry_timeout (60 seconds by default).
Sometimes this value is too short because the LB needs more time to get ready and responsive, so in this case the watcher thread dies and the connection with the K8s API will never be done.

watch_retry_timeout should be increased for kuryr-cni.

Version-Release number of the following components:

$ rpm -q openshift-ansible
openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch

$ rpm -q ansible
ansible-2.4.4.0-1.el7ae.noarch

$ ansible --version
ansible 2.4.4.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/cloud-user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible: depends on LB/containers creation timing

1. Deploy OpenStack (OSP13)
2. Deploy a DNS server and the Ansible host in the overcloud
3. Download OCP rpm and configure:
   - OpenStack (inventory/group_vars/all.yml)
       . Configure Kuryr SDN
   - OpenShift (inventory/group_vars/OSEv3.yml)
       . Configure the Red Hat LDAP identity provider
4. Install OpenShift by running the playbooks for OpenStack (deployed 3 masters, 2 infra and 2 app nodes) and verify the installer succeeds without any errors.
5. Check all pods are Running

Actual results:
No watch_retry_timeout is defined in kuryr-cni.conf

Expected results:
 watch_retry_timeout is defined in kuryr-cni.conf section under [kubernetes], with a value greater than 60 seconds.

Comment 1 Scott Dodson 2018-10-05 17:47:20 UTC
https://github.com/openshift/openshift-ansible/pull/8952 release-3.10 backport already merged

Comment 4 Jon Uriarte 2018-10-22 15:00:17 UTC
Verified in openshift-ansible-3.10.59-1.git.0.f9ba890.el7.noarch on OSP 13 2018-10-02.1 puddle.

OCP on OSP installation playbooks do end successfully and all the pods are in Running status.

$ oc get pods --all-namespaces -o wide
NAMESPACE         NAME                                                READY     STATUS    RESTARTS   AGE       IP              NODE
default           docker-registry-1-gp9qg                             1/1       Running   0          5h        10.11.0.7       infra-node-0.openshift.example.com
default           registry-console-1-pfrcm                            1/1       Running   0          5h        10.11.0.18      master-0.openshift.example.com
default           router-1-hs54c                                      1/1       Running   0          5h        192.168.99.5    infra-node-0.openshift.example.com
kube-system       master-api-master-0.openshift.example.com           1/1       Running   1          5h        192.168.99.14   master-0.openshift.example.com
kube-system       master-controllers-master-0.openshift.example.com   1/1       Running   1          5h        192.168.99.14   master-0.openshift.example.com
kube-system       master-etcd-master-0.openshift.example.com          1/1       Running   1          5h        192.168.99.14   master-0.openshift.example.com
openshift-infra   kuryr-cni-ds-27tvb                                  2/2       Running   0          5h        192.168.99.14   master-0.openshift.example.com
openshift-infra   kuryr-cni-ds-llwgw                                  2/2       Running   0          5h        192.168.99.10   app-node-0.openshift.example.com
openshift-infra   kuryr-cni-ds-ngvcz                                  2/2       Running   0          5h        192.168.99.5    infra-node-0.openshift.example.com
openshift-infra   kuryr-cni-ds-rs2h4                                  2/2       Running   0          5h        192.168.99.13   app-node-1.openshift.example.com
openshift-infra   kuryr-controller-59fc7f478b-q6bxt                   1/1       Running   0          5h        192.168.99.13   app-node-1.openshift.example.com
openshift-node    sync-8nfc9                                          1/1       Running   0          5h        192.168.99.10   app-node-0.openshift.example.com
openshift-node    sync-qlkx6                                          1/1       Running   0          5h        192.168.99.5    infra-node-0.openshift.example.com
openshift-node    sync-t7c7z                                          1/1       Running   0          5h        192.168.99.14   master-0.openshift.example.com
openshift-node    sync-vrldf                                          1/1       Running   0          5h        192.168.99.13   app-node-1.openshift.example.com


$ oc -n openshift-infra get configmap kuryr-config -o yaml | grep watch_retry
    watch_retry_timeout = 3600

Comment 6 errata-xmlrpc 2018-11-11 16:39:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2709