Bug 1608950 - Pod deployments failing DeploymentConfig on openshift on openstack provider
Summary: Pod deployments failing DeploymentConfig on openshift on openstack provider
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.10.z
Assignee: Luis Tomas Bolivar
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-26 14:23 UTC by jliberma@redhat.com
Modified: 2019-02-04 08:31 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-04 08:31:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description jliberma@redhat.com 2018-07-26 14:23:16 UTC
Description of problem:

kubernetes service allows pods to access the kubernetes API. this service is created by the installer manually behind the octavia load balancer. it appears that some of the pod connections to the kubernetes service are being closed by API due to octavia timeouts, resulting in pods error status.

Version-Release number of the following components:

(shiftstack) [cloud-user@bastion ~]$ rpm -q openshift-ansible
openshift-ansible-3.10.24-1.git.0.3032aec.el7.noarch

(shiftstack) [cloud-user@bastion ~]$ rpm -q ansible
ansible-2.4.6.0-1.el7ae.noarch

(shiftstack) [cloud-user@bastion ~]$ ansible --version
ansible 2.4.6.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/cloud-user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible:

We saw it 5/5 deployments with Octavia enabled. Different service pods failed: 

docker-registry
registry-console

We also saw this with the cake-php app we launched to test the ocp deployment:

  $ oc login --insecure-skip-tls-verify=true "https://console.openshift.example.com:8443" -u username -p password
  $ oc new-project test-app
  $ oc new-app --template=cakephp-mysql-example


Steps to Reproduce:
1. Deploy openshift-ansible with octavia enabled
2. View pod STATUS with oc get pods --all-namespaces -o wide

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated


[openshift@master-0 ~]$ oc get pods docker-registry-1-deploy
 docker-registry-1-deploy   0/1       Error     0          1h

[openshift@master-0 ~]$ oc logs docker-registry-1- deploy                                                                                                                                       
 --> Scaling docker-registry-1 to 1
 error: update acceptor rejected docker-registry-1: watch closed before Until timeout

Expected results:

Pods complete build before connections are dropped. (if that is indeed what is happening)

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Michał Dulko 2018-07-27 17:06:50 UTC
I've verified the Octavia timeout hypothesis today:

1. When I've shrinked `timeout client` and `timeout server` HAProxy options to 20000 on Amphora VM, the deployment containers started to die earlier.
2. When I've extended `timeout client` and `timeout server` HAProxy options to 500000 on Amphora VM, the deployment containers started to work successfully.

This proves that those options being non configurable is the root cause. Unfortunately it's fixed only in OSP14 Octavia.

Comment 2 Luis Tomas Bolivar 2019-02-04 08:31:20 UTC
Closing the bug as this is fixed on OCP3.11 when used on top of OSP14. For OSP13 the modifications to Director to include the knobs to increase the loadbalancers client timeouts has also been merged


Note You need to log in before you can comment on or make changes to this bug.