Bug 1618685

Summary: *-deploy pods aren't restarting watching OpenShift-API on failure.
Product: OpenShift Container Platform Reporter: Michał Dulko <mdulko>
Component: MasterAssignee: Tomáš Nožička <tnozicka>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: high Docs Contact:
Priority: medium    
Version: 3.10.0CC: aos-bugs, jokerman, juriarte, ltomasbo, mfojtik, mmccomas, oblaut, rbost, tnozicka, xxia
Target Milestone: ---Keywords: AutomationBlocker, NeedsTestCase
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:40:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michał Dulko 2018-08-17 10:37:19 UTC
Description of problem:
We've run into this issue when running on OpenStack, where OpenShift-API is configured to be behind Octavia load balancer. Octavia is dropping the continuous connections after 50 seconds and while of course that's not really ideal for OpenShift API LB, it uncovered a different issue.

Basically *-deploy containers are not restarting watching the OpenShift API resulting in such log in them:

> [openshift@master-0 ~]$ oc logs pod/docker-registry-1-deploy
> --> Scaling docker-registry-1 to 1
> error: update acceptor rejected docker-registry-1: watch closed before Until timeout

This means that every brief networking failure will break such DeploymentConfig rollout. At almost no cost such connection could be retried for at least reasonable time.

Version-Release number of selected component (if applicable):
We've seen it on OCP 3.10 (and OSP 13).

How reproducible:
Always when connection is broken while *-deploy container is running.

Steps to Reproduce:
1. Run a DeploymentConfig.
2. Briefly interrupt connection to OpenShift API (e.g. restart openshift-master).
3. Notice that DC is now failed with aforementioned message in the log.

Actual results:
DeploymentConfig fails.

Expected results:
Connection gets restarted inside *-deploy container and DeploymentConfig finishes successfully.

Additional info:
We're tracking OpenStack + Octavia part in this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1608950

Comment 1 Tomáš Nožička 2018-08-27 14:51:01 UTC
Yes, this is a know issue. I'd recommend to raise the LB timeout. This goes deep into the Kubernetes apimachinery which I am fixing for v1.12. I'd hope to get the scale fix in for v1.12 or v1.13.

Comment 2 Tomáš Nožička 2018-08-29 09:22:55 UTC
I've looked into the particular failure which is in OpenShift code (acceptor waiting for available pods) - so we will fix it after 1.12 rebase when we pick up the new apimachinery.

Comment 3 Luis Tomas Bolivar 2018-09-04 08:32:19 UTC
*** Bug 1623989 has been marked as a duplicate of this bug. ***

Comment 4 Jon Uriarte 2018-10-02 09:00:10 UTC
Hey, this is affecting OCP 3.10 on Openstack installation, as the pods in the default namespace remain in Error status, and need to be retriggered manually.

We can increase the LB timeout values but the default values are quite smaller, and we are adding a new prerequisite.

I see it is targeted for 4.0, but will it be backported to 3.10 and 3.11?

Thank you

Comment 5 Tomáš Nožička 2018-10-02 09:43:19 UTC
I don't think this can technically be backported since it will be using the new machinery in Kubernetes 1.12. Also there are dozens of other places in the core that assume watch can work for reasonable amount of time.

> We can increase the LB timeout values but the default values are quite smaller, and we are adding a new prerequisite.

I think your defaults need to be more reasonable. If they aren't and they are causing the platform in normal conditions to fail, you need to change those.

Comment 10 Jon Uriarte 2018-10-04 09:21:18 UTC
Re-opened [1] for adding required documentation to OCP 3.10 and 3.11 Openstack playbooks to reflect the changes in Octavia.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1623989

Comment 11 Michal Fojtik 2019-04-05 10:55:03 UTC
PR: https://github.com/openshift/origin/pull/22100

Tomas will unhold next week after some additional testing.

Comment 12 Tomáš Nožička 2019-04-08 17:54:19 UTC
PR is now tagged and ready to be merged but the CI is borked on unrelated build issues

Comment 13 zhou ying 2019-04-16 09:56:34 UTC
Confirmed with OCP: 4.0.0-0.nightly-2019-04-10-182914 with steps:

1) Create dc with hooks to make sure the deployer pod will run for a while;
2) Trigger restart for kube-apiserver and openshift-apiserver by command:
   `oc patch kubeapiservers/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "just a forced test33" } ]'`
   `oc patch openshiftapiservers/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "just a forced test33" } ]'`
3) Use command `oc get po -w` to check the deployer pod running status


Result: The `oc get po -w` command will interrupted, but the deployer pod running well and succeed finally.  so will verify this issue.

Comment 15 errata-xmlrpc 2019-06-04 10:40:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758