Hide Forgot
Description of problem: We've run into this issue when running on OpenStack, where OpenShift-API is configured to be behind Octavia load balancer. Octavia is dropping the continuous connections after 50 seconds and while of course that's not really ideal for OpenShift API LB, it uncovered a different issue. Basically *-deploy containers are not restarting watching the OpenShift API resulting in such log in them: > [openshift@master-0 ~]$ oc logs pod/docker-registry-1-deploy > --> Scaling docker-registry-1 to 1 > error: update acceptor rejected docker-registry-1: watch closed before Until timeout This means that every brief networking failure will break such DeploymentConfig rollout. At almost no cost such connection could be retried for at least reasonable time. Version-Release number of selected component (if applicable): We've seen it on OCP 3.10 (and OSP 13). How reproducible: Always when connection is broken while *-deploy container is running. Steps to Reproduce: 1. Run a DeploymentConfig. 2. Briefly interrupt connection to OpenShift API (e.g. restart openshift-master). 3. Notice that DC is now failed with aforementioned message in the log. Actual results: DeploymentConfig fails. Expected results: Connection gets restarted inside *-deploy container and DeploymentConfig finishes successfully. Additional info: We're tracking OpenStack + Octavia part in this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1608950
Yes, this is a know issue. I'd recommend to raise the LB timeout. This goes deep into the Kubernetes apimachinery which I am fixing for v1.12. I'd hope to get the scale fix in for v1.12 or v1.13.
I've looked into the particular failure which is in OpenShift code (acceptor waiting for available pods) - so we will fix it after 1.12 rebase when we pick up the new apimachinery.
*** Bug 1623989 has been marked as a duplicate of this bug. ***
Hey, this is affecting OCP 3.10 on Openstack installation, as the pods in the default namespace remain in Error status, and need to be retriggered manually. We can increase the LB timeout values but the default values are quite smaller, and we are adding a new prerequisite. I see it is targeted for 4.0, but will it be backported to 3.10 and 3.11? Thank you
I don't think this can technically be backported since it will be using the new machinery in Kubernetes 1.12. Also there are dozens of other places in the core that assume watch can work for reasonable amount of time. > We can increase the LB timeout values but the default values are quite smaller, and we are adding a new prerequisite. I think your defaults need to be more reasonable. If they aren't and they are causing the platform in normal conditions to fail, you need to change those.
Re-opened [1] for adding required documentation to OCP 3.10 and 3.11 Openstack playbooks to reflect the changes in Octavia. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1623989
PR: https://github.com/openshift/origin/pull/22100 Tomas will unhold next week after some additional testing.
PR is now tagged and ready to be merged but the CI is borked on unrelated build issues
Confirmed with OCP: 4.0.0-0.nightly-2019-04-10-182914 with steps: 1) Create dc with hooks to make sure the deployer pod will run for a while; 2) Trigger restart for kube-apiserver and openshift-apiserver by command: `oc patch kubeapiservers/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "just a forced test33" } ]'` `oc patch openshiftapiservers/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "just a forced test33" } ]'` 3) Use command `oc get po -w` to check the deployer pod running status Result: The `oc get po -w` command will interrupted, but the deployer pod running well and succeed finally. so will verify this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758