Bug 1618685
Summary: | *-deploy pods aren't restarting watching OpenShift-API on failure. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Michał Dulko <mdulko> |
Component: | Master | Assignee: | Tomáš Nožička <tnozicka> |
Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.10.0 | CC: | aos-bugs, jokerman, juriarte, ltomasbo, mfojtik, mmccomas, oblaut, rbost, tnozicka, xxia |
Target Milestone: | --- | Keywords: | AutomationBlocker, NeedsTestCase |
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:40:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Michał Dulko
2018-08-17 10:37:19 UTC
Yes, this is a know issue. I'd recommend to raise the LB timeout. This goes deep into the Kubernetes apimachinery which I am fixing for v1.12. I'd hope to get the scale fix in for v1.12 or v1.13. I've looked into the particular failure which is in OpenShift code (acceptor waiting for available pods) - so we will fix it after 1.12 rebase when we pick up the new apimachinery. *** Bug 1623989 has been marked as a duplicate of this bug. *** Hey, this is affecting OCP 3.10 on Openstack installation, as the pods in the default namespace remain in Error status, and need to be retriggered manually. We can increase the LB timeout values but the default values are quite smaller, and we are adding a new prerequisite. I see it is targeted for 4.0, but will it be backported to 3.10 and 3.11? Thank you I don't think this can technically be backported since it will be using the new machinery in Kubernetes 1.12. Also there are dozens of other places in the core that assume watch can work for reasonable amount of time.
> We can increase the LB timeout values but the default values are quite smaller, and we are adding a new prerequisite.
I think your defaults need to be more reasonable. If they aren't and they are causing the platform in normal conditions to fail, you need to change those.
Re-opened [1] for adding required documentation to OCP 3.10 and 3.11 Openstack playbooks to reflect the changes in Octavia. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1623989 PR: https://github.com/openshift/origin/pull/22100 Tomas will unhold next week after some additional testing. PR is now tagged and ready to be merged but the CI is borked on unrelated build issues Confirmed with OCP: 4.0.0-0.nightly-2019-04-10-182914 with steps: 1) Create dc with hooks to make sure the deployer pod will run for a while; 2) Trigger restart for kube-apiserver and openshift-apiserver by command: `oc patch kubeapiservers/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "just a forced test33" } ]'` `oc patch openshiftapiservers/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "just a forced test33" } ]'` 3) Use command `oc get po -w` to check the deployer pod running status Result: The `oc get po -w` command will interrupted, but the deployer pod running well and succeed finally. so will verify this issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |