Bug 1735502
| Summary: | [3.11] Some pods lost default gateway route after restarting docker | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | sfu <sfu> | |
| Component: | Networking | Assignee: | Alexander Constantinescu <aconstan> | |
| Networking sub component: | openshift-sdn | QA Contact: | Weibin Liang <weliang> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | urgent | CC: | aconstan, aos-bugs, cdc, guachen, huirwang, jinjli, mfuruta, mirollin, nagrawal, nstielau, openshift-bugs-escalate, rhowe, rsandu, scuppett, weliang, yqu | |
| Version: | 3.11.0 | Keywords: | Reopened | |
| Target Milestone: | --- | |||
| Target Release: | 3.11.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1748031 1748032 1772981 (view as bug list) | Environment: | ||
| Last Closed: | 2019-09-24 08:08:08 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1772981 | |||
|
Description
sfu@redhat.com
2019-08-01 01:51:24 UTC
Hi Team I've changed the severity and priority to urgent to reflect the current status, and will raise an ACE ticket later so that EMT team can help track this issue along with us. I've also sent a mail to rhose-prio-list so that it can be highlighted. Could you help push this forward and provide an update by this week? Below is the latest update from account team: The account team visited CMB yesterday, and customer complained it's such a long time since the issue is escalated and there is still no fix or workaround available. Our service value is challenged by them, and it also had a bad impact on migrating their important apps on OCP environment. Customer is really worried that how long it will take to address this issue. In such a long time not figuring out the root cause, they are worried if it's the right choice for them to upgrade more apps running on OCP in the future. Details: CASE LINK: https://gss--c.na94.visual.force.com/apex/Case_View?srPos=0&srKp=500&id=5002K00000dMt7Z&sfdc.override=1 BZ LINK: https://bugzilla.redhat.com/show_bug.cgi?id=1735502 It's confirmed that this issue can be reproduced in the latest OCP version(v3.11.135). QE also reproduced this issue on AWS env from this bug telling. Let us know if further info is required. Thanks, Yunyun *** Bug 1744077 has been marked as a duplicate of this bug. *** Hi We are continuing to look at this. Given the complexity of this bug: we have not been able to find the root cause yet. But rest assure that the work continues. Thanks, Alexander Update: Making some progress. We've determined that, randomly, the CNI binaries are not running to completion. We're not yet sure why. They're still exiting with return code 0, so the kubelet thinks the network is up and running. We've also found that the kubelet sometimes randomly sends a SIGTERM and SIGCONT to the cni plugin binary. If the machine is heavily loaded (e.g. after a docker restart), then the network plugin may have not made sufficient progress before being killed. Once we've done a bit more analysis, we can probably ship a test binary that blocks sigterm. Can not verify the bug due to Bug 1752641: Latest v3.11 installation failed on QE rpm-rhel7-s3_registry-aws-cloudprovider-elb-ha Tested and verified on v3.11.146 No pods lost default gateway route after restarting docker several times Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2816 |