Description of problem: Some pods lost default gateway route after restarting docker, which blocks pods communicate to others. Issued Pod: =================================== Destination Gateway Genmask Flags Metric Ref Use Iface 10.128.0.0 0.0.0.0 255.252.0.0 U 0 0 0 eth0 =================================== Normal Pod: =================================== Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.130.74.1 0.0.0.0 UG 0 0 0 eth0 10.128.0.0 0.0.0.0 255.252.0.0 U 0 0 0 eth0 10.130.74.0 0.0.0.0 255.255.254.0 U 0 0 0 eth0 224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 eth0 =================================== How reproducible: always Steps to Reproduce: 1.just restarting docker service on a node Actual results: At customer end, 200 nodes, 5000+ pod, after restarting docker service, about 50 pods will lost route and can't ping/curl successfully. Expected results: all pod running well Additional info: Workaround: 1.delete and rebuild the pod 2. manually add gateway route inside the pod ocp 3.11.43 docker-1.13.1-96.gitb2f74b2.el7.x86_64
Hi Team I've changed the severity and priority to urgent to reflect the current status, and will raise an ACE ticket later so that EMT team can help track this issue along with us. I've also sent a mail to rhose-prio-list so that it can be highlighted. Could you help push this forward and provide an update by this week? Below is the latest update from account team: The account team visited CMB yesterday, and customer complained it's such a long time since the issue is escalated and there is still no fix or workaround available. Our service value is challenged by them, and it also had a bad impact on migrating their important apps on OCP environment. Customer is really worried that how long it will take to address this issue. In such a long time not figuring out the root cause, they are worried if it's the right choice for them to upgrade more apps running on OCP in the future. Details: CASE LINK: https://gss--c.na94.visual.force.com/apex/Case_View?srPos=0&srKp=500&id=5002K00000dMt7Z&sfdc.override=1 BZ LINK: https://bugzilla.redhat.com/show_bug.cgi?id=1735502 It's confirmed that this issue can be reproduced in the latest OCP version(v3.11.135). QE also reproduced this issue on AWS env from this bug telling. Let us know if further info is required. Thanks, Yunyun
*** Bug 1744077 has been marked as a duplicate of this bug. ***
Hi We are continuing to look at this. Given the complexity of this bug: we have not been able to find the root cause yet. But rest assure that the work continues. Thanks, Alexander
Update: Making some progress. We've determined that, randomly, the CNI binaries are not running to completion. We're not yet sure why. They're still exiting with return code 0, so the kubelet thinks the network is up and running. We've also found that the kubelet sometimes randomly sends a SIGTERM and SIGCONT to the cni plugin binary. If the machine is heavily loaded (e.g. after a docker restart), then the network plugin may have not made sufficient progress before being killed. Once we've done a bit more analysis, we can probably ship a test binary that blocks sigterm.
Can not verify the bug due to Bug 1752641: Latest v3.11 installation failed on QE rpm-rhel7-s3_registry-aws-cloudprovider-elb-ha
Tested and verified on v3.11.146 No pods lost default gateway route after restarting docker several times
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2816