Bug 1735502 - [3.11] Some pods lost default gateway route after restarting docker
Summary: [3.11] Some pods lost default gateway route after restarting docker
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.11.z
Assignee: Alexander Constantinescu
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks: 1772981
TreeView+ depends on / blocked
 
Reported: 2019-08-01 01:51 UTC by sfu@redhat.com
Modified: 2023-03-24 15:08 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1748031 1748032 1772981 (view as bug list)
Environment:
Last Closed: 2019-09-24 08:08:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 11878 0 'None' closed Bug 1735502: compare solely md5 hash for config change 2021-01-18 11:38:43 UTC
Red Hat Product Errata RHBA-2019:2816 0 None None None 2019-09-24 08:08:18 UTC

Description sfu@redhat.com 2019-08-01 01:51:24 UTC
Description of problem:
Some pods lost default gateway route after restarting docker, which blocks pods communicate to others.

Issued Pod:
===================================
Destination          Gateway            Genmask                 Flags       Metric         Ref             Use                     Iface
10.128.0.0            0.0.0.0               255.252.0.0          U               0                  0                  0                         eth0
===================================
  
Normal Pod:
===================================
Destination          Gateway            Genmask                 Flags       Metric         Ref             Use          Iface
0.0.0.0                 10.130.74.1          0.0.0.0                    UG            0                   0                 0              eth0
10.128.0.0            0.0.0.0               255.252.0.0          U               0                  0                  0              eth0
10.130.74.0          0.0.0.0               255.255.254.0     U               0                  0                  0             eth0
224.0.0.0            0.0.0.0                240.0.0.0               U               0                   0                 0              eth0
===================================


How reproducible:
always

Steps to Reproduce:
1.just restarting docker service on a node


Actual results:
At customer end, 200 nodes, 5000+ pod, after restarting docker service, about 50 pods will lost route and can't ping/curl successfully.

Expected results:
all pod running well

Additional info:
Workaround:   1.delete and rebuild the pod   2. manually add gateway route inside the pod

ocp 3.11.43
docker-1.13.1-96.gitb2f74b2.el7.x86_64

Comment 22 Yunyun Qu 2019-08-23 09:59:25 UTC
Hi Team

I've changed the severity and priority to urgent to reflect the current status, and will raise an ACE ticket later so that EMT team can help track this issue along with us. I've also sent a mail to rhose-prio-list so that it can be highlighted.  Could you help push this forward and provide an update by this week? 

Below is the latest update from account team: 

The account team visited CMB yesterday, and customer complained it's such a long time since the issue is escalated and there is still no fix or workaround available. Our service value is challenged by them, and it also had a bad impact on migrating their important apps on OCP environment. Customer is really worried that how long it will take to address this issue. In such a long time not figuring out the root cause, they are worried if it's the right choice for them to upgrade more apps running on OCP in the future. 

Details: 
CASE LINK: https://gss--c.na94.visual.force.com/apex/Case_View?srPos=0&srKp=500&id=5002K00000dMt7Z&sfdc.override=1

BZ LINK: https://bugzilla.redhat.com/show_bug.cgi?id=1735502

It's confirmed that this issue can be reproduced in the latest OCP version(v3.11.135). QE also reproduced this issue on AWS env from this bug telling.   

Let us know if further info is required. 


Thanks,
Yunyun

Comment 32 Masaki Furuta ( RH ) 2019-08-29 06:43:12 UTC
*** Bug 1744077 has been marked as a duplicate of this bug. ***

Comment 33 Alexander Constantinescu 2019-08-29 16:25:45 UTC
Hi

We are continuing to look at this. Given the complexity of this bug: we have not been able to find the root cause yet. 

But rest assure that the work continues. 

Thanks,
Alexander

Comment 34 Casey Callendrello 2019-08-30 17:52:32 UTC
Update:
Making some progress. We've determined that, randomly, the CNI binaries are not running to completion. We're not yet sure why. They're still exiting with return code 0, so the kubelet thinks the network is up and running.

We've also found that the kubelet sometimes randomly sends a SIGTERM and SIGCONT to the cni plugin binary. If the machine is heavily loaded (e.g. after a docker restart), then the network plugin may have not made sufficient progress before being killed.

Once we've done a bit more analysis, we can probably ship a test binary that blocks sigterm.

Comment 42 Weibin Liang 2019-09-16 19:37:00 UTC
Can not verify the bug due to Bug 1752641: Latest v3.11 installation failed on QE rpm-rhel7-s3_registry-aws-cloudprovider-elb-ha

Comment 43 Weibin Liang 2019-09-18 15:24:56 UTC
Tested and verified on v3.11.146
No pods lost default gateway route after restarting docker several times

Comment 45 errata-xmlrpc 2019-09-24 08:08:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2816


Note You need to log in before you can comment on or make changes to this bug.