1735502 – [3.11] Some pods lost default gateway route after restarting docker

Bug 1735502 - [3.11] Some pods lost default gateway route after restarting docker

Summary: [3.11] Some pods lost default gateway route after restarting docker

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Alexander Constantinescu
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1772981
TreeView+	depends on / blocked

Reported:	2019-08-01 01:51 UTC by sfu@redhat.com
Modified:	2023-03-24 15:08 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1748031 1748032 1772981 (view as bug list)
Environment:
Last Closed:	2019-09-24 08:08:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 11878	0	'None'	closed	Bug 1735502: compare solely md5 hash for config change	2021-01-18 11:38:43 UTC
Red Hat Product Errata	RHBA-2019:2816	0	None	None	None	2019-09-24 08:08:18 UTC

Description sfu@redhat.com 2019-08-01 01:51:24 UTC

Description of problem:
Some pods lost default gateway route after restarting docker, which blocks pods communicate to others.

Issued Pod:
===================================
Destination          Gateway            Genmask                 Flags       Metric         Ref             Use                     Iface
10.128.0.0            0.0.0.0               255.252.0.0          U               0                  0                  0                         eth0
===================================
  
Normal Pod:
===================================
Destination          Gateway            Genmask                 Flags       Metric         Ref             Use          Iface
0.0.0.0                 10.130.74.1          0.0.0.0                    UG            0                   0                 0              eth0
10.128.0.0            0.0.0.0               255.252.0.0          U               0                  0                  0              eth0
10.130.74.0          0.0.0.0               255.255.254.0     U               0                  0                  0             eth0
224.0.0.0            0.0.0.0                240.0.0.0               U               0                   0                 0              eth0
===================================


How reproducible:
always

Steps to Reproduce:
1.just restarting docker service on a node


Actual results:
At customer end, 200 nodes, 5000+ pod, after restarting docker service, about 50 pods will lost route and can't ping/curl successfully.

Expected results:
all pod running well

Additional info:
Workaround:   1.delete and rebuild the pod   2. manually add gateway route inside the pod

ocp 3.11.43
docker-1.13.1-96.gitb2f74b2.el7.x86_64

Comment 22 Yunyun Qu 2019-08-23 09:59:25 UTC

Hi Team

I've changed the severity and priority to urgent to reflect the current status, and will raise an ACE ticket later so that EMT team can help track this issue along with us. I've also sent a mail to rhose-prio-list so that it can be highlighted.  Could you help push this forward and provide an update by this week? 

Below is the latest update from account team: 

The account team visited CMB yesterday, and customer complained it's such a long time since the issue is escalated and there is still no fix or workaround available. Our service value is challenged by them, and it also had a bad impact on migrating their important apps on OCP environment. Customer is really worried that how long it will take to address this issue. In such a long time not figuring out the root cause, they are worried if it's the right choice for them to upgrade more apps running on OCP in the future. 

Details: 
CASE LINK: https://gss--c.na94.visual.force.com/apex/Case_View?srPos=0&srKp=500&id=5002K00000dMt7Z&sfdc.override=1

BZ LINK: https://bugzilla.redhat.com/show_bug.cgi?id=1735502

It's confirmed that this issue can be reproduced in the latest OCP version(v3.11.135). QE also reproduced this issue on AWS env from this bug telling.   

Let us know if further info is required. 


Thanks,
Yunyun

Comment 32 Masaki Furuta ( RH ) 2019-08-29 06:43:12 UTC

*** Bug 1744077 has been marked as a duplicate of this bug. ***

Comment 33 Alexander Constantinescu 2019-08-29 16:25:45 UTC

Hi

We are continuing to look at this. Given the complexity of this bug: we have not been able to find the root cause yet. 

But rest assure that the work continues. 

Thanks,
Alexander

Comment 34 Casey Callendrello 2019-08-30 17:52:32 UTC

Update:
Making some progress. We've determined that, randomly, the CNI binaries are not running to completion. We're not yet sure why. They're still exiting with return code 0, so the kubelet thinks the network is up and running.

We've also found that the kubelet sometimes randomly sends a SIGTERM and SIGCONT to the cni plugin binary. If the machine is heavily loaded (e.g. after a docker restart), then the network plugin may have not made sufficient progress before being killed.

Once we've done a bit more analysis, we can probably ship a test binary that blocks sigterm.

Comment 42 Weibin Liang 2019-09-16 19:37:00 UTC

Can not verify the bug due to Bug 1752641: Latest v3.11 installation failed on QE rpm-rhel7-s3_registry-aws-cloudprovider-elb-ha

Comment 43 Weibin Liang 2019-09-18 15:24:56 UTC

Tested and verified on v3.11.146
No pods lost default gateway route after restarting docker several times

Comment 45 errata-xmlrpc 2019-09-24 08:08:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2816

Note You need to log in before you can comment on or make changes to this bug.