1486914 – PLEG is not healthy error, node marked NotReady

Bug 1486914 - PLEG is not healthy error, node marked NotReady

Summary: PLEG is not healthy error, node marked NotReady

Keywords:
Status:	CLOSED DUPLICATE of bug 1451902
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.6.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Dan Williams
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:	aos-scalability-37
Duplicates (1):	1487334 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-30 18:56 UTC by Jeremy Eder
Modified:	2017-10-18 19:33 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-18 19:33:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pod yaml (3.41 KB, text/plain) 2017-09-29 20:42 UTC, Sten Turpin	no flags	Details
View All

Comment 8 Seth Jennings 2017-09-12 19:31:09 UTC

*** Bug 1487334 has been marked as a duplicate of this bug. ***

Comment 9 Seth Jennings 2017-09-13 17:33:58 UTC

Vikas, PTAL

Comment 10 Vikas Choudhary 2017-09-15 08:30:50 UTC

Sten, Can you please share the pod yaml file that led to this issue?
And also if possible logs from the node.

Comment 11 Sten Turpin 2017-09-29 20:42:08 UTC

Created attachment 1332527 [details]
pod yaml

Comment 12 Justin Pierce 2017-09-29 20:43:41 UTC

Master and node logs showing the the PLEG issues. 

http://file.rdu.redhat.com/~jupierce/share/pleg_logs.tgz

node log in this TGZ @ logs-starter-us-east-1-201709292012/nodes/starter-us-east-1-node-compute-8c013/journal/atomic-openshift-node

Comment 15 Mike Fiedler 2017-10-02 12:19:43 UTC

Also, see:  https://bugzilla.redhat.com/show_bug.cgi?id=1451902

Comment 16 Seth Jennings 2017-10-03 15:54:19 UTC

All of the docker_operations_latency_microseconds metrics looked good, yet the overall pleg_relist_latency_microseconds was very high.  Because the latency was so high we just decided to get a goroutine stack trace on the process and see where the relist code path was blocked.  Here are two goroutines associated with the blocked PLEG relist:

http://pastebin.test.redhat.com/521118

It is blocked on a mutex in GetPodNetworkStatus() downstream and across the CRI from the updateCache() call in the PLEG relist.  So this is an issue with the networking, not docker operations or IOPS/BurstBalance as first thought.

The mutex is a per pod mutex and is only taken in NetworkPlugin SetUpPod(), TearDownPod(), and GetPodNetworkStatus().

https://github.com/openshift/origin/blob/release-3.6/vendor/k8s.io/kubernetes/pkg/kubelet/network/plugins.go#L379-L416

Here is the full goroutine dump:

http://file.rdu.redhat.com/~sturpin/goroutine 

SetUpPod() is not running anywhere, however, there are 6 TearDownPod() traces all blocked on an exec out to the CNI plugin.

Dan looked into the SDN code and found that pod tear down operations are serialized behind some pretty coarse locks and that routines are blocking for minutes behind these locks.  This is consistent with the PLEG relist latency we see.

Sending to networking to continue investigation and fix.

Comment 17 Ben Bennett 2017-10-11 19:57:45 UTC

PR https://github.com/openshift/origin/pull/16692 made it not call iptables for tear-down when there are no hostports defined in the pod.

That may have helped slightly with this.

Comment 19 Ben Bennett 2017-10-18 19:33:31 UTC


*** This bug has been marked as a duplicate of bug 1451902 ***

Note You need to log in before you can comment on or make changes to this bug.