Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1486914

Summary: PLEG is not healthy error, node marked NotReady
Product: OpenShift Container Platform Reporter: Jeremy Eder <jeder>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED DUPLICATE QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.6.0CC: aos-bugs, bbennett, dma, jokerman, jupierce, mifiedle, mmccomas, mwhittin, rromerom, sjenning, sten
Target Milestone: ---Keywords: OpsBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: aos-scalability-37
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-18 19:33:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pod yaml none

Comment 8 Seth Jennings 2017-09-12 19:31:09 UTC
*** Bug 1487334 has been marked as a duplicate of this bug. ***

Comment 9 Seth Jennings 2017-09-13 17:33:58 UTC
Vikas, PTAL

Comment 10 Vikas Choudhary 2017-09-15 08:30:50 UTC
Sten, Can you please share the pod yaml file that led to this issue?
And also if possible logs from the node.

Comment 11 Sten Turpin 2017-09-29 20:42:08 UTC
Created attachment 1332527 [details]
pod yaml

Comment 12 Justin Pierce 2017-09-29 20:43:41 UTC
Master and node logs showing the the PLEG issues. 

http://file.rdu.redhat.com/~jupierce/share/pleg_logs.tgz

node log in this TGZ @ logs-starter-us-east-1-201709292012/nodes/starter-us-east-1-node-compute-8c013/journal/atomic-openshift-node

Comment 15 Mike Fiedler 2017-10-02 12:19:43 UTC
Also, see:  https://bugzilla.redhat.com/show_bug.cgi?id=1451902

Comment 16 Seth Jennings 2017-10-03 15:54:19 UTC
All of the docker_operations_latency_microseconds metrics looked good, yet the overall pleg_relist_latency_microseconds was very high.  Because the latency was so high we just decided to get a goroutine stack trace on the process and see where the relist code path was blocked.  Here are two goroutines associated with the blocked PLEG relist:

http://pastebin.test.redhat.com/521118

It is blocked on a mutex in GetPodNetworkStatus() downstream and across the CRI from the updateCache() call in the PLEG relist.  So this is an issue with the networking, not docker operations or IOPS/BurstBalance as first thought.

The mutex is a per pod mutex and is only taken in NetworkPlugin SetUpPod(), TearDownPod(), and GetPodNetworkStatus().

https://github.com/openshift/origin/blob/release-3.6/vendor/k8s.io/kubernetes/pkg/kubelet/network/plugins.go#L379-L416

Here is the full goroutine dump:

http://file.rdu.redhat.com/~sturpin/goroutine 

SetUpPod() is not running anywhere, however, there are 6 TearDownPod() traces all blocked on an exec out to the CNI plugin.

Dan looked into the SDN code and found that pod tear down operations are serialized behind some pretty coarse locks and that routines are blocking for minutes behind these locks.  This is consistent with the PLEG relist latency we see.

Sending to networking to continue investigation and fix.

Comment 17 Ben Bennett 2017-10-11 19:57:45 UTC
PR https://github.com/openshift/origin/pull/16692 made it not call iptables for tear-down when there are no hostports defined in the pod.

That may have helped slightly with this.

Comment 19 Ben Bennett 2017-10-18 19:33:31 UTC

*** This bug has been marked as a duplicate of bug 1451902 ***