Bug 1486914
| Summary: | PLEG is not healthy error, node marked NotReady | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jeremy Eder <jeder> | ||||
| Component: | Networking | Assignee: | Dan Williams <dcbw> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Meng Bo <bmeng> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 3.6.0 | CC: | aos-bugs, bbennett, dma, jokerman, jupierce, mifiedle, mmccomas, mwhittin, rromerom, sjenning, sten | ||||
| Target Milestone: | --- | Keywords: | OpsBlocker | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | aos-scalability-37 | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-10-18 19:33:31 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Comment 8
Seth Jennings
2017-09-12 19:31:09 UTC
Vikas, PTAL Sten, Can you please share the pod yaml file that led to this issue? And also if possible logs from the node. Created attachment 1332527 [details]
pod yaml
Master and node logs showing the the PLEG issues. http://file.rdu.redhat.com/~jupierce/share/pleg_logs.tgz node log in this TGZ @ logs-starter-us-east-1-201709292012/nodes/starter-us-east-1-node-compute-8c013/journal/atomic-openshift-node All of the docker_operations_latency_microseconds metrics looked good, yet the overall pleg_relist_latency_microseconds was very high. Because the latency was so high we just decided to get a goroutine stack trace on the process and see where the relist code path was blocked. Here are two goroutines associated with the blocked PLEG relist: http://pastebin.test.redhat.com/521118 It is blocked on a mutex in GetPodNetworkStatus() downstream and across the CRI from the updateCache() call in the PLEG relist. So this is an issue with the networking, not docker operations or IOPS/BurstBalance as first thought. The mutex is a per pod mutex and is only taken in NetworkPlugin SetUpPod(), TearDownPod(), and GetPodNetworkStatus(). https://github.com/openshift/origin/blob/release-3.6/vendor/k8s.io/kubernetes/pkg/kubelet/network/plugins.go#L379-L416 Here is the full goroutine dump: http://file.rdu.redhat.com/~sturpin/goroutine SetUpPod() is not running anywhere, however, there are 6 TearDownPod() traces all blocked on an exec out to the CNI plugin. Dan looked into the SDN code and found that pod tear down operations are serialized behind some pretty coarse locks and that routines are blocking for minutes behind these locks. This is consistent with the PLEG relist latency we see. Sending to networking to continue investigation and fix. PR https://github.com/openshift/origin/pull/16692 made it not call iptables for tear-down when there are no hostports defined in the pod. That may have helped slightly with this. *** This bug has been marked as a duplicate of bug 1451902 *** |