Bug 1493182
| Summary: | NodeNotReady scaling up to 11.4K pods in 950 projects - NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||
| Component: | Networking | Assignee: | Ben Bennett <bbennett> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Mike Fiedler <mifiedle> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 3.7.0 | CC: | aos-bugs, bbennett, jmencak, mifiedle, wabouham | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | aos-scalability-37 | ||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2017-10-24 17:40:04 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Mike Fiedler
2017-09-19 14:27:41 UTC
Do you have full node logs from the affected nodes? openshift-sdn only writes out the config file to enable NodeReady once it's done initial setup. So it appears that it's taking a while to get everything set up, and we should figure out how long things really are taking. Also, what's the CPU load on the affected nodes during setup? The attached file is the full node log. journalctl > <output> I'll grab pbench data the next time I set this up. Possibly tomorrow. There's a lot of weird log messages there. It looks like atomic-openshift-node gets restarted quite a few times by something, I'm not sure whether that's ansible or systemd or what. At the 12:21:28 mark, the node has just restarted, and the SDN isn't up and running yet. That takes a bit of time. It's still initializing at 12:22:56. Then for some reason, openshift-node gets killed by ansible at 12:23:01 and restarted. Then it looks like everything is mostly ready at 12:23:11? So I'm not sure this is a network problem. It looks like something is either slowing the node down (in the first service run in the logs up until 12:20-ish) and then something is restarting openshift-node. Any idea what that is? Reproduced this today and captured system data right after a node when NotReady: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-17-233/20170927_from890_a/tools-default/ip-172-31-44-27.us-west-2.compute.internal/ sar shows the system (4vCPU, 16GB) memory getting busy in peaks - it has 244 pods running on it. 2.5 cores used in spikes and 5GB memory in use. I grabbed the logs - there are earlier instances but if you search up from the bottom on NotReady you will find the instance captured here. Note that the node service restarted immediately after. Maybe this should belong to Pod component? Will try to repro tomorrow with loglevel 5 - this one was only 2. Created attachment 1331567 [details]
Node log/system log (loglevel 2)
This may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1451902 *** This bug has been marked as a duplicate of bug 1451902 *** |