Bug 1474875
| Summary: | [starter-us-east-1] atomic-openshift-node takes a long time to start | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> |
| Component: | Networking | Assignee: | Ben Bennett <bbennett> |
| Status: | CLOSED DUPLICATE | QA Contact: | Meng Bo <bmeng> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.6.0 | CC: | aloughla, aos-bugs, atragler, bbennett, ccoleman, eparis, jupierce, mwoodson, rkhan, sdodson, sukulkar |
| Target Milestone: | --- | Keywords: | DeliveryBlocker |
| Target Release: | 3.6.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-10-26 18:01:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Justin Pierce
2017-07-25 14:31:46 UTC
Log of atomic-openshift-node start with loglevel=3 : http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow3.log Suggested workaround for now sed -i '/^RestartSec/ s/$/\nTimeoutStartSec=300/' /etc/systemd/system/atomic-openshift-node.service && systemctl daemon-reload The installer now sets TimeoutStartSec=300 when defining the node unit during install and upgrade in 3.6. I wouldn't consider that to be a complete fix though. Can't proceed on this until we get another cluster into this state and can debug it. All of the starter nodes take > 90s to start. What do you need to collect? We extended the timeout, so it starts eventually (sub 5m) but still always > 90s (the default timout). this did not happen in 3.5- Tell me what you need. Ben can get node logs... (In reply to Eric Paris from comment #5) > All of the starter nodes take > 90s to start. What do you need to collect? > > We extended the timeout, so it starts eventually (sub 5m) but still always > > 90s (the default timout). this did not happen in 3.5- > > Tell me what you need. Ben can get node logs... If we can get node logs with --loglevel=5 then we can find out if it's the iptables-save locking changes or what. We may need to run dcbw's debug RPMs (and then back them out after we get logs with the slowness) to try to isolate why/if iptables is slow, if loglevel 5 is insufficient. The ball is currently in the kernel's court... iptables can take a long time when the system is busy. What's the kernel BZ? The kernel iptables BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1503702 *** This bug has been marked as a duplicate of bug 1451902 *** |