Description of problem: The default timeout for starting atomic-openshift-node from systemd is far too short. The timeout needed to be set 300 seconds. This was performed on starter-us-east-1 master on starter-us-east-1-master-25064 . Version-Release number of selected component (if applicable): 3.6.152.0 How reproducible: 100% Steps to Reproduce: 1. systemctl start atomic-openshift-node on a master 2. 3. Actual results: The default timeout of 90 seconds was insufficient for the service to start. We set TimeoutStartSec=300 and tried again. Journal taken shorly after service reported active with 300 second timeout set for startup: http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow.txt Expected results: The startup process should be faster or the default TimeoutStartSec needs to be overriden. Additional info:
Log of atomic-openshift-node start with loglevel=3 : http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow3.log
Suggested workaround for now sed -i '/^RestartSec/ s/$/\nTimeoutStartSec=300/' /etc/systemd/system/atomic-openshift-node.service && systemctl daemon-reload
The installer now sets TimeoutStartSec=300 when defining the node unit during install and upgrade in 3.6. I wouldn't consider that to be a complete fix though.
Can't proceed on this until we get another cluster into this state and can debug it.
All of the starter nodes take > 90s to start. What do you need to collect? We extended the timeout, so it starts eventually (sub 5m) but still always > 90s (the default timout). this did not happen in 3.5- Tell me what you need. Ben can get node logs...
(In reply to Eric Paris from comment #5) > All of the starter nodes take > 90s to start. What do you need to collect? > > We extended the timeout, so it starts eventually (sub 5m) but still always > > 90s (the default timout). this did not happen in 3.5- > > Tell me what you need. Ben can get node logs... If we can get node logs with --loglevel=5 then we can find out if it's the iptables-save locking changes or what.
We may need to run dcbw's debug RPMs (and then back them out after we get logs with the slowness) to try to isolate why/if iptables is slow, if loglevel 5 is insufficient.
The ball is currently in the kernel's court... iptables can take a long time when the system is busy.
What's the kernel BZ?
The kernel iptables BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1503702
*** This bug has been marked as a duplicate of bug 1451902 ***