Bug 1474875

Summary: [starter-us-east-1] atomic-openshift-node takes a long time to start
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Status: CLOSED DUPLICATE QA Contact: Meng Bo <bmeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.6.0CC: aloughla, aos-bugs, atragler, bbennett, ccoleman, eparis, jupierce, mwoodson, rkhan, sdodson, sukulkar
Target Milestone: ---Keywords: DeliveryBlocker
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-26 18:01:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Pierce 2017-07-25 14:31:46 UTC
Description of problem:
The default timeout for starting atomic-openshift-node from systemd is far too short. The timeout needed to be set 300 seconds. This was performed on starter-us-east-1 master on starter-us-east-1-master-25064 . 

Version-Release number of selected component (if applicable):
3.6.152.0

How reproducible:
100%

Steps to Reproduce:
1. systemctl start atomic-openshift-node on a master
2.
3.

Actual results:
The default timeout of 90 seconds was insufficient for the service to start. We set TimeoutStartSec=300 and tried again. 

Journal taken shorly after service reported active with 300 second timeout set for startup: http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow.txt

Expected results:
The startup process should be faster or the default TimeoutStartSec needs to be overriden. 

Additional info:

Comment 1 Justin Pierce 2017-07-25 14:50:18 UTC
Log of atomic-openshift-node start with loglevel=3 : http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow3.log

Comment 2 Scott Dodson 2017-07-25 15:01:45 UTC
Suggested workaround for now

sed -i '/^RestartSec/ s/$/\nTimeoutStartSec=300/' /etc/systemd/system/atomic-openshift-node.service && systemctl daemon-reload

Comment 3 Scott Dodson 2017-07-26 18:57:19 UTC
The installer now sets TimeoutStartSec=300 when defining the node unit during install and upgrade in 3.6. I wouldn't consider that to be a complete fix though.

Comment 4 Ben Bennett 2017-09-12 18:26:48 UTC
Can't proceed on this until we get another cluster into this state and can debug it.

Comment 5 Eric Paris 2017-09-12 18:29:47 UTC
All of the starter nodes take > 90s to start. What do you need to collect?

We extended the timeout, so it starts eventually (sub 5m) but still always > 90s (the default timout).  this did not happen in 3.5-

Tell me what you need. Ben can get node logs...

Comment 6 Dan Williams 2017-09-29 22:09:57 UTC
(In reply to Eric Paris from comment #5)
> All of the starter nodes take > 90s to start. What do you need to collect?
> 
> We extended the timeout, so it starts eventually (sub 5m) but still always >
> 90s (the default timout).  this did not happen in 3.5-
> 
> Tell me what you need. Ben can get node logs...

If we can get node logs with --loglevel=5 then we can find out if it's the iptables-save locking changes or what.

Comment 7 Ben Bennett 2017-10-05 18:20:36 UTC
We may need to run dcbw's debug RPMs (and then back them out after we get logs with the slowness) to try to isolate why/if iptables is slow, if loglevel 5 is insufficient.

Comment 8 Ben Bennett 2017-10-17 15:59:00 UTC
The ball is currently in the kernel's court... iptables can take a long time when the system is busy.

Comment 9 Eric Paris 2017-10-17 17:28:48 UTC
What's the kernel BZ?

Comment 10 Ben Bennett 2017-10-18 19:35:16 UTC
The kernel iptables BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1503702

Comment 11 Ben Bennett 2017-10-26 18:01:38 UTC

*** This bug has been marked as a duplicate of bug 1451902 ***