Bug 1474875

Summary:	[starter-us-east-1] atomic-openshift-node takes a long time to start
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Status:	CLOSED DUPLICATE	QA Contact:	Meng Bo <bmeng>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.6.0	CC:	aloughla, aos-bugs, atragler, bbennett, ccoleman, eparis, jupierce, mwoodson, rkhan, sdodson, sukulkar
Target Milestone:	---	Keywords:	DeliveryBlocker
Target Release:	3.6.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-26 18:01:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Justin Pierce 2017-07-25 14:31:46 UTC

Description of problem:
The default timeout for starting atomic-openshift-node from systemd is far too short. The timeout needed to be set 300 seconds. This was performed on starter-us-east-1 master on starter-us-east-1-master-25064 . 

Version-Release number of selected component (if applicable):
3.6.152.0

How reproducible:
100%

Steps to Reproduce:
1. systemctl start atomic-openshift-node on a master
2.
3.

Actual results:
The default timeout of 90 seconds was insufficient for the service to start. We set TimeoutStartSec=300 and tried again. 

Journal taken shorly after service reported active with 300 second timeout set for startup: http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow.txt

Expected results:
The startup process should be faster or the default TimeoutStartSec needs to be overriden. 

Additional info:

Comment 1 Justin Pierce 2017-07-25 14:50:18 UTC

Log of atomic-openshift-node start with loglevel=3 : http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow3.log

Comment 2 Scott Dodson 2017-07-25 15:01:45 UTC

Suggested workaround for now

sed -i '/^RestartSec/ s/$/\nTimeoutStartSec=300/' /etc/systemd/system/atomic-openshift-node.service && systemctl daemon-reload

Comment 3 Scott Dodson 2017-07-26 18:57:19 UTC

The installer now sets TimeoutStartSec=300 when defining the node unit during install and upgrade in 3.6. I wouldn't consider that to be a complete fix though.

Comment 4 Ben Bennett 2017-09-12 18:26:48 UTC

Can't proceed on this until we get another cluster into this state and can debug it.

Comment 5 Eric Paris 2017-09-12 18:29:47 UTC

All of the starter nodes take > 90s to start. What do you need to collect?

We extended the timeout, so it starts eventually (sub 5m) but still always > 90s (the default timout).  this did not happen in 3.5-

Tell me what you need. Ben can get node logs...

Comment 6 Dan Williams 2017-09-29 22:09:57 UTC

(In reply to Eric Paris from comment #5)
> All of the starter nodes take > 90s to start. What do you need to collect?
> 
> We extended the timeout, so it starts eventually (sub 5m) but still always >
> 90s (the default timout).  this did not happen in 3.5-
> 
> Tell me what you need. Ben can get node logs...

If we can get node logs with --loglevel=5 then we can find out if it's the iptables-save locking changes or what.

Comment 7 Ben Bennett 2017-10-05 18:20:36 UTC

We may need to run dcbw's debug RPMs (and then back them out after we get logs with the slowness) to try to isolate why/if iptables is slow, if loglevel 5 is insufficient.

Comment 8 Ben Bennett 2017-10-17 15:59:00 UTC

The ball is currently in the kernel's court... iptables can take a long time when the system is busy.

Comment 9 Eric Paris 2017-10-17 17:28:48 UTC

What's the kernel BZ?

Comment 10 Ben Bennett 2017-10-18 19:35:16 UTC

The kernel iptables BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1503702

Comment 11 Ben Bennett 2017-10-26 18:01:38 UTC


*** This bug has been marked as a duplicate of bug 1451902 ***