1474875 – [starter-us-east-1] atomic-openshift-node takes a long time to start

Bug 1474875 - [starter-us-east-1] atomic-openshift-node takes a long time to start

Summary: [starter-us-east-1] atomic-openshift-node takes a long time to start

Keywords:
Status:	CLOSED DUPLICATE of bug 1451902
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-25 14:31 UTC by Justin Pierce
Modified:	2017-12-18 14:01 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-26 18:01:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Justin Pierce 2017-07-25 14:31:46 UTC

Description of problem:
The default timeout for starting atomic-openshift-node from systemd is far too short. The timeout needed to be set 300 seconds. This was performed on starter-us-east-1 master on starter-us-east-1-master-25064 . 

Version-Release number of selected component (if applicable):
3.6.152.0

How reproducible:
100%

Steps to Reproduce:
1. systemctl start atomic-openshift-node on a master
2.
3.

Actual results:
The default timeout of 90 seconds was insufficient for the service to start. We set TimeoutStartSec=300 and tried again. 

Journal taken shorly after service reported active with 300 second timeout set for startup: http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow.txt

Expected results:
The startup process should be faster or the default TimeoutStartSec needs to be overriden. 

Additional info:

Comment 1 Justin Pierce 2017-07-25 14:50:18 UTC

Log of atomic-openshift-node start with loglevel=3 : http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow3.log

Comment 2 Scott Dodson 2017-07-25 15:01:45 UTC

Suggested workaround for now

sed -i '/^RestartSec/ s/$/\nTimeoutStartSec=300/' /etc/systemd/system/atomic-openshift-node.service && systemctl daemon-reload

Comment 3 Scott Dodson 2017-07-26 18:57:19 UTC

The installer now sets TimeoutStartSec=300 when defining the node unit during install and upgrade in 3.6. I wouldn't consider that to be a complete fix though.

Comment 4 Ben Bennett 2017-09-12 18:26:48 UTC

Can't proceed on this until we get another cluster into this state and can debug it.

Comment 5 Eric Paris 2017-09-12 18:29:47 UTC

All of the starter nodes take > 90s to start. What do you need to collect?

We extended the timeout, so it starts eventually (sub 5m) but still always > 90s (the default timout).  this did not happen in 3.5-

Tell me what you need. Ben can get node logs...

Comment 6 Dan Williams 2017-09-29 22:09:57 UTC

(In reply to Eric Paris from comment #5)
> All of the starter nodes take > 90s to start. What do you need to collect?
> 
> We extended the timeout, so it starts eventually (sub 5m) but still always >
> 90s (the default timout).  this did not happen in 3.5-
> 
> Tell me what you need. Ben can get node logs...

If we can get node logs with --loglevel=5 then we can find out if it's the iptables-save locking changes or what.

Comment 7 Ben Bennett 2017-10-05 18:20:36 UTC

We may need to run dcbw's debug RPMs (and then back them out after we get logs with the slowness) to try to isolate why/if iptables is slow, if loglevel 5 is insufficient.

Comment 8 Ben Bennett 2017-10-17 15:59:00 UTC

The ball is currently in the kernel's court... iptables can take a long time when the system is busy.

Comment 9 Eric Paris 2017-10-17 17:28:48 UTC

What's the kernel BZ?

Comment 10 Ben Bennett 2017-10-18 19:35:16 UTC

The kernel iptables BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1503702

Comment 11 Ben Bennett 2017-10-26 18:01:38 UTC


*** This bug has been marked as a duplicate of bug 1451902 ***

Note You need to log in before you can comment on or make changes to this bug.