Bug 1474875 - [starter-us-east-1] atomic-openshift-node takes a long time to start
Summary: [starter-us-east-1] atomic-openshift-node takes a long time to start
Keywords:
Status: CLOSED DUPLICATE of bug 1451902
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.6.z
Assignee: Ben Bennett
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-25 14:31 UTC by Justin Pierce
Modified: 2017-12-18 14:01 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-26 18:01:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Justin Pierce 2017-07-25 14:31:46 UTC
Description of problem:
The default timeout for starting atomic-openshift-node from systemd is far too short. The timeout needed to be set 300 seconds. This was performed on starter-us-east-1 master on starter-us-east-1-master-25064 . 

Version-Release number of selected component (if applicable):
3.6.152.0

How reproducible:
100%

Steps to Reproduce:
1. systemctl start atomic-openshift-node on a master
2.
3.

Actual results:
The default timeout of 90 seconds was insufficient for the service to start. We set TimeoutStartSec=300 and tried again. 

Journal taken shorly after service reported active with 300 second timeout set for startup: http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow.txt

Expected results:
The startup process should be faster or the default TimeoutStartSec needs to be overriden. 

Additional info:

Comment 1 Justin Pierce 2017-07-25 14:50:18 UTC
Log of atomic-openshift-node start with loglevel=3 : http://file.rdu.redhat.com/~jupierce/share/atomic-openshift-node.slow3.log

Comment 2 Scott Dodson 2017-07-25 15:01:45 UTC
Suggested workaround for now

sed -i '/^RestartSec/ s/$/\nTimeoutStartSec=300/' /etc/systemd/system/atomic-openshift-node.service && systemctl daemon-reload

Comment 3 Scott Dodson 2017-07-26 18:57:19 UTC
The installer now sets TimeoutStartSec=300 when defining the node unit during install and upgrade in 3.6. I wouldn't consider that to be a complete fix though.

Comment 4 Ben Bennett 2017-09-12 18:26:48 UTC
Can't proceed on this until we get another cluster into this state and can debug it.

Comment 5 Eric Paris 2017-09-12 18:29:47 UTC
All of the starter nodes take > 90s to start. What do you need to collect?

We extended the timeout, so it starts eventually (sub 5m) but still always > 90s (the default timout).  this did not happen in 3.5-

Tell me what you need. Ben can get node logs...

Comment 6 Dan Williams 2017-09-29 22:09:57 UTC
(In reply to Eric Paris from comment #5)
> All of the starter nodes take > 90s to start. What do you need to collect?
> 
> We extended the timeout, so it starts eventually (sub 5m) but still always >
> 90s (the default timout).  this did not happen in 3.5-
> 
> Tell me what you need. Ben can get node logs...

If we can get node logs with --loglevel=5 then we can find out if it's the iptables-save locking changes or what.

Comment 7 Ben Bennett 2017-10-05 18:20:36 UTC
We may need to run dcbw's debug RPMs (and then back them out after we get logs with the slowness) to try to isolate why/if iptables is slow, if loglevel 5 is insufficient.

Comment 8 Ben Bennett 2017-10-17 15:59:00 UTC
The ball is currently in the kernel's court... iptables can take a long time when the system is busy.

Comment 9 Eric Paris 2017-10-17 17:28:48 UTC
What's the kernel BZ?

Comment 10 Ben Bennett 2017-10-18 19:35:16 UTC
The kernel iptables BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1503702

Comment 11 Ben Bennett 2017-10-26 18:01:38 UTC

*** This bug has been marked as a duplicate of bug 1451902 ***


Note You need to log in before you can comment on or make changes to this bug.