1503252 – [starter-ca-central-1] Simple deploy took longer than 600 seconds

Bug 1503252 - [starter-ca-central-1] Simple deploy took longer than 600 seconds

Summary: [starter-ca-central-1] Simple deploy took longer than 600 seconds

Keywords:
Status:	CLOSED DUPLICATE of bug 1451902
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1505167 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-17 16:17 UTC by Justin Pierce
Modified:	2017-10-26 18:02 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-26 18:02:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Events from jenkins pod (361 bytes, text/plain) 2017-10-17 16:17 UTC, Justin Pierce	no flags	Details
node-combined.log (27.56 KB, text/plain) 2017-10-20 03:50 UTC, Seth Jennings	no flags	Details
View All

Description Justin Pierce 2017-10-17 16:17:28 UTC

Created attachment 1339803 [details]
Events from jenkins pod

Description of problem:

[root@starter-ca-central-1-master-692e9 ~]# oc get pods
NAME               READY     STATUS              RESTARTS   AGE
jenkins-1-57f3m    0/1       ContainerCreating   0          6m
jenkins-1-deploy   1/1       Running             0          6m
[root@starter-ca-central-1-master-692e9 ~]# oc get pods 
NAME               READY     STATUS    RESTARTS   AGE
jenkins-1-deploy   0/1       Error     0          10m
[root@starter-ca-central-1-master-692e9 ~]# oc logs jenkins-1-deploy
--> Scaling jenkins-1 to 1
error: update acceptor rejected jenkins-1: pods for rc 'jmp-test/jenkins-1' took longer than 600 seconds to become available



Version-Release number of selected component (if applicable):
Master: oc v3.7.0-0.143.3
Nodes: oc v3.6 GA



Steps to Reproduce:
1. Instantiate the Jenkins ephemeral template

Comment 1 Seth Jennings 2017-10-17 17:09:27 UTC

$ oc get pod -o wide
NAME               READY     STATUS    RESTARTS   AGE       IP             NODE
jenkins-1-deploy   0/1       Error     0          1h        10.131.34.57   ip-172-31-20-86.ca-central-1.compute.internal

on the node the pod was assigned:

$ oc describe node ip-172-31-25-45.ca-central-1.compute.internal
...
Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath	Type		Reason		Message
  ---------	--------	-----	----							-------------	--------	------		-------
  8d		1h		202	kubelet, ip-172-31-25-45.ca-central-1.compute.internal			Normal		NodeNotReady	Node ip-172-31-25-45.ca-central-1.compute.internal status is now: NodeNotReady
  8d		1h		202	kubelet, ip-172-31-25-45.ca-central-1.compute.internal			Normal		NodeReady	Node ip-172-31-25-45.ca-central-1.compute.internal status is now: NodeReady
  2d		41m		10	kubelet, ip-172-31-25-45.ca-central-1.compute.internal			Warning		SystemOOM	System OOM encountered

so this node is not healthy and explains the timeout.

Comment 4 Seth Jennings 2017-10-20 03:49:02 UTC

The deploy worked for me:

$ oc project
Using project "sjenning-demo" on server "https://api.starter-ca-central-1.openshift.com:443".

$ oc get pod -o wide
NAME              READY     STATUS    RESTARTS   AGE       IP              NODE
jenkins-1-16hqj   1/1       Running   0          11m       10.131.38.196   ip-172-31-22-176.ca-central-1.compute.internal

However, I was on the edge of timing out.  The sandbox creation keeps failing due to the iptables issue:

RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "jenkins-1-deploy_sjenning-demo" network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?

Eventually it did succeed.

Comment 5 Seth Jennings 2017-10-20 03:50:58 UTC

Created attachment 1341136 [details]
node-combined.log

These are the interleaved log from the node running the deploy pod (jenkins-1-deploy) and the node running the pod to be deployed (jenkins-1-16hqj).

Comment 6 Seth Jennings 2017-10-20 04:05:31 UTC

I imagine there is bug already tracking this but can't find it atm.  Sending to Networking for processing.

Comment 7 Ben Bennett 2017-10-24 17:42:35 UTC

*** Bug 1505167 has been marked as a duplicate of this bug. ***

Comment 8 Ben Bennett 2017-10-24 17:44:05 UTC

This is probably the same as https://bugzilla.redhat.com/show_bug.cgi?id=1451902

I'm looking to make sure that we have the patch that reduces the number of calls to iptables, but the real fix will be when the kernel change to fix https://bugzilla.redhat.com/show_bug.cgi?id=1503702 lands.

Comment 9 Ben Bennett 2017-10-26 18:02:05 UTC


*** This bug has been marked as a duplicate of bug 1451902 ***

Note You need to log in before you can comment on or make changes to this bug.