Bug 1503252

Summary: [starter-ca-central-1] Simple deploy took longer than 600 seconds
Product: OpenShift Online Reporter: Justin Pierce <jupierce>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Status: CLOSED DUPLICATE QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.xCC: aos-bugs, jokerman, mmccomas, pportant
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-26 18:02:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Events from jenkins pod
none
node-combined.log none

Description Justin Pierce 2017-10-17 16:17:28 UTC
Created attachment 1339803 [details]
Events from jenkins pod

Description of problem:

[root@starter-ca-central-1-master-692e9 ~]# oc get pods
NAME               READY     STATUS              RESTARTS   AGE
jenkins-1-57f3m    0/1       ContainerCreating   0          6m
jenkins-1-deploy   1/1       Running             0          6m
[root@starter-ca-central-1-master-692e9 ~]# oc get pods 
NAME               READY     STATUS    RESTARTS   AGE
jenkins-1-deploy   0/1       Error     0          10m
[root@starter-ca-central-1-master-692e9 ~]# oc logs jenkins-1-deploy
--> Scaling jenkins-1 to 1
error: update acceptor rejected jenkins-1: pods for rc 'jmp-test/jenkins-1' took longer than 600 seconds to become available



Version-Release number of selected component (if applicable):
Master: oc v3.7.0-0.143.3
Nodes: oc v3.6 GA



Steps to Reproduce:
1. Instantiate the Jenkins ephemeral template

Comment 1 Seth Jennings 2017-10-17 17:09:27 UTC
$ oc get pod -o wide
NAME               READY     STATUS    RESTARTS   AGE       IP             NODE
jenkins-1-deploy   0/1       Error     0          1h        10.131.34.57   ip-172-31-20-86.ca-central-1.compute.internal

on the node the pod was assigned:

$ oc describe node ip-172-31-25-45.ca-central-1.compute.internal
...
Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath	Type		Reason		Message
  ---------	--------	-----	----							-------------	--------	------		-------
  8d		1h		202	kubelet, ip-172-31-25-45.ca-central-1.compute.internal			Normal		NodeNotReady	Node ip-172-31-25-45.ca-central-1.compute.internal status is now: NodeNotReady
  8d		1h		202	kubelet, ip-172-31-25-45.ca-central-1.compute.internal			Normal		NodeReady	Node ip-172-31-25-45.ca-central-1.compute.internal status is now: NodeReady
  2d		41m		10	kubelet, ip-172-31-25-45.ca-central-1.compute.internal			Warning		SystemOOM	System OOM encountered

so this node is not healthy and explains the timeout.

Comment 4 Seth Jennings 2017-10-20 03:49:02 UTC
The deploy worked for me:

$ oc project
Using project "sjenning-demo" on server "https://api.starter-ca-central-1.openshift.com:443".

$ oc get pod -o wide
NAME              READY     STATUS    RESTARTS   AGE       IP              NODE
jenkins-1-16hqj   1/1       Running   0          11m       10.131.38.196   ip-172-31-22-176.ca-central-1.compute.internal

However, I was on the edge of timing out.  The sandbox creation keeps failing due to the iptables issue:

RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "jenkins-1-deploy_sjenning-demo" network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?

Eventually it did succeed.

Comment 5 Seth Jennings 2017-10-20 03:50:58 UTC
Created attachment 1341136 [details]
node-combined.log

These are the interleaved log from the node running the deploy pod (jenkins-1-deploy) and the node running the pod to be deployed (jenkins-1-16hqj).

Comment 6 Seth Jennings 2017-10-20 04:05:31 UTC
I imagine there is bug already tracking this but can't find it atm.  Sending to Networking for processing.

Comment 7 Ben Bennett 2017-10-24 17:42:35 UTC
*** Bug 1505167 has been marked as a duplicate of this bug. ***

Comment 8 Ben Bennett 2017-10-24 17:44:05 UTC
This is probably the same as https://bugzilla.redhat.com/show_bug.cgi?id=1451902

I'm looking to make sure that we have the patch that reduces the number of calls to iptables, but the real fix will be when the kernel change to fix https://bugzilla.redhat.com/show_bug.cgi?id=1503702 lands.

Comment 9 Ben Bennett 2017-10-26 18:02:05 UTC

*** This bug has been marked as a duplicate of bug 1451902 ***