Bug 1329399

Summary: OpenShift router pods go into CrashLoopBackOff and restart when scaling an app to ~1000 replicas
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NetworkingAssignee: Phil Cameron <pcameron>
Networking sub component: router QA Contact: Mike Fiedler <mifiedle>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: medium CC: aloughla, aos-bugs, atragler, bbennett, dakini, eparis, jmalde, marc.jadoul, mifiedle, mleitner, ramr, rkhan, tstclair
Version: 3.2.0Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-12 17:06:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oc get events during router crash
none
syslog messages during router crash
none
additional syslog messages
none
Deployment config for pause pod with service and route defined none

Description Mike Fiedler 2016-04-21 20:08:50 UTC
Description of problem:

routers go into CrashLoopBackOff and constantly restart when scaling the cakephp or django quickstart applications above 1000 replicas (or when building a single django or cakephp quick start application in 1000 projects).

I've hit a couple flavors of this problem that exhibit the same external symptoms.  I'm documenting the smallest cluster and simplest way to hit it.   See Additional Info for other variations.


Version-Release number of selected component (if applicable): 3.2.0.18


How reproducible: Always


Steps to Reproduce:
1. Need a big enough cluster to run 1000+ quick start application pods.  I am using AWS
   3 x master : m4.4xlarge
   3 x router : m4.xlarge
   1 x load balancer: m4.xlarge
  10 x node: m4.2xlarge
2. Create 1 project with 1 quick start app.  I know this can be hit with cakephp and django
3. once the app is running oc scale dc/cakephp-example --replicas 800
4. It should scale up OK and router pods should be in good health
5. oc scale --replicas=1100


Actual results:

Router pods will all go into CrashLoopBackOff and show lots of restarts

Expected results:

Router pods continue to function.  If additional HAProxy tuning is required, expose the tuning.


Additional info:

I've also hit this on a larger 200 node cluster by trying to run oc new-app in 1000 projects.  i.e. 1 copy each in 1000 namespaces instead of 1000 replicas in a single namespace

Comment 1 Mike Fiedler 2016-04-21 20:12:50 UTC
Created attachment 1149563 [details]
oc get events during router crash

Comment 2 Mike Fiedler 2016-04-21 20:13:26 UTC
Created attachment 1149564 [details]
syslog messages during router crash

Comment 3 Ben Bennett 2016-04-22 11:17:27 UTC
Is there anything in the docker logs for the crashed routers?

Comment 4 Mike Fiedler 2016-04-22 13:53:18 UTC
The docker logs just show retries for the health check.   Let me know if there is a higher log level to try.

root@ip-172-31-19-40: ~ # docker logs 935972e95440
I0422 09:49:50.321345       1 router.go:161] Router is including routes in all namespaces
I0422 09:49:55.317371       1 router.go:310] Router reloaded:
 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 5 retry attempt(s).
I0422 09:49:57.437498       1 router.go:310] Router reloaded:
 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 2 retry attempt(s).

Comment 5 Timothy St. Clair 2016-04-26 16:33:54 UTC
What's the ulimit on the container?

Comment 6 Mike Fiedler 2016-04-26 17:38:23 UTC
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63390
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Comment 7 Mike Fiedler 2016-04-26 20:00:20 UTC
Did some further investigation.

Scaled up to a known good number of pods and then increased slowly while waiting between increases.

There is definitely a magic number involved here.  I could always scale up to 1011 instances and wait for any period of time and things were fine - no router restarts, no error messages in the logs on the nodes running the routers.   As soon as I increased the replicas to 1012 the routers restarted.   Decreasing back to 1011 and the restarts stopped.

Config was 3x master + etcd, 2x router, 2x registry, 10x app nodes.

Attaching system log from a node where the restarts occurred.  A burst of messages always comes in, but not sure how useful they are.

Comment 8 Mike Fiedler 2016-04-26 20:00:56 UTC
Created attachment 1151062 [details]
additional syslog messages

Comment 9 Phil Cameron 2016-05-16 21:00:58 UTC
I have reproduce this (or a similar) problem ona 6 nide cluster in the lab. Next will be finding the root cause of the problem.

Comment 10 Phil Cameron 2016-05-17 17:04:33 UTC
# docker ps -a | grep ro
8bd46592eca7        openshift3/ose-haproxy-router:v3.2.0.44   "/usr/bin/openshift-r"   3 minutes ago       Exited (2) About a minute ago                       k8s_router.863168f7_router-1-s0dnn_default_c5cf63fb-1b8f-11e6-8e61-ecf4bbd7ae44_984463d0

# docker logs 8bd46592eca7
.
.
.
I0517 12:56:51.493340       1 router.go:310] Router reloaded:
 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 1 retry attempt(s).
I0517 12:56:59.735537       1 router.go:310] Router reloaded:
 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 0 retry attempt(s).

Comment 11 Phil Cameron 2016-05-18 17:16:10 UTC
Set timeouts in dc/router, periodSeconds=300, timeoutSeconds=40

I let this run overnight. I came in this morning and took a look.
In looking at the docker logs, he router was started at 16:01 and crashed at 01:45

 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 0 retry attempt(s).
I0518 01:45:05.882564       1 reaper.go:24] Signal received: child exited
I0518 01:45:07.825801       1 plugin.go:133] Processing 1 Endpoints for Name: nginx-svc-online (MODIFIED)
I0518 01:45:07.825817       1 plugin.go:136]   Subset 0 : api.EndpointSubset{Addresses:[]api.EndpointAddress(nil), NotReadyAddresses:[]api.EndpointAddress{api.EndpointAddress{IP:"10.1.4.2", TargetRef:(*api.ObjectReference)(0xc2087b07e0)}}, Ports:[]api.EndpointPort{api.EndpointPort{Name:"80-tcp", Port:8080, Protocol:"TCP"}}}
I0518 01:45:07.825857       1 plugin.go:145] Modifying endpoints for conn1002/nginx-svc-online
I0518 01:45:07.825869       1 router.go:501] Ignoring change for conn1002/nginx-svc-online, endpoints are the same

The new router started at: 01:46:58.573031
and is still working this morning.

I took a look at the health check retries and made a table of the number of samples that required N retries:
Total samples: 1104
Retries  samples
   0        677
   1        232
   2        9
   3        18
   4        36
   5        21
   6        47
   7        13
   8        24
   9        25
   10        1
   11        0
   12        0
   13        1
   14        0
It strikes me that when you have to try curl 2 time to see it work 21% of the time and 6 times to finally see it work 4% of the time there is a problem.

Comment 12 Phil Cameron 2016-05-18 17:21:46 UTC
Came back from lunch:
# curl -H "Host: hello-openshift-v3.mycloud.com" http://172.30.113.75/
curl: (7) Failed connect to 172.30.113.75:80; Connection refused

waited a few seconds:

# curl -H "Host: hello-openshift-v3.mycloud.com" http://172.30.113.75/
Hello OpenShift!

# oc get rc
NAME       DESIRED   CURRENT   AGE
hello-rc   950       950       6d

Comment 13 Phil Cameron 2016-05-20 18:08:07 UTC
I have reproduced the problem on the lab 6 node cluster with a project running 975 replicas of hello-openshift.

During these tests the setup is static, no changes are made to the configuration, no pods are added or deleted.

Aside from infrastructure pods, there are some nginx and a lot of hello-openshift pods. curl to the hello-openshift pods is the only input to the cluster.
 
This setup shows some interesting behaviors.

- I have found that extending the timeouts resolves the problem.
# oc edit dc/router
    spec:
      containers:
      - command:
        - /usr/bin/openshift-router
        - --loglevel=4
      livenessProbe:
          initialDelaySeconds: 30
          periodSeconds: 300
          successThreshold: 1
          timeoutSeconds: 40
The actual reload seems to be quick:
<logs-snippet>
I0517 15:24:38.913860       1 router.go:225] Writing the router state
I0517 15:24:38.918282       1 router.go:230] Writing the router config
I0517 15:24:39.007365       1 router.go:235] Reloading the router
I0517 15:24:39.054372       1 reaper.go:24] Signal received: child exited
I0517 15:24:39.054411       1 reaper.go:32] Reaped process with pid 55
I0517 15:24:40.067301       1 router.go:310] Router reloaded:
  - Checking HAProxy /healthz on port 1936 ...
  - HAProxy port 1936 health check ok : 1 retry attempt(s).
I0517 15:24:40.067329       1 reaper.go:24] Signal received: child exited
</logs-snippet>

- Why does increasing the timeout work?
This is still unknown...

- On reload the first curl doesn't always work. The retries are 500msec apart.
A histogram of retries for 1104 reloads follows
Retries  samples
   0        677
   1        232
   2        9
   3        18
   4        36
   5        21
   6        47
   7        13
   8        24
   9        25
   10        1
   11        0
   12        0
   13        1
   14        0
It strikes me that when you have to try curl 2 time to see it work 21% of the time and 6 times to finally see it work 4% of the time there is a problem. Could this be the timeout problem?


- 6 second network pause every 1-3 min
(This is not likely related to the 10 minute reload cycle)

I put curl in a 2 sec loop on the master node and printed out every time the curl failed (returned 7)
It appears that every 1-3 minutes, the curl fails for roughly 6 seconds.
I set up dnsmasq on my laptop and looped calling curl every second and timed how long (wall clock) curl took. Most of the time it is around 140msec, however every few minutes it takes 5.3 seconds.
During these pauses ssh

- router pod crashed and was restarted 7 times in 2 days
In looking at the docker logs, the router was started at 16:01 and crashed at 01:45

 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 0 retry attempt(s).
I0518 01:45:05.882564       1 reaper.go:24] Signal received: child exited
Log file ended here.

New log file started here, a minute and 53 sec later, The new router started at: 01:46:58.573031


- Is iptables changing or pod changes changing configuration?
I captured 
oc get all -o wide
before and after a stall and the are the same

I captured iptables before and after a stall. There are 4061 lines.
The nat table has a minor re-ordering of the router service entries.

Comment 14 Phil Cameron 2016-05-20 20:48:12 UTC
Looking at docker log for router container. The time stamps roughly mirror the pauses in processing. 

I0518 22:57:09.962036       1 router.go:310] Router reloaded:
I0518 22:57:37.100725       1 router.go:310] Router reloaded:
I0518 23:00:00.071569       1 router.go:310] Router reloaded:
I0518 23:00:06.033517       1 router.go:310] Router reloaded:
I0518 23:01:36.905530       1 router.go:310] Router reloaded:
I0518 23:01:52.080527       1 router.go:310] Router reloaded:
I0518 23:03:27.918080       1 router.go:310] Router reloaded:
I0518 23:03:35.882036       1 router.go:310] Router reloaded:
I0518 23:06:31.943347       1 router.go:310] Router reloaded:
I0518 23:06:35.891345       1 router.go:310] Router reloaded:
I0518 23:07:00.019580       1 router.go:310] Router reloaded:
I0518 23:07:17.913971       1 router.go:310] Router reloaded:
I0518 23:07:27.050100       1 router.go:310] Router reloaded:
I0518 23:10:14.963170       1 router.go:310] Router reloaded:
I0518 23:10:15.896205       1 router.go:310] Router reloaded:
I0518 23:10:57.905800       1 router.go:310] Router reloaded:

Found out why the reload happens so often. 
pkg/router/template/router.go  AddEndpoints()
calls reflect.DeepEqual() which does not notice that there are no changes. If the match gets noticed, this static setup would not need to reload.

Still don't know why networking stalls during reload. Also, don't know why the change event was signaled (since things haven't changed).

Probably a good idea to find and fix the stall first since the match problem reproduces the problem so often.

Finally, still don't know why the router crashes (which is the actual bug).

Comment 15 Phil Cameron 2016-05-31 14:32:55 UTC
In looking at the docker log for the router it appears that haproxy reload is 
a frequent occurance. It bounces back and forth between a null set and the expected route. I put in a quick hack to not reload when null set is seen and this pretty much eliminates the reload.

bz1320233 - haproxy is reloaded every 10 minutes N-times for N endpoints
The fix by Maur Newby <marun> (PR8287) reduces the number of haproxy reloads.

My quick hack along with the long liveness timeouts ran over the 3 day weekend and the router did not fail. I have reduced the liveness timeouts to timeoutSeconds 5, periodSeconds 10 as a test to see if the crash restart is still happening. I will let the test run for a while and report results.

Comment 16 Phil Cameron 2016-05-31 14:56:18 UTC
In testing I noticed that networking stalled for 6+ seconds every 2-3 minutes. This seems to be related to the haproxy reload. With the hack mentioned in comment 15 the curl test ran 161000 times over a 3 day weekend without a stall. I reconfigured the liveness timeouts (as mentioned in comment 15) and noticed a single 12 second stall.

The script runs:
curl -H "Host: hello-openshift-v3.mycloud.com" http://172.30.113.75
every 2 seconds.

Comment 17 Phil Cameron 2016-06-03 12:46:37 UTC
Noticed the router set of routes bounced from the configured set to nil every 30 sec or so and every 1-3 minutes networking paused (as shown by curl failures) for 6+ seconds. I put in a quick hack to not reload the router with the nil set. After 2 days the pauses did not re-appear. Reset the timeouts to  periodSeconds: 10, timeoutSeconds: 5. The router pod has not restarted in 3 days of running the curl test. 

It appears that the router crashing frequently enough to enter CrashLoopBackOff is related to the bounce. 

PR8287 (bz 1320233) properly fixes the bounce problem which appears to be the root cause of this bug.

Comment 18 Phil Cameron 2016-06-06 13:31:28 UTC
Ran curl test (to detect network pause) over the weekend. Pause did not occur.

Comment 19 Phil Cameron 2016-06-10 13:33:24 UTC

*** This bug has been marked as a duplicate of bug 1320233 ***

Comment 20 Mike Fiedler 2016-06-24 15:55:37 UTC
I re-tested this with:

registry.qe.openshift.com/openshift3/ose-haproxy-router          v3.2.1.3            fe5e72b03bce

and the problem reported still occurs.   I used the following steps:

1. Need a big enough cluster to run 1000+ quick start application pods.  I am using AWS
   3 x master : m4.4xlarge
   3 x router : m4.xlarge
   1 x load balancer: m4.xlarge
  10 x node: m4.xlarge
2. Create 1 project with 1 application using the django-example template
3. once the app is running oc scale dc/dhango-example --replicas 1000
4. It should scale up OK and router pods should be in good health
5. oc scale --replicas=1100 

The routers go into CrashLoopBackOff.   

As in comment 7, there was a "magic number".  This time it was 1008 replicas.

I could scale to 1007 replicas and the routers were stable.  No restarts, no failed health checks.   As soon as I scaled to 1008 replicas the routers began failing their health checks.  OpenShift/Kube then kills and restarts the pod.

Let me know what documentation/information you need.

Comment 23 Phil Cameron 2016-06-28 13:41:10 UTC
Mike, it times out because there is too much to do in the allotted time. 
What are the cakephp or django quickstart applications? How do they behave? Are then starting and stopping and setting up and deleting routes a lot?
I am using "hello-openshift" which just comes up and sits there. The routing, once set up, doesn't change. In this mode I cannot force the bug even with very short timeouts. 

The root cause was the excessive route changing that required a lot of processing and timeouts were missed. The bug mentioned in the duplicate started an update by erasing the routes and adding them back in. This causes a lot of processing. The fix created and filled in a new table and then  replaced the old. Most of the time there are no changes and minimal processing. This was fixed in the "duplicate" bug that I used to close this bug.

I suspect that either your setup is aggressively changing routes and requiring a lot of processing or the fix is not in your version. We can start by exploring the pattern of route changes in your setup. In the end we may find that there is just too much going on and extending the timeouts is the needed change. There is also a chance we may find a different problem.

Are the modified timeouts sufficient to resolve this or do you want to continue exploring this?

Comment 24 Mike Fiedler 2016-06-28 14:50:33 UTC
The quick start applications are close to "do-nothing" applications.   Basically hello-worlds in different languages (PHP and python in this case) which keep track of how many times they have been hit.  No long-running activity.

See:  https://docs.openshift.com/enterprise/3.2/dev_guide/app_tutorials/quickstarts.html

I will re-test with hello-openshift and see what the results are.  It would be nice if someone can tell us if the fix is in the 3.2.1.3 version of the router Docker image.

Comment 25 Phil Cameron 2016-07-12 12:23:55 UTC
Mike, is there anything left to do here? Can this be closed again?

Comment 26 Phil Cameron 2016-07-20 12:35:56 UTC
No response to #25
Closing this again.

Comment 27 Mike Fiedler 2016-08-09 00:38:44 UTC
Sorry I missed the comments.  I recreated this on 3.3 with hello-openshift pods as well as gcr.io/google_containers/pause-amd64:3.0 pods which are the slimmest, do-nothing-est shim of a container possible.   The secret is to make sure the deployment configuration has a Service and Router defined.  If you just start the container as a pod, the problem does not occur.  

My guess is that the large number of replicas for a single svc/route is what triggers the router restarts.

I'm re-opening so that this additional information can be seen.   Whether 1000+ replicas of a single DC is reasonable is something I don't have the answer to.

I'm attaching the yaml for the DC in case someone wants to recreate it with the pause pod.

Comment 28 Mike Fiedler 2016-08-09 00:39:29 UTC
Created attachment 1188978 [details]
Deployment config for pause pod with service and route defined

Comment 29 Mike Fiedler 2016-08-09 01:09:42 UTC
The configuration for the re-create using hello-openshift and pause was on AWS EC2

1 x m4.xlarge master load balancer
3 x m4.xlarge master/etcd
2 x m4.xlarge HAProxy router nodes
5 x m4.xlarge nodes with maxPods set at 250

OpenShift v3.3.0.16

Comment 31 Ben Bennett 2016-08-26 15:20:46 UTC
*** Bug 1370249 has been marked as a duplicate of this bug. ***

Comment 32 Marc Jadoul 2017-05-10 07:37:34 UTC
Hello,

I see this bug is closed .... (what mean exactly CLOSED CURRENTRELEASE?)
We are hit by this issue also on an Openshift 3.2.1.30.
Our nodes are not really busy. And we reserved memory and CPU for the router. 

Therefor we are looking for the possible root cause of the haproxy not responding to the health check.

Our router have been recently reinstalled and moved to new nodes. Old nodes have been removed. 
We have quite a few hypothesis:
* There is apparently some misconfiguration of the Network interface. We have bonding on it... But I do not see any reason it would impact the router.
* We see the router crash more often on some nodes. It might be correlated to the number of floating IP on this particular node.
* As we reinstalled the router its name changed and it is now walled router-XX-YY. When doing a diagnostic, we get a warning that the router is not found. This might indicate some misconfiguration and might be also the cause of our problem?

I will log a call to Redhat support but any help is welcome. 
Marc

Comment 33 Phil Cameron 2017-05-10 12:44:22 UTC
Marc, see comment 30. Does that help? Also, verify maxconn is large enough. The default is 20000 which may not be enough for you.

Comment 34 Marc Jadoul 2017-05-11 09:19:26 UTC
Hello Phil,
(I do not see comment 30 but support told me what it is)
My first try to increase timeout was not concluant. The router crash less often but still do not respond for a long time which causes service interruption.
Also our servers are not really busy.... there is plenty of free CPU and Memory.
And I reserved Memory + CPU for haproxy, but this is also not concluant.
I will try to increase the 20000 limit but it seems already large and I do not see what could take 20000 connections. Our service is still in its beginning and we do not have large number of users. 
How can I see maxconn / ulimit on the container?
The first thing I would like to clarify: is it the haproxy which is disfunctionning (not responding to heath check) or the full server which lock --> but that does not seems the case as  from the node terminal I can wget on localhost and see that it is the health check that does not respond..... 
Thus the haproxy disfunction....?
Marc

Comment 35 Marc Jadoul 2017-05-17 17:12:56 UTC
So yes in my case it was effectivelly a maxconn issue.... It seems usefull to mention that the maxconn default value is probable 4096 in Openshift 3.2.1.31.

Thanks a lot for your help.

Marc