Bug 1329399
| Summary: | OpenShift router pods go into CrashLoopBackOff and restart when scaling an app to ~1000 replicas | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||||||
| Component: | Networking | Assignee: | Phil Cameron <pcameron> | ||||||||||
| Networking sub component: | router | QA Contact: | Mike Fiedler <mifiedle> | ||||||||||
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||||||
| Severity: | medium | ||||||||||||
| Priority: | medium | CC: | aloughla, aos-bugs, atragler, bbennett, dakini, eparis, jmalde, marc.jadoul, mifiedle, mleitner, ramr, rkhan, tstclair | ||||||||||
| Version: | 3.2.0 | Keywords: | Reopened | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2016-08-12 17:06:23 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Mike Fiedler
2016-04-21 20:08:50 UTC
Created attachment 1149563 [details]
oc get events during router crash
Created attachment 1149564 [details]
syslog messages during router crash
Is there anything in the docker logs for the crashed routers? The docker logs just show retries for the health check. Let me know if there is a higher log level to try. root@ip-172-31-19-40: ~ # docker logs 935972e95440 I0422 09:49:50.321345 1 router.go:161] Router is including routes in all namespaces I0422 09:49:55.317371 1 router.go:310] Router reloaded: - Checking HAProxy /healthz on port 1936 ... - HAProxy port 1936 health check ok : 5 retry attempt(s). I0422 09:49:57.437498 1 router.go:310] Router reloaded: - Checking HAProxy /healthz on port 1936 ... - HAProxy port 1936 health check ok : 2 retry attempt(s). What's the ulimit on the container? core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 63390 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1048576 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Did some further investigation. Scaled up to a known good number of pods and then increased slowly while waiting between increases. There is definitely a magic number involved here. I could always scale up to 1011 instances and wait for any period of time and things were fine - no router restarts, no error messages in the logs on the nodes running the routers. As soon as I increased the replicas to 1012 the routers restarted. Decreasing back to 1011 and the restarts stopped. Config was 3x master + etcd, 2x router, 2x registry, 10x app nodes. Attaching system log from a node where the restarts occurred. A burst of messages always comes in, but not sure how useful they are. Created attachment 1151062 [details]
additional syslog messages
I have reproduce this (or a similar) problem ona 6 nide cluster in the lab. Next will be finding the root cause of the problem. # docker ps -a | grep ro 8bd46592eca7 openshift3/ose-haproxy-router:v3.2.0.44 "/usr/bin/openshift-r" 3 minutes ago Exited (2) About a minute ago k8s_router.863168f7_router-1-s0dnn_default_c5cf63fb-1b8f-11e6-8e61-ecf4bbd7ae44_984463d0 # docker logs 8bd46592eca7 . . . I0517 12:56:51.493340 1 router.go:310] Router reloaded: - Checking HAProxy /healthz on port 1936 ... - HAProxy port 1936 health check ok : 1 retry attempt(s). I0517 12:56:59.735537 1 router.go:310] Router reloaded: - Checking HAProxy /healthz on port 1936 ... - HAProxy port 1936 health check ok : 0 retry attempt(s). Set timeouts in dc/router, periodSeconds=300, timeoutSeconds=40
I let this run overnight. I came in this morning and took a look.
In looking at the docker logs, he router was started at 16:01 and crashed at 01:45
- Checking HAProxy /healthz on port 1936 ...
- HAProxy port 1936 health check ok : 0 retry attempt(s).
I0518 01:45:05.882564 1 reaper.go:24] Signal received: child exited
I0518 01:45:07.825801 1 plugin.go:133] Processing 1 Endpoints for Name: nginx-svc-online (MODIFIED)
I0518 01:45:07.825817 1 plugin.go:136] Subset 0 : api.EndpointSubset{Addresses:[]api.EndpointAddress(nil), NotReadyAddresses:[]api.EndpointAddress{api.EndpointAddress{IP:"10.1.4.2", TargetRef:(*api.ObjectReference)(0xc2087b07e0)}}, Ports:[]api.EndpointPort{api.EndpointPort{Name:"80-tcp", Port:8080, Protocol:"TCP"}}}
I0518 01:45:07.825857 1 plugin.go:145] Modifying endpoints for conn1002/nginx-svc-online
I0518 01:45:07.825869 1 router.go:501] Ignoring change for conn1002/nginx-svc-online, endpoints are the same
The new router started at: 01:46:58.573031
and is still working this morning.
I took a look at the health check retries and made a table of the number of samples that required N retries:
Total samples: 1104
Retries samples
0 677
1 232
2 9
3 18
4 36
5 21
6 47
7 13
8 24
9 25
10 1
11 0
12 0
13 1
14 0
It strikes me that when you have to try curl 2 time to see it work 21% of the time and 6 times to finally see it work 4% of the time there is a problem.
Came back from lunch: # curl -H "Host: hello-openshift-v3.mycloud.com" http://172.30.113.75/ curl: (7) Failed connect to 172.30.113.75:80; Connection refused waited a few seconds: # curl -H "Host: hello-openshift-v3.mycloud.com" http://172.30.113.75/ Hello OpenShift! # oc get rc NAME DESIRED CURRENT AGE hello-rc 950 950 6d I have reproduced the problem on the lab 6 node cluster with a project running 975 replicas of hello-openshift.
During these tests the setup is static, no changes are made to the configuration, no pods are added or deleted.
Aside from infrastructure pods, there are some nginx and a lot of hello-openshift pods. curl to the hello-openshift pods is the only input to the cluster.
This setup shows some interesting behaviors.
- I have found that extending the timeouts resolves the problem.
# oc edit dc/router
spec:
containers:
- command:
- /usr/bin/openshift-router
- --loglevel=4
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 300
successThreshold: 1
timeoutSeconds: 40
The actual reload seems to be quick:
<logs-snippet>
I0517 15:24:38.913860 1 router.go:225] Writing the router state
I0517 15:24:38.918282 1 router.go:230] Writing the router config
I0517 15:24:39.007365 1 router.go:235] Reloading the router
I0517 15:24:39.054372 1 reaper.go:24] Signal received: child exited
I0517 15:24:39.054411 1 reaper.go:32] Reaped process with pid 55
I0517 15:24:40.067301 1 router.go:310] Router reloaded:
- Checking HAProxy /healthz on port 1936 ...
- HAProxy port 1936 health check ok : 1 retry attempt(s).
I0517 15:24:40.067329 1 reaper.go:24] Signal received: child exited
</logs-snippet>
- Why does increasing the timeout work?
This is still unknown...
- On reload the first curl doesn't always work. The retries are 500msec apart.
A histogram of retries for 1104 reloads follows
Retries samples
0 677
1 232
2 9
3 18
4 36
5 21
6 47
7 13
8 24
9 25
10 1
11 0
12 0
13 1
14 0
It strikes me that when you have to try curl 2 time to see it work 21% of the time and 6 times to finally see it work 4% of the time there is a problem. Could this be the timeout problem?
- 6 second network pause every 1-3 min
(This is not likely related to the 10 minute reload cycle)
I put curl in a 2 sec loop on the master node and printed out every time the curl failed (returned 7)
It appears that every 1-3 minutes, the curl fails for roughly 6 seconds.
I set up dnsmasq on my laptop and looped calling curl every second and timed how long (wall clock) curl took. Most of the time it is around 140msec, however every few minutes it takes 5.3 seconds.
During these pauses ssh
- router pod crashed and was restarted 7 times in 2 days
In looking at the docker logs, the router was started at 16:01 and crashed at 01:45
- Checking HAProxy /healthz on port 1936 ...
- HAProxy port 1936 health check ok : 0 retry attempt(s).
I0518 01:45:05.882564 1 reaper.go:24] Signal received: child exited
Log file ended here.
New log file started here, a minute and 53 sec later, The new router started at: 01:46:58.573031
- Is iptables changing or pod changes changing configuration?
I captured
oc get all -o wide
before and after a stall and the are the same
I captured iptables before and after a stall. There are 4061 lines.
The nat table has a minor re-ordering of the router service entries.
Looking at docker log for router container. The time stamps roughly mirror the pauses in processing. I0518 22:57:09.962036 1 router.go:310] Router reloaded: I0518 22:57:37.100725 1 router.go:310] Router reloaded: I0518 23:00:00.071569 1 router.go:310] Router reloaded: I0518 23:00:06.033517 1 router.go:310] Router reloaded: I0518 23:01:36.905530 1 router.go:310] Router reloaded: I0518 23:01:52.080527 1 router.go:310] Router reloaded: I0518 23:03:27.918080 1 router.go:310] Router reloaded: I0518 23:03:35.882036 1 router.go:310] Router reloaded: I0518 23:06:31.943347 1 router.go:310] Router reloaded: I0518 23:06:35.891345 1 router.go:310] Router reloaded: I0518 23:07:00.019580 1 router.go:310] Router reloaded: I0518 23:07:17.913971 1 router.go:310] Router reloaded: I0518 23:07:27.050100 1 router.go:310] Router reloaded: I0518 23:10:14.963170 1 router.go:310] Router reloaded: I0518 23:10:15.896205 1 router.go:310] Router reloaded: I0518 23:10:57.905800 1 router.go:310] Router reloaded: Found out why the reload happens so often. pkg/router/template/router.go AddEndpoints() calls reflect.DeepEqual() which does not notice that there are no changes. If the match gets noticed, this static setup would not need to reload. Still don't know why networking stalls during reload. Also, don't know why the change event was signaled (since things haven't changed). Probably a good idea to find and fix the stall first since the match problem reproduces the problem so often. Finally, still don't know why the router crashes (which is the actual bug). In looking at the docker log for the router it appears that haproxy reload is a frequent occurance. It bounces back and forth between a null set and the expected route. I put in a quick hack to not reload when null set is seen and this pretty much eliminates the reload. bz1320233 - haproxy is reloaded every 10 minutes N-times for N endpoints The fix by Maur Newby <marun> (PR8287) reduces the number of haproxy reloads. My quick hack along with the long liveness timeouts ran over the 3 day weekend and the router did not fail. I have reduced the liveness timeouts to timeoutSeconds 5, periodSeconds 10 as a test to see if the crash restart is still happening. I will let the test run for a while and report results. In testing I noticed that networking stalled for 6+ seconds every 2-3 minutes. This seems to be related to the haproxy reload. With the hack mentioned in comment 15 the curl test ran 161000 times over a 3 day weekend without a stall. I reconfigured the liveness timeouts (as mentioned in comment 15) and noticed a single 12 second stall. The script runs: curl -H "Host: hello-openshift-v3.mycloud.com" http://172.30.113.75 every 2 seconds. Noticed the router set of routes bounced from the configured set to nil every 30 sec or so and every 1-3 minutes networking paused (as shown by curl failures) for 6+ seconds. I put in a quick hack to not reload the router with the nil set. After 2 days the pauses did not re-appear. Reset the timeouts to periodSeconds: 10, timeoutSeconds: 5. The router pod has not restarted in 3 days of running the curl test. It appears that the router crashing frequently enough to enter CrashLoopBackOff is related to the bounce. PR8287 (bz 1320233) properly fixes the bounce problem which appears to be the root cause of this bug. Ran curl test (to detect network pause) over the weekend. Pause did not occur. *** This bug has been marked as a duplicate of bug 1320233 *** I re-tested this with: registry.qe.openshift.com/openshift3/ose-haproxy-router v3.2.1.3 fe5e72b03bce and the problem reported still occurs. I used the following steps: 1. Need a big enough cluster to run 1000+ quick start application pods. I am using AWS 3 x master : m4.4xlarge 3 x router : m4.xlarge 1 x load balancer: m4.xlarge 10 x node: m4.xlarge 2. Create 1 project with 1 application using the django-example template 3. once the app is running oc scale dc/dhango-example --replicas 1000 4. It should scale up OK and router pods should be in good health 5. oc scale --replicas=1100 The routers go into CrashLoopBackOff. As in comment 7, there was a "magic number". This time it was 1008 replicas. I could scale to 1007 replicas and the routers were stable. No restarts, no failed health checks. As soon as I scaled to 1008 replicas the routers began failing their health checks. OpenShift/Kube then kills and restarts the pod. Let me know what documentation/information you need. Mike, it times out because there is too much to do in the allotted time. What are the cakephp or django quickstart applications? How do they behave? Are then starting and stopping and setting up and deleting routes a lot? I am using "hello-openshift" which just comes up and sits there. The routing, once set up, doesn't change. In this mode I cannot force the bug even with very short timeouts. The root cause was the excessive route changing that required a lot of processing and timeouts were missed. The bug mentioned in the duplicate started an update by erasing the routes and adding them back in. This causes a lot of processing. The fix created and filled in a new table and then replaced the old. Most of the time there are no changes and minimal processing. This was fixed in the "duplicate" bug that I used to close this bug. I suspect that either your setup is aggressively changing routes and requiring a lot of processing or the fix is not in your version. We can start by exploring the pattern of route changes in your setup. In the end we may find that there is just too much going on and extending the timeouts is the needed change. There is also a chance we may find a different problem. Are the modified timeouts sufficient to resolve this or do you want to continue exploring this? The quick start applications are close to "do-nothing" applications. Basically hello-worlds in different languages (PHP and python in this case) which keep track of how many times they have been hit. No long-running activity. See: https://docs.openshift.com/enterprise/3.2/dev_guide/app_tutorials/quickstarts.html I will re-test with hello-openshift and see what the results are. It would be nice if someone can tell us if the fix is in the 3.2.1.3 version of the router Docker image. Mike, is there anything left to do here? Can this be closed again? No response to #25 Closing this again. Sorry I missed the comments. I recreated this on 3.3 with hello-openshift pods as well as gcr.io/google_containers/pause-amd64:3.0 pods which are the slimmest, do-nothing-est shim of a container possible. The secret is to make sure the deployment configuration has a Service and Router defined. If you just start the container as a pod, the problem does not occur. My guess is that the large number of replicas for a single svc/route is what triggers the router restarts. I'm re-opening so that this additional information can be seen. Whether 1000+ replicas of a single DC is reasonable is something I don't have the answer to. I'm attaching the yaml for the DC in case someone wants to recreate it with the pause pod. Created attachment 1188978 [details]
Deployment config for pause pod with service and route defined
The configuration for the re-create using hello-openshift and pause was on AWS EC2 1 x m4.xlarge master load balancer 3 x m4.xlarge master/etcd 2 x m4.xlarge HAProxy router nodes 5 x m4.xlarge nodes with maxPods set at 250 OpenShift v3.3.0.16 *** Bug 1370249 has been marked as a duplicate of this bug. *** Hello, I see this bug is closed .... (what mean exactly CLOSED CURRENTRELEASE?) We are hit by this issue also on an Openshift 3.2.1.30. Our nodes are not really busy. And we reserved memory and CPU for the router. Therefor we are looking for the possible root cause of the haproxy not responding to the health check. Our router have been recently reinstalled and moved to new nodes. Old nodes have been removed. We have quite a few hypothesis: * There is apparently some misconfiguration of the Network interface. We have bonding on it... But I do not see any reason it would impact the router. * We see the router crash more often on some nodes. It might be correlated to the number of floating IP on this particular node. * As we reinstalled the router its name changed and it is now walled router-XX-YY. When doing a diagnostic, we get a warning that the router is not found. This might indicate some misconfiguration and might be also the cause of our problem? I will log a call to Redhat support but any help is welcome. Marc Marc, see comment 30. Does that help? Also, verify maxconn is large enough. The default is 20000 which may not be enough for you. Hello Phil, (I do not see comment 30 but support told me what it is) My first try to increase timeout was not concluant. The router crash less often but still do not respond for a long time which causes service interruption. Also our servers are not really busy.... there is plenty of free CPU and Memory. And I reserved Memory + CPU for haproxy, but this is also not concluant. I will try to increase the 20000 limit but it seems already large and I do not see what could take 20000 connections. Our service is still in its beginning and we do not have large number of users. How can I see maxconn / ulimit on the container? The first thing I would like to clarify: is it the haproxy which is disfunctionning (not responding to heath check) or the full server which lock --> but that does not seems the case as from the node terminal I can wget on localhost and see that it is the health check that does not respond..... Thus the haproxy disfunction....? Marc So yes in my case it was effectivelly a maxconn issue.... It seems usefull to mention that the maxconn default value is probable 4096 in Openshift 3.2.1.31. Thanks a lot for your help. Marc |