Bug 1757916
| Summary: | kuryr-controller stuck in CrashLoopBack and no pod creation possible after running OCP functional automation. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Bennett <bbennett> | ||||||
| Component: | Networking | Assignee: | Luis Tomas Bolivar <ltomasbo> | ||||||
| Networking sub component: | kuryr | QA Contact: | GenadiC <gcheresh> | ||||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||||
| Severity: | urgent | ||||||||
| Priority: | unspecified | CC: | bbennett, gcheresh, juriarte, ltomasbo, mdemaced, mdulko, mifiedle, racedoro | ||||||
| Version: | 4.2.0 | Keywords: | TestBlocker | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 4.2.z | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | 1757876 | Environment: | |||||||
| Last Closed: | 2019-10-17 15:41:19 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 1757876, 1759095 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
|
Comment 4
Ben Bennett
2019-10-02 19:07:04 UTC
With the increased quota, things seem better - I can at least create new projects and pods. Next up is k8s conformance. However the kube-controller is still periodically flapping and restarting even when the cluster is sitting idle. Is there any way to make this env completely clean from a Kuryr perspective? I've tried deleting all non-openshift projects and the issue still occurs. Here's oc get pods -o wide -w -n openshift-kuryr for about 15 minutes: NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 496 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 497 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 497 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 CrashLoopBackOff 497 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 498 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 498 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 CrashLoopBackOff 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 500 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> I'm also seeing occasional restarts if the kuryr-cni pods on worker nodes: NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 CrashLoopBackOff 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 500 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 500 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-cni-4b7hq 0/1 Running 63 2d21h 10.196.0.21 ostest-vqmw9-worker-74hgf <none> <none> kuryr-cni-4b7hq 1/1 Running 63 2d21h 10.196.0.21 ostest-vqmw9-worker-74hgf <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 501 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> Created attachment 1622294 [details]
openshift-kuryr pod logs with pods flapping during idle cluster
openshift-kuryr pod logs. kuryr-controller and kuryr-cni on workers are flapping while the cluster is idle
I started kubernetes conformance (openshift-tests run kubernetes/conformance) and things immediately hung with the issue of not being able to create new pods and kuryr-controller crash looping. Do we need a fresh install with the increased quotas established from the inception of the cluster? Mike, thanks for the update on this bugzilla. I would suggest to go for a fresh installation before running the next tests. OCP reinstalled on it's latest nightly version: 4.2.0-0.nightly-2019-10-02-150642 I deleted all the OpenStack leftovers from the previous cluster and tests. Let's see how it goes now in a fresh cluster with increased OpenStack quotas. Executed kubernetes conformance tests on the new cluster in comment 12 with the increased quotas - OCP QE will focus on this use case for this bz on k8s conformance unless there are objections. It is currently used by QE to vet clusters on all cloud providers. It usually runs in around 20 minutes with all 204/204 tests passing unless there are bugs present. During this run, 80 of 204 tests failed and the run took 2 hours and 25 minutes. During the course of the run, kuryr-controller crash looped and restarted several times. kuryr-cni on the worker nodes restarted several time. I'll attach pod logs - let me know what other info is required. The access to the cluster and the kubeconfig are as detailed in the description (search on titan24). Reproducer: - extract openshift-tests from payload (I will include a link in private comment to follow) - KUBECONFIG=/path/to/kubeconfig ./openshift-tests run kubernetes/conformance - oc get pods -n openshift-kuryr -w - wait a while. I started seeing restarts after ~80 tests were complete. kuryr-cni-4pf76 1/1 Running 1 48m kuryr-cni-lpmz6 1/1 Running 1 48m kuryr-controller-745bc55f58-mpqkn 0/1 Running 5 62m kuryr-controller-745bc55f58-mpqkn 1/1 Running 5 64m kuryr-cni-lpmz6 0/1 Running 2 53m kuryr-cni-4pf76 0/1 Running 2 53m kuryr-cni-bkqt8 0/1 Running 2 52m kuryr-cni-lpmz6 1/1 Running 2 53m kuryr-cni-4pf76 1/1 Running 2 53m kuryr-cni-bkqt8 1/1 Running 2 53m kuryr-controller-745bc55f58-mpqkn 0/1 Running 6 67m kuryr-controller-745bc55f58-mpqkn 1/1 Running 6 68m kuryr-controller-745bc55f58-mpqkn 0/1 Running 7 79m kuryr-controller-745bc55f58-mpqkn 0/1 Running 8 81m kuryr-cni-bkqt8 0/1 Running 3 70m kuryr-cni-lpmz6 0/1 Running 3 70m kuryr-cni-4pf76 0/1 Running 3 70m kuryr-cni-bkqt8 1/1 Running 3 70m kuryr-controller-745bc55f58-mpqkn 0/1 Running 9 83m kuryr-cni-4pf76 1/1 Running 3 71m kuryr-cni-lpmz6 1/1 Running 3 71m kuryr-controller-745bc55f58-mpqkn 0/1 Running 10 85m kuryr-cni-lpmz6 0/1 Running 4 74m kuryr-controller-745bc55f58-mpqkn 0/1 Running 11 87m kuryr-cni-lpmz6 1/1 Running 4 74m kuryr-cni-bkqt8 0/1 Running 4 74m kuryr-cni-4pf76 0/1 Running 4 75m kuryr-cni-bkqt8 1/1 Running 4 74m kuryr-cni-4pf76 1/1 Running 4 75m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 11 89m kuryr-cni-lpmz6 0/1 Running 5 78m kuryr-cni-lpmz6 1/1 Running 5 78m kuryr-cni-bkqt8 0/1 Running 5 78m kuryr-cni-4pf76 0/1 Running 5 78m kuryr-cni-bkqt8 1/1 Running 5 78m kuryr-controller-745bc55f58-mpqkn 0/1 Running 12 91m kuryr-cni-4pf76 1/1 Running 5 79m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 12 93m kuryr-controller-745bc55f58-mpqkn 0/1 Running 13 98m kuryr-cni-bkqt8 0/1 Running 6 86m kuryr-cni-bkqt8 1/1 Running 6 86m kuryr-controller-745bc55f58-mpqkn 0/1 Running 14 100m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 14 102m kuryr-controller-745bc55f58-mpqkn 0/1 Running 15 107m kuryr-controller-745bc55f58-mpqkn 0/1 Running 16 109m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 16 111m kuryr-cni-lpmz6 0/1 Running 6 102m kuryr-cni-lpmz6 1/1 Running 6 102m kuryr-controller-745bc55f58-mpqkn 0/1 Running 17 116m kuryr-controller-745bc55f58-mpqkn 0/1 Running 18 117m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 18 119m kuryr-cni-4pf76 0/1 Running 6 107m kuryr-cni-4pf76 1/1 Running 6 108m kuryr-cni-4pf76 0/1 Running 7 111m kuryr-cni-4pf76 1/1 Running 7 112m kuryr-controller-745bc55f58-mpqkn 0/1 Running 19 124m kuryr-controller-745bc55f58-mpqkn 0/1 Running 20 126m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 20 128m kuryr-controller-745bc55f58-mpqkn 0/1 Running 21 133m kuryr-controller-745bc55f58-mpqkn 0/1 Running 22 135m kuryr-cni-lpmz6 0/1 Running 7 124m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 22 137m kuryr-cni-lpmz6 1/1 Running 7 124m kuryr-cni-bkqt8 0/1 Running 7 129m kuryr-controller-745bc55f58-mpqkn 0/1 Running 23 142m kuryr-cni-bkqt8 1/1 Running 7 129m kuryr-controller-745bc55f58-mpqkn 0/1 Running 24 144m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 24 145m Created attachment 1622380 [details]
openshift-kuryr pod logs
Pod logs for openshift-kuryr namespace after kuryr-controller started crashlooping and kuryr-cni pods were restarting regularly. This was during execution of the kubernetes conformance testsuite.
For comparison to comment 13, I ran the k8s conformance on an OSP 13 cluster configured with OpenShift SDN (4.2.0-0.nightly-2019-10-02-122541). 204/204 tests passed in 21m25s. Testing with # oc describe deployment -n openshift-kuryr kuryr-controller | grep Image
Image: docker.io/maysamacedo/kuryr-controller:latest
This run was better but not great. kuryr-controller restarted 17 times and was in CrashLoopBack a few times as well.
kuryr-controller-588b9c6bcd-lqgbv 0/1 CrashLoopBackOff 16 3h26m
kuryr-controller-588b9c6bcd-lqgbv 0/1 Running 17 3h31m
The actual k8s conformance test suite did a bit better. The tests ran in 78 minutes with 68 failed and 136 passed. Compare to comment 16 for OpenShift SDN and comment 13 for previous Kuryr. The failures were not networking specific tests - just general tests failing due to pod creation failures.
I am running again to see if the results are consistent or if we degrade. Will attach detailed test results this time as well.
Seeing the following events firing while the tests are running Oct 04 19:22:44.243 W ns/openshift-kuryr pod/kuryr-cni-sn4gk Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.6 10.196.0.11 10.196.0.12 (230 times) Oct 04 19:22:47.089 W ns/openshift-dns pod/dns-default-s6h5t Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.34 10.196.0.11 10.196.0.12 (225 times) Oct 04 19:22:48.027 W ns/openshift-monitoring pod/node-exporter-5d7km Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.6 10.196.0.11 10.196.0.12 (218 times) Oct 04 19:22:48.335 W ns/openshift-kuryr pod/kuryr-cni-nft6c Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.42 10.196.0.11 10.196.0.12 (289 times) Oct 04 19:22:49.008 W ns/openshift-kuryr pod/kuryr-controller-588b9c6bcd-lqgbv Readiness probe failed: HTTP probe failed with statuscode: 500 (222 times) Oct 04 19:22:49.028 W ns/openshift-kube-apiserver pod/kube-apiserver-ostest-fgfxk-master-0 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.6 10.196.0.11 10.196.0.12 (212 times) Oct 04 19:22:50.339 W ns/openshift-monitoring pod/node-exporter-ffqwx Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.42 10.196.0.11 10.196.0.12 (217 times) Oct 04 19:22:53.030 W ns/openshift-multus pod/multus-m5nnt Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.14 10.196.0.11 10.196.0.12 (224 times) Oct 04 19:23:02.031 W ns/openshift-ingress pod/router-default-5f64bb4978-8dnbw Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.14 10.196.0.11 10.196.0.12 (217 times) During the second run mentioned in comment 18, cluster and kuryr seem to be in a degraded state. error: 172 fail, 32 pass, 0 skip (1h27m33s). Same access info as before, let me know if there is anything I should gather. The cluster was reinstalled today (7-October) and I executed the kubernetes conformance tests again. The results seem to be the same. kuryr-controller and kuryr-cni pods are restarting frequently and the conformance tests have about a 50% pass rate.
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 0.0.1-2019-10-07-143408 True False 79m Cluster version is 0.0.1-2019-10-07-143408
# oc describe deployment kuryr-controller -n openshift-kuryr | grep Image
Image: docker.io/luis5tb/kuryr:latest
kuryr-controller-6b4459584f-hb8qj 0/1 Running 7 53m
kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 7 55m
kuryr-cni-wh76m 0/1 Running 3 34m
kuryr-cni-wh76m 1/1 Running 3 34m
kuryr-cni-dvc7z 0/1 Running 3 37m
kuryr-cni-cvhdb 0/1 Running 3 37m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 8 58m
kuryr-cni-cvhdb 1/1 Running 3 37m
kuryr-cni-dvc7z 1/1 Running 3 38m
kuryr-cni-wh76m 0/1 Running 4 37m
kuryr-controller-6b4459584f-hb8qj 1/1 Running 8 60m
kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 8 60m
kuryr-cni-wh76m 1/1 Running 4 38m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 9 65m
kuryr-cni-wh76m 0/1 Running 5 44m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 10 67m
kuryr-cni-wh76m 1/1 Running 5 45m
kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 10 69m
kuryr-cni-wh76m 0/1 Running 6 49m
kuryr-cni-cvhdb 0/1 Running 4 50m
kuryr-cni-dvc7z 0/1 Running 4 51m
kuryr-cni-wh76m 1/1 Running 6 49m
kuryr-cni-cvhdb 1/1 Running 4 51m
kuryr-cni-dvc7z 1/1 Running 4 51m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 11 74m
kuryr-cni-wh76m 0/1 Running 7 53m
kuryr-cni-dvc7z 0/1 Running 5 54m
kuryr-cni-cvhdb 0/1 Running 5 54m
kuryr-cni-wh76m 1/1 Running 7 53m
kuryr-cni-cvhdb 1/1 Running 5 54m
kuryr-cni-dvc7z 1/1 Running 5 55m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 12 76m
kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 12 78m
kuryr-cni-wh76m 0/1 CrashLoopBackOff 7 56m
kuryr-cni-cvhdb 0/1 Running 6 58m
kuryr-cni-cvhdb 1/1 Running 6 58m
kuryr-cni-dvc7z 0/1 Running 6 59m
kuryr-cni-dvc7z 1/1 Running 6 60m
kuryr-cni-cvhdb 0/1 Running 7 61m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 13 83m
kuryr-cni-cvhdb 1/1 Running 7 62m
kuryr-cni-dvc7z 0/1 Running 7 63m
kuryr-cni-wh76m 0/1 Running 8 61m
kuryr-cni-dvc7z 1/1 Running 7 63m
kuryr-cni-wh76m 1/1 Running 8 62m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 14 84m
kuryr-cni-cvhdb 0/1 Running 8 65m
kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 14 86m
kuryr-cni-cvhdb 1/1 Running 8 65m
kuryr-cni-dvc7z 0/1 Running 8 67m
kuryr-cni-dvc7z 1/1 Running 8 67m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 15 91m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 16 93m
kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 16 95m
kuryr-cni-wh76m 0/1 Running 9 75m
kuryr-cni-wh76m 1/1 Running 9 75m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 17 100m
kuryr-controller-6b4459584f-hb8qj 0/1 Running 18 102m
kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 18 104m
*** This bug has been marked as a duplicate of bug 1759097 *** |