If testing is blocked by this, can you try https://bugzilla.redhat.com/show_bug.cgi?id=1757876#c5 to see if bumping the quota unblocks QE?
With the increased quota, things seem better - I can at least create new projects and pods. Next up is k8s conformance. However the kube-controller is still periodically flapping and restarting even when the cluster is sitting idle. Is there any way to make this env completely clean from a Kuryr perspective? I've tried deleting all non-openshift projects and the issue still occurs. Here's oc get pods -o wide -w -n openshift-kuryr for about 15 minutes: NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 496 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 497 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 497 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 CrashLoopBackOff 497 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 498 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 498 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 CrashLoopBackOff 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 500 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none>
I'm also seeing occasional restarts if the kuryr-cni pods on worker nodes: NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 CrashLoopBackOff 499 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 500 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-controller-69cb8bd84d-kwc4t 1/1 Running 500 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none> kuryr-cni-4b7hq 0/1 Running 63 2d21h 10.196.0.21 ostest-vqmw9-worker-74hgf <none> <none> kuryr-cni-4b7hq 1/1 Running 63 2d21h 10.196.0.21 ostest-vqmw9-worker-74hgf <none> <none> kuryr-controller-69cb8bd84d-kwc4t 0/1 Running 501 2d21h 10.196.0.40 ostest-vqmw9-master-1 <none> <none>
Created attachment 1622294 [details] openshift-kuryr pod logs with pods flapping during idle cluster openshift-kuryr pod logs. kuryr-controller and kuryr-cni on workers are flapping while the cluster is idle
Pod logs in comment 8 are with the increased quota settings from comment 5
I started kubernetes conformance (openshift-tests run kubernetes/conformance) and things immediately hung with the issue of not being able to create new pods and kuryr-controller crash looping. Do we need a fresh install with the increased quotas established from the inception of the cluster?
Mike, thanks for the update on this bugzilla. I would suggest to go for a fresh installation before running the next tests.
OCP reinstalled on it's latest nightly version: 4.2.0-0.nightly-2019-10-02-150642 I deleted all the OpenStack leftovers from the previous cluster and tests. Let's see how it goes now in a fresh cluster with increased OpenStack quotas.
Executed kubernetes conformance tests on the new cluster in comment 12 with the increased quotas - OCP QE will focus on this use case for this bz on k8s conformance unless there are objections. It is currently used by QE to vet clusters on all cloud providers. It usually runs in around 20 minutes with all 204/204 tests passing unless there are bugs present. During this run, 80 of 204 tests failed and the run took 2 hours and 25 minutes. During the course of the run, kuryr-controller crash looped and restarted several times. kuryr-cni on the worker nodes restarted several time. I'll attach pod logs - let me know what other info is required. The access to the cluster and the kubeconfig are as detailed in the description (search on titan24). Reproducer: - extract openshift-tests from payload (I will include a link in private comment to follow) - KUBECONFIG=/path/to/kubeconfig ./openshift-tests run kubernetes/conformance - oc get pods -n openshift-kuryr -w - wait a while. I started seeing restarts after ~80 tests were complete. kuryr-cni-4pf76 1/1 Running 1 48m kuryr-cni-lpmz6 1/1 Running 1 48m kuryr-controller-745bc55f58-mpqkn 0/1 Running 5 62m kuryr-controller-745bc55f58-mpqkn 1/1 Running 5 64m kuryr-cni-lpmz6 0/1 Running 2 53m kuryr-cni-4pf76 0/1 Running 2 53m kuryr-cni-bkqt8 0/1 Running 2 52m kuryr-cni-lpmz6 1/1 Running 2 53m kuryr-cni-4pf76 1/1 Running 2 53m kuryr-cni-bkqt8 1/1 Running 2 53m kuryr-controller-745bc55f58-mpqkn 0/1 Running 6 67m kuryr-controller-745bc55f58-mpqkn 1/1 Running 6 68m kuryr-controller-745bc55f58-mpqkn 0/1 Running 7 79m kuryr-controller-745bc55f58-mpqkn 0/1 Running 8 81m kuryr-cni-bkqt8 0/1 Running 3 70m kuryr-cni-lpmz6 0/1 Running 3 70m kuryr-cni-4pf76 0/1 Running 3 70m kuryr-cni-bkqt8 1/1 Running 3 70m kuryr-controller-745bc55f58-mpqkn 0/1 Running 9 83m kuryr-cni-4pf76 1/1 Running 3 71m kuryr-cni-lpmz6 1/1 Running 3 71m kuryr-controller-745bc55f58-mpqkn 0/1 Running 10 85m kuryr-cni-lpmz6 0/1 Running 4 74m kuryr-controller-745bc55f58-mpqkn 0/1 Running 11 87m kuryr-cni-lpmz6 1/1 Running 4 74m kuryr-cni-bkqt8 0/1 Running 4 74m kuryr-cni-4pf76 0/1 Running 4 75m kuryr-cni-bkqt8 1/1 Running 4 74m kuryr-cni-4pf76 1/1 Running 4 75m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 11 89m kuryr-cni-lpmz6 0/1 Running 5 78m kuryr-cni-lpmz6 1/1 Running 5 78m kuryr-cni-bkqt8 0/1 Running 5 78m kuryr-cni-4pf76 0/1 Running 5 78m kuryr-cni-bkqt8 1/1 Running 5 78m kuryr-controller-745bc55f58-mpqkn 0/1 Running 12 91m kuryr-cni-4pf76 1/1 Running 5 79m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 12 93m kuryr-controller-745bc55f58-mpqkn 0/1 Running 13 98m kuryr-cni-bkqt8 0/1 Running 6 86m kuryr-cni-bkqt8 1/1 Running 6 86m kuryr-controller-745bc55f58-mpqkn 0/1 Running 14 100m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 14 102m kuryr-controller-745bc55f58-mpqkn 0/1 Running 15 107m kuryr-controller-745bc55f58-mpqkn 0/1 Running 16 109m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 16 111m kuryr-cni-lpmz6 0/1 Running 6 102m kuryr-cni-lpmz6 1/1 Running 6 102m kuryr-controller-745bc55f58-mpqkn 0/1 Running 17 116m kuryr-controller-745bc55f58-mpqkn 0/1 Running 18 117m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 18 119m kuryr-cni-4pf76 0/1 Running 6 107m kuryr-cni-4pf76 1/1 Running 6 108m kuryr-cni-4pf76 0/1 Running 7 111m kuryr-cni-4pf76 1/1 Running 7 112m kuryr-controller-745bc55f58-mpqkn 0/1 Running 19 124m kuryr-controller-745bc55f58-mpqkn 0/1 Running 20 126m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 20 128m kuryr-controller-745bc55f58-mpqkn 0/1 Running 21 133m kuryr-controller-745bc55f58-mpqkn 0/1 Running 22 135m kuryr-cni-lpmz6 0/1 Running 7 124m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 22 137m kuryr-cni-lpmz6 1/1 Running 7 124m kuryr-cni-bkqt8 0/1 Running 7 129m kuryr-controller-745bc55f58-mpqkn 0/1 Running 23 142m kuryr-cni-bkqt8 1/1 Running 7 129m kuryr-controller-745bc55f58-mpqkn 0/1 Running 24 144m kuryr-controller-745bc55f58-mpqkn 0/1 CrashLoopBackOff 24 145m
Created attachment 1622380 [details] openshift-kuryr pod logs Pod logs for openshift-kuryr namespace after kuryr-controller started crashlooping and kuryr-cni pods were restarting regularly. This was during execution of the kubernetes conformance testsuite.
For comparison to comment 13, I ran the k8s conformance on an OSP 13 cluster configured with OpenShift SDN (4.2.0-0.nightly-2019-10-02-122541). 204/204 tests passed in 21m25s.
Testing with # oc describe deployment -n openshift-kuryr kuryr-controller | grep Image Image: docker.io/maysamacedo/kuryr-controller:latest This run was better but not great. kuryr-controller restarted 17 times and was in CrashLoopBack a few times as well. kuryr-controller-588b9c6bcd-lqgbv 0/1 CrashLoopBackOff 16 3h26m kuryr-controller-588b9c6bcd-lqgbv 0/1 Running 17 3h31m The actual k8s conformance test suite did a bit better. The tests ran in 78 minutes with 68 failed and 136 passed. Compare to comment 16 for OpenShift SDN and comment 13 for previous Kuryr. The failures were not networking specific tests - just general tests failing due to pod creation failures. I am running again to see if the results are consistent or if we degrade. Will attach detailed test results this time as well.
Seeing the following events firing while the tests are running Oct 04 19:22:44.243 W ns/openshift-kuryr pod/kuryr-cni-sn4gk Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.6 10.196.0.11 10.196.0.12 (230 times) Oct 04 19:22:47.089 W ns/openshift-dns pod/dns-default-s6h5t Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.34 10.196.0.11 10.196.0.12 (225 times) Oct 04 19:22:48.027 W ns/openshift-monitoring pod/node-exporter-5d7km Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.6 10.196.0.11 10.196.0.12 (218 times) Oct 04 19:22:48.335 W ns/openshift-kuryr pod/kuryr-cni-nft6c Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.42 10.196.0.11 10.196.0.12 (289 times) Oct 04 19:22:49.008 W ns/openshift-kuryr pod/kuryr-controller-588b9c6bcd-lqgbv Readiness probe failed: HTTP probe failed with statuscode: 500 (222 times) Oct 04 19:22:49.028 W ns/openshift-kube-apiserver pod/kube-apiserver-ostest-fgfxk-master-0 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.6 10.196.0.11 10.196.0.12 (212 times) Oct 04 19:22:50.339 W ns/openshift-monitoring pod/node-exporter-ffqwx Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.42 10.196.0.11 10.196.0.12 (217 times) Oct 04 19:22:53.030 W ns/openshift-multus pod/multus-m5nnt Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.14 10.196.0.11 10.196.0.12 (224 times) Oct 04 19:23:02.031 W ns/openshift-ingress pod/router-default-5f64bb4978-8dnbw Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.14 10.196.0.11 10.196.0.12 (217 times)
During the second run mentioned in comment 18, cluster and kuryr seem to be in a degraded state. error: 172 fail, 32 pass, 0 skip (1h27m33s). Same access info as before, let me know if there is anything I should gather.
The cluster was reinstalled today (7-October) and I executed the kubernetes conformance tests again. The results seem to be the same. kuryr-controller and kuryr-cni pods are restarting frequently and the conformance tests have about a 50% pass rate. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 0.0.1-2019-10-07-143408 True False 79m Cluster version is 0.0.1-2019-10-07-143408 # oc describe deployment kuryr-controller -n openshift-kuryr | grep Image Image: docker.io/luis5tb/kuryr:latest kuryr-controller-6b4459584f-hb8qj 0/1 Running 7 53m kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 7 55m kuryr-cni-wh76m 0/1 Running 3 34m kuryr-cni-wh76m 1/1 Running 3 34m kuryr-cni-dvc7z 0/1 Running 3 37m kuryr-cni-cvhdb 0/1 Running 3 37m kuryr-controller-6b4459584f-hb8qj 0/1 Running 8 58m kuryr-cni-cvhdb 1/1 Running 3 37m kuryr-cni-dvc7z 1/1 Running 3 38m kuryr-cni-wh76m 0/1 Running 4 37m kuryr-controller-6b4459584f-hb8qj 1/1 Running 8 60m kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 8 60m kuryr-cni-wh76m 1/1 Running 4 38m kuryr-controller-6b4459584f-hb8qj 0/1 Running 9 65m kuryr-cni-wh76m 0/1 Running 5 44m kuryr-controller-6b4459584f-hb8qj 0/1 Running 10 67m kuryr-cni-wh76m 1/1 Running 5 45m kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 10 69m kuryr-cni-wh76m 0/1 Running 6 49m kuryr-cni-cvhdb 0/1 Running 4 50m kuryr-cni-dvc7z 0/1 Running 4 51m kuryr-cni-wh76m 1/1 Running 6 49m kuryr-cni-cvhdb 1/1 Running 4 51m kuryr-cni-dvc7z 1/1 Running 4 51m kuryr-controller-6b4459584f-hb8qj 0/1 Running 11 74m kuryr-cni-wh76m 0/1 Running 7 53m kuryr-cni-dvc7z 0/1 Running 5 54m kuryr-cni-cvhdb 0/1 Running 5 54m kuryr-cni-wh76m 1/1 Running 7 53m kuryr-cni-cvhdb 1/1 Running 5 54m kuryr-cni-dvc7z 1/1 Running 5 55m kuryr-controller-6b4459584f-hb8qj 0/1 Running 12 76m kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 12 78m kuryr-cni-wh76m 0/1 CrashLoopBackOff 7 56m kuryr-cni-cvhdb 0/1 Running 6 58m kuryr-cni-cvhdb 1/1 Running 6 58m kuryr-cni-dvc7z 0/1 Running 6 59m kuryr-cni-dvc7z 1/1 Running 6 60m kuryr-cni-cvhdb 0/1 Running 7 61m kuryr-controller-6b4459584f-hb8qj 0/1 Running 13 83m kuryr-cni-cvhdb 1/1 Running 7 62m kuryr-cni-dvc7z 0/1 Running 7 63m kuryr-cni-wh76m 0/1 Running 8 61m kuryr-cni-dvc7z 1/1 Running 7 63m kuryr-cni-wh76m 1/1 Running 8 62m kuryr-controller-6b4459584f-hb8qj 0/1 Running 14 84m kuryr-cni-cvhdb 0/1 Running 8 65m kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 14 86m kuryr-cni-cvhdb 1/1 Running 8 65m kuryr-cni-dvc7z 0/1 Running 8 67m kuryr-cni-dvc7z 1/1 Running 8 67m kuryr-controller-6b4459584f-hb8qj 0/1 Running 15 91m kuryr-controller-6b4459584f-hb8qj 0/1 Running 16 93m kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 16 95m kuryr-cni-wh76m 0/1 Running 9 75m kuryr-cni-wh76m 1/1 Running 9 75m kuryr-controller-6b4459584f-hb8qj 0/1 Running 17 100m kuryr-controller-6b4459584f-hb8qj 0/1 Running 18 102m kuryr-controller-6b4459584f-hb8qj 0/1 CrashLoopBackOff 18 104m
*** This bug has been marked as a duplicate of bug 1759097 ***