Bug 1757916

Summary:

kuryr-controller stuck in CrashLoopBack and no pod creation possible after running OCP functional automation.

Product:

OpenShift Container Platform

Reporter:

Ben Bennett <bbennett>

Component:

Networking

Assignee:

Luis Tomas Bolivar <ltomasbo>

Networking sub component:

kuryr

QA Contact:

GenadiC <gcheresh>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

urgent

Priority:

unspecified

CC:

bbennett, gcheresh, juriarte, ltomasbo, mdemaced, mdulko, mifiedle, racedoro

Version:

4.2.0

Keywords:

TestBlocker

Target Milestone:

---

Target Release:

4.2.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

1757876

Environment:

Last Closed:

2019-10-17 15:41:19 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1757876, 1759095

Bug Blocks:

Attachments:

Description	Flags
openshift-kuryr pod logs with pods flapping during idle cluster	none
openshift-kuryr pod logs	none

Comment 4 Ben Bennett 2019-10-02 19:07:04 UTC

If testing is blocked by this, can you try https://bugzilla.redhat.com/show_bug.cgi?id=1757876#c5 to see if bumping the quota unblocks QE?

Comment 6 Mike Fiedler 2019-10-03 12:28:04 UTC

With the increased quota, things seem better - I can at least create new projects and pods.  Next up is k8s conformance.    

However the kube-controller is still periodically flapping and restarting even when the cluster is sitting idle.   Is there any way to make this env completely clean from a Kuryr perspective?  I've tried deleting all non-openshift projects and the issue still occurs.  Here's oc get pods -o wide -w -n openshift-kuryr for about 15 minutes:

NAME                                   READY   STATUS    RESTARTS   AGE     IP             NODE                        NOMINATED NODE   READINESS GATES
kuryr-controller-69cb8bd84d-kwc4t      1/1     Running   496        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      0/1     Running   497        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      1/1     Running   497        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      0/1     CrashLoopBackOff   497        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      0/1     Running            498        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      1/1     Running            498        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      0/1     Running            499        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      1/1     Running            499        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      0/1     CrashLoopBackOff   499        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      0/1     Running            500        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>

Comment 7 Mike Fiedler 2019-10-03 12:29:32 UTC

I'm also seeing occasional restarts if the kuryr-cni pods on worker nodes:

NAME                                   READY   STATUS    RESTARTS   AGE     IP             NODE                        NOMINATED NODE   READINESS GATES
kuryr-controller-69cb8bd84d-kwc4t      1/1     Running            499        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      0/1     CrashLoopBackOff   499        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      0/1     Running            500        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      1/1     Running            500        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>
kuryr-cni-4b7hq                        0/1     Running            63         2d21h   10.196.0.21    ostest-vqmw9-worker-74hgf   <none>           <none>
kuryr-cni-4b7hq                        1/1     Running            63         2d21h   10.196.0.21    ostest-vqmw9-worker-74hgf   <none>           <none>
kuryr-controller-69cb8bd84d-kwc4t      0/1     Running            501        2d21h   10.196.0.40    ostest-vqmw9-master-1       <none>           <none>

Comment 8 Mike Fiedler 2019-10-03 12:33:34 UTC

Created attachment 1622294 [details]
openshift-kuryr pod logs with pods flapping during idle cluster

openshift-kuryr pod logs.  kuryr-controller and kuryr-cni on workers are flapping while the cluster is idle

Comment 9 Mike Fiedler 2019-10-03 12:34:33 UTC

Pod logs in comment 8 are with the increased quota settings from comment 5

Comment 10 Mike Fiedler 2019-10-03 13:10:08 UTC

I started kubernetes conformance (openshift-tests run kubernetes/conformance) and things immediately hung with the issue of not being able to create new pods and kuryr-controller crash looping.   Do we need  a fresh install with the increased quotas established from the inception of the cluster?

Comment 11 Maysa Macedo 2019-10-03 13:44:34 UTC

Mike, thanks for the update on this bugzilla. I would suggest to go for a fresh installation before running the next tests.

Comment 12 Jon Uriarte 2019-10-03 16:26:03 UTC

OCP reinstalled on it's latest nightly version: 4.2.0-0.nightly-2019-10-02-150642
I deleted all the OpenStack leftovers from the previous cluster and tests.

Let's see how it goes now in a fresh cluster with increased OpenStack quotas.

Comment 13 Mike Fiedler 2019-10-03 18:31:46 UTC

Executed kubernetes conformance tests on the new cluster in comment 12 with the increased quotas - OCP QE will focus on this use case for this bz on k8s conformance unless there are objections. It is currently used by QE to vet clusters on all cloud providers.   It usually runs in around 20 minutes with all 204/204 tests passing unless there are bugs present.

During this run, 80 of 204 tests failed and the run took 2 hours and 25 minutes.   During the course of the run, kuryr-controller crash looped and restarted several times.   kuryr-cni on the worker nodes restarted several time.   I'll attach pod logs - let me know what other info is required.   The access to the cluster and the kubeconfig are as detailed in the description (search on titan24). 

Reproducer:
  - extract openshift-tests from payload (I will include a link in private comment to follow)
  - KUBECONFIG=/path/to/kubeconfig ./openshift-tests run kubernetes/conformance
  - oc get pods -n openshift-kuryr -w
  - wait a while.   I started seeing restarts after ~80 tests were complete.

kuryr-cni-4pf76                        1/1     Running   1          48m
kuryr-cni-lpmz6                        1/1     Running   1          48m
kuryr-controller-745bc55f58-mpqkn      0/1     Running   5          62m
kuryr-controller-745bc55f58-mpqkn      1/1     Running   5          64m
kuryr-cni-lpmz6                        0/1     Running   2          53m
kuryr-cni-4pf76                        0/1     Running   2          53m
kuryr-cni-bkqt8                        0/1     Running   2          52m
kuryr-cni-lpmz6                        1/1     Running   2          53m
kuryr-cni-4pf76                        1/1     Running   2          53m
kuryr-cni-bkqt8                        1/1     Running   2          53m
kuryr-controller-745bc55f58-mpqkn      0/1     Running   6          67m
kuryr-controller-745bc55f58-mpqkn      1/1     Running   6          68m
kuryr-controller-745bc55f58-mpqkn      0/1     Running   7          79m
kuryr-controller-745bc55f58-mpqkn      0/1     Running   8          81m
kuryr-cni-bkqt8                        0/1     Running   3          70m
kuryr-cni-lpmz6                        0/1     Running   3          70m
kuryr-cni-4pf76                        0/1     Running   3          70m
kuryr-cni-bkqt8                        1/1     Running   3          70m
kuryr-controller-745bc55f58-mpqkn      0/1     Running   9          83m
kuryr-cni-4pf76                        1/1     Running   3          71m
kuryr-cni-lpmz6                        1/1     Running   3          71m
kuryr-controller-745bc55f58-mpqkn      0/1     Running   10         85m
kuryr-cni-lpmz6                        0/1     Running   4          74m
kuryr-controller-745bc55f58-mpqkn      0/1     Running   11         87m
kuryr-cni-lpmz6                        1/1     Running   4          74m
kuryr-cni-bkqt8                        0/1     Running   4          74m
kuryr-cni-4pf76                        0/1     Running   4          75m
kuryr-cni-bkqt8                        1/1     Running   4          74m
kuryr-cni-4pf76                        1/1     Running   4          75m
kuryr-controller-745bc55f58-mpqkn      0/1     CrashLoopBackOff   11         89m
kuryr-cni-lpmz6                        0/1     Running            5          78m
kuryr-cni-lpmz6                        1/1     Running            5          78m
kuryr-cni-bkqt8                        0/1     Running            5          78m
kuryr-cni-4pf76                        0/1     Running            5          78m
kuryr-cni-bkqt8                        1/1     Running            5          78m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            12         91m
kuryr-cni-4pf76                        1/1     Running            5          79m
kuryr-controller-745bc55f58-mpqkn      0/1     CrashLoopBackOff   12         93m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            13         98m
kuryr-cni-bkqt8                        0/1     Running            6          86m
kuryr-cni-bkqt8                        1/1     Running            6          86m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            14         100m
kuryr-controller-745bc55f58-mpqkn      0/1     CrashLoopBackOff   14         102m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            15         107m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            16         109m
kuryr-controller-745bc55f58-mpqkn      0/1     CrashLoopBackOff   16         111m
kuryr-cni-lpmz6                        0/1     Running            6          102m
kuryr-cni-lpmz6                        1/1     Running            6          102m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            17         116m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            18         117m
kuryr-controller-745bc55f58-mpqkn      0/1     CrashLoopBackOff   18         119m
kuryr-cni-4pf76                        0/1     Running            6          107m
kuryr-cni-4pf76                        1/1     Running            6          108m
kuryr-cni-4pf76                        0/1     Running            7          111m
kuryr-cni-4pf76                        1/1     Running            7          112m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            19         124m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            20         126m
kuryr-controller-745bc55f58-mpqkn      0/1     CrashLoopBackOff   20         128m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            21         133m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            22         135m
kuryr-cni-lpmz6                        0/1     Running            7          124m
kuryr-controller-745bc55f58-mpqkn      0/1     CrashLoopBackOff   22         137m
kuryr-cni-lpmz6                        1/1     Running            7          124m
kuryr-cni-bkqt8                        0/1     Running            7          129m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            23         142m
kuryr-cni-bkqt8                        1/1     Running            7          129m
kuryr-controller-745bc55f58-mpqkn      0/1     Running            24         144m
kuryr-controller-745bc55f58-mpqkn      0/1     CrashLoopBackOff   24         145m

Comment 14 Mike Fiedler 2019-10-03 18:33:31 UTC

Created attachment 1622380 [details]
openshift-kuryr pod logs

Pod logs for openshift-kuryr namespace after kuryr-controller started crashlooping and kuryr-cni pods were restarting regularly.   This was during execution of the kubernetes conformance testsuite.

Comment 16 Mike Fiedler 2019-10-03 20:12:23 UTC

For comparison to comment 13, I ran the k8s conformance on an OSP 13 cluster configured with OpenShift SDN (4.2.0-0.nightly-2019-10-02-122541).   204/204 tests passed in 21m25s.

Comment 18 Mike Fiedler 2019-10-04 18:28:17 UTC

Testing with # oc describe deployment -n openshift-kuryr kuryr-controller | grep Image
    Image:      docker.io/maysamacedo/kuryr-controller:latest


This run was better but not great.  kuryr-controller restarted 17 times and was in CrashLoopBack a few times as well.

kuryr-controller-588b9c6bcd-lqgbv      0/1     CrashLoopBackOff   16         3h26m
kuryr-controller-588b9c6bcd-lqgbv      0/1     Running            17         3h31m


The actual k8s conformance test suite did  a bit better.  The tests ran in 78 minutes with 68 failed and 136 passed.   Compare to comment 16 for OpenShift SDN and comment 13 for previous Kuryr.   The failures were not networking specific tests - just general tests failing due to pod creation failures.  

I am running again to see if the results are consistent or if we degrade.  Will attach detailed test results this time as well.

Comment 19 Mike Fiedler 2019-10-04 19:26:49 UTC

Seeing the following events firing while the tests are running

Oct 04 19:22:44.243 W ns/openshift-kuryr pod/kuryr-cni-sn4gk Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.6 10.196.0.11 10.196.0.12 (230 times)
Oct 04 19:22:47.089 W ns/openshift-dns pod/dns-default-s6h5t Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.34 10.196.0.11 10.196.0.12 (225 times)
Oct 04 19:22:48.027 W ns/openshift-monitoring pod/node-exporter-5d7km Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.6 10.196.0.11 10.196.0.12 (218 times)
Oct 04 19:22:48.335 W ns/openshift-kuryr pod/kuryr-cni-nft6c Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.42 10.196.0.11 10.196.0.12 (289 times)
Oct 04 19:22:49.008 W ns/openshift-kuryr pod/kuryr-controller-588b9c6bcd-lqgbv Readiness probe failed: HTTP probe failed with statuscode: 500 (222 times)
Oct 04 19:22:49.028 W ns/openshift-kube-apiserver pod/kube-apiserver-ostest-fgfxk-master-0 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.6 10.196.0.11 10.196.0.12 (212 times)
Oct 04 19:22:50.339 W ns/openshift-monitoring pod/node-exporter-ffqwx Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.42 10.196.0.11 10.196.0.12 (217 times)
Oct 04 19:22:53.030 W ns/openshift-multus pod/multus-m5nnt Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.14 10.196.0.11 10.196.0.12 (224 times)
Oct 04 19:23:02.031 W ns/openshift-ingress pod/router-default-5f64bb4978-8dnbw Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.196.0.14 10.196.0.11 10.196.0.12 (217 times)

Comment 20 Mike Fiedler 2019-10-04 19:55:17 UTC

During the second run mentioned in comment 18, cluster and kuryr seem to be in a degraded state.   error: 172 fail, 32 pass, 0 skip (1h27m33s).   Same access info as before, let me know if there is anything I should gather.

Comment 21 Mike Fiedler 2019-10-07 20:41:32 UTC

The cluster was reinstalled today (7-October) and I executed the kubernetes conformance tests again.   The results seem to be the same.   kuryr-controller and kuryr-cni pods are restarting frequently and the conformance tests have about a 50% pass rate.

# oc get clusterversion
NAME      VERSION                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   0.0.1-2019-10-07-143408   True        False         79m     Cluster version is 0.0.1-2019-10-07-143408

# oc describe deployment kuryr-controller -n openshift-kuryr | grep Image
    Image:      docker.io/luis5tb/kuryr:latest



kuryr-controller-6b4459584f-hb8qj      0/1     Running   7          53m
kuryr-controller-6b4459584f-hb8qj      0/1     CrashLoopBackOff   7          55m
kuryr-cni-wh76m                        0/1     Running            3          34m
kuryr-cni-wh76m                        1/1     Running            3          34m
kuryr-cni-dvc7z                        0/1     Running            3          37m
kuryr-cni-cvhdb                        0/1     Running            3          37m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            8          58m
kuryr-cni-cvhdb                        1/1     Running            3          37m
kuryr-cni-dvc7z                        1/1     Running            3          38m
kuryr-cni-wh76m                        0/1     Running            4          37m
kuryr-controller-6b4459584f-hb8qj      1/1     Running            8          60m
kuryr-controller-6b4459584f-hb8qj      0/1     CrashLoopBackOff   8          60m
kuryr-cni-wh76m                        1/1     Running            4          38m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            9          65m
kuryr-cni-wh76m                        0/1     Running            5          44m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            10         67m
kuryr-cni-wh76m                        1/1     Running            5          45m
kuryr-controller-6b4459584f-hb8qj      0/1     CrashLoopBackOff   10         69m
kuryr-cni-wh76m                        0/1     Running            6          49m
kuryr-cni-cvhdb                        0/1     Running            4          50m
kuryr-cni-dvc7z                        0/1     Running            4          51m
kuryr-cni-wh76m                        1/1     Running            6          49m
kuryr-cni-cvhdb                        1/1     Running            4          51m
kuryr-cni-dvc7z                        1/1     Running            4          51m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            11         74m
kuryr-cni-wh76m                        0/1     Running            7          53m
kuryr-cni-dvc7z                        0/1     Running            5          54m
kuryr-cni-cvhdb                        0/1     Running            5          54m
kuryr-cni-wh76m                        1/1     Running            7          53m
kuryr-cni-cvhdb                        1/1     Running            5          54m
kuryr-cni-dvc7z                        1/1     Running            5          55m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            12         76m
kuryr-controller-6b4459584f-hb8qj      0/1     CrashLoopBackOff   12         78m
kuryr-cni-wh76m                        0/1     CrashLoopBackOff   7          56m
kuryr-cni-cvhdb                        0/1     Running            6          58m
kuryr-cni-cvhdb                        1/1     Running            6          58m
kuryr-cni-dvc7z                        0/1     Running            6          59m
kuryr-cni-dvc7z                        1/1     Running            6          60m
kuryr-cni-cvhdb                        0/1     Running            7          61m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            13         83m
kuryr-cni-cvhdb                        1/1     Running            7          62m
kuryr-cni-dvc7z                        0/1     Running            7          63m
kuryr-cni-wh76m                        0/1     Running            8          61m
kuryr-cni-dvc7z                        1/1     Running            7          63m
kuryr-cni-wh76m                        1/1     Running            8          62m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            14         84m
kuryr-cni-cvhdb                        0/1     Running            8          65m
kuryr-controller-6b4459584f-hb8qj      0/1     CrashLoopBackOff   14         86m
kuryr-cni-cvhdb                        1/1     Running            8          65m
kuryr-cni-dvc7z                        0/1     Running            8          67m
kuryr-cni-dvc7z                        1/1     Running            8          67m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            15         91m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            16         93m
kuryr-controller-6b4459584f-hb8qj      0/1     CrashLoopBackOff   16         95m
kuryr-cni-wh76m                        0/1     Running            9          75m
kuryr-cni-wh76m                        1/1     Running            9          75m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            17         100m
kuryr-controller-6b4459584f-hb8qj      0/1     Running            18         102m
kuryr-controller-6b4459584f-hb8qj      0/1     CrashLoopBackOff   18         104m

Comment 25 Luis Tomas Bolivar 2019-10-17 15:41:19 UTC


*** This bug has been marked as a duplicate of bug 1759097 ***