Bug 1936342 - kuryr-controller restarting after 3 days cluster running - pools without members
Summary: kuryr-controller restarting after 3 days cluster running - pools without members
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Ben Bennett
QA Contact: rlobillo
URL:
Whiteboard:
Depends On:
Blocks: 1949551
TreeView+ depends on / blocked
 
Reported: 2021-03-08 09:17 UTC by rlobillo
Modified: 2021-07-27 22:52 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:51:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kuryr controller logs (38.45 KB, text/plain)
2021-03-08 09:17 UTC, rlobillo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 492 0 None open Bug 1936342: kuryr-controller restarting after 3 days cluster running - pools without members 2021-03-30 15:32:00 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:52:19 UTC

Description rlobillo 2021-03-08 09:17:18 UTC
Created attachment 1761526 [details]
kuryr controller logs

Description of problem: 

After 4.8.0-0.nightly-2021-03-05-015511 cluster running 3 days, kuryr-controller is restarting because there are pools with empty loadbalancers:

$ openstack loadbalancer pool show a8bf9c07-4cf5-4932-9182-829c16537ed9
+----------------------+-------------------------------------------------------------+
| Field                | Value                                                       |
+----------------------+-------------------------------------------------------------+
| admin_state_up       | True                                                        |
| created_at           | 2021-03-05T11:16:10                                         |
| description          |                                                             |
| healthmonitor_id     |                                                             |
| id                   | a8bf9c07-4cf5-4932-9182-829c16537ed9                        |
| lb_algorithm         | ROUND_ROBIN                                                 |
| listeners            | 80708599-b2f3-41a6-8e8b-f9538bdb0a7f                        |
| loadbalancers        | b1a3ee03-d248-4f21-b870-ac9c64546335                        |
| members              |                                                             |
| name                 | openshift-marketplace/marketplace-operator-metrics:TCP:8383 |
| operating_status     | ONLINE                                                      |
| project_id           | cb736c7b6ada44218c8ee2d9e417368f                            |
| protocol             | TCP                                                         |
| provisioning_status  | ACTIVE                                                      |
| session_persistence  | None                                                        |
| updated_at           | 2021-03-05T11:16:17                                         |
| tls_container_ref    | None                                                        |
| ca_tls_container_ref | None                                                        |
| crl_container_ref    | None                                                        |
| tls_enabled          | False                                                       |
+----------------------+-------------------------------------------------------------+

The issue was resolved by removing two svc (and letting kuryr-controller to recreate them):

- openshift-console-operator/metrics 
- openshift-marketplace/marketplace-operator-metrics


Version-Release number of selected component (if applicable): OCP4.8.0-0.nightly-2021-03-05-015511 on OSP13 (2021-01-20.1) Amphora provider.


How reproducible: Unknown

Steps to Reproduce:
Install cluster and let it running 2-3 days.

Actual results: kuryr-controller restarting.

(shiftstack) [stack@undercloud-0 network]$ oc get pods -n openshift-kuryr
NAME                                READY   STATUS    RESTARTS   AGE
kuryr-cni-4d76c                     1/1     Running   1          2d21h
kuryr-cni-4j58w                     1/1     Running   1          2d21h
kuryr-cni-7f6dt                     1/1     Running   0          2d21h
kuryr-cni-qg9wm                     1/1     Running   0          2d21h
kuryr-cni-qwfqw                     1/1     Running   0          2d21h
kuryr-cni-r5ddr                     1/1     Running   1          2d21h
kuryr-controller-78b7bdfdb4-tt95k   1/1     Running   914        2d18h

Expected results: kuryr-controller stable.


Additional info: kuryr-controller logs + must-gather.

Comment 1 rlobillo 2021-03-08 09:19:17 UTC
must-gather: http://file.rdu.redhat.com/rlobillo/must_gather_BZ1936342.tgz

Comment 2 sscavnic 2021-03-12 16:58:41 UTC
Assignee: mdulko → sscavnic

Comment 5 rlobillo 2021-04-19 09:37:25 UTC
Verified on OCP4.8.0-0.nightly-2021-04-17-044339 over OSP(RHOS-16.1-RHEL-8-20210323.n.0) with ovn octavia.

Cluster has been running for 47h without hitting the issue.

$ oc get pods -n openshift-kuryr
NAME                              READY   STATUS    RESTARTS   AGE
kuryr-cni-8gsqw                   1/1     Running   0          46h
kuryr-cni-8pd7z                   1/1     Running   0          46h
kuryr-cni-8s52n                   1/1     Running   0          46h
kuryr-cni-9kf7z                   1/1     Running   0          46h
kuryr-cni-phtt7                   1/1     Running   0          46h
kuryr-cni-zdtms                   1/1     Running   0          46h
kuryr-controller-949d6fd9-4qfrp   1/1     Running   0          45h
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-17-044339   True        False         47h     Cluster version is 4.8.0-0.nightly-2021-04-17-044339

Furthermore, removing the members of the pool on OSP manually and editing the .status section of the related klb comes to a fully regeneration of the members without kuryr-controller restarts:


$ openstack loadbalancer member list demo/demo:TCP:80
+--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| id                                   | name                            | project_id                       | provisioning_status | address       | protocol_port | operating_status | weight |
+--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| 47445b1d-b1bb-4d5a-bc9f-82dd6a2adda3 | demo/demo-7897db69cc-dbt8v:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE              | 10.128.124.30 |          8080 | NO_MONITOR       |      1 |
| 2e0a7eeb-ea49-429b-9af7-1d73ca5ef81f | demo/demo-7897db69cc-lk6jw:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE              | 10.128.124.53 |          8080 | NO_MONITOR       |      1 |
| ae1d1bc2-fa7e-4ce6-b637-fb2c36e280bc | demo/demo-7897db69cc-k7lts:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE              | 10.128.124.70 |          8080 | NO_MONITOR       |      1 |
+--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+

$ for i in `openstack loadbalancer member list demo/demo:TCP:80 -c name -f value`; do openstack loadbalancer member delete demo/demo:TCP:80 $i; done


$ oc edit klb/demo -n demo
(Remove status section and wait for recreation)

$ openstack loadbalancer member list demo/demo:TCP:80
+--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| id                                   | name                            | project_id                       | provisioning_status | address       | protocol_port | operating_status | weight |
+--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| cf635d96-141d-4975-9373-de4f49108fcf | demo/demo-7897db69cc-dbt8v:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE              | 10.128.124.30 |          8080 | NO_MONITOR       |      1 |
| 5d102934-bc4b-4b5d-ab81-20febbe98a2b | demo/demo-7897db69cc-lk6jw:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE              | 10.128.124.53 |          8080 | NO_MONITOR       |      1 |
| 2084cd10-3b5c-4df7-8c46-353248763d91 | demo/demo-7897db69cc-k7lts:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE              | 10.128.124.70 |          8080 | NO_MONITOR       |      1 |
+--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+

Comment 8 errata-xmlrpc 2021-07-27 22:51:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.