Created attachment 1761526 [details] kuryr controller logs Description of problem: After 4.8.0-0.nightly-2021-03-05-015511 cluster running 3 days, kuryr-controller is restarting because there are pools with empty loadbalancers: $ openstack loadbalancer pool show a8bf9c07-4cf5-4932-9182-829c16537ed9 +----------------------+-------------------------------------------------------------+ | Field | Value | +----------------------+-------------------------------------------------------------+ | admin_state_up | True | | created_at | 2021-03-05T11:16:10 | | description | | | healthmonitor_id | | | id | a8bf9c07-4cf5-4932-9182-829c16537ed9 | | lb_algorithm | ROUND_ROBIN | | listeners | 80708599-b2f3-41a6-8e8b-f9538bdb0a7f | | loadbalancers | b1a3ee03-d248-4f21-b870-ac9c64546335 | | members | | | name | openshift-marketplace/marketplace-operator-metrics:TCP:8383 | | operating_status | ONLINE | | project_id | cb736c7b6ada44218c8ee2d9e417368f | | protocol | TCP | | provisioning_status | ACTIVE | | session_persistence | None | | updated_at | 2021-03-05T11:16:17 | | tls_container_ref | None | | ca_tls_container_ref | None | | crl_container_ref | None | | tls_enabled | False | +----------------------+-------------------------------------------------------------+ The issue was resolved by removing two svc (and letting kuryr-controller to recreate them): - openshift-console-operator/metrics - openshift-marketplace/marketplace-operator-metrics Version-Release number of selected component (if applicable): OCP4.8.0-0.nightly-2021-03-05-015511 on OSP13 (2021-01-20.1) Amphora provider. How reproducible: Unknown Steps to Reproduce: Install cluster and let it running 2-3 days. Actual results: kuryr-controller restarting. (shiftstack) [stack@undercloud-0 network]$ oc get pods -n openshift-kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-4d76c 1/1 Running 1 2d21h kuryr-cni-4j58w 1/1 Running 1 2d21h kuryr-cni-7f6dt 1/1 Running 0 2d21h kuryr-cni-qg9wm 1/1 Running 0 2d21h kuryr-cni-qwfqw 1/1 Running 0 2d21h kuryr-cni-r5ddr 1/1 Running 1 2d21h kuryr-controller-78b7bdfdb4-tt95k 1/1 Running 914 2d18h Expected results: kuryr-controller stable. Additional info: kuryr-controller logs + must-gather.
must-gather: http://file.rdu.redhat.com/rlobillo/must_gather_BZ1936342.tgz
Assignee: mdulko → sscavnic
Verified on OCP4.8.0-0.nightly-2021-04-17-044339 over OSP(RHOS-16.1-RHEL-8-20210323.n.0) with ovn octavia. Cluster has been running for 47h without hitting the issue. $ oc get pods -n openshift-kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-8gsqw 1/1 Running 0 46h kuryr-cni-8pd7z 1/1 Running 0 46h kuryr-cni-8s52n 1/1 Running 0 46h kuryr-cni-9kf7z 1/1 Running 0 46h kuryr-cni-phtt7 1/1 Running 0 46h kuryr-cni-zdtms 1/1 Running 0 46h kuryr-controller-949d6fd9-4qfrp 1/1 Running 0 45h $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-17-044339 True False 47h Cluster version is 4.8.0-0.nightly-2021-04-17-044339 Furthermore, removing the members of the pool on OSP manually and editing the .status section of the related klb comes to a fully regeneration of the members without kuryr-controller restarts: $ openstack loadbalancer member list demo/demo:TCP:80 +--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+ | id | name | project_id | provisioning_status | address | protocol_port | operating_status | weight | +--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+ | 47445b1d-b1bb-4d5a-bc9f-82dd6a2adda3 | demo/demo-7897db69cc-dbt8v:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE | 10.128.124.30 | 8080 | NO_MONITOR | 1 | | 2e0a7eeb-ea49-429b-9af7-1d73ca5ef81f | demo/demo-7897db69cc-lk6jw:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE | 10.128.124.53 | 8080 | NO_MONITOR | 1 | | ae1d1bc2-fa7e-4ce6-b637-fb2c36e280bc | demo/demo-7897db69cc-k7lts:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE | 10.128.124.70 | 8080 | NO_MONITOR | 1 | +--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+ $ for i in `openstack loadbalancer member list demo/demo:TCP:80 -c name -f value`; do openstack loadbalancer member delete demo/demo:TCP:80 $i; done $ oc edit klb/demo -n demo (Remove status section and wait for recreation) $ openstack loadbalancer member list demo/demo:TCP:80 +--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+ | id | name | project_id | provisioning_status | address | protocol_port | operating_status | weight | +--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+ | cf635d96-141d-4975-9373-de4f49108fcf | demo/demo-7897db69cc-dbt8v:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE | 10.128.124.30 | 8080 | NO_MONITOR | 1 | | 5d102934-bc4b-4b5d-ab81-20febbe98a2b | demo/demo-7897db69cc-lk6jw:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE | 10.128.124.53 | 8080 | NO_MONITOR | 1 | | 2084cd10-3b5c-4df7-8c46-353248763d91 | demo/demo-7897db69cc-k7lts:8080 | 08749ff8b0c34d349d059668b1b392a2 | ACTIVE | 10.128.124.70 | 8080 | NO_MONITOR | 1 | +--------------------------------------+---------------------------------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438