Bug 2004542

Summary:	[osp][octavia lb] cannot create LoadBalancer type svcs
Product:	OpenShift Container Platform	Reporter:	Jon Uriarte <juriarte>
Component:	Cloud Compute	Assignee:	Pierre Prinetti <pprinett>
Cloud Compute sub component:	OpenStack Provider	QA Contact:	Jon Uriarte <juriarte>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	andbartl, andcosta, aos-bugs, cshepher, emacchi, gferrazs, mabajodu, m.andre, mbooth, mdulko, mfedosin, mfojtik, nagrawal, pprinett
Version:	4.8	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Cause: There's a race condition between OpenStack credentials secret creation and kube-controller-manager startup. Consequence: If it happens OpenStack cloud-provider will not get configured with OpenStack credentials, effectively breaking support for creating Octavia load balancers for LoadBalancer services. Workaround (if any): Workaround is to restart the kube-controller-manager pods (note that those are static pods, so just deleting them in OpenShift API doesn't do the job, you got to manipulate manifests on the master nodes to do that). Result: After kube-controller-manager restart, the problem should never repeat on the cluster.	Story Points:	---
Clone Of:
Clones:	2039373 (view as bug list)		Environment:
Last Closed:	2022-03-10 16:10:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2039373

Description Jon Uriarte 2021-09-15 14:24:33 UTC

Description of problem:

LoabBalancer type services cannot be created due to what seems an error getting openstack-credentials secret (the secret exists though).

Version-Release number of selected component (if applicable):
OCP 4.9.0-0.nightly-2021-09-14-200602
OSP 16.1.6


How reproducible: always


Steps to Reproduce:
Install OCP 4.9 on OSP

oc new-project test1-ns
oc create deployment test1-dep --image=quay.io/kuryr/demo
oc scale deployments/test1-dep --replicas=2
oc expose deployment test1-dep --name test1-svc --type=LoadBalancer --port 80 --target-port=8080

Actual results:
The LB is not being created in Openstack

Expected results:
LB created in Openstack

Additional info:

$ oc get cm cloud-provider-config -n openshift-config  -o yaml
...
    [LoadBalancer]
    use-octavia = True

$ oc -n test1-ns describe svc test1-svc
Name:                     test1-svc
Namespace:                test1-ns
Labels:                   app=test1-dep
Annotations:              <none>
Selector:                 app=test1-dep
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.30.153.52
IPs:                      172.30.153.52
Port:                     <unset>  80/TCP
TargetPort:               8080/TCP
NodePort:                 <unset>  31099/TCP
Endpoints:                10.128.2.10:8080,10.131.0.63:8080
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

openshift-kube-controller-manager pod shows:
E0915 11:16:45.097228       1 openstack.go:284] cannot get secret openstack-credentials in namespace kube-system. error: "secret \"openstack-credentials\" not found"
E0915 11:16:45.097305       1 core.go:91] Failed to start service controller: the cloud provider does not support external load balancers

The secret openstack-credentials does exist:
$ oc -n kube-system describe secret openstack-credentials
Name:         openstack-credentials
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Type:  Opaque
                                                                                                                                                                                              
Data                                                                                                                                                                                          
====                                                                                                                                                                                          
clouds.conf:  300 bytes
clouds.yaml:  437 bytes

Comment 1 egarcia 2021-09-15 15:43:05 UTC

this looks like an RBAC bug

Comment 2 Michał Dulko 2021-09-15 15:59:16 UTC

It seems it's affecting 4.8 as well.

Comment 3 egarcia 2021-09-15 16:52:11 UTC

This should also mean that self signed certificates are broken I believe

Comment 4 Michał Dulko 2021-09-17 08:30:46 UTC

Another data point - it seems like the issue is a race condition between kube-controller-manager startup and secret creation. Possibly a workaround could be to restart the kube-controller-manager container (it's a static pod, deleting it from the API isn't doing the job).

Comment 5 Michał Dulko 2021-09-21 12:33:20 UTC

I can confirm that this is the race condition described above. To work it around you got to restart kube-controller-manager, but it isn't trivial, as it's a static pod defined in /etc/kubernetes/manifests on the masters.

A simple way to do it is to make a dummy change to openshift-config/cloud-provider-config ConfigMap. First edit it:

 oc edit cm cloud-provider-config -n openshift-config

Then you can make a dummy change. I just added a "#foobar" comment at the "config" key like this:

  config: |
    [Global]
    secret-name = openstack-credentials
    secret-namespace = kube-system
    region = regionOne
    ca-file = /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem
    [LoadBalancer]
    #foobar
    use-octavia = True

This will force reconfiguration of all the nodes, effectively restarting the kube-controller-manager pods. It'll take a long time, depending on the size of the cluster. You got to wait until no nodes will be "NotReady,SchedulingDisabled" in `oc get nodes`. Once done you can edit the ConfigMap again to remove the comment but you'll need to wait for reconfiguration again.

An alternative is to SSH into each of the master nodes and moving /etc/kubernetes/manifests/kube-controller-manager-pod.yaml somewhere and then back to that location to force the pod deletion and creation.

Given that there's a fairly simple workaround I'm setting blocker-.

Comment 8 Martin André 2021-09-30 13:26:43 UTC

I was able to verify this issue does not consistently reproduce on fresh deployments. For example, I didn't see the issue on MOC, on a fresh deployment:

moc-dev ❯ oc -n test1-ns describe svc test1-svc
Name:                     test1-svc
Namespace:                test1-ns
Labels:                   app=test1-dep
Annotations:              <none>
Selector:                 app=test1-dep
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.30.143.54
IPs:                      172.30.143.54
LoadBalancer Ingress:     128.31.26.238
Port:                     <unset>  80/TCP
TargetPort:               8080/TCP
NodePort:                 <unset>  30753/TCP
Endpoints:                10.129.2.11:8080,10.131.0.27:8080
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                Age    From                Message
  ----    ------                ----   ----                -------
  Normal  EnsuringLoadBalancer  4m18s  service-controller  Ensuring load balancer
  Normal  EnsuredLoadBalancer   2m1s   service-controller  Ensured load balancer


moc-dev ❯ openstack loadbalancer show 12e0e9a3-c259-46f7-809f-71c176a02236
+---------------------+--------------------------------------------------------------+
| Field               | Value                                                        |
+---------------------+--------------------------------------------------------------+
| admin_state_up      | True                                                         |
| availability_zone   |                                                              |
| created_at          | 2021-09-30T13:15:28                                          |
| description         | Kubernetes external service a6aebc446d2454b2c9499a6ee4a8a259 |
| flavor_id           | None                                                         |
| id                  | 12e0e9a3-c259-46f7-809f-71c176a02236                         |
| listeners           | 5c068bf4-3792-4587-a032-003d2f12a728                         |
| name                | a6aebc446d2454b2c9499a6ee4a8a259                             |
| operating_status    | ONLINE                                                       |
| pools               | dd5654be-625e-4857-82d2-62909fcc280f                         |
| project_id          | f12f928576ae4d21bdb984da5dd1d3bf                             |
| provider            | amphora                                                      |
| provisioning_status | ACTIVE                                                       |
| updated_at          | 2021-09-30T13:17:35                                          |
| vip_address         | 10.0.131.95                                                  |
| vip_network_id      | 6f0026f1-7f4a-4d34-8552-2662d8514616                         |
| vip_port_id         | 2460bb42-a2c0-48c7-99c7-f8df5b2e615f                         |
| vip_qos_policy_id   | None                                                         |
| vip_subnet_id       | 80af4c30-64a7-431d-b7ca-4cc5c7a9a7f5                         |
+---------------------+--------------------------------------------------------------+

moc-dev ❯ curl http://128.31.26.238
test1-dep-86cd46676d-kjwcp: HELLO! I AM ALIVE!!!

The KCM pods haven't been restarted:

moc-dev ❯ oc get pods -A | grep kube-controller-manager-
openshift-kube-controller-manager-operator         kube-controller-manager-operator-7c94444795-jcxk5           1/1     Running     1 (27m ago)   32m
openshift-kube-controller-manager                  kube-controller-manager-mandre-rrf85-master-0               4/4     Running     0             14m
openshift-kube-controller-manager                  kube-controller-manager-mandre-rrf85-master-1               4/4     Running     0             15m
openshift-kube-controller-manager                  kube-controller-manager-mandre-rrf85-master-2               4/4     Running     0             14m

Comment 11 Emilien Macchi 2021-11-24 16:48:15 UTC

Moving to the KCM team, so they can have a look.
Please let us know if that's something we can help.

Comment 12 Maciej Szulik 2021-11-29 12:26:01 UTC

Honestly, I don't see here a problem with a kube-controller-manager but an integration issue with a specific cloud provider, here OpenStack.

Comment 13 Martin André 2021-12-03 16:44:27 UTC

Hi Jon, can you provide us with a must-gather next time you encounter the issue?

Comment 14 Gabriel Stein 2021-12-20 17:04:23 UTC

*** Bug 2033632 has been marked as a duplicate of this bug. ***

Comment 48 Martin André 2022-02-15 13:12:11 UTC

*** Bug 1938188 has been marked as a duplicate of this bug. ***

Comment 50 errata-xmlrpc 2022-03-10 16:10:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056