2001620 – Cluster becomes degraded if it can't talk to Manila

Bug 2001620 - Cluster becomes degraded if it can't talk to Manila

Summary: Cluster becomes degraded if it can't talk to Manila

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Martin André
QA Contact:	Jon Uriarte
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2001958
TreeView+	depends on / blocked

Reported:	2021-09-06 14:47 UTC by Martin André
Modified:	2024-12-20 20:57 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: error handling when communicating with Manila didn't account for all possible failures. Consequence: unexpected errors when talking to Manila would degrade the operator and thus the cluster. Fix: treat all failures to reach the Manila endpoint as a non fatal error Result: the Manila operator gets disabled instead of making the cluster degraded.
Clone Of:
Environment:
Last Closed:	2022-03-10 16:07:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift csi-driver-manila-operator pull 120	0	None	None	None	2021-09-06 14:52:24 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:08:05 UTC

Description Martin André 2021-09-06 14:47:46 UTC

Description of problem: It's possible that for one reason or another we're unable to reach the Manila endpoint. In the past we've tried to be smart and handled errors differently, if it's a 404, a 403, or other types of errors. The problem with this approach is that it's very easy to forget valid failure cases. We've had a recent example with proxy setting not correctly propagated to the Manila pod and degrading the cluster.

We should instead treat all failures to reach the Manila endpoint as a non fatal error and disable the Manila operator instead of making the cluster degraded.

Comment 32 rlobillo 2021-09-10 15:46:01 UTC

Verified on OCP 4.10.0-0.nightly-2021-09-09-163608 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210818.n.0) with OpenshiftSDN network type.

The IPI installation performed on restricted network with a proxy finished successfully when the SG rules on the proxy instance is blocking the egress traffic going to OSP manila endpoint:

$ openstack catalog show manila | grep public
|           |   public: https://10.46.44.10:13786/v1/65d84c01ef224b0c9fe8892d43fa804a |


# Egress rules on the instance where the proxy is running:
$ openstack security group rule list --egress installer_host-sg
+--------------------------------------+-------------+-----------+-----------+-------------+-----------------------+
| ID                                   | IP Protocol | Ethertype | IP Range  | Port Range  | Remote Security Group |
+--------------------------------------+-------------+-----------+-----------+-------------+-----------------------+
| 17a7dccc-d005-4f22-8369-bc511b86ff83 | udp         | IPv4      | 0.0.0.0/0 |             | None                  |
| 22c83d2c-33f1-401a-b8fe-319628066615 | tcp         | IPv4      | 0.0.0.0/0 | 13787:65000 | None                  |
| 45d10ca3-9954-49c1-ad47-abfcc63a0d93 | tcp         | IPv4      | 0.0.0.0/0 | 1:13785     | None                  |
| d39faa61-7294-4e7e-8a29-aac757354233 | None        | IPv6      | ::/0      |             | None                  |
+--------------------------------------+-------------+-----------+-----------+-------------+-----------------------+

This is provoking that the manila-csi-driver-operator is getting a timeout while reaching the manila API, but it is working for the rest (tested with keystone):

- manila OSP API is not reachable:

$ oc rsh -n openshift-cluster-csi-drivers $(oc get pods -n openshift-cluster-csi-drivers -l name=manila-csi-driver-operator -o name)
sh-4.4$ curl --connect-timeout 5 --proxy-cacert /etc/openstack-ca/ca-bundle.pem --cacert /etc/openstack-ca/ca-bundle.pem https://10.46.44.10:13786/v1/65d84c01ef224b0c9fe8892d43fa804a
curl: (28) Operation timed out after 5002 milliseconds with 0 out of 0 bytes received

- However, keystone is reachable:

sh-4.4$ curl --connect-timeout 5 --proxy-cacert /etc/openstack-ca/ca-bundle.pem --cacert /etc/openstack-ca/ca-bundle.pem https://10.46.44.10:13000                                    
{"versions": {"values": [{"id": "v3.13", "status": "stable", "updated": "2019-07-19T00:00:00Z", "links": [{"rel": "self", "href": "https://10.46.44.10:13000/v3/"}], "media-types": [{"base": "application/json", "type": "application/vnd.openstack.identity-v3+json"}]}]}}sh-4.4$ 


Under these circumstances, the IPI installation works fine:

DEBUG Time elapsed per stage:
DEBUG                   : 1m49s
DEBUG Bootstrap Complete: 24m4s
DEBUG                API: 3m26s
DEBUG  Bootstrap Destroy: 33s
DEBUG  Cluster Operators: 20m14s
INFO Time elapsed: 47m36s


All cluster operators are available:

$ oc get clusteroperators
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE                                                                                                         
authentication                             4.10.0-0.nightly-2021-09-09-163608   True        False         False      23m                                                                                                                     
baremetal                                  4.10.0-0.nightly-2021-09-09-163608   True        False         False      47m                                                                                                                     
cloud-controller-manager                   4.10.0-0.nightly-2021-09-09-163608   True        False         False      55m                                                                                                                     
cloud-credential                           4.10.0-0.nightly-2021-09-09-163608   True        False         False      62m                                                                                                                     
cluster-autoscaler                         4.10.0-0.nightly-2021-09-09-163608   True        False         False      47m                                                                                                                     
config-operator                            4.10.0-0.nightly-2021-09-09-163608   True        False         False      52m                                                                                                                     
console                                    4.10.0-0.nightly-2021-09-09-163608   True        False         False      27m                                                                                                                     
csi-snapshot-controller                    4.10.0-0.nightly-2021-09-09-163608   True        False         False      48m                                                                                                                     
dns                                        4.10.0-0.nightly-2021-09-09-163608   True        False         False      47m                                                                                                                     
etcd                                       4.10.0-0.nightly-2021-09-09-163608   True        False         False      49m                                                                                                                     
image-registry                             4.10.0-0.nightly-2021-09-09-163608   True        False         False      31m                                                                                                                     
ingress                                    4.10.0-0.nightly-2021-09-09-163608   True        False         False      29m                                                                                                                     
insights                                   4.10.0-0.nightly-2021-09-09-163608   True        False         False      45m                                                                                                                     
kube-apiserver                             4.10.0-0.nightly-2021-09-09-163608   True        False         False      46m                                                                                                                     
kube-controller-manager                    4.10.0-0.nightly-2021-09-09-163608   True        False         False      48m                                                                                                                     
kube-scheduler                             4.10.0-0.nightly-2021-09-09-163608   True        False         False      48m                                                                                                                     
kube-storage-version-migrator              4.10.0-0.nightly-2021-09-09-163608   True        False         False      50m                                                                                                                     
machine-api                                4.10.0-0.nightly-2021-09-09-163608   True        False         False      41m                                                                                                                     
machine-approver                           4.10.0-0.nightly-2021-09-09-163608   True        False         False      47m                                                                                                                     
machine-config                             4.10.0-0.nightly-2021-09-09-163608   True        False         False      46m                                                                                                                     
marketplace                                4.10.0-0.nightly-2021-09-09-163608   True        False         False      47m                                                                                                                     
monitoring                                 4.10.0-0.nightly-2021-09-09-163608   True        False         False      27m                                                                                                                     
network                                    4.10.0-0.nightly-2021-09-09-163608   True        False         False      49m                                                                                                                     
node-tuning                                4.10.0-0.nightly-2021-09-09-163608   True        False         False      47m                                                                                                                     
openshift-apiserver                        4.10.0-0.nightly-2021-09-09-163608   True        False         False      42m                                                                                                                     
openshift-controller-manager               4.10.0-0.nightly-2021-09-09-163608   True        False         False      47m                                                                                                                     
openshift-samples                          4.10.0-0.nightly-2021-09-09-163608   True        False         False      44m                                                                                                                     
operator-lifecycle-manager                 4.10.0-0.nightly-2021-09-09-163608   True        False         False      49m                                                                                                                     
operator-lifecycle-manager-catalog         4.10.0-0.nightly-2021-09-09-163608   True        False         False      49m                                                                                                                     
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-2021-09-09-163608   True        False         False      44m                                                                                                                     
service-ca                                 4.10.0-0.nightly-2021-09-09-163608   True        False         False      52m                                                                                                                     
storage                                    4.10.0-0.nightly-2021-09-09-163608   True        False         False      43m 


and Manila is not deployed as stated on the clusteroperator storage:

$ oc get clusteroperator storage -o json | jq '.status.conditions[] | select(.type=="Available")'
{
  "lastTransitionTime": "2021-09-10T14:42:33Z",
  "message": "OpenStackCinderCSIDriverOperatorCRAvailable: All is well\nManilaCSIDriverOperatorCRAvailable: CSI driver for Manila is disabled: Unable to retrieve Manila share types: cannot list available share types: Get \"https://10.46.44.10:13786/v2/65d84c01ef224b0c9fe8892d43fa804a/types\": Service Unavailable",
  "reason": "AsExpected",
  "status": "True",
  "type": "Available"
}

$ oc get sc
NAME                 PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
standard (default)   kubernetes.io/cinder       Delete          WaitForFirstConsumer   true                   65m
standard-csi         cinder.csi.openstack.org   Delete          WaitForFirstConsumer   true                   63m

$ oc get pods -A | grep -i manila
openshift-cluster-csi-drivers                      manila-csi-driver-operator-66d4476d74-x9bs2                1/1     Running     0             64m

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-09-09-163608   True        False         40m     Cluster version is 4.10.0-0.nightly-2021-09-09-163608

Comment 33 ShiftStack Bugwatcher 2021-11-25 16:12:26 UTC

Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 36 errata-xmlrpc 2022-03-10 16:07:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.