Bug 2001620
Summary: | Cluster becomes degraded if it can't talk to Manila | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Martin André <m.andre> |
Component: | Storage | Assignee: | Martin André <m.andre> |
Storage sub component: | OpenStack CSI Drivers | QA Contact: | Jon Uriarte <juriarte> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | aos-bugs, rlobillo |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: error handling when communicating with Manila didn't account for all possible failures.
Consequence: unexpected errors when talking to Manila would degrade the operator and thus the cluster.
Fix: treat all failures to reach the Manila endpoint as
a non fatal error
Result: the Manila operator gets disabled instead of making the cluster degraded.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:07:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2001958 |
Description
Martin André
2021-09-06 14:47:46 UTC
Verified on OCP 4.10.0-0.nightly-2021-09-09-163608 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210818.n.0) with OpenshiftSDN network type. The IPI installation performed on restricted network with a proxy finished successfully when the SG rules on the proxy instance is blocking the egress traffic going to OSP manila endpoint: $ openstack catalog show manila | grep public | | public: https://10.46.44.10:13786/v1/65d84c01ef224b0c9fe8892d43fa804a | # Egress rules on the instance where the proxy is running: $ openstack security group rule list --egress installer_host-sg +--------------------------------------+-------------+-----------+-----------+-------------+-----------------------+ | ID | IP Protocol | Ethertype | IP Range | Port Range | Remote Security Group | +--------------------------------------+-------------+-----------+-----------+-------------+-----------------------+ | 17a7dccc-d005-4f22-8369-bc511b86ff83 | udp | IPv4 | 0.0.0.0/0 | | None | | 22c83d2c-33f1-401a-b8fe-319628066615 | tcp | IPv4 | 0.0.0.0/0 | 13787:65000 | None | | 45d10ca3-9954-49c1-ad47-abfcc63a0d93 | tcp | IPv4 | 0.0.0.0/0 | 1:13785 | None | | d39faa61-7294-4e7e-8a29-aac757354233 | None | IPv6 | ::/0 | | None | +--------------------------------------+-------------+-----------+-----------+-------------+-----------------------+ This is provoking that the manila-csi-driver-operator is getting a timeout while reaching the manila API, but it is working for the rest (tested with keystone): - manila OSP API is not reachable: $ oc rsh -n openshift-cluster-csi-drivers $(oc get pods -n openshift-cluster-csi-drivers -l name=manila-csi-driver-operator -o name) sh-4.4$ curl --connect-timeout 5 --proxy-cacert /etc/openstack-ca/ca-bundle.pem --cacert /etc/openstack-ca/ca-bundle.pem https://10.46.44.10:13786/v1/65d84c01ef224b0c9fe8892d43fa804a curl: (28) Operation timed out after 5002 milliseconds with 0 out of 0 bytes received - However, keystone is reachable: sh-4.4$ curl --connect-timeout 5 --proxy-cacert /etc/openstack-ca/ca-bundle.pem --cacert /etc/openstack-ca/ca-bundle.pem https://10.46.44.10:13000 {"versions": {"values": [{"id": "v3.13", "status": "stable", "updated": "2019-07-19T00:00:00Z", "links": [{"rel": "self", "href": "https://10.46.44.10:13000/v3/"}], "media-types": [{"base": "application/json", "type": "application/vnd.openstack.identity-v3+json"}]}]}}sh-4.4$ Under these circumstances, the IPI installation works fine: DEBUG Time elapsed per stage: DEBUG : 1m49s DEBUG Bootstrap Complete: 24m4s DEBUG API: 3m26s DEBUG Bootstrap Destroy: 33s DEBUG Cluster Operators: 20m14s INFO Time elapsed: 47m36s All cluster operators are available: $ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-0.nightly-2021-09-09-163608 True False False 23m baremetal 4.10.0-0.nightly-2021-09-09-163608 True False False 47m cloud-controller-manager 4.10.0-0.nightly-2021-09-09-163608 True False False 55m cloud-credential 4.10.0-0.nightly-2021-09-09-163608 True False False 62m cluster-autoscaler 4.10.0-0.nightly-2021-09-09-163608 True False False 47m config-operator 4.10.0-0.nightly-2021-09-09-163608 True False False 52m console 4.10.0-0.nightly-2021-09-09-163608 True False False 27m csi-snapshot-controller 4.10.0-0.nightly-2021-09-09-163608 True False False 48m dns 4.10.0-0.nightly-2021-09-09-163608 True False False 47m etcd 4.10.0-0.nightly-2021-09-09-163608 True False False 49m image-registry 4.10.0-0.nightly-2021-09-09-163608 True False False 31m ingress 4.10.0-0.nightly-2021-09-09-163608 True False False 29m insights 4.10.0-0.nightly-2021-09-09-163608 True False False 45m kube-apiserver 4.10.0-0.nightly-2021-09-09-163608 True False False 46m kube-controller-manager 4.10.0-0.nightly-2021-09-09-163608 True False False 48m kube-scheduler 4.10.0-0.nightly-2021-09-09-163608 True False False 48m kube-storage-version-migrator 4.10.0-0.nightly-2021-09-09-163608 True False False 50m machine-api 4.10.0-0.nightly-2021-09-09-163608 True False False 41m machine-approver 4.10.0-0.nightly-2021-09-09-163608 True False False 47m machine-config 4.10.0-0.nightly-2021-09-09-163608 True False False 46m marketplace 4.10.0-0.nightly-2021-09-09-163608 True False False 47m monitoring 4.10.0-0.nightly-2021-09-09-163608 True False False 27m network 4.10.0-0.nightly-2021-09-09-163608 True False False 49m node-tuning 4.10.0-0.nightly-2021-09-09-163608 True False False 47m openshift-apiserver 4.10.0-0.nightly-2021-09-09-163608 True False False 42m openshift-controller-manager 4.10.0-0.nightly-2021-09-09-163608 True False False 47m openshift-samples 4.10.0-0.nightly-2021-09-09-163608 True False False 44m operator-lifecycle-manager 4.10.0-0.nightly-2021-09-09-163608 True False False 49m operator-lifecycle-manager-catalog 4.10.0-0.nightly-2021-09-09-163608 True False False 49m operator-lifecycle-manager-packageserver 4.10.0-0.nightly-2021-09-09-163608 True False False 44m service-ca 4.10.0-0.nightly-2021-09-09-163608 True False False 52m storage 4.10.0-0.nightly-2021-09-09-163608 True False False 43m and Manila is not deployed as stated on the clusteroperator storage: $ oc get clusteroperator storage -o json | jq '.status.conditions[] | select(.type=="Available")' { "lastTransitionTime": "2021-09-10T14:42:33Z", "message": "OpenStackCinderCSIDriverOperatorCRAvailable: All is well\nManilaCSIDriverOperatorCRAvailable: CSI driver for Manila is disabled: Unable to retrieve Manila share types: cannot list available share types: Get \"https://10.46.44.10:13786/v2/65d84c01ef224b0c9fe8892d43fa804a/types\": Service Unavailable", "reason": "AsExpected", "status": "True", "type": "Available" } $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE standard (default) kubernetes.io/cinder Delete WaitForFirstConsumer true 65m standard-csi cinder.csi.openstack.org Delete WaitForFirstConsumer true 63m $ oc get pods -A | grep -i manila openshift-cluster-csi-drivers manila-csi-driver-operator-66d4476d74-x9bs2 1/1 Running 0 64m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-09-09-163608 True False 40m Cluster version is 4.10.0-0.nightly-2021-09-09-163608 Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |