Bug 1806034
Summary: | Name conflict caused by vsphere-volume storage provisioner makes OCS installation fail | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Martin Bukatovic <mbukatov> |
Component: | Storage | Assignee: | Tomas Smetana <tsmetana> |
Storage sub component: | Kubernetes | QA Contact: | Qin Ping <piqin> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | aos-bugs, chuffman, dahorak, jsafrane, pbalogh |
Version: | 4.2.z | Keywords: | Automation |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
A race condition in the VSphere cloud provider could cause failures in persistentent volume provisioning.
Consequence:
In rare circumstances under a heavy load a VSphere volume might fail to dynamically provision.
Fix:
The race condition has been fixed.
Result:
The VSphere volumes should provision reliably.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 15:55:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1863009 | ||
Bug Blocks: | |||
Attachments: |
Description
Martin Bukatovic
2020-02-21 20:50:13 UTC
Created attachment 1664842 [details]
Output of oc describe pod/rook-ceph-mon-c-canary-58fb966bdd-nh299 -n openshift-storage
Created attachment 1664843 [details]
Output of oc describe pvc/rook-ceph-mon-c -n openshift-storage
Created attachment 1664855 [details]
Screenshot #1: Overview dashboard of OCP Console, with "ocs-storagecluster-cephcluster is not available" error.
Created attachment 1664856 [details]
Screenshot #2: vcenter web ui with file list of kubevols directory in DS-1datacenter (with virtual disk for affected PVC)
Additional info =============== Master Log ---------- In the must gather data, there are: - masters/crio_service.log - masters/kubelet_service.log Node Log (of failed PODs) ------------------------- Not applicable, ceph pod was not started because of failed PVC provisioning. PV Dump ------- Not applicable, PV was not provisioned: ``` $ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-d6fe39f5-54c9-11ea-9272-005056be10c1 10Gi RWO Delete Bound openshift-storage/rook-ceph-mon-a thin 5h4m pvc-e63d5b82-54c9-11ea-9272-005056be10c1 10Gi RWO Delete Bound openshift-storage/rook-ceph-mon-b thin 5h4m ``` PVC Dump -------- ``` $ oc get pvc -n openshift-storage NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rook-ceph-mon-a Bound pvc-d6fe39f5-54c9-11ea-9272-005056be10c1 10Gi RWO thin 5h2m rook-ceph-mon-b Bound pvc-e63d5b82-54c9-11ea-9272-005056be10c1 10Gi RWO thin 5h2m rook-ceph-mon-c Pending thin 5h1m ``` ``` $ oc get pvc -n openshift-storage -o yaml apiVersion: v1 items: - apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume creationTimestamp: "2020-02-21T16:47:26Z" finalizers: - kubernetes.io/pvc-protection labels: app: rook-ceph-mon ceph-version: 14.2.4 ceph_daemon_id: a mon: a mon_cluster: openshift-storage rook-version: 4.2-292.a26dc84c.ocs_4.2 rook_cluster: openshift-storage name: rook-ceph-mon-a namespace: openshift-storage ownerReferences: - apiVersion: ceph.rook.io/v1 blockOwnerDeletion: true kind: CephCluster name: ocs-storagecluster-cephcluster uid: 7533e883-54c9-11ea-831c-005056be67e7 resourceVersion: "27190" selfLink: /api/v1/namespaces/openshift-storage/persistentvolumeclaims/rook-ceph-mon-a uid: d6fe39f5-54c9-11ea-9272-005056be10c1 spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: thin volumeMode: Filesystem volumeName: pvc-d6fe39f5-54c9-11ea-9272-005056be10c1 status: accessModes: - ReadWriteOnce capacity: storage: 10Gi phase: Bound - apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume creationTimestamp: "2020-02-21T16:47:51Z" finalizers: - kubernetes.io/pvc-protection labels: app: rook-ceph-mon ceph-version: 14.2.4 ceph_daemon_id: b mon: b mon_cluster: openshift-storage rook-version: 4.2-292.a26dc84c.ocs_4.2 rook_cluster: openshift-storage name: rook-ceph-mon-b namespace: openshift-storage ownerReferences: - apiVersion: ceph.rook.io/v1 blockOwnerDeletion: true kind: CephCluster name: ocs-storagecluster-cephcluster uid: 7533e883-54c9-11ea-831c-005056be67e7 resourceVersion: "27287" selfLink: /api/v1/namespaces/openshift-storage/persistentvolumeclaims/rook-ceph-mon-b uid: e63d5b82-54c9-11ea-9272-005056be10c1 spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: thin volumeMode: Filesystem volumeName: pvc-e63d5b82-54c9-11ea-9272-005056be10c1 status: accessModes: - ReadWriteOnce capacity: storage: 10Gi phase: Bound - apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume creationTimestamp: "2020-02-21T16:48:01Z" finalizers: - kubernetes.io/pvc-protection labels: app: rook-ceph-mon ceph-version: 14.2.4 ceph_daemon_id: c mon: c mon_cluster: openshift-storage rook-version: 4.2-292.a26dc84c.ocs_4.2 rook_cluster: openshift-storage name: rook-ceph-mon-c namespace: openshift-storage ownerReferences: - apiVersion: ceph.rook.io/v1 blockOwnerDeletion: true kind: CephCluster name: ocs-storagecluster-cephcluster uid: 7533e883-54c9-11ea-831c-005056be67e7 resourceVersion: "27305" selfLink: /api/v1/namespaces/openshift-storage/persistentvolumeclaims/rook-ceph-mon-c uid: ec53589e-54c9-11ea-9272-005056be10c1 spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: thin volumeMode: Filesystem status: phase: Pending kind: List metadata: resourceVersion: "" selfLink: "" ``` StorageClass Dump (if StorageClass used by PV/PVC) -------------------------------------------------- ``` $ oc get sc/thin -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "true" creationTimestamp: "2020-02-21T16:11:58Z" name: thin ownerReferences: - apiVersion: v1 kind: clusteroperator name: storage uid: 7451b6dd-54c1-11ea-8c75-005056bec7d8 resourceVersion: "12214" selfLink: /apis/storage.k8s.io/v1/storageclasses/thin uid: e31b69ac-54c4-11ea-8cb8-005056be67e7 parameters: diskformat: thin provisioner: kubernetes.io/vsphere-volume reclaimPolicy: Delete volumeBindingMode: Immediate ``` I retried to deploy OCP/OCS 4.2 cluster in the same way today, and I haven't hit the issue here, updating How reproducible stats to 1/2. That said, openshift-storage.cephfs.csi.ceph.com storage provisioner failed in a similar way instead (even though that the storage codebase is imho very different in each case). I'm going to report OCS BZ for that tomorrow and will link it here. (In reply to Martin Bukatovic from comment #7) > That said, openshift-storage.cephfs.csi.ceph.com storage provisioner failed > in a similar way instead (even though that the storage codebase is imho very > different in each case). I'm going to report OCS BZ for that tomorrow and > will link it here. The problem was caused by FailedScheduling because of Insufficient cpu and memory on worker nodes. This is not a bug. I see you limited number of CPUs for worker nodes in your used config here: http://git.engineering.redhat.com/git/users/mbukatov/ocs-ci.git/tree/conf/deployment/vsphere/upi_1az_rhcos_3m_3w.mbukatov.yaml?h=vmware_deployment_config Your cluster has 16 cores - vmware is able to share all those cores to all VMs I think. Have you tried to run with whole 16 vCPU for workers? FYI: I've just saw some discussion that 12 vCPU should be enough for vmware: https://bugzilla.redhat.com/show_bug.cgi?id=1806921 mail thread with subject: OCS 4.2 on VMware constrained by hardware - can we reduce vCPU? in sme-storage mailing list. Petr (In reply to Petr Balogh from comment #9) > I see you limited number of CPUs for worker nodes in your used config here: > http://git.engineering.redhat.com/git/users/mbukatov/ocs-ci.git/tree/conf/ > deployment/vsphere/upi_1az_rhcos_3m_3w.mbukatov. > yaml?h=vmware_deployment_config > > Your cluster has 16 cores - vmware is able to share all those cores to all > VMs I think. Have you tried to run with whole 16 vCPU for workers? I'm using DC2 C1 cluster, which has the following resources available for it's virtual machines: - 128 GB RAM - 32 VCPUs - VMFS datastore, 7.9 TB Since the default cluster config was allocating just 4 cpus per node, I was trying to increase the allocation. I didn't try to overcommit hw allocations for the virtual machines, as the performance I see is already terrible. That said, this is totally unrelated to this bug. The first error in the logs is this one: quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-485739555a8a5560b70d4b7be3a5a4b234d7dfe56f295c6ca8a257707a911a6b/namespaces/openshift-etcd/pods/etcd-member-control-plane-2/etcd-member/etcd-member/logs/current.log:2020-02-21T16:48:23.441557968Z 2020-02-21 16:48:23.441521 W | etcdserver: read-only range request "key:\"/kubernetes.io/persistentvolumes/pvc-ec53589e-54c9-11ea-9272-005056be10c1\" " with result "range_response_count:0 size:6" took too long (272.492327ms) to execute So I wonder if this can't be a race caused by high latency on etcd (the logs are full of it). Does this occur every time or is this error only intermittent? (In reply to Tomas Smetana from comment #11) > So I wonder if this can't be a race caused by high latency on etcd (the logs > are full of it). Does this occur every time or is this error only > intermittent? I agree that this looks like a race condition. Since the moment I reported this bug, I tried to install OCP/OCS 4.2 additional 3 times, and I haven't seen this problem again (that said, in one case, the installation failed on something else), so the total "How reproducible?" stats is about 1/4. After reviewing the logs from the must-gather it looks like the server got simply congested and etcd stopped being reliable -- there's too many of these and similar errors as in the comment #11: 2020-02-21T16:35:08.696568454Z 2020-02-21 16:35:08.696558 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 1.167541657s) 2020-02-21T16:35:08.696600807Z 2020-02-21 16:35:08.696584 W | etcdserver: server is likely overloaded Could you please attach etcd pod logs (or must-gather with those) from a cluster where things work normally? I'd like to see how frequently does etcd complain there (if at all). There's two disk managers in the VSphere cloud provider. One of them actually checks the "the file or folder ... already exists" error and ignores it: https://github.com/openshift/origin/blob/release-4.2/vendor/k8s.io/kubernetes/pkg/cloudprovider/providers/vsphere/vclib/diskmanagers/vmdm.go#L136 However the virtualDiskManager based disk manager treats this error as failure. Looks like an omission to me. Upstream PR: https://github.com/kubernetes/kubernetes/pull/90638 Now vsphere storage provision failed on 4.6. Verified on 4.6.0-0.nightly-2020-08-31-220837 Launch 5 scripts at the same time, each script will create 30 pvc, checked 150 pvc are all Bound. Test 10 time and all pvc Bound. So change the status to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |