Bug 1806034

Summary:

Name conflict caused by vsphere-volume storage provisioner makes OCS installation fail

Product:

OpenShift Container Platform

Reporter:

Martin Bukatovic <mbukatov>

Component:

Storage

Assignee:

Tomas Smetana <tsmetana>

Storage sub component:

Kubernetes

QA Contact:

Qin Ping <piqin>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

aos-bugs, chuffman, dahorak, jsafrane, pbalogh

Version:

4.2.z

Keywords:

Automation

Target Milestone:

---

Target Release:

4.6.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: A race condition in the VSphere cloud provider could cause failures in persistentent volume provisioning. Consequence: In rare circumstances under a heavy load a VSphere volume might fail to dynamically provision. Fix: The race condition has been fixed. Result: The VSphere volumes should provision reliably.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-10-27 15:55:19 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1863009

Bug Blocks:

Attachments:

Description	Flags
Output of oc describe pod/rook-ceph-mon-c-canary-58fb966bdd-nh299 -n openshift-storage	none
Output of oc describe pvc/rook-ceph-mon-c -n openshift-storage	none
Screenshot #1: Overview dashboard of OCP Console, with "ocs-storagecluster-cephcluster is not available" error.	none
Screenshot #2: vcenter web ui with file list of kubevols directory in DS-1datacenter (with virtual disk for affected PVC)	none

Description Martin Bukatovic 2020-02-21 20:50:13 UTC

Description of problem
======================

When I try to install OCP/OCS 4.2 on vmware cluster of OCS QE team, the
installation of OCS fails because one PVC of rook-ceph-mon can't reach Bound
state (so that no PV is provisioned), which seems to be caused by name conflict
created by kubernetes.io/vsphere-volume storage provisioner.

Version-Release number of selected component
============================================

cluster channel: stable-4.2
cluster version: 4.2.18
cluster image: quay.io/openshift-release-dev/ocp-release@sha256:283a1625e18e0b6d7f354b1b022a0aeaab5598f2144ec484faf89e1ecb5c7498

storage namespace openshift-cluster-storage-operator
image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d9dd509754b883dab8301bc1cd50d1b902de531d012817b27bfdda2cd28c782a
 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d9dd509754b883dab8301bc1cd50d1b902de531d012817b27bfdda2cd28c782a

storage namespace openshift-storage
image quay.io/rhceph-dev/cephcsi:4.2-259.72ac53b6.release_4.2
 * quay.io/rhceph-dev/cephcsi@sha256:280b39004dc8983eba810f6707b15da229efd62af512839a4f14efc983df2110
image registry.redhat.io/openshift4/ose-csi-driver-registrar:4.2
 * registry.redhat.io/openshift4/ose-csi-driver-registrar@sha256:873565b29b5bda21326cc6bf58bfe4bdf59c4b1da2e5e813a300c8e4b7e7ab81
image registry.redhat.io/openshift4/ose-csi-external-attacher:4.2
 * registry.redhat.io/openshift4/ose-csi-external-attacher@sha256:aea27da675bdfbd7cb460030912c8386556e5f475eeba21878ac63337f97dff8
image registry.redhat.io/openshift4/ose-csi-external-provisioner-rhel7:4.2
 * registry.redhat.io/openshift4/ose-csi-external-provisioner-rhel7@sha256:ae616978273612f4721bd6de05b8bcae7da3b3faae7847e0c6ca75e1c6861e39
image quay.io/rhceph-dev/mcg-operator:2.0.10-63.e905e55.2.0
 * quay.io/rhceph-dev/mcg-operator@sha256:9b2df211528394938e4e4df648e746f54bf377962968d364febcfcec432074a0
image quay.io/rhceph-dev/ocs-operator:4.2-286.88017e9.release_4.2
 * quay.io/rhceph-dev/ocs-operator@sha256:54202499a2a7e87f5fa6afa28fe621fdb53b6e076c8417eda21c60816004c1be
image quay.io/rhceph-dev/rook-ceph:4.2-292.a26dc84c.ocs_4.2
 * quay.io/rhceph-dev/rook-ceph@sha256:9addc80eb620d5d48ab4a51d3525ea6149a19a9719be71d4e3263eca2e1ceb13

How reproducible
================

1/1

Steps to Reproduce
==================

1. Install OCP/OCS on vmware cluster (I did this via ocs ci jenkins deploy
   cluster job of ocs qe team)
2. Check logs and status of OCS via OCP Console

Actual results
==============

OCP Console reports (via Administrator view, Home -> Dashboard):

> Cluster Health: ocs-storagecluster-cephcluster is not available

Listing following alerts:

> Deployment openshift-storage/ocs-operator has not matched the expected number of replicas for longer than 15 minutes.

(see screenshot #1)

Digging deeper, I see that one ceph monitor failed to start:

```
$ oc describe pod/rook-ceph-mon-c-canary-58fb966bdd-nh299 -n openshift-storage > rook-ceph-mon-c-canary-58fb966bdd-nh299
$ grep ^Status rook-ceph-mon-c-canary-58fb966bdd-nh299
Status:             Pending
$ tail -4 rook-ceph-mon-c-canary-58fb966bdd-nh299
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  22s (x11 over 13m)  default-scheduler  pod has unbound immediate PersistentVolumeClaims (repeated 3 times)
```

Checking what's wrong with the PVC reveals a problem on vsphere storage
provisioner side:

```
$ oc describe pvc/rook-ceph-mon-c -n openshift-storage > pvc.rook-ceph-mon-c
$ grep ^Status pvc.rook-ceph-mon-c
Status:        Pending
$ tail -5 pvc.rook-ceph-mon-c
Events:
  Type       Reason              Age                 From                         Message
  ----       ------              ----                ----                         -------
  Warning    ProvisioningFailed  35s (x29 over 49m)  persistentvolume-controller  Failed to provision volume with StorageClass "thin": Cannot complete the operation because the file or folder [DS-1] kubevols/mbukatov-dc2-clus-wdst7-dynamic-pvc-ec53589e-54c9-11ea-9272-005056be10c1.vmdk already exists
Mounted By:  rook-ceph-mon-c-canary-58fb966bdd-nh299
```

When I look into the vsphere what is going on in DS-1 datacenter, I see that
there are 3 virtual disks, one for each PVC, but the one k8s persistentvolume
controller complains about has a zero size.

(see screenshot #2)

Expected results
================

Vshphere storage provisioner doesn't hit any virtual disk name conflict,
OCS installation finishes without any problem.

Additional info
===============

It's safe to rule out a possibility that the Virtual Disk of the same name as
persistentvolume-controller wanted to use for ceph-mon-c was already present
before the installation:

- I was checking the DS-1 a day before because of a different issue, and I'm
  quite sure there were no mbukatov-dc2* disks left.
- The PVC name is afaik chosen randomly.

It's possible that this is related to (or even a root cause of) BZ 1802186, but
the fact that in this case OCS cluster haven't been able to recover after few
hours after installation (as it was the case reported in BZ 1802186) suggests
that these are 2 different problems.

Comment 1 Martin Bukatovic 2020-02-21 20:53:59 UTC

Created attachment 1664842 [details]
Output of oc describe pod/rook-ceph-mon-c-canary-58fb966bdd-nh299 -n openshift-storage

Comment 2 Martin Bukatovic 2020-02-21 20:54:50 UTC

Created attachment 1664843 [details]
Output of oc describe pvc/rook-ceph-mon-c -n openshift-storage

Comment 3 Martin Bukatovic 2020-02-21 21:07:41 UTC

Created attachment 1664855 [details]
Screenshot #1: Overview dashboard of OCP Console, with "ocs-storagecluster-cephcluster is not available" error.

Comment 4 Martin Bukatovic 2020-02-21 21:10:59 UTC

Created attachment 1664856 [details]
Screenshot #2: vcenter web ui with file list of kubevols directory in DS-1datacenter (with virtual disk for affected PVC)

Comment 6 Martin Bukatovic 2020-02-21 21:54:13 UTC

Additional info
===============

Master Log
----------

In the must gather data, there are:

- masters/crio_service.log
- masters/kubelet_service.log

Node Log (of failed PODs)
-------------------------

Not applicable, ceph pod was not started because of
failed PVC provisioning.

PV Dump
-------

Not applicable, PV was not provisioned:

```
$ oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                               STORAGECLASS   REASON   AGE
pvc-d6fe39f5-54c9-11ea-9272-005056be10c1   10Gi       RWO            Delete           Bound    openshift-storage/rook-ceph-mon-a   thin                    5h4m
pvc-e63d5b82-54c9-11ea-9272-005056be10c1   10Gi       RWO            Delete           Bound    openshift-storage/rook-ceph-mon-b   thin                    5h4m
```

PVC Dump
--------

```
$ oc get pvc -n openshift-storage
NAME              STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
rook-ceph-mon-a   Bound     pvc-d6fe39f5-54c9-11ea-9272-005056be10c1   10Gi       RWO            thin           5h2m
rook-ceph-mon-b   Bound     pvc-e63d5b82-54c9-11ea-9272-005056be10c1   10Gi       RWO            thin           5h2m
rook-ceph-mon-c   Pending                                                                        thin           5h1m
```

```
$ oc get pvc -n openshift-storage -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    annotations:
      pv.kubernetes.io/bind-completed: "yes"
      pv.kubernetes.io/bound-by-controller: "yes"
      volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume
    creationTimestamp: "2020-02-21T16:47:26Z"
    finalizers:
    - kubernetes.io/pvc-protection
    labels:
      app: rook-ceph-mon
      ceph-version: 14.2.4
      ceph_daemon_id: a
      mon: a
      mon_cluster: openshift-storage
      rook-version: 4.2-292.a26dc84c.ocs_4.2
      rook_cluster: openshift-storage
    name: rook-ceph-mon-a
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      blockOwnerDeletion: true
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: 7533e883-54c9-11ea-831c-005056be67e7
    resourceVersion: "27190"
    selfLink: /api/v1/namespaces/openshift-storage/persistentvolumeclaims/rook-ceph-mon-a
    uid: d6fe39f5-54c9-11ea-9272-005056be10c1
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
    storageClassName: thin
    volumeMode: Filesystem
    volumeName: pvc-d6fe39f5-54c9-11ea-9272-005056be10c1
  status:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 10Gi
    phase: Bound
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    annotations:
      pv.kubernetes.io/bind-completed: "yes"
      pv.kubernetes.io/bound-by-controller: "yes"
      volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume
    creationTimestamp: "2020-02-21T16:47:51Z"
    finalizers:
    - kubernetes.io/pvc-protection
    labels:
      app: rook-ceph-mon
      ceph-version: 14.2.4
      ceph_daemon_id: b
      mon: b
      mon_cluster: openshift-storage
      rook-version: 4.2-292.a26dc84c.ocs_4.2
      rook_cluster: openshift-storage
    name: rook-ceph-mon-b
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      blockOwnerDeletion: true
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: 7533e883-54c9-11ea-831c-005056be67e7
    resourceVersion: "27287"
    selfLink: /api/v1/namespaces/openshift-storage/persistentvolumeclaims/rook-ceph-mon-b
    uid: e63d5b82-54c9-11ea-9272-005056be10c1
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
    storageClassName: thin
    volumeMode: Filesystem
    volumeName: pvc-e63d5b82-54c9-11ea-9272-005056be10c1
  status:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 10Gi
    phase: Bound
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    annotations:
      volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume
    creationTimestamp: "2020-02-21T16:48:01Z"
    finalizers:
    - kubernetes.io/pvc-protection
    labels:
      app: rook-ceph-mon
      ceph-version: 14.2.4
      ceph_daemon_id: c
      mon: c
      mon_cluster: openshift-storage
      rook-version: 4.2-292.a26dc84c.ocs_4.2
      rook_cluster: openshift-storage
    name: rook-ceph-mon-c
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      blockOwnerDeletion: true
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: 7533e883-54c9-11ea-831c-005056be67e7
    resourceVersion: "27305"
    selfLink: /api/v1/namespaces/openshift-storage/persistentvolumeclaims/rook-ceph-mon-c
    uid: ec53589e-54c9-11ea-9272-005056be10c1
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
    storageClassName: thin
    volumeMode: Filesystem
  status:
    phase: Pending
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
```

StorageClass Dump (if StorageClass used by PV/PVC)
--------------------------------------------------

```
$ oc get sc/thin -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: "2020-02-21T16:11:58Z"
  name: thin
  ownerReferences:
  - apiVersion: v1
    kind: clusteroperator
    name: storage
    uid: 7451b6dd-54c1-11ea-8c75-005056bec7d8
  resourceVersion: "12214"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/thin
  uid: e31b69ac-54c4-11ea-8cb8-005056be67e7
parameters:
  diskformat: thin
provisioner: kubernetes.io/vsphere-volume
reclaimPolicy: Delete
volumeBindingMode: Immediate
```

Comment 7 Martin Bukatovic 2020-02-25 17:51:21 UTC

I retried to deploy OCP/OCS 4.2 cluster in the same way today, and I haven't hit the issue here, updating How reproducible stats to 1/2.

That said, openshift-storage.cephfs.csi.ceph.com storage provisioner failed in a similar way instead (even though that the storage codebase is imho very different in each case). I'm going to report OCS BZ for that tomorrow and will link it here.

Comment 8 Martin Bukatovic 2020-02-26 10:08:27 UTC

(In reply to Martin Bukatovic from comment #7)
> That said, openshift-storage.cephfs.csi.ceph.com storage provisioner failed
> in a similar way instead (even though that the storage codebase is imho very
> different in each case). I'm going to report OCS BZ for that tomorrow and
> will link it here.

The problem was caused by FailedScheduling because of Insufficient cpu and memory
on worker nodes. This is not a bug.

Comment 9 Petr Balogh 2020-02-26 10:34:57 UTC

I see you limited number of CPUs for worker nodes in your used config here:
http://git.engineering.redhat.com/git/users/mbukatov/ocs-ci.git/tree/conf/deployment/vsphere/upi_1az_rhcos_3m_3w.mbukatov.yaml?h=vmware_deployment_config

Your cluster has 16 cores - vmware is able to share all those cores to all VMs I think. Have you tried to run with whole 16 vCPU for workers?

FYI:
I've just saw some discussion that 12 vCPU should be enough for vmware: https://bugzilla.redhat.com/show_bug.cgi?id=1806921 mail thread with subject: OCS 4.2 on VMware constrained by hardware - can we reduce vCPU?  in sme-storage mailing list.


Petr

Comment 10 Martin Bukatovic 2020-02-26 12:25:49 UTC

(In reply to Petr Balogh from comment #9)
> I see you limited number of CPUs for worker nodes in your used config here:
> http://git.engineering.redhat.com/git/users/mbukatov/ocs-ci.git/tree/conf/
> deployment/vsphere/upi_1az_rhcos_3m_3w.mbukatov.
> yaml?h=vmware_deployment_config
> 
> Your cluster has 16 cores - vmware is able to share all those cores to all
> VMs I think. Have you tried to run with whole 16 vCPU for workers?

I'm using DC2 C1 cluster, which has the following resources available for
it's virtual machines:

- 128 GB RAM
- 32 VCPUs
- VMFS datastore, 7.9 TB

Since the default cluster config was allocating just 4 cpus per node, I was
trying to increase the allocation.

I didn't try to overcommit hw allocations for the virtual machines, as the
performance I see is already terrible.

That said, this is totally unrelated to this bug.

Comment 11 Tomas Smetana 2020-02-28 10:35:56 UTC

The first error in the logs is this one:

quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-485739555a8a5560b70d4b7be3a5a4b234d7dfe56f295c6ca8a257707a911a6b/namespaces/openshift-etcd/pods/etcd-member-control-plane-2/etcd-member/etcd-member/logs/current.log:2020-02-21T16:48:23.441557968Z 2020-02-21 16:48:23.441521 W | etcdserver: read-only range request "key:\"/kubernetes.io/persistentvolumes/pvc-ec53589e-54c9-11ea-9272-005056be10c1\" " with result "range_response_count:0 size:6" took too long (272.492327ms) to execute

So I wonder if this can't be a race caused by high latency on etcd (the logs are full of it). Does this occur every time or is this error only intermittent?

Comment 12 Martin Bukatovic 2020-02-28 10:43:03 UTC

(In reply to Tomas Smetana from comment #11)
> So I wonder if this can't be a race caused by high latency on etcd (the logs
> are full of it). Does this occur every time or is this error only
> intermittent?

I agree that this looks like a race condition. Since the moment I reported this bug,
I tried to install OCP/OCS 4.2 additional 3 times, and I haven't seen this problem
again (that said, in one case, the installation failed on something else), so the
total "How reproducible?" stats is about 1/4.

Comment 13 Tomas Smetana 2020-03-03 09:54:23 UTC

After reviewing the logs from the must-gather it looks like the server got simply congested and etcd stopped being reliable -- there's too many of these and similar errors as in the comment #11:

2020-02-21T16:35:08.696568454Z 2020-02-21 16:35:08.696558 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 1.167541657s)                                                                                                                                          
2020-02-21T16:35:08.696600807Z 2020-02-21 16:35:08.696584 W | etcdserver: server is likely overloaded

Could you please attach etcd pod logs (or must-gather with those) from a cluster where things work normally? I'd like to see how frequently does etcd complain there (if at all).

Comment 17 Tomas Smetana 2020-04-29 10:41:42 UTC

There's two disk managers in the VSphere cloud provider. One of them actually checks the "the file or folder ... already exists" error and ignores it:
https://github.com/openshift/origin/blob/release-4.2/vendor/k8s.io/kubernetes/pkg/cloudprovider/providers/vsphere/vclib/diskmanagers/vmdm.go#L136

However the virtualDiskManager based disk manager treats this error as failure. Looks like an omission to me.

Comment 18 Tomas Smetana 2020-04-30 11:55:12 UTC

Upstream PR: https://github.com/kubernetes/kubernetes/pull/90638

Comment 23 Wei Duan 2020-08-06 23:40:43 UTC

Now vsphere storage provision failed  on 4.6.

Comment 24 Wei Duan 2020-09-02 09:12:57 UTC

Verified on 4.6.0-0.nightly-2020-08-31-220837

Launch 5 scripts at the same time, each script will create 30 pvc, checked 150 pvc are all Bound.

Test 10 time and all pvc Bound.

So change the status to VERIFIED.

Comment 26 errata-xmlrpc 2020-10-27 15:55:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196