Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1760940

Summary: VerifyControllerAttachedVolume in volume execution operator is slow in reporting success, delaying pod startup times
Product: OpenShift Container Platform Reporter: Shyamsundar <srangana>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED WONTFIX QA Contact: Liang Xia <lxia>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: aos-bugs, aos-storage-staff, bbennett, ekuric, jespy, jstrunk, mifiedle
Target Milestone: ---Keywords: Performance
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: ocs-monkey
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-26 21:02:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
journal logs from node where pod with PVC is deployed
none
Patch diff, and deployment artifacts none

Description Shyamsundar 2019-10-11 18:56:32 UTC
Created attachment 1624836 [details]
journal logs from node where pod with PVC is deployed

Description of problem:

Volume operation executor takes varying time due to VerifyControllerAttachedVolume having to retry and establish, if the volume to mount is reported as in use by the node attempting the attach.

Volume attach, when tested with CSI-ebs and CSI-ceph-rbd, takes anywhere between 12 seconds to 30 seconds on an AWS setup, with the attach verification causing most of the delay and variance in the time for a pod to reach Running state.

On checking the journal logs of the node and correlating these with the events on the namespace, it is seen that VerifyControllerAttachedVolume takes variant time, that adds to the overall attach time of the volume. This variant time seems to be stemming from [1] that details in its comments the following,

// If the given volume has not yet been added to the list of
// VolumesInUse in the node's volume status, do not proceed, return
// error. Caller will log and retry. The node status is updated
// periodically by kubelet, so it may take as much as 10 seconds
// before this clears.
// Issue #28141 to enable on demand status updates.

[1] VerifyControllerAttachedVolume failures: https://github.com/openshift/origin/blob/bad6c078765ee144437afb9cf33494b2203e1cd4/vendor/k8s.io/kubernetes/pkg/volume/util/operationexecutor/operation_generator.go#L1311-L1319

Version-Release number of selected component (if applicable):

$ oc version
Client Version: v4.2.0-alpha.0-87-gd5465d7
Server Version: 4.2.0-0.ci-2019-10-10-124512
Kubernetes Version: v1.14.6+bad6c07

How reproducible: Always

Steps to Reproduce:

1. Create a PVC using gp2/csi-ebs/csi-rbd as the storage class
  - In the provided data, the PVC is created using ceph-rbd CSI storage class

2. Start tailing journals of the nodes for the PV name claimed by the PVC (or a single node as stated in (3)) (e.g: [root@ip-10-0-175-29 ~]# journalctl -f | grep pvc-49afc145-ec49-11e9-bfc2-0631cd325e42)

3. Start watching events in the test namespace where a deployment using the PVC is to be started (e.g: oc get events -n sample -o json -w --watch-only)

4. Start a deployment using the above created PVC,
  - The deployment is crafted, with a specific node selector (in this case ip-10-0-175-29) to ease journal collection and correlation

5. Change the replica count of the deployment to 0, and then later to 1, to repeat the test and log/event collection

6. Note the time taken for VerifyControllerAttachedVolume and correlate it with the total time taken for the deployment to reach the Running state

Actual results:

logs/events from the tests are provided below ("Log and event analysis:"). The intent of the summary below is to show how the time taken in  VerifyControllerAttachedVolume phase directly translates to overall pod reaching Running status.

- There are 2 cases provided in the logs and event analysis, a "Slow example" and a "Fast example"

NOTE: Time analysis is using event and log timestamps
- The slow example is where the time taken,
  - to complete VerifyControllerAttachedVolume is, 15.6 seconds
  - to publish the volume, by csi-ceph-rbd (NodeStage and NodePublish) is, 0.9 seconds
  - between the 2 events of interest is, 25 seconds

- The fast example is where the time taken,
  - to complete VerifyControllerAttachedVolume is, 3.5 seconds
  - to publish the volume, by csi-ceph-rbd (NodeStage and NodePublish) is, 5.9 seconds
  - between the 2 events of interest is, 17 seconds

Expected results:

The expectation or ask is, if the VerifyControllerAttachedVolume can be done faster or with lower deviation from a theoretical mean.

The real need/use-case is, when a PVC is used by a hosted Che service (provisioned using ebs-gp2 or ceph-rbd), and time to start a pod directly translates to the time a user has to wait on the browser. Hence optimizations would help reduce this wait time for the user.

On following the issue #28141 in kubernetes, it leads to the PR https://github.com/kubernetes/kubernetes/pull/45344 that seems have the right idea behind speeding this phase. Hence the further ask is if this can be fixed to watch for the event instead of waiting for the node status to be updated periodically?

Log and event analysis:

- Slow example:

Events from # oc get events -n sample -o json -w --watch-only
Logs interspersed from [root@ip-10-0-175-29 ~]# journalctl -f | grep pvc-49afc145-ec49-11e9-bfc2-0631cd325e42

Event: {
<certain fields snipped for terseness>
    "firstTimestamp": "2019-10-11T17:56:49Z",
    "involvedObject": {
        "kind": "Pod",
        "name": "mypod-deployment-cdf545855-d9m82",
        "namespace": "sample",
    },
    "kind": "Event",
    "lastTimestamp": "2019-10-11T17:56:49Z",
    "message": "AttachVolume.Attach succeeded for volume \"pvc-49afc145-ec49-11e9-bfc2-0631cd325e42\" ",
    "reason": "SuccessfulAttachVolume",
}

Logs:
Oct 11 17:56:50 ip-10-0-175-29 hyperkube[1159]: I1011 17:56:50.074444    1159 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-d9m82" (UID: "7fe475a8-ec50-11e9-bfc2-0631cd325e42")

Oct 11 17:56:50 ip-10-0-175-29 hyperkube[1159]: E1011 17:56:50.074564    1159 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e\"" failed. No retries permitted until 2019-10-11 17:56:50.574519772 +0000 UTC m=+21448.842259794 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-49afc145-ec49-11e9-bfc2-0631cd325e42\" (UniqueName: \"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e\") pod \"mypod-deployment-cdf545855-d9m82\" (UID: \"7fe475a8-ec50-11e9-bfc2-0631cd325e42\") "

<snip repeated logs as above>

Oct 11 17:57:05 ip-10-0-175-29 hyperkube[1159]: I1011 17:57:05.678078    1159 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-d9m82" (UID: "7fe475a8-ec50-11e9-bfc2-0631cd325e42")

Oct 11 17:57:05 ip-10-0-175-29 hyperkube[1159]: I1011 17:57:05.684655    1159 operation_generator.go:1339] Controller attach succeeded for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-d9m82" (UID: "7fe475a8-ec50-11e9-bfc2-0631cd325e42") device path: "csi-3cbdda62d2e74511aa38e58d91e13b1950ce65b0d659744d254944099fa2c424"

<entering CSI Stage and Publish steps>

Oct 11 17:57:05 ip-10-0-175-29 hyperkube[1159]: I1011 17:57:05.778634    1159 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-d9m82" (UID: "7fe475a8-ec50-11e9-bfc2-0631cd325e42")

Oct 11 17:57:06 ip-10-0-175-29 hyperkube[1159]: I1011 17:57:06.629125    1159 operation_generator.go:667] MountVolume.SetUp succeeded for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-d9m82" (UID: "7fe475a8-ec50-11e9-bfc2-0631cd325e42"

<CSI operations done>

Next event: {
<certain fields snipped for terseness>
    "firstTimestamp": "2019-10-11T17:57:14Z",
    "involvedObject": {
        "kind": "Pod",
        "name": "mypod-deployment-cdf545855-d9m82",
        "namespace": "sample",
    },
    "kind": "Event",
    "lastTimestamp": "2019-10-11T17:57:14Z",
    "message": "Container image \"quay.io/shyamsundarr/filecounts:test\" already present on machine",
    "reason": "Pulled",
}

- Fast example:

Events from # oc get events -n sample -o json -w --watch-only
Logs interspersed from [root@ip-10-0-175-29 ~]# journalctl -f | grep pvc-49afc145-ec49-11e9-bfc2-0631cd325e42

Event: {
<certain fields snipped for terseness>
    "firstTimestamp": "2019-10-11T18:07:26Z",
    "involvedObject": {
        "kind": "Pod",
        "name": "mypod-deployment-cdf545855-xh9p2",
        "namespace": "sample",
    },
    "kind": "Event",
    "lastTimestamp": "2019-10-11T18:07:26Z",
    "message": "AttachVolume.Attach succeeded for volume \"pvc-49afc145-ec49-11e9-bfc2-0631cd325e42\" ",
    "reason": "SuccessfulAttachVolume",
}

Logs:
Oct 11 18:07:26 ip-10-0-175-29 hyperkube[1159]: I1011 18:07:26.445344    1159 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-xh9p2" (UID: "fb302196-ec51-11e9-bfc2-0631cd325e42")

Oct 11 18:07:26 ip-10-0-175-29 hyperkube[1159]: E1011 18:07:26.445476    1159 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e\"" failed. No retries permitted until 2019-10-11 18:07:26.945429071 +0000 UTC m=+22085.213169065 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-49afc145-ec49-11e9-bfc2-0631cd325e42\" (UniqueName: \"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e\") pod \"mypod-deployment-cdf545855-xh9p2\" (UID: \"fb302196-ec51-11e9-bfc2-0631cd325e42\") "

<snip repeated logs as above>

Oct 11 18:07:29 ip-10-0-175-29 hyperkube[1159]: I1011 18:07:29.987908    1159 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-xh9p2" (UID: "fb302196-ec51-11e9-bfc2-0631cd325e42")

Oct 11 18:07:29 ip-10-0-175-29 hyperkube[1159]: I1011 18:07:29.994428    1159 operation_generator.go:1339] Controller attach succeeded for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-xh9p2" (UID: "fb302196-ec51-11e9-bfc2-0631cd325e42") device path: "csi-3cbdda62d2e74511aa38e58d91e13b1950ce65b0d659744d254944099fa2c424"

<entering CSI Stage and Publish steps>

Oct 11 18:07:30 ip-10-0-175-29 hyperkube[1159]: I1011 18:07:30.088397    1159 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-xh9p2" (UID: "fb302196-ec51-11e9-bfc2-0631cd325e42")

Oct 11 18:07:35 ip-10-0-175-29 hyperkube[1159]: I1011 18:07:35.969460    1159 operation_generator.go:667] MountVolume.SetUp succeeded for volume "pvc-49afc145-ec49-11e9-bfc2-0631cd325e42" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e") pod "mypod-deployment-cdf545855-xh9p2" (UID: "fb302196-ec51-11e9-bfc2-0631cd325e42")

<CSI operations done>

Next event: {
<certain fields snipped for terseness>
    "firstTimestamp": "2019-10-11T18:07:43Z",
    "involvedObject": {
        "kind": "Pod",
        "name": "mypod-deployment-cdf545855-xh9p2",
        "namespace": "sample",
    },
    "kind": "Event",
    "lastTimestamp": "2019-10-11T18:07:43Z",
    "message": "Container image \"quay.io/shyamsundarr/filecounts:test\" already present on machine",
    "reason": "Pulled",
}

Node Log (of failed PODs):

Relevant logs mentioned in "Log and event analysis" section. Full node log attached.

PV Dump:

$ oc get pv pvc-49afc145-ec49-11e9-bfc2-0631cd325e42 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
  creationTimestamp: "2019-10-11T17:05:13Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-49afc145-ec49-11e9-bfc2-0631cd325e42
  resourceVersion: "145400"
  selfLink: /api/v1/persistentvolumes/pvc-49afc145-ec49-11e9-bfc2-0631cd325e42
  uid: 4a5a8cf1-ec49-11e9-bfc2-0631cd325e42
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 50Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: mypvc
    namespace: sample
    resourceVersion: "145380"
    uid: 49afc145-ec49-11e9-bfc2-0631cd325e42
  csi:
    driver: openshift-storage.rbd.csi.ceph.com
    fsType: ext4
    nodeStageSecretRef:
      name: rook-ceph-csi
      namespace: openshift-storage
    volumeAttributes:
      clusterID: openshift-storage
      imageFeatures: layering
      imageFormat: "2"
      pool: openshift-storage-cephblockpool
      storage.kubernetes.io/csiProvisionerIdentity: 1570805008458-8081-openshift-storage.rbd.csi.ceph.com
    volumeHandle: 0001-0011-openshift-storage-0000000000000001-49c88f2e-ec49-11e9-9457-0a580a80020e
  persistentVolumeReclaimPolicy: Delete
  storageClassName: openshift-storage-ceph-rbd
  volumeMode: Filesystem
status:
  phase: Bound

PVC Dump:

$ oc get pvc -n sample -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    annotations:
      pv.kubernetes.io/bind-completed: "yes"
      pv.kubernetes.io/bound-by-controller: "yes"
      volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
    creationTimestamp: "2019-10-11T17:05:12Z"
    finalizers:
    - kubernetes.io/pvc-protection
    name: mypvc
    namespace: sample
    resourceVersion: "145402"
    selfLink: /api/v1/namespaces/sample/persistentvolumeclaims/mypvc
    uid: 49afc145-ec49-11e9-bfc2-0631cd325e42
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 50Gi
    storageClassName: openshift-storage-ceph-rbd
    volumeMode: Filesystem
    volumeName: pvc-49afc145-ec49-11e9-bfc2-0631cd325e42
  status:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 50Gi
    phase: Bound
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

StorageClass Dump (if StorageClass used by PV/PVC):

$ oc get sc openshift-storage-ceph-rbd -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: "2019-10-11T14:43:17Z"
  name: openshift-storage-ceph-rbd
  ownerReferences:
  - apiVersion: ocs.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: StorageClusterInitialization
    name: openshift-storage
    uid: 767e1e6c-ec35-11e9-a527-0a9b00bcb2d4
  resourceVersion: "82898"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/openshift-storage-ceph-rbd
  uid: 76ab28b3-ec35-11e9-a527-0a9b00bcb2d4
parameters:
  clusterID: openshift-storage
  csi.storage.k8s.io/fstype: ext4
  csi.storage.k8s.io/node-stage-secret-name: rook-ceph-csi
  csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
  csi.storage.k8s.io/provisioner-secret-name: rook-ceph-csi
  csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
  imageFeatures: layering
  imageFormat: "2"
  pool: openshift-storage-cephblockpool
provisioner: openshift-storage.rbd.csi.ceph.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

Additional info:

- Deployment used for testing:

$ oc get deployment -n sample mypod-deployment -o yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "5"
  creationTimestamp: "2019-10-11T17:05:17Z"
  generation: 27
  labels:
    app: mypod
  name: mypod-deployment
  namespace: sample
  resourceVersion: "168419"
  selfLink: /apis/extensions/v1beta1/namespaces/sample/deployments/mypod-deployment
  uid: 4c75d344-ec49-11e9-a527-0a9b00bcb2d4
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: mypod
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: mypod
    spec:
      containers:
      - args:
        - -testdir=/mnt
        - -filecount=2
        - -dircount=1
        - -dirdepth=1
        - -tsr=true
        image: quay.io/shyamsundarr/filecounts:test
        imagePullPolicy: IfNotPresent
        name: attach-time-tester
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt
          name: data
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/hostname: ip-10-0-175-29
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 0
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: mypvc
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2019-10-11T17:05:17Z"
    lastUpdateTime: "2019-10-11T17:48:35Z"
    message: ReplicaSet "mypod-deployment-cdf545855" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-10-11T18:00:29Z"
    lastUpdateTime: "2019-10-11T18:00:29Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 27
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Comment 1 Shyamsundar 2019-10-18 16:12:38 UTC
Created attachment 1627230 [details]
Patch diff, and deployment artifacts

Update on experiments with upstream 2 year old PR:

PR ref: https://github.com/kubernetes/kubernetes/pull/45344

I modified the PR to port the same to the current code base, and the patch is attached with this comment (attach-verify.diff).
- Modification included 2 additional events, to ease data collection from pod events, rather than poking around at logs
  - First event added was to report successful VerifyControllerAttachedVolume completion
  - Second event added was to report successful MountVolume.SetUp
- Modification included one further change in watch resrouceVersion, I excluded setting this to 0 to avoid cache returns as per the comment here [1]. Unsure if this is meaningful though.

Benchmarks:
Ran a simple deployment, with 3 attached PVCs, and and scaled the deployment between 0-1 replicas, and measured the time reported by the events, between AttachVolume.Attach and the newly added VerifyControllerAttachedVolume.Verify. This was run for 10 iterations for each version,
  - Version: no-Fix, default hyperkube, with just the added events
  - Version: with-Fix, hyperkube with attached patch (attach-verify.diff)
  - The deployment yaml is included in the attachement (pod-deploy.yaml)
  - The PVC yaml is also included in the attachement (pvc.yaml)
  - Tests were run using ceph-rbd only for this benchmark

Results:
- Numbers in seconds
  Version   Min	Max	Median	StdDev	    Mean
  no-Fix    1	  16	8	4.98998998  7.7
  with-Fix  1	  8	4.5	2.859681412 4.2

The results do show an improvement. Further, I believe this is due to the fact that without the fix, we depend on an exponential backoff, and with the fix this is moved to a watch, thus if I compare iterations, without the fix takes 16 seconds to return at times, whereas with the fix it takes a max of 8 seconds.

Iterations and latency spread:
- Numbers in seconds
Iteration Verify-time-no-Fix  Verify-time-with-Fix
1         8                   1
2         4                   3
3         16                  2
4         8                   8
5         1                   1
6         8                   7
7         8                   7
8         16                  1
9         4                   6
10        4                   6

Wanted to provide this data, and also potentially understand if the current manner of detecting attach verification, based on my code reading, where updates comes from [2], is potentially correct, and the patch would introduce unwanted API server load. Or, are there alternatives, where the exponential backoff can be controlled [3], to provide a faster retry to reduce the outliers.

[1] Avoid reading from local cache for a watch: https://github.com/openshift/origin/blob/bfbbe79bce6ae1d65059072eebb905a8e74eeb07/vendor/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L351-L358

[2] Kubelet node status updates: https://github.com/openshift/origin/blob/bfbbe79bce6ae1d65059072eebb905a8e74eeb07/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet_node_status.go#L396-L412

[3] Exponential backoff for attach verification:
     - Set for an operation: https://github.com/openshift/origin/blob/bfbbe79bce6ae1d65059072eebb905a8e74eeb07/vendor/k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go#L137
     - Defaults to: https://github.com/openshift/origin/blob/bfbbe79bce6ae1d65059072eebb905a8e74eeb07/vendor/k8s.io/kubernetes/pkg/util/goroutinemap/exponentialbackoff/exponential_backoff.go#L26-L38

Comment 2 Shyamsundar 2019-10-31 16:10:57 UTC
Ceph-csi drivers do not support ControllerPublishVolume calls, and do not require the controller to ack a publish to a node. As a result adding a CSIDriver object with the "attachRequired" set to false, circumvents the entire AttachVerify step that was slowing down the overall attach times and as a result the pod readiness times.

This helped overcoming the problem reported in the bug. (better than adding a watch per attach etc.), as updated here: https://bugzilla.redhat.com/show_bug.cgi?id=1764637#c10

As it stands, we (OCS 4.x/Ceph-csi) do not require a fix for this problem.

I will leave the closing of the bug to OCP-Storage team, in case this needs to be handled for other volume plugins, or based on other considerations.

Comment 3 Shyamsundar 2019-11-17 13:54:59 UTC
(In reply to Shyamsundar from comment #2)
> Ceph-csi drivers do not support ControllerPublishVolume calls, and do not
> require the controller to ack a publish to a node. As a result adding a
> CSIDriver object with the "attachRequired" set to false, circumvents the
> entire AttachVerify step that was slowing down the overall attach times and
> as a result the pod readiness times.

As noted by Hemanth here https://bugzilla.redhat.com/show_bug.cgi?id=1764637#c20 and rectified for Ceph CSI drivers here, https://github.com/rook/rook/pull/4332 we need to enforce the attachRequired to true, as otherwise the kubernetes AD controller does not help to ensure RWO volume types are attached only to a single node.

Thus, this line of fixing the problem and circumventing the attach verification is not a possible resolution to address the attach delays.

> 
> This helped overcoming the problem reported in the bug. (better than adding
> a watch per attach etc.), as updated here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1764637#c10
> 
> As it stands, we (OCS 4.x/Ceph-csi) do not require a fix for this problem.
> 
> I will leave the closing of the bug to OCP-Storage team, in case this needs
> to be handled for other volume plugins, or based on other considerations.

As noted above, we would need to look at this problem from an OCP/kubernetes perspective, as the determined workaround is not valid for OCS. Basically we still need help/attention here in order to enable faster attach performance.

Comment 4 Hemant Kumar 2019-11-18 18:30:05 UTC
The thing is - default value of node's status being updated in kubelet is 10s. So, technically we can't go lower than that for VerifyControllerAttachedVolume call. The only reason sometimes you would see lower value than 10s is when VerifyControllerAttachedVolume is called immediately after node's status is updated.

`NodeStatusUpdateFrequency` is itself configurable, so one can lower that value if one wants node's status to be updated more frequently. But there is a cost to do that and must be done with caution. I am curios - how the SLA was set for attach/detach and what would be an acceptable value?

Comment 5 Shyamsundar 2019-11-19 15:42:49 UTC
(In reply to Hemant Kumar from comment #4)
> The thing is - default value of node's status being updated in kubelet is
> 10s. So, technically we can't go lower than that for
> VerifyControllerAttachedVolume call. The only reason sometimes you would see
> lower value than 10s is when VerifyControllerAttachedVolume is called
> immediately after node's status is updated.

Understood, also there is the exponential wait to check for updates, and hence at max this goes to 16 seconds.

Tests also show a distribution which closely aligns with the exponential waits.

> 
> `NodeStatusUpdateFrequency` is itself configurable, so one can lower that
> value if one wants node's status to be updated more frequently. But there is
> a cost to do that and must be done with caution. I am curios - how the SLA
> was set for attach/detach and what would be an acceptable value?

Understood, we possibly do not want to flood the API server at higher frequencies (than the current 10s).

The attach/detach SLA comes from hosted-Che (Hosted-CRW: Codeready workspaces), where the users page needs to load "fast" and one of the contributing factors here is the attach rate (the other contributor is the fsGroup and seLinux recursive setting tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1745773#c21 ).

Acceptable values would be within 20s as I understand CRW needs.

The issue/slowness from this phase is present for both cloud vendor backed PVs (EBS) and Ceph.

I am not sure where to go from here, in terms of possibilities what I see are as follows,
- poll the local node cache more often?
  - This will help reduce the MAX from 16s to at least 10s (reduce outlier times)
  - It may further speed up other exponential maximas that we see (IOW 8s can become anything between 4-8s)
  - Can we be greedy here, as there are no calls outside of the node, but only attempting to read the node cache, hence make it try every 1s?
- Does a watch for node changes, instead of a get every 10 seconds help?
  - Totally unsure if this also floods the API servers at scale, so just a thought