Bug 1760940
| Summary: | VerifyControllerAttachedVolume in volume execution operator is slow in reporting success, delaying pod startup times | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Shyamsundar <srangana> | ||||||
| Component: | Storage | Assignee: | Hemant Kumar <hekumar> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | Liang Xia <lxia> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.2.0 | CC: | aos-bugs, aos-storage-staff, bbennett, ekuric, jespy, jstrunk, mifiedle | ||||||
| Target Milestone: | --- | Keywords: | Performance | ||||||
| Target Release: | 4.3.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | ocs-monkey | ||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2019-11-26 21:02:35 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Shyamsundar
2019-10-11 18:56:32 UTC
Created attachment 1627230 [details] Patch diff, and deployment artifacts Update on experiments with upstream 2 year old PR: PR ref: https://github.com/kubernetes/kubernetes/pull/45344 I modified the PR to port the same to the current code base, and the patch is attached with this comment (attach-verify.diff). - Modification included 2 additional events, to ease data collection from pod events, rather than poking around at logs - First event added was to report successful VerifyControllerAttachedVolume completion - Second event added was to report successful MountVolume.SetUp - Modification included one further change in watch resrouceVersion, I excluded setting this to 0 to avoid cache returns as per the comment here [1]. Unsure if this is meaningful though. Benchmarks: Ran a simple deployment, with 3 attached PVCs, and and scaled the deployment between 0-1 replicas, and measured the time reported by the events, between AttachVolume.Attach and the newly added VerifyControllerAttachedVolume.Verify. This was run for 10 iterations for each version, - Version: no-Fix, default hyperkube, with just the added events - Version: with-Fix, hyperkube with attached patch (attach-verify.diff) - The deployment yaml is included in the attachement (pod-deploy.yaml) - The PVC yaml is also included in the attachement (pvc.yaml) - Tests were run using ceph-rbd only for this benchmark Results: - Numbers in seconds Version Min Max Median StdDev Mean no-Fix 1 16 8 4.98998998 7.7 with-Fix 1 8 4.5 2.859681412 4.2 The results do show an improvement. Further, I believe this is due to the fact that without the fix, we depend on an exponential backoff, and with the fix this is moved to a watch, thus if I compare iterations, without the fix takes 16 seconds to return at times, whereas with the fix it takes a max of 8 seconds. Iterations and latency spread: - Numbers in seconds Iteration Verify-time-no-Fix Verify-time-with-Fix 1 8 1 2 4 3 3 16 2 4 8 8 5 1 1 6 8 7 7 8 7 8 16 1 9 4 6 10 4 6 Wanted to provide this data, and also potentially understand if the current manner of detecting attach verification, based on my code reading, where updates comes from [2], is potentially correct, and the patch would introduce unwanted API server load. Or, are there alternatives, where the exponential backoff can be controlled [3], to provide a faster retry to reduce the outliers. [1] Avoid reading from local cache for a watch: https://github.com/openshift/origin/blob/bfbbe79bce6ae1d65059072eebb905a8e74eeb07/vendor/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L351-L358 [2] Kubelet node status updates: https://github.com/openshift/origin/blob/bfbbe79bce6ae1d65059072eebb905a8e74eeb07/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet_node_status.go#L396-L412 [3] Exponential backoff for attach verification: - Set for an operation: https://github.com/openshift/origin/blob/bfbbe79bce6ae1d65059072eebb905a8e74eeb07/vendor/k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go#L137 - Defaults to: https://github.com/openshift/origin/blob/bfbbe79bce6ae1d65059072eebb905a8e74eeb07/vendor/k8s.io/kubernetes/pkg/util/goroutinemap/exponentialbackoff/exponential_backoff.go#L26-L38 Ceph-csi drivers do not support ControllerPublishVolume calls, and do not require the controller to ack a publish to a node. As a result adding a CSIDriver object with the "attachRequired" set to false, circumvents the entire AttachVerify step that was slowing down the overall attach times and as a result the pod readiness times. This helped overcoming the problem reported in the bug. (better than adding a watch per attach etc.), as updated here: https://bugzilla.redhat.com/show_bug.cgi?id=1764637#c10 As it stands, we (OCS 4.x/Ceph-csi) do not require a fix for this problem. I will leave the closing of the bug to OCP-Storage team, in case this needs to be handled for other volume plugins, or based on other considerations. (In reply to Shyamsundar from comment #2) > Ceph-csi drivers do not support ControllerPublishVolume calls, and do not > require the controller to ack a publish to a node. As a result adding a > CSIDriver object with the "attachRequired" set to false, circumvents the > entire AttachVerify step that was slowing down the overall attach times and > as a result the pod readiness times. As noted by Hemanth here https://bugzilla.redhat.com/show_bug.cgi?id=1764637#c20 and rectified for Ceph CSI drivers here, https://github.com/rook/rook/pull/4332 we need to enforce the attachRequired to true, as otherwise the kubernetes AD controller does not help to ensure RWO volume types are attached only to a single node. Thus, this line of fixing the problem and circumventing the attach verification is not a possible resolution to address the attach delays. > > This helped overcoming the problem reported in the bug. (better than adding > a watch per attach etc.), as updated here: > https://bugzilla.redhat.com/show_bug.cgi?id=1764637#c10 > > As it stands, we (OCS 4.x/Ceph-csi) do not require a fix for this problem. > > I will leave the closing of the bug to OCP-Storage team, in case this needs > to be handled for other volume plugins, or based on other considerations. As noted above, we would need to look at this problem from an OCP/kubernetes perspective, as the determined workaround is not valid for OCS. Basically we still need help/attention here in order to enable faster attach performance. The thing is - default value of node's status being updated in kubelet is 10s. So, technically we can't go lower than that for VerifyControllerAttachedVolume call. The only reason sometimes you would see lower value than 10s is when VerifyControllerAttachedVolume is called immediately after node's status is updated. `NodeStatusUpdateFrequency` is itself configurable, so one can lower that value if one wants node's status to be updated more frequently. But there is a cost to do that and must be done with caution. I am curios - how the SLA was set for attach/detach and what would be an acceptable value? (In reply to Hemant Kumar from comment #4) > The thing is - default value of node's status being updated in kubelet is > 10s. So, technically we can't go lower than that for > VerifyControllerAttachedVolume call. The only reason sometimes you would see > lower value than 10s is when VerifyControllerAttachedVolume is called > immediately after node's status is updated. Understood, also there is the exponential wait to check for updates, and hence at max this goes to 16 seconds. Tests also show a distribution which closely aligns with the exponential waits. > > `NodeStatusUpdateFrequency` is itself configurable, so one can lower that > value if one wants node's status to be updated more frequently. But there is > a cost to do that and must be done with caution. I am curios - how the SLA > was set for attach/detach and what would be an acceptable value? Understood, we possibly do not want to flood the API server at higher frequencies (than the current 10s). The attach/detach SLA comes from hosted-Che (Hosted-CRW: Codeready workspaces), where the users page needs to load "fast" and one of the contributing factors here is the attach rate (the other contributor is the fsGroup and seLinux recursive setting tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1745773#c21 ). Acceptable values would be within 20s as I understand CRW needs. The issue/slowness from this phase is present for both cloud vendor backed PVs (EBS) and Ceph. I am not sure where to go from here, in terms of possibilities what I see are as follows, - poll the local node cache more often? - This will help reduce the MAX from 16s to at least 10s (reduce outlier times) - It may further speed up other exponential maximas that we see (IOW 8s can become anything between 4-8s) - Can we be greedy here, as there are no calls outside of the node, but only attempting to read the node cache, hence make it try every 1s? - Does a watch for node changes, instead of a get every 10 seconds help? - Totally unsure if this also floods the API servers at scale, so just a thought |