Description of problem: Some of PV deleting failed when deleting test projects. On EC2 web console, the Volumes are still attached to the instance. Version-Release number of selected component (if applicable): root@ip-172-31-0-221: ~ # yum list installed | grep openshift atomic-openshift.x86_64 3.9.0-0.22.0.git.0.d4658fb.el7 How reproducible: Very likely. Met at least 3 times in 3.9 cluster. Steps to Reproduce: 1. Enable only one compute node scheduling in the cluster. This step is not necessary. We do it here to show the problem easily. 2. Create 39 projects with a PVC dynamically created and attached to a pod, then label those project as purpose=test (oc label namespace <project_name> purpose=test; Wait until all the pods are running 3. delete all test projects: oc delete project -l purpose=test; watch the results of "oc get pv --no-headers"; Some PVs are going to be there forever (in the test, I waited usually for 1 hour) Actual results: We can redo step 2 and 3 and will see more and more PVs accumulated there. Expected results: The PVs are deleted and deattached to the instance. Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info: The master log and the node log will be provided. This is one of the PV description. root@ip-172-31-0-221: ~ # oc describe pv Name: pvc-6119afd9-ff91-11e7-ba2c-024c74e10c04 Labels: failure-domain.beta.kubernetes.io/region=us-west-2 failure-domain.beta.kubernetes.io/zone=us-west-2b Annotations: kubernetes.io/createdby=aws-ebs-dynamic-provisioner pv.kubernetes.io/bound-by-controller=yes pv.kubernetes.io/provisioned-by=kubernetes.io/aws-ebs StorageClass: gp2 Status: Failed Claim: fiotestia8/pvc4hdjm8m76c Reclaim Policy: Delete Access Modes: RWO Capacity: 1Gi Message: Error deleting EBS volume "vol-053e38dfab9143549" since volume is currently attached to "i-0f2140c2b4b59222e" Source: Type: AWSElasticBlockStore (a Persistent Disk resource in AWS) VolumeID: aws://us-west-2b/vol-053e38dfab9143549 FSType: ext4 Partition: 0 ReadOnly: false Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VolumeFailedDelete 39m persistentvolume-controller Error deleting EBS volume "vol-053e38dfab9143549" since volume is currently attached to "i-0f2140c2b4b59222e"
The master log contains these warnings: Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances) Meaning we hit the AWS API call quota...
(In reply to Tomas Smetana from comment #2) > The master log contains these warnings: > > Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances) > > Meaning we hit the AWS API call quota... From a user's point of view, what is the impact to the cluster?
I'll try to reproduce the problem to see whether this is simply a case of overloaded system and why exactly the volumes were not detached: i.e. if OpenShift tried to detach them but could not because of API quota or if the API call storm triggered some bug to show.
Did you try to wait "long enough"? Detacher tries periodically to detach the volume and deleter tries to delete it periodically too. It should cure itself in few minutes. Or dozens of minutes...
Oh, I see, waited for 1 hour. That should be enough. Sorry for the noise.
(In reply to Jan Safranek from comment #6) > Oh, I see, waited for 1 hour. That should be enough. Sorry for the noise. Yes I did. Actually, even more. I repeated several times on step 2 and 3. It lasted over 3 hours. Even the latest created PV were deleted, the first lefeover was still there.
I've tried to reproduce this several times but failed (cleanup of the PVs took quite long -- ~20 minutes -- but eventually they were all detached in the end). I also failed to get the "RequestLimitExceeded" warning so I guess those two are related. If anybody ever happens to encounter something similar: could you please also capture the controllers log? It might reveal more...
I am in a training this week. Will try to reproduce this again early next week with controllers log.
@Tomas, thanks for the reply. Good to know that we have a potential cause for this bz. Let me know if anything need to be done from my side.
What I think is going on here (I'll try to describe so someone can correct me in case I missed something): Once the volume is unmounted controller starts detaching the volume: -> OperationExecutor.DetachVolume() -> OperationGenerator.GenerateDetachVolumeFunc() (gets the AWS plugin, calls NewDetacher() to obtain AWS detacher) -> detacher.Detach() // this is the AWS plugin detacher -> aws.Volumes.DetachDisk() // calling cloudprovider -> Cloud.checkIfAttachedToNode() -> awsDisk.describeVolume() (fails with RequestLimitExceeded on ec2::DescribeVolumes and thigs start breaking here) The checkIfAttachedToNode() method returns three values: disk information obtained by the describeVolume, whehter the disk is attached to the specified node and eventual error. In the case the describeVolume call returns error it does this: if err != nil { describeError := fmt.Errorf("Error describing volume %s with %v", diskName, err) glog.Warning(describeError) awsDiskInfo.volumeState = "unknown" return awsDiskInfo, false, describeError } So, it sets the diskInfo.volumeState to "unknown" but leaves the rest of diskInfo empty, and tells the caller the volume is not attached to the node... The caller (AWS cloudprovider's DetachDisk) finds out there is no instance information in the diskInfo and the volume is not attached to any instance so it just returns immediately with no error. This looks like a successful detach in the AWS plugin hence the "DetachVolume.Detach succeeded". The volume is marked as detached in the controller's ActualStateOfWorld (will never try to detach it again) and the PV controler starts deleting it. This fails since the volume is still attached to the instance: DetachDisk silently returned without instructing AWS to really detach the volume. The PV controller calls AWS cloudprovider's DeleteDisk which again tries to call describeInstance and hits the quota from time to time. Nevertheless it also checks whether the volume can be deleted and of course finds out it can't... The PV controller's logic seems to be safe since it would re-try when hitting the API quota for describeVolume. We need to make sure the volume gets really detached though.
I forgot to add: I think this should be enough to fix the problem: --- a/vendor/k8s.io/kubernetes/pkg/cloudprovider/providers/aws/aws.go +++ b/vendor/k8s.io/kubernetes/pkg/cloudprovider/providers/aws/aws.go @@ -2022,7 +2022,7 @@ func (c *Cloud) AttachDisk(diskName KubernetesVolumeID, nodeName types.NodeName, // DetachDisk implements Volumes.DetachDisk func (c *Cloud) DetachDisk(diskName KubernetesVolumeID, nodeName types.NodeName) (string, error) { diskInfo, attached, err := c.checkIfAttachedToNode(diskName, nodeName) - if diskInfo == nil { + if err != nil { return "", err }
I have the problem reproduced by faking describeVolume failures. The fix will be probably slightly more complicated than the one in comment #15 since as Hemant pointed out: I should check for error 404 in which case it is safe to assume the pod is detached and not try again.
Upstream PR: https://github.com/kubernetes/kubernetes/pull/59569 Note: I have used the one-line patch in the end anyway... The cloudprovider re-formats the awsError so parsing out the code outside of the describeVolume method is not possible and also it is being called from several places so changing its behaviour is much more dangerous. The proposed change is aiming at this very specific issue.
@Tomas, Thanks for the PR. The PR gives more chances to k8s to detach volumes, right?
(In reply to Hongkai Liu from comment #18) > The PR gives more chances to k8s to detach volumes, right? Yes. In case the describeVolumes fail (hit the API call quota limit in our case) the DetachDisk method will propagate the error so the attach/detach controller will not mark the volume as detached and will try to detach it again. I suppose this was actually quite a common issue on more loaded clusters and we may have solved one of the AWS mysterious problem. We'll see. The next step would be to create a more "sophisticated" (and complex) fix that would differentiate among the errors and be robust even in weird situations, like someone manually managed to delete volume while the detach was in progress. It's very unlikely to happen but can't be ruled out... However the fix would touch more parts, will take more time and might bring regressions.
https://github.com/openshift/origin/pull/18544
Cool. Thanks for the reply. @Tomas.
Tested with atomic-openshift.x86_64 3.9.7-1.git.0.e1a30c3.el7 root@ip-172-31-14-145: ~/go/src/github.com/openshift/ose # git tag --contains 15c1c88b6bef9a437fed960822f203f0620b9074 ... v3.9.7-1 ... root@ip-172-31-14-145: ~/go/src/github.com/openshift/ose # git rev-list -n 1 v3.9.7-1 e1a30c3f321a9c6809f53bd4467d6442e92522c4 I can no longer see any PV left there. Tried about 10 times of creating/deleting. All PVs got cleaned up in my test.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816