1377486 – error deleting EBS volumes: VolumeInUse

Bug 1377486 - error deleting EBS volumes: VolumeInUse

Summary: error deleting EBS volumes: VolumeInUse

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Tomas Smetana
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1377548 (view as bug list)
Depends On:
Blocks:	OSOPS_V3 1376104
TreeView+	depends on / blocked

Reported:	2016-09-19 21:09 UTC by Stefanie Forrester
Modified:	2019-12-26 06:30 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously if the pod was using persistent volume and has been deleted while the controller was down the volume was never detached from the node. The restarted controller was not able to find the attached volume and never tried to detach it. The fixed controller examines the node volumes on startup and determines which ones need to be detached and detaches them properly.
Clone Of:
Environment:
Last Closed:	2017-06-15 18:36:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs of the failed PVs' state (458.07 KB, text/plain) 2016-10-04 22:21 UTC, Stefanie Forrester	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:1425	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update	2017-06-15 22:35:53 UTC

Description Stefanie Forrester 2016-09-19 21:09:10 UTC

Description of problem:

Automatic deletion of EBS volumes is failing.

Sep 19 20:46:54 ip-172-31-6-179.ec2.internal atomic-openshift-master-controllers[45543]: I0919 20:46:54.084796   45543 aws_util.go:51] Error deleting EBS Disk volume aws://us-east-1c/vol-1a662bb6: error deleting EBS volumes: VolumeInUse: Volume vol-1a662bb6 is currently attached to i-0e5a5388

This is causing the released volumes to enter Failed state.

[root@dev-preview-int-master-d41bf origin]# oc get pv|grep Failed
pv-aws-cxxkd   5Gi        RWO           Failed    jjj/mongodb                                     12h
pv-aws-nj8nc   1Gi        RWO           Failed    jjj/jws-app-mongodb-claim                       10h


[root@dev-preview-int-master-d41bf origin]# oc describe pv pv-aws-cxxkd
Name:           pv-aws-cxxkd
Labels:         failure-domain.beta.kubernetes.io/region=us-east-1
                failure-domain.beta.kubernetes.io/zone=us-east-1c
Status:         Failed
Claim:          jjj/mongodb
Reclaim Policy: Delete
Access Modes:   RWO
Capacity:       5Gi
Message:        Delete of volume "pv-aws-cxxkd" failed: error deleting EBS volumes: VolumeInUse: Volume vol-5e0f42f2 is currently attached to i-0e5a5388
                status code: 400, request id: 
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-east-1c/vol-5e0f42f2
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
No events.

The node log shows that the disk was unmounted from the instance.

Sep 19 09:26:06 ip-172-31-14-248.ec2.internal atomic-openshift-node[2881]: I0919 09:26:06.982740    2881 reconciler.go:139] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-5e0f42f2" (spec.Name: "mongodb-data") from pod "bfa6e3bc-7e40-11e6-9392-0a27f5fffde3" (UID: "bfa6e3bc-7e40-11e6-9392-0a27f5fffde3").
Sep 19 09:26:07 ip-172-31-14-248.ec2.internal atomic-openshift-node[2881]: I0919 09:26:06.993698    2881 operation_executor.go:812] UnmountVolume.TearDown succeeded for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-5e0f42f2" (volume.spec.Name: "mongodb-data") pod "bfa6e3bc-7e40-11e6-9392-0a27f5fffde3" (UID: "bfa6e3bc-7e40-11e6-9392-0a27f5fffde3").
Sep 19 09:26:07 ip-172-31-14-248.ec2.internal atomic-openshift-node[2881]: I0919 09:26:07.084966    2881 reconciler.go:284] UnmountDevice operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-5e0f42f2" (spec.Name: "pv-aws-cxxkd")
Sep 19 09:26:07 ip-172-31-14-248.ec2.internal atomic-openshift-node[2881]: I0919 09:26:07.202760    2881 operation_executor.go:890] UnmountDevice succeeded for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-5e0f42f2" (spec.Name: "pv-aws-cxxkd").


But the AWS web console shows the volume is still connected to the instance (state: in-use), and therefore won't be able to be deleted.

The controller log shows pages of this message:

Sep 19 16:06:24 ip-172-31-6-179.ec2.internal atomic-openshift-master-controllers[5313]: I0919 16:06:24.392142    5313 controller.go:1079] isVolumeReleased[pv-aws-cxxkd]: volume is released
Sep 19 16:06:54 ip-172-31-6-179.ec2.internal atomic-openshift-master-controllers[5313]: I0919 16:06:54.392306    5313 controller.go:1079] isVolumeReleased[pv-aws-cxxkd]: volume is released
Sep 19 16:07:24 ip-172-31-6-179.ec2.internal atomic-openshift-master-controllers[5313]: I0919 16:07:24.393297    5313 controller.go:1079] isVolumeReleased[pv-aws-cxxkd]: volume is released


Version-Release number of selected component (if applicable):
atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64

How reproducible:
Every time.

Steps to Reproduce:
1. Add this to node-config.yaml.

kubeletArguments:
  enable-controller-attach-detach:
  - 'true'

2. Create an app containing a pvc. 
'oc new-app cakephp-mysql-example -n <project>'.

3. Delete the app and check the PV status. PV will be "Failed".

Actual results:


Expected results:


Additional info:

Comment 1 Abhishek Gupta 2016-09-20 23:06:08 UTC

Bumping up the severity as this is a blocker for DevPreview upgrade to 3.3. This is currently being investigated by Mark and Stefanie. We will pull in Paul Morie or folks from the storage team once we gather more data.

Comment 2 zhaliu 2016-09-21 01:51:59 UTC

*** Bug 1377548 has been marked as a duplicate of this bug. ***

Comment 3 Chao Yang 2016-09-21 03:10:30 UTC

I did not reproduce this on the ocp
[root@ip-172-18-13-195 ~]# openshift version
openshift v3.3.0.31
kubernetes v1.3.0+52492b4
etcd 2.3.0+git

1.Create pvc and pod,after pod is running, delete the pod and pvc
PV enters Failed state but will be deleted after some time.
https://bugzilla.redhat.com/show_bug.cgi?id=1371375

2.Create pvc and pod,after pod is running, delete pvc, pv will be failed and ebs volume is "in-use", but after delete pod, pv will be deleted and ebs volume is deleted from web console

Comment 4 Jianwei Hou 2016-09-21 06:59:14 UTC

With:
atomic-openshift-node-3.3.0.31-1.git.0.aede597.el7.x86_64
atomic-openshift-master-3.3.0.31-1.git.0.aede597.el7.x86_64

and:
  enable-controller-attach-detach:
  - 'true'

I wasn't able to reproduce this. I followed the same steps from https://bugzilla.redhat.com/show_bug.cgi?id=1377548
1. oc new-app mongodb-persistent
2. oc edit pvc mongodb, add annotation to make it auto provision a 5Gi ebs volume
3. Verify pods are ready:
NAME              READY     STATUS    RESTARTS   AGE
mongodb-1-yon7y   1/1       Running   0          3m
4. oc delete pvc --all
5. Immediately the PV became 'FAILED'
6. oc delete project --all
7. oc get pv
pvc-2b1259cd-7fc6-11e6-9027-0ed4ad3f44b5   5Gi        RWO           Failed    jhou/mongodb                    5m
8. Wait a minute and get pv again, pv was already deleted. Also volume was deleted which I could validate from ec2 console

Comment 5 Jan Safranek 2016-09-21 07:29:30 UTC

When you delete a pod and PVC at the same time it takes some time for the pod to detach its volumes and during this period the volume cannot be deleted (hence the error). You must wait even for a few minutes for the volumes to get detached and then they should be automatically deleted.

There is a "fix" in https://github.com/kubernetes/kubernetes/pull/31978, which does not mark the volumes as Failed and leaves them Released instead, making the process less confusing - it's a normal operation, not an error.

So, be patient and let Kubernetes to detach everything, the volume *should* be deleted eventually. Let me know if the volume is there even after ~5 minutes and provide full master logs, preferably with log level 5.

Comment 6 Stefanie Forrester 2016-09-21 14:55:55 UTC

I just did another batch of 20 PV-backed app-creates and then deleted all 20 of the projects. I saw the controller detach and clean up all the volumes as it should, this time. So it's no longer an urgent issue.

I think the behavior I was seeing previously had to do with bz #1377483, where the controllers were offline intermittently and therefore unable to perform the detach operation.

It seems that if the controllers are offline, we miss our chance to get the volumes detached, because the original Failed PVs are still present after more than a day:

[root@dev-preview-int-master-968f8 ~]# oc get pv |grep Failed
pv-aws-4ln9z   1Gi        RWO           Failed    dakinitest60/mysql                              1d
pv-aws-9kgnp   1Gi        RWO           Failed    ttt/mongodb                                     1d
pv-aws-cxxkd   5Gi        RWO           Failed    jjj/mongodb                                     2d
pv-aws-nj8nc   1Gi        RWO           Failed    jjj/jws-app-mongodb-claim                       2d

'oc describe' shows a similar message for each of them:

Message:        Delete of volume "pv-aws-4ln9z" failed: error deleting EBS volumes: VolumeInUse: Volume vol-4b8ac4e7 is currently attached to i-095a538f

And the controllers attempt to delete the volume, but no longer attempt to detach it:

Sep 21 02:16:28 ip-172-31-6-178.ec2.internal atomic-openshift-master-controllers[86332]: I0921 02:16:28.468300   86332 controller.go:995] deleteVolumeOperation [pv-aws-4ln9z] started
Sep 21 02:16:28 ip-172-31-6-178.ec2.internal atomic-openshift-master-controllers[86332]: I0921 02:16:28.472444   86332 controller.go:1079] isVolumeReleased[pv-aws-4ln9z]: volume is released
Sep 21 02:16:28 ip-172-31-6-178.ec2.internal atomic-openshift-master-controllers[86332]: I0921 02:16:28.472477   86332 controller.go:1086] doDeleteVolume [pv-aws-4ln9z]
Sep 21 02:16:28 ip-172-31-6-178.ec2.internal atomic-openshift-master-controllers[86332]: I0921 02:16:28.615384   86332 controller.go:1017] deletion of volume "pv-aws-4ln9z" failed: Delete of volume "pv-aws-4ln9z" failed: error deleting EBS volumes: VolumeInUse: Volume vol-4b8ac4e7 is currently attached to i-095a538f
Sep 21 02:16:28 ip-172-31-6-178.ec2.internal atomic-openshift-master-controllers[86332]: I0921 02:16:28.615398   86332 controller.go:626] updating updateVolumePhaseWithEvent[pv-aws-4ln9z]: set phase Failed
Sep 21 02:16:28 ip-172-31-6-178.ec2.internal atomic-openshift-master-controllers[86332]: I0921 02:16:28.615415   86332 controller.go:629] updating updateVolumePhaseWithEvent[pv-aws-4ln9z]: phase Failed already set

Comment 7 Jan Safranek 2016-09-21 15:12:49 UTC

Newly started controller manager should catch up and continue where the crashed one stopped. It may take some time to gather all the necessary info, however it should detach and delete everything eventually.

Comment 8 Jan Safranek 2016-09-21 15:15:17 UTC

Anyway, I'm closing this bug, please reopen it if you're able to reproduce it again (and please gather logs!).

Comment 9 Stefanie Forrester 2016-10-04 22:20:08 UTC

This issue has come up again after installing OCP 3.3.0.33 in prod. Again, we're having a problem where the controllers are intermittently inaccessible. However, despite the controllers going offline, I think there should be some kind of retry logic that will attempt to detach the EBS volumes repeatedly when they get stuck like this.

The problem seems to be that if a volume fails to detach initially, the detach operation is never retried. I currently have 54 volumes in prod that are in a Failed state because they're not being detached from their instances.

Please let me know what logs would be useful in debugging this.

[root@preview-master-afbb8 ~]# oc get pv |grep Fail
pv-aws-0cr3v   1Gi        RWO           Failed      teste/postgresql                                          60d
pv-aws-10ri1   1Gi        RWO           Failed      drivenow/mysql                                            29d
pv-aws-15v1s   1Gi        RWO           Failed      development/mysql                                         15d
pv-aws-1vbtv   1Gi        RWO           Failed      vdinh-prod-1/eap-app-mysql-claim                          4d
pv-aws-34ekb   1Gi        RWO           Failed      mmj2/postgresql                                           17d
pv-aws-3rbml   1Gi        RWO           Failed      theblocks-example/mongodb                                 16d
pv-aws-3xj0z   1Gi        RWO           Failed      naveentest/jenkins                                        60d
pv-aws-5n59j   1Gi        RWO           Failed      openshiftnodemongo/mongodb                                27d
pv-aws-6rc24   1Gi        RWO           Failed      dbtudo/postgresql                                         12d
pv-aws-71jl0   1Gi        RWO           Failed      mydb/postgresql                                           29d
pv-aws-83jhj   1Gi        RWO           Failed      my-cloud/pg                                               16d
pv-aws-94cu5   1Gi        RWO           Failed      ganadomas/eap-app-mysql-claim                             7d
pv-aws-9fwva   1Gi        RWO           Failed      jbibrest/mysql                                            101d
pv-aws-b58b1   1Gi        RWO           Failed      mycloud/mysql                                             21d
pv-aws-ba3dt   1Gi        RWO           Failed      azizbw/postgresql                                         14d
pv-aws-c24x0   1Gi        RWO           Failed      letpressknow/jenkins                                      11d
pv-aws-cgsxq   1Gi        RWO           Failed      my-project-my/mongodb                                     61d
pv-aws-etodp   1Gi        RWO           Failed      shift-test/mongodb                                        16d
pv-aws-fe49y   1Gi        RWO           Failed      mineduc/eap-app-mysql-claim                               5d
pv-aws-g1hed   1Gi        RWO           Failed      yledemo/postgresql                                        30d
pv-aws-hsend   1Gi        RWO           Failed      testesomething/jenkins                                    31d
pv-aws-i5b85   1Gi        RWO           Failed      myflights/mongodb                                         98d
pv-aws-j7xqv   1Gi        RWO           Failed      dakinitest2/mysql                                         23h
pv-aws-jrha9   1Gi        RWO           Failed      ivinnikov-example/mongodb                                 26d
pv-aws-k1del   1Gi        RWO           Failed      supertiendas-turpial/mysql                                9d
pv-aws-krqm9   1Gi        RWO           Failed      cfg-invoice/mysql                                         61d
pv-aws-lbpzv   1Gi        RWO           Failed      zenifyxbot/mongodb                                        16d
pv-aws-lhd73   1Gi        RWO           Failed      powelldean-m1/database                                    68d
pv-aws-lhl1n   1Gi        RWO           Failed      for-sale/mongodb                                          30d
pv-aws-m1tbs   1Gi        RWO           Failed      dockertest/mongodb                                        31d
pv-aws-m840v   1Gi        RWO           Failed      shirleycharlin-demo/mysql                                 20d
pv-aws-ma0nx   1Gi        RWO           Failed      fixi/mysql                                                65d
pv-aws-nccjm   1Gi        RWO           Failed      hello-openshift/jenkins                                   11d
pv-aws-ojz2r   1Gi        RWO           Failed      worldportfolio/worldportfolio                             15d
pv-aws-pjt2e   1Gi        RWO           Failed      eigh-tk/jenkins                                           32d
pv-aws-q1tcw   1Gi        RWO           Failed      sasskinn12/mysql                                          25d
pv-aws-qbn0m   1Gi        RWO           Failed      tiendafruta/mysql                                         31d
pv-aws-qxf3m   1Gi        RWO           Failed      epiclegends-musicbot/mysql                                29d
pv-aws-qxyvx   1Gi        RWO           Failed      fclm-test-newgen/mongodb                                  2h
pv-aws-rfset   1Gi        RWO           Failed      rates/ratesjobs-mongodb-claim                             37d
pv-aws-s1p58   1Gi        RWO           Failed      malodmor/jenkins                                          30d
pv-aws-s4d3s   1Gi        RWO           Failed      wemaspace/postgresql                                      14d
pv-aws-sahyb   1Gi        RWO           Failed      project1wordpress/mysql                                   29d
pv-aws-siatf   1Gi        RWO           Failed      jw-001/mongodb                                            30d
pv-aws-sqxpi   1Gi        RWO           Failed      pjm-test/mariadb                                          28d
pv-aws-t2u8l   1Gi        RWO           Failed      staging/mariadb                                           28d
pv-aws-tegp0   1Gi        RWO           Failed      restaurante/mysql                                         29d
pv-aws-v80fv   1Gi        RWO           Failed      examen/mysql                                              62d
pv-aws-wbje7   1Gi        RWO           Failed      job-app-db/mongodb                                        14d
pv-aws-wxdw6   1Gi        RWO           Failed      masco/mysql                                               29d
pv-aws-x5sd5   1Gi        RWO           Failed      djsispan/postgresql                                       31d
pv-aws-xk5n0   1Gi        RWO           Failed      nights-out/mysql                                          34d
pv-aws-yr6kd   1Gi        RWO           Failed      rockettest/mongodb                                        70d
pv-aws-zkc9e   1Gi        RWO           Failed      zhangqi-test/jenkins                                      33d

[root@preview-master-afbb8 ~]# oc get pv |grep Failed -c
54

Comment 10 Stefanie Forrester 2016-10-04 22:21:26 UTC

Created attachment 1207376 [details]
logs of the failed PVs' state

Comment 11 Jan Safranek 2016-10-05 08:12:06 UTC

Detach operation should be retried periodically when the volume is not needed. "When not needed" is important - please make sure that all pods that needed the volume are terminated.

We need controller-manager and kubelet logs (i.e. openshift master + openshift node), with as high log level as possible. Several GB of logs is better than missing an important message.

Comment 13 Stefanie Forrester 2016-10-05 15:28:54 UTC

I have a list of Failed PVs from yesterday and another list of them today. I took 5 random PVs from the list to use for testing. I made sure they were on today's list and yesterday's list.

These 5 PVs have no associated projects anymore. So certainly no PVC or pod remaining.

Our PV provisioner pod is marking them Released after some time, and then they become Failed again later. So they're constantly cycling between Failed and Released.

[root@preview-master-afbb8 ~]# date; oc get pv pv-aws-zkc9e pv-aws-yr6kd pv-aws-xk5n0 pv-aws-x5sd5 pv-aws-wbje7
Wed Oct  5 15:26:48 UTC 2016
NAME           CAPACITY   ACCESSMODES   STATUS    CLAIM                  REASON    AGE
pv-aws-zkc9e   1Gi        RWO           Failed    zhangqi-test/jenkins             34d
pv-aws-yr6kd   1Gi        RWO           Failed    rockettest/mongodb               70d
pv-aws-xk5n0   1Gi        RWO           Failed    nights-out/mysql                 35d
pv-aws-x5sd5   1Gi        RWO           Failed    djsispan/postgresql              32d
pv-aws-wbje7   1Gi        RWO           Failed    job-app-db/mongodb               15d

[root@preview-master-afbb8 ~]# date; oc get pv pv-aws-zkc9e pv-aws-yr6kd pv-aws-xk5n0 pv-aws-x5sd5 pv-aws-wbje7
Wed Oct  5 15:27:43 UTC 2016
NAME           CAPACITY   ACCESSMODES   STATUS     CLAIM                  REASON    AGE
pv-aws-zkc9e   1Gi        RWO           Released   zhangqi-test/jenkins             34d
pv-aws-yr6kd   1Gi        RWO           Released   rockettest/mongodb               70d
pv-aws-xk5n0   1Gi        RWO           Released   nights-out/mysql                 35d
pv-aws-x5sd5   1Gi        RWO           Released   djsispan/postgresql              32d
pv-aws-wbje7   1Gi        RWO           Released   job-app-db/mongodb               15d


[root@preview-master-afbb8 ~]# oc get project zhangqi-test
Error from server: namespaces "zhangqi-test" not found

[root@preview-master-afbb8 ~]# oc get project rockettest
Error from server: namespaces "rockettest" not found

[root@preview-master-afbb8 ~]# oc get project nights-out
Error from server: namespaces "nights-out" not found

[root@preview-master-afbb8 ~]# oc get project djsispan
Error from server: namespaces "djsispan" not found

[root@preview-master-afbb8 ~]# oc get project job-app-db
Error from server: namespaces "job-app-db" not found


Since the pods don't exist, I don't have a way to look up what node they were running on. So I don't think I can provide node logs.

Comment 14 Stefanie Forrester 2016-10-05 15:37:13 UTC

I just talked with Mark Turansky who said he fixed this issue in the latest PV provisioner pod. I'm going to try deploying that in prod and see if it clears up the issue.

Comment 15 Stefanie Forrester 2016-10-05 16:20:08 UTC

With the latest version of the PV provisioner pod (oso-volume-provisioner:v3.3.0.0-4), the PVs are no longer cycling between Failed and Released. They're now consistently in Failed state, 60 of them.

Comment 16 Stefanie Forrester 2016-10-05 17:10:47 UTC

It's been an hour since the latest PV provisioner pod was deployed, and I still see the 60 PVs in Failed state. The vast majority of the projects that used to contain the PVCs for these PVs are gone.

I think that would suggest the controller is failing to detach these.

In our API logs I'm seeing about 30 requests per second attempting to delete the volumes and failing due to "VolumeInUse". But I've only seen a handful of DetachVolume attempts in the past 3 hours. All of them succeeded. Which means it's not attempting to detach the Failed PVs.

[root@preview-master-afbb8 ~]# oc get pv |grep Failed -c
61

[root@preview-master-afbb8 ~]# for project in $(oc get pv |awk '/Failed/ {print $5}'| cut -d/ -f1); do oc get project $project ; done
Error from server: namespaces "teste" not found
Error from server: namespaces "drivenow" not found
Error from server: namespaces "development" not found
Error from server: namespaces "vdinh-prod-1" not found
Error from server: namespaces "mmj2" not found
Error from server: namespaces "hocctnss-project" not found
Error from server: namespaces "theblocks-example" not found
Error from server: namespaces "naveentest" not found
Error from server: namespaces "openshiftnodemongo" not found
Error from server: namespaces "dbtudo" not found
Error from server: namespaces "mydb" not found
NAME       DISPLAY NAME   STATUS
my-cloud   My Cloud       Active
Error from server: namespaces "jenk" not found
Error from server: namespaces "ganadomas" not found
Error from server: namespaces "jbibrest" not found
Error from server: namespaces "mycloud" not found
Error from server: namespaces "azizbw" not found
NAME             DISPLAY NAME     STATUS
pueblosmagicos   PueblosMagicos   Active
Error from server: namespaces "letpressknow" not found
Error from server: namespaces "my-project-my" not found
Error from server: namespaces "deneme" not found
Error from server: namespaces "shift-test" not found
NAME      DISPLAY NAME   STATUS
mineduc   mineduc        Active
Error from server: namespaces "yledemo" not found
Error from server: namespaces "testesomething" not found
Error from server: namespaces "myflights" not found
Error from server: namespaces "dakinitest2" not found
Error from server: namespaces "ivinnikov-example" not found
Error from server: namespaces "supertiendas-turpial" not found
Error from server: namespaces "cfg-invoice" not found
Error from server: namespaces "zenifyxbot" not found
Error from server: namespaces "powelldean-m1" not found
Error from server: namespaces "for-sale" not found
Error from server: namespaces "dockertest" not found
Error from server: namespaces "shirleycharlin-demo" not found
Error from server: namespaces "fixi" not found
Error from server: namespaces "hello-openshift" not found
NAME             DISPLAY NAME    STATUS
worldportfolio   Weltportfolio   Active
Error from server: namespaces "eigh-tk" not found
Error from server: namespaces "sasskinn12" not found
Error from server: namespaces "samwebapp" not found
Error from server: namespaces "tiendafruta" not found
Error from server: namespaces "epiclegends-musicbot" not found
Error from server: namespaces "fclm-test-newgen" not found
Error from server: namespaces "rates" not found
Error from server: namespaces "malodmor" not found
Error from server: namespaces "wemaspace" not found
Error from server: namespaces "project1wordpress" not found
Error from server: namespaces "jw-001" not found
Error from server: namespaces "pjm-test" not found
Error from server: namespaces "staging" not found
Error from server: namespaces "restaurante" not found
Error from server: namespaces "examen" not found
Error from server: namespaces "jy-project" not found
Error from server: namespaces "job-app-db" not found
Error from server: namespaces "masco" not found
Error from server: namespaces "djsispan" not found
Error from server: namespaces "nights-out" not found
NAME       DISPLAY NAME   STATUS
sistcoop                  Active
Error from server: namespaces "rockettest" not found
Error from server: namespaces "zhangqi-test" not found

Comment 17 Jan Safranek 2016-10-06 12:00:29 UTC

Thanks for the logs! They're big but valuable!

It seems that there is something wrong in the attach controller and AWS - after controller manager crash it does not seem to restore list of attached volumes correctly. The master crashes every couple of minutes, it's possible that it does not even have chance to restore its state.

I'll dig into this.


If possible, can you try to find a stable reproducer in the meantime? Without gigabytes of logs and with as small number steps as possible? /bin/kill might be helpful.

Comment 18 Stefanie Forrester 2016-10-06 14:24:19 UTC

Ok, I'll try reproducing in STG. This seems like a fairly rare case to hit, since even Prod only gets about 2 failures per day. But I'll see what I can do.

Comment 19 Jan Safranek 2016-10-06 15:39:33 UTC

I reproduced something that has very similar symptoms and filled https://github.com/kubernetes/kubernetes/issues/34242.

It should be much better if master did not crash.

Comment 20 Stefanie Forrester 2016-10-06 16:12:26 UTC

Nice! Thanks for that. I have a bug for the master crashes here:

https://bugzilla.redhat.com/show_bug.cgi?id=1381745

and I believe the crashes are happening as a result of all the AWS throttling going on here:

https://bugzilla.redhat.com/show_bug.cgi?id=1367229

I was able to significantly reduce the amount of AWS throttling by manually detaching about 80 EBS volumes this week. So that allowed the DeleteVolume calls to succeed instead of hammering the API many times per second. Also, because of that, we're down from 100+ controller crashes per day to about 15. So the impact of this particular bug isn't as serious as it was yesterday. I think it would be ok to reduce the severity of this bug.

Comment 21 Stefanie Forrester 2016-10-06 16:42:31 UTC

I tried just about every kill signal out there and I wasn't able to reproduce the exact state of the controller restarts we see in prod (specifically, this):

Oct 04 20:51:33 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[110809]: F1004 20:51:33.797506  110809 start_master.go:568] Controller graceful shutdown requested
Oct 04 20:51:34 ip-172-31-10-25.ec2.internal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a

So I couldn't trigger the graceful shutdown. But shutting off the controllers seemed to work just as well.

1. From the master, I created a bunch of PV-backed apps.

for num in {0..20}; do oadm new-project dakinitest${num}; oc new-app cakephp-mysql-example -n dakinitest${num}; sleep 1s; done

2. Made sure they were all running.

for num in {0..20}; do oc get pods -n dakinitest${num}|grep mysql-1|grep -v deploy;  done

3. Shut off the controller service across all 3 masters.

opssh -i -l root -c dev-preview-stg --v3 -t master "systemctl stop atomic-openshift-controllers"

4. Deleted the projects.

for num in {0..20}; do oc delete project dakinitest${num};  done

5. Started the controllers back up.

opssh -i -l root -c dev-preview-stg --v3 -t master "systemctl stop atomic-openshift-controllers"

6. Then I looked at which PVs had failed, and why they failed.

oc get pv |awk '/Failed/ {print $1}'|xargs oc describe pv

Any failures always said "error deleting EBS volumes: VolumeInUse". I found I could get more failures when I restarted the controller process a few times instead of leaving it running the whole time after the projects were deleted.

Comment 22 Jan Safranek 2016-10-07 07:41:46 UTC

Thanks for the reproducer!

Newly started Kubernetes controller manager does not detach volumes when a pod is deleted when the controller manager is down. Talking to upstream, there is no easy solution for it and it will take some time to fix it, see link in comment #19.

The best thing to do is not to let the manager crash and manually detach+delete volumes when it does.

Comment 23 Stefanie Forrester 2016-10-25 16:47:08 UTC

I added a workaround here for any Ops people who need to detach the volumes manually.

https://github.com/openshift/openshift-tools/blob/prod/ansible/playbooks/adhoc/detach_failed_volumes.yml

Comment 24 Jan Safranek 2017-01-03 09:26:17 UTC

Tomas is working on a fix.

Comment 32 Wenqi He 2017-04-26 06:42:49 UTC

I have tested this on below version with a HA OCP, all the pv can be deleted without showing "Failed" status.

# oc version
openshift v3.5.5.9
kubernetes v1.5.2+43a9be4

# for num in {0..18}; do oadm new-project pvtest${num}; oc new-app mysql-persistent -n pvtest${num}; oc patch pvc mysql -p '{"metadata":{"annotations":{"volume.alpha.kubernetes.io/storage-class":"foo"}}}' -n pvtest${num}; sleep 1s; done

# oc get pods --all-namespaces=true
NAMESPACE      NAME                            READY     STATUS      RESTARTS   AGE
pvtest0        mysql-1-7mn3c                   1/1       Running     0          7m
pvtest1        mysql-1-ghcvw                   1/1       Running     0          7m
pvtest10       mysql-1-2dqtj                   1/1       Running     0          4m
pvtest11       mysql-1-90dvz                   1/1       Running     0          4m
pvtest12       mysql-1-bgldj                   1/1       Running     0          3m
pvtest13       mysql-1-h2vch                   1/1       Running     0          2m
pvtest14       mysql-1-nm6p7                   1/1       Running     0          1m
pvtest15       mysql-1-b9qxb                   1/1       Running     0          1m
pvtest16       mysql-1-lwsxn                   1/1       Running     0          1m
pvtest17       mysql-1-215ws                   1/1       Running     0          1m
pvtest18       mysql-1-dhvw1                   1/1       Running     0          1m
pvtest2        mysql-1-5frfp                   1/1       Running     0          7m
pvtest3        mysql-1-44wc3                   1/1       Running     0          6m
pvtest4        mysql-1-gjcs5                   1/1       Running     0          6m
pvtest5        mysql-1-1jwmx                   1/1       Running     0          6m
pvtest6        mysql-1-rq3tq                   1/1       Running     0          4m
pvtest7        mysql-1-bkxtl                   1/1       Running     0          4m
pvtest8        mysql-1-09s44                   1/1       Running     0          4m
pvtest9        mysql-1-qlqhq                   1/1       Running     0          4m

# systemctl stop atomic-openshift-master-controllers.service (On each master)

# for num in {0..18}; do oc delete project pvtest${num};  done

# oc get pv
NAME                                       CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS     CLAIM            REASON    AGE
pvc-2cfac421-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest6/mysql              12m
pvc-2ea2bf46-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest7/mysql              12m
pvc-30526917-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest8/mysql              12m
pvc-31f67d17-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest9/mysql              12m
pvc-33a1d848-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest10/mysql             12m
pvc-355c78ad-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest11/mysql             12m
pvc-36ffc5f2-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest12/mysql             12m
pvc-7bf096bc-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest13/mysql             10m
pvc-7d9829f8-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest14/mysql             10m
pvc-7f4681a0-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest15/mysql             10m
pvc-80e55303-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest16/mysql             10m
pvc-82a5fe68-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest17/mysql             10m
pvc-84453982-2a48-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest18/mysql             10m
pvc-c6a365a4-2a47-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest0/mysql              15m
pvc-c82e4c1f-2a47-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest1/mysql              15m
pvc-c9bd786f-2a47-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest2/mysql              15m
pvc-cb56b8f0-2a47-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest3/mysql              15m
pvc-cd011db3-2a47-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest4/mysql              15m
pvc-cea4b0f4-2a47-11e7-b646-0ea5d1077080   1Gi        RWO           Delete          Released   pvtest5/mysql              15m

#  systemctl start atomic-openshift-master-controllers.service (on each master)

And then wait for several mins, all the PVs have been deleted without showing "error deleting EBS volumes: VolumeInUse"

Comment 34 Abhishek Gupta 2017-05-30 15:37:52 UTC

I moved this bug to ON_QA assuming this bug was for Online. Moving it back to MODIFIED for someone to chime in with the proper process for the status transition for OCP bugs.

Comment 37 Wenqi He 2017-06-05 10:14:27 UTC

This bug is fixed and I have already verified in comment 32. Thanks.

Comment 39 errata-xmlrpc 2017-06-15 18:36:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1425

Note You need to log in before you can comment on or make changes to this bug.