Bug 1573304

Summary:

[Tracker-RHGS-BZ#1631329-BZ#1524336 ] Cant delete PV - Stuck in Failed status

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Vikas Laad <vlaad>

Component:

rhgs-server-container

Assignee:

Saravanakumar <sarumuga>

Status:

CLOSED ERRATA

QA Contact:

Neha Berry <nberry>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.0

CC:

amukherj, aos-bugs, aos-storage-staff, bchilds, ekuric, hchiramm, hongkliu, jarrpa, jmulligan, kramdoss, madam, mifiedle, moagrawa, nigoyal, pprakash, rcyriac, rhs-bugs, rtalur, sankarshan, storage-qa-internal, vinug, vlaad

Target Milestone:

---

Keywords:

ZStream

Target Release:

OCS 3.11.1

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

rhgs-server-rhel7:3.11.1-1

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-02-07 04:12:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1524336, 1631329, 1659815

Bug Blocks:

1641915, 1644154

Attachments:

Description	Flags
master controller logs	none
master api logs	none
heketi logs	none

Description Vikas Laad 2018-04-30 19:48:16 UTC

Description of problem:
After creating bunch of PV's backed by GlusterFS and deleting them leave sometimes some PVs in Failed status and they are never cleared.
I see following error in Heketi logs

[kubeexec] DEBUG 2018/04/30 17:50:41 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:244: Host: ip-172-31-26-78.us-west-2.compute.internal Pod: glusterfs-storage-57dww Command: sed -i.save "/brick_2701ca7c2bc965053c47b49f2e4103d6/d" /var/lib/heketi/fstab
Result:
[heketi] INFO 2018/04/30 17:50:41 Delete Volume succeeded
[asynchttp] INFO 2018/04/30 17:50:41 asynchttp.go:129: Completed job 744e668b99ced8feab686885ab621cf1 in 3m59.480136032s
[kubeexec] DEBUG 2018/04/30 17:50:41 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:244: Host: ip-172-31-26-78.us-west-2.compute.internal Pod: glusterfs-storage-57dww Command: lvs --options=lv_name,thin_count --separator=:
Result:   LV:#Thins
  brick_300a2d9cd0d59c0ab5262114e7ff36cd:
  tp_300a2d9cd0d59c0ab5262114e7ff36cd:1
[heketi] ERROR 2018/04/30 17:50:41 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:639: Cannot delete thin pool tp_ea5a5a06bdc058d91c77359ef5f5c06f on ip-172-31-26-78.us-west-2.compute.internal because it is used by [-1] snapshot(s) or cloned volume(s)
[heketi] ERROR 2018/04/30 17:50:41 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:428: Cannot delete thin pool tp_ea5a5a06bdc058d91c77359ef5f5c06f on ip-172-31-26-78.us-west-2.compute.internal because it is used by [-1] snapshot(s) or cloned volume(s)
[heketi] ERROR 2018/04/30 17:50:41 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:360: Error executing delete volume: Cannot delete thin pool tp_ea5a5a06bdc058d91c77359ef5f5c06f on ip-172-31-26-78.us-west-2.compute.internal because it is used by [-1] snapshot(s) or cloned volume(s)
[heketi] ERROR 2018/04/30 17:50:41 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:373: Failed to get bricks from op: Id not found
[heketi] ERROR 2018/04/30 17:50:41 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:917: Delete Volume Rollback error: Id not found
[heketi] ERROR 2018/04/30 17:50:41 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:919: Delete Volume Failed: Cannot delete thin pool tp_ea5a5a06bdc058d91c77359ef5f5c06f on ip-172-31-26-78.us-west-2.compute.internal because it is used by [-1] snapshot(s) or cloned volume(s)
[asynchttp] INFO 2018/04/30 17:50:41 asynchttp.go:129: Completed job fbfed19af0abffdbb69a73c2030f3596 in 12.621512236s
[kubeexec] DEBUG 2018/04/30 17:50:42 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:244: Host: ip-172-31-26-78.us-west-2.compute.internal Pod: glusterfs-storage-57dww Command: sed -i.save "/brick_039e580cd0b43aae33e7742e6cb7f884/d" /var/lib/heketi/fstab


Version-Release number of selected component (if applicable):

registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-13
registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7:3.3.1-10

openshift v3.10.0-0.31.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

How reproducible:
I was able to do it by creating 30 PVs and delete them.

Steps to Reproduce:
1. Create project and create PVs using template here https://raw.githubusercontent.com/openshift/svt/master/openshift_scalability/content/pvc-templates/pvc-template.json
2. oc delete 

Actual results:
Some PVs stuck in Failed status

Expected results:
PVs should delete fine.

Master Log:
Attached.

Additional info:
Attached master and heketi logs.

Comment 1 Vikas Laad 2018-04-30 19:49:05 UTC

pvc-d3bf386c-4cac-11e8-a620-02701b2ae108   1Gi        RWO            Delete           Failed    pvclusproject0/pvc1u7oqq4rcl   glusterfs-storage             19m

Comment 2 Vikas Laad 2018-04-30 19:51:03 UTC

Created attachment 1428936 [details]
master controller logs

Comment 3 Vikas Laad 2018-04-30 19:51:30 UTC

Created attachment 1428937 [details]
master api logs

Comment 4 Vikas Laad 2018-04-30 19:52:04 UTC

Created attachment 1428938 [details]
heketi logs

Comment 5 Vikas Laad 2018-04-30 19:56:56 UTC

If tried again using delete pv it gets deleted, in the first attempt it was deleted using

oc delete pvc --all -n <proj-name>

Comment 11 Jose A. Rivera 2018-05-31 20:14:14 UTC

This looks like a case of too many concurrent volume operations hitting Heketi. I believe this should be resolved with the latest version. Please see if you can reproduce.

Comment 12 Hongkai Liu 2018-06-01 01:34:03 UTC

I will give it a shot tomorrow.

Comment 13 Hongkai Liu 2018-06-01 12:38:56 UTC

Hi Jose,

Just to make sure the version:

rhgs-volmanager-container-3.3.1-8.1527091742
https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=65121

rhgs-server-container-3.3.1-10.1527091766
https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=65120

rhgs-gluster-block-prov-container-3.3.1-7.1527091787
https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=65123

Are those images that I should be using?

Thanks.

Comment 14 Hongkai Liu 2018-06-01 15:35:17 UTC

Hi Jose,

another thing to confirm:

With this commit
https://github.com/openshift/openshift-ansible/commit/0be4b2565beb92c064917627863401af7dfb73d3


openshift_storage_glusterfs_image=registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7
openshift_storage_glusterfs_version=3.3.1-10.1527091766

have to be updated as this

openshift_storage_glusterfs_image=registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-10.1527091766

And openshift_storage_glusterfs_version is no longer used.

The same applies to heketi and provisioner images too.

Is it correct?

Comment 15 Hongkai Liu 2018-06-01 18:15:57 UTC

Tried again with
# yum list installed | grep openshift
atomic-openshift.x86_64       3.10.0-0.53.0.git.0.f0248f3.el7

# oc describe node | grep -i run
 Container Runtime Version:  docker://1.13.1

# oc get pod -n glusterfs -o yaml | grep "image:" | sort -u
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-gluster-block-prov-rhel7:3.3.1-7.1527091787
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-10.1527091766
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7:3.3.1-8.1527091742


The problem is still there.
- Create 70 pods with glusterfs PVCs in a project.
- oc delete project <project_name>


Then PV cannot be cleaned up.

Comment 16 Hongkai Liu 2018-06-01 18:31:24 UTC

There were 2 PVs left there.

# oc get pv | grep Failed                                                                     
pvc-ac62399a-65b2-11e8-a1c5-02fa00f301da   1Gi        RWO            Delete           Failed    fioatest0/pvcnsfshfadmv   glusterfs-storage             2h
pvc-cd1d024a-65b2-11e8-a1c5-02fa00f301da   1Gi        RWO            Delete           Failed    fioatest0/pvcm6cfwtn5wu   glusterfs-storage             2h


# oc describe pv pvc-cd1d024a-65b2-11e8-a1c5-02fa00f301da
Name:            pvc-cd1d024a-65b2-11e8-a1c5-02fa00f301da
Labels:          <none>
Annotations:     Description=Gluster-Internal: Dynamically provisioned PV
                 gluster.kubernetes.io/heketi-volume-id=707cbc371a6b84baed95009431b5667d
                 gluster.org/type=file
                 kubernetes.io/createdby=heketi-dynamic-provisioner
                 pv.beta.kubernetes.io/gid=2048
                 pv.kubernetes.io/bound-by-controller=yes
                 pv.kubernetes.io/provisioned-by=kubernetes.io/glusterfs
                 volume.beta.kubernetes.io/mount-options=auto_unmount
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    glusterfs-storage
Status:          Failed
Claim:           fioatest0/pvcm6cfwtn5wu
Reclaim Policy:  Delete
Access Modes:    RWO
Capacity:        1Gi
Node Affinity:   <none>
Message:         Unable to delete volume vol_707cbc371a6b84baed95009431b5667d: Unable to execute command on glusterfs-storage-ggxfv:
Source:
    Type:           Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime)
    EndpointsName:  glusterfs-dynamic-pvcm6cfwtn5wu
    Path:           vol_707cbc371a6b84baed95009431b5667d
    ReadOnly:       false
Events:
  Type     Reason              Age   From                         Message
  ----     ------              ----  ----                         -------
  Warning  VolumeFailedDelete  37m   persistentvolume-controller  Unable to delete volume vol_707cbc371a6b84baed95009431b5667d: Unable to execute command on glusterfs-storage-ggxfv:

Comment 18 Humble Chirammal 2018-06-05 15:22:07 UTC

This is not

Comment 19 Humble Chirammal 2018-06-05 15:27:58 UTC

(In reply to Humble Chirammal from comment #18)
> This is not

Sorry for incomplete comment. This is between heketi-> gluster :


[kubeexec] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [gluster --mode=script volume delete vol_120a9680caea2b4b45c914a9bd29fbe6] on glusterfs-storage-57dww: Err[command terminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []
[cmdexec] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/executors/cmdexec/volume.go:144: Unable to delete volume vol_120a9680caea2b4b45c914a9bd29fbe6: Unable to execute command on glusterfs-storage-57dww: 
[heketi] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:436: Unable to delete volume: Unable to delete volume vol_120a9680caea2b4b45c914a9bd29fbe6: Unable to execute command on glusterfs-storage-57dww: 
[heketi] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:360: Error executing delete volume: Unable to delete volume vol_120a9680caea2b4b45c914a9bd29fbe6: Unable to execute command on glusterfs-storage-57dww: 
[heketi] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:919: Delete Volume Failed: Unable to delete volume vol_120a9680caea2b4b45c914a9bd29fbe6: Unable to execute command on glusterfs-storage-57dww: 
[asynchttp] INFO 2018/04/30 17:29:18 asynchttp.go:129: Completed job b8f6f9faced142f91a3409a7374ff083 in 6m39.204431248s

I am moving this bugzilla to heketi.

Comment 26 Hongkai Liu 2018-07-03 20:07:41 UTC

Yes, 3 glusterfs pods are up and oc rsh worked.

Comment 35 Raghavendra Talur 2018-07-19 06:16:11 UTC

Throttling support comes in the following two patches

  https://github.com/heketi/heketi/pull/1267
  https://github.com/heketi/heketi/pull/1271

Comment 39 Vikas Laad 2018-07-23 17:44:55 UTC

I created 800 PVC and deleted them, I see around 500 PV in Failed status. The number is going down slowly. I think this is what is expected after reading above comment.

Here is what I see when I do a describe on PV

Name:            pvc-f6996b4f-8e8b-11e8-9974-02615df9d798
Labels:          <none>
Annotations:     Description=Gluster-Internal: Dynamically provisioned PV
                 gluster.kubernetes.io/heketi-volume-id=be1058b743f0ba86d9fb54ecc3f8b8bf
                 gluster.org/type=file
                 kubernetes.io/createdby=heketi-dynamic-provisioner
                 pv.beta.kubernetes.io/gid=2403
                 pv.kubernetes.io/bound-by-controller=yes
                 pv.kubernetes.io/provisioned-by=kubernetes.io/glusterfs
                 volume.beta.kubernetes.io/mount-options=auto_unmount
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    glusterfs-storage
Status:          Failed
Claim:           fiotest0/pvcqo2nnqxk7t
Reclaim Policy:  Delete
Access Modes:    RWO
Capacity:        1Gi
Node Affinity:   <none>
Message:         Server busy. Retry operation later.
Source:
    Type:           Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime)
    EndpointsName:  glusterfs-dynamic-pvcqo2nnqxk7t
    Path:           vol_be1058b743f0ba86d9fb54ecc3f8b8bf
    ReadOnly:       false
Events:
  Type     Reason              Age   From                         Message
  ----     ------              ----  ----                         -------
  Warning  VolumeFailedDelete  53m   persistentvolume-controller  Server busy. Retry operation later.

Comment 43 Hongkai Liu 2018-08-01 12:23:59 UTC

@krishnaram,

Deletion is still stuck when the number of PVCs is 1000. See 2 in comment 41.

Comment 53 Hongkai Liu 2018-08-20 13:01:46 UTC

Hello, Michael,

which image should we use to verify the bz?
thanks.

Comment 69 Hongkai Liu 2018-08-22 18:33:47 UTC

Hi Mohit,

let me understand what I need to do:
1. grep "target is busy" <hekeit_log>
2. when it shows up ... I need to do ps and then lsof command in all 3 glusterfs pods

eg,
# oc project 
Using project "glusterfs" on server "https://ip-172-31-20-26.us-west-2.compute.internal:8443".
root@ip-172-31-20-26: ~ # oc get pod
NAME                                          READY     STATUS    RESTARTS   AGE
glusterblock-storage-provisioner-dc-1-6tz9v   1/1       Running   0          3h
glusterfs-storage-8l2xx                       1/1       Running   0          3h
glusterfs-storage-kq75g                       1/1       Running   0          3h
glusterfs-storage-n65vg                       1/1       Running   0          3h
heketi-storage-2-qqlt2                        1/1       Running   0          3h
root@ip-172-31-20-26: ~ # oc rsh glusterfs-storage-8l2xx

sh-4.2# ps -aef | grep glusterfsd
root        885      1  0 15:01 ?        00:00:11 /usr/sbin/glusterfsd -s 172.31.44.66 --volfile-id heketidbstorage.172.31.44.66.var-lib-heketi-mounts-vg_639487f84fc76a698a7881b9d0aa1d7a-brick_a8e0ce3d3e7d0cba739f06d040b7b058-brick -p /var/run/gluster/vols/heketidbstorage/172.31.44.66-var-lib-heketi-mounts-vg_639487f84fc76a698a7881b9d0aa1d7a-brick_a8e0ce3d3e7d0cba739f06d040b7b058-brick.pid -S /var/run/gluster/fc4d7e2e07c495800fe01a7c42ab5309.socket --brick-name /var/lib/heketi/mounts/vg_639487f84fc76a698a7881b9d0aa1d7a/brick_a8e0ce3d3e7d0cba739f06d040b7b058/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_639487f84fc76a698a7881b9d0aa1d7a-brick_a8e0ce3d3e7d0cba739f06d040b7b058-brick.log --xlator-option *-posix.glusterd-uuid=41a63ecb-947f-4042-bf29-867df052f47e --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152

sh-4.2# lsof -p 885
sh: lsof: command not found

sh-4.2# yum install lsof
Loaded plugins: ovl, product-id, search-disabled-repos, subscription-manager
This system is not receiving updates. You can use subscription-manager on the host to register and assign subscriptions.
There are no enabled repos.
 Run "yum repolist all" to see the repos you have.
 To enable Red Hat Subscription Management repositories:
     subscription-manager repos --enable <repo>
 To enable custom repositories:
     yum-config-manager --enable <repo>

3. collect /var/log/glusterfs folders in all 3 glusterfs pods.

====================
A. Please confirm the above description is correct.
B. how to handle the missing command of lsof in the pod?

Thanks.

Comment 70 Mohit Agrawal 2018-08-23 02:49:57 UTC

Yes steps are OK. You need to install lsof tool on pod.
I am not aware how to install a specific package on pod but i think you can download lsof package on the host and then you can copy same from host to pod and install the package.


Thanks
Mohit Agrawal

Comment 74 Michael Adam 2018-08-27 11:13:10 UTC

*** Bug 1620383 has been marked as a duplicate of this bug. ***

Comment 77 Hongkai Liu 2018-08-28 21:54:11 UTC

Test results/logs and version info are saved here:
http://file.rdu.redhat.com/~hongkliu/test_result/bz1600160/20180828/file/

stuck at 424 for over 30 mins. "target is busy" showed up in heketi log.

Comment 112 errata-xmlrpc 2019-02-07 04:12:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0287