Bug 1573304
Summary: | [Tracker-RHGS-BZ#1631329-BZ#1524336 ] Cant delete PV - Stuck in Failed status | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Vikas Laad <vlaad> | ||||||||
Component: | rhgs-server-container | Assignee: | Saravanakumar <sarumuga> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Neha Berry <nberry> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | rhgs-3.0 | CC: | amukherj, aos-bugs, aos-storage-staff, bchilds, ekuric, hchiramm, hongkliu, jarrpa, jmulligan, kramdoss, madam, mifiedle, moagrawa, nigoyal, pprakash, rcyriac, rhs-bugs, rtalur, sankarshan, storage-qa-internal, vinug, vlaad | ||||||||
Target Milestone: | --- | Keywords: | ZStream | ||||||||
Target Release: | OCS 3.11.1 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | rhgs-server-rhel7:3.11.1-1 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2019-02-07 04:12:47 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1524336, 1631329, 1659815 | ||||||||||
Bug Blocks: | 1641915, 1644154 | ||||||||||
Attachments: |
|
Description
Vikas Laad
2018-04-30 19:48:16 UTC
pvc-d3bf386c-4cac-11e8-a620-02701b2ae108 1Gi RWO Delete Failed pvclusproject0/pvc1u7oqq4rcl glusterfs-storage 19m Created attachment 1428936 [details]
master controller logs
Created attachment 1428937 [details]
master api logs
Created attachment 1428938 [details]
heketi logs
If tried again using delete pv it gets deleted, in the first attempt it was deleted using oc delete pvc --all -n <proj-name> This looks like a case of too many concurrent volume operations hitting Heketi. I believe this should be resolved with the latest version. Please see if you can reproduce. I will give it a shot tomorrow. Hi Jose, Just to make sure the version: rhgs-volmanager-container-3.3.1-8.1527091742 https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=65121 rhgs-server-container-3.3.1-10.1527091766 https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=65120 rhgs-gluster-block-prov-container-3.3.1-7.1527091787 https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=65123 Are those images that I should be using? Thanks. Hi Jose, another thing to confirm: With this commit https://github.com/openshift/openshift-ansible/commit/0be4b2565beb92c064917627863401af7dfb73d3 openshift_storage_glusterfs_image=registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7 openshift_storage_glusterfs_version=3.3.1-10.1527091766 have to be updated as this openshift_storage_glusterfs_image=registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-10.1527091766 And openshift_storage_glusterfs_version is no longer used. The same applies to heketi and provisioner images too. Is it correct? Tried again with # yum list installed | grep openshift atomic-openshift.x86_64 3.10.0-0.53.0.git.0.f0248f3.el7 # oc describe node | grep -i run Container Runtime Version: docker://1.13.1 # oc get pod -n glusterfs -o yaml | grep "image:" | sort -u image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-gluster-block-prov-rhel7:3.3.1-7.1527091787 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-10.1527091766 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7:3.3.1-8.1527091742 The problem is still there. - Create 70 pods with glusterfs PVCs in a project. - oc delete project <project_name> Then PV cannot be cleaned up. There were 2 PVs left there. # oc get pv | grep Failed pvc-ac62399a-65b2-11e8-a1c5-02fa00f301da 1Gi RWO Delete Failed fioatest0/pvcnsfshfadmv glusterfs-storage 2h pvc-cd1d024a-65b2-11e8-a1c5-02fa00f301da 1Gi RWO Delete Failed fioatest0/pvcm6cfwtn5wu glusterfs-storage 2h # oc describe pv pvc-cd1d024a-65b2-11e8-a1c5-02fa00f301da Name: pvc-cd1d024a-65b2-11e8-a1c5-02fa00f301da Labels: <none> Annotations: Description=Gluster-Internal: Dynamically provisioned PV gluster.kubernetes.io/heketi-volume-id=707cbc371a6b84baed95009431b5667d gluster.org/type=file kubernetes.io/createdby=heketi-dynamic-provisioner pv.beta.kubernetes.io/gid=2048 pv.kubernetes.io/bound-by-controller=yes pv.kubernetes.io/provisioned-by=kubernetes.io/glusterfs volume.beta.kubernetes.io/mount-options=auto_unmount Finalizers: [kubernetes.io/pv-protection] StorageClass: glusterfs-storage Status: Failed Claim: fioatest0/pvcm6cfwtn5wu Reclaim Policy: Delete Access Modes: RWO Capacity: 1Gi Node Affinity: <none> Message: Unable to delete volume vol_707cbc371a6b84baed95009431b5667d: Unable to execute command on glusterfs-storage-ggxfv: Source: Type: Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime) EndpointsName: glusterfs-dynamic-pvcm6cfwtn5wu Path: vol_707cbc371a6b84baed95009431b5667d ReadOnly: false Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VolumeFailedDelete 37m persistentvolume-controller Unable to delete volume vol_707cbc371a6b84baed95009431b5667d: Unable to execute command on glusterfs-storage-ggxfv: This is not (In reply to Humble Chirammal from comment #18) > This is not Sorry for incomplete comment. This is between heketi-> gluster : [kubeexec] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [gluster --mode=script volume delete vol_120a9680caea2b4b45c914a9bd29fbe6] on glusterfs-storage-57dww: Err[command terminated with exit code 1]: Stdout [Error : Request timed out ]: Stderr [] [cmdexec] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/executors/cmdexec/volume.go:144: Unable to delete volume vol_120a9680caea2b4b45c914a9bd29fbe6: Unable to execute command on glusterfs-storage-57dww: [heketi] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:436: Unable to delete volume: Unable to delete volume vol_120a9680caea2b4b45c914a9bd29fbe6: Unable to execute command on glusterfs-storage-57dww: [heketi] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:360: Error executing delete volume: Unable to delete volume vol_120a9680caea2b4b45c914a9bd29fbe6: Unable to execute command on glusterfs-storage-57dww: [heketi] ERROR 2018/04/30 17:29:18 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:919: Delete Volume Failed: Unable to delete volume vol_120a9680caea2b4b45c914a9bd29fbe6: Unable to execute command on glusterfs-storage-57dww: [asynchttp] INFO 2018/04/30 17:29:18 asynchttp.go:129: Completed job b8f6f9faced142f91a3409a7374ff083 in 6m39.204431248s I am moving this bugzilla to heketi. Yes, 3 glusterfs pods are up and oc rsh worked. Throttling support comes in the following two patches https://github.com/heketi/heketi/pull/1267 https://github.com/heketi/heketi/pull/1271 I created 800 PVC and deleted them, I see around 500 PV in Failed status. The number is going down slowly. I think this is what is expected after reading above comment. Here is what I see when I do a describe on PV Name: pvc-f6996b4f-8e8b-11e8-9974-02615df9d798 Labels: <none> Annotations: Description=Gluster-Internal: Dynamically provisioned PV gluster.kubernetes.io/heketi-volume-id=be1058b743f0ba86d9fb54ecc3f8b8bf gluster.org/type=file kubernetes.io/createdby=heketi-dynamic-provisioner pv.beta.kubernetes.io/gid=2403 pv.kubernetes.io/bound-by-controller=yes pv.kubernetes.io/provisioned-by=kubernetes.io/glusterfs volume.beta.kubernetes.io/mount-options=auto_unmount Finalizers: [kubernetes.io/pv-protection] StorageClass: glusterfs-storage Status: Failed Claim: fiotest0/pvcqo2nnqxk7t Reclaim Policy: Delete Access Modes: RWO Capacity: 1Gi Node Affinity: <none> Message: Server busy. Retry operation later. Source: Type: Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime) EndpointsName: glusterfs-dynamic-pvcqo2nnqxk7t Path: vol_be1058b743f0ba86d9fb54ecc3f8b8bf ReadOnly: false Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VolumeFailedDelete 53m persistentvolume-controller Server busy. Retry operation later. @krishnaram, Deletion is still stuck when the number of PVCs is 1000. See 2 in comment 41. Hello, Michael, which image should we use to verify the bz? thanks. Hi Mohit, let me understand what I need to do: 1. grep "target is busy" <hekeit_log> 2. when it shows up ... I need to do ps and then lsof command in all 3 glusterfs pods eg, # oc project Using project "glusterfs" on server "https://ip-172-31-20-26.us-west-2.compute.internal:8443". root@ip-172-31-20-26: ~ # oc get pod NAME READY STATUS RESTARTS AGE glusterblock-storage-provisioner-dc-1-6tz9v 1/1 Running 0 3h glusterfs-storage-8l2xx 1/1 Running 0 3h glusterfs-storage-kq75g 1/1 Running 0 3h glusterfs-storage-n65vg 1/1 Running 0 3h heketi-storage-2-qqlt2 1/1 Running 0 3h root@ip-172-31-20-26: ~ # oc rsh glusterfs-storage-8l2xx sh-4.2# ps -aef | grep glusterfsd root 885 1 0 15:01 ? 00:00:11 /usr/sbin/glusterfsd -s 172.31.44.66 --volfile-id heketidbstorage.172.31.44.66.var-lib-heketi-mounts-vg_639487f84fc76a698a7881b9d0aa1d7a-brick_a8e0ce3d3e7d0cba739f06d040b7b058-brick -p /var/run/gluster/vols/heketidbstorage/172.31.44.66-var-lib-heketi-mounts-vg_639487f84fc76a698a7881b9d0aa1d7a-brick_a8e0ce3d3e7d0cba739f06d040b7b058-brick.pid -S /var/run/gluster/fc4d7e2e07c495800fe01a7c42ab5309.socket --brick-name /var/lib/heketi/mounts/vg_639487f84fc76a698a7881b9d0aa1d7a/brick_a8e0ce3d3e7d0cba739f06d040b7b058/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_639487f84fc76a698a7881b9d0aa1d7a-brick_a8e0ce3d3e7d0cba739f06d040b7b058-brick.log --xlator-option *-posix.glusterd-uuid=41a63ecb-947f-4042-bf29-867df052f47e --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152 sh-4.2# lsof -p 885 sh: lsof: command not found sh-4.2# yum install lsof Loaded plugins: ovl, product-id, search-disabled-repos, subscription-manager This system is not receiving updates. You can use subscription-manager on the host to register and assign subscriptions. There are no enabled repos. Run "yum repolist all" to see the repos you have. To enable Red Hat Subscription Management repositories: subscription-manager repos --enable <repo> To enable custom repositories: yum-config-manager --enable <repo> 3. collect /var/log/glusterfs folders in all 3 glusterfs pods. ==================== A. Please confirm the above description is correct. B. how to handle the missing command of lsof in the pod? Thanks. Yes steps are OK. You need to install lsof tool on pod. I am not aware how to install a specific package on pod but i think you can download lsof package on the host and then you can copy same from host to pod and install the package. Thanks Mohit Agrawal *** Bug 1620383 has been marked as a duplicate of this bug. *** Test results/logs and version info are saved here: http://file.rdu.redhat.com/~hongkliu/test_result/bz1600160/20180828/file/ stuck at 424 for over 30 mins. "target is busy" showed up in heketi log. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0287 |