Bug 1598748

Summary: Block device deletion returns "SUCCESS" even though deletion has failed on one of the nodes
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rachael <rgeorge>
Component: gluster-blockAssignee: Prasanna Kumar Kalever <prasanna.kalever>
Status: CLOSED ERRATA QA Contact: Nitin Goyal <nigoyal>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.10CC: bgoyal, kramdoss, nberry, pkarampu, pprakash, prasanna.kalever, rgeorge, rhs-bugs, sankarshan, vbellur, xiubli
Target Milestone: ---   
Target Release: CNS 3.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gluster-block-0.2.1-21.el7rhgs Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-12 09:27:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1568862    

Description Rachael 2018-07-06 11:33:25 UTC
Description of problem:
On a CNS setup, with existing block PVCs, a script was run to delete 10 PVCs while killing the targetcli process on one of the pods. The deletion of the PVCs were successful and the block devices got deleted from both heketi and gluster backend. However the following message was seen in the heketi logs:

[kubeexec] DEBUG 2018/07/06 09:21:51 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:246: Host: dhcp46-244.lab.eng.blr.redhat.com Pod: glusterfs-storage-w9jcs Command: gluster-block delete vol_37b50be8ac1fb551ad7f1b2985d8b6a7/test-vol_glusterfs_claim45_2d9c2d2b-80cf-11e8-a4e5-0a580a810203 --json
Result: { "FAILED ON": [ "10.70.46.244" ], "SUCCESSFUL ON": [ "10.70.47.60", "10.70.47.95" ], "RESULT": "SUCCESS" }

On checking the gluster-blockd logs the following error was seen: 

[2018-07-06 09:21:49.707366] INFO: delete cli request, volume=vol_37b50be8ac1fb551ad7f1b2985d8b6a7 blockname=test-vol_glusterfs_claim45_2d9c2d2b-80cf-11e8-a4e5-0a580a810203 [at block_svc_routines.c+4530 :<block_delete_cli_1_svc_st>]
[2018-07-06 09:21:49.813627] INFO: delete request, blockname=test-vol_glusterfs_claim45_2d9c2d2b-80cf-11e8-a4e5-0a580a810203 filename=31dbec61-2f92-43d2-afae-8d2620585511 [at block_svc_routines.c+4651 :<block_delete_1_svc_st>]
[2018-07-06 09:21:49.835666] ERROR: No target config for block test-vol_glusterfs_claim45_2d9c2d2b-80cf-11e8-a4e5-0a580a810203. [at block_svc_routines.c+4673 :<block_delete_1_svc_st>]
[2018-07-06 09:21:49.948161] ERROR: failed in remote delete for block test-vol_glusterfs_claim45_2d9c2d2b-80cf-11e8-a4e5-0a580a810203 on host 10.70.46.244 volume vol_37b50be8ac1fb551ad7f1b2985d8b6a7 [at block_svc_routines.c+1031 :<glusterBlockDeleteRemote>]
[2018-07-06 09:21:51.208877] ERROR: failed to delete config on 10.70.46.244 No target config for block test-vol_glusterfs_claim45_2d9c2d2b-80cf-11e8-a4e5-0a580a810203.: on volume vol_37b50be8ac1fb551ad7f1b2985d8b6a7 on host 10.70.46.244 [at block_svc_routines.c+1115 :<glusterBlockCollectAttemptSuccess>]

The saveconfig.json file still has the entries for the deleted block devices.



Version-Release number of selected component (if applicable):

# oc version
oc v3.10.0-0.67.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

# rpm -qa|grep heketi
python-heketi-7.0.0-2.el7rhgs.x86_64
heketi-client-7.0.0-2.el7rhgs.x86_64
heketi-7.0.0-2.el7rhgs.x86_64


# rpm -qa|grep gluster
glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-api-3.8.4-54.12.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64
glusterfs-server-3.8.4-54.12.el7rhgs.x86_64
gluster-block-0.2.1-20.el7rhgs.x86_64


How reproducible:3/3


Steps to Reproduce:

1. On one of the gluster pods,run the following loop to kill targetcli process: while(true); do pkill targetcli; done 

2. From the master node, delete block pvc: oc delete pvc <claim_name>


Actual results:
The deletion of the PVC is successful even though it has failed on one of the nodes.

Additional info:
Logs will be attached soon

Comment 6 Pranith Kumar K 2018-07-09 09:26:56 UTC
Racheal,
   Could you provide QE-ack?

Comment 11 Nitin Goyal 2018-07-30 09:26:14 UTC
Hi,

I verified this bug on below rpm and container Images. It is working fine.

Rpm: ->
gluster-block-0.2.1-22.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-api-3.8.4-54.15.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64
glusterfs-server-3.8.4-54.15.el7rhgs.x86_64
gluster-block-0.2.1-22.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64

rhel 7.53 kernel
3.10.0-862.11.2.el7.x86_64

Container Images: ->
rhgs-server-rhel7:3.3.1-27
rhgs-gluster-block-prov-rhel7:3.3.1-20

What i observed : ->

When i am deleting pvc and running "while(true); do pkill targetcli; done" simultaneously on one gluster pod, Pvc did not got deleted, But stale entries got deleted from other gluster pods.

When i stop the script "while(true); do pkill targetcli; done" block device got deleted and stale entries are also got deleted.


when Pvc was not deleted yet: ->

[root@dhcp46-180 ~]# oc get pvc | grep c126
c126      Bound     pvc-70f7c93a-919e-11e8-b631-005056a53010   1Gi        RWO            block-sc       2d
[root@dhcp46-180 ~]# 
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-r6r6c gluster-block list vol_355b430ec4dfc1ee674da2e63f12153b | grep c126
blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-rfrx6 gluster-block list vol_355b430ec4dfc1ee674da2e63f12153b | grep c126
blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-x94vq gluster-block list vol_355b430ec4dfc1ee674da2e63f12153b | grep c126
blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004
[root@dhcp46-180 ~]# 
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-r6r6c cat /etc/target/saveconfig.json | grep c126
      "name": "blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004", 
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-rfrx6 cat /etc/target/saveconfig.json | grep c126
      "name": "blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004", 
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-x94vq cat /etc/target/saveconfig.json | grep c126
      "name": "blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004", 
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"



Running script on one gluster pod: ->

sh-4.2# while(true); do pkill targetcli; done

When pvc delete command given: ->

[root@dhcp46-180 ~]# oc delete pvc c126
persistentvolumeclaim "c126" deleted
[root@dhcp46-180 ~]# 
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-r6r6c gluster-block list vol_355b430ec4dfc1ee674da2e63f12153b | grep c126
blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-rfrx6 gluster-block list vol_355b430ec4dfc1ee674da2e63f12153b | grep c126
blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-x94vq gluster-block list vol_355b430ec4dfc1ee674da2e63f12153b | grep c126
blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-r6r6c cat /etc/target/saveconfig.json | grep c126
      "name": "blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004", 
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
              "storage_object": "/backstores/user/blk_glusterfs_c126_714eda17-919e-11e8-9b11-0a580a810004"
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-rfrx6 cat /etc/target/saveconfig.json | grep c126
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-x94vq cat /etc/target/saveconfig.json | grep c126
[root@dhcp46-180 ~]# 

Stopping the script on gluster pod
sh-4.2# while(true); do pkill targetcli; done 
^C
sh-4.2# 

After stoping script: ->

[root@dhcp46-180 ~]# oc rsh glusterfs-storage-r6r6c gluster-block list vol_355b430ec4dfc1ee674da2e63f12153b | grep c126
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-rfrx6 gluster-block list vol_355b430ec4dfc1ee674da2e63f12153b | grep c126
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-x94vq gluster-block list vol_355b430ec4dfc1ee674da2e63f12153b | grep c126
[root@dhcp46-180 ~]# 
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-r6r6c cat /etc/target/saveconfig.json | grep c126
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-rfrx6 cat /etc/target/saveconfig.json | grep c126
[root@dhcp46-180 ~]# oc rsh glusterfs-storage-x94vq cat /etc/target/saveconfig.json | grep c126
[root@dhcp46-180 ~]#

Comment 13 errata-xmlrpc 2018-09-12 09:27:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2691