Bug 1565940 - [Tracker-RHGS-BZ#1618221] "Repeated failure/recovery of glusterd" + "volume create request on cns setup" can lead to a stale volume entry in gluster
Summary: [Tracker-RHGS-BZ#1618221] "Repeated failure/recovery of glusterd" + "volume c...
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rhgs-server-container
Version: cns-3.9
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Saravanakumar
QA Contact: Prasanth
Depends On: 1618221
Blocks: 1573420 OCS-3.11.1-Engineering-Proposed-BZs OCS-3.11.1-devel-triage-done
TreeView+ depends on / blocked
Reported: 2018-04-11 06:19 UTC by krishnaram Karthick
Modified: 2019-04-22 22:48 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
If heketi tries to perform volume create and delete operations in a loop while one of the gluster pod is also restarted in a loop in background, it can lead to a rare race condition where the glusterd instance while coming up on the restarted gluster pod end up with syncing a stale volume which was already deleted resulting into one additional volume entry in the restarted gluster pod. This will eventually lead to a situation where heketi will fail to create the volume with a message "volume already exists". Workaround: The above can be confirmed by looking at the 'gluster v list | wc -l' output from all the gluster pods and see if there's a mismatch in number of volume entries and which ever node there's an extra entry for a volume removing /var/lib/glusterd/vols/<volname in question> from the backend and restarting glusterd instance would fix this issue.
Clone Of:
: 1582402 (view as bug list)
Last Closed: 2019-04-17 06:56:19 UTC
Target Upstream Version:

Attachments (Terms of Use)
heketi_logs (937.98 KB, text/plain)
2018-04-11 06:23 UTC, krishnaram Karthick
no flags Details
gluster_volume_info (9.46 KB, text/plain)
2018-04-11 06:24 UTC, krishnaram Karthick
no flags Details
heketi_logs_apr_16_failedQA (680.00 KB, text/plain)
2018-04-16 14:17 UTC, krishnaram Karthick
no flags Details
db_dump_apr16 (9.77 KB, text/plain)
2018-04-16 14:26 UTC, krishnaram Karthick
no flags Details
glusterd_logs_collated (46.72 KB, text/plain)
2018-04-17 07:58 UTC, Raghavendra Talur
no flags Details
cmd_history_collated.txt (8.66 KB, text/plain)
2018-04-17 08:11 UTC, Raghavendra Talur
no flags Details
all cmd_history separate (40.00 KB, application/x-tar)
2018-04-17 08:23 UTC, Raghavendra Talur
no flags Details
all glusterd logs separate (2.76 MB, application/x-tar)
2018-04-17 08:24 UTC, Raghavendra Talur
no flags Details

Description krishnaram Karthick 2018-04-11 06:19:11 UTC
Description of problem:
On a 5 node CNS gluster cluster, with glusterd being stopped and started on 2 nodes, volume create operations fails occasionally and ends up with a mismatch between number of volumes in heketi and gluster. 

List of pods:

oc get pods -o wide
NAME                                          READY     STATUS    RESTARTS   AGE       IP             NODE
glusterblock-storage-provisioner-dc-1-t9rqz   1/1       Running   0          14h     dhcp47-80.lab.eng.blr.redhat.com
glusterfs-storage-6p7lm                       1/1       Running   1          14h    dhcp46-45.lab.eng.blr.redhat.com
glusterfs-storage-8gqqx                       1/1       Running   0          14h   dhcp47-162.lab.eng.blr.redhat.com
glusterfs-storage-djxwn                       1/1       Running   0          14h    dhcp47-40.lab.eng.blr.redhat.com
glusterfs-storage-g894j                       1/1       Running   1          14h   dhcp46-107.lab.eng.blr.redhat.com
glusterfs-storage-ms5kc                       1/1       Running   0          14h   dhcp47-166.lab.eng.blr.redhat.com
heketi-storage-3-jxkr2                        1/1       Running   0          40m     dhcp47-166.lab.eng.blr.redhat.com

following shell script is run to simulate glusterd failure

while true; do oc rsh glusterfs-storage-6p7lm systemctl stop glusterd; oc rsh glusterfs-storage-g894j systemctl stop glusterd; sleep 10; oc rsh glusterfs-storage-g894j systemctl start glusterd; oc rsh glusterfs-storage-6p7lm systemctl start glusterd; sleep 5; done

Now, from heketi volume create operation is run in a loop.

sh-4.2# for i in {1..10}; do heketi-cli  volume create --size=1; done
Error: Unable to execute command on glusterfs-storage-6p7lm: volume create: vol_4d40ff831898bd72f0d5708e467d032d: failed: Volume vol_4d40ff831898bd72f0d5708e467d032d already exists
Error: Unable to execute command on glusterfs-storage-djxwn: volume create: vol_dadca6d84616d64e9badaea178061f15: failed: Volume vol_dadca6d84616d64e9badaea178061f15 already exists
Name: vol_bf1fb8a0308a945d1d18d07e5152161c
Size: 1
Volume Id: bf1fb8a0308a945d1d18d07e5152161c
Cluster Id: 2553d9c43f8d7b968fd147c2b7da49e8
Mount Options: backup-volfile-servers=,,,
Block: false
Free Size: 0
Block Volumes: []
Durability Type: replicate
Distributed+Replica: 3
Name: vol_63c5abc9d6c3837b40a433ace366f1cd
Size: 1
Volume Id: 63c5abc9d6c3837b40a433ace366f1cd
Cluster Id: 2553d9c43f8d7b968fd147c2b7da49e8
Mount Options: backup-volfile-servers=,,,
Block: false
Free Size: 0
Block Volumes: []
Durability Type: replicate
Distributed+Replica: 3
Error: Unable to execute command on glusterfs-storage-ms5kc: volume create: vol_e465e8d402692afa83d7b89a79adde9a: failed: Volume vol_e465e8d402692afa83d7b89a79adde9a already exists
Name: vol_1ac289e01a4928c28eb5947ae5bad6f3
Size: 1
Volume Id: 1ac289e01a4928c28eb5947ae5bad6f3
Cluster Id: 2553d9c43f8d7b968fd147c2b7da49e8
Mount Options: backup-volfile-servers=,,,
Block: false
Free Size: 0
Block Volumes: []
Durability Type: replicate
Distributed+Replica: 3
Error: Unable to execute command on glusterfs-storage-g894j: volume create: vol_f71551f08f4a2645933d497addfa0653: failed: Volume vol_f71551f08f4a2645933d497addfa0653 already exists
Error: Unable to execute command on glusterfs-storage-ms5kc: volume create: vol_78bdaad1c288b2cc97e35f7b78f49ee4: failed: Volume vol_78bdaad1c288b2cc97e35f7b78f49ee4 already exists
Error: Unable to execute command on glusterfs-storage-ms5kc: volume create: vol_dc25850f2f514ddc8fa55046120d29a1: failed: Volume vol_dc25850f2f514ddc8fa55046120d29a1 already exists
Name: vol_61f3ab252579bd73026c7471cae2ea8a
Size: 1
Volume Id: 61f3ab252579bd73026c7471cae2ea8a
Cluster Id: 2553d9c43f8d7b968fd147c2b7da49e8
Mount Options: backup-volfile-servers=,,,
Block: false
Free Size: 0
Block Volumes: []
Durability Type: replicate
Distributed+Replica: 3

sh-4.2# heketi-cli volume list
Id:015191ee76505771b71a941b9d3525e3    Cluster:2553d9c43f8d7b968fd147c2b7da49e8    Name:vol_015191ee76505771b71a941b9d3525e3
Id:1ac289e01a4928c28eb5947ae5bad6f3    Cluster:2553d9c43f8d7b968fd147c2b7da49e8    Name:vol_1ac289e01a4928c28eb5947ae5bad6f3
Id:1cce9eaeaf8fe817203a983770b45da4    Cluster:2553d9c43f8d7b968fd147c2b7da49e8    Name:heketidbstorage
Id:52f9d1a32b916236076226ec361c31dd    Cluster:2553d9c43f8d7b968fd147c2b7da49e8    Name:vol_52f9d1a32b916236076226ec361c31dd
Id:61f3ab252579bd73026c7471cae2ea8a    Cluster:2553d9c43f8d7b968fd147c2b7da49e8    Name:vol_61f3ab252579bd73026c7471cae2ea8a
Id:63c5abc9d6c3837b40a433ace366f1cd    Cluster:2553d9c43f8d7b968fd147c2b7da49e8    Name:vol_63c5abc9d6c3837b40a433ace366f1cd
Id:b523d2625eb50a48658164657dbff597    Cluster:2553d9c43f8d7b968fd147c2b7da49e8    Name:vol_b523d2625eb50a48658164657dbff597
Id:bf1fb8a0308a945d1d18d07e5152161c    Cluster:2553d9c43f8d7b968fd147c2b7da49e8    Name:vol_bf1fb8a0308a945d1d18d07e5152161c

However, from gluster there are more number of volumes.

[root@dhcp47-116 ~]# oc rsh glusterfs-storage-8gqqx gluster volume list

The additional volumes seen here are in stopped state. (complete list of volume info shall be attached)

Version-Release number of selected component (if applicable):
rpm -qa | grep 'heketi'

How reproducible:
1/1, but should be easily seen

Steps to Reproduce:
covered as part of description

Actual results:
volume creation fails and creates additional volumes in the backend

Expected results:
volume creation should succeed and there should not be any mismatch 

Additional info:
heketi logs shall be attached

Comment 2 krishnaram Karthick 2018-04-11 06:23:44 UTC
Created attachment 1420180 [details]

Comment 3 krishnaram Karthick 2018-04-11 06:24:51 UTC
Created attachment 1420181 [details]

Comment 4 Raghavendra Talur 2018-04-11 15:49:33 UTC
Ok, here is the root cause:

If a volume is created but start fails then the volume create operation is considered as a failure. This triggers a cleanup process which is not strict enough(we don't check if volume is really destroyed). As the retry mechanism uses the same ID as before we end up getting errors from gluster "volume already exists". 

Possible Solution:

a. Split the volume create executor into two parts or more. After each executor call determine the best course of action and process.

b. In case of a cleanup where we are not sure if the cleanup has completed or not, we cannot reuse the IDs and will have to retry at a higher level. Note that we should not delete the PendingOperations of the first volume create. It might have more cleanups to be performed.

Comment 5 John Mulligan 2018-04-11 18:05:01 UTC
Here's my interpretation of the problem:

Based on my reading of the code I doubt this exact condition is new to this build and has probably been lurking around for a while. But I suspect that the retry loop actually makes this easier to hit.

It's important that in this test glusterd is being stopped and started in a loop. According to the logs volume create requests are coming in and most fail because gluster on the node it is trying to use is down. Typically, this failure occurs on a 'gluster volume create' command. However, due to timing it occasionally hits a case where 'gluster volume create' succeeds but 'gluster volume start' fails.

Within the code to create a volume is an "immediate cleanup" of the volume when an error is hit. However, because of the state of glusterd this cleanup also fails. At this point the volume create operation's rollback function is called. Rollback attempts to clean up the bricks (only the bricks, because historically the code assumed the self-cleanup of volume create would succeed) and then the heketi db entries. Since these steps work, rollback succeeds and a new retry is attempted. On this attempt the create fails because the volume is already in use by gluster. Again, rollback fires with a false success removing any trace of the volume from the db.

Dealing with the cleanup when the volume can't reliably be deleted on the gluster side will be challenging and need more design & changes on the heketi side (at the very least). However, in the short term I'm going to look into making the rollback fail reliably when the volume cleanup hasn't happened. This will leave the volume in the db (in a pending state) and the heketi db tools can then be used to view it and help clean it up manually on the gluster side.

Unfortunately, we can't avoid the problem just by cycling IDs because we support user specified names. This would also not avoid the issue of the volume being removed from the db on rollback failures.

Comment 10 krishnaram Karthick 2018-04-16 14:17:01 UTC
Created attachment 1422571 [details]

Comment 11 krishnaram Karthick 2018-04-16 14:26:09 UTC
Created attachment 1422573 [details]

Comment 13 Raghavendra Talur 2018-04-17 06:08:32 UTC
Doc text looks good to me.

Comment 15 Raghavendra Talur 2018-04-17 07:58:41 UTC
Created attachment 1422936 [details]

Comment 17 Raghavendra Talur 2018-04-17 08:11:55 UTC
Created attachment 1422937 [details]

Comment 18 Raghavendra Talur 2018-04-17 08:23:04 UTC
Created attachment 1422941 [details]
all cmd_history separate

Comment 19 Raghavendra Talur 2018-04-17 08:24:46 UTC
Created attachment 1422942 [details]
all glusterd logs separate

Comment 40 Yaniv Kaul 2019-04-17 06:56:19 UTC
I don't see any additional work to do here.

Note You need to log in before you can comment on or make changes to this bug.