Description of problem: On a 5 node CNS gluster cluster, with glusterd being stopped and started on 2 nodes, volume create operations fails occasionally and ends up with a mismatch between number of volumes in heketi and gluster. List of pods: oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE glusterblock-storage-provisioner-dc-1-t9rqz 1/1 Running 0 14h 10.130.2.7 dhcp47-80.lab.eng.blr.redhat.com glusterfs-storage-6p7lm 1/1 Running 1 14h 10.70.46.45 dhcp46-45.lab.eng.blr.redhat.com glusterfs-storage-8gqqx 1/1 Running 0 14h 10.70.47.162 dhcp47-162.lab.eng.blr.redhat.com glusterfs-storage-djxwn 1/1 Running 0 14h 10.70.47.40 dhcp47-40.lab.eng.blr.redhat.com glusterfs-storage-g894j 1/1 Running 1 14h 10.70.46.107 dhcp46-107.lab.eng.blr.redhat.com glusterfs-storage-ms5kc 1/1 Running 0 14h 10.70.47.166 dhcp47-166.lab.eng.blr.redhat.com heketi-storage-3-jxkr2 1/1 Running 0 40m 10.130.0.8 dhcp47-166.lab.eng.blr.redhat.com following shell script is run to simulate glusterd failure while true; do oc rsh glusterfs-storage-6p7lm systemctl stop glusterd; oc rsh glusterfs-storage-g894j systemctl stop glusterd; sleep 10; oc rsh glusterfs-storage-g894j systemctl start glusterd; oc rsh glusterfs-storage-6p7lm systemctl start glusterd; sleep 5; done Now, from heketi volume create operation is run in a loop. sh-4.2# for i in {1..10}; do heketi-cli volume create --size=1; done Error: Unable to execute command on glusterfs-storage-6p7lm: volume create: vol_4d40ff831898bd72f0d5708e467d032d: failed: Volume vol_4d40ff831898bd72f0d5708e467d032d already exists Error: Unable to execute command on glusterfs-storage-djxwn: volume create: vol_dadca6d84616d64e9badaea178061f15: failed: Volume vol_dadca6d84616d64e9badaea178061f15 already exists Name: vol_bf1fb8a0308a945d1d18d07e5152161c Size: 1 Volume Id: bf1fb8a0308a945d1d18d07e5152161c Cluster Id: 2553d9c43f8d7b968fd147c2b7da49e8 Mount: 10.70.47.166:vol_bf1fb8a0308a945d1d18d07e5152161c Mount Options: backup-volfile-servers=10.70.47.162,10.70.47.40,10.70.46.107,10.70.46.45 Block: false Free Size: 0 Block Volumes: [] Durability Type: replicate Distributed+Replica: 3 Name: vol_63c5abc9d6c3837b40a433ace366f1cd Size: 1 Volume Id: 63c5abc9d6c3837b40a433ace366f1cd Cluster Id: 2553d9c43f8d7b968fd147c2b7da49e8 Mount: 10.70.47.166:vol_63c5abc9d6c3837b40a433ace366f1cd Mount Options: backup-volfile-servers=10.70.47.162,10.70.47.40,10.70.46.107,10.70.46.45 Block: false Free Size: 0 Block Volumes: [] Durability Type: replicate Distributed+Replica: 3 Error: Unable to execute command on glusterfs-storage-ms5kc: volume create: vol_e465e8d402692afa83d7b89a79adde9a: failed: Volume vol_e465e8d402692afa83d7b89a79adde9a already exists Name: vol_1ac289e01a4928c28eb5947ae5bad6f3 Size: 1 Volume Id: 1ac289e01a4928c28eb5947ae5bad6f3 Cluster Id: 2553d9c43f8d7b968fd147c2b7da49e8 Mount: 10.70.47.166:vol_1ac289e01a4928c28eb5947ae5bad6f3 Mount Options: backup-volfile-servers=10.70.47.162,10.70.47.40,10.70.46.107,10.70.46.45 Block: false Free Size: 0 Block Volumes: [] Durability Type: replicate Distributed+Replica: 3 Error: Unable to execute command on glusterfs-storage-g894j: volume create: vol_f71551f08f4a2645933d497addfa0653: failed: Volume vol_f71551f08f4a2645933d497addfa0653 already exists Error: Unable to execute command on glusterfs-storage-ms5kc: volume create: vol_78bdaad1c288b2cc97e35f7b78f49ee4: failed: Volume vol_78bdaad1c288b2cc97e35f7b78f49ee4 already exists Error: Unable to execute command on glusterfs-storage-ms5kc: volume create: vol_dc25850f2f514ddc8fa55046120d29a1: failed: Volume vol_dc25850f2f514ddc8fa55046120d29a1 already exists Name: vol_61f3ab252579bd73026c7471cae2ea8a Size: 1 Volume Id: 61f3ab252579bd73026c7471cae2ea8a Cluster Id: 2553d9c43f8d7b968fd147c2b7da49e8 Mount: 10.70.47.166:vol_61f3ab252579bd73026c7471cae2ea8a Mount Options: backup-volfile-servers=10.70.47.162,10.70.47.40,10.70.46.107,10.70.46.45 Block: false Free Size: 0 Block Volumes: [] Durability Type: replicate Distributed+Replica: 3 sh-4.2# heketi-cli volume list Id:015191ee76505771b71a941b9d3525e3 Cluster:2553d9c43f8d7b968fd147c2b7da49e8 Name:vol_015191ee76505771b71a941b9d3525e3 Id:1ac289e01a4928c28eb5947ae5bad6f3 Cluster:2553d9c43f8d7b968fd147c2b7da49e8 Name:vol_1ac289e01a4928c28eb5947ae5bad6f3 Id:1cce9eaeaf8fe817203a983770b45da4 Cluster:2553d9c43f8d7b968fd147c2b7da49e8 Name:heketidbstorage Id:52f9d1a32b916236076226ec361c31dd Cluster:2553d9c43f8d7b968fd147c2b7da49e8 Name:vol_52f9d1a32b916236076226ec361c31dd Id:61f3ab252579bd73026c7471cae2ea8a Cluster:2553d9c43f8d7b968fd147c2b7da49e8 Name:vol_61f3ab252579bd73026c7471cae2ea8a Id:63c5abc9d6c3837b40a433ace366f1cd Cluster:2553d9c43f8d7b968fd147c2b7da49e8 Name:vol_63c5abc9d6c3837b40a433ace366f1cd Id:b523d2625eb50a48658164657dbff597 Cluster:2553d9c43f8d7b968fd147c2b7da49e8 Name:vol_b523d2625eb50a48658164657dbff597 Id:bf1fb8a0308a945d1d18d07e5152161c Cluster:2553d9c43f8d7b968fd147c2b7da49e8 Name:vol_bf1fb8a0308a945d1d18d07e5152161c However, from gluster there are more number of volumes. [root@dhcp47-116 ~]# oc rsh glusterfs-storage-8gqqx gluster volume list heketidbstorage vol_015191ee76505771b71a941b9d3525e3 vol_1ac289e01a4928c28eb5947ae5bad6f3 vol_4d40ff831898bd72f0d5708e467d032d vol_52f9d1a32b916236076226ec361c31dd vol_61f3ab252579bd73026c7471cae2ea8a vol_63c5abc9d6c3837b40a433ace366f1cd vol_78bdaad1c288b2cc97e35f7b78f49ee4 vol_b523d2625eb50a48658164657dbff597 vol_bf1fb8a0308a945d1d18d07e5152161c vol_dadca6d84616d64e9badaea178061f15 vol_dc25850f2f514ddc8fa55046120d29a1 vol_e465e8d402692afa83d7b89a79adde9a vol_f71551f08f4a2645933d497addfa0653 The additional volumes seen here are in stopped state. (complete list of volume info shall be attached) Version-Release number of selected component (if applicable): rpm -qa | grep 'heketi' heketi-6.0.0-7.2.el7rhgs.x86_64 python-heketi-6.0.0-7.2.el7rhgs.x86_64 heketi-client-6.0.0-7.2.el7rhgs.x86_64 How reproducible: 1/1, but should be easily seen Steps to Reproduce: covered as part of description Actual results: volume creation fails and creates additional volumes in the backend Expected results: volume creation should succeed and there should not be any mismatch Additional info: heketi logs shall be attached
Created attachment 1420180 [details] heketi_logs
Created attachment 1420181 [details] gluster_volume_info
Ok, here is the root cause: If a volume is created but start fails then the volume create operation is considered as a failure. This triggers a cleanup process which is not strict enough(we don't check if volume is really destroyed). As the retry mechanism uses the same ID as before we end up getting errors from gluster "volume already exists". Possible Solution: a. Split the volume create executor into two parts or more. After each executor call determine the best course of action and process. b. In case of a cleanup where we are not sure if the cleanup has completed or not, we cannot reuse the IDs and will have to retry at a higher level. Note that we should not delete the PendingOperations of the first volume create. It might have more cleanups to be performed.
Here's my interpretation of the problem: Based on my reading of the code I doubt this exact condition is new to this build and has probably been lurking around for a while. But I suspect that the retry loop actually makes this easier to hit. It's important that in this test glusterd is being stopped and started in a loop. According to the logs volume create requests are coming in and most fail because gluster on the node it is trying to use is down. Typically, this failure occurs on a 'gluster volume create' command. However, due to timing it occasionally hits a case where 'gluster volume create' succeeds but 'gluster volume start' fails. Within the code to create a volume is an "immediate cleanup" of the volume when an error is hit. However, because of the state of glusterd this cleanup also fails. At this point the volume create operation's rollback function is called. Rollback attempts to clean up the bricks (only the bricks, because historically the code assumed the self-cleanup of volume create would succeed) and then the heketi db entries. Since these steps work, rollback succeeds and a new retry is attempted. On this attempt the create fails because the volume is already in use by gluster. Again, rollback fires with a false success removing any trace of the volume from the db. Dealing with the cleanup when the volume can't reliably be deleted on the gluster side will be challenging and need more design & changes on the heketi side (at the very least). However, in the short term I'm going to look into making the rollback fail reliably when the volume cleanup hasn't happened. This will leave the volume in the db (in a pending state) and the heketi db tools can then be used to view it and help clean it up manually on the gluster side. Unfortunately, we can't avoid the problem just by cycling IDs because we support user specified names. This would also not avoid the issue of the volume being removed from the db on rollback failures.
Created attachment 1422571 [details] heketi_logs_apr_16_failedQA
Created attachment 1422573 [details] db_dump_apr16
Doc text looks good to me.
Created attachment 1422936 [details] glusterd_logs_collated
Created attachment 1422937 [details] cmd_history_collated.txt
Created attachment 1422941 [details] all cmd_history separate
Created attachment 1422942 [details] all glusterd logs separate
I don't see any additional work to do here.