Bug 1583494

Summary: heketi create/delete commands are in queue(hang) state
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Apeksha <akhakhar>
Component: heketiAssignee: Michael Adam <madam>
Status: CLOSED CANTFIX QA Contact: Apeksha <akhakhar>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.9CC: akhakhar, hchiramm, jmulligan, kramdoss, pprakash, rhs-bugs, rtalur, sankarshan, storage-qa-internal, tcarlin
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-23 20:42:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1641915    
Attachments:
Description Flags
heketi pod logs
none
glusterfs log none

Description Apeksha 2018-05-29 06:36:20 UTC
Created attachment 1445137 [details]
heketi pod logs

Description of problem:
heketi create/delete commands are in queue state

Version-Release number of selected component (if applicable):
heketi-client-6.0.0-7.4.el7rhgs.x86_64
openshift v3.6.173.0.117

How reproducible:
Twice

Steps to Reproduce:
1. Setup OCP3.6 + CNS3.9 Async
2. Run dynamic provisioing autonation testcases 
   All passed including heketi-pod fialure testcase
3. Start running heketi autonated cases
4. Heketi device delete and heketi deleet volume command in queue state


Attached the heketi.log

Comment 2 Apeksha 2018-05-29 06:38:42 UTC
Additional Info: 
Command hanged -  heketi-cli -s http://172.30.11.17:8080 volume delete 3d8325d2095c5196c321cddfd7e3eaea

Comment 3 Apeksha 2018-05-29 08:35:26 UTC
Created attachment 1445247 [details]
glusterfs log

Comment 6 Raghavendra Talur 2018-05-30 19:24:22 UTC
John and I looked at the system. DNS resolution wasn't working on the master. This looked like a known issue that we are aware of, so we tried the following commands. Most important being, restart of dbus and dnsmasq services. Then the login started working.

  212  oc login -u system:admin --config /etc/origin/master/admin.kubeconfig
  213  netstat -tnap | grep 53
  214  netstat -tnap | grep 53 | LISTEN
  215  netstat -tnap | grep 53 | grep LISTEN
  216  oc login -u system:admin --config /etc/origin/master/admin.kubeconfig
  217  oc get nodes
  218  ssh root.eng.blr.redhat.com 
  219  ssh root.47.38
  220  clear
  221  ls
  222  vim /etc/resolv.conf 
  223  oc login -u system:admin --config /etc/origin/master/admin.kubeconfig
  224  systemctl restart dnsmasq
  225  oc login -u system:admin --config /etc/origin/master/admin.kubeconfig
  226  netstat -tnap | grep 53 | grep LISTEN
  227  less /var/log/messages
  228  systemctl restart dbus
  229  netstat -tnap | grep 53 | grep LISTEN
  230  oc login -u system:admin --config /etc/origin/master/admin.kubeconfig
  231  history



We suspect that it is the same reason that you see heketi bug. Please retest this now.

Comment 7 Raghavendra Talur 2018-05-30 20:01:23 UTC
The bug that it is known issue of is https://bugzilla.redhat.com/show_bug.cgi?id=1550582 . 

After fixing the DNS on master, heketi-cli is able to reach master.

The original log attached to the bug shows that the heketi-cli request to delete the volume had reached heketi. Relevant log lines are

[negroni] Started DELETE /volumes/3d8325d2095c5196c321cddfd7e3eaea
[heketi] INFO 2018/05/29 06:04:34 Loaded simple allocator
[negroni] Completed 202 Accepted in 115.686293ms
[asynchttp] INFO 2018/05/29 06:04:34 asynchttp.go:125: Started job e68581b0d78ae01a413e88bc3fb35595
[heketi] INFO 2018/05/29 06:04:34 Started async operation: Delete Volume


After this point, I suspect that the connection from heketi to OC master was lost because of bad DNS. I have verified that the volume delete in question is logged in db as a pending operation. Now the thing to see is if heketi continues operation after such a long break or does it stay in pending operation forever. Either way it does not seem like a new bug.

John, adding a needinfo on you for confirmation of my analysis.

Comment 8 John Mulligan 2018-05-30 21:12:38 UTC
I agree.

After giving Heketi a chance to recover we can try doing a 'heketi-cli db dump' and checking to see if there are any pending operations in the db. If there are then we are still "stuck" and heketi may not recover. If there are no pending operations try running new commands to see how well heketi is behaving now that the dns is restored.

Comment 9 Apeksha 2018-05-31 06:34:20 UTC
The delete command is failing, without any error message.

[root@dhcp47-93 ~]# heketi-cli volume delete 3d8325d2095c5196c321cddfd7e3eaea
Error: 

Output of heketi-cli db dump

[root@dhcp47-93 ~]# heketi-cli db dump
{"clusterentries":{"56d32397306dc3aa057320922250a070":{"Info":{"id":"56d32397306dc3aa057320922250a070","nodes":["866bea732a8fead87ff6b1720301a3b6","e3e6b22f48dd409750f46de51accb317","fa2e86a1d3e5c918018753eb154248a2"],"volumes":["305a7947a0af7aeebf43a78dd966070b","3d8325d2095c5196c321cddfd7e3eaea"],"block":true,"file":true,"blockvolumes":[]}}},"volumeentries":{"305a7947a0af7aeebf43a78dd966070b":{"Info":{"size":2,"name":"heketidbstorage","durability":{"type":"replicate","replicate":{"replica":3},"disperse":{}},"snapshot":{"enable":false,"factor":1},"id":"305a7947a0af7aeebf43a78dd966070b","cluster":"56d32397306dc3aa057320922250a070","mount":{"glusterfs":{"hosts":["10.70.47.38","10.70.46.85","10.70.46.238"],"device":"10.70.47.38:heketidbstorage","options":{"backup-volfile-servers":"10.70.46.85,10.70.46.238"}}},"blockinfo":{}},"Bricks":["210971d929f335f169e239d17cc4a12a","7fa6e112aa4cf0354fd9cf15d00c48fe","f02ab5738a7b55c4c4d4d0d857c00681"],"GlusterVolumeOptions":null,"Pending":{"Id":""}},"3d8325d2095c5196c321cddfd7e3eaea":{"Info":{"size":1,"name":"vol_3d8325d2095c5196c321cddfd7e3eaea","durability":{"type":"replicate","replicate":{"replica":3},"disperse":{"data":4,"redundancy":2}},"snapshot":{"enable":false,"factor":1},"id":"3d8325d2095c5196c321cddfd7e3eaea","cluster":"56d32397306dc3aa057320922250a070","mount":{"glusterfs":{"hosts":["10.70.47.38","10.70.46.85","10.70.46.238"],"device":"10.70.47.38:vol_3d8325d2095c5196c321cddfd7e3eaea","options":{"backup-volfile-servers":"10.70.46.85,10.70.46.238"}}},"blockinfo":{}},"Bricks":["2aa308032dc6931168db5393f77a8fe8","4f14ad754a1bc2d2b686532e12bd7a1e","4f96665a9d143e46fa69b9f56ec9d1e4"],"GlusterVolumeOptions":null,"Pending":{"Id":""}}},"brickentries":{"210971d929f335f169e239d17cc4a12a":{"Info":{"id":"210971d929f335f169e239d17cc4a12a","path":"/var/lib/heketi/mounts/vg_44ef3a02b018b9f8855bdb487bda910f/brick_210971d929f335f169e239d17cc4a12a/brick","device":"44ef3a02b018b9f8855bdb487bda910f","node":"866bea732a8fead87ff6b1720301a3b6","volume":"305a7947a0af7aeebf43a78dd966070b","size":2097152},"TpSize":2097152,"PoolMetadataSize":12288,"Pending":{"Id":""}},"2aa308032dc6931168db5393f77a8fe8":{"Info":{"id":"2aa308032dc6931168db5393f77a8fe8","path":"/var/lib/heketi/mounts/vg_6d499c3d339689185794d4e0859898a0/brick_2aa308032dc6931168db5393f77a8fe8/brick","device":"6d499c3d339689185794d4e0859898a0","node":"e3e6b22f48dd409750f46de51accb317","volume":"3d8325d2095c5196c321cddfd7e3eaea","size":1048576},"TpSize":1048576,"PoolMetadataSize":8192,"Pending":{"Id":"6c067686ad9d9fddbdadf36e394af77c"}},"4f14ad754a1bc2d2b686532e12bd7a1e":{"Info":{"id":"4f14ad754a1bc2d2b686532e12bd7a1e","path":"/var/lib/heketi/mounts/vg_456a280ec02d926fc470e6db5c895737/brick_4f14ad754a1bc2d2b686532e12bd7a1e/brick","device":"456a280ec02d926fc470e6db5c895737","node":"fa2e86a1d3e5c918018753eb154248a2","volume":"3d8325d2095c5196c321cddfd7e3eaea","size":1048576},"TpSize":1048576,"PoolMetadataSize":8192,"Pending":{"Id":"6c067686ad9d9fddbdadf36e394af77c"}},"4f96665a9d143e46fa69b9f56ec9d1e4":{"Info":{"id":"4f96665a9d143e46fa69b9f56ec9d1e4","path":"/var/lib/heketi/mounts/vg_44ef3a02b018b9f8855bdb487bda910f/brick_4f96665a9d143e46fa69b9f56ec9d1e4/brick","device":"44ef3a02b018b9f8855bdb487bda910f","node":"866bea732a8fead87ff6b1720301a3b6","volume":"3d8325d2095c5196c321cddfd7e3eaea","size":1048576},"TpSize":1048576,"PoolMetadataSize":8192,"Pending":{"Id":"6c067686ad9d9fddbdadf36e394af77c"}},"7fa6e112aa4cf0354fd9cf15d00c48fe":{"Info":{"id":"7fa6e112aa4cf0354fd9cf15d00c48fe","path":"/var/lib/heketi/mounts/vg_6d499c3d339689185794d4e0859898a0/brick_7fa6e112aa4cf0354fd9cf15d00c48fe/brick","device":"6d499c3d339689185794d4e0859898a0","node":"e3e6b22f48dd409750f46de51accb317","volume":"305a7947a0af7aeebf43a78dd966070b","size":2097152},"TpSize":2097152,"PoolMetadataSize":12288,"Pending":{"Id":""}},"f02ab5738a7b55c4c4d4d0d857c00681":{"Info":{"id":"f02ab5738a7b55c4c4d4d0d857c00681","path":"/var/lib/heketi/mounts/vg_456a280ec02d926fc470e6db5c895737/brick_f02ab5738a7b55c4c4d4d0d857c00681/brick","device":"456a280ec02d926fc470e6db5c895737","node":"fa2e86a1d3e5c918018753eb154248a2","volume":"305a7947a0af7aeebf43a78dd966070b","size":2097152},"TpSize":2097152,"PoolMetadataSize":12288,"Pending":{"Id":""}}},"nodeentries":{"866bea732a8fead87ff6b1720301a3b6":{"State":"online","Info":{"zone":1,"hostnames":{"manage":["dhcp47-38.lab.eng.blr.redhat.com"],"storage":["10.70.47.38"]},"cluster":"56d32397306dc3aa057320922250a070","id":"866bea732a8fead87ff6b1720301a3b6"},"Devices":["44ef3a02b018b9f8855bdb487bda910f","e0ad7d31cffddba562fbaaef16cad6ae"]},"e3e6b22f48dd409750f46de51accb317":{"State":"online","Info":{"zone":1,"hostnames":{"manage":["dhcp46-85.lab.eng.blr.redhat.com"],"storage":["10.70.46.85"]},"cluster":"56d32397306dc3aa057320922250a070","id":"e3e6b22f48dd409750f46de51accb317"},"Devices":["2df7febc45419bc0275826cca6842e8a","6d499c3d339689185794d4e0859898a0"]},"fa2e86a1d3e5c918018753eb154248a2":{"State":"online","Info":{"zone":1,"hostnames":{"manage":["dhcp46-238.lab.eng.blr.redhat.com"],"storage":["10.70.46.238"]},"cluster":"56d32397306dc3aa057320922250a070","id":"fa2e86a1d3e5c918018753eb154248a2"},"Devices":["456a280ec02d926fc470e6db5c895737","a65eaec63acb81c24a2f9053ec3afc09"]}},"deviceentries":{"2df7febc45419bc0275826cca6842e8a":{"State":"failed","Info":{"name":"/dev/sdf","storage":{"total":104722432,"free":104722432,"used":0},"id":"2df7febc45419bc0275826cca6842e8a"},"Bricks":[],"NodeId":"e3e6b22f48dd409750f46de51accb317","ExtentSize":4096},"44ef3a02b018b9f8855bdb487bda910f":{"State":"online","Info":{"name":"/dev/sde","storage":{"total":629010432,"free":625844224,"used":3166208},"id":"44ef3a02b018b9f8855bdb487bda910f"},"Bricks":["210971d929f335f169e239d17cc4a12a","4f96665a9d143e46fa69b9f56ec9d1e4"],"NodeId":"866bea732a8fead87ff6b1720301a3b6","ExtentSize":4096},"456a280ec02d926fc470e6db5c895737":{"State":"online","Info":{"name":"/dev/sde","storage":{"total":629010432,"free":625844224,"used":3166208},"id":"456a280ec02d926fc470e6db5c895737"},"Bricks":["4f14ad754a1bc2d2b686532e12bd7a1e","f02ab5738a7b55c4c4d4d0d857c00681"],"NodeId":"fa2e86a1d3e5c918018753eb154248a2","ExtentSize":4096},"6d499c3d339689185794d4e0859898a0":{"State":"online","Info":{"name":"/dev/sdd","storage":{"total":104722432,"free":101556224,"used":3166208},"id":"6d499c3d339689185794d4e0859898a0"},"Bricks":["2aa308032dc6931168db5393f77a8fe8","7fa6e112aa4cf0354fd9cf15d00c48fe"],"NodeId":"e3e6b22f48dd409750f46de51accb317","ExtentSize":4096},"a65eaec63acb81c24a2f9053ec3afc09":{"State":"online","Info":{"name":"/dev/sdd","storage":{"total":104722432,"free":104722432,"used":0},"id":"a65eaec63acb81c24a2f9053ec3afc09"},"Bricks":[],"NodeId":"fa2e86a1d3e5c918018753eb154248a2","ExtentSize":4096},"e0ad7d31cffddba562fbaaef16cad6ae":{"State":"online","Info":{"name":"/dev/sdd","storage":{"total":104722432,"free":104722432,"used":0},"id":"e0ad7d31cffddba562fbaaef16cad6ae"},"Bricks":[],"NodeId":"866bea732a8fead87ff6b1720301a3b6","ExtentSize":4096}},"blockvolumeentries":{},"dbattributeentries":{"DB_CLUSTER_HAS_FILE_BLOCK_FLAG":{"Key":"DB_CLUSTER_HAS_FILE_BLOCK_FLAG","Value":"yes"},"DB_GENERATION_ID":{"Key":"DB_GENERATION_ID","Value":"acffbac9cd64b181a28a2aa492e3b464"},"DB_HAS_PENDING_OPS_BUCKET":{"Key":"DB_HAS_PENDING_OPS_BUCKET","Value":"yes"}},"pendingoperations":{"6c067686ad9d9fddbdadf36e394af77c":{"Id":"6c067686ad9d9fddbdadf36e394af77c","Timestamp":1527573874,"Type":2,"Actions":[{"Change":3,"Id":"2aa308032dc6931168db5393f77a8fe8","Delta":null},{"Change":3,"Id":"4f14ad754a1bc2d2b686532e12bd7a1e","Delta":null},{"Change":3,"Id":"4f96665a9d143e46fa69b9f56ec9d1e4","Delta":null},{"Change":4,"Id":"3d8325d2095c5196c321cddfd7e3eaea","Delta":null}]}}}

Comment 10 John Mulligan 2018-05-31 15:25:53 UTC
(In reply to Apeksha from comment #9)
> The delete command is failing, without any error message.
> 
> [root@dhcp47-93 ~]# heketi-cli volume delete 3d8325d2095c5196c321cddfd7e3eaea
> Error: 

Odd. Is there anything in the log when you run that command? Does the server at least see a request coming in from the client?

> 
> Output of heketi-cli db dump
> 

Thanks!

Comment 11 Apeksha 2018-06-01 06:41:59 UTC
Yes heketi logs do have the info that the server has received a heketi volume delete command:

[negroni] Completed 200 OK in 1.573816ms
[negroni] Started GET /volumes/3d8325d2095c5196c321cddfd7e3eaea
[negroni] Completed 200 OK in 872.792µs
[negroni] Started DELETE /volumes/3d8325d2095c5196c321cddfd7e3eaea
[heketi] INFO 2018/06/01 06:33:36 Loaded simple allocator
[negroni] Completed 202 Accepted in 114.870718ms
[asynchttp] INFO 2018/06/01 06:33:36 asynchttp.go:125: Started job bdd8afa25e269fd85ca7b8dc61edeb89
[heketi] INFO 2018/06/01 06:33:36 Started async operation: Delete Volume
[negroni] Started GET /queue/bdd8afa25e269fd85ca7b8dc61edeb89

Comment 12 John Mulligan 2018-06-01 14:07:21 UTC
Good, now we need to track down the server's status for the operation and perhaps we can figure out why the client didn't display a proper error message.

Comment 15 Apeksha 2018-06-26 08:53:19 UTC
Hit the same issue again:

Steps to Reproduce:
1. Setup OCP3.6 + CNS3.9 Async
2. Run dynamic provisioing autonation testcases 
   All passed including heketi-pod fialure testcase
3. Start running heketi autonated cases
4. Heketi device delete command hanged.

[root@dhcp47-93 ~]# ps -aux | grep heketi
root      83356  0.0  0.0 113172  1580 ?        Ss   13:43   0:00 bash -c cd /root && heketi-cli -s http://172.30.47.185:8080 device delete 1f3ae05d8b13328d758f49da2bc20ccb   
root      83359  0.0  0.0 292640 16632 ?        Sl   13:43   0:01 heketi-cli -s http://172.30.47.185:8080 device delete 1f3ae05d8b13328d758f49da2bc20ccb
root      86610  0.0  0.0 112704   968 pts/0    S+   14:22   0:00 grep --color=auto heketi

Comment 16 Raghavendra Talur 2018-06-28 12:45:43 UTC
Apeksha,

What were the steps performed to handle the bug https://bugzilla.redhat.com/show_bug.cgi?id=1550582 ? Without handling this, you will keep hitting the bug.

Comment 17 Apeksha 2018-07-02 12:20:29 UTC
(In reply to Raghavendra Talur from comment #16)
> Apeksha,
> 
> What were the steps performed to handle the bug
> https://bugzilla.redhat.com/show_bug.cgi?id=1550582 ? Without handling this,
> you will keep hitting the bug.

I tried restarting heketi-pod, but still i was hitting this issue. Hence I re-created the setup and this time I skipped the heketi device delete testcase, where I was hitting this issue and executed other testcases.