Bug 1583494
Summary: | heketi create/delete commands are in queue(hang) state | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Apeksha <akhakhar> | ||||||
Component: | heketi | Assignee: | Michael Adam <madam> | ||||||
Status: | CLOSED CANTFIX | QA Contact: | Apeksha <akhakhar> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | cns-3.9 | CC: | akhakhar, hchiramm, jmulligan, kramdoss, pprakash, rhs-bugs, rtalur, sankarshan, storage-qa-internal, tcarlin | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-01-23 20:42:50 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1641915 | ||||||||
Attachments: |
|
Additional Info: Command hanged - heketi-cli -s http://172.30.11.17:8080 volume delete 3d8325d2095c5196c321cddfd7e3eaea Created attachment 1445247 [details]
glusterfs log
John and I looked at the system. DNS resolution wasn't working on the master. This looked like a known issue that we are aware of, so we tried the following commands. Most important being, restart of dbus and dnsmasq services. Then the login started working. 212 oc login -u system:admin --config /etc/origin/master/admin.kubeconfig 213 netstat -tnap | grep 53 214 netstat -tnap | grep 53 | LISTEN 215 netstat -tnap | grep 53 | grep LISTEN 216 oc login -u system:admin --config /etc/origin/master/admin.kubeconfig 217 oc get nodes 218 ssh root.eng.blr.redhat.com 219 ssh root.47.38 220 clear 221 ls 222 vim /etc/resolv.conf 223 oc login -u system:admin --config /etc/origin/master/admin.kubeconfig 224 systemctl restart dnsmasq 225 oc login -u system:admin --config /etc/origin/master/admin.kubeconfig 226 netstat -tnap | grep 53 | grep LISTEN 227 less /var/log/messages 228 systemctl restart dbus 229 netstat -tnap | grep 53 | grep LISTEN 230 oc login -u system:admin --config /etc/origin/master/admin.kubeconfig 231 history We suspect that it is the same reason that you see heketi bug. Please retest this now. The bug that it is known issue of is https://bugzilla.redhat.com/show_bug.cgi?id=1550582 . After fixing the DNS on master, heketi-cli is able to reach master. The original log attached to the bug shows that the heketi-cli request to delete the volume had reached heketi. Relevant log lines are [negroni] Started DELETE /volumes/3d8325d2095c5196c321cddfd7e3eaea [heketi] INFO 2018/05/29 06:04:34 Loaded simple allocator [negroni] Completed 202 Accepted in 115.686293ms [asynchttp] INFO 2018/05/29 06:04:34 asynchttp.go:125: Started job e68581b0d78ae01a413e88bc3fb35595 [heketi] INFO 2018/05/29 06:04:34 Started async operation: Delete Volume After this point, I suspect that the connection from heketi to OC master was lost because of bad DNS. I have verified that the volume delete in question is logged in db as a pending operation. Now the thing to see is if heketi continues operation after such a long break or does it stay in pending operation forever. Either way it does not seem like a new bug. John, adding a needinfo on you for confirmation of my analysis. I agree. After giving Heketi a chance to recover we can try doing a 'heketi-cli db dump' and checking to see if there are any pending operations in the db. If there are then we are still "stuck" and heketi may not recover. If there are no pending operations try running new commands to see how well heketi is behaving now that the dns is restored. The delete command is failing, without any error message. [root@dhcp47-93 ~]# heketi-cli volume delete 3d8325d2095c5196c321cddfd7e3eaea Error: Output of heketi-cli db dump [root@dhcp47-93 ~]# heketi-cli db dump {"clusterentries":{"56d32397306dc3aa057320922250a070":{"Info":{"id":"56d32397306dc3aa057320922250a070","nodes":["866bea732a8fead87ff6b1720301a3b6","e3e6b22f48dd409750f46de51accb317","fa2e86a1d3e5c918018753eb154248a2"],"volumes":["305a7947a0af7aeebf43a78dd966070b","3d8325d2095c5196c321cddfd7e3eaea"],"block":true,"file":true,"blockvolumes":[]}}},"volumeentries":{"305a7947a0af7aeebf43a78dd966070b":{"Info":{"size":2,"name":"heketidbstorage","durability":{"type":"replicate","replicate":{"replica":3},"disperse":{}},"snapshot":{"enable":false,"factor":1},"id":"305a7947a0af7aeebf43a78dd966070b","cluster":"56d32397306dc3aa057320922250a070","mount":{"glusterfs":{"hosts":["10.70.47.38","10.70.46.85","10.70.46.238"],"device":"10.70.47.38:heketidbstorage","options":{"backup-volfile-servers":"10.70.46.85,10.70.46.238"}}},"blockinfo":{}},"Bricks":["210971d929f335f169e239d17cc4a12a","7fa6e112aa4cf0354fd9cf15d00c48fe","f02ab5738a7b55c4c4d4d0d857c00681"],"GlusterVolumeOptions":null,"Pending":{"Id":""}},"3d8325d2095c5196c321cddfd7e3eaea":{"Info":{"size":1,"name":"vol_3d8325d2095c5196c321cddfd7e3eaea","durability":{"type":"replicate","replicate":{"replica":3},"disperse":{"data":4,"redundancy":2}},"snapshot":{"enable":false,"factor":1},"id":"3d8325d2095c5196c321cddfd7e3eaea","cluster":"56d32397306dc3aa057320922250a070","mount":{"glusterfs":{"hosts":["10.70.47.38","10.70.46.85","10.70.46.238"],"device":"10.70.47.38:vol_3d8325d2095c5196c321cddfd7e3eaea","options":{"backup-volfile-servers":"10.70.46.85,10.70.46.238"}}},"blockinfo":{}},"Bricks":["2aa308032dc6931168db5393f77a8fe8","4f14ad754a1bc2d2b686532e12bd7a1e","4f96665a9d143e46fa69b9f56ec9d1e4"],"GlusterVolumeOptions":null,"Pending":{"Id":""}}},"brickentries":{"210971d929f335f169e239d17cc4a12a":{"Info":{"id":"210971d929f335f169e239d17cc4a12a","path":"/var/lib/heketi/mounts/vg_44ef3a02b018b9f8855bdb487bda910f/brick_210971d929f335f169e239d17cc4a12a/brick","device":"44ef3a02b018b9f8855bdb487bda910f","node":"866bea732a8fead87ff6b1720301a3b6","volume":"305a7947a0af7aeebf43a78dd966070b","size":2097152},"TpSize":2097152,"PoolMetadataSize":12288,"Pending":{"Id":""}},"2aa308032dc6931168db5393f77a8fe8":{"Info":{"id":"2aa308032dc6931168db5393f77a8fe8","path":"/var/lib/heketi/mounts/vg_6d499c3d339689185794d4e0859898a0/brick_2aa308032dc6931168db5393f77a8fe8/brick","device":"6d499c3d339689185794d4e0859898a0","node":"e3e6b22f48dd409750f46de51accb317","volume":"3d8325d2095c5196c321cddfd7e3eaea","size":1048576},"TpSize":1048576,"PoolMetadataSize":8192,"Pending":{"Id":"6c067686ad9d9fddbdadf36e394af77c"}},"4f14ad754a1bc2d2b686532e12bd7a1e":{"Info":{"id":"4f14ad754a1bc2d2b686532e12bd7a1e","path":"/var/lib/heketi/mounts/vg_456a280ec02d926fc470e6db5c895737/brick_4f14ad754a1bc2d2b686532e12bd7a1e/brick","device":"456a280ec02d926fc470e6db5c895737","node":"fa2e86a1d3e5c918018753eb154248a2","volume":"3d8325d2095c5196c321cddfd7e3eaea","size":1048576},"TpSize":1048576,"PoolMetadataSize":8192,"Pending":{"Id":"6c067686ad9d9fddbdadf36e394af77c"}},"4f96665a9d143e46fa69b9f56ec9d1e4":{"Info":{"id":"4f96665a9d143e46fa69b9f56ec9d1e4","path":"/var/lib/heketi/mounts/vg_44ef3a02b018b9f8855bdb487bda910f/brick_4f96665a9d143e46fa69b9f56ec9d1e4/brick","device":"44ef3a02b018b9f8855bdb487bda910f","node":"866bea732a8fead87ff6b1720301a3b6","volume":"3d8325d2095c5196c321cddfd7e3eaea","size":1048576},"TpSize":1048576,"PoolMetadataSize":8192,"Pending":{"Id":"6c067686ad9d9fddbdadf36e394af77c"}},"7fa6e112aa4cf0354fd9cf15d00c48fe":{"Info":{"id":"7fa6e112aa4cf0354fd9cf15d00c48fe","path":"/var/lib/heketi/mounts/vg_6d499c3d339689185794d4e0859898a0/brick_7fa6e112aa4cf0354fd9cf15d00c48fe/brick","device":"6d499c3d339689185794d4e0859898a0","node":"e3e6b22f48dd409750f46de51accb317","volume":"305a7947a0af7aeebf43a78dd966070b","size":2097152},"TpSize":2097152,"PoolMetadataSize":12288,"Pending":{"Id":""}},"f02ab5738a7b55c4c4d4d0d857c00681":{"Info":{"id":"f02ab5738a7b55c4c4d4d0d857c00681","path":"/var/lib/heketi/mounts/vg_456a280ec02d926fc470e6db5c895737/brick_f02ab5738a7b55c4c4d4d0d857c00681/brick","device":"456a280ec02d926fc470e6db5c895737","node":"fa2e86a1d3e5c918018753eb154248a2","volume":"305a7947a0af7aeebf43a78dd966070b","size":2097152},"TpSize":2097152,"PoolMetadataSize":12288,"Pending":{"Id":""}}},"nodeentries":{"866bea732a8fead87ff6b1720301a3b6":{"State":"online","Info":{"zone":1,"hostnames":{"manage":["dhcp47-38.lab.eng.blr.redhat.com"],"storage":["10.70.47.38"]},"cluster":"56d32397306dc3aa057320922250a070","id":"866bea732a8fead87ff6b1720301a3b6"},"Devices":["44ef3a02b018b9f8855bdb487bda910f","e0ad7d31cffddba562fbaaef16cad6ae"]},"e3e6b22f48dd409750f46de51accb317":{"State":"online","Info":{"zone":1,"hostnames":{"manage":["dhcp46-85.lab.eng.blr.redhat.com"],"storage":["10.70.46.85"]},"cluster":"56d32397306dc3aa057320922250a070","id":"e3e6b22f48dd409750f46de51accb317"},"Devices":["2df7febc45419bc0275826cca6842e8a","6d499c3d339689185794d4e0859898a0"]},"fa2e86a1d3e5c918018753eb154248a2":{"State":"online","Info":{"zone":1,"hostnames":{"manage":["dhcp46-238.lab.eng.blr.redhat.com"],"storage":["10.70.46.238"]},"cluster":"56d32397306dc3aa057320922250a070","id":"fa2e86a1d3e5c918018753eb154248a2"},"Devices":["456a280ec02d926fc470e6db5c895737","a65eaec63acb81c24a2f9053ec3afc09"]}},"deviceentries":{"2df7febc45419bc0275826cca6842e8a":{"State":"failed","Info":{"name":"/dev/sdf","storage":{"total":104722432,"free":104722432,"used":0},"id":"2df7febc45419bc0275826cca6842e8a"},"Bricks":[],"NodeId":"e3e6b22f48dd409750f46de51accb317","ExtentSize":4096},"44ef3a02b018b9f8855bdb487bda910f":{"State":"online","Info":{"name":"/dev/sde","storage":{"total":629010432,"free":625844224,"used":3166208},"id":"44ef3a02b018b9f8855bdb487bda910f"},"Bricks":["210971d929f335f169e239d17cc4a12a","4f96665a9d143e46fa69b9f56ec9d1e4"],"NodeId":"866bea732a8fead87ff6b1720301a3b6","ExtentSize":4096},"456a280ec02d926fc470e6db5c895737":{"State":"online","Info":{"name":"/dev/sde","storage":{"total":629010432,"free":625844224,"used":3166208},"id":"456a280ec02d926fc470e6db5c895737"},"Bricks":["4f14ad754a1bc2d2b686532e12bd7a1e","f02ab5738a7b55c4c4d4d0d857c00681"],"NodeId":"fa2e86a1d3e5c918018753eb154248a2","ExtentSize":4096},"6d499c3d339689185794d4e0859898a0":{"State":"online","Info":{"name":"/dev/sdd","storage":{"total":104722432,"free":101556224,"used":3166208},"id":"6d499c3d339689185794d4e0859898a0"},"Bricks":["2aa308032dc6931168db5393f77a8fe8","7fa6e112aa4cf0354fd9cf15d00c48fe"],"NodeId":"e3e6b22f48dd409750f46de51accb317","ExtentSize":4096},"a65eaec63acb81c24a2f9053ec3afc09":{"State":"online","Info":{"name":"/dev/sdd","storage":{"total":104722432,"free":104722432,"used":0},"id":"a65eaec63acb81c24a2f9053ec3afc09"},"Bricks":[],"NodeId":"fa2e86a1d3e5c918018753eb154248a2","ExtentSize":4096},"e0ad7d31cffddba562fbaaef16cad6ae":{"State":"online","Info":{"name":"/dev/sdd","storage":{"total":104722432,"free":104722432,"used":0},"id":"e0ad7d31cffddba562fbaaef16cad6ae"},"Bricks":[],"NodeId":"866bea732a8fead87ff6b1720301a3b6","ExtentSize":4096}},"blockvolumeentries":{},"dbattributeentries":{"DB_CLUSTER_HAS_FILE_BLOCK_FLAG":{"Key":"DB_CLUSTER_HAS_FILE_BLOCK_FLAG","Value":"yes"},"DB_GENERATION_ID":{"Key":"DB_GENERATION_ID","Value":"acffbac9cd64b181a28a2aa492e3b464"},"DB_HAS_PENDING_OPS_BUCKET":{"Key":"DB_HAS_PENDING_OPS_BUCKET","Value":"yes"}},"pendingoperations":{"6c067686ad9d9fddbdadf36e394af77c":{"Id":"6c067686ad9d9fddbdadf36e394af77c","Timestamp":1527573874,"Type":2,"Actions":[{"Change":3,"Id":"2aa308032dc6931168db5393f77a8fe8","Delta":null},{"Change":3,"Id":"4f14ad754a1bc2d2b686532e12bd7a1e","Delta":null},{"Change":3,"Id":"4f96665a9d143e46fa69b9f56ec9d1e4","Delta":null},{"Change":4,"Id":"3d8325d2095c5196c321cddfd7e3eaea","Delta":null}]}}} (In reply to Apeksha from comment #9) > The delete command is failing, without any error message. > > [root@dhcp47-93 ~]# heketi-cli volume delete 3d8325d2095c5196c321cddfd7e3eaea > Error: Odd. Is there anything in the log when you run that command? Does the server at least see a request coming in from the client? > > Output of heketi-cli db dump > Thanks! Yes heketi logs do have the info that the server has received a heketi volume delete command: [negroni] Completed 200 OK in 1.573816ms [negroni] Started GET /volumes/3d8325d2095c5196c321cddfd7e3eaea [negroni] Completed 200 OK in 872.792µs [negroni] Started DELETE /volumes/3d8325d2095c5196c321cddfd7e3eaea [heketi] INFO 2018/06/01 06:33:36 Loaded simple allocator [negroni] Completed 202 Accepted in 114.870718ms [asynchttp] INFO 2018/06/01 06:33:36 asynchttp.go:125: Started job bdd8afa25e269fd85ca7b8dc61edeb89 [heketi] INFO 2018/06/01 06:33:36 Started async operation: Delete Volume [negroni] Started GET /queue/bdd8afa25e269fd85ca7b8dc61edeb89 Good, now we need to track down the server's status for the operation and perhaps we can figure out why the client didn't display a proper error message. Hit the same issue again: Steps to Reproduce: 1. Setup OCP3.6 + CNS3.9 Async 2. Run dynamic provisioing autonation testcases All passed including heketi-pod fialure testcase 3. Start running heketi autonated cases 4. Heketi device delete command hanged. [root@dhcp47-93 ~]# ps -aux | grep heketi root 83356 0.0 0.0 113172 1580 ? Ss 13:43 0:00 bash -c cd /root && heketi-cli -s http://172.30.47.185:8080 device delete 1f3ae05d8b13328d758f49da2bc20ccb root 83359 0.0 0.0 292640 16632 ? Sl 13:43 0:01 heketi-cli -s http://172.30.47.185:8080 device delete 1f3ae05d8b13328d758f49da2bc20ccb root 86610 0.0 0.0 112704 968 pts/0 S+ 14:22 0:00 grep --color=auto heketi Apeksha, What were the steps performed to handle the bug https://bugzilla.redhat.com/show_bug.cgi?id=1550582 ? Without handling this, you will keep hitting the bug. (In reply to Raghavendra Talur from comment #16) > Apeksha, > > What were the steps performed to handle the bug > https://bugzilla.redhat.com/show_bug.cgi?id=1550582 ? Without handling this, > you will keep hitting the bug. I tried restarting heketi-pod, but still i was hitting this issue. Hence I re-created the setup and this time I skipped the heketi device delete testcase, where I was hitting this issue and executed other testcases. |
Created attachment 1445137 [details] heketi pod logs Description of problem: heketi create/delete commands are in queue state Version-Release number of selected component (if applicable): heketi-client-6.0.0-7.4.el7rhgs.x86_64 openshift v3.6.173.0.117 How reproducible: Twice Steps to Reproduce: 1. Setup OCP3.6 + CNS3.9 Async 2. Run dynamic provisioing autonation testcases All passed including heketi-pod fialure testcase 3. Start running heketi autonated cases 4. Heketi device delete and heketi deleet volume command in queue state Attached the heketi.log