Description of problem: heketi device delete command hanged Version-Release number of selected component (if applicable): CVE fix setup - OCP3.9 heketi-client-6.0.0-7.4.el7rhgs.x86_64 oc v3.9.30 How reproducible: Once Steps to Reproduce: Run Automated heketi cases , in one of testcase heketi device command is in hang state [root@dhcp46-130 ~]# ps -aux | grep delete root 16949 0.0 0.0 112704 968 pts/0 S+ 11:32 0:00 grep --color=auto delete root 105257 0.0 0.0 113172 1572 ? Ss 04:56 0:00 bash -c cd /root && heketi-cli -s http://heketi-storage-glusterfs.router.default.svc.cluster.local device delete d1502e93d1320127383936293e4dcd6c --secret 'adminkey' --user admin root 105262 0.0 0.0 255916 17108 ? Sl 04:56 0:19 heketi-cli -s http://heketi-storage-glusterfs.router.default.svc.cluster.local device delete d1502e93d1320127383936293e4dcd6c --secret adminkey --user admin Error in heketi log: [kubeexec] DEBUG 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:244: Host: dhcp46-61.lab.eng.blr.redhat.com Pod: glusterfs-storage-q25zn Command: gluster --mode=script volume replace-brick heketidbstorage 10.70.46.61:/var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f/brick 10.70.46.61:/var/lib/heketi/mounts/vg_d1502e93d1320127383936293e4dcd6c/brick_de6552f1f5208620b0e3ff7b01bb622a/brick commit force Result: volume replace-brick: success: replace-brick commit force operation successful [heketi] INFO 2018/06/11 11:26:02 Deleting brick 284a16bbc0e4bee4a3a124a0db224c8f [negroni] Started GET /queue/60ad29a6454424d14570e3aa303a7067 [negroni] Completed 200 OK in 167.959µs [kubeexec] ERROR 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [umount /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f] on glusterfs-storage-q25zn: Err[command terminated with exit code 32]: Stdout []: Stderr [umount: /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) ] [cmdexec] ERROR 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/cmdexec/brick.go:123: Unable to execute command on glusterfs-storage-q25zn: umount: /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) [kubeexec] ERROR 2018/06/11 11:26:03 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [lvremove -f vg_e7a2abd8e004058324ce78235ae03927/tp_284a16bbc0e4bee4a3a124a0db224c8f] on glusterfs-storage-q25zn: Err[command terminated with exit code 5]: Stdout []: Stderr [ Logical volume vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f contains a filesystem in use. ] [cmdexec] ERROR 2018/06/11 11:26:03 /src/github.com/heketi/heketi/executors/cmdexec/brick.go:132: Unable to execute command on glusterfs-storage-q25zn: Logical volume vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f contains a filesystem in use. [kubeexec] ERROR 2018/06/11 11:26:03 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [rmdir /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f] on glusterfs-storage-q25zn: Err[command terminated with exit code 1]: Stdout []: Stderr [rmdir: failed to remove '/var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f': Device or resource busy
Created attachment 1450325 [details] attaching heketi-pod logs
Created attachment 1450328 [details] heketi-db-dump
heketi-pod logs at http://rhsqe-repo.lab.eng.blr.redhat.com/cns/bugs/BZ-1590143/
How long did the command hang for? Did it eventually time out or did you kill it? Can we get a system into this state again? If we can I'd like to get a core dump of a running heketi to see exactly what operation the server was performing. In the attached db dump there are no pending operations in the db, which implies that either the server completed the operation before the db dump was taken, the cli never actually communicated the delete operation to the server, or the db was cleaned up manually and I doubt you would have done that. The error triggering the log line: [cmdexec] ERROR 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/cmdexec/brick.go:123: Unable to execute command on glusterfs-storage-q25zn: umount: /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) is an issue in gluster AFAIK around brick multiplexing (bz#1524336 if I'm not mistaken). However, this alone should not cause the command to hang.
(In reply to John Mulligan from comment #6) > How long did the command hang for? Did it eventually time out or did you > kill it? It hangs for more than 48 hours and then eventually i killed it. > Can we get a system into this state again? If we can I'd like to get a core > dump of a running heketi to see exactly what operation the server was > performing. In the attached db dump there are no pending operations in the > db, which implies that either the server completed the operation before the > db dump was taken, the cli never actually communicated the delete operation > to the server, or the db was cleaned up manually and I doubt you would have > done that. I dont have the setup now, but I have never manually cleared the db. Note i haven't seen this issue in CNS3.10. > The error triggering the log line: > [cmdexec] ERROR 2018/06/11 11:26:02 > /src/github.com/heketi/heketi/executors/cmdexec/brick.go:123: Unable to > execute command on glusterfs-storage-q25zn: umount: > /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/ > brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy. > (In some cases useful info about processes that use > the device is found by lsof(8) or fuser(1)) > > is an issue in gluster AFAIK around brick multiplexing (bz#1524336 if I'm > not mistaken). However, this alone should not cause the command to hang.
Patch posted at : https://github.com/heketi/heketi/pull/1247
Fixed in version: rhgs-volmanager-rhel7:3.3.1-20
Havent hit this issue again in build -rhgs3/rhgs-volmanager-rhel7:3.3.1-22. Hence marking it as verified.
Updated doc text in the Doc Text field. Please review for technical accuracy.
have updated the doc text based on feedback.
(In reply to Anjana from comment #22) > have updated the doc text based on feedback. Looks OK to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2686