Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1590143 - heketi device delete command hanged (as a vgremove command was waiting for an interactive prompt response)
heketi device delete command hanged (as a vgremove command was waiting for an...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: heketi (Show other bugs)
cns-3.10
x86_64 Linux
unspecified Severity high
: ---
: CNS 3.10
Assigned To: John Mulligan
Apeksha
:
Depends On:
Blocks: 1568862 1594716
  Show dependency treegraph
 
Reported: 2018-06-12 02:23 EDT by Apeksha
Modified: 2018-09-12 05:23 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, Heketi executed LVM management operations on the underlying storage nodes such that the commands might prompt for input. Commands that triggered input prompts would never complete without administrative intervention. With this fix Heketi is changed to run the LVM operations for volume delete and will not prompt for input. Heketi operations that manage LVM storage will either succeed or fail .
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-09-12 05:22:13 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
attaching heketi-pod logs (2.53 MB, text/plain)
2018-06-12 02:26 EDT, Apeksha
no flags Details
heketi-db-dump (9.07 KB, text/plain)
2018-06-12 02:33 EDT, Apeksha
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Github https://github.com/heketi/heketi/issues/1244 None None None 2018-06-29 15:37 EDT
Red Hat Product Errata RHEA-2018:2686 None None None 2018-09-12 05:23 EDT

  None (edit)
Description Apeksha 2018-06-12 02:23:28 EDT
Description of problem:
heketi device delete command hanged


Version-Release number of selected component (if applicable):
CVE fix setup - OCP3.9
heketi-client-6.0.0-7.4.el7rhgs.x86_64
oc v3.9.30

How reproducible: Once


Steps to Reproduce:
Run Automated heketi cases , in one of testcase heketi device command is in hang state

[root@dhcp46-130 ~]# ps -aux | grep delete
root      16949  0.0  0.0 112704   968 pts/0    S+   11:32   0:00 grep --color=auto delete
root     105257  0.0  0.0 113172  1572 ?        Ss   04:56   0:00 bash -c cd /root && heketi-cli -s http://heketi-storage-glusterfs.router.default.svc.cluster.local device delete d1502e93d1320127383936293e4dcd6c  --secret 'adminkey' --user admin
root     105262  0.0  0.0 255916 17108 ?        Sl   04:56   0:19 heketi-cli -s http://heketi-storage-glusterfs.router.default.svc.cluster.local device delete d1502e93d1320127383936293e4dcd6c --secret adminkey --user admin


Error in heketi log:
[kubeexec] DEBUG 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:244: Host: dhcp46-61.lab.eng.blr.redhat.com Pod: glusterfs-storage-q25zn Command: gluster --mode=script volume replace-brick heketidbstorage 10.70.46.61:/var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f/brick 10.70.46.61:/var/lib/heketi/mounts/vg_d1502e93d1320127383936293e4dcd6c/brick_de6552f1f5208620b0e3ff7b01bb622a/brick commit force
Result: volume replace-brick: success: replace-brick commit force operation successful
[heketi] INFO 2018/06/11 11:26:02 Deleting brick 284a16bbc0e4bee4a3a124a0db224c8f
[negroni] Started GET /queue/60ad29a6454424d14570e3aa303a7067
[negroni] Completed 200 OK in 167.959µs
[kubeexec] ERROR 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [umount /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f] on glusterfs-storage-q25zn: Err[command terminated with exit code 32]: Stdout []: Stderr [umount: /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
]
[cmdexec] ERROR 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/cmdexec/brick.go:123: Unable to execute command on glusterfs-storage-q25zn: umount: /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
[kubeexec] ERROR 2018/06/11 11:26:03 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [lvremove -f vg_e7a2abd8e004058324ce78235ae03927/tp_284a16bbc0e4bee4a3a124a0db224c8f] on glusterfs-storage-q25zn: Err[command terminated with exit code 5]: Stdout []: Stderr [  Logical volume vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f contains a filesystem in use.
]
[cmdexec] ERROR 2018/06/11 11:26:03 /src/github.com/heketi/heketi/executors/cmdexec/brick.go:132: Unable to execute command on glusterfs-storage-q25zn:   Logical volume vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f contains a filesystem in use.
[kubeexec] ERROR 2018/06/11 11:26:03 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [rmdir /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f] on glusterfs-storage-q25zn: Err[command terminated with exit code 1]: Stdout []: Stderr [rmdir: failed to remove '/var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f': Device or resource busy
Comment 2 Apeksha 2018-06-12 02:26 EDT
Created attachment 1450325 [details]
attaching heketi-pod logs
Comment 3 Apeksha 2018-06-12 02:33 EDT
Created attachment 1450328 [details]
heketi-db-dump
Comment 4 Apeksha 2018-06-12 02:41:30 EDT
heketi-pod logs at http://rhsqe-repo.lab.eng.blr.redhat.com/cns/bugs/BZ-1590143/
Comment 6 John Mulligan 2018-06-27 13:47:16 EDT
How long did the command hang for? Did it eventually time out or did you kill it?

Can we get a system into this state again? If we can I'd like to get a core dump of a running heketi to see exactly what operation the server was performing. In the attached db dump there are no pending operations in the db, which implies that either the server completed the operation before the db dump was taken, the cli never actually communicated the delete operation to the server, or the db was cleaned up manually and I doubt you would have done that.

The error triggering the log line:
[cmdexec] ERROR 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/cmdexec/brick.go:123: Unable to execute command on glusterfs-storage-q25zn: umount: /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))

is an issue in gluster AFAIK around brick multiplexing (bz#1524336 if I'm not mistaken). However, this alone should not cause the command to hang.
Comment 8 Apeksha 2018-06-28 01:53:13 EDT
(In reply to John Mulligan from comment #6)
> How long did the command hang for? Did it eventually time out or did you
> kill it?
  It hangs for more than 48 hours and then eventually i killed it.
 
> Can we get a system into this state again? If we can I'd like to get a core
> dump of a running heketi to see exactly what operation the server was
> performing. In the attached db dump there are no pending operations in the
> db, which implies that either the server completed the operation before the
> db dump was taken, the cli never actually communicated the delete operation
> to the server, or the db was cleaned up manually and I doubt you would have
> done that.
   I dont have the setup now, but I have never manually cleared the db. Note i haven't seen this issue in CNS3.10.
  
> The error triggering the log line:
> [cmdexec] ERROR 2018/06/11 11:26:02
> /src/github.com/heketi/heketi/executors/cmdexec/brick.go:123: Unable to
> execute command on glusterfs-storage-q25zn: umount:
> /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/
> brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy.
>         (In some cases useful info about processes that use
>          the device is found by lsof(8) or fuser(1))
> 
> is an issue in gluster AFAIK around brick multiplexing (bz#1524336 if I'm
> not mistaken). However, this alone should not cause the command to hang.
Comment 12 Raghavendra Talur 2018-07-09 08:41:15 EDT
Patch posted at : https://github.com/heketi/heketi/pull/1247
Comment 18 Humble Chirammal 2018-07-13 02:34:05 EDT
Fixed in version: rhgs-volmanager-rhel7:3.3.1-20
Comment 19 Apeksha 2018-08-01 06:16:14 EDT
Havent hit this issue again in build -rhgs3/rhgs-volmanager-rhel7:3.3.1-22.
Hence marking it as verified.
Comment 20 Anjana 2018-08-30 19:37:47 EDT
Updated doc text in the Doc Text field. Please review for technical accuracy.
Comment 22 Anjana 2018-09-05 05:40:05 EDT
have updated the doc text based on feedback.
Comment 23 John Mulligan 2018-09-05 15:25:14 EDT
(In reply to Anjana from comment #22)
> have updated the doc text based on feedback.

Looks OK to me.
Comment 25 errata-xmlrpc 2018-09-12 05:22:13 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2686

Note You need to log in before you can comment on or make changes to this bug.