1590143 – heketi device delete command hanged (as a vgremove command was waiting for an interactive prompt response)

Bug 1590143 - heketi device delete command hanged (as a vgremove command was waiting for an interactive prompt response)

Summary: heketi device delete command hanged (as a vgremove command was waiting for an...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	CNS 3.10
Assignee:	John Mulligan
QA Contact:	Apeksha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1568862 1594716
TreeView+	depends on / blocked

Reported:	2018-06-12 06:23 UTC by Apeksha
Modified:	2018-11-16 07:19 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, Heketi executed LVM management operations on the underlying storage nodes such that the commands might prompt for input. Commands that triggered input prompts would never complete without administrative intervention. With this fix Heketi is changed to run the LVM operations for volume delete and will not prompt for input. Heketi operations that manage LVM storage will either succeed or fail .
Clone Of:
Environment:
Last Closed:	2018-09-12 09:22:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
attaching heketi-pod logs (2.53 MB, text/plain) 2018-06-12 06:26 UTC, Apeksha	no flags	Details
heketi-db-dump (9.07 KB, text/plain) 2018-06-12 06:33 UTC, Apeksha	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	https://github.com/heketi heketi issues 1244	None	None	None	2020-07-03 04:12:37 UTC
Red Hat Bugzilla	1583494	unspecified	CLOSED	heketi create/delete commands are in queue(hang) state	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1594716	unspecified	CLOSED	Block hosting volume is deleted from gluster but still lists from heketi	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2018:2686	None	None	None	2018-09-12 09:23:23 UTC

Internal Links: 1583494 1594716

Description Apeksha 2018-06-12 06:23:28 UTC

Description of problem:
heketi device delete command hanged


Version-Release number of selected component (if applicable):
CVE fix setup - OCP3.9
heketi-client-6.0.0-7.4.el7rhgs.x86_64
oc v3.9.30

How reproducible: Once


Steps to Reproduce:
Run Automated heketi cases , in one of testcase heketi device command is in hang state

[root@dhcp46-130 ~]# ps -aux | grep delete
root      16949  0.0  0.0 112704   968 pts/0    S+   11:32   0:00 grep --color=auto delete
root     105257  0.0  0.0 113172  1572 ?        Ss   04:56   0:00 bash -c cd /root && heketi-cli -s http://heketi-storage-glusterfs.router.default.svc.cluster.local device delete d1502e93d1320127383936293e4dcd6c  --secret 'adminkey' --user admin
root     105262  0.0  0.0 255916 17108 ?        Sl   04:56   0:19 heketi-cli -s http://heketi-storage-glusterfs.router.default.svc.cluster.local device delete d1502e93d1320127383936293e4dcd6c --secret adminkey --user admin


Error in heketi log:
[kubeexec] DEBUG 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:244: Host: dhcp46-61.lab.eng.blr.redhat.com Pod: glusterfs-storage-q25zn Command: gluster --mode=script volume replace-brick heketidbstorage 10.70.46.61:/var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f/brick 10.70.46.61:/var/lib/heketi/mounts/vg_d1502e93d1320127383936293e4dcd6c/brick_de6552f1f5208620b0e3ff7b01bb622a/brick commit force
Result: volume replace-brick: success: replace-brick commit force operation successful
[heketi] INFO 2018/06/11 11:26:02 Deleting brick 284a16bbc0e4bee4a3a124a0db224c8f
[negroni] Started GET /queue/60ad29a6454424d14570e3aa303a7067
[negroni] Completed 200 OK in 167.959µs
[kubeexec] ERROR 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [umount /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f] on glusterfs-storage-q25zn: Err[command terminated with exit code 32]: Stdout []: Stderr [umount: /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
]
[cmdexec] ERROR 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/cmdexec/brick.go:123: Unable to execute command on glusterfs-storage-q25zn: umount: /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
[kubeexec] ERROR 2018/06/11 11:26:03 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [lvremove -f vg_e7a2abd8e004058324ce78235ae03927/tp_284a16bbc0e4bee4a3a124a0db224c8f] on glusterfs-storage-q25zn: Err[command terminated with exit code 5]: Stdout []: Stderr [  Logical volume vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f contains a filesystem in use.
]
[cmdexec] ERROR 2018/06/11 11:26:03 /src/github.com/heketi/heketi/executors/cmdexec/brick.go:132: Unable to execute command on glusterfs-storage-q25zn:   Logical volume vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f contains a filesystem in use.
[kubeexec] ERROR 2018/06/11 11:26:03 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [rmdir /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f] on glusterfs-storage-q25zn: Err[command terminated with exit code 1]: Stdout []: Stderr [rmdir: failed to remove '/var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f': Device or resource busy

Comment 2 Apeksha 2018-06-12 06:26:38 UTC

Created attachment 1450325 [details]
attaching heketi-pod logs

Comment 3 Apeksha 2018-06-12 06:33:34 UTC

Created attachment 1450328 [details]
heketi-db-dump

Comment 4 Apeksha 2018-06-12 06:41:30 UTC

heketi-pod logs at http://rhsqe-repo.lab.eng.blr.redhat.com/cns/bugs/BZ-1590143/

Comment 6 John Mulligan 2018-06-27 17:47:16 UTC

How long did the command hang for? Did it eventually time out or did you kill it?

Can we get a system into this state again? If we can I'd like to get a core dump of a running heketi to see exactly what operation the server was performing. In the attached db dump there are no pending operations in the db, which implies that either the server completed the operation before the db dump was taken, the cli never actually communicated the delete operation to the server, or the db was cleaned up manually and I doubt you would have done that.

The error triggering the log line:
[cmdexec] ERROR 2018/06/11 11:26:02 /src/github.com/heketi/heketi/executors/cmdexec/brick.go:123: Unable to execute command on glusterfs-storage-q25zn: umount: /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))

is an issue in gluster AFAIK around brick multiplexing (bz#1524336 if I'm not mistaken). However, this alone should not cause the command to hang.

Comment 8 Apeksha 2018-06-28 05:53:13 UTC

(In reply to John Mulligan from comment #6)
> How long did the command hang for? Did it eventually time out or did you
> kill it?
  It hangs for more than 48 hours and then eventually i killed it.
 
> Can we get a system into this state again? If we can I'd like to get a core
> dump of a running heketi to see exactly what operation the server was
> performing. In the attached db dump there are no pending operations in the
> db, which implies that either the server completed the operation before the
> db dump was taken, the cli never actually communicated the delete operation
> to the server, or the db was cleaned up manually and I doubt you would have
> done that.
   I dont have the setup now, but I have never manually cleared the db. Note i haven't seen this issue in CNS3.10.
  
> The error triggering the log line:
> [cmdexec] ERROR 2018/06/11 11:26:02
> /src/github.com/heketi/heketi/executors/cmdexec/brick.go:123: Unable to
> execute command on glusterfs-storage-q25zn: umount:
> /var/lib/heketi/mounts/vg_e7a2abd8e004058324ce78235ae03927/
> brick_284a16bbc0e4bee4a3a124a0db224c8f: target is busy.
>         (In some cases useful info about processes that use
>          the device is found by lsof(8) or fuser(1))
> 
> is an issue in gluster AFAIK around brick multiplexing (bz#1524336 if I'm
> not mistaken). However, this alone should not cause the command to hang.

Comment 12 Raghavendra Talur 2018-07-09 12:41:15 UTC

Patch posted at : https://github.com/heketi/heketi/pull/1247

Comment 18 Humble Chirammal 2018-07-13 06:34:05 UTC

Fixed in version: rhgs-volmanager-rhel7:3.3.1-20

Comment 19 Apeksha 2018-08-01 10:16:14 UTC

Havent hit this issue again in build -rhgs3/rhgs-volmanager-rhel7:3.3.1-22.
Hence marking it as verified.

Comment 20 Anjana KD 2018-08-30 23:37:47 UTC

Updated doc text in the Doc Text field. Please review for technical accuracy.

Comment 22 Anjana KD 2018-09-05 09:40:05 UTC

have updated the doc text based on feedback.

Comment 23 John Mulligan 2018-09-05 19:25:14 UTC

(In reply to Anjana from comment #22)
> have updated the doc text based on feedback.

Looks OK to me.

Comment 25 errata-xmlrpc 2018-09-12 09:22:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2686

Note You need to log in before you can comment on or make changes to this bug.