Bug 1685935

Summary: disable dmeventd in the rhgs-server container
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Niels de Vos <ndevos>
Component: rhgs-server-containerAssignee: Saravanakumar <sarumuga>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: high Docs Contact:
Priority: medium    
Version: ocs-3.11CC: akrishna, jmulligan, knarra, kramdoss, madam, pasik, prajnoha, rhs-bugs, rtalur, sankarshan
Target Milestone: ---Keywords: ZStream
Target Release: OCS 3.11.z Batch Update 3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhgs3/rhgs-server-rhel7:3.11.3-9 Doc Type: Bug Fix
Doc Text:
Previously, two running device-mapper event daemon (dmeventd) services on the same system caused conflict causing Logical Volume Manager (LVM) commands to take a very long time. As a consequence, when a dmeventd service handled an event, LVM command waited until the event was completed. With two dmeventd services running on the system, one in the server container and one on the host. This confused the LVM commands and waited on the 'wrong' dmeventd service which did not handle the event. With this fix, the container no longer runs a dmeventd service. Hence, when only one dmeventd service is running, LVM commands cannot connect to an idle instance and there is no delay.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-13 19:18:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1674475    

Description Niels de Vos 2019-03-06 10:58:56 UTC
Description of problem:
We have seen problems where 'pvscan' in the rhgs-server container got into some sort of 'hung' state. Running strace against the newly started '/usr/sbin/lvm pvscan' command shows a (seemingly) endless-loop while reading from /run/dmeventd-client.

Version-Release number of selected component (if applicable):
rhgs-server-container:v3.11.1-15

How reproducible:
Occasionally

Steps to Reproduce:
1. Create many LVs, fill some up to 100%
2. Reboot the system
3. Have the rhgs-server container come up
4. Notice the container starting, but not getting 'ready'
5. Run 'ps ax' in the container, see 'pvscan' processes running

Actual results:
rhgs-server container does not become ready

Expected results:
The rhgs-server container should become ready within a few minutes. Mounting the LVs for all the bricks can still take some time. Mounting is done after pvscan has finished.

Additional info:
These types of 'hangs' in pvscan can probably be prevented by not running dmeventd in the container. There is no (currently known) reason to have access to the dmeventd sockets from the service running on the host.

Comment 1 Niels de Vos 2019-03-06 13:26:07 UTC
Peter, can you think of a reason why we would want to have dmeventd running inside the container (and likely on the host)?

If there is no valid reason, we'll continue with disabling the service in the rhgs-server container.

Comment 2 Zdenek Kabelac 2019-03-06 15:20:04 UTC
dmeventd was never designed to be executed inside  'container' so there are some assumption about being there only single instance of running  'dmeventd' on the whole host system.

So currently I'd not recommend running multiple instances of dmeventd per many containers.

Comment 5 RamaKasturi 2019-03-28 17:32:14 UTC
Acking the bug for 3.11.3 release

Comment 9 RamaKasturi 2019-05-07 10:01:46 UTC
Moving this bug to failed_qa as i see that dmeventd service is still running inside the container.

sh-4.2# systemctl status dm-event.service
● dm-event.service - Device-mapper event daemon
   Loaded: loaded (/usr/lib/systemd/system/dm-event.service; static; vendor preset: enabled)
   Active: active (running) since Tue 2019-05-07 05:47:59 UTC; 4h 12min ago
     Docs: man:dmeventd(8)
 Main PID: 45 (dmeventd)
   CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda0f5ecf2_708b_11e9_a71c_02a517c0cfee.slice/docker-41392d87adf9e0666622c3d0f8083b1fd9e69a432450780aaa28e705a0b60662.scope/system.slice/dm-event.service
           └─45 /usr/sbin/dmeventd -f

sh-4.2# ls -l /root/buildinfo/
total 12
-rw-r--r--. 1 root root 2798 Apr 16 15:34 Dockerfile-rhel7-7.6-252
-rw-r--r--. 1 root root 6582 Apr 24 11:51 Dockerfile-rhgs3-rhgs-server-rhel7-3.11.3-8

sh-4.2# rpm -qa | grep lvm 
lvm2-libs-2.02.180-10.el7_6.7.x86_64
lvm2-2.02.180-10.el7_6.7.x86_64

Comment 14 RamaKasturi 2019-05-15 18:23:59 UTC
Moving the bug to verified state as i do not see dmeventd process running in the rhgs-server-container. Performed the tests below to confirm the same.

sh-4.2# systemctl status dm-event.service
● dm-event.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)
sh-4.2# systemctl status dm-event.socket 
● dm-event.socket
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)
sh-4.2# ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  0.4  0.0  46828  6932 ?        Ss   11:19   0:20 /usr/sbin/init
dbus         48  0.0  0.0  58096  2112 ?        Ss   11:20   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root         71  0.0  0.0  91504  1176 ?        Ssl  11:20   0:00 /usr/sbin/gssproxy -D
root         78  0.0  0.0  22696  1540 ?        Ss   11:20   0:00 /usr/sbin/crond -n
root         97  0.0  0.0 112864  4316 ?        Ss   11:20   0:00 /usr/sbin/sshd -D
root       1574  2.8  0.5 594756 167608 ?       Ssl  11:25   1:55 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
root       1776  0.0  0.0  11680  1464 ?        Ss   11:26   0:00 /bin/bash /usr/local/bin/check_diskspace.sh
root       1901  1.3  2.8 20568016 924124 ?     Ssl  11:27   0:55 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.
root       1911 12.5  3.2 28059724 1066352 ?    Ssl  11:27   8:20 /usr/sbin/glusterfsd -s 10.70.47.125 --volfile-id heketidbstorage.10.70.47.125.var-lib-heketi-mounts-vg_51f
root       5499  9.4  3.3 28283528 1113192 ?    Ssl  11:27   6:14 /usr/sbin/glusterfsd -s 10.70.47.125 --volfile-id vol_664a7ca9425557dde9cab390704e2921.10.70.47.125.var-lib
root       6704 18.5  0.2 2558404 73856 ?       Ssl  11:28  12:13 /usr/sbin/glusterfsd -s 10.70.47.125 --volfile-id vol_81d773e7ba758e0f4a8f17f88c0eba44.10.70.47.125.var-lib
root       9632  5.9  1.5 13552452 509172 ?     Ssl  11:28   3:53 /usr/sbin/glusterfsd -s 10.70.47.125 --volfile-id vol_d5514028de508428dd17508c142ba3a1.10.70.47.125.var-lib
root      11814  6.8  0.7 219741900 230756 ?    Ssl  11:29   4:24 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block
root      13204  0.0  0.0 276308  1872 ?        Ssl  11:30   0:00 /usr/sbin/gluster-blockd --glfs-lru-count 15 --log-level INFO
root      15543  0.0  0.0  21764   996 ?        Ss   12:01   0:00 /usr/sbin/anacron -s
root      17760  0.0  0.0   4360   352 ?        S    12:32   0:00 sleep 120
root      17761  0.0  0.0  11820  1760 pts/0    Ss   12:32   0:00 /bin/sh
root      17911  0.0  0.0  51744  1748 pts/0    R+   12:34   0:00 ps aux
sh-4.2# ps aux | grep dmeventd
root      17918  0.0  0.0   9092   676 pts/0    R+   12:34   0:00 grep dmeventd


sh-4.2# ls -l /root/buildinfo/
total 12
-rw-r--r--. 1 root root 2798 Apr 16 15:34 Dockerfile-rhel7-7.6-252
-rw-r--r--. 1 root root 6824 May 15 05:02 Dockerfile-rhgs3-rhgs-server-rhel7-3.11.3-11

Add / remove device works fine.

Able to create glusterfile & block volume but i saw that it took around 2 mins for the volume to be in bound state.

Rebooted the server but i see an issue while the server boots up. Will raise a different bug for this issue.

Rebooted the server and added the device, device addition worked fine.

Comment 15 Anjana KD 2019-06-03 12:50:22 UTC
Have updated the doc text. Kindly review it for technical accuracy.

Comment 18 errata-xmlrpc 2019-06-13 19:18:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1406