Bug 1674485

Summary:	Cannot install latest version of ocp + ocs in aws environment.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RamaKasturi <knarra>
Component:	cns-ansible	Assignee:	John Mulligan <jmulligan>
Status:	CLOSED WORKSFORME	QA Contact:	Prasanth <pprakash>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	ocs-3.11	CC:	admin, bkunal, bward, dyocum, hchiramm, jarrpa, jocelyn.thode, knarra, kramdoss, madam, mrobson, nchilaka, ndevos, nravinas, pasik, pdwyer, puebele, rcyriac, rhs-bugs, sarumuga
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-20 03:48:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1676466, 1676612, 1684133
Bug Blocks:

Description RamaKasturi 2019-02-11 13:12:24 UTC

Description of problem:

When trying to install latest version of ocp + ocs on aws environment i see that create heketi db volume fails with the error as below.

TASK [openshift_storage_glusterfs : Create heketi DB volume] ****************************************************************************************************************
fatal: [ip-172-16-21-89.ap-south-1.compute.internal]: FAILED! => {"changed": true, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-cBAB7c/admin.kubeconfig", "rsh", 
"--namespace=app-storage", "deploy-heketi-storage-1-647l6", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "lX9PAltTQ+YewW/mOC5GbNjFaPMogGkuCUNo
67VF9E0=", "setup-openshift-heketi-storage", "--image", "registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:v3.11.1", "--listfile", "/tmp/heketi-storage.json"], "delta":
 "0:00:00.332553", "end": "2019-02-11 11:06:30.846805", "msg": "non-zero return code", "rc": 255, "start": "2019-02-11 11:06:30.514252", "stderr": "Error: Failed to allocate
 new volume: No online storage devices in cluster\ncommand terminated with exit code 255", "stderr_lines": ["Error: Failed to allocate new volume: No online storage devices 
in cluster", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}

Deploy_heketi logs:
=====================
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 150.95µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 198.975µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 136.402µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 153.753µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 157.145µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 147.729µs
[kubeexec] DEBUG 2019/02/11 11:09:02 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [pvcreate -qq --metadatasize=128M --dataalignment=256K '/dev/xvdf'] on [pod:glusterfs-storage-g56jg c:glusterfs ns:app-storage (from host:ip-172-16-43-224.ap-south-1.compute.internal selector:glusterfs-node)]: Stdout []: Stderr [  WARNING: Device /dev/xvdf not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/dockervg/dockerlv not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvda2 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdb1 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdc not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdd not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdf not initialized in udev database even after waiting 10000000 microseconds.
]
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 237.859µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 144.222µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 206.97µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 139.667µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 254.123µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 146.091µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 146.346µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 132.643µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 156.745µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 158.468µs
[heketi] INFO 2019/02/11 11:09:13 Starting Node Health Status refresh
[cmdexec] INFO 2019/02/11 11:09:13 Check Glusterd service status in node ip-172-16-25-44.ap-south-1.compute.internal
[kubeexec] DEBUG 2019/02/11 11:09:13 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [systemctl status glusterd] on [pod:glusterfs-storage-8j7l5 c:glusterfs ns:app-storage (from host:ip-172-16-25-44.ap-south-1.compute.internal selector:glusterfs-node)]: Stdout [● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-02-11 11:03:46 UTC; 5min ago
  Process: 394 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 398 (glusterd)
   CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb0f9ae56_2dec_11e9_8a2b_021f0f7e5d7a.slice/docker-f1ebc3766158fc63231de73a67ebdc0adff2a0e8fe28f36d664ef7f2fc91611a.scope/system.slice/glusterd.service
           └─398 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
]: Stderr []
[heketi] INFO 2019/02/11 11:09:13 Periodic health check status: node 6d8e435b0cf0d43ff48f8072ff63a655 up=true
[cmdexec] INFO 2019/02/11 11:09:13 Check Glusterd service status in node ip-172-16-43-224.ap-south-1.compute.internal
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 125.652µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 158.985µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291


Version-Release number of selected component (if applicable):
[ec2-user@ip-172-16-25-44 ~]$ oc version
oc v3.11.69
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO


How reproducible:
Easily

Steps to Reproduce:
1. Deploy ocp + ocs 3.11.1 on aws environment
2. 
3.

Actual results:
Deployment fails at the stage where it tries to create heketi db volume.

Expected results:
Deployment should not fail.

Additional info:

TASK [openshift_storage_glusterfs : Create heketi DB volume] ****************************************************************************************************************
fatal: [ip-172-16-21-89.ap-south-1.compute.internal]: FAILED! => {"changed": true, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-cBAB7c/admin.kubeconfig", "rsh", 
"--namespace=app-storage", "deploy-heketi-storage-1-647l6", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "lX9PAltTQ+YewW/mOC5GbNjFaPMogGkuCUNo
67VF9E0=", "setup-openshift-heketi-storage", "--image", "registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:v3.11.1", "--listfile", "/tmp/heketi-storage.json"], "delta":
 "0:00:00.332553", "end": "2019-02-11 11:06:30.846805", "msg": "non-zero return code", "rc": 255, "start": "2019-02-11 11:06:30.514252", "stderr": "Error: Failed to allocate
 new volume: No online storage devices in cluster\ncommand terminated with exit code 255", "stderr_lines": ["Error: Failed to allocate new volume: No online storage devices 
in cluster", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}

Deploy_heketi logs:
=====================
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 150.95µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 198.975µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 136.402µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 153.753µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 157.145µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 147.729µs
[kubeexec] DEBUG 2019/02/11 11:09:02 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [pvcreate -qq --metadatasize=128M --dataalignment=256K '/dev/xvdf'] on [pod:glusterfs-storage-g56jg c:glusterfs ns:app-storage (from host:ip-172-16-43-224.ap-south-1.compute.internal selector:glusterfs-node)]: Stdout []: Stderr [  WARNING: Device /dev/xvdf not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/dockervg/dockerlv not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvda2 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdb1 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdc not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdd not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdf not initialized in udev database even after waiting 10000000 microseconds.
]
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 237.859µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 144.222µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 206.97µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 139.667µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 254.123µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 146.091µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 146.346µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 132.643µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 156.745µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 158.468µs
[heketi] INFO 2019/02/11 11:09:13 Starting Node Health Status refresh
[cmdexec] INFO 2019/02/11 11:09:13 Check Glusterd service status in node ip-172-16-25-44.ap-south-1.compute.internal
[kubeexec] DEBUG 2019/02/11 11:09:13 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [systemctl status glusterd] on [pod:glusterfs-storage-8j7l5 c:glusterfs ns:app-storage (from host:ip-172-16-25-44.ap-south-1.compute.internal selector:glusterfs-node)]: Stdout [● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-02-11 11:03:46 UTC; 5min ago
  Process: 394 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 398 (glusterd)
   CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb0f9ae56_2dec_11e9_8a2b_021f0f7e5d7a.slice/docker-f1ebc3766158fc63231de73a67ebdc0adff2a0e8fe28f36d664ef7f2fc91611a.scope/system.slice/glusterd.service
           └─398 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
]: Stderr []
[heketi] INFO 2019/02/11 11:09:13 Periodic health check status: node 6d8e435b0cf0d43ff48f8072ff63a655 up=true
[cmdexec] INFO 2019/02/11 11:09:13 Check Glusterd service status in node ip-172-16-43-224.ap-south-1.compute.internal
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 125.652µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291
[negroni] Completed 200 OK in 158.985µs
[negroni] Started GET /queue/9af1cf9cf58e61d2edaf19783be57291

Comment 2 RamaKasturi 2019-02-11 13:13:04 UTC

Will be uploading the logs shortly

Comment 8 Niels de Vos 2019-02-12 10:14:25 UTC

While running lvm commands in the glusterfs pod, there are many references to udev. It is not recommended to run udev in both the host and the pod, so we configure LVM to not use udev.

sh-4.2# pvs -vvv
...
        lvmcache /dev/xvdb1: VG dockervg: set VGID to QMLH4dQ53pkpyho6e7qxf1pOpUIFSbzg.
        Opened /dev/xvda RO O_DIRECT
      /dev/xvda: size is 209715200 sectors
        Closed /dev/xvda
        /dev/xvda: Skipping: Partition table signature found
        dm version   [ opencount flush ]   [16384] (*1)
        dm status   (253:0) [ noopencount noflush ]   [16384] (*1)
        Opened /dev/dockervg/dockerlv RO O_DIRECT
      /dev/dockervg/dockerlv: size is 209707008 sectors
        Closed /dev/dockervg/dockerlv
      /dev/dockervg/dockerlv: using cached size 209707008 sectors
        Device /dev/dockervg/dockerlv not initialized in udev database (1/100, 0 microseconds).
        Device /dev/dockervg/dockerlv not initialized in udev database (2/100, 100000 microseconds).
        Device /dev/dockervg/dockerlv not initialized in udev database (3/100, 200000 microseconds).
        Device /dev/dockervg/dockerlv not initialized in udev database (4/100, 300000 microseconds).
        Device /dev/dockervg/dockerlv not initialized in udev database (5/100, 400000 microseconds).
...

And this all takes a pretty long time.

There is an option in /etc/lvm/lvm.conf that we should probably disable:

        obtain_device_list_from_udev = 1

Even when disabling this option, the messages still occur.

With ltrace it is shown that libudev is used a lot:

sh-4.2# ltrace -e '*udev*' pvs
...
libudev.so.1->udev_list_entry_get_next(0, 0x7f554bfed778, 0xffffffff, 0x55f6dcd0e000)                                                              = 0
libudev.so.1->udev_list_entry_get_next(0, 0x7f554bfed778, 0xffffffff, 0x55f6dcd0e000)                                                              = 0
libudev.so.1->udev_list_entry_get_next(0, 0x7f554bfed778, 0xffffffff, 0x55f6dcd0e000)                                                              = 0
<... udev_device_unref resumed> )                                                                                                                  = 0
pvs->udev_device_new_from_devnum(0x55f6dccd5010, 98, 0xfd00, 0x55f6dcd0d830 <unfinished ...>
libudev.so.1->udev_device_new_from_syspath(0x55f6dccd5010, 0x7ffd81a312e0, 0x7ffd81a312f4, 20)                                                     = 0x55f6dcd0d840
<... udev_device_new_from_devnum resumed> )                                                                                                        = 0x55f6dcd0d840
pvs->udev_device_get_is_initialized(0x55f6dcd0d840, 0x55f6dcd0e0c2, 0x7f554bd99b20, 0 <unfinished ...>
libudev.so.1->udev_device_get_subsystem(0x55f6dcd0d840, 0, 0x7f554bd99b20, 0)                                                                      = 0x55f6dcd0e0a0
libudev.so.1->udev_device_get_devnum(0x55f6dcd0d840, 0x55f6dcd0de18, 0, 0)                                                                         = 0xfd00
libudev.so.1->udev_device_get_devnum(0x55f6dcd0d840, 0, 253, 0)                                                                                    = 0xfd00
libudev.so.1->udev_device_get_devnum(0x55f6dcd0d840, 0, 253, 0)                                                                                    = 0xfd00
libudev.so.1->udev_device_get_subsystem(0x55f6dcd0d840, 0, 253, 0)                                                                                 = 0x55f6dcd0e0a0
<... udev_device_get_is_initialized resumed> )                                                                                                     = 0
pvs->udev_device_unref(0x55f6dcd0d840, 0, 0, -1 <unfinished ...>
libudev.so.1->udev_list_entry_get_next(0, 0x7f554bfed768, 0xffffffff, 0x55f6dcd0e500)                                                              = 0
...


Looking at the lvm2 code, there has been a recent change that pretty much describes our problem:
 - https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=3ebce8dbd2d9afc031e0737f8feed796ec7a8df9;hp=d19e3727951853093828b072e254e447f7d61c60

    apply obtain_device_list_from_udev to all libudev usage

    udev_dev_is_md_component and udev_dev_is_mpath_component
    are not used for obtaining the device list, but they still
    use libudev for device info.  When there are problems with
    udev, these functions can get stuck. So, use the existing
    obtain_device_list_from_udev config setting to also control
    whether these "is component" functions are used, which gives
    us a way to avoid using libudev entirely when it's causing
    problems.


I'll run some tests to see if this fixes the problem for us:

1. edit lvm.conf and disable 'obtain_device_list_from_udev'
2. build a patched lvm2 package and install it in a running pod

Comment 9 Niels de Vos 2019-02-12 10:37:42 UTC

In two different pods, the 1st with the original lvm2 packages and config:

--- %< ---
sh-4.2# time pvs
  WARNING: Device /dev/dockervg/dockerlv not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vg_9beb04a3ac2e50d6ff4ca1e44fde85bc/bz1674485 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvda2 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdb1 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdc not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdd not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/xvdf not initialized in udev database even after waiting 10000000 microseconds.
  PV         VG                                  Fmt  Attr PSize    PFree  
  /dev/xvdb1 dockervg                            lvm2 a--  <100.00g      0 
  /dev/xvdf  vg_9beb04a3ac2e50d6ff4ca1e44fde85bc lvm2 a--   999.87g 997.87g

real	1m10.215s
user	0m0.044s
sys	0m0.066s
--- %< ---

And the 2nd pod with updated lvm2 packages and obtain_device_list_from_udev=0 in lvm.conf:

--- %< ---
sh-4.2# time pvs
  PV         VG                                  Fmt  Attr PSize    PFree  
  /dev/xvdb1 dockervg                            lvm2 a--  <100.00g      0 
  /dev/xvdf  vg_4f47afe74c28587a6f0b4110eae142cd lvm2 a--   999.87g 997.87g

real	0m0.031s
user	0m0.004s
sys	0m0.010s
--- %< ---

The updated packages (+src.rpm) are at https://people.redhat.com/ndevos/bz1674485/

Comment 10 Humble Chirammal 2019-02-12 10:46:23 UTC

kasturi, Thanks for providing these details. It would be also appreciated if you could provide some more details
from host:

#rpm -qa |egrep 'lvm2|systemd|udev'

Comment 11 RamaKasturi 2019-02-12 10:55:03 UTC

output from the hosts where gluster pods are running :
==========================================================

[ec2-user@ip-172-31-25-77 ~]$ ansible -i  inv_3.11 nodes -b -m shell -a "rpm -qa |egrep 'lvm2|systemd|udev'"
 [WARNING]: Consider using the yum, dnf or zypper module rather than running rpm.  If you need to use command because yum, dnf or zypper is insufficient you can add
warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message.

ip-172-16-43-224.ap-south-1.compute.internal | SUCCESS | rc=0 >>
python-gudev-147.2-7.el7.x86_64
lvm2-2.02.180-10.el7_6.3.x86_64
systemd-219-62.el7.x86_64
systemd-sysv-219-62.el7.x86_64
python-pyudev-0.15-9.el7.noarch
oci-systemd-hook-0.1.18-3.git8787307.el7_6.x86_64
systemd-libs-219-62.el7.x86_64
libgudev1-219-62.el7.x86_64
lvm2-libs-2.02.180-10.el7_6.3.x86_64

ip-172-16-25-44.ap-south-1.compute.internal | SUCCESS | rc=0 >>
python-gudev-147.2-7.el7.x86_64
lvm2-2.02.180-10.el7_6.3.x86_64
systemd-219-62.el7.x86_64
systemd-sysv-219-62.el7.x86_64
python-pyudev-0.15-9.el7.noarch
oci-systemd-hook-0.1.18-3.git8787307.el7_6.x86_64
systemd-libs-219-62.el7.x86_64
libgudev1-219-62.el7.x86_64
lvm2-libs-2.02.180-10.el7_6.3.x86_64

ip-172-16-30-46.ap-south-1.compute.internal | SUCCESS | rc=0 >>
python-gudev-147.2-7.el7.x86_64
lvm2-2.02.180-10.el7_6.3.x86_64
systemd-219-62.el7.x86_64
systemd-sysv-219-62.el7.x86_64
python-pyudev-0.15-9.el7.noarch
oci-systemd-hook-0.1.18-3.git8787307.el7_6.x86_64
systemd-libs-219-62.el7.x86_64
libgudev1-219-62.el7.x86_64
lvm2-libs-2.02.180-10.el7_6.3.x86_64

Comment 12 Humble Chirammal 2019-02-12 11:06:35 UTC

Niels, we can disable the param (obtain_device_list_from_udev) in lvm.conf and commit an image for QE testing. Thats the easiest or quick thing we
can do here to isolate the issue.

kasturi, Were we able to do successful deployment in other cloud or baremetal cluster ? I am trying to understand 
whats special about AWS instance.

Comment 13 RamaKasturi 2019-02-12 11:12:56 UTC

(In reply to Humble Chirammal from comment #12)
> Niels, we can disable the param (obtain_device_list_from_udev) in lvm.conf
> and commit an image for QE testing. Thats the easiest or quick thing we
> can do here to isolate the issue.
> 
> kasturi, Were we able to do successful deployment in other cloud or
> baremetal cluster ? 
Our primary test environment has been vmware  and we did not face any issues related to deployment on this environment.

I am trying to understand whats special about AWS instance.

Comment 14 RamaKasturi 2019-02-12 11:13:43 UTC

setting need info back on ndevos as this got cleared

Comment 15 Niels de Vos 2019-02-12 12:00:32 UTC

(In reply to Humble Chirammal from comment #12)
> Niels, we can disable the param (obtain_device_list_from_udev) in lvm.conf
> and commit an image for QE testing. Thats the easiest or quick thing we
> can do here to isolate the issue.

Unfortunately changing obtain_device_list_from_udev is not sufficient. LVM has a bug where it still tries to use udev for some actions (or some devices?). The patch mentioned in comment #8 is not included in a build of lvm2 yet (that is why I did a test build for comment #9). I am checking with the lvm developers if they have a bz for getting that patch included already, otherwise we'll open a new one.

Comment 16 Humble Chirammal 2019-02-12 13:30:36 UTC

(In reply to RamaKasturi from comment #13)
> (In reply to Humble Chirammal from comment #12)
> > Niels, we can disable the param (obtain_device_list_from_udev) in lvm.conf
> > and commit an image for QE testing. Thats the easiest or quick thing we
> > can do here to isolate the issue.
> > 
> > kasturi, Were we able to do successful deployment in other cloud or
> > baremetal cluster ? 
> Our primary test environment has been vmware  and we did not face any issues
> related to deployment on this environment.
> 
> I am trying to understand whats special about AWS instance.

Kasturi, could you please fetch the lvm , udev, systemd package versions from a system where we ***dont*** hit this issue ?

Because, the RCA mapping between AWS and our OCS 3.11.1 container has to be established somehow, so the question.

Comment 17 Matthew Robson 2019-02-12 17:50:52 UTC

*** Bug 1675134 has been marked as a duplicate of this bug. ***

Comment 19 Humble Chirammal 2019-02-13 08:13:06 UTC

I took a look at the vmware setup where QE didnt hit installation issue. however I still see
udev db fetch take long time on this setup too.


[root@dhcp46-138 ~]# oc get pods -o wide
NAME                                          READY     STATUS    RESTARTS   AGE       IP             NODE                                NOMINATED NODE
glusterblock-storage-provisioner-dc-1-79sn5   1/1       Running   0          4d        10.130.0.2     dhcp47-37.lab.eng.blr.redhat.com    <none>
glusterfs-storage-6vstf                       1/1       Running   1          4d        10.70.46.115   dhcp46-115.lab.eng.blr.redhat.com   <none>
glusterfs-storage-88jxh                       1/1       Running   0          4d        10.70.47.37    dhcp47-37.lab.eng.blr.redhat.com    <none>
glusterfs-storage-zcppk                       1/1       Running   0          4d        10.70.47.2     dhcp47-2.lab.eng.blr.redhat.com     <none>
heketi-storage-1-vzpwb                        1/1       Running   0          4d        10.130.0.7     dhcp47-37.lab.eng.blr.redhat.com    <none>


Inside  'glusterfs-storage-6vstf' pod

[root@dhcp46-115 /]# time pvs
  WARNING: Device /dev/rhel_dhcp47-42/root not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sda1 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sda2 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sda3 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vg_0f0b820a8ce7358d933692c82db220fe/brick_539df0f01872cc2d2250f8344de1d1e3 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/rhel_dhcp47-42/home not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/mapper/docker-8:17-11610-a15d2327feccd20bfbdec749b9ae6a8f6e3e8c84ec6317057fb0d6e470c8f743 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/mapper/docker-8:17-11610-0db4c3fb7c2a280f4e2e7384a7108630c4c71f13b49a00a8d8323cbf47d654fb not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/mapper/docker-8:17-11610-c9d00c30643dc58c88c339e3353dc115af8630fde26272e9b771c4b8d51530f3 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/mapper/docker-8:17-11610-ffa8367b3094fdcc953f823ee6b1f71fa9da5da436f835c8f0f2b15e4586717d not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/mapper/docker-8:17-11610-404e60308514ee2e68bd3460e1b597b5c09af3957ce3263dccb767a329f6b902 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/mapper/docker-8:17-11610-53a6ace3a8119115dfde7716553919bd07cfc9ced0ab81bf3d1c7f91a951ccaf not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/mapper/docker-8:17-11610-b1c63776580ea421898bd35ac5c97137015075d94e1911f49e2783471713e4da not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sdb1 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/mapper/docker-8:17-11610-ae756cef22b62b9b7b810ff3e7b8102251a98592f0d074ae74a832a9ede9f476 not initialized in udev database even after waiting 10000000 microseconds.


[root@dhcp46-115 ~]# uname -r
3.10.0-957.5.1.el7.x86_64

[root@dhcp46-115 ~]# rpm -qa |egrep 'lvm2|udev|systemd'
lvm2-2.02.180-10.el7_6.3.x86_64
python-pyudev-0.15-9.el7.noarch
python-gudev-147.2-7.el7.x86_64
oci-systemd-hook-0.1.18-3.git8787307.el7_6.x86_64
systemd-libs-219-62.el7_6.3.x86_64
systemd-219-62.el7_6.3.x86_64
systemd-sysv-219-62.el7_6.3.x86_64
lvm2-libs-2.02.180-10.el7_6.3.x86_64
libgudev1-219-62.el7_6.3.x86_64

-----------------------------

The pending query is that, how did this setup passed the installation stage.

kasturi, can you please confirm ansible installation is unchanged between this VMWARE and AWS setup?

Comment 24 RamaKasturi 2019-02-13 13:34:56 UTC

Hello Humble & Niels,

   I have tested the build provided at comment 20 and i could get the install successful on aws environment. Below are some of the output details.


INSTALLER STATUS ************************************************************************************************************************************************************
Initialization               : Complete (0:02:29)
Health Check                 : Complete (0:01:13)
Node Bootstrap Preparation   : Complete (0:59:03)
etcd Install                 : Complete (0:13:48)
Master Install               : Complete (0:35:00)
Master Additional Install    : Complete (0:20:59)
Node Join                    : Complete (0:02:33)
GlusterFS Install            : Complete (0:10:41)
Hosted Install               : Complete (0:05:28)
Cluster Monitoring Operator  : Complete (0:02:27)
Web Console Install          : Complete (0:01:53)
Console Install              : Complete (0:01:34)
metrics-server Install       : Complete (0:00:01)
Service Catalog Install      : Complete (0:09:42)


Images used for the test :
================================
[ec2-user@ip-172-16-28-144 ~]$ sudo docker images
REPOSITORY                                                                     TAG                 IMAGE ID            CREATED             SIZE
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7   ocs3.11-testbuild   6e57b47ca4e7        18 hours ago        268 MB
registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7                         v3.11               f4a8b6113476        11 days ago         277 MB
registry.access.redhat.com/rhgs3/rhgs-gluster-block-prov-rhel7                 v3.11               e74761279746        11 days ago         957 MB


pvs and pvscan output :
========================
[ec2-user@ip-172-16-28-144 ~]$ oc exec -it glusterfs-storage-fhwnx bash
[root@ip-172-16-19-162 /]# pvs
  PV         VG                                  Fmt  Attr PSize    PFree   
  /dev/xvdb1 dockervg                            lvm2 a--  <100.00g       0 
  /dev/xvdf  vg_cd65d867303a3297078a165dce522092 lvm2 a--   999.87g <997.85g
[root@ip-172-16-19-162 /]# pvscan
  PV /dev/xvdb1   VG dockervg                              lvm2 [<100.00 GiB / 0    free]
  PV /dev/xvdf    VG vg_cd65d867303a3297078a165dce522092   lvm2 [999.87 GiB / <997.85 GiB free]
  Total: 2 [1.07 TiB] / in use: 2 [1.07 TiB] / in no VG: 0 [0   ]

lvm version :
======================

[root@ip-172-16-19-162 /]# rpm -qa | grep lvm
lvm2-libs-2.02.180-10.el7_6.2.x86_64
lvm2-2.02.180-10.el7_6.2.x86_64

pvs, pvscan & lvm output from the node hosting the second pod:
==========================================================

[ec2-user@ip-172-16-28-144 ~]$ oc exec -it glusterfs-storage-nbggv bash
[root@ip-172-16-17-223 /]# pvs
  PV         VG                                  Fmt  Attr PSize    PFree   
  /dev/xvdb1 dockervg                            lvm2 a--  <100.00g       0 
  /dev/xvdf  vg_a91d09b1b311fe03a90e4fc5d741ed06 lvm2 a--   999.87g <997.85g
[root@ip-172-16-17-223 /]# pvscan
  PV /dev/xvdf    VG vg_a91d09b1b311fe03a90e4fc5d741ed06   lvm2 [999.87 GiB / <997.85 GiB free]
  PV /dev/xvdb1   VG dockervg                              lvm2 [<100.00 GiB / 0    free]
  Total: 2 [1.07 TiB] / in use: 2 [1.07 TiB] / in no VG: 0 [0   ]
[root@ip-172-16-17-223 /]# rpm -qa | gerep lvm
bash: gerep: command not found
[root@ip-172-16-17-223 /]# rpm -qa | grep lvm
lvm2-libs-2.02.180-10.el7_6.2.x86_64
lvm2-2.02.180-10.el7_6.2.x86_64
[root@ip-172-16-17-223 /]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)
[root@ip-172-16-17-223 /]# cat /etc/redhat-storage-release 
Red Hat Gluster Storage Server 3.4.2(Container)

Third node:
===================
[ec2-user@ip-172-16-28-144 ~]$ oc exec -it glusterfs-storage-vmv2b bash
[root@ip-172-16-38-52 /]# pvs
  PV         VG                                  Fmt  Attr PSize    PFree   
  /dev/xvdb1 dockervg                            lvm2 a--  <100.00g       0 
  /dev/xvdf  vg_933fad2019ed74e294c19b0635ae6d97 lvm2 a--   999.87g <997.85g
[root@ip-172-16-38-52 /]# pvscan
  PV /dev/xvdf    VG vg_933fad2019ed74e294c19b0635ae6d97   lvm2 [999.87 GiB / <997.85 GiB free]
  PV /dev/xvdb1   VG dockervg                              lvm2 [<100.00 GiB / 0    free]
  Total: 2 [1.07 TiB] / in use: 2 [1.07 TiB] / in no VG: 0 [0   ]
[root@ip-172-16-38-52 /]# rpm -qa | grep lvm
lvm2-libs-2.02.180-10.el7_6.2.x86_64
lvm2-2.02.180-10.el7_6.2.x86_64

Created a file and block pvc on the default gluster storage class and that works fine :
========================================================================================

[ec2-user@ip-172-16-28-144 ~]$ oc get pvc
NAME         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
blockvtest   Bound     pvc-7ec61e7f-2f93-11e9-b418-029d9c8ade0e   10Gi       RWO            glusterfs-storage-block   7s
filevtest    Bound     pvc-601cef03-2f93-11e9-b418-029d9c8ade0e   10Gi       RWO            glusterfs-storage         58s

@Humble / Niels, above are the basic validations i have performed on the aws setup. Are there any other tests we need to run before we say that the container image works fine ?

Comment 25 RamaKasturi 2019-02-13 13:39:30 UTC

I see that a need info on Jose has been cleared, so setting it back.

Comment 37 jocelyn.thode 2019-02-18 15:26:49 UTC

Hey Guys,

I also hit this Issue on our OKD 3.11 Cluster. I have summarized part of my findings here: https://github.com/gluster/gluster-containers/issues/128

I did not encounter this problem on an older gluster-centos image which was running gluster 4.1.6 with the following packages:

sh-4.2# rpm -qa |egrep 'lvm2|udev|systemd'
python-pyudev-0.15-9.el7.noarch
systemd-219-62.el7.x86_64
lvm2-libs-2.02.180-10.el7_6.2.x86_64
lvm2-2.02.180-10.el7_6.2.x86_64
systemd-libs-219-62.el7.x86_64
systemd-sysv-219-62.el7.x86_64

However when using the latest built gluster-centos image which had 4.1.7 in it, the lvm packages were updated as well with the following versions:
sh-4.2# rpm -qa |egrep 'lvm2|udev|systemd'
python-pyudev-0.15-9.el7.noarch
systemd-219-62.el7_6.3.x86_64
lvm2-libs-2.02.180-10.el7_6.3.x86_64
lvm2-2.02.180-10.el7_6.3.x86_64
systemd-libs-219-62.el7_6.3.x86_64
systemd-sysv-219-62.el7_6.3.x86_64

With these new versions I can always reproduce this issue on our OKD cluster

Comment 39 Brian Ward 2019-02-26 14:29:56 UTC

Is this issue resolved with the fix provided by bug 1676921 ?

Comment 40 Brian Ward 2019-02-27 21:22:40 UTC

My customer deployed the latest successfully and determined this issue was resolved by the fix provided by bug 1676921.

Comment 41 Niels de Vos 2019-02-28 13:56:17 UTC

Kasturi, Jose, what do you think about closing this as a duplicate of bug 1676921?

Comment 42 RamaKasturi 2019-03-03 10:43:28 UTC

Hello Niels,

    I feel that the other bug was raised to track the lvm package downgrade and this bug cannot be closed duplicate of the other bug. It would be good if you can move this to on_qa and will verify this.

Thanks
kasturi

Comment 43 Niels de Vos 2019-03-04 09:11:52 UTC

With the current rhgs-server container this problem should not happen anymore.

Comment 44 RamaKasturi 2019-03-04 17:46:22 UTC

I do not see this issue happening anymore with the rhgs-server-container build rhgs3/rhgs-server-rhel7:3.11.1-15. Fix for this was to have a downgraded lvm package in the build.


Below are the tests performed to validate the bug.

1) Fresh install on vmware and aws environment.
2) Upgrade on vmware and aws environment.

Image used for testing these packages is rhgs3/rhgs-server-rhel7:3.11.1-15

Below are the versions of gluster and lvm2 present in the container:
==================================================================

[ec2-user@ip-172-16-16-77 ~]$ oc rsh glusterfs-storage-mnlkh 
sh-4.2# cat /etc/redhat-storage-release 
Red Hat Gluster Storage Server 3.4.2(Container)
sh-4.2# rpm -qa | grep gluster
glusterfs-libs-3.12.2-32.el7rhgs.x86_64
glusterfs-3.12.2-32.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-32.el7rhgs.x86_64
glusterfs-server-3.12.2-32.el7rhgs.x86_64
gluster-block-0.2.1-30.el7rhgs.x86_64
glusterfs-api-3.12.2-32.el7rhgs.x86_64
glusterfs-cli-3.12.2-32.el7rhgs.x86_64
python2-gluster-3.12.2-32.el7rhgs.x86_64
glusterfs-fuse-3.12.2-32.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-32.el7rhgs.x86_64
sh-4.2# rpm -qa | grep lvm
lvm2-libs-2.02.180-10.el7_6.2.x86_64
lvm2-2.02.180-10.el7_6.2.x86_64

pvs & pvscan on the gluster pod:
======================================

sh-4.2# pvs
  /run/lvm/lvmetad.socket: connect failed: Connection refused
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
  PV         VG                                  Fmt  Attr PSize    PFree
  /dev/xvdb1 dockervg                            lvm2 a--  <100.00g    0 
  /dev/xvdf  vg_558b9fb496a0c15f5d9e41bc323a114a lvm2 a--     1.95t 1.80t
sh-4.2# pvscan
  /run/lvm/lvmetad.socket: connect failed: Connection refused
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
  PV /dev/xvdf    VG vg_558b9fb496a0c15f5d9e41bc323a114a   lvm2 [1.95 TiB / 1.80 TiB free]
  PV /dev/xvdb1   VG dockervg                              lvm2 [<100.00 GiB / 0    free]
  Total: 2 [2.05 TiB] / in use: 2 [2.05 TiB] / in no VG: 0 [0   ]


pv & pvscan on the node where gluster pod is running:
=======================================================
[ec2-user@ip-172-16-37-138 ~]$ sudo pvs
  /run/lvm/lvmetad.socket: connect failed: Connection refused
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
  PV         VG                                  Fmt  Attr PSize    PFree
  /dev/xvdb1 dockervg                            lvm2 a--  <100.00g    0 
  /dev/xvdf  vg_558b9fb496a0c15f5d9e41bc323a114a lvm2 a--     1.95t 1.80t
[ec2-user@ip-172-16-37-138 ~]$ sudo pvscan
  /run/lvm/lvmetad.socket: connect failed: Connection refused
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
  PV /dev/xvdf    VG vg_558b9fb496a0c15f5d9e41bc323a114a   lvm2 [1.95 TiB / 1.80 TiB free]
  PV /dev/xvdb1   VG dockervg                              lvm2 [<100.00 GiB / 0    free]
  Total: 2 [2.05 TiB] / in use: 2 [2.05 TiB] / in no VG: 0 [0   ]


logs for the command heketi-cli server state examine gluster is present in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/cns/311async/

Moving this bug to verified state since no udev related issues are seen and the container image has the lvm2 version as  lvm2-2.02.180-10.el7_6.2.x86_64