Bug 1627104 - cant deploy gluster with crio because LVM commands fail
Summary: cant deploy gluster with crio because LVM commands fail
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhgs-server-container
Version: cns-3.10
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: OCS 3.11.1
Assignee: Saravanakumar
QA Contact: Rachael
URL:
Whiteboard:
: 1634454 (view as bug list)
Depends On: 1536511 1651270
Blocks: OCS-3.11.1-Engineering-Proposed-BZs OCS-3.11.1-devel-triage-done 1642792 1644154
TreeView+ depends on / blocked
 
Reported: 2018-09-10 13:00 UTC by Karim Boumedhel
Modified: 2022-03-13 15:32 UTC (History)
26 users (show)

Fixed In Version: rhgs-server-rhel7:3.11.0-2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-07 04:12:47 UTC
Embargoed:


Attachments (Terms of Use)
inventory (3.57 KB, text/plain)
2018-09-10 13:00 UTC, Karim Boumedhel
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github gluster gluster-containers pull 104 0 'None' closed Do not run udev and lvmetad inside the container 2021-01-14 08:25:38 UTC
Red Hat Bugzilla 1536511 0 unspecified CLOSED Gluster pod with 850 volumes fails to come up after node reboot 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1589277 1 None None None 2021-09-09 14:29:39 UTC
Red Hat Bugzilla 1623433 0 unspecified CLOSED Brick fails to come online after shutting down and restarting a node 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1651270 0 unspecified CLOSED Cant deploy gluster with crio on OCP-3.11.43 2022-03-13 16:07:36 UTC
Red Hat Knowledge Base (Solution) 3774451 0 None None None 2018-12-29 11:09:14 UTC
Red Hat Product Errata RHEA-2019:0287 0 None None None 2019-02-07 04:13:08 UTC

Internal Links: 1536511 1589277 1623433 1651270

Description Karim Boumedhel 2018-09-10 13:00:11 UTC
Description of problem:
when i deploy container native storage with openshift and crio set to True, the gluster deployment fails on creating logical volumes within gluster pods


Version-Release number of selected component (if applicable):


How reproducible:
deploy openshift with openshift-ansible and with gluster and crio 


Steps to Reproduce:
1.launch the openshift ansible playbook
2.
3.

Actual results:


Expected results:


Additional info:
a patch can be included to overcome the issue, by disabling udev

Comment 2 Karim Boumedhel 2018-09-10 13:00:50 UTC
Created attachment 1482120 [details]
inventory

Comment 3 Niels de Vos 2018-09-13 11:37:49 UTC
This problem has been found while deploying CNS-3.10 on OCP-3.11 (which uses CRI-O by default). Using docker as container runtime makes it work again.

Alternatives are to disable udev_rules in /etc/lvm/lvm.conf in the glusterfs pods, or setting the environment variable DM_DISABLE_UDEV to "1" in the glusterfs daemonset.

Eventhough disabling udev-rules in the glusterfs pods is the preferred approach, problems have been reported when this is done (bz#1536511). This needs more investigation on what problems this causes (and if that is still the case).

Comment 7 Jose A. Rivera 2018-10-01 19:48:18 UTC
*** Bug 1634454 has been marked as a duplicate of this bug. ***

Comment 12 Klaas Demter 2018-10-17 15:48:16 UTC
Are there workaround instructions for OpenShift Enterprise 3.11?

Comment 14 Klaas Demter 2018-10-17 16:07:02 UTC
I don't think thats a public registry :) I would really like for this to hit the customer facing openshift registry, this kinda kills my only openshift instance :)

Comment 17 Klaas Demter 2018-10-24 06:37:58 UTC
So the support informed me "As we heard as of now crio is not supported with ocs 3.11, As engineering team is already working on the raised bugzilla, we believe it would come in later version of ocs." Shouldn't this maybe make it into release notes or something like that?

Comment 44 Niels de Vos 2018-12-20 14:02:02 UTC
The changes for this bug have been included in the rhgs-server-rhel7:3.11.0-2 image.

Testing deploying OCS on an environment with CRI-O should work now. Previously creating the heketidbstorage volume failed because the glusterfs-server pods could not create LVM/LVs for the bricks.

Comment 46 Klaas Demter 2018-12-29 12:34:14 UTC
does this also mean crio + ocs is supported or does it mean "it works but using it is your own risk"?

Comment 47 Niels de Vos 2018-12-29 13:03:11 UTC
(In reply to Klaas Demter from comment #46)
> does this also mean crio + ocs is supported or does it mean "it works but
> using it is your own risk"?

It is currently not supported (nor completely functional). We're working on having it functional first. When the product supports it, it will be mentioned in the announcement.

Comment 49 Sri Vignesh Selvan 2019-01-09 06:42:36 UTC
Deployment with crio has passed refer Comment #48

Moving this to verified

Comment 50 Sudarshan Chaudhari 2019-01-25 09:40:54 UTC
Hello, 

IHAC, who is facing the similar issue as pointed in the Bugzilla: [1] https://bugzilla.redhat.com/show_bug.cgi?id=1634763. The setup is of OCP 3.10 and running on docker not on CRI-O.

This Bugzilla is marked as Duplicate of [2] https://bugzilla.redhat.com/show_bug.cgi?id=1634454 which represents the similar issue for CRI-O which is marked duplicate of this Bug.

The error message:
~~~
TASK [openshift_storage_glusterfs : Create heketi DB volume] *****************************************************************************************
Wednesday 23 January 2019  19:36:33 +0100 (0:00:12.858)       0:04:00.583 ***** 
fatal: [m1.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-UhoDOI/admin.kubeconfig", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-cdqrf", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "d1lg2npzY2yqxHzEs8JQBeVxPy1SZXqrv6hKtIpSoXY=", "setup-openshift-heketi-storage", "--image", "registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:v3.10", "--listfile", "/tmp/heketi-storage.json"], "delta": "0:01:03.993401", "end": "2019-01-23 19:37:37.693236", "failed": true, "msg": "non-zero return code", "rc": 255, "start": "2019-01-23 19:36:33.699835", "stderr": "Error: WARNING: This metadata update is NOT backed up.\n  /dev/vg_3d08c35c8c2c30ae723cda26647854df/lvol0: not found: device not cleared\n  Aborting. Failed to wipe start of new LV.\ncommand terminated with exit code 255", "stderr_lines": ["Error: WARNING: This metadata update is NOT backed up.", "  /dev/vg_3d08c35c8c2c30ae723cda26647854df/lvol0: not found: device not cleared", "  Aborting. Failed to wipe start of new LV.", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}
~~~

OCP version from sos-report:
~~~
$ cat yum_list_installed | grep openshift
atomic-openshift.x86_64        3.10.89-1.git.0.00d2623.el7 @rhel-7-server-ose-3.10-rpms
atomic-openshift-clients.x86_64
atomic-openshift-docker-excluder.noarch
atomic-openshift-excluder.noarch
atomic-openshift-hyperkube.x86_64
atomic-openshift-node.x86_64   3.10.89-1.git.0.00d2623.el7 @rhel-7-server-ose-3.10-rpms
openshift-ansible.noarch       3.10.89-1.git.0.14ed1cb.el7 @rhel-7-server-ose-3.10-rpms
openshift-ansible-docs.noarch  3.10.89-1.git.0.14ed1cb.el7 @rhel-7-server-ose-3.10-rpms
openshift-ansible-playbooks.noarch
openshift-ansible-roles.noarch 3.10.89-1.git.0.14ed1cb.el7 @rhel-7-server-ose-3.10-rpms
$ cat yum_list_installed | grep docker
atomic-openshift-docker-excluder.noarch
docker.x86_64                  2:1.13.1-88.git07f3374.el7  @rhel-7-server-extras-rpms
docker-client.x86_64           2:1.13.1-88.git07f3374.el7  @rhel-7-server-extras-rpms
docker-common.x86_64           2:1.13.1-88.git07f3374.el7  @rhel-7-server-extras-rpms
~~~

Adding the complete ansible logs to the BZ. Can anyone of you check if the issue is fixed and is this issue similar to both docker as well as CRI-O?


Thanks in advance

Comment 56 s.tanke 2019-01-31 08:06:59 UTC
Hi guys,

having the same issue with OCP 3.10 and docker!

Help is appreciated.

Best regards,
Sascha

Comment 57 s.tanke 2019-01-31 08:23:35 UTC
Hi,

Found this: https://github.com/heketi/heketi/issues/810
It seems to solve the issue. 

Applying the following patch to /etc/lvm/lvm.conf worked for OCP 3.10 with docker:
sed -i.save -e "s#udev_sync = 1#udev_sync = 0#" -e "s#udev_rules = 1#udev_rules = 0#" -e "s#use_lvmetad = 1#use_lvmetad = 0#" /etc/lvm/lvm.conf

So this seems to be a image issue.

Could RedHat change the image tagging for rhgs-server-rhel7:v3.10 to point to a working/fixed one?

Thanks in advance.

best regards,
Sascha

Comment 58 s.tanke 2019-01-31 08:31:04 UTC
Patch for /etc/lvm/lvm.conf can be shortened:
sed -i.save -e "s#udev_rules = 1#udev_rules = 0#" /etc/lvm/lvm.conf

Comment 59 Michael Adam 2019-01-31 12:46:20 UTC
(In reply to s.tanke from comment #57)
> Hi,
> 
> Found this: https://github.com/heketi/heketi/issues/810
> It seems to solve the issue. 
> 
> Applying the following patch to /etc/lvm/lvm.conf worked for OCP 3.10 with
> docker:
> sed -i.save -e "s#udev_sync = 1#udev_sync = 0#" -e "s#udev_rules =
> 1#udev_rules = 0#" -e "s#use_lvmetad = 1#use_lvmetad = 0#" /etc/lvm/lvm.conf
> 
> So this seems to be a image issue.
> 
> Could RedHat change the image tagging for rhgs-server-rhel7:v3.10 to point
> to a working/fixed one?

Thanks for your comment!

We have fixed various issues in the 3.11 series.

We can check whether we can backport a fix and ship an update to the 3.10 images (which we usually don't do once the next version is out). 

Does the patch fix the issue for you entirely?

Thanks - Michael

Comment 60 s.tanke 2019-01-31 13:41:14 UTC
At least running:

sed -i.save -e "s#udev_sync = 1#udev_sync = 0#" -e "s#udev_rules = 1#udev_rules = 0#" -e "s#use_lvmetad = 1#use_lvmetad = 0#" /etc/lvm/lvm.conf

on the glusterfs-storage-* pods and running playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml worked. Afterwards we restarted advanced installation via ansible.

Comment 64 errata-xmlrpc 2019-02-07 04:12:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0287


Note You need to log in before you can comment on or make changes to this bug.