1627104 – cant deploy gluster with crio because LVM commands fail

Bug 1627104 - cant deploy gluster with crio because LVM commands fail

Summary: cant deploy gluster with crio because LVM commands fail

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhgs-server-container
Sub Component:
Version:	cns-3.10
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 3.11.1
Assignee:	Saravanakumar
QA Contact:	Rachael
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1634454 (view as bug list)
Depends On:	1536511 1651270
Blocks:	OCS-3.11.1-Engineering-Proposed-BZs OCS-3.11.1-devel-triage-done 1642792 1644154
TreeView+	depends on / blocked

Reported:	2018-09-10 13:00 UTC by Karim Boumedhel
Modified:	2022-03-13 15:32 UTC (History)
CC List:	26 users (show)
Fixed In Version:	rhgs-server-rhel7:3.11.0-2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-07 04:12:47 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
inventory (3.57 KB, text/plain) 2018-09-10 13:00 UTC, Karim Boumedhel	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	gluster gluster-containers pull 104	0	'None'	closed	Do not run udev and lvmetad inside the container	2021-01-14 08:25:38 UTC
Red Hat Bugzilla	1536511	0	unspecified	CLOSED	Gluster pod with 850 volumes fails to come up after node reboot	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1589277	1	None	None	None	2024-09-18 00:48:01 UTC
Red Hat Bugzilla	1623433	0	unspecified	CLOSED	Brick fails to come online after shutting down and restarting a node	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1651270	0	unspecified	CLOSED	Cant deploy gluster with crio on OCP-3.11.43	2022-03-13 16:07:36 UTC
Red Hat Knowledge Base (Solution)	3774451	0	None	None	None	2018-12-29 11:09:14 UTC
Red Hat Product Errata	RHEA-2019:0287	0	None	None	None	2019-02-07 04:13:08 UTC

Internal Links: 1536511 1589277 1623433 1651270

Description Karim Boumedhel 2018-09-10 13:00:11 UTC

Description of problem:
when i deploy container native storage with openshift and crio set to True, the gluster deployment fails on creating logical volumes within gluster pods


Version-Release number of selected component (if applicable):


How reproducible:
deploy openshift with openshift-ansible and with gluster and crio 


Steps to Reproduce:
1.launch the openshift ansible playbook
2.
3.

Actual results:


Expected results:


Additional info:
a patch can be included to overcome the issue, by disabling udev

Comment 2 Karim Boumedhel 2018-09-10 13:00:50 UTC

Created attachment 1482120 [details]
inventory

Comment 3 Niels de Vos 2018-09-13 11:37:49 UTC

This problem has been found while deploying CNS-3.10 on OCP-3.11 (which uses CRI-O by default). Using docker as container runtime makes it work again.

Alternatives are to disable udev_rules in /etc/lvm/lvm.conf in the glusterfs pods, or setting the environment variable DM_DISABLE_UDEV to "1" in the glusterfs daemonset.

Eventhough disabling udev-rules in the glusterfs pods is the preferred approach, problems have been reported when this is done (bz#1536511). This needs more investigation on what problems this causes (and if that is still the case).

Comment 7 Jose A. Rivera 2018-10-01 19:48:18 UTC

*** Bug 1634454 has been marked as a duplicate of this bug. ***

Comment 12 Klaas Demter 2018-10-17 15:48:16 UTC

Are there workaround instructions for OpenShift Enterprise 3.11?

Comment 14 Klaas Demter 2018-10-17 16:07:02 UTC

I don't think thats a public registry :) I would really like for this to hit the customer facing openshift registry, this kinda kills my only openshift instance :)

Comment 17 Klaas Demter 2018-10-24 06:37:58 UTC

So the support informed me "As we heard as of now crio is not supported with ocs 3.11, As engineering team is already working on the raised bugzilla, we believe it would come in later version of ocs." Shouldn't this maybe make it into release notes or something like that?

Comment 44 Niels de Vos 2018-12-20 14:02:02 UTC

The changes for this bug have been included in the rhgs-server-rhel7:3.11.0-2 image.

Testing deploying OCS on an environment with CRI-O should work now. Previously creating the heketidbstorage volume failed because the glusterfs-server pods could not create LVM/LVs for the bricks.

Comment 46 Klaas Demter 2018-12-29 12:34:14 UTC

does this also mean crio + ocs is supported or does it mean "it works but using it is your own risk"?

Comment 47 Niels de Vos 2018-12-29 13:03:11 UTC

(In reply to Klaas Demter from comment #46)
> does this also mean crio + ocs is supported or does it mean "it works but
> using it is your own risk"?

It is currently not supported (nor completely functional). We're working on having it functional first. When the product supports it, it will be mentioned in the announcement.

Comment 49 Sri Vignesh Selvan 2019-01-09 06:42:36 UTC

Deployment with crio has passed refer Comment #48

Moving this to verified

Comment 50 Sudarshan Chaudhari 2019-01-25 09:40:54 UTC

Hello, 

IHAC, who is facing the similar issue as pointed in the Bugzilla: [1] https://bugzilla.redhat.com/show_bug.cgi?id=1634763. The setup is of OCP 3.10 and running on docker not on CRI-O.

This Bugzilla is marked as Duplicate of [2] https://bugzilla.redhat.com/show_bug.cgi?id=1634454 which represents the similar issue for CRI-O which is marked duplicate of this Bug.

The error message:
~~~
TASK [openshift_storage_glusterfs : Create heketi DB volume] *****************************************************************************************
Wednesday 23 January 2019  19:36:33 +0100 (0:00:12.858)       0:04:00.583 ***** 
fatal: [m1.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-UhoDOI/admin.kubeconfig", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-cdqrf", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "d1lg2npzY2yqxHzEs8JQBeVxPy1SZXqrv6hKtIpSoXY=", "setup-openshift-heketi-storage", "--image", "registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:v3.10", "--listfile", "/tmp/heketi-storage.json"], "delta": "0:01:03.993401", "end": "2019-01-23 19:37:37.693236", "failed": true, "msg": "non-zero return code", "rc": 255, "start": "2019-01-23 19:36:33.699835", "stderr": "Error: WARNING: This metadata update is NOT backed up.\n  /dev/vg_3d08c35c8c2c30ae723cda26647854df/lvol0: not found: device not cleared\n  Aborting. Failed to wipe start of new LV.\ncommand terminated with exit code 255", "stderr_lines": ["Error: WARNING: This metadata update is NOT backed up.", "  /dev/vg_3d08c35c8c2c30ae723cda26647854df/lvol0: not found: device not cleared", "  Aborting. Failed to wipe start of new LV.", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}
~~~

OCP version from sos-report:
~~~
$ cat yum_list_installed | grep openshift
atomic-openshift.x86_64        3.10.89-1.git.0.00d2623.el7 @rhel-7-server-ose-3.10-rpms
atomic-openshift-clients.x86_64
atomic-openshift-docker-excluder.noarch
atomic-openshift-excluder.noarch
atomic-openshift-hyperkube.x86_64
atomic-openshift-node.x86_64   3.10.89-1.git.0.00d2623.el7 @rhel-7-server-ose-3.10-rpms
openshift-ansible.noarch       3.10.89-1.git.0.14ed1cb.el7 @rhel-7-server-ose-3.10-rpms
openshift-ansible-docs.noarch  3.10.89-1.git.0.14ed1cb.el7 @rhel-7-server-ose-3.10-rpms
openshift-ansible-playbooks.noarch
openshift-ansible-roles.noarch 3.10.89-1.git.0.14ed1cb.el7 @rhel-7-server-ose-3.10-rpms
$ cat yum_list_installed | grep docker
atomic-openshift-docker-excluder.noarch
docker.x86_64                  2:1.13.1-88.git07f3374.el7  @rhel-7-server-extras-rpms
docker-client.x86_64           2:1.13.1-88.git07f3374.el7  @rhel-7-server-extras-rpms
docker-common.x86_64           2:1.13.1-88.git07f3374.el7  @rhel-7-server-extras-rpms
~~~

Adding the complete ansible logs to the BZ. Can anyone of you check if the issue is fixed and is this issue similar to both docker as well as CRI-O?


Thanks in advance

Comment 56 s.tanke 2019-01-31 08:06:59 UTC

Hi guys,

having the same issue with OCP 3.10 and docker!

Help is appreciated.

Best regards,
Sascha

Comment 57 s.tanke 2019-01-31 08:23:35 UTC

Hi,

Found this: https://github.com/heketi/heketi/issues/810
It seems to solve the issue. 

Applying the following patch to /etc/lvm/lvm.conf worked for OCP 3.10 with docker:
sed -i.save -e "s#udev_sync = 1#udev_sync = 0#" -e "s#udev_rules = 1#udev_rules = 0#" -e "s#use_lvmetad = 1#use_lvmetad = 0#" /etc/lvm/lvm.conf

So this seems to be a image issue.

Could RedHat change the image tagging for rhgs-server-rhel7:v3.10 to point to a working/fixed one?

Thanks in advance.

best regards,
Sascha

Comment 58 s.tanke 2019-01-31 08:31:04 UTC

Patch for /etc/lvm/lvm.conf can be shortened:
sed -i.save -e "s#udev_rules = 1#udev_rules = 0#" /etc/lvm/lvm.conf

Comment 59 Michael Adam 2019-01-31 12:46:20 UTC

(In reply to s.tanke from comment #57)
> Hi,
> 
> Found this: https://github.com/heketi/heketi/issues/810
> It seems to solve the issue. 
> 
> Applying the following patch to /etc/lvm/lvm.conf worked for OCP 3.10 with
> docker:
> sed -i.save -e "s#udev_sync = 1#udev_sync = 0#" -e "s#udev_rules =
> 1#udev_rules = 0#" -e "s#use_lvmetad = 1#use_lvmetad = 0#" /etc/lvm/lvm.conf
> 
> So this seems to be a image issue.
> 
> Could RedHat change the image tagging for rhgs-server-rhel7:v3.10 to point
> to a working/fixed one?

Thanks for your comment!

We have fixed various issues in the 3.11 series.

We can check whether we can backport a fix and ship an update to the 3.10 images (which we usually don't do once the next version is out). 

Does the patch fix the issue for you entirely?

Thanks - Michael

Comment 60 s.tanke 2019-01-31 13:41:14 UTC

At least running:

sed -i.save -e "s#udev_sync = 1#udev_sync = 0#" -e "s#udev_rules = 1#udev_rules = 0#" -e "s#use_lvmetad = 1#use_lvmetad = 0#" /etc/lvm/lvm.conf

on the glusterfs-storage-* pods and running playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml worked. Afterwards we restarted advanced installation via ansible.

Comment 64 errata-xmlrpc 2019-02-07 04:12:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0287

Note You need to log in before you can comment on or make changes to this bug.

bkunal
dmoessne
ikke
jstrunk
klaas
kramdoss
madam
mtaru
ncredi
ndevos
nick
nschuetz
parmstro
pdwyer
pprakash
rcyriac
rgeorge
rhs-bugs
rkant
rtalur
sankarshan
sarumuga
sselvan
s.tanke
suchaudh
wmeng