Description of problem: when i deploy container native storage with openshift and crio set to True, the gluster deployment fails on creating logical volumes within gluster pods Version-Release number of selected component (if applicable): How reproducible: deploy openshift with openshift-ansible and with gluster and crio Steps to Reproduce: 1.launch the openshift ansible playbook 2. 3. Actual results: Expected results: Additional info: a patch can be included to overcome the issue, by disabling udev
Created attachment 1482120 [details] inventory
This problem has been found while deploying CNS-3.10 on OCP-3.11 (which uses CRI-O by default). Using docker as container runtime makes it work again. Alternatives are to disable udev_rules in /etc/lvm/lvm.conf in the glusterfs pods, or setting the environment variable DM_DISABLE_UDEV to "1" in the glusterfs daemonset. Eventhough disabling udev-rules in the glusterfs pods is the preferred approach, problems have been reported when this is done (bz#1536511). This needs more investigation on what problems this causes (and if that is still the case).
*** Bug 1634454 has been marked as a duplicate of this bug. ***
Are there workaround instructions for OpenShift Enterprise 3.11?
I don't think thats a public registry :) I would really like for this to hit the customer facing openshift registry, this kinda kills my only openshift instance :)
So the support informed me "As we heard as of now crio is not supported with ocs 3.11, As engineering team is already working on the raised bugzilla, we believe it would come in later version of ocs." Shouldn't this maybe make it into release notes or something like that?
The changes for this bug have been included in the rhgs-server-rhel7:3.11.0-2 image. Testing deploying OCS on an environment with CRI-O should work now. Previously creating the heketidbstorage volume failed because the glusterfs-server pods could not create LVM/LVs for the bricks.
does this also mean crio + ocs is supported or does it mean "it works but using it is your own risk"?
(In reply to Klaas Demter from comment #46) > does this also mean crio + ocs is supported or does it mean "it works but > using it is your own risk"? It is currently not supported (nor completely functional). We're working on having it functional first. When the product supports it, it will be mentioned in the announcement.
Deployment with crio has passed refer Comment #48 Moving this to verified
Hello, IHAC, who is facing the similar issue as pointed in the Bugzilla: [1] https://bugzilla.redhat.com/show_bug.cgi?id=1634763. The setup is of OCP 3.10 and running on docker not on CRI-O. This Bugzilla is marked as Duplicate of [2] https://bugzilla.redhat.com/show_bug.cgi?id=1634454 which represents the similar issue for CRI-O which is marked duplicate of this Bug. The error message: ~~~ TASK [openshift_storage_glusterfs : Create heketi DB volume] ***************************************************************************************** Wednesday 23 January 2019 19:36:33 +0100 (0:00:12.858) 0:04:00.583 ***** fatal: [m1.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-UhoDOI/admin.kubeconfig", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-cdqrf", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "d1lg2npzY2yqxHzEs8JQBeVxPy1SZXqrv6hKtIpSoXY=", "setup-openshift-heketi-storage", "--image", "registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:v3.10", "--listfile", "/tmp/heketi-storage.json"], "delta": "0:01:03.993401", "end": "2019-01-23 19:37:37.693236", "failed": true, "msg": "non-zero return code", "rc": 255, "start": "2019-01-23 19:36:33.699835", "stderr": "Error: WARNING: This metadata update is NOT backed up.\n /dev/vg_3d08c35c8c2c30ae723cda26647854df/lvol0: not found: device not cleared\n Aborting. Failed to wipe start of new LV.\ncommand terminated with exit code 255", "stderr_lines": ["Error: WARNING: This metadata update is NOT backed up.", " /dev/vg_3d08c35c8c2c30ae723cda26647854df/lvol0: not found: device not cleared", " Aborting. Failed to wipe start of new LV.", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []} ~~~ OCP version from sos-report: ~~~ $ cat yum_list_installed | grep openshift atomic-openshift.x86_64 3.10.89-1.git.0.00d2623.el7 @rhel-7-server-ose-3.10-rpms atomic-openshift-clients.x86_64 atomic-openshift-docker-excluder.noarch atomic-openshift-excluder.noarch atomic-openshift-hyperkube.x86_64 atomic-openshift-node.x86_64 3.10.89-1.git.0.00d2623.el7 @rhel-7-server-ose-3.10-rpms openshift-ansible.noarch 3.10.89-1.git.0.14ed1cb.el7 @rhel-7-server-ose-3.10-rpms openshift-ansible-docs.noarch 3.10.89-1.git.0.14ed1cb.el7 @rhel-7-server-ose-3.10-rpms openshift-ansible-playbooks.noarch openshift-ansible-roles.noarch 3.10.89-1.git.0.14ed1cb.el7 @rhel-7-server-ose-3.10-rpms $ cat yum_list_installed | grep docker atomic-openshift-docker-excluder.noarch docker.x86_64 2:1.13.1-88.git07f3374.el7 @rhel-7-server-extras-rpms docker-client.x86_64 2:1.13.1-88.git07f3374.el7 @rhel-7-server-extras-rpms docker-common.x86_64 2:1.13.1-88.git07f3374.el7 @rhel-7-server-extras-rpms ~~~ Adding the complete ansible logs to the BZ. Can anyone of you check if the issue is fixed and is this issue similar to both docker as well as CRI-O? Thanks in advance
Hi guys, having the same issue with OCP 3.10 and docker! Help is appreciated. Best regards, Sascha
Hi, Found this: https://github.com/heketi/heketi/issues/810 It seems to solve the issue. Applying the following patch to /etc/lvm/lvm.conf worked for OCP 3.10 with docker: sed -i.save -e "s#udev_sync = 1#udev_sync = 0#" -e "s#udev_rules = 1#udev_rules = 0#" -e "s#use_lvmetad = 1#use_lvmetad = 0#" /etc/lvm/lvm.conf So this seems to be a image issue. Could RedHat change the image tagging for rhgs-server-rhel7:v3.10 to point to a working/fixed one? Thanks in advance. best regards, Sascha
Patch for /etc/lvm/lvm.conf can be shortened: sed -i.save -e "s#udev_rules = 1#udev_rules = 0#" /etc/lvm/lvm.conf
(In reply to s.tanke from comment #57) > Hi, > > Found this: https://github.com/heketi/heketi/issues/810 > It seems to solve the issue. > > Applying the following patch to /etc/lvm/lvm.conf worked for OCP 3.10 with > docker: > sed -i.save -e "s#udev_sync = 1#udev_sync = 0#" -e "s#udev_rules = > 1#udev_rules = 0#" -e "s#use_lvmetad = 1#use_lvmetad = 0#" /etc/lvm/lvm.conf > > So this seems to be a image issue. > > Could RedHat change the image tagging for rhgs-server-rhel7:v3.10 to point > to a working/fixed one? Thanks for your comment! We have fixed various issues in the 3.11 series. We can check whether we can backport a fix and ship an update to the 3.10 images (which we usually don't do once the next version is out). Does the patch fix the issue for you entirely? Thanks - Michael
At least running: sed -i.save -e "s#udev_sync = 1#udev_sync = 0#" -e "s#udev_rules = 1#udev_rules = 0#" -e "s#use_lvmetad = 1#use_lvmetad = 0#" /etc/lvm/lvm.conf on the glusterfs-storage-* pods and running playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml worked. Afterwards we restarted advanced installation via ansible.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0287