Bug 1651270 - Cant deploy gluster with crio on OCP-3.11.43
Summary: Cant deploy gluster with crio on OCP-3.11.43
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: cns-ansible
Version: ocs-3.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: John Mulligan
QA Contact: Prasanth
URL:
Whiteboard:
Depends On: 1707789
Blocks: 1627104 1656897
TreeView+ depends on / blocked
 
Reported: 2018-11-19 15:04 UTC by Nelly Credi
Modified: 2019-12-04 06:42 UTC (History)
22 users (show)

Fixed In Version: openshift-ansible-3.11.52-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-12 18:02:22 UTC
Target Upstream Version:


Attachments (Terms of Use)
rhel7 pods with combinations of privileged option and /dev bind-mount (835 bytes, text/plain)
2018-11-19 19:10 UTC, Niels de Vos
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1627104 'urgent' 'CLOSED' 'cant deploy gluster with crio because LVM commands fail' 2019-12-03 08:42:44 UTC
Red Hat Knowledge Base (Solution) 3774451 None None None 2018-12-25 13:53:22 UTC

Internal Links: 1627104

Description Nelly Credi 2018-11-19 15:04:19 UTC
Description of problem:
Fail to deploy OCS with CRI-O
Blocking CNV

Version-Release number of selected component (if applicable):
OCP-3.11.43
openshift_storage_glusterfs_image: rhgs-server-rhel7:v3.11.0
openshift_storage_glusterfs_heketi_image: rhgs-volmanager-rhel7:v3.11.0
openshift_storage_glusterfs_block_image: rhgs-gluster-block-prov-rhel7:v3.11.0
openshift_storage_glusterfs_s3_image: rhgs-s3-server-rhel7:v3.11.0

How reproducible:
100%

Steps to Reproduce:
1.Execute openshift ansible playbook with OCS & CRI-O enabled
2.
3.

Actual results:

 Warning  Unhealthy  1m (x346 over 1h)  kubelet, cnv-executor-lbednar-node2.example.com  (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = command error: time="2018-11-15T14:27:20Z" level=error msg="open /dev/null: permission denied
"
open /dev/null: permission denied
exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer"
, stdout: , stderr: , exit code -1

Expected results:
Deployment should work


Additional info:

Niels provided us with a workaround (change in
openshift-ansible glusterfs deployment), which we are trying now

and gave the following info:
- glusterfs containers are running privileged
- it seems this (now?) automatically provides /dev bind-mounted
- the /dev bind-mount in the deployment prevents the container to start
- removing the (now unneeded) /dev bind-mount from the deployment makes
  it work

We'll need to investigate if this new /dev automativ bind-mount is
intended, or maybe has been there for longer.

Comment 2 Yaniv Kaul 2018-11-19 15:55:10 UTC
Dup of bug 1627104 ?

Comment 3 Sudhir 2018-11-19 16:56:18 UTC
Is this change in recent OCP Ansible installer post verification of bug 1627104 (https://bugzilla.redhat.com/show_bug.cgi?id=1627104) at OCP 3.11 + OCS 3.11?

Comment 4 Niels de Vos 2018-11-19 18:28:31 UTC
(In reply to Yaniv Kaul from comment #2)
> Dup of bug 1627104 ?

No.

(In reply to Sudhir from comment #3)
> Is this change in recent OCP Ansible installer post verification of bug
> 1627104 (https://bugzilla.redhat.com/show_bug.cgi?id=1627104) at OCP 3.11 +
> OCS 3.11?

This was not a problem with OCP-3.11.16, but was found with OCP-3.11.43.

Comment 7 Niels de Vos 2018-11-19 19:00:46 UTC
It seems unneeded to explicitly have a bind-mount for /dev when a container is running in privileged mode. In the few environments where I have been testing, the /dev directory gets automatically bind mounted in that case.

Documentation is not very clear though.

From https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.12/#securitycontext-v1-core :

  privileged

  Run container in privileged mode. Processes in privileged containers are 
  essentially equivalent to root on the host. Defaults to false.


From 'man docker-run':

       --privileged=true|false
          Give extended privileges to this container. The default is false.

       By  default,  Docker containers are “unprivileged” (=false) and cannot, 
       for example, run a Docker daemon inside the Docker container. This is 
       because by default a container is not allowed to access any devices. A 
       “privileged” container is given access to all devices.

       When the operator executes docker run --privileged, Docker will enable 
       access to all devices on the host as well as set some configuration in 
       AppArmor to allow the container nearly all the same access to the host as 
       processes running outside of a container on the host.


From https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privileged :

  Privileged - determines if any container in a pod can enable privileged mode. 
  By default a container is not allowed to access any devices on the host, but a 
  “privileged” container is given access to all devices on the host. This allows 
  the container nearly all the same access as processes running on the host. 
  This is useful for containers that want to use linux capabilities like 
  manipulating the network stack and accessing devices.

Comment 8 Niels de Vos 2018-11-19 19:10:29 UTC
Created attachment 1507319 [details]
rhel7 pods with combinations of privileged option and /dev bind-mount

This attached .yaml file has three rhel7 pods:

1. plain pod, not priviliged, no /dev bindmount
2. privileged pod, no /dev bindmount - but /dev will be populated!
3. privileged pod, with /dev bindmount - fails on OCP-3.11.43

With OCP-3.11.43 the 3rd pod fails to start. This is how the ocs/rhgs-server:v3.11.0 image is configured through openshift-ansible.

The workaround given to CNV QE removes the /dev bindmount (as mentioned in comment #0). This is a minor change that we should be able to include in the next openshift-ansible update.

Comment 10 Jose A. Rivera 2018-11-19 19:52:05 UTC
The only time I can remember having problems with /dev mounts was when running unprivileged. As long as we verify that there are no issues with the LVM commands if we don't explicitly bind mount /dev it should be fine.

Comment 11 Ilkka Tengvall 2018-11-21 20:18:28 UTC
I'm hit with this bug as well on my test cluster. I have ansible installer to install OCP+CNS from scratch. Installer also creates the machines from scratch to begin with. I repeatedly get this, and can't get past it just by rebuilding everything from scratch. I could let you login to debug, if you wish so? In case you want to remote debug it, please share here your public key which I drop to systems.


----------------------------
[cloud-user@ocp-master ~]$ rpm -qa atomic*
atomic-openshift-clients-3.11.43-1.git.0.647ac05.el7.x86_64
atomic-openshift-docker-excluder-3.11.43-1.git.0.647ac05.el7.noarch
atomic-1.22.1-25.git5a342e3.el7.x86_64
atomic-openshift-excluder-3.11.43-1.git.0.647ac05.el7.noarch
atomic-openshift-hyperkube-3.11.43-1.git.0.647ac05.el7.x86_64
atomic-openshift-node-3.11.43-1.git.0.647ac05.el7.x86_64
atomic-registries-1.22.1-25.git5a342e3.el7.x86_64
atomic-openshift-3.11.43-1.git.0.647ac05.el7.x86_64

----------------------------


It gets stuck in this task:
----------------------------
TASK [openshift_storage_glusterfs : Wait for GlusterFS pods] 
----------------------------


Here are the logs of it:

oc get events on master:
-------------------------
open /dev/null: permission denied
exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer"
, stdout: , stderr: , exit code -1
ocs       0s        6m        34        glusterfs-ocs-5zs78.15693c224fe883c0   Pod       spec.containers{glusterfs}   Warning   Unhealthy   kubelet, ocp-master   (combined from similar events): Readiness probe errored: rpc error: code = Unknown desc = command error: time="2018-11-21T20:07:58Z" level=error msg="open /dev/null: permission denied
" 
-------------------------





[cloud-user@ocp-master ~]$ sudo journalctl -xe
-----------------------
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: open /dev/null: permission denied
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer"
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: , stdout: , stderr: , exit code -1
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: E1121 21:59:49.076173   24236 remote_runtime.go:332] ExecSync a79912124c84633e5486b8c12154e375b6edc060c304edcde2ba554d38f9d756 '/bin/bash -c if command -v
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: "
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: open /dev/null: permission denied
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer"
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: , stdout: , stderr: , exit code -1
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: E1121 21:59:49.107262   24236 remote_runtime.go:332] ExecSync a79912124c84633e5486b8c12154e375b6edc060c304edcde2ba554d38f9d756 '/bin/bash -c if command -v
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: "
----------------------------


and this is how device looks on master:

----------------------------
[cloud-user@ocp-master ~]$ ls -laZ /dev/null 
crw-rw-rw-. root root system_u:object_r:null_device_t:s0 /dev/null
----------------------------


versions:
----------------------------
[cloud-user@bastion ~]$ rpm -qa '*openshift*'
openshift-ansible-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-playbooks-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-docs-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-roles-3.11.43-1.git.0.fa69a02.el7.noarch

[cloud-user@bastion ~]$ rpm -qa '*ansible*'
openshift-ansible-3.11.43-1.git.0.fa69a02.el7.noarch
ansible-2.6.6-1.el7ae.noarch
openshift-ansible-playbooks-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-docs-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-roles-3.11.43-1.git.0.fa69a02.el7.noarch
----------------------------


Gluster options in hosts file:
----------------------------
#OCS
openshift_storage_glusterfs_namespace=ocs
openshift_storage_glusterfs_name=ocs
openshift_storage_glusterfs_wipe=True
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=true
openshift_storage_glusterfs_image=registry.access.redhat.com/rhgs3/rhgs-server-rhel7
openshift_storage_glusterfs_heketi_image=registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7
openshift_storage_glusterfs_block_deploy=True
openshift_storage_glusterfs_block_host_vol_create=true
openshift_storage_glusterfs_block_host_vol_size=50
openshift_storage_glusterfs_block_storageclass=true
----------------------------

Comment 12 Ilkka Tengvall 2018-11-21 20:39:17 UTC
ah, sorry about the previous. I did not read carefully, this was a duplicate and there was a link to bugzilla with suggestion for workaround.

Comment 15 Ilkka Tengvall 2018-11-26 19:53:02 UTC
Close but no cigar. With the ansible playbook tweak I get the installation further, but glusterfs-ocs-nn pods do not finsih starting. This is what the status check fails with (oc rsh glustfs-ocs-nn):

---------------
sh-4.2#  /usr/local/bin/status-probe.sh  readiness                                                      
failed check: systemctl -q is-active gluster-blockd.service
sh-4.2# systemctl is-active gluster-blockd
failed
---------------

None of three pods get full up status:

---------------
[cloud-user@ocp-master ~]$ oc get pods -o wide
NAME                  READY     STATUS    RESTARTS   AGE       IP               NODE        NOMINATED NODE
glusterfs-ocs-dkt6g   0/1       Running   2          1h        192.168.122.30   ocp-apps1   <none>
glusterfs-ocs-szwdc   0/1       Running   2          1h        192.168.122.40   ocp-apps2   <none>
glusterfs-ocs-v9q8h   0/1       Running   2          1h        192.168.122.20   ocp-infra   <none>
---------------

Comment 17 Sudhir 2018-11-27 16:56:56 UTC
We need this for OCP 3.11 errata for sure. OCS 3.11.1 will be on December 11th, so whichever aligns will help.

Comment 19 Gal Ben Haim 2018-11-29 05:35:36 UTC
I find the method of not mounting /dev to be problematic since "/dev/disk/by-id/*" doesn't exist inside the gluster pods. 

The "/dev/disk/by-id/*" is very useful for letting gluster know which block devices to use.

A workaround that worked for me was to bind mount only "/dev/disk/by-id".

Comment 20 Niels de Vos 2018-11-29 10:42:13 UTC
Hmm, that indeed looks problematic. Not many people use "/dev/disk/by-id/*" at the moment, but it is something that we want to support (and possibly even recommend) in the future.

I'll have a look at it and see if "/dev/disk" should be included.

Comment 21 Klaas Demter 2018-11-29 12:22:16 UTC
(In reply to Niels de Vos from comment #20)
> Hmm, that indeed looks problematic. Not many people use "/dev/disk/by-id/*"
> at the moment, but it is something that we want to support (and possibly
> even recommend) in the future.
> 
> I'll have a look at it and see if "/dev/disk" should be included.

As an ose customer I tried to use them when I started with openshift. But heketi didn't support it. It was fixed by https://github.com/heketi/heketi/commit/73c0ef4b8d56183a6011430f679dbbb04b4a2ee0 which made it into 3.10/3.3. I have not yet tested it again though :D

Comment 22 Niels de Vos 2018-11-29 17:47:45 UTC
New PRs that add /dev/disk as a bind-mount:

- https://github.com/gluster/gluster-kubernetes/pull/542
- https://github.com/openshift/openshift-ansible/pull/10793

Reviews, /ok-to-test and similar much appreciated.

Comment 23 Niels de Vos 2018-12-03 13:37:36 UTC
More changes to allow /dev/mapper/* devices...

- https://github.com/openshift/openshift-ansible/pull/10806
- https://github.com/gluster/gluster-kubernetes/pull/544

Comment 24 Niels de Vos 2018-12-04 17:12:58 UTC
Scott, can openshift-ansible PR#10806 be included as well please?

Comment 25 Niels de Vos 2018-12-06 16:06:12 UTC
PR#10806 has been included in (upstream) openshift-ansible-3.11.52-1

Comment 29 Humble Chirammal 2018-12-17 06:54:06 UTC
(In reply to Niels de Vos from comment #25)
> PR#10806 has been included in (upstream) openshift-ansible-3.11.52-1


This version of ansible got released and available, however I see this bugzilla in MODIFIED state. The change log only contains 1 PR instead of "3". 

Niels, are you sure all the fix PRs are available with v3.11.52-1 ? 

If yes, can you check why this is still in MODIFIED state?

Comment 30 Niels de Vos 2018-12-17 15:27:50 UTC
(In reply to Humble Chirammal from comment #29)
> Niels, are you sure all the fix PRs are available with v3.11.52-1 ? 

Yes.

> If yes, can you check why this is still in MODIFIED state?

That is probably something Jose or Scott can answer.

Comment 31 Jose A. Rivera 2018-12-17 20:13:07 UTC
This is an RHGS bug, not an OpenShift bug, so Scott shouldn't be expected to stay on top of it. Similarly, I have been doing no work on this, so I have not been keeping track of its state. If you or Niels feel this is ready, please move it accordingly.

Comment 33 Sudhir 2019-05-15 20:12:54 UTC
Talur,
We do support it as tech preview, we need to figure out what broke

Comment 34 Niels de Vos 2019-05-16 06:38:19 UTC
(In reply to Sudhir from comment #33)
> Talur,
> We do support it as tech preview, we need to figure out what broke

All changes have been included already, the problem should have been fixed for months now. This is mainly waiting on acceptance from OCS QE to validate deploying with CRI-O. But as CRI-O is a Technology Preview for the OCS product, this does not have a high priority and falls out of the planning every time :-/

Moving this to ON_QA as openshift-ansible-3.11.52-1 in combination with current OCS container images should not have this problem anymore.

Comment 39 Rachael 2019-05-21 11:48:42 UTC
Version tested on:

oc v3.11.82
kubernetes v1.11.0+d4cacc0
openshift-ansible-3.11.82-3.git.0.9718d0a.el7.noarch

cns-deploy-7.0.0-9.el7rhgs.x86_64

IMAGES:

rhgs3/rhgs-server-rhel7:v3.11.1
rhgs3/rhgs-volmanager-rhel7:v3.11.1
rhgs3/rhgs-gluster-block-prov-rhel7:v3.11.1

Comment 40 Raghavendra Talur 2019-09-12 18:02:22 UTC
This works with current release as seen in comment 39. Closing.


Note You need to log in before you can comment on or make changes to this bug.