Bug 1651270
| Summary: | Cant deploy gluster with crio on OCP-3.11.43 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nelly Credi <ncredi> | ||||
| Component: | cns-ansible | Assignee: | John Mulligan <jmulligan> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Prasanth <pprakash> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | ocs-3.11 | CC: | bkunal, fedoraproject, gbenhaim, hchiramm, ikke, jarrpa, klaas, knarra, kramdoss, lbednar, madam, mtaru, ncredi, ndevos, pdwyer, rgeorge, rhs-bugs, rtalur, sarumuga, sdodson, suprasad, tparsons | ||||
| Target Milestone: | --- | Keywords: | ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | openshift-ansible-3.11.52-1 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-09-12 18:02:22 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1707789 | ||||||
| Bug Blocks: | 1627104, 1656897 | ||||||
| Attachments: |
|
||||||
|
Description
Nelly Credi
2018-11-19 15:04:19 UTC
Dup of bug 1627104 ? Is this change in recent OCP Ansible installer post verification of bug 1627104 (https://bugzilla.redhat.com/show_bug.cgi?id=1627104) at OCP 3.11 + OCS 3.11? (In reply to Yaniv Kaul from comment #2) > Dup of bug 1627104 ? No. (In reply to Sudhir from comment #3) > Is this change in recent OCP Ansible installer post verification of bug > 1627104 (https://bugzilla.redhat.com/show_bug.cgi?id=1627104) at OCP 3.11 + > OCS 3.11? This was not a problem with OCP-3.11.16, but was found with OCP-3.11.43. It seems unneeded to explicitly have a bind-mount for /dev when a container is running in privileged mode. In the few environments where I have been testing, the /dev directory gets automatically bind mounted in that case. Documentation is not very clear though. From https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.12/#securitycontext-v1-core : privileged Run container in privileged mode. Processes in privileged containers are essentially equivalent to root on the host. Defaults to false. From 'man docker-run': --privileged=true|false Give extended privileges to this container. The default is false. By default, Docker containers are “unprivileged” (=false) and cannot, for example, run a Docker daemon inside the Docker container. This is because by default a container is not allowed to access any devices. A “privileged” container is given access to all devices. When the operator executes docker run --privileged, Docker will enable access to all devices on the host as well as set some configuration in AppArmor to allow the container nearly all the same access to the host as processes running outside of a container on the host. From https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privileged : Privileged - determines if any container in a pod can enable privileged mode. By default a container is not allowed to access any devices on the host, but a “privileged” container is given access to all devices on the host. This allows the container nearly all the same access as processes running on the host. This is useful for containers that want to use linux capabilities like manipulating the network stack and accessing devices. Created attachment 1507319 [details] rhel7 pods with combinations of privileged option and /dev bind-mount This attached .yaml file has three rhel7 pods: 1. plain pod, not priviliged, no /dev bindmount 2. privileged pod, no /dev bindmount - but /dev will be populated! 3. privileged pod, with /dev bindmount - fails on OCP-3.11.43 With OCP-3.11.43 the 3rd pod fails to start. This is how the ocs/rhgs-server:v3.11.0 image is configured through openshift-ansible. The workaround given to CNV QE removes the /dev bindmount (as mentioned in comment #0). This is a minor change that we should be able to include in the next openshift-ansible update. The only time I can remember having problems with /dev mounts was when running unprivileged. As long as we verify that there are no issues with the LVM commands if we don't explicitly bind mount /dev it should be fine. I'm hit with this bug as well on my test cluster. I have ansible installer to install OCP+CNS from scratch. Installer also creates the machines from scratch to begin with. I repeatedly get this, and can't get past it just by rebuilding everything from scratch. I could let you login to debug, if you wish so? In case you want to remote debug it, please share here your public key which I drop to systems.
----------------------------
[cloud-user@ocp-master ~]$ rpm -qa atomic*
atomic-openshift-clients-3.11.43-1.git.0.647ac05.el7.x86_64
atomic-openshift-docker-excluder-3.11.43-1.git.0.647ac05.el7.noarch
atomic-1.22.1-25.git5a342e3.el7.x86_64
atomic-openshift-excluder-3.11.43-1.git.0.647ac05.el7.noarch
atomic-openshift-hyperkube-3.11.43-1.git.0.647ac05.el7.x86_64
atomic-openshift-node-3.11.43-1.git.0.647ac05.el7.x86_64
atomic-registries-1.22.1-25.git5a342e3.el7.x86_64
atomic-openshift-3.11.43-1.git.0.647ac05.el7.x86_64
----------------------------
It gets stuck in this task:
----------------------------
TASK [openshift_storage_glusterfs : Wait for GlusterFS pods]
----------------------------
Here are the logs of it:
oc get events on master:
-------------------------
open /dev/null: permission denied
exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer"
, stdout: , stderr: , exit code -1
ocs 0s 6m 34 glusterfs-ocs-5zs78.15693c224fe883c0 Pod spec.containers{glusterfs} Warning Unhealthy kubelet, ocp-master (combined from similar events): Readiness probe errored: rpc error: code = Unknown desc = command error: time="2018-11-21T20:07:58Z" level=error msg="open /dev/null: permission denied
"
-------------------------
[cloud-user@ocp-master ~]$ sudo journalctl -xe
-----------------------
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: open /dev/null: permission denied
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer"
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: , stdout: , stderr: , exit code -1
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: E1121 21:59:49.076173 24236 remote_runtime.go:332] ExecSync a79912124c84633e5486b8c12154e375b6edc060c304edcde2ba554d38f9d756 '/bin/bash -c if command -v
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: "
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: open /dev/null: permission denied
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer"
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: , stdout: , stderr: , exit code -1
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: E1121 21:59:49.107262 24236 remote_runtime.go:332] ExecSync a79912124c84633e5486b8c12154e375b6edc060c304edcde2ba554d38f9d756 '/bin/bash -c if command -v
Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: "
----------------------------
and this is how device looks on master:
----------------------------
[cloud-user@ocp-master ~]$ ls -laZ /dev/null
crw-rw-rw-. root root system_u:object_r:null_device_t:s0 /dev/null
----------------------------
versions:
----------------------------
[cloud-user@bastion ~]$ rpm -qa '*openshift*'
openshift-ansible-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-playbooks-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-docs-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-roles-3.11.43-1.git.0.fa69a02.el7.noarch
[cloud-user@bastion ~]$ rpm -qa '*ansible*'
openshift-ansible-3.11.43-1.git.0.fa69a02.el7.noarch
ansible-2.6.6-1.el7ae.noarch
openshift-ansible-playbooks-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-docs-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-roles-3.11.43-1.git.0.fa69a02.el7.noarch
----------------------------
Gluster options in hosts file:
----------------------------
#OCS
openshift_storage_glusterfs_namespace=ocs
openshift_storage_glusterfs_name=ocs
openshift_storage_glusterfs_wipe=True
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=true
openshift_storage_glusterfs_image=registry.access.redhat.com/rhgs3/rhgs-server-rhel7
openshift_storage_glusterfs_heketi_image=registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7
openshift_storage_glusterfs_block_deploy=True
openshift_storage_glusterfs_block_host_vol_create=true
openshift_storage_glusterfs_block_host_vol_size=50
openshift_storage_glusterfs_block_storageclass=true
----------------------------
ah, sorry about the previous. I did not read carefully, this was a duplicate and there was a link to bugzilla with suggestion for workaround. 2 PRs are under review: - https://github.com/gluster/gluster-kubernetes/pull/538 - https://github.com/openshift/openshift-ansible/pull/10768 Close but no cigar. With the ansible playbook tweak I get the installation further, but glusterfs-ocs-nn pods do not finsih starting. This is what the status check fails with (oc rsh glustfs-ocs-nn): --------------- sh-4.2# /usr/local/bin/status-probe.sh readiness failed check: systemctl -q is-active gluster-blockd.service sh-4.2# systemctl is-active gluster-blockd failed --------------- None of three pods get full up status: --------------- [cloud-user@ocp-master ~]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE glusterfs-ocs-dkt6g 0/1 Running 2 1h 192.168.122.30 ocp-apps1 <none> glusterfs-ocs-szwdc 0/1 Running 2 1h 192.168.122.40 ocp-apps2 <none> glusterfs-ocs-v9q8h 0/1 Running 2 1h 192.168.122.20 ocp-infra <none> --------------- We need this for OCP 3.11 errata for sure. OCS 3.11.1 will be on December 11th, so whichever aligns will help. I find the method of not mounting /dev to be problematic since "/dev/disk/by-id/*" doesn't exist inside the gluster pods. The "/dev/disk/by-id/*" is very useful for letting gluster know which block devices to use. A workaround that worked for me was to bind mount only "/dev/disk/by-id". Hmm, that indeed looks problematic. Not many people use "/dev/disk/by-id/*" at the moment, but it is something that we want to support (and possibly even recommend) in the future. I'll have a look at it and see if "/dev/disk" should be included. (In reply to Niels de Vos from comment #20) > Hmm, that indeed looks problematic. Not many people use "/dev/disk/by-id/*" > at the moment, but it is something that we want to support (and possibly > even recommend) in the future. > > I'll have a look at it and see if "/dev/disk" should be included. As an ose customer I tried to use them when I started with openshift. But heketi didn't support it. It was fixed by https://github.com/heketi/heketi/commit/73c0ef4b8d56183a6011430f679dbbb04b4a2ee0 which made it into 3.10/3.3. I have not yet tested it again though :D New PRs that add /dev/disk as a bind-mount: - https://github.com/gluster/gluster-kubernetes/pull/542 - https://github.com/openshift/openshift-ansible/pull/10793 Reviews, /ok-to-test and similar much appreciated. More changes to allow /dev/mapper/* devices... - https://github.com/openshift/openshift-ansible/pull/10806 - https://github.com/gluster/gluster-kubernetes/pull/544 Scott, can openshift-ansible PR#10806 be included as well please? PR#10806 has been included in (upstream) openshift-ansible-3.11.52-1 (In reply to Niels de Vos from comment #25) > PR#10806 has been included in (upstream) openshift-ansible-3.11.52-1 This version of ansible got released and available, however I see this bugzilla in MODIFIED state. The change log only contains 1 PR instead of "3". Niels, are you sure all the fix PRs are available with v3.11.52-1 ? If yes, can you check why this is still in MODIFIED state? (In reply to Humble Chirammal from comment #29) > Niels, are you sure all the fix PRs are available with v3.11.52-1 ? Yes. > If yes, can you check why this is still in MODIFIED state? That is probably something Jose or Scott can answer. This is an RHGS bug, not an OpenShift bug, so Scott shouldn't be expected to stay on top of it. Similarly, I have been doing no work on this, so I have not been keeping track of its state. If you or Niels feel this is ready, please move it accordingly. Talur, We do support it as tech preview, we need to figure out what broke (In reply to Sudhir from comment #33) > Talur, > We do support it as tech preview, we need to figure out what broke All changes have been included already, the problem should have been fixed for months now. This is mainly waiting on acceptance from OCS QE to validate deploying with CRI-O. But as CRI-O is a Technology Preview for the OCS product, this does not have a high priority and falls out of the planning every time :-/ Moving this to ON_QA as openshift-ansible-3.11.52-1 in combination with current OCS container images should not have this problem anymore. Version tested on: oc v3.11.82 kubernetes v1.11.0+d4cacc0 openshift-ansible-3.11.82-3.git.0.9718d0a.el7.noarch cns-deploy-7.0.0-9.el7rhgs.x86_64 IMAGES: rhgs3/rhgs-server-rhel7:v3.11.1 rhgs3/rhgs-volmanager-rhel7:v3.11.1 rhgs3/rhgs-gluster-block-prov-rhel7:v3.11.1 This works with current release as seen in comment 39. Closing. |