Hide Forgot
Description of problem: Fail to deploy OCS with CRI-O Blocking CNV Version-Release number of selected component (if applicable): OCP-3.11.43 openshift_storage_glusterfs_image: rhgs-server-rhel7:v3.11.0 openshift_storage_glusterfs_heketi_image: rhgs-volmanager-rhel7:v3.11.0 openshift_storage_glusterfs_block_image: rhgs-gluster-block-prov-rhel7:v3.11.0 openshift_storage_glusterfs_s3_image: rhgs-s3-server-rhel7:v3.11.0 How reproducible: 100% Steps to Reproduce: 1.Execute openshift ansible playbook with OCS & CRI-O enabled 2. 3. Actual results: Warning Unhealthy 1m (x346 over 1h) kubelet, cnv-executor-lbednar-node2.example.com (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = command error: time="2018-11-15T14:27:20Z" level=error msg="open /dev/null: permission denied " open /dev/null: permission denied exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer" , stdout: , stderr: , exit code -1 Expected results: Deployment should work Additional info: Niels provided us with a workaround (change in openshift-ansible glusterfs deployment), which we are trying now and gave the following info: - glusterfs containers are running privileged - it seems this (now?) automatically provides /dev bind-mounted - the /dev bind-mount in the deployment prevents the container to start - removing the (now unneeded) /dev bind-mount from the deployment makes it work We'll need to investigate if this new /dev automativ bind-mount is intended, or maybe has been there for longer.
Dup of bug 1627104 ?
Is this change in recent OCP Ansible installer post verification of bug 1627104 (https://bugzilla.redhat.com/show_bug.cgi?id=1627104) at OCP 3.11 + OCS 3.11?
(In reply to Yaniv Kaul from comment #2) > Dup of bug 1627104 ? No. (In reply to Sudhir from comment #3) > Is this change in recent OCP Ansible installer post verification of bug > 1627104 (https://bugzilla.redhat.com/show_bug.cgi?id=1627104) at OCP 3.11 + > OCS 3.11? This was not a problem with OCP-3.11.16, but was found with OCP-3.11.43.
It seems unneeded to explicitly have a bind-mount for /dev when a container is running in privileged mode. In the few environments where I have been testing, the /dev directory gets automatically bind mounted in that case. Documentation is not very clear though. From https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.12/#securitycontext-v1-core : privileged Run container in privileged mode. Processes in privileged containers are essentially equivalent to root on the host. Defaults to false. From 'man docker-run': --privileged=true|false Give extended privileges to this container. The default is false. By default, Docker containers are “unprivileged” (=false) and cannot, for example, run a Docker daemon inside the Docker container. This is because by default a container is not allowed to access any devices. A “privileged” container is given access to all devices. When the operator executes docker run --privileged, Docker will enable access to all devices on the host as well as set some configuration in AppArmor to allow the container nearly all the same access to the host as processes running outside of a container on the host. From https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privileged : Privileged - determines if any container in a pod can enable privileged mode. By default a container is not allowed to access any devices on the host, but a “privileged” container is given access to all devices on the host. This allows the container nearly all the same access as processes running on the host. This is useful for containers that want to use linux capabilities like manipulating the network stack and accessing devices.
Created attachment 1507319 [details] rhel7 pods with combinations of privileged option and /dev bind-mount This attached .yaml file has three rhel7 pods: 1. plain pod, not priviliged, no /dev bindmount 2. privileged pod, no /dev bindmount - but /dev will be populated! 3. privileged pod, with /dev bindmount - fails on OCP-3.11.43 With OCP-3.11.43 the 3rd pod fails to start. This is how the ocs/rhgs-server:v3.11.0 image is configured through openshift-ansible. The workaround given to CNV QE removes the /dev bindmount (as mentioned in comment #0). This is a minor change that we should be able to include in the next openshift-ansible update.
The only time I can remember having problems with /dev mounts was when running unprivileged. As long as we verify that there are no issues with the LVM commands if we don't explicitly bind mount /dev it should be fine.
I'm hit with this bug as well on my test cluster. I have ansible installer to install OCP+CNS from scratch. Installer also creates the machines from scratch to begin with. I repeatedly get this, and can't get past it just by rebuilding everything from scratch. I could let you login to debug, if you wish so? In case you want to remote debug it, please share here your public key which I drop to systems. ---------------------------- [cloud-user@ocp-master ~]$ rpm -qa atomic* atomic-openshift-clients-3.11.43-1.git.0.647ac05.el7.x86_64 atomic-openshift-docker-excluder-3.11.43-1.git.0.647ac05.el7.noarch atomic-1.22.1-25.git5a342e3.el7.x86_64 atomic-openshift-excluder-3.11.43-1.git.0.647ac05.el7.noarch atomic-openshift-hyperkube-3.11.43-1.git.0.647ac05.el7.x86_64 atomic-openshift-node-3.11.43-1.git.0.647ac05.el7.x86_64 atomic-registries-1.22.1-25.git5a342e3.el7.x86_64 atomic-openshift-3.11.43-1.git.0.647ac05.el7.x86_64 ---------------------------- It gets stuck in this task: ---------------------------- TASK [openshift_storage_glusterfs : Wait for GlusterFS pods] ---------------------------- Here are the logs of it: oc get events on master: ------------------------- open /dev/null: permission denied exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer" , stdout: , stderr: , exit code -1 ocs 0s 6m 34 glusterfs-ocs-5zs78.15693c224fe883c0 Pod spec.containers{glusterfs} Warning Unhealthy kubelet, ocp-master (combined from similar events): Readiness probe errored: rpc error: code = Unknown desc = command error: time="2018-11-21T20:07:58Z" level=error msg="open /dev/null: permission denied " ------------------------- [cloud-user@ocp-master ~]$ sudo journalctl -xe ----------------------- Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: open /dev/null: permission denied Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer" Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: , stdout: , stderr: , exit code -1 Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: E1121 21:59:49.076173 24236 remote_runtime.go:332] ExecSync a79912124c84633e5486b8c12154e375b6edc060c304edcde2ba554d38f9d756 '/bin/bash -c if command -v Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: " Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: open /dev/null: permission denied Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: exec failed: container_linux.go:336: starting container process caused "read init-p: connection reset by peer" Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: , stdout: , stderr: , exit code -1 Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: E1121 21:59:49.107262 24236 remote_runtime.go:332] ExecSync a79912124c84633e5486b8c12154e375b6edc060c304edcde2ba554d38f9d756 '/bin/bash -c if command -v Nov 21 21:59:49 ocp-master atomic-openshift-node[24236]: " ---------------------------- and this is how device looks on master: ---------------------------- [cloud-user@ocp-master ~]$ ls -laZ /dev/null crw-rw-rw-. root root system_u:object_r:null_device_t:s0 /dev/null ---------------------------- versions: ---------------------------- [cloud-user@bastion ~]$ rpm -qa '*openshift*' openshift-ansible-3.11.43-1.git.0.fa69a02.el7.noarch openshift-ansible-playbooks-3.11.43-1.git.0.fa69a02.el7.noarch openshift-ansible-docs-3.11.43-1.git.0.fa69a02.el7.noarch openshift-ansible-roles-3.11.43-1.git.0.fa69a02.el7.noarch [cloud-user@bastion ~]$ rpm -qa '*ansible*' openshift-ansible-3.11.43-1.git.0.fa69a02.el7.noarch ansible-2.6.6-1.el7ae.noarch openshift-ansible-playbooks-3.11.43-1.git.0.fa69a02.el7.noarch openshift-ansible-docs-3.11.43-1.git.0.fa69a02.el7.noarch openshift-ansible-roles-3.11.43-1.git.0.fa69a02.el7.noarch ---------------------------- Gluster options in hosts file: ---------------------------- #OCS openshift_storage_glusterfs_namespace=ocs openshift_storage_glusterfs_name=ocs openshift_storage_glusterfs_wipe=True openshift_storage_glusterfs_storageclass=true openshift_storage_glusterfs_storageclass_default=true openshift_storage_glusterfs_image=registry.access.redhat.com/rhgs3/rhgs-server-rhel7 openshift_storage_glusterfs_heketi_image=registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7 openshift_storage_glusterfs_block_deploy=True openshift_storage_glusterfs_block_host_vol_create=true openshift_storage_glusterfs_block_host_vol_size=50 openshift_storage_glusterfs_block_storageclass=true ----------------------------
ah, sorry about the previous. I did not read carefully, this was a duplicate and there was a link to bugzilla with suggestion for workaround.
2 PRs are under review: - https://github.com/gluster/gluster-kubernetes/pull/538 - https://github.com/openshift/openshift-ansible/pull/10768
Close but no cigar. With the ansible playbook tweak I get the installation further, but glusterfs-ocs-nn pods do not finsih starting. This is what the status check fails with (oc rsh glustfs-ocs-nn): --------------- sh-4.2# /usr/local/bin/status-probe.sh readiness failed check: systemctl -q is-active gluster-blockd.service sh-4.2# systemctl is-active gluster-blockd failed --------------- None of three pods get full up status: --------------- [cloud-user@ocp-master ~]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE glusterfs-ocs-dkt6g 0/1 Running 2 1h 192.168.122.30 ocp-apps1 <none> glusterfs-ocs-szwdc 0/1 Running 2 1h 192.168.122.40 ocp-apps2 <none> glusterfs-ocs-v9q8h 0/1 Running 2 1h 192.168.122.20 ocp-infra <none> ---------------
We need this for OCP 3.11 errata for sure. OCS 3.11.1 will be on December 11th, so whichever aligns will help.
I find the method of not mounting /dev to be problematic since "/dev/disk/by-id/*" doesn't exist inside the gluster pods. The "/dev/disk/by-id/*" is very useful for letting gluster know which block devices to use. A workaround that worked for me was to bind mount only "/dev/disk/by-id".
Hmm, that indeed looks problematic. Not many people use "/dev/disk/by-id/*" at the moment, but it is something that we want to support (and possibly even recommend) in the future. I'll have a look at it and see if "/dev/disk" should be included.
(In reply to Niels de Vos from comment #20) > Hmm, that indeed looks problematic. Not many people use "/dev/disk/by-id/*" > at the moment, but it is something that we want to support (and possibly > even recommend) in the future. > > I'll have a look at it and see if "/dev/disk" should be included. As an ose customer I tried to use them when I started with openshift. But heketi didn't support it. It was fixed by https://github.com/heketi/heketi/commit/73c0ef4b8d56183a6011430f679dbbb04b4a2ee0 which made it into 3.10/3.3. I have not yet tested it again though :D
New PRs that add /dev/disk as a bind-mount: - https://github.com/gluster/gluster-kubernetes/pull/542 - https://github.com/openshift/openshift-ansible/pull/10793 Reviews, /ok-to-test and similar much appreciated.
More changes to allow /dev/mapper/* devices... - https://github.com/openshift/openshift-ansible/pull/10806 - https://github.com/gluster/gluster-kubernetes/pull/544
Scott, can openshift-ansible PR#10806 be included as well please?
PR#10806 has been included in (upstream) openshift-ansible-3.11.52-1
(In reply to Niels de Vos from comment #25) > PR#10806 has been included in (upstream) openshift-ansible-3.11.52-1 This version of ansible got released and available, however I see this bugzilla in MODIFIED state. The change log only contains 1 PR instead of "3". Niels, are you sure all the fix PRs are available with v3.11.52-1 ? If yes, can you check why this is still in MODIFIED state?
(In reply to Humble Chirammal from comment #29) > Niels, are you sure all the fix PRs are available with v3.11.52-1 ? Yes. > If yes, can you check why this is still in MODIFIED state? That is probably something Jose or Scott can answer.
This is an RHGS bug, not an OpenShift bug, so Scott shouldn't be expected to stay on top of it. Similarly, I have been doing no work on this, so I have not been keeping track of its state. If you or Niels feel this is ready, please move it accordingly.
Talur, We do support it as tech preview, we need to figure out what broke
(In reply to Sudhir from comment #33) > Talur, > We do support it as tech preview, we need to figure out what broke All changes have been included already, the problem should have been fixed for months now. This is mainly waiting on acceptance from OCS QE to validate deploying with CRI-O. But as CRI-O is a Technology Preview for the OCS product, this does not have a high priority and falls out of the planning every time :-/ Moving this to ON_QA as openshift-ansible-3.11.52-1 in combination with current OCS container images should not have this problem anymore.
Version tested on: oc v3.11.82 kubernetes v1.11.0+d4cacc0 openshift-ansible-3.11.82-3.git.0.9718d0a.el7.noarch cns-deploy-7.0.0-9.el7rhgs.x86_64 IMAGES: rhgs3/rhgs-server-rhel7:v3.11.1 rhgs3/rhgs-volmanager-rhel7:v3.11.1 rhgs3/rhgs-gluster-block-prov-rhel7:v3.11.1
This works with current release as seen in comment 39. Closing.