Description of problem:
++++++++++++++++++++++++
We had an OCP 3.10 + OCS 3.10 setup with gluster-bits=3.12.2-15. Gluser-block version = gluster-block-0.2.1-24.el7rhgs.x86_64. The setup has logging pods configured and the metrics pods couldn't come up.
Created around 50 block pvcs in two loops and then attached them to app pods
Loop #1 : 101..130 and created app pods bk-101 to bk-130.
Result : All pods were in running state.
Loop #2: 131..150 and created app pods bk-101 to bk-130.
Result:
++++++++++++
1. None of the new pods came up and the oc describe pod printed following error message. iscsiadm logins were successful on the initator nodes though.
============================================================
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m default-scheduler Successfully assigned bk142-1-fz5gx to dhcp46-65.lab.eng.blr.redhat.com
Normal SuccessfulAttachVolume 7m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-ae117351-a45f-11e8-92b7-005056a52fd4"
Warning FailedMount 6m (x3 over 6m) kubelet, dhcp46-65.lab.eng.blr.redhat.com MountVolume.MountDevice failed for volume "pvc-ae117351-a45f-11e8-92b7-005056a52fd4" : exit status 1
Warning FailedMount 1m (x3 over 5m) kubelet, dhcp46-65.lab.eng.blr.redhat.com Unable to mount volumes for pod "bk142-1-fz5gx_glusterfs(cd399d0f-a460-11e8-92b7-005056a52fd4)": timeout expired waiting for volumes to attach or mount for pod "glusterfs"/"bk142-1-fz5gx". list of unmounted volumes=[foo-vol]. list of unattached volumes=[foo-vol default-token-8fbpq]
2. Two existing RUNNING pods started going into CrashLoppBackState with following error message:
========================================================
oc describe pod bk124-1-zmq99
+++++++++++++++++++++++++++++++
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1h default-scheduler Successfully assigned bk124-1-zmq99 to dhcp46-181.lab.eng.blr.redhat.com
Normal SuccessfulAttachVolume 1h attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-11a2ccad-a439-11e8-b3b0-005056a52fd4"
Normal Started 1h kubelet, dhcp46-181.lab.eng.blr.redhat.com Started container
Warning Unhealthy 16m kubelet, dhcp46-181.lab.eng.blr.redhat.com Liveness probe failed: /dev/sdr on /mnt type xfs (rw,seclabel,relatime,attr2,inode64,noquota)
sh: can't create /mnt/random-data.log: Input/output error
Normal Pulled 14m (x5 over 1h) kubelet, dhcp46-181.lab.eng.blr.redhat.com Container image "cirros" already present on machine
Normal Created 14m (x5 over 1h) kubelet, dhcp46-181.lab.eng.blr.redhat.com Created container
Warning Failed 14m (x4 over 16m) kubelet, dhcp46-181.lab.eng.blr.redhat.com Error: failed to start container "foo": Error response from daemon: error setting label on mount source '/var/lib/origin/openshift.local.volumes/pods/dde3d3d9-a458-11e8-92b7-005056a52fd4/volumes/kubernetes.io~iscsi/pvc-11a2ccad-a439-11e8-b3b0-005056a52fd4': SELinux relabeling of /var/lib/origin/openshift.local.volumes/pods/dde3d3d9-a458-11e8-92b7-005056a52fd4/volumes/kubernetes.io~iscsi/pvc-11a2ccad-a439-11e8-b3b0-005056a52fd4 is not allowed: "input/output error"
Warning BackOff 1m (x57 over 16m) kubelet, dhcp46-181.lab.eng.blr.redhat.com Back-off restarting failed container
oc describe pod bk129-1-qzcs5
++++++++++++++++++++++++++++++++++
Message: error setting label on mount source '/var/lib/origin/openshift.local.volumes/pods/f0f77021-a458-11e8-92b7-005056a52fd4/volumes/kubernetes.io~iscsi/pvc-213178f1-a439-11e8-b3b0-005056a52fd4': SELinux relabeling of /var/lib/origin/openshift.local.volumes/pods/f0f77021-a458-11e8-92b7-005056a52fd4/volumes/kubernetes.io~iscsi/pvc-213178f1-a439-11e8-b3b0-005056a52fd4 is not allowed: "input/output error"
Exit Code: 128
Some info about the setup:
============================
1. it was seen that the brick for heketidbstorage was NOT ONLINE for the gluster pod -10.70.46.150.
2. Also, on 10.70.46.150, the 2 block-hosting vols had 2 separate PIDS and the brick for vol_9f93ae4c845f3910f5d1558cc5ae9f0a was NOT ONLINE.
(We shall be raising a separate bug for the above two issues)
Version-Release number of selected component (if applicable):
++++++++++++++++++++++++
[root@dhcp46-137 ~]# oc version
oc v3.10.14
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://dhcp46-137.lab.eng.blr.redhat.com:8443
openshift v3.10.14
kubernetes v1.10.0+b81c8f8
[root@dhcp46-137 ~]#
Gluster 3.4.0
==============
[root@dhcp46-137 ~]# oc rsh glusterfs-storage-q22cl rpm -qa|grep gluster
glusterfs-client-xlators-3.12.2-15.el7rhgs.x86_64
glusterfs-cli-3.12.2-15.el7rhgs.x86_64
python2-gluster-3.12.2-15.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-15.el7rhgs.x86_64
glusterfs-libs-3.12.2-15.el7rhgs.x86_64
glusterfs-3.12.2-15.el7rhgs.x86_64
glusterfs-api-3.12.2-15.el7rhgs.x86_64
glusterfs-fuse-3.12.2-15.el7rhgs.x86_64
glusterfs-server-3.12.2-15.el7rhgs.x86_64
gluster-block-0.2.1-24.el7rhgs.x86_64
[root@dhcp46-137 ~]#
[root@dhcp46-137 ~]# oc rsh heketi-storage-1-px7jd rpm -qa|grep heketi
python-heketi-7.0.0-6.el7rhgs.x86_64
heketi-7.0.0-6.el7rhgs.x86_64
heketi-client-7.0.0-6.el7rhgs.x86_64
[root@dhcp46-137 ~]#
gluster client version
=========================
[root@dhcp46-65 ~]# rpm -qa|grep gluster
glusterfs-libs-3.12.2-15.el7.x86_64
glusterfs-3.12.2-15.el7.x86_64
glusterfs-fuse-3.12.2-15.el7.x86_64
glusterfs-client-xlators-3.12.2-15.el7.x86_64
[root@dhcp46-65 ~]#
How reproducible:
++++++++++++++++++++++++
The issue was seen on one setup. The setup is kept in the same condition.
Steps to Reproduce:
++++++++++++++++++++++++
1. Create an OCP +OCS 3.10 setup.
2. Upgrade the docker version to 1.13.1.74 and also update the gluster client packages. The pods will be restarted as docker is upgraded.
3. Once setup is up, create block pvcs and then bound them to app pods.
4. Check the pod status and the gluster v status.
Actual results:
++++++++++++++++++++++++
The pods should
Expected results:
++++++++++++++++++++++++
The new pods should have got created successfully and old ones should keep running.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2019:0285