Bug 1619264

Summary: [Tracking BZ#1632719] [F-QE] App Pods with block pvcs attached went into CrashLoopBackState- input/output error
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Neha Berry <nberry>
Component: gluster-blockAssignee: Prasanna Kumar Kalever <prasanna.kalever>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact:
Priority: medium    
Version: cns-3.10CC: atumball, bgoyal, hchiramm, kramdoss, madam, msaini, nberry, nigoyal, pkarampu, pprakash, prasanna.kalever, rhs-bugs, rtalur, sankarshan, vbellur, vinug, xiubli
Target Milestone: ---Keywords: ZStream
Target Release: OCS 3.11.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-20 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-07 03:38:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1623874, 1632719    
Bug Blocks: 1641915, 1644154    

Description Neha Berry 2018-08-20 13:16:25 UTC
Description of problem:
++++++++++++++++++++++++
We had an OCP 3.10 + OCS 3.10 setup with gluster-bits=3.12.2-15. Gluser-block version = gluster-block-0.2.1-24.el7rhgs.x86_64. The setup has logging pods configured and the metrics pods couldn't come up.

Created around 50 block pvcs in two loops and then attached them to app pods 

Loop #1 :  101..130 and created app pods bk-101 to bk-130. 
Result : All pods were in running state.


Loop #2:   131..150 and created app pods bk-101 to bk-130. 

Result: 
++++++++++++

1. None of the new pods came up and the oc describe pod printed following error message. iscsiadm logins were successful on the initator nodes though.
============================================================

Events:
  Type     Reason                  Age              From                                       Message
  ----     ------                  ----             ----                                       -------
  Normal   Scheduled               7m               default-scheduler                          Successfully assigned bk142-1-fz5gx to dhcp46-65.lab.eng.blr.redhat.com
  Normal   SuccessfulAttachVolume  7m               attachdetach-controller                    AttachVolume.Attach succeeded for volume "pvc-ae117351-a45f-11e8-92b7-005056a52fd4"
  Warning  FailedMount             6m (x3 over 6m)  kubelet, dhcp46-65.lab.eng.blr.redhat.com  MountVolume.MountDevice failed for volume "pvc-ae117351-a45f-11e8-92b7-005056a52fd4" : exit status 1
  Warning  FailedMount             1m (x3 over 5m)  kubelet, dhcp46-65.lab.eng.blr.redhat.com  Unable to mount volumes for pod "bk142-1-fz5gx_glusterfs(cd399d0f-a460-11e8-92b7-005056a52fd4)": timeout expired waiting for volumes to attach or mount for pod "glusterfs"/"bk142-1-fz5gx". list of unmounted volumes=[foo-vol]. list of unattached volumes=[foo-vol default-token-8fbpq]

2. Two existing RUNNING pods started going into CrashLoppBackState with following error message:
========================================================

oc describe pod bk124-1-zmq99
+++++++++++++++++++++++++++++++

Events:
  Type     Reason                  Age   From                                        Message
  ----     ------                  ----  ----                                        -------
  Normal   Scheduled               1h    default-scheduler                           Successfully assigned bk124-1-zmq99 to dhcp46-181.lab.eng.blr.redhat.com
  Normal   SuccessfulAttachVolume  1h    attachdetach-controller                     AttachVolume.Attach succeeded for volume "pvc-11a2ccad-a439-11e8-b3b0-005056a52fd4"
  Normal   Started                 1h    kubelet, dhcp46-181.lab.eng.blr.redhat.com  Started container
  Warning  Unhealthy               16m   kubelet, dhcp46-181.lab.eng.blr.redhat.com  Liveness probe failed: /dev/sdr on /mnt type xfs (rw,seclabel,relatime,attr2,inode64,noquota)
sh: can't create /mnt/random-data.log: Input/output error
  Normal   Pulled   14m (x5 over 1h)   kubelet, dhcp46-181.lab.eng.blr.redhat.com  Container image "cirros" already present on machine
  Normal   Created  14m (x5 over 1h)   kubelet, dhcp46-181.lab.eng.blr.redhat.com  Created container
  Warning  Failed   14m (x4 over 16m)  kubelet, dhcp46-181.lab.eng.blr.redhat.com  Error: failed to start container "foo": Error response from daemon: error setting label on mount source '/var/lib/origin/openshift.local.volumes/pods/dde3d3d9-a458-11e8-92b7-005056a52fd4/volumes/kubernetes.io~iscsi/pvc-11a2ccad-a439-11e8-b3b0-005056a52fd4': SELinux relabeling of /var/lib/origin/openshift.local.volumes/pods/dde3d3d9-a458-11e8-92b7-005056a52fd4/volumes/kubernetes.io~iscsi/pvc-11a2ccad-a439-11e8-b3b0-005056a52fd4 is not allowed: "input/output error"
  Warning  BackOff  1m (x57 over 16m)  kubelet, dhcp46-181.lab.eng.blr.redhat.com  Back-off restarting failed container

oc describe pod bk129-1-qzcs5
++++++++++++++++++++++++++++++++++
     Message:      error setting label on mount source '/var/lib/origin/openshift.local.volumes/pods/f0f77021-a458-11e8-92b7-005056a52fd4/volumes/kubernetes.io~iscsi/pvc-213178f1-a439-11e8-b3b0-005056a52fd4': SELinux relabeling of /var/lib/origin/openshift.local.volumes/pods/f0f77021-a458-11e8-92b7-005056a52fd4/volumes/kubernetes.io~iscsi/pvc-213178f1-a439-11e8-b3b0-005056a52fd4 is not allowed: "input/output error"
      Exit Code:    128



Some info about the setup: 
============================

1. it was seen that the brick for heketidbstorage was NOT ONLINE for the gluster pod -10.70.46.150. 
2. Also,  on 10.70.46.150, the 2 block-hosting vols had 2 separate PIDS and the brick for vol_9f93ae4c845f3910f5d1558cc5ae9f0a was NOT ONLINE.
(We shall be raising a separate bug for the above two issues)




Version-Release number of selected component (if applicable):
++++++++++++++++++++++++

[root@dhcp46-137 ~]# oc version
oc v3.10.14
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp46-137.lab.eng.blr.redhat.com:8443
openshift v3.10.14
kubernetes v1.10.0+b81c8f8
[root@dhcp46-137 ~]# 


Gluster 3.4.0
==============

[root@dhcp46-137 ~]# oc rsh glusterfs-storage-q22cl rpm -qa|grep gluster
glusterfs-client-xlators-3.12.2-15.el7rhgs.x86_64
glusterfs-cli-3.12.2-15.el7rhgs.x86_64
python2-gluster-3.12.2-15.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-15.el7rhgs.x86_64
glusterfs-libs-3.12.2-15.el7rhgs.x86_64
glusterfs-3.12.2-15.el7rhgs.x86_64
glusterfs-api-3.12.2-15.el7rhgs.x86_64
glusterfs-fuse-3.12.2-15.el7rhgs.x86_64
glusterfs-server-3.12.2-15.el7rhgs.x86_64
gluster-block-0.2.1-24.el7rhgs.x86_64
[root@dhcp46-137 ~]# 


[root@dhcp46-137 ~]# oc rsh heketi-storage-1-px7jd rpm -qa|grep heketi
python-heketi-7.0.0-6.el7rhgs.x86_64
heketi-7.0.0-6.el7rhgs.x86_64
heketi-client-7.0.0-6.el7rhgs.x86_64
[root@dhcp46-137 ~]# 


gluster client version
=========================
[root@dhcp46-65 ~]# rpm -qa|grep gluster
glusterfs-libs-3.12.2-15.el7.x86_64
glusterfs-3.12.2-15.el7.x86_64
glusterfs-fuse-3.12.2-15.el7.x86_64
glusterfs-client-xlators-3.12.2-15.el7.x86_64
[root@dhcp46-65 ~]# 




How reproducible:
++++++++++++++++++++++++
The issue was seen on one setup. The setup is kept in the same condition.

Steps to Reproduce:
++++++++++++++++++++++++
1. Create an OCP +OCS 3.10 setup.
2. Upgrade the docker version to 1.13.1.74 and also update the gluster client packages. The pods will be restarted as docker is upgraded.
3. Once setup is up, create block pvcs and then bound them to app pods.
4. Check the pod status and the gluster v status.


Actual results:
++++++++++++++++++++++++
The pods should 

Expected results:
++++++++++++++++++++++++
The new pods should have got created successfully and old ones should keep running.

Comment 62 errata-xmlrpc 2019-02-07 03:38:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0285