Bug 1566326

Summary:	PV not getting mounted even though only 1 out of 3 bricks in replica 3 volume is offline
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Shekhar Berry <shberry>
Component:	rhgs-server-container	Assignee:	Saravanakumar <sarumuga>
Status:	CLOSED WORKSFORME	QA Contact:	Prasanth <pprakash>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.0	CC:	aos-bugs, aos-storage-staff, ekuric, madam, mpillai, psuriset, rcyriac, rgeorge, rhs-bugs, rsussman, rtalur, sankarshan, shberry
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-10 18:47:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Shekhar Berry 2018-04-12 05:06:19 UTC

Description of problem:

I am using Glusterfs backed PV from CNS to host DB created by Postgres.  Postgres app pod was up and running as I was scaling DB but it crashed as PV got unmounted when one of the bricks in the volume went offline.

The replication controller tried to re-spawn the POD but the POD went to CrashLoopBackOff status as the volume could not be remounted. 

The description of POD says: 

Mounting arguments: --description=Kubernetes transient mount for /var/lib/origin/openshift.local.volumes/pods/ed3608cb-3d4b-11e8-979f-ac1f6b221e7e/volumes/kubernetes.io~glusterfs/pvc-beec041a-3cbc-11e8-979f-ac1f6b221e7e --scope -- mount -t glusterfs -o auto_unmount,log-level=ERROR,log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-beec041a-3cbc-11e8-979f-ac1f6b221e7e/postgresql-1-p9ccg-glusterfs.log,backup-volfile-servers=10.16.29.212:10.16.29.214:10.16.29.216 10.16.29.212:vol_e5a06ef5e3ded904f341092d22589a3f /var/lib/origin/openshift.local.volumes/pods/ed3608cb-3d4b-11e8-979f-ac1f6b221e7e/volumes/kubernetes.io~glusterfs/pvc-beec041a-3cbc-11e8-979f-ac1f6b221e7e
Output: Running scope as unit run-61298.scope.
Mount failed. Please check the log file for more details.


The logs for the client side mount is here: http://perf1.perf.lab.eng.bos.redhat.com/shberry/mount_issue/client_mnt_log/postgresql-1-p9ccg-glusterfs.log

It says:

2018-04-11 07:21:15.302977] E [MSGID: 114044] [client-handshake.c:1144:client_setvolume_cbk] 0-vol_e5a06ef5e3ded904f341092d22589a3f-client-2: SETVOLUME on remote-host failed: remote-subvolume "/var/lib/heketi/mounts/vg_f1704c4e203e3d0fc86bbc95ee330cc6/brick_6a2541e6138d7a54622f1d12cf1dc1a9/brick" is not found [No such file or directory]


Version-Release number of selected component (if applicable):

 oc version
oc v3.9.19
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://perf148c.perf.lab.eng.bos.redhat.com:8443
openshift v3.9.19
kubernetes v1.9.1+a0ce1bc657

rpm -qa | grep gluster
glusterfs-libs-3.8.4-54.3.el7rhgs.x86_64
glusterfs-3.8.4-54.3.el7rhgs.x86_64
glusterfs-api-3.8.4-54.3.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.3.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.3.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.3.el7rhgs.x86_64
gluster-block-0.2.1-14.1.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.3.el7rhgs.x86_64
glusterfs-server-3.8.4-54.3.el7rhgs.x86_64


Additional info:

Glusterfs log  (/var/log/glsuterfs) and (/var/lib/heketi/fstab) from 3 CNS nodes has been captured and can be found here:

http://perf1.perf.lab.eng.bos.redhat.com/shberry/mount_issue/




Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Shekhar Berry 2018-04-12 05:22:50 UTC

Gluster Vol status output:

Status of volume: vol_e5a06ef5e3ded904f341092d22589a3f
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.16.29.212:/var/lib/heketi/mounts/v
g_bfc15fe9069ee29bc752b293fdaec6eb/brick_69
b77b5800b28f9e6bf4539f0a6ac70a/brick        49153     0          Y       469  
Brick 10.16.29.216:/var/lib/heketi/mounts/v
g_eb93eb5db356e087f0db1deba22918cc/brick_95
5d9313c2a653ac009838af375edab6/brick        49153     0          Y       465  
Brick 10.16.29.214:/var/lib/heketi/mounts/v
g_f1704c4e203e3d0fc86bbc95ee330cc6/brick_6a
2541e6138d7a54622f1d12cf1dc1a9/brick        N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       408  
Self-heal Daemon on perf148f.perf.lab.eng.b
os.redhat.com                               N/A       N/A        Y       451  
Self-heal Daemon on 10.16.29.214            N/A       N/A        Y       449  
 
Task Status of Volume vol_e5a06ef5e3ded904f341092d22589a3f
------------------------------------------------------------------------------
There are no active volume tasks

Comment 6 Saravanakumar 2018-05-22 09:50:49 UTC

Referring the logs:
I can see the following messages:
connection to 10.16.29.212:24007 failed (Connection refused); disconnecting socket

connection to 10.16.29.214:24007 failed (Connection refused); disconnecting socket

So, 24007 port needs to be opened in firewall and doesn't seems to be.

How the CNS deployment is carried out?

I suspect issue in firewall setup during CNS deployment.

Comment 7 Shekhar Berry 2018-05-22 12:04:16 UTC

(In reply to Saravanakumar from comment #6)
> Referring the logs:
> I can see the following messages:
> connection to 10.16.29.212:24007 failed (Connection refused); disconnecting
> socket
> 
> connection to 10.16.29.214:24007 failed (Connection refused); disconnecting
> socket
> 
> So, 24007 port needs to be opened in firewall and doesn't seems to be.
> 
> How the CNS deployment is carried out?
> 
> I suspect issue in firewall setup during CNS deployment.

CNS deployment was done through ansible deployment.

Comment 8 Saravanakumar 2018-05-29 10:49:20 UTC

I have verified doing I/O using pgbench, after bringing down a node (and the corresponding gluster pod).

I don't observe any issue in gluster volumes and can carry out I/O.

Complete logs below:

==========================================

# oc version 
oc v3.9.0-0.53.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp43-231.lab.eng.blr.redhat.com:8443
openshift v3.9.0-0.53.0
kubernetes v1.9.1+a0ce1bc657

==========================================

Gluster node and corresponding pod down::

[root@dhcp43-231 ~]# oc get nodes 
NAME                                STATUS     ROLES     AGE       VERSION
dhcp41-180.lab.eng.blr.redhat.com   Ready      <none>    92d       v1.9.1+a0ce1bc657
dhcp42-122.lab.eng.blr.redhat.com   NotReady   <none>    92d       v1.9.1+a0ce1bc657
dhcp43-135.lab.eng.blr.redhat.com   Ready      <none>    92d       v1.9.1+a0ce1bc657
dhcp43-195.lab.eng.blr.redhat.com   Ready      <none>    92d       v1.9.1+a0ce1bc657
dhcp43-231.lab.eng.blr.redhat.com   Ready      master    92d       v1.9.1+a0ce1bc657
[root@dhcp43-231 ~]# 
[root@dhcp43-231 ~]# 
[root@dhcp43-231 ~]# 
[root@dhcp43-231 ~]# 
[root@dhcp43-231 ~]# oc get pods 
NAME                                          READY     STATUS    RESTARTS   AGE
glusterblock-storage-provisioner-dc-1-j4bwp   1/1       Running   1          89d
glusterfs-storage-bzbqk                       1/1       Running   1          89d
glusterfs-storage-cqh26                       1/1       Unknown   1          89d
glusterfs-storage-m9jjh                       1/1       Running   1          89d
heketi-storage-1-m9tdl                        1/1       Running   1          89d
postgresql-1-ppkvc                            1/1       Running   1          2h

==========================================

Gluster volume status:

[root@dhcp43-231 ~]# oc rsh po/glusterfs-storage-bzbqk
sh-4.2# 
sh-4.2# 
sh-4.2# gluster vol status 
Status of volume: heketidbstorage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.41.180:/var/lib/heketi/mounts/v
g_20815059c81012a55f4c4fa14725695f/brick_be
8168fc1b53eee4a0d6d5c6ed26a098/brick        49152     0          Y       444  
Brick 10.70.43.135:/var/lib/heketi/mounts/v
g_14e52f050c99f33a77834d55e248b8b7/brick_5b
e313c9d73d824819733dc3b7682bcf/brick        49152     0          Y       433  
Self-heal Daemon on localhost               N/A       N/A        Y       14217
Self-heal Daemon on 10.70.43.135            N/A       N/A        Y       14208
 
Task Status of Volume heketidbstorage
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: vol_2aaf562cfd128cc6a174022f18b683b0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.43.135:/var/lib/heketi/mounts/v
g_c15ed928d624beddf3578195805cbdb5/brick_97
71e962d1b1858eb7bfe8741c38358f/brick        49152     0          Y       433  
Brick 10.70.41.180:/var/lib/heketi/mounts/v
g_3aa9dfed04dfbc76fba92a6256a0d294/brick_f1
ed81a1533f0e92e25d94006ec8a8ad/brick        49152     0          Y       444  
Self-heal Daemon on localhost               N/A       N/A        Y       14217
Self-heal Daemon on 10.70.43.135            N/A       N/A        Y       14208
 
Task Status of Volume vol_2aaf562cfd128cc6a174022f18b683b0
------------------------------------------------------------------------------
There are no active volume tasks

==========================================

Carrying out I/O on gluster volume:

sh-4.2$ pgbench -i -s 10
creating tables...
100000 of 1000000 tuples (10%) done (elapsed 0.73 s, remaining 6.57 s)
200000 of 1000000 tuples (20%) done (elapsed 1.65 s, remaining 6.59 s)
300000 of 1000000 tuples (30%) done (elapsed 2.69 s, remaining 6.27 s)
400000 of 1000000 tuples (40%) done (elapsed 3.69 s, remaining 5.54 s)
500000 of 1000000 tuples (50%) done (elapsed 4.69 s, remaining 4.69 s)
600000 of 1000000 tuples (60%) done (elapsed 5.70 s, remaining 3.80 s)
700000 of 1000000 tuples (70%) done (elapsed 6.73 s, remaining 2.88 s)
800000 of 1000000 tuples (80%) done (elapsed 7.75 s, remaining 1.94 s)
900000 of 1000000 tuples (90%) done (elapsed 8.85 s, remaining 0.98 s)
1000000 of 1000000 tuples (100%) done (elapsed 10.09 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
sh-4.2$ 
sh-4.2$ 
sh-4.2$ df -h
Filesystem                                                                                        Size  Used Avail Use% Mounted on
/dev/mapper/docker-253:3-389005-674752ba6e44c9317ccddba442c488971f53263e3773f970848aa4731b0eab64   10G  349M  9.7G   4% /
tmpfs                                                                                             1.9G     0  1.9G   0% /dev
tmpfs                                                                                             1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/vda3                                                                                          15G  3.1G   12G  21% /etc/hosts
shm                                                                                                64M  4.0K   64M   1% /dev/shm
10.70.41.180:vol_2aaf562cfd128cc6a174022f18b683b0                                                1016M  230M  787M  23% /var/lib/pgsql/data
tmpfs                                                                                             1.9G   16K  1.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                                                             1.9G     0  1.9G   0% /proc/scsi
sh-4.2$ 
sh-4.2$ 
sh-4.2$ pwd
/var/lib/pgsql/data/userdata
sh-4.2$ 
sh-4.2$ date
Tue May 29 10:40:24 UTC 2018
sh-4.2$ ls -ltr
total 103
drwx------. 2 1000080000 root  4096 May 29 08:19 pg_commit_ts
drwx------. 2 1000080000 root  4096 May 29 08:19 pg_dynshmem
drwx------. 2 1000080000 root  4096 May 29 08:19 pg_serial
drwx------. 2 1000080000 root  4096 May 29 08:19 pg_snapshots
drwx------. 2 1000080000 root  4096 May 29 08:19 pg_twophase
drwx------. 4 1000080000 root  4096 May 29 08:19 pg_multixact
drwx------. 2 1000080000 root  4096 May 29 08:19 pg_replslot
drwx------. 2 1000080000 root  4096 May 29 08:19 pg_tblspc
drwx------. 4 1000080000 root  4096 May 29 08:19 pg_logical
-rw-------. 1 1000080000 root     4 May 29 08:19 PG_VERSION
-rw-------. 1 1000080000 root    88 May 29 08:19 postgresql.auto.conf
-rw-------. 1 1000080000 root  1636 May 29 08:19 pg_ident.conf
drwx------. 3 1000080000 root  4096 May 29 08:19 pg_xlog
drwx------. 2 1000080000 root  4096 May 29 08:19 pg_clog
drwx------. 2 1000080000 root  4096 May 29 08:19 pg_subtrans
-rw-------. 1 1000080000 root  4666 May 29 08:20 pg_hba.conf
drwx------. 2 1000080000 root  4096 May 29 08:20 pg_log
-rw-------. 1 1000080000 root 22424 May 29 08:20 postgresql.conf
drwx------. 6 1000080000 root  4096 May 29 08:20 base
-rw-------. 1 1000080000 root    89 May 29 08:20 postmaster.pid
drwx------. 2 1000080000 root  4096 May 29 08:20 pg_notify
-rw-------. 1 1000080000 root    46 May 29 08:20 postmaster.opts
drwx------. 2 1000080000 root  4096 May 29 08:20 pg_stat
drwx------. 2 1000080000 root  4096 May 29 10:39 global
drwx------. 2 1000080000 root  4096 May 29 10:40 pg_stat_tmp

==========================================

As mentioned in comment #6, this looks like a setup issue.

Comment 10 Raghavendra Talur 2018-07-10 18:47:04 UTC

Based on comment 8 and comment 9, we have not been able to reproduce the bug. I will close the bug as not reproducible.