Description of problem: I am using Glusterfs backed PV from CNS to host DB created by Postgres. Postgres app pod was up and running as I was scaling DB but it crashed as PV got unmounted when one of the bricks in the volume went offline. The replication controller tried to re-spawn the POD but the POD went to CrashLoopBackOff status as the volume could not be remounted. The description of POD says: Mounting arguments: --description=Kubernetes transient mount for /var/lib/origin/openshift.local.volumes/pods/ed3608cb-3d4b-11e8-979f-ac1f6b221e7e/volumes/kubernetes.io~glusterfs/pvc-beec041a-3cbc-11e8-979f-ac1f6b221e7e --scope -- mount -t glusterfs -o auto_unmount,log-level=ERROR,log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-beec041a-3cbc-11e8-979f-ac1f6b221e7e/postgresql-1-p9ccg-glusterfs.log,backup-volfile-servers=10.16.29.212:10.16.29.214:10.16.29.216 10.16.29.212:vol_e5a06ef5e3ded904f341092d22589a3f /var/lib/origin/openshift.local.volumes/pods/ed3608cb-3d4b-11e8-979f-ac1f6b221e7e/volumes/kubernetes.io~glusterfs/pvc-beec041a-3cbc-11e8-979f-ac1f6b221e7e Output: Running scope as unit run-61298.scope. Mount failed. Please check the log file for more details. The logs for the client side mount is here: http://perf1.perf.lab.eng.bos.redhat.com/shberry/mount_issue/client_mnt_log/postgresql-1-p9ccg-glusterfs.log It says: 2018-04-11 07:21:15.302977] E [MSGID: 114044] [client-handshake.c:1144:client_setvolume_cbk] 0-vol_e5a06ef5e3ded904f341092d22589a3f-client-2: SETVOLUME on remote-host failed: remote-subvolume "/var/lib/heketi/mounts/vg_f1704c4e203e3d0fc86bbc95ee330cc6/brick_6a2541e6138d7a54622f1d12cf1dc1a9/brick" is not found [No such file or directory] Version-Release number of selected component (if applicable): oc version oc v3.9.19 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://perf148c.perf.lab.eng.bos.redhat.com:8443 openshift v3.9.19 kubernetes v1.9.1+a0ce1bc657 rpm -qa | grep gluster glusterfs-libs-3.8.4-54.3.el7rhgs.x86_64 glusterfs-3.8.4-54.3.el7rhgs.x86_64 glusterfs-api-3.8.4-54.3.el7rhgs.x86_64 glusterfs-cli-3.8.4-54.3.el7rhgs.x86_64 glusterfs-fuse-3.8.4-54.3.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-54.3.el7rhgs.x86_64 gluster-block-0.2.1-14.1.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-54.3.el7rhgs.x86_64 glusterfs-server-3.8.4-54.3.el7rhgs.x86_64 Additional info: Glusterfs log (/var/log/glsuterfs) and (/var/lib/heketi/fstab) from 3 CNS nodes has been captured and can be found here: http://perf1.perf.lab.eng.bos.redhat.com/shberry/mount_issue/ Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
Gluster Vol status output: Status of volume: vol_e5a06ef5e3ded904f341092d22589a3f Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.16.29.212:/var/lib/heketi/mounts/v g_bfc15fe9069ee29bc752b293fdaec6eb/brick_69 b77b5800b28f9e6bf4539f0a6ac70a/brick 49153 0 Y 469 Brick 10.16.29.216:/var/lib/heketi/mounts/v g_eb93eb5db356e087f0db1deba22918cc/brick_95 5d9313c2a653ac009838af375edab6/brick 49153 0 Y 465 Brick 10.16.29.214:/var/lib/heketi/mounts/v g_f1704c4e203e3d0fc86bbc95ee330cc6/brick_6a 2541e6138d7a54622f1d12cf1dc1a9/brick N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 408 Self-heal Daemon on perf148f.perf.lab.eng.b os.redhat.com N/A N/A Y 451 Self-heal Daemon on 10.16.29.214 N/A N/A Y 449 Task Status of Volume vol_e5a06ef5e3ded904f341092d22589a3f ------------------------------------------------------------------------------ There are no active volume tasks
Referring the logs: I can see the following messages: connection to 10.16.29.212:24007 failed (Connection refused); disconnecting socket connection to 10.16.29.214:24007 failed (Connection refused); disconnecting socket So, 24007 port needs to be opened in firewall and doesn't seems to be. How the CNS deployment is carried out? I suspect issue in firewall setup during CNS deployment.
(In reply to Saravanakumar from comment #6) > Referring the logs: > I can see the following messages: > connection to 10.16.29.212:24007 failed (Connection refused); disconnecting > socket > > connection to 10.16.29.214:24007 failed (Connection refused); disconnecting > socket > > So, 24007 port needs to be opened in firewall and doesn't seems to be. > > How the CNS deployment is carried out? > > I suspect issue in firewall setup during CNS deployment. CNS deployment was done through ansible deployment.
I have verified doing I/O using pgbench, after bringing down a node (and the corresponding gluster pod). I don't observe any issue in gluster volumes and can carry out I/O. Complete logs below: ========================================== # oc version oc v3.9.0-0.53.0 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp43-231.lab.eng.blr.redhat.com:8443 openshift v3.9.0-0.53.0 kubernetes v1.9.1+a0ce1bc657 ========================================== Gluster node and corresponding pod down:: [root@dhcp43-231 ~]# oc get nodes NAME STATUS ROLES AGE VERSION dhcp41-180.lab.eng.blr.redhat.com Ready <none> 92d v1.9.1+a0ce1bc657 dhcp42-122.lab.eng.blr.redhat.com NotReady <none> 92d v1.9.1+a0ce1bc657 dhcp43-135.lab.eng.blr.redhat.com Ready <none> 92d v1.9.1+a0ce1bc657 dhcp43-195.lab.eng.blr.redhat.com Ready <none> 92d v1.9.1+a0ce1bc657 dhcp43-231.lab.eng.blr.redhat.com Ready master 92d v1.9.1+a0ce1bc657 [root@dhcp43-231 ~]# [root@dhcp43-231 ~]# [root@dhcp43-231 ~]# [root@dhcp43-231 ~]# [root@dhcp43-231 ~]# oc get pods NAME READY STATUS RESTARTS AGE glusterblock-storage-provisioner-dc-1-j4bwp 1/1 Running 1 89d glusterfs-storage-bzbqk 1/1 Running 1 89d glusterfs-storage-cqh26 1/1 Unknown 1 89d glusterfs-storage-m9jjh 1/1 Running 1 89d heketi-storage-1-m9tdl 1/1 Running 1 89d postgresql-1-ppkvc 1/1 Running 1 2h ========================================== Gluster volume status: [root@dhcp43-231 ~]# oc rsh po/glusterfs-storage-bzbqk sh-4.2# sh-4.2# sh-4.2# gluster vol status Status of volume: heketidbstorage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.41.180:/var/lib/heketi/mounts/v g_20815059c81012a55f4c4fa14725695f/brick_be 8168fc1b53eee4a0d6d5c6ed26a098/brick 49152 0 Y 444 Brick 10.70.43.135:/var/lib/heketi/mounts/v g_14e52f050c99f33a77834d55e248b8b7/brick_5b e313c9d73d824819733dc3b7682bcf/brick 49152 0 Y 433 Self-heal Daemon on localhost N/A N/A Y 14217 Self-heal Daemon on 10.70.43.135 N/A N/A Y 14208 Task Status of Volume heketidbstorage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vol_2aaf562cfd128cc6a174022f18b683b0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.43.135:/var/lib/heketi/mounts/v g_c15ed928d624beddf3578195805cbdb5/brick_97 71e962d1b1858eb7bfe8741c38358f/brick 49152 0 Y 433 Brick 10.70.41.180:/var/lib/heketi/mounts/v g_3aa9dfed04dfbc76fba92a6256a0d294/brick_f1 ed81a1533f0e92e25d94006ec8a8ad/brick 49152 0 Y 444 Self-heal Daemon on localhost N/A N/A Y 14217 Self-heal Daemon on 10.70.43.135 N/A N/A Y 14208 Task Status of Volume vol_2aaf562cfd128cc6a174022f18b683b0 ------------------------------------------------------------------------------ There are no active volume tasks ========================================== Carrying out I/O on gluster volume: sh-4.2$ pgbench -i -s 10 creating tables... 100000 of 1000000 tuples (10%) done (elapsed 0.73 s, remaining 6.57 s) 200000 of 1000000 tuples (20%) done (elapsed 1.65 s, remaining 6.59 s) 300000 of 1000000 tuples (30%) done (elapsed 2.69 s, remaining 6.27 s) 400000 of 1000000 tuples (40%) done (elapsed 3.69 s, remaining 5.54 s) 500000 of 1000000 tuples (50%) done (elapsed 4.69 s, remaining 4.69 s) 600000 of 1000000 tuples (60%) done (elapsed 5.70 s, remaining 3.80 s) 700000 of 1000000 tuples (70%) done (elapsed 6.73 s, remaining 2.88 s) 800000 of 1000000 tuples (80%) done (elapsed 7.75 s, remaining 1.94 s) 900000 of 1000000 tuples (90%) done (elapsed 8.85 s, remaining 0.98 s) 1000000 of 1000000 tuples (100%) done (elapsed 10.09 s, remaining 0.00 s) vacuum... set primary keys... done. sh-4.2$ sh-4.2$ sh-4.2$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/docker-253:3-389005-674752ba6e44c9317ccddba442c488971f53263e3773f970848aa4731b0eab64 10G 349M 9.7G 4% / tmpfs 1.9G 0 1.9G 0% /dev tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup /dev/vda3 15G 3.1G 12G 21% /etc/hosts shm 64M 4.0K 64M 1% /dev/shm 10.70.41.180:vol_2aaf562cfd128cc6a174022f18b683b0 1016M 230M 787M 23% /var/lib/pgsql/data tmpfs 1.9G 16K 1.9G 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 1.9G 0 1.9G 0% /proc/scsi sh-4.2$ sh-4.2$ sh-4.2$ pwd /var/lib/pgsql/data/userdata sh-4.2$ sh-4.2$ date Tue May 29 10:40:24 UTC 2018 sh-4.2$ ls -ltr total 103 drwx------. 2 1000080000 root 4096 May 29 08:19 pg_commit_ts drwx------. 2 1000080000 root 4096 May 29 08:19 pg_dynshmem drwx------. 2 1000080000 root 4096 May 29 08:19 pg_serial drwx------. 2 1000080000 root 4096 May 29 08:19 pg_snapshots drwx------. 2 1000080000 root 4096 May 29 08:19 pg_twophase drwx------. 4 1000080000 root 4096 May 29 08:19 pg_multixact drwx------. 2 1000080000 root 4096 May 29 08:19 pg_replslot drwx------. 2 1000080000 root 4096 May 29 08:19 pg_tblspc drwx------. 4 1000080000 root 4096 May 29 08:19 pg_logical -rw-------. 1 1000080000 root 4 May 29 08:19 PG_VERSION -rw-------. 1 1000080000 root 88 May 29 08:19 postgresql.auto.conf -rw-------. 1 1000080000 root 1636 May 29 08:19 pg_ident.conf drwx------. 3 1000080000 root 4096 May 29 08:19 pg_xlog drwx------. 2 1000080000 root 4096 May 29 08:19 pg_clog drwx------. 2 1000080000 root 4096 May 29 08:19 pg_subtrans -rw-------. 1 1000080000 root 4666 May 29 08:20 pg_hba.conf drwx------. 2 1000080000 root 4096 May 29 08:20 pg_log -rw-------. 1 1000080000 root 22424 May 29 08:20 postgresql.conf drwx------. 6 1000080000 root 4096 May 29 08:20 base -rw-------. 1 1000080000 root 89 May 29 08:20 postmaster.pid drwx------. 2 1000080000 root 4096 May 29 08:20 pg_notify -rw-------. 1 1000080000 root 46 May 29 08:20 postmaster.opts drwx------. 2 1000080000 root 4096 May 29 08:20 pg_stat drwx------. 2 1000080000 root 4096 May 29 10:39 global drwx------. 2 1000080000 root 4096 May 29 10:40 pg_stat_tmp ========================================== As mentioned in comment #6, this looks like a setup issue.
Based on comment 8 and comment 9, we have not been able to reproduce the bug. I will close the bug as not reproducible.