Bug 1473172

Summary:	[Scale Testing] glusterd fails to start in gluster pod when rebooted with around 600 volumes
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	krishnaram Karthick <kramdoss>
Component:	rhgs-server-container	Assignee:	Ramakrishna Reddy Yekulla <rreddy>
Status:	CLOSED NOTABUG	QA Contact:	krishnaram Karthick <kramdoss>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	cns-3.6	CC:	amukherj, asrivast, hchiramm, kramdoss, madam, pprakash, rcyriac, rhs-bugs, rreddy, sankarshan
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-09-18 07:44:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description krishnaram Karthick 2017-07-20 07:42:00 UTC

Description of problem:
On a 3 node CNS setup, when one of the node containing gluster pod was rebooted, the gluster pod fails to come up. There are around 600 volumes in the setup. The build used for the testing has brick multiplexing enabled.

I'm not sure where the glusterd fails to come up, but this is the observation.

Lot of brick relates messages are seen in the glusterd logs.

[2017-07-20 07:25:23.026481] I [glusterd-utils.c:5057:attach_brick] 0-management: add brick /var/lib/heketi/mounts/vg_cd31d2bc8e2d9def002ba6d2c5db7e19/brick_7c71f2212339545ca834a72ea6fee8c8/brick to existing process for /var/lib/heketi/mounts/vg_161d21b3068c4ea60ecba72c1450aedf/brick_7a11f1c67edee5557a9fbf3d4f5a0495/brick
[2017-07-20 07:25:23.026935] I [glusterd-utils.c:5057:attach_brick] 0-management: add brick /var/lib/heketi/mounts/vg_ac2b2a4bb2a522005033a9ba9eb6d8a6/brick_abce34cd3481cc4682693d32833669d9/brick to existing process for /var/lib/heketi/mounts/vg_161d21b3068c4ea60ecba72c1450aedf/brick_7a11f1c67edee5557a9fbf3d4f5a0495/brick
[2017-07-20 07:25:23.027449] I [glusterd-utils.c:5057:attach_brick] 0-management: add brick /var/lib/heketi/mounts/vg_161d21b3068c4ea60ecba72c1450aedf/brick_b2870fabb4225bcc35bfe427c92faa1a/brick to existing process for /var/lib/heketi/mounts/vg_161d21b3068c4ea60ecba72c1450aedf/brick_7a11f1c67edee5557a9fbf3d4f5a0495/brick

connection related messages are seen too.

[2017-07-20 07:25:18.690206] W [glusterd-utils.c:5114:attach_brick] 0-management: attach failed for /var/lib/heketi/mounts/vg_161d21b3068c4ea60ecba72c1450aedf/brick_7a11f1c67edee5557a9fbf3d4f5a0495/brick
[2017-07-20 07:25:18.746394] I [rpc-clnt.c:1060:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2017-07-20 07:25:19.747509] I [glusterd-utils.c:5057:attach_brick] 0-management: add brick /var/lib/heketi/mounts/vg_cd31d2bc8e2d9def002ba6d2c5db7e19/brick_9312d56fb9f5a1c85f7a06725ac27347/brick to existing process for /var/lib/heketi/mounts/vg_161d21b3068c4ea60ecba72c1450aedf/brick_7a11f1c67edee5557a9fbf3d4f5a0495/brick
[2017-07-20 07:25:19.747816] I [glusterd-utils.c:4967:send_attach_req] 0-management: not connected yet
[2017-07-20 07:25:19.848777] I [glusterd-utils.c:4967:send_attach_req] 0-management: not connected yet
[2017-07-20 07:25:19.967816] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-07-20 07:25:19.970427] I [MSGID: 106005] [glusterd-handler.c:5687:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.47.49:/var/lib/heketi/mounts/vg_161d21b3068c4ea60ecba72c1450aedf/brick_7a11f1c67edee5557a9fbf3d4f5a0495/brick has disconnected from glusterd.
[2017-07-20 07:25:20.776064] I [glusterd-utils.c:4967:send_attach_req] 0-management: not connected yet
[2017-07-20 07:25:20.996851] I [glusterd-utils.c:4967:send_attach_req] 0-management: not connected yet
[2017-07-20 07:25:21.988277] I [glusterd-utils.c:4967:send_attach_req] 0-management: not connected yet
[2017-07-20 07:25:22.185533] I [glusterd-utils.c:4967:send_attach_req] 0-management: not connected yet

From the console I see this,
A start job is running for LVM2 PV scan on device xxx

I'm not able to point out where to start debugging, if the issue is with glusterd(brick-multiplexing) or LVM related.

Version-Release number of selected component (if applicable):
glusterfs-server-3.8.4-33.el7rhgs.x86_64
rhgs3/rhgs-server-rhel7:3.3.0-7

How reproducible:
1/1

Steps to Reproduce:
1. create 600 volumes on a CNS setup using heketi
2. reboot one of the worker node hosting gluster pod


Actual results:
gluster pod fails to get respun

Expected results:
gluster pod should get respun successfully

Additional info:
Logs shall be attached shortly.

Comment 3 Raghavendra Talur 2017-07-20 08:51:41 UTC

On the rebooted node, pvscan --cache && pvs && vgscan --cache && vgs is resulting in these messages:

10.70.47.49
 14:04:27 up  1:45,  1 user,  load average: 0.12, 0.23, 0.52
  PV         VG                                  Fmt  Attr PSize   PFree   
  /dev/sda2  rhel_dhcp47-104                     lvm2 a--  <99.00g    4.00m
  /dev/sdb   vg_rhel_dhcp47--104-var             lvm2 a--  <60.00g 1020.00m
  /dev/sdc1  docker-vg                           lvm2 a--  <50.00g   30.00g
  /dev/sdd   vg_161d21b3068c4ea60ecba72c1450aedf lvm2 a--   99.87g   39.39g
  /dev/sde   vg_cd31d2bc8e2d9def002ba6d2c5db7e19 lvm2 a--   99.87g   92.00m
  /dev/sdg   vg_ac2b2a4bb2a522005033a9ba9eb6d8a6 lvm2 a--  511.87g  <59.36g
  Reading volume groups from cache.
  Found volume group "vg_ac2b2a4bb2a522005033a9ba9eb6d8a6" using metadata type lvm2
  Found volume group "docker-vg" using metadata type lvm2
  Found volume group "vg_161d21b3068c4ea60ecba72c1450aedf" using metadata type lvm2
  Found volume group "rhel_dhcp47-104" using metadata type lvm2
  Found volume group "vg_cd31d2bc8e2d9def002ba6d2c5db7e19" using metadata type lvm2
  Found volume group "vg_rhel_dhcp47--104-var" using metadata type lvm2
  VG                                  #PV #LV #SN Attr   VSize   VFree   
  docker-vg                             1   1   0 wz--n- <50.00g   30.00g
  rhel_dhcp47-104                       1   3   0 wz--n- <99.00g    4.00m
  vg_161d21b3068c4ea60ecba72c1450aedf   1 120   0 wz--n-  99.87g   39.39g
  vg_ac2b2a4bb2a522005033a9ba9eb6d8a6   1 898   0 wz--n- 511.87g  <59.36g
  vg_cd31d2bc8e2d9def002ba6d2c5db7e19   1 196   0 wz--n-  99.87g   92.00m
  vg_rhel_dhcp47--104-var               1   1   0 wz--n- <60.00g 1020.00m
  WARNING: lvmetad connection failed, cannot reconnect.
  WARNING: lvmetad cannot be used due to error: Connection reset by peer
  WARNING: To avoid corruption, restart lvmetad (or disable with use_lvmetad=0).
  WARNING: lvmetad connection failed, cannot reconnect.
  WARNING: lvmetad cannot be used due to error: Connection reset by peer
  WARNING: To avoid corruption, restart lvmetad (or disable with use_lvmetad=0).
  WARNING: lvmetad connection failed, cannot reconnect.
  WARNING: lvmetad cannot be used due to error: Connection reset by peer
  WARNING: To avoid corruption, restart lvmetad (or disable with use_lvmetad=0).
  WARNING: lvmetad connection failed, cannot reconnect.

Comment 4 Raghavendra Talur 2017-07-20 09:22:13 UTC

root@dhcp47-49 ~]# pvscan
  PV /dev/sdg    VG vg_ac2b2a4bb2a522005033a9ba9eb6d8a6   lvm2 [511.87 GiB / <59.36 GiB free]
  PV /dev/sdc1   VG docker-vg                             lvm2 [<50.00 GiB / 30.00 GiB free]
  PV /dev/sdd    VG vg_161d21b3068c4ea60ecba72c1450aedf   lvm2 [99.87 GiB / 39.39 GiB free]
  PV /dev/sda2   VG rhel_dhcp47-104                       lvm2 [<99.00 GiB / 4.00 MiB free]
  PV /dev/sde    VG vg_cd31d2bc8e2d9def002ba6d2c5db7e19   lvm2 [99.87 GiB / 92.00 MiB free]
  PV /dev/sdb    VG vg_rhel_dhcp47--104-var               lvm2 [<60.00 GiB / 1020.00 MiB free]
  Total: 6 [920.60 GiB] / in use: 6 [920.60 GiB] / in no VG: 0 [0   ]
[root@dhcp47-49 ~]# vgscan
  Reading volume groups from cache.
  Found volume group "vg_ac2b2a4bb2a522005033a9ba9eb6d8a6" using metadata type lvm2
  Found volume group "docker-vg" using metadata type lvm2
  Found volume group "vg_161d21b3068c4ea60ecba72c1450aedf" using metadata type lvm2
  Found volume group "rhel_dhcp47-104" using metadata type lvm2
  Found volume group "vg_cd31d2bc8e2d9def002ba6d2c5db7e19" using metadata type lvm2
  Found volume group "vg_rhel_dhcp47--104-var" using metadata type lvm2

Comment 5 Raghavendra Talur 2017-07-20 09:26:24 UTC

I cannot say conclusively but I think this is the explanation

On the vg vg_cd31d2bc8e2d9def002ba6d2c5db7e19, there are 196 LVs. 98 thin pools and 98 bricks. This vg is only of size 99GB and practically full. The metadata for all the LVs are stored on the same VG and there is some resource contention happening which is rendering lvmetad useless.


An easy verification is to have disks which double the required capacity for 600 volumes and try again. If we don't see this problem when disks aren't full, it is surely a LVM bug and we will have to approach LVM experts about the limits.


krk, can we try the above scenario please?

Comment 8 Raghavendra Talur 2017-07-25 05:15:17 UTC

Does the problem occur on all the three machines?
10.70.47.49
10.70.47.72
10.70.46.248

Comment 9 krishnaram Karthick 2017-07-25 11:04:45 UTC

(In reply to Raghavendra Talur from comment #8)
> Does the problem occur on all the three machines?
> 10.70.47.49
> 10.70.47.72
> 10.70.46.248

Yes. This is seen on all three machines.

Comment 27 Ramakrishna Reddy Yekulla 2017-09-16 06:47:52 UTC

Test runs with  625 Volumes the node reboot
===========================================
- I rebooted the VM, dhcp42-97.lab.eng.blr.redhat.com
via virt-manager.

- The test run with 625 volumes, with a node reboot is back online
in approximately 4 minutes.

Comment 28 Humble Chirammal 2017-09-16 06:58:00 UTC

Thanks a lot Ramky!!

Yeah, to summarize 2 scenarios were performed:

1) Pod Restart
2) Node Restart

1) Pod Restart:  The Gluster pod which host bricks for 625 volumes came back in 6 minutes in this setup with "deletion of pod" via "oc delete pod <glusterfs-*> command 

2) Node Restart: The Gluster pod which host bricks for 625 volumes came up in "3" minutes when we performed NODE reboot! 

--snip--

[root@dhcp42-112 ~]# oc get pods
NAME                                  READY     STATUS    RESTARTS   AGE
glusterblock-provisioner-dc-1-nz9mv   1/1       Running   2          20h
glusterfs-359cp                       1/1       Running   2          20h
glusterfs-3bp0t                       1/1       Running   0          55m
glusterfs-xss4t                       1/1       Running   2          20h
heketi-1-sq557                        1/1       Running   79         18h
storage-project-router-1-4kmsq        1/1       Running   2          23h
[root@dhcp42-112 ~]# oc get nodes
NAME                                STATUS                     AGE       VERSION
dhcp41-193.lab.eng.blr.redhat.com   Ready                      23h       v1.6.1+5115d708d7
dhcp42-112.lab.eng.blr.redhat.com   Ready,SchedulingDisabled   23h       v1.6.1+5115d708d7
dhcp42-97.lab.eng.blr.redhat.com    NotReady                   23h       v1.6.1+5115d708d7
dhcp43-171.lab.eng.blr.redhat.com   Ready                      23h       v1.6.1+5115d708d7


[root@dhcp42-112 ~]# oc get nodes
NAME                                STATUS                     AGE       VERSION
dhcp41-193.lab.eng.blr.redhat.com   Ready                      23h       v1.6.1+5115d708d7
dhcp42-112.lab.eng.blr.redhat.com   Ready,SchedulingDisabled   23h       v1.6.1+5115d708d7
dhcp42-97.lab.eng.blr.redhat.com    Ready                      23h       v1.6.1+5115d708d7
dhcp43-171.lab.eng.blr.redhat.com   Ready                      23h       v1.6.1+5115d708d7

[root@dhcp42-112 ~]# oc get pods
NAME                                  READY     STATUS    RESTARTS   AGE
glusterblock-provisioner-dc-1-nz9mv   1/1       Running   2          20h
glusterfs-359cp                       1/1       Running   2          20h
glusterfs-3bp0t                       1/1       Running   0          56m
glusterfs-xss4t                       0/1       Running   3          20h
heketi-1-sq557                        1/1       Running   79         18h
storage-project-router-1-4kmsq        1/1       Running   2          23h


[root@dhcp42-112 ~]# oc get pods -w
NAME                                  READY     STATUS    RESTARTS   AGE
glusterblock-provisioner-dc-1-nz9mv   1/1       Running   2          20h
glusterfs-359cp   1/1       Running   2         20h
glusterfs-3bp0t   1/1       Running   0         56m
glusterfs-xss4t   0/1       Running   3         20h
heketi-1-sq557   1/1       Running   79        18h
storage-project-router-1-4kmsq   1/1       Running   2         23h
glusterfs-xss4t   1/1       Running   3         20h

--/snip--


Based on the test results, I am moving back this to ON_QA.

Comment 30 Ramakrishna Reddy Yekulla 2017-09-18 06:36:57 UTC

I'll answer the latter part of the question first. 
The test mentioned has been performed with 625 Volumes on a Single Device.
[root@dhcp42-112 ~]# heketi-cli topology info | grep "/dev"                                                           
                Id:1bb70c3198d3c4d7d5d7ac529a35ef9b   Name:/dev/vdc            State:online    Size (GiB):1199    Used (GiB):632     Free (GiB):566     
                Id:2591462d49daa7930d4eb7b770cac00b   Name:/dev/vdc            State:online    Size (GiB):1199    Used (GiB):632     Free (GiB):566     
                Id:bfa9ce5c8b955141709cdd6658c3c3c5   Name:/dev/vdc            State:online    Size (GiB):1199    Used (GiB):632     Free (GiB):566  


===
For the maximum number of volumes per device, there are quite a few co-relations, Amount of memory available in the nodes, amount of disk space. 
Since most of the dev test environments are created in virtual environments. In all tests above the VM's had about 16G of RAM. I could create additional 300 volumes without any errors. But when the test VM's memory was about 4GB, the volume creations used to fail when with 157 - 185 volumes on a single device. 

In the present test setup, my gut feeling is that we can go up to another 400 volumes on a single disk. But then here, if you want a deterministic approach, we need to leave the numbers to the performance team, who can give a conclusive answer, as their test is on bare metal and their access to repeatable/ reproducible test environments.

Comment 31 Ramakrishna Reddy Yekulla 2017-09-18 07:44:52 UTC

I'm closing this bug as "NOTABUG". As on a single device, I have been able to perform pod restart and node restart with the following builds:

Build Numbers :: 
==============================
-[root@dhcp42-112 ~]# rpm -qa | grep 'cns-deploy'
cns-deploy-5.0.0-38.el7rhgs.x86_64
[root@dhcp42-112 ~]# rpm -qa | grep 'heketi-client'
heketi-client-5.0.0-11.el7rhgs.x86_64


Docker Images ::
================
[root@dhcp42-112 ~]# docker images
REPOSITORY                                                                                  TAG                 IMAGE ID            CREATED             SIZE
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7                3.3.0-21            7a58514538de        4 days ago          407.7 MB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-volmanager-rhel7            3.3.0-13            f3afaa5a05f7        6 days ago          426.7 MB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7                3.3.0-20            1d748122341a        6 days ago          407.7 MB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-gluster-block-prov-docker   3.3.0-10            e571939c4800        2 weeks ago         395.9 MB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-s3-server-rhel7             3.3.0-7             c684dc64abff        2 weeks ago         447.3 MB

Comment 32 Humble Chirammal 2017-09-18 08:07:17 UTC

Karthick, thanks for verifying this bug !