Bug 1337495
| Summary: | [Volume Scale] gluster node randomly going to Disconnected state after scaling to more than 290 gluster volumes | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Prasanth <pprakash> |
| Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> |
| Status: | CLOSED ERRATA | QA Contact: | Prasanth <pprakash> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.1 | CC: | annair, asrivast, pousley, pprakash, rcyriac, rhinduja, rhs-bugs, storage-qa-internal, vbellur |
| Target Milestone: | --- | ||
| Target Release: | RHGS 3.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | glusterfs-3.8.4-1 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-03-23 05:32:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1351522 | ||
RCA for this goes same as BZ 1336267 http://review.gluster.org/#/c/14849/ fixes this issue too. Upstream mainline : http://review.gluster.org/14849 Upstream 3.8 : http://review.gluster.org/14860 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4. The reported issue seems to be fixed in glusterfs-3.8.4 and I was able to scale gluster volumes using heketi-cli even beyond 300 volumes and the gluster nodes are no longer going into disconnected state. ###################### # gluster --version glusterfs 3.8.4 built on Feb 20 2017 03:15:38 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. # heketi-cli volume list |wc -l 500 # gluster volume list |wc -l 500 # gluster peer status Number of Peers: 2 Hostname: dhcp46-150.lab.eng.blr.redhat.com Uuid: c4bdf1ad-04ab-4301-b9fe-f144272079ef State: Peer in Cluster (Connected) Hostname: 10.70.47.163 Uuid: fcd44049-a3b9-4f85-851c-79915812cf3f State: Peer in Cluster (Connected) # gluster volume info vol291 Volume Name: vol291 Type: Replicate Volume ID: 48eeed36-e50e-429b-b474-10e4e336ffca Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.46.150:/var/lib/heketi/mounts/vg_1dae8c1c2feb1c16338a0440f64bcfed/brick_6f278e35a09ea7cff70b008784cb99c1/brick Brick2: 10.70.47.161:/var/lib/heketi/mounts/vg_cf48e4fe475f69149d157bbfae86db75/brick_5cc9403c371cff1f2c4b6504bca5f2e9/brick Brick3: 10.70.47.163:/var/lib/heketi/mounts/vg_94e77f6c32ac54b0c819ceee0899981f/brick_a6fb045df6a79ee353031f9da40309c5/brick Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on nfs.disable: on # gluster volume info vol500 Volume Name: vol500 Type: Replicate Volume ID: 72c9be57-3284-4e12-8449-84889a782c23 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.47.163:/var/lib/heketi/mounts/vg_814d6fa82429363cf09aaafe5ba7d850/brick_1f6b1df64468e950ee64aebe57f498d9/brick Brick2: 10.70.46.150:/var/lib/heketi/mounts/vg_60357bad265972bb79fd3155e6d473fa/brick_d14fc99faef482004d606480d405df7d/brick Brick3: 10.70.47.161:/var/lib/heketi/mounts/vg_893e765b342697a5086bf56d58332501/brick_3f4e9c68a58d6435ab377d85d08f90ff/brick Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on nfs.disable: on # gluster pool list UUID Hostname State c4bdf1ad-04ab-4301-b9fe-f144272079ef dhcp46-150.lab.eng.blr.redhat.com Connected fcd44049-a3b9-4f85-851c-79915812cf3f 10.70.47.163 Connected a614a3d8-478f-48d4-8542-e6d9c3b526ad localhost Connected ###################### Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |
Description of problem: [Volume Scale for Aplo] gluster node randomly going to Disconnected state after scaling to more than 290 gluster volumes. This in turn might had affected the creation of subsequent volumes after 290. Version-Release number of selected component (if applicable): glusterfs-3.7.9-5.el7rhgs.x86_64 glusterfs-server-3.7.9-5.el7rhgs.x86_64 How reproducible: Mostly Steps to Reproduce: 1. A gluster cluster of 4 RHGS 3.1.3 nodes having glusterd.service MemoryLimit=32G 2. Using heketi-cli, try to create and start around 300 gluster volumes in a loop for i in {1..300}; do heketi-cli volume create --name=vol$i --size=10 --durability="replicate" --replica=3; done 3. Check for command output and heketi logs Actual results: While it was trying to start vol291 after it's creation, it took a while. During this time, # gluster pool list was showing one of the node in disconnected state even though glusterd was running on it. ------------ [root@dhcp42-85 ~]# systemctl status glusterd ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/glusterd.service.d └─50-MemoryLimit.conf Active: active (running) since Wed 2016-05-18 19:35:20 IST; 5h 14min ago Process: 15144 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 15145 (glusterd) Memory: 29.4G (limit: 32.0G) CGroup: /system.slice/glusterd.service ------------ ######## [root@dhcp43-158 ~]# gluster pool list UUID Hostname State 4b494dd7-09e6-4d7d-8834-218534548912 10.70.42.222 Connected 36c62d2f-7baa-4138-8697-9509bf249d47 10.70.42.85 Disconnected 7fb3cbba-b377-4657-9a2a-d21f9c115388 10.70.43.162 Connected 3b6f62bd-4df1-40b1-9ec5-fed8018d7816 localhost Connected [root@dhcp43-158 ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.42.222 Uuid: 4b494dd7-09e6-4d7d-8834-218534548912 State: Peer in Cluster (Connected) Other names: dhcp42-222.lab.eng.blr.redhat.com Hostname: 10.70.42.85 Uuid: 36c62d2f-7baa-4138-8697-9509bf249d47 State: Peer in Cluster (Disconnected) Hostname: 10.70.43.162 Uuid: 7fb3cbba-b377-4657-9a2a-d21f9c115388 State: Peer in Cluster (Connected) [root@dhcp42-85 ~]# gluster pool list Error : Request timed out However, the other nodes were showing it as Connected. ######## Expected results: gluster node should not go into disconnected state while the glusterd service is up and running.