1565729 – volume creation fails - when a 5 node gluster cluster is reduced to 3 node by removing labels on 2 nodes

Bug 1565729 - volume creation fails - when a 5 node gluster cluster is reduced to 3 node by removing labels on 2 nodes

Summary: volume creation fails - when a 5 node gluster cluster is reduced to 3 node by...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	CNS 3.9 Async
Assignee:	Michael Adam
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-10 15:47 UTC by krishnaram Karthick
Modified:	2018-12-12 09:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:	rhgs-volmanager-container-3.3.1-8.3
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-19 03:34:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
heketi_logs (118.17 KB, text/plain) 2018-04-10 17:17 UTC, krishnaram Karthick	no flags	Details
topology file (5.46 KB, text/plain) 2018-04-10 17:19 UTC, krishnaram Karthick	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1178	0	None	None	None	2018-04-19 03:34:55 UTC

Description krishnaram Karthick 2018-04-10 15:47:22 UTC

Description of problem:
On a 5 node gluster cluster, gluster pods were brought down on 2 nodes by removing 'glusterfs: storage-host'. Now effectively, there are 3 gluster pods up and running. 

sh-4.2# heketi-cli volume create --size=20
Error: Unable to find a GlusterFS pod on host dhcp46-45.lab.eng.blr.redhat.com with a label key glusterfs-node


heketi volume create operation on such a system failed.

Version-Release number of selected component (if applicable):
rpm -qa | grep 'heketi'
heketi-6.0.0-7.2.el7rhgs.x86_64
python-heketi-6.0.0-7.2.el7rhgs.x86_64
heketi-client-6.0.0-7.2.el7rhgs.x86_64


How reproducible:
2/2, this should definitely be reproducible

Steps to Reproduce:
1. create a 5 node cns setup
2. on 2 of the nodes, remove the label - glusterfs: storage-host
3. Try to create heketi volume

Actual results:
volume creation fails

Expected results:
heketi should pick up the nodes on which pods are up

Additional info:
heketi logs and topology info shall be attached

Comment 2 krishnaram Karthick 2018-04-10 17:17:28 UTC

Created attachment 1419998 [details]
heketi_logs

Comment 3 krishnaram Karthick 2018-04-10 17:19:21 UTC

Created attachment 1419999 [details]
topology file

Comment 4 John Mulligan 2018-04-10 20:56:45 UTC

Two items I noticed looking through the logs:

1) The node health monitor thread has not been started. This is probably due to an "old" heketi config that lacks the parameter needed to enable this thread. With the monitor on the volume create operation will not try to use nodes it knows to be unavailable.

2) The volume create operation retried correctly, but must have never hit a combination of nodes where all nodes were up. We may need to tweak the number of retries performed to increase the chances of a working node selection.

But before we work on #2, we should retest with #1 working.

Comment 5 krishnaram Karthick 2018-04-17 03:58:16 UTC

This issue was due to the node health monitoring not enabled. With rhgs-volmanager-container-3.3.1-8.3, this is enabled by default. 

Heketi 6.0.0
[heketi] INFO 2018/04/16 14:34:34 Loaded kubernetes executor
[heketi] ERROR 2018/04/16 14:34:34 /src/github.com/heketi/heketi/apps/glusterfs/app.go:100: invalid log level: 
[heketi] INFO 2018/04/16 14:34:34 Block: Auto Create Block Hosting Volume set to true
[heketi] INFO 2018/04/16 14:34:34 Block: New Block Hosting Volume size 100 GB
[heketi] INFO 2018/04/16 14:34:34 GlusterFS Application Loaded
[heketi] INFO 2018/04/16 14:34:34 Started Node Health Cache Monitor
Listening on port 8080

Verified the bug in rhgs-volmanager-container-3.3.1-8.4.

Comment 8 errata-xmlrpc 2018-04-19 03:34:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1178

Note You need to log in before you can comment on or make changes to this bug.