Bug 1127328

Summary:	BVT: Remove-brick operation in top-profile tests is failing complaining "One or more nodes do not support the required op-version"
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Lalatendu Mohanty <lmohanty>
Component:	core	Assignee:	Kaushal <kaushal>
Status:	CLOSED DUPLICATE	QA Contact:	Lalatendu Mohanty <lmohanty>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.0	CC:	amukherj, kaushal, kparthas, lmohanty, nsathyan, rhs-bugs, sasundar, storage-qa-internal, vagarwal
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.0.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.6.0.31-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-11-26 11:41:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1109742
Bug Blocks:

Description Lalatendu Mohanty 2014-08-06 16:36:12 UTC

Description of problem:

Remove-brick operation in top-profile tests is failing complaining "One or more nodes do not support the required op-version".

But all nodes are freshly installed and have glusterfs-server-3.6.0.27-1.el6rhs.x86_64 version.

Error:

volume remove-brick commit force: failed: One or more nodes do not support the required op-version. Cluster op-version must atleast be 30000.

Version-Release number of selected component (if applicable):
glusterfs-server-3.6.0.27-1.el6rhs.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1. Create a 3x2 dist-rep volume

These are the steps the test do:

gluster volume profile $volname start
gluster volume profile $volname info

#Create Data on the client mount

gluster volume stop $volname 
gluster volume start $volname"
             
gluster volume profile $volname info
peer_probe $PEER
add_brick $volname $NUM_ADD_BRICKS"
gluster volume rebalance $volname start
gluster volume info $volname 
gluster volume profile $volname stop
gluster volume info $volname 
gluster volume remove-brick $volname $remove_brick_list force


Actual results:


Expected results:


Additional info:

: [ 21:07:23 ] ::  volinfo before remove-brick:  
Volume Name: hosdu
Type: Distributed-Replicate
Volume ID: 69464784-e846-401d-843b-a3aacb2ccc1d
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick0
Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick1
Brick3: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick2
Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3
Brick5: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick4
Brick6: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5
Brick7: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick6
Brick8: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick7
Options Reconfigured:
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

volume remove-brick commit force: failed: One or more nodes do not support the required op-version. Cluster op-version must atleast be 30000.

:: [   FAIL   ] :: Running 'gluster volume remove-brick hosdu  rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick0 rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick1 force                   --mode=script' (Expected 0, got 1)

:: [   PASS   ] :: Running 'sleep 5' (Expected 0, got 0)

:: [ 21:07:29 ] ::  volinfo after remove-brick:  
Volume Name: hosdu
Type: Distributed-Replicate
Volume ID: 69464784-e846-401d-843b-a3aacb2ccc1d
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick0
Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick1
Brick3: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick2
Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3
Brick5: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick4
Brick6: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5
Brick7: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick6
Brick8: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick7
Options Reconfigured:
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

Comment 2 SATHEESARAN 2014-08-11 10:00:35 UTC

Here is how I reproduced this issue manually. Not sure, is this the way BVT code too found this issue.

1. Installed 2 RHS 3.0 Nodes ( freshly installed from ISO )
2. Installed a RHS 2.1 Node
3. Probed a RHS 3.0 Node from RHS 2.1 U2 Node ( cluster op-version now has become 2 )
4. Detached the node
5. Probed a RHS 3.0 Node from another RHS 3.0 Node( which is detached earlier from RHS 2.1 U2 ) Notice that cluster op-version remains to be 2 here, eventhough cluster is capable of 30000
6. Created a new distributed volume and started the same.
I see readdir-ahead was not enabled by default on the volume

7. Tried remove-brick and I ran in to the same issue
[Mon Aug 11 09:41:41 UTC 2014 root@:~ ] # gluster v i
 
Volume Name: dvol
Type: Distribute
Volume ID: d19723ee-5a31-4e62-af8c-64613d0399f5
Status: Started
Snap Volume: no
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 10.70.37.131:/rhs/brick1/br1
Brick2: 10.70.37.58:/rhs/brick1/br1
Options Reconfigured:
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

[Mon Aug 11 09:43:10 UTC 2014 root@:~ ] # gluster volume remove-brick dvol 10.70.37.58:/rhs/brick1/br1 force
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit force: failed: One or more nodes do not support the required op-version. Cluster op-version must atleast be 30000.

Comment 3 SATHEESARAN 2014-08-11 10:05:42 UTC

From comment0, I see the "gluster volume info" doesn't have readdir-ahead enabled on the volume ( I suppose that the cluster's op-version was not 3, because if its 3 readdir-ahead would be enabled on the volume by default).

From comment2, we can doubt that the BVT test could have probed RHS 2.1 U2 at any point of time during the test, lowering cluster op-version to 2 and the detached it.

Now remove-brick code on RHS 3.0 would fail fail as it requires op-version to be 3

Comment 4 Lalatendu Mohanty 2014-08-12 07:13:01 UTC

For BVT tests we use freshly installed servers. The servers are provisioned through Beaker using repos and kickstart file. The repositories include latest repo for glusterfs packages and a rhs ISO(which in this case rhs3.0 ISO).

So all the machines involved in the test were having same version of Gluster RPMs from the beginning. The same thing can be validated by looking at the install.log files of each machine in the following Beaker job. 

So not sure if the conclusion drawn in comment #3 is right. How ever I can't find out if there was any issue with the provisioning of the machines (during the particular test) as logs does not show any error.

https://beaker.engineering.redhat.com/jobs/713334

Comment 5 Lalatendu Mohanty 2014-08-12 07:17:24 UTC

Just add, BVT faced this issue only once in last 7 runs, so the issue is very rare.

Comment 6 Lalatendu Mohanty 2014-08-12 07:47:53 UTC

Also the assumption in comment #2 is wrong as the base ISO was "RHSS-3.0-20140624.n.0" which is rhs3.0 ISO and should have op version "30000".

Check the Beaker job mentioned in comment #4 for details.

Comment 7 Kaushal 2014-08-12 09:17:26 UTC

Lala,
Could any other RHS-2.1 machine have probed the machines you were running tests on accidentally? Based on what has been uncovered here, this is the most probable cause I can think of.

If there was a RHS-2.1 machine which believed that these machines were a part of its cluster, it would attempt to connect to test machines. This connection attempt can cause the op-version to be lowered. This issue is already being tracked with bug-1109742.

Can you confirm if this could not have absolutely happened?

Comment 8 Lalatendu Mohanty 2014-08-12 11:28:10 UTC

Kaushal,

BVT did not have any RHS 2.1 in the cluster, that's for sure. But I am not sure if someone explicitly probed one of the machine in the cluster when test(BVT) was running. Thats very unlikely but not impossible and I don't have enough information(logs) to confirm that.

Comment 9 Atin Mukherjee 2014-08-13 04:58:54 UTC

Wouldn't the glusterd logs give a hint whether a RHS 2.1 tried to connect to the cluster?

Comment 10 Lalatendu Mohanty 2014-08-13 05:33:04 UTC

Atin,

The particular test case (the failed ones) did not upload gluster logs to Beaker, so we don't have the logs. After the failure I too noticed that and fixed it in the automation. So next time if this reproduces, we can check the gluster logs.

Comment 12 Atin Mukherjee 2014-11-04 07:28:40 UTC

https://code.engineering.redhat.com/gerrit/#/c/35665/ fixes this problem which has been merged downstream and hence moving the status to modified.

Comment 15 Lalatendu Mohanty 2014-11-11 09:06:59 UTC

I never encountered this issue apart from the failure instance on which this bug was raised. So it more looks like the assumption drawn in comment #7.

Comment 16 Lalatendu Mohanty 2014-11-11 11:11:31 UTC

I think this bug is duplicate of BZ 1109742. 

Kp, can you please confirm?

Comment 19 krishnan parthasarathi 2014-11-26 11:41:22 UTC


*** This bug has been marked as a duplicate of bug 1109742 ***