Description of problem: Remove-brick operation in top-profile tests is failing complaining "One or more nodes do not support the required op-version". But all nodes are freshly installed and have glusterfs-server-3.6.0.27-1.el6rhs.x86_64 version. Error: volume remove-brick commit force: failed: One or more nodes do not support the required op-version. Cluster op-version must atleast be 30000. Version-Release number of selected component (if applicable): glusterfs-server-3.6.0.27-1.el6rhs.x86_64 How reproducible: Intermittent Steps to Reproduce: 1. Create a 3x2 dist-rep volume These are the steps the test do: gluster volume profile $volname start gluster volume profile $volname info #Create Data on the client mount gluster volume stop $volname gluster volume start $volname" gluster volume profile $volname info peer_probe $PEER add_brick $volname $NUM_ADD_BRICKS" gluster volume rebalance $volname start gluster volume info $volname gluster volume profile $volname stop gluster volume info $volname gluster volume remove-brick $volname $remove_brick_list force Actual results: Expected results: Additional info: : [ 21:07:23 ] :: volinfo before remove-brick: Volume Name: hosdu Type: Distributed-Replicate Volume ID: 69464784-e846-401d-843b-a3aacb2ccc1d Status: Started Snap Volume: no Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick0 Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick1 Brick3: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick2 Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3 Brick5: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick4 Brick6: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5 Brick7: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick6 Brick8: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick7 Options Reconfigured: auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 volume remove-brick commit force: failed: One or more nodes do not support the required op-version. Cluster op-version must atleast be 30000. :: [ FAIL ] :: Running 'gluster volume remove-brick hosdu rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick0 rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick1 force --mode=script' (Expected 0, got 1) :: [ PASS ] :: Running 'sleep 5' (Expected 0, got 0) :: [ 21:07:29 ] :: volinfo after remove-brick: Volume Name: hosdu Type: Distributed-Replicate Volume ID: 69464784-e846-401d-843b-a3aacb2ccc1d Status: Started Snap Volume: no Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick0 Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick1 Brick3: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick2 Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3 Brick5: rhsauto019.lab.eng.blr.redhat.com:/bricks/hosdu_brick4 Brick6: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5 Brick7: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick6 Brick8: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick7 Options Reconfigured: auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256
Here is how I reproduced this issue manually. Not sure, is this the way BVT code too found this issue. 1. Installed 2 RHS 3.0 Nodes ( freshly installed from ISO ) 2. Installed a RHS 2.1 Node 3. Probed a RHS 3.0 Node from RHS 2.1 U2 Node ( cluster op-version now has become 2 ) 4. Detached the node 5. Probed a RHS 3.0 Node from another RHS 3.0 Node( which is detached earlier from RHS 2.1 U2 ) Notice that cluster op-version remains to be 2 here, eventhough cluster is capable of 30000 6. Created a new distributed volume and started the same. I see readdir-ahead was not enabled by default on the volume 7. Tried remove-brick and I ran in to the same issue [Mon Aug 11 09:41:41 UTC 2014 root@:~ ] # gluster v i Volume Name: dvol Type: Distribute Volume ID: d19723ee-5a31-4e62-af8c-64613d0399f5 Status: Started Snap Volume: no Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.70.37.131:/rhs/brick1/br1 Brick2: 10.70.37.58:/rhs/brick1/br1 Options Reconfigured: snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable [Mon Aug 11 09:43:10 UTC 2014 root@:~ ] # gluster volume remove-brick dvol 10.70.37.58:/rhs/brick1/br1 force Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit force: failed: One or more nodes do not support the required op-version. Cluster op-version must atleast be 30000.
From comment0, I see the "gluster volume info" doesn't have readdir-ahead enabled on the volume ( I suppose that the cluster's op-version was not 3, because if its 3 readdir-ahead would be enabled on the volume by default). From comment2, we can doubt that the BVT test could have probed RHS 2.1 U2 at any point of time during the test, lowering cluster op-version to 2 and the detached it. Now remove-brick code on RHS 3.0 would fail fail as it requires op-version to be 3
For BVT tests we use freshly installed servers. The servers are provisioned through Beaker using repos and kickstart file. The repositories include latest repo for glusterfs packages and a rhs ISO(which in this case rhs3.0 ISO). So all the machines involved in the test were having same version of Gluster RPMs from the beginning. The same thing can be validated by looking at the install.log files of each machine in the following Beaker job. So not sure if the conclusion drawn in comment #3 is right. How ever I can't find out if there was any issue with the provisioning of the machines (during the particular test) as logs does not show any error. https://beaker.engineering.redhat.com/jobs/713334
Just add, BVT faced this issue only once in last 7 runs, so the issue is very rare.
Also the assumption in comment #2 is wrong as the base ISO was "RHSS-3.0-20140624.n.0" which is rhs3.0 ISO and should have op version "30000". Check the Beaker job mentioned in comment #4 for details.
Lala, Could any other RHS-2.1 machine have probed the machines you were running tests on accidentally? Based on what has been uncovered here, this is the most probable cause I can think of. If there was a RHS-2.1 machine which believed that these machines were a part of its cluster, it would attempt to connect to test machines. This connection attempt can cause the op-version to be lowered. This issue is already being tracked with bug-1109742. Can you confirm if this could not have absolutely happened?
Kaushal, BVT did not have any RHS 2.1 in the cluster, that's for sure. But I am not sure if someone explicitly probed one of the machine in the cluster when test(BVT) was running. Thats very unlikely but not impossible and I don't have enough information(logs) to confirm that.
Wouldn't the glusterd logs give a hint whether a RHS 2.1 tried to connect to the cluster?
Atin, The particular test case (the failed ones) did not upload gluster logs to Beaker, so we don't have the logs. After the failure I too noticed that and fixed it in the automation. So next time if this reproduces, we can check the gluster logs.
https://code.engineering.redhat.com/gerrit/#/c/35665/ fixes this problem which has been merged downstream and hence moving the status to modified.
I never encountered this issue apart from the failure instance on which this bug was raised. So it more looks like the assumption drawn in comment #7.
I think this bug is duplicate of BZ 1109742. Kp, can you please confirm?
*** This bug has been marked as a duplicate of bug 1109742 ***