Bug 1127328
Summary: | BVT: Remove-brick operation in top-profile tests is failing complaining "One or more nodes do not support the required op-version" | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Lalatendu Mohanty <lmohanty> |
Component: | core | Assignee: | Kaushal <kaushal> |
Status: | CLOSED DUPLICATE | QA Contact: | Lalatendu Mohanty <lmohanty> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | rhgs-3.0 | CC: | amukherj, kaushal, kparthas, lmohanty, nsathyan, rhs-bugs, sasundar, storage-qa-internal, vagarwal |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | RHGS 3.0.3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.6.0.31-1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-11-26 11:41:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1109742 | ||
Bug Blocks: |
Description
Lalatendu Mohanty
2014-08-06 16:36:12 UTC
Here is how I reproduced this issue manually. Not sure, is this the way BVT code too found this issue. 1. Installed 2 RHS 3.0 Nodes ( freshly installed from ISO ) 2. Installed a RHS 2.1 Node 3. Probed a RHS 3.0 Node from RHS 2.1 U2 Node ( cluster op-version now has become 2 ) 4. Detached the node 5. Probed a RHS 3.0 Node from another RHS 3.0 Node( which is detached earlier from RHS 2.1 U2 ) Notice that cluster op-version remains to be 2 here, eventhough cluster is capable of 30000 6. Created a new distributed volume and started the same. I see readdir-ahead was not enabled by default on the volume 7. Tried remove-brick and I ran in to the same issue [Mon Aug 11 09:41:41 UTC 2014 root@:~ ] # gluster v i Volume Name: dvol Type: Distribute Volume ID: d19723ee-5a31-4e62-af8c-64613d0399f5 Status: Started Snap Volume: no Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.70.37.131:/rhs/brick1/br1 Brick2: 10.70.37.58:/rhs/brick1/br1 Options Reconfigured: snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable [Mon Aug 11 09:43:10 UTC 2014 root@:~ ] # gluster volume remove-brick dvol 10.70.37.58:/rhs/brick1/br1 force Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit force: failed: One or more nodes do not support the required op-version. Cluster op-version must atleast be 30000. From comment0, I see the "gluster volume info" doesn't have readdir-ahead enabled on the volume ( I suppose that the cluster's op-version was not 3, because if its 3 readdir-ahead would be enabled on the volume by default). From comment2, we can doubt that the BVT test could have probed RHS 2.1 U2 at any point of time during the test, lowering cluster op-version to 2 and the detached it. Now remove-brick code on RHS 3.0 would fail fail as it requires op-version to be 3 For BVT tests we use freshly installed servers. The servers are provisioned through Beaker using repos and kickstart file. The repositories include latest repo for glusterfs packages and a rhs ISO(which in this case rhs3.0 ISO). So all the machines involved in the test were having same version of Gluster RPMs from the beginning. The same thing can be validated by looking at the install.log files of each machine in the following Beaker job. So not sure if the conclusion drawn in comment #3 is right. How ever I can't find out if there was any issue with the provisioning of the machines (during the particular test) as logs does not show any error. https://beaker.engineering.redhat.com/jobs/713334 Just add, BVT faced this issue only once in last 7 runs, so the issue is very rare. Also the assumption in comment #2 is wrong as the base ISO was "RHSS-3.0-20140624.n.0" which is rhs3.0 ISO and should have op version "30000". Check the Beaker job mentioned in comment #4 for details. Lala, Could any other RHS-2.1 machine have probed the machines you were running tests on accidentally? Based on what has been uncovered here, this is the most probable cause I can think of. If there was a RHS-2.1 machine which believed that these machines were a part of its cluster, it would attempt to connect to test machines. This connection attempt can cause the op-version to be lowered. This issue is already being tracked with bug-1109742. Can you confirm if this could not have absolutely happened? Kaushal, BVT did not have any RHS 2.1 in the cluster, that's for sure. But I am not sure if someone explicitly probed one of the machine in the cluster when test(BVT) was running. Thats very unlikely but not impossible and I don't have enough information(logs) to confirm that. Wouldn't the glusterd logs give a hint whether a RHS 2.1 tried to connect to the cluster? Atin, The particular test case (the failed ones) did not upload gluster logs to Beaker, so we don't have the logs. After the failure I too noticed that and fixed it in the automation. So next time if this reproduces, we can check the gluster logs. https://code.engineering.redhat.com/gerrit/#/c/35665/ fixes this problem which has been merged downstream and hence moving the status to modified. I never encountered this issue apart from the failure instance on which this bug was raised. So it more looks like the assumption drawn in comment #7. I think this bug is duplicate of BZ 1109742. Kp, can you please confirm? *** This bug has been marked as a duplicate of bug 1109742 *** |