Description of problem: Bring down the first data brick in the arbiter volume and create a vm. Once the vm installation finishes poweroff the node. Bring back the first brick up and start the vm. I see that vm does not boot and goes to grub mode. As suggested by vijay, while starting the vm i have not brought the first brick up and i see that vm boots with out any issues. Version-Release number of selected component (if applicable): glusterfs-3.8.4-8.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a 1 x (2 + 1) arbiter volume. 2. Now bring down the first data brick in the volume. 3. Create a vm. 4. Once vm installation finishes poweroff the vm. 5. Bring the first brick which was down and start the vm. Actual results: I see that vm does not boot and goes to grub mode. Expected results: vm should boot with out any issues. Additional info:
gluster volume info on vmstore: ==================================== [root@rhsqa-grafton4 ~]# gluster volume info vmstore Volume Name: vmstore Type: Replicate Volume ID: 3d67c0ad-5084-4190-a4b5-c468994ca084 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.82:/rhgs/brick2/vmstore Brick2: 10.70.36.83:/rhgs/brick2/vmstore Brick3: 10.70.36.84:/rhgs/brick2/vmstore (arbiter) Options Reconfigured: performance.readdir-ahead: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: off cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 network.ping-timeout: 30 user.cifs: off performance.strict-o-direct: on client.ssl: on server.ssl: on auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83 cluster.granular-entry-heal: enable cluster.use-compound-fops: on
sosreports can be found in the brick below: ============================================= http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1405302/
From the errors in rhev-data-center-mnt-glusterSD-10.70.36.82\:_vmstore.log in grafton5, it looks like the same problem as BZ 1404982 (comment #5). I am providing a test build to Kasturi with the same fix on top of the latest downstream code (HEAD @ tag: v3.8.4-9, origin/rhgs-3.2.0, rhgs-3.2.0) protocol/client: fix op_errno handling, was unused variable) to see if it fixes the issue.
upstream mainline patch http://review.gluster.org/#/c/16205/ posted for review.
Hi Ravi, Tested this issue with glusterfs-3.8.4-5.el7rhgs.x86_64. I tried the steps mentioned in the description thrice. But i was not able to hit the issue. Thanks kasturi
(In reply to RamaKasturi from comment #8) > Hi Ravi, > > Tested this issue with glusterfs-3.8.4-5.el7rhgs.x86_64. I tried the > steps mentioned in the description thrice. But i was not able to hit the > issue. > > Thanks > kasturi Thanks Kasturi, if we are able to hit the issue with glusterfs-3.8.4-6, then we have a reasonably lesser no. of fixes to do a git bisect and find the offending commit. Please give it a try on v3.8.4-6 as well. Thanks! Ravi
Just for the record, after comment#9, Kasturi tried a couple of test builds (thanks a lot Kasturi!) and we were not able to hit the issue with some modifications made to the original patch posted in comment #7.
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/93560/
will verify this bug once the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1400057 lands.
verified and works fine with build glusterfs-3.8.4-11.el7rhgs.x86_64. Brought the first brick down in the volume, created a vm and installed os. Once vm is installed,powered off the vm,brought the first brick up and i see that vm has been booted successfully.
Moving this to verified state.
[root@rhsqa-grafton4 ~]# gluster volume info vmstore Volume Name: vmstore Type: Replicate Volume ID: 2f8938c2-26d3-4912-a6e0-bc12b76146d0 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.82:/rhgs/brick1/vmstore Brick2: 10.70.36.83:/rhgs/brick1/vmstore Brick3: 10.70.36.84:/rhgs/brick1/vmstore (arbiter) Options Reconfigured: auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83 server.ssl: on client.ssl: on cluster.granular-entry-heal: on user.cifs: off network.ping-timeout: 30 performance.strict-o-direct: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular performance.low-prio-threads: 32 features.shard-block-size: 4MB storage.owner-gid: 36 storage.owner-uid: 36 cluster.data-self-heal-algorithm: full features.shard: on cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: off cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html