Bug 862975
Summary: | Storage domain did not come up after killing a brick and bringing back from a distribute volume, restarting the VM is also not possible | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | shylesh <shmohan> | ||||||||||
Component: | vdsm | Assignee: | Ayal Baron <abaron> | ||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | shylesh <shmohan> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | unspecified | CC: | amarts, barumuga, bazulay, fsimonce, grajaiya, hateya, iheim, lpeer, mlipchuk, rhs-bugs, sabose, scohen, sdharane, sgowda, shaines, shireesh, vbellur, ykaul | ||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||
Target Release: | 3.2.0 | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | gluster | ||||||||||||
Fixed In Version: | glusterfs-3.3.0.5rhs-40 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2013-04-30 11:49:25 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Can you please re-try this with write-behind disabled? sure I can try that, but can you give me a work around to activate the storage domain? the whole datacenter is down because of this test. Not able to reproduce the issue, and bug is related to a stats being incorrect for rebalance. Reducing the severity and priority Ignore comment 4. It was not meant for this bug. Reverting priority/severity. I am not able to reproduce this bug. Please re-open if you hit it. Created attachment 624044 [details]
mnt and vdsm logs
sorry for the confusion, I seem to have missed couple of steps in my bug description. here are the complete steps for reproducing this issue. The following steps exactly lead to the same state as described above. re-opening since even though the steps to reproduce defer the issues remains the same steps ====== 1. created a 2 brick distribute volume with default volume tagging and brought down one of the brick results ======== VM paused but after restarting the volume VM was resumed 2. Added one more brick to the volume and did rebalance again repeated step 1 results ======= VM paused and resumed once volume restarted 3. This time i tried with eager-lock off and write-behind disabled and step1 results ======== VM paused and resumed once volume started 4. I again enabled write-behind and switched on the eager-lock results ======= This time i hit the bug, VM was paused but after bringing back the brick not able to resume. These VMs have preallocated disks and OS installed and running on them. Attached the mnt logs and vdsm logs Some observations from accessing the setup provided:
The rhev-host to which the master storage volume(same in this case) was mounted was showing errors. This led to the data center being down, which explains the fact that the vm's cannot be resumed.
Is this scenario reproducible when the rhev-m reports the host to be up? (I know the host is ping-able and ssh'able)
The data store on the hosts seems to be mounted just fine.
--------------------------------------------------------------
Hypervisior:
rhs-gp-srv9.lab.eng.blr.redhat.com
[root@rhs-gp-srv9 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc2 503G 2.7G 475G 1% /
tmpfs 24G 0 24G 0% /dev/shm
rhs-gp-srv4.lab.eng.blr.redhat.com:/vmstore
1.5T 111G 1.4T 8% /rhev/data-center/mnt/rhs-gp-srv4.lab.eng.blr.redhat.com:_vm
store
-------------------------------------------------------------
> Volume Name: vmstore
> Type: Distribute
> Volume ID: 91aa3e01-6330-44b7-acf1-9e5a20570cc8
> Status: Started
> Number of Bricks: 3
> Transport-type: tcp
> Bricks:
> Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/another
> Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/another
> Brick3: rhs-gp-srv12.lab.eng.blr.redhat.com:/another
--------------------------------------------------------------
more over none of the brick logs indicate any error. They show a successful mount. All the permissions on the back-end seem to be fine.
Alternative step to reproduce this issue (seen it work couple of times) 1. Create 4 brick dht volume 2. Attach to Datacenter 3. Launch 3-4 vm's from this storage 4. Kill a brick, and observe the available free space being shown as <1G. 5. Wait for vm's to pause. 6. Start volume (force option) 7. If vm/storage domain are online, repeat step 4 with next brick. we can rather fix it using https://code.engineering.redhat.com/gerrit/#/c/122/ need confirmation on if the issue got fixed if we use 'spread-count 1' as an option. (In reply to comment #13) > we can rather fix it using > https://code.engineering.redhat.com/gerrit/#/c/122/ > > need confirmation on if the issue got fixed if we use 'spread-count 1' as an > option. Amar , I tried setting this option but didn't work [root@rhs-client36 core]# gluster v set newdis directory-spread-count 1 option : directory-spread-count does not exist Did you mean io-thread-count? Set volume unsuccessful Shylesh, Can you please try with "directory-layout-spread" and not "directory-spread-count"? After bringing back the brick the storage domain was still down and could not perform any operations but this has got a work around. After bringing back the brick unmount the gluster mount from hypervisor and from the GUI reactivate the storage domain now everything is back to normal. Please refer https://bugzilla.redhat.com/show_bug.cgi?id=865669#c5 for more details. Moving the bug to assigned state, since bringing the brick alone didn't solve the issue. Please provide the sos reports for the storage nodes, virtualization hosts, virtualization manager. as the bug mentioned in comment #17 is closed, would like to move it to ON_QA again. (In reply to comment #18) > Please provide the sos reports for the storage nodes, virtualization hosts, > virtualization manager. Refer comment #9 from comment #17, > > After bringing back the brick the storage domain was still down and could not perform any operations but this has got a work around. > > After bringing back the brick unmount the gluster mount from hypervisor and from the GUI reactivate the storage domain now everything is back to normal. > Here, the mount point was active that shows there was no issues with filesystem. But the console never considered the mount point came back up until we did umount/mount. I suspect there seems to be some work needed in vdsm to recheck the mount point status if from console maintenance and resume of VM happens. Assigning it to Bala to check on the vdsm issue, keeping shireesh in loop. I looked into vdsm.log and found lot of traceback about missing storage domain. Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 178, in _monitorDomain self.domain = sdCache.produce(self.sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 110, in produce dom.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 123, in _realProduce dom = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 147, in _findDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('ff792265-bcc9-4c53-a65e-7d1c6f09b186',) I will check with vdsm guys about these traceback. From what I can see the domain was never mounted properly (see below) and after connectStoragePool (which succeeds because master domain is visible) log ends. In gerenal, engine should reconnect host to storage after about 5 minutes. Can you provide a log where it doesn't? Thread-6858::DEBUG::2012-10-09 15:59:58,738::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6 guido.lab. eng.blr.redhat.com:/iso2 /rhev/data-center/mnt/guido.lab.eng.blr.redhat.com:_iso2' (cwd None) Thread-6858::ERROR::2012-10-09 16:02:04,453::hsm::1986::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): ... MountError: (32, ';mount.nfs: Connection timed out\n') Can you attach engine log? Created attachment 705360 [details]
engine , vdsm and mount logs
Attaching the engine, vdsm logs and sosreports
DC name: Rebalance
Storage domain name: dist-rebal type: posix
hypervisor name : rhs-client16.lab.eng.blr.redhat.com (RHEVH
[root@rhs-client16 vdsm]# cat /etc/issue
Red Hat Enterprise Virtualization Hypervisor release 6.4 (20130221.0.auto205)
Kernel \r on an \m (\l)
latest log says some problem related to locks hence i suspect not able to start the storage domain even hypervisor mount looks clean /var/log/messages ========= Mar 6 05:26:12 rhs-client16 sanlock[13336]: 2013-03-06 05:26:12+0000 599469 [13344]: r6183 cmd_acquire 8,17,5709 acquire_token -90 Mar 6 05:26:12 rhs-client16 vdsm TaskManager.Task ERROR Task=`ea6f570f-3df8-43e2-a8b7-43d8c561dae2`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 840, in _run#012 File "/usr/share/vdsm/storage/task.py", line 307, in run#012 File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012 File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012 File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')" Mar 6 05:26:21 rhs-client16 ntpd[4730]: synchronized to 10.5.26.10, stratum 1 Mar 6 05:26:23 rhs-client16 sanlock[13336]: 2013-03-06 05:26:23+0000 599480 [13342]: r6184 cmd_acquire 8,17,5709 acquire_token -90 Mar 6 05:26:23 rhs-client16 vdsm TaskManager.Task ERROR Task=`2e88b03b-9427-443e-82bf-0dd914e891b6`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 840, in _run#012 File "/usr/share/vdsm/storage/task.py", line 307, in run#012 File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012 File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012 File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')" Mar 6 05:26:34 rhs-client16 sanlock[13336]: 2013-03-06 05:26:34+0000 599491 [13344]: r6185 cmd_acquire 8,17,5709 acquire_token -90 Mar 6 05:26:34 rhs-client16 vdsm TaskManager.Task ERROR Task=`cfc9bfb8-def6-4c46-a479-10cefb382e90`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 840, in _run#012 File "/usr/share/vdsm/storage/task.py", line 307, in run#012 File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012 File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012 File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')" Mar 6 05:26:46 rhs-client16 sanlock[13336]: 2013-03-06 05:26:46+0000 599502 [13342]: r6186 cmd_acquire 8,17,5709 acquire_token -90 Mar 6 05:26:46 rhs-client16 vdsm TaskManager.Task ERROR Task=`03151f80-dd4f-40a9-96bf-56a14dd50b49`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 840, in _run#012 File "/usr/share/vdsm/storage/task.py", line 307, in run#012 File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012 File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012 File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')" Mar 6 05:26:57 rhs-client16 sanlock[13336]: 2013-03-06 05:26:57+0000 599513 [13344]: r6187 cmd_acquire 8,17,5709 acquire_token -90 Mar 6 05:26:57 rhs-client16 vdsm TaskManager.Task ERROR Task=`32ab06cb-9874-468d-8136-8afe67ceaa6a`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 840, in _run#012 File "/usr/share/vdsm/storage/task.py", line 307, in run#012 File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012 File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012 File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')" Mar 6 05:27:08 rhs-client16 sanlock[13336]: 2013-03-06 05:27:08+0000 599524 [13342]: r6188 cmd_acquire 8,17,5709 acquire_token -90 Mar 6 05:27:08 rhs-client16 vdsm TaskManager.Task ERROR Task=`b751c946-4e84-49c5-93e0-0eef2d5fd2e7`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 840, in _run#012 File "/usr/share/vdsm/storage/task.py", line 307, in run#012 File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012 File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012 File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')" vdsm logs ================ VM Channels Listener::DEBUG::2013-03-06 05:28:42,368::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17. VM Channels Listener::DEBUG::2013-03-06 05:28:42,369::guestIF::95::vm.Vm::(_connect) vmId=`0d5a1d47-448f-491c-a3af-08ab5a590ffb`::Attempting connection to /var/lib/libvirt/qemu/channels/APP3.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2013-03-06 05:28:42,370::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 26. VM Channels Listener::DEBUG::2013-03-06 05:28:42,370::guestIF::95::vm.Vm::(_connect) vmId=`ff0d5b76-b27f-4e8c-b11f-0f6c34e5b97a`::Attempting connection to /var/lib/libvirt/qemu/channels/app4.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2013-03-06 05:28:42,370::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 20. VM Channels Listener::DEBUG::2013-03-06 05:28:42,370::guestIF::95::vm.Vm::(_connect) vmId=`a2d63307-d80b-4c7c-ab6b-388419524150`::Attempting connection to /var/lib/libvirt/qemu/channels/APP1.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2013-03-06 05:28:42,371::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 29. VM Channels Listener::DEBUG::2013-03-06 05:28:42,371::guestIF::95::vm.Vm::(_connect) vmId=`2b73a07c-8318-4200-a9fa-29222630f2a8`::Attempting connection to /var/lib/libvirt/qemu/channels/app5.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2013-03-06 05:28:42,371::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 31. VM Channels Listener::DEBUG::2013-03-06 05:28:42,371::guestIF::95::vm.Vm::(_connect) vmId=`cce07a2c-331f-46eb-8dbb-d5323dc3df3a`::Attempting connection to /var/lib/libvirt/qemu/channels/app2.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2013-03-06 05:28:43,372::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17. VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::guestIF::95::vm.Vm::(_connect) vmId=`0d5a1d47-448f-491c-a3af-08ab5a590ffb`::Attempting connection to /var/lib/libvirt/qemu/channels/APP3.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 26. VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::guestIF::95::vm.Vm::(_connect) vmId=`ff0d5b76-b27f-4e8c-b11f-0f6c34e5b97a`::Attempting connection to /var/lib/libvirt/qemu/channels/app4.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 20. VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::guestIF::95::vm.Vm::(_connect) vmId=`a2d63307-d80b-4c7c-ab6b-388419524150`::Attempting connection to /var/lib/libvirt/qemu/channels/APP1.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 29. VM Channels Listener::DEBUG::2013-03-06 05:28:43,374::guestIF::95::vm.Vm::(_connect) vmId=`2b73a07c-8318-4200-a9fa-29222630f2a8`::Attempting connection to /var/lib/libvirt/qemu/channels/app5.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2013-03-06 05:28:43,374::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 31. VM Channels Listener::DEBUG::2013-03-06 05:28:43,374::guestIF::95::vm.Vm::(_connect) vmId=` engine logs ============ 2013-03-05 21:28:02,606 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (QuartzScheduler_Worker-55) [701b7916] START, HSMClearTaskVDSCommand(HostName = rhs-client16.lab.eng.blr.redhat.com, HostId = 79532a34-809d-11e2-a5f5-525400632e67, taskId=372062e6-a2a7-45d8-af17-dda0eab710fb), log id: 1ece86c6 2013-03-05 21:28:02,622 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (QuartzScheduler_Worker-55) [701b7916] FINISH, HSMClearTaskVDSCommand, log id: 1ece86c6 2013-03-05 21:28:02,622 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (QuartzScheduler_Worker-55) [701b7916] FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@8f240f7, log id: 23721ba7 2013-03-05 21:28:02,624 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (QuartzScheduler_Worker-55) [58871cb7] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 5a66d2a6-bc44-49b6-8fe7-e1480f49ec44 Type: StoragePool 2013-03-05 21:28:02,630 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-55) [58871cb7] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed attached the complete log files Created attachment 705778 [details]
vdsm, engine , /var/log/messages logs
From the engine log it seems that a scheduler operation which calls GetStoragePoolInfoVDSCommand every few seconds, tries to get the IrsProxy, which is the SPM object in the memory, and then founds out that there is no SPM, so it tries to register rhs-client16 to be the new SPM, but it gets "Cannot obtain lock" from the host. It seams that the problem which needs to be solved is why VDSM can not obtain a lock on the SPM. Fede, can you take a quick look and see what's going on? If relevant, please document how you reached the conclusion so that others would know what to look for themselves next time. (In reply to comment #33) > Fede, can you take a quick look and see what's going on? > If relevant, please document how you reached the conclusion so that others > would know what to look for themselves next time. Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 840, in _run File "/usr/share/vdsm/storage/task.py", line 307, in run File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')" Despite the string error that has a default translation of 'Message too long', the underlying errno is EMSGSIZE which signals that the lockspace/resource has an incorrect size (typically the ids/leases files have been damaged with a truncate). I suppose this could be related to bug 859589 comment 24. (In reply to comment #34) > (In reply to comment #33) > > Fede, can you take a quick look and see what's going on? > > If relevant, please document how you reached the conclusion so that others > > would know what to look for themselves next time. > > Traceback (most recent call last): > File "/usr/share/vdsm/storage/task.py", line 840, in _run > File "/usr/share/vdsm/storage/task.py", line 307, in run > File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm > File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock > File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire > AcquireLockFailure: Cannot obtain lock: > "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster > lock, err=(90, 'Sanlock resource not acquired', 'Message too long')" > > Despite the string error that has a default translation of 'Message too > long', the underlying errno is EMSGSIZE which signals that the > lockspace/resource has an incorrect size (typically the ids/leases files > have been damaged with a truncate). > > I suppose this could be related to bug 859589 comment 24. Shylesh, Can you take a look and see if this is a duplicate of bug 859589? or if gLuster is truncating in another use case? In any event, this seems like a gluster bug? (In reply to comment #35) > (In reply to comment #34) > > (In reply to comment #33) > > > Fede, can you take a quick look and see what's going on? > > > If relevant, please document how you reached the conclusion so that others > > > would know what to look for themselves next time. > > > > Traceback (most recent call last): > > File "/usr/share/vdsm/storage/task.py", line 840, in _run > > File "/usr/share/vdsm/storage/task.py", line 307, in run > > File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm > > File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock > > File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire > > AcquireLockFailure: Cannot obtain lock: > > "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster > > lock, err=(90, 'Sanlock resource not acquired', 'Message too long')" > > > > Despite the string error that has a default translation of 'Message too > > long', the underlying errno is EMSGSIZE which signals that the > > lockspace/resource has an incorrect size (typically the ids/leases files > > have been damaged with a truncate). > > > > I suppose this could be related to bug 859589 comment 24. > > Shylesh, > > Can you take a look and see if this is a duplicate of bug 859589? or if > gLuster is truncating in another use case? > In any event, this seems like a gluster bug? As per comment 8, i once hit the bug with rebalance also but even without rebalance also i hit the bug. 859589 involves gluster rebalance operation which was causing truncation, whereas in this bug without doing rebalance just bring down a brick and bring it back. (In reply to comment #36) > (In reply to comment #35) > > (In reply to comment #34) > > Shylesh, > > > > Can you take a look and see if this is a duplicate of bug 859589? or if > > gLuster is truncating in another use case? > > In any event, this seems like a gluster bug? > > As per comment 8, i once hit the bug with rebalance also but even without > rebalance also i hit the bug. > 859589 involves gluster rebalance operation which was causing truncation, > whereas in this bug without doing rebalance just bring down a brick and > bring it back. The debugging suggestions given in bug 859589 comment 23 apply here too. Try to inspect the "leases" file to check if it's corrupted in some way. *** This bug has been marked as a duplicate of bug 859589 *** |
Created attachment 621336 [details] mnt and vdsm logs Description of problem: Killing one of the brick and bringing it back didn't made storage domain up and ofcourse VMs on this domain didn't start Version-Release number of selected component (if applicable): [root@rhs-gp-srv4 dist-rebal]# rpm -qa | grep gluster glusterfs-fuse-3.3.0rhsvirt1-6.el6rhs.x86_64 vdsm-gluster-4.9.6-14.el6rhs.noarch gluster-swift-plugin-1.0-5.noarch gluster-swift-container-1.4.8-4.el6.noarch org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-3.3.0rhsvirt1-6.el6rhs.x86_64 glusterfs-server-3.3.0rhsvirt1-6.el6rhs.x86_64 glusterfs-rdma-3.3.0rhsvirt1-6.el6rhs.x86_64 gluster-swift-proxy-1.4.8-4.el6.noarch gluster-swift-account-1.4.8-4.el6.noarch gluster-swift-doc-1.4.8-4.el6.noarch glusterfs-geo-replication-3.3.0rhsvirt1-6.el6rhs.x86_64 gluster-swift-1.4.8-4.el6.noarch gluster-swift-object-1.4.8-4.el6.noarch How reproducible: Steps to Reproduce: 1. Created a 2 brick distribute volume which servers as VM store 2. created couple of VMs from RHEV and it was running fine 3. brought down one of the brick from the volume 4. One of the VM went to "Pause" state 5. Started the volume with Force 6. Tried to start the VM but it didn't came up Actual results: Trying to start the VM says "Stoarage domain can't be accessed", mnt log says [2012-10-04 10:24:41.786152] E [dht-helper.c:827:dht_rebalance_inprogress_task] 5-rebal-dht: (null): failed to get the 'linkto' xattr No data available [2012-10-04 10:24:41.787327] W [client3_1-fops.c:1183:client3_1_fgetxattr_cbk] 5-rebal-client-1: remote operation failed: No data available [2012-10-04 10:24:41.787369] E [dht-helper.c:827:dht_rebalance_inprogress_task] 5-rebal-dht: (null): failed to get the 'linkto' xattr No data available [2012-10-04 10:24:41.789154] W [client3_1-fops.c:1183:client3_1_fgetxattr_cbk] 5-rebal-client-1: remote operation failed: No data available [2012-10-04 10:24:41.789207] E [dht-helper.c:827:dht_rebalance_inprogress_task] 5-rebal-dht: (null): failed to get the 'linkto' xattr No data available [2012-10-04 10:24:41.790209] W [client3_1-fops.c:1183:client3_1_fgetxattr_cbk] 5-rebal-client-1: remote operation failed: No data available Additional info: Attached the mnt and VDSM logs