Bug 862975

Summary: Storage domain did not come up after killing a brick and bringing back from a distribute volume, restarting the VM is also not possible
Product: Red Hat Enterprise Virtualization Manager Reporter: shylesh <shmohan>
Component: vdsmAssignee: Ayal Baron <abaron>
Status: CLOSED DUPLICATE QA Contact: shylesh <shmohan>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: amarts, barumuga, bazulay, fsimonce, grajaiya, hateya, iheim, lpeer, mlipchuk, rhs-bugs, sabose, scohen, sdharane, sgowda, shaines, shireesh, vbellur, ykaul
Target Milestone: ---Keywords: Reopened
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: gluster
Fixed In Version: glusterfs-3.3.0.5rhs-40 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-04-30 11:49:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
mnt and vdsm logs
none
mnt and vdsm logs
none
engine , vdsm and mount logs
none
vdsm, engine , /var/log/messages logs none

Description shylesh 2012-10-04 05:08:24 UTC
Created attachment 621336 [details]
mnt and vdsm logs

Description of problem:
Killing one of the brick and bringing it back didn't made storage domain up and ofcourse VMs on this domain didn't start

Version-Release number of selected component (if applicable):
[root@rhs-gp-srv4 dist-rebal]# rpm -qa | grep gluster
glusterfs-fuse-3.3.0rhsvirt1-6.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-6.el6rhs.x86_64
glusterfs-server-3.3.0rhsvirt1-6.el6rhs.x86_64
glusterfs-rdma-3.3.0rhsvirt1-6.el6rhs.x86_64
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-geo-replication-3.3.0rhsvirt1-6.el6rhs.x86_64
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch


How reproducible:


Steps to Reproduce:
1. Created a 2 brick distribute volume which servers as VM store
2. created couple of VMs from RHEV and it was running fine
3. brought down one of the brick from the volume 
4. One of the VM went to "Pause" state
5. Started the volume with Force
6. Tried to start the VM but it didn't came up
  
Actual results:

Trying to start the VM says "Stoarage domain can't be accessed", mnt log says 


[2012-10-04 10:24:41.786152] E [dht-helper.c:827:dht_rebalance_inprogress_task] 5-rebal-dht: (null): failed to get the 'linkto' xattr No data available
[2012-10-04 10:24:41.787327] W [client3_1-fops.c:1183:client3_1_fgetxattr_cbk] 5-rebal-client-1: remote operation failed: No data available
[2012-10-04 10:24:41.787369] E [dht-helper.c:827:dht_rebalance_inprogress_task] 5-rebal-dht: (null): failed to get the 'linkto' xattr No data available
[2012-10-04 10:24:41.789154] W [client3_1-fops.c:1183:client3_1_fgetxattr_cbk] 5-rebal-client-1: remote operation failed: No data available
[2012-10-04 10:24:41.789207] E [dht-helper.c:827:dht_rebalance_inprogress_task] 5-rebal-dht: (null): failed to get the 'linkto' xattr No data available
[2012-10-04 10:24:41.790209] W [client3_1-fops.c:1183:client3_1_fgetxattr_cbk] 5-rebal-client-1: remote operation failed: No data available
 


Additional info:


Attached the mnt and VDSM logs

Comment 2 shishir gowda 2012-10-04 05:43:00 UTC
Can you please re-try this with write-behind disabled?

Comment 3 shylesh 2012-10-04 06:35:24 UTC
sure I can try that, but can you give me a work around to activate the storage domain? the whole datacenter is down because of this test.

Comment 4 shishir gowda 2012-10-05 07:23:18 UTC
Not able to reproduce the issue, and bug is related to a stats being incorrect for rebalance. Reducing the severity and priority

Comment 5 shishir gowda 2012-10-05 07:24:11 UTC
Ignore comment 4. It was not meant for this bug. Reverting priority/severity.

Comment 6 shishir gowda 2012-10-09 10:34:08 UTC
I am not able to reproduce this bug. Please re-open if you hit it.

Comment 7 shylesh 2012-10-09 11:49:11 UTC
Created attachment 624044 [details]
mnt and vdsm logs

Comment 8 shylesh 2012-10-09 11:50:51 UTC
sorry for the confusion, I seem to have missed couple of steps in my bug description. here are the complete steps for reproducing this issue.

The following steps exactly lead to the same state as described above.
re-opening since even though the steps to reproduce defer the issues remains the same




steps
======
1. created a 2 brick distribute volume with default volume tagging and brought down one of the brick

results
========

VM paused but after restarting the volume VM was resumed



2. Added one more brick to the volume and did rebalance again repeated step 1

results
=======
VM paused and resumed once volume restarted

3. This time i tried with eager-lock off and write-behind disabled and step1

results
========
VM paused and resumed once volume started


4. I again enabled write-behind and switched on the eager-lock

results
=======
This time i hit the bug, VM was paused but after bringing back the brick not able to resume.



These VMs have preallocated disks and OS installed and running on them.


Attached the mnt logs and vdsm logs

Comment 10 shishir gowda 2012-10-09 12:45:21 UTC
Some observations from accessing the setup provided:

The rhev-host to which the master storage volume(same in this case) was mounted was showing errors. This led to the data center being down, which explains the fact that the vm's cannot be resumed.

Is this scenario reproducible when the rhev-m reports the host to be up? (I know the host is ping-able and ssh'able)

The data store on the hosts seems to be mounted just fine.
--------------------------------------------------------------
Hypervisior:
rhs-gp-srv9.lab.eng.blr.redhat.com
[root@rhs-gp-srv9 ~]# df -h                                                                             
Filesystem            Size  Used Avail Use% Mounted on                                                  
/dev/sdc2             503G  2.7G  475G   1% /                                                           
tmpfs                  24G     0   24G   0% /dev/shm                                                    
rhs-gp-srv4.lab.eng.blr.redhat.com:/vmstore                                                             
                      1.5T  111G  1.4T   8% /rhev/data-center/mnt/rhs-gp-srv4.lab.eng.blr.redhat.com:_vm
store
-------------------------------------------------------------
> Volume Name: vmstore
> Type: Distribute
> Volume ID: 91aa3e01-6330-44b7-acf1-9e5a20570cc8
> Status: Started
> Number of Bricks: 3
> Transport-type: tcp
> Bricks:
> Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/another
> Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/another
> Brick3: rhs-gp-srv12.lab.eng.blr.redhat.com:/another
--------------------------------------------------------------

more over none of the brick logs indicate any error. They show a successful mount. All the permissions on the back-end seem to be fine.

Comment 11 shishir gowda 2012-10-15 10:43:18 UTC
Alternative step to reproduce this issue (seen it work couple of times)

1. Create 4 brick dht volume
2. Attach to Datacenter
3. Launch 3-4 vm's from this storage
4. Kill a brick, and observe the available free space being shown as <1G.
5. Wait for vm's to pause.
6. Start volume (force option)
7. If vm/storage domain are online, repeat step 4 with next brick.

Comment 12 shishir gowda 2012-10-23 09:09:54 UTC
Fix @ https://code.engineering.redhat.com/gerrit/#/c/119/

Comment 13 Amar Tumballi 2012-10-30 07:27:02 UTC
we can rather fix it using https://code.engineering.redhat.com/gerrit/#/c/122/

need confirmation on if the issue got fixed if we use 'spread-count 1' as an option.

Comment 15 shylesh 2012-11-09 07:19:04 UTC
(In reply to comment #13)
> we can rather fix it using
> https://code.engineering.redhat.com/gerrit/#/c/122/
> 
> need confirmation on if the issue got fixed if we use 'spread-count 1' as an
> option.

Amar , 
I tried setting this option but didn't work

[root@rhs-client36 core]# gluster v set newdis directory-spread-count 1
option : directory-spread-count does not exist
Did you mean io-thread-count?
Set volume unsuccessful

Comment 16 Vijay Bellur 2012-11-15 07:28:45 UTC
Shylesh,

Can you please try with "directory-layout-spread" and not "directory-spread-count"?

Comment 17 shylesh 2013-01-16 12:54:22 UTC
After bringing back the brick the storage domain was still down and could not perform any operations but this has got a work around.

After bringing back the brick unmount the gluster mount from hypervisor and from the GUI reactivate the storage domain now everything is back to normal.

Please refer https://bugzilla.redhat.com/show_bug.cgi?id=865669#c5 for more details.

Moving the bug to assigned state, since bringing the brick alone didn't solve the issue.

Comment 18 shishir gowda 2013-01-22 05:40:11 UTC
Please provide the sos reports for the storage nodes, virtualization hosts, virtualization manager.

Comment 19 Amar Tumballi 2013-01-30 05:34:59 UTC
as the bug mentioned in comment #17 is closed, would like to move it to ON_QA again.

Comment 20 Gowrishankar Rajaiyan 2013-02-04 04:17:23 UTC
(In reply to comment #18)
> Please provide the sos reports for the storage nodes, virtualization hosts,
> virtualization manager.

Refer comment #9

Comment 21 Amar Tumballi 2013-02-08 05:55:20 UTC
from comment #17,
> 
> After bringing back the brick the storage domain was still down and could not perform any operations but this has got a work around.
> 
> After bringing back the brick unmount the gluster mount from hypervisor and from the GUI reactivate the storage domain now everything is back to normal.
> 

Here, the mount point was active that shows there was no issues with filesystem. But the console never considered the mount point came back up until we did umount/mount. I suspect there seems to be some work needed in vdsm to recheck the mount point status if from console maintenance and resume of VM happens.

Assigning it to Bala to check on the vdsm issue, keeping shireesh in loop.

Comment 22 Bala.FA 2013-02-08 09:24:26 UTC
I looked into vdsm.log and found lot of traceback about missing storage domain.

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 178, in _monitorDomain
    self.domain = sdCache.produce(self.sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 110, in produce
    dom.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 123, in _realProduce
    dom = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 147, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('ff792265-bcc9-4c53-a65e-7d1c6f09b186',)

I will check with vdsm guys about these traceback.

Comment 24 Ayal Baron 2013-02-17 12:11:24 UTC
From what I can see the domain was never mounted properly (see below) and after connectStoragePool (which succeeds because master domain is visible) log ends. In gerenal, engine should reconnect host to storage after about 5 minutes.  Can you provide a log where it doesn't?

Thread-6858::DEBUG::2012-10-09 15:59:58,738::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6 guido.lab.
eng.blr.redhat.com:/iso2 /rhev/data-center/mnt/guido.lab.eng.blr.redhat.com:_iso2' (cwd None)

Thread-6858::ERROR::2012-10-09 16:02:04,453::hsm::1986::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
...
MountError: (32, ';mount.nfs: Connection timed out\n')

Comment 25 Ayal Baron 2013-02-17 13:12:36 UTC
Can you attach engine log?

Comment 26 shylesh 2013-03-05 11:09:51 UTC
Created attachment 705360 [details]
engine , vdsm and mount logs

Attaching the engine, vdsm logs and sosreports

DC name: Rebalance
Storage domain name: dist-rebal     type: posix
hypervisor name : rhs-client16.lab.eng.blr.redhat.com (RHEVH

[root@rhs-client16 vdsm]# cat /etc/issue
Red Hat Enterprise Virtualization Hypervisor release 6.4 (20130221.0.auto205)
Kernel \r on an \m (\l)

Comment 28 shylesh 2013-03-06 05:42:01 UTC
latest log says some problem related to locks hence i suspect not able to start the storage domain even hypervisor mount looks clean


/var/log/messages
=========
Mar  6 05:26:12 rhs-client16 sanlock[13336]: 2013-03-06 05:26:12+0000 599469 [13344]: r6183 cmd_acquire 8,17,5709 acquire_token -90
	Mar  6 05:26:12 rhs-client16 vdsm TaskManager.Task ERROR Task=`ea6f570f-3df8-43e2-a8b7-43d8c561dae2`::Unexpected error#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/task.py", line 840, in _run#012  File "/usr/share/vdsm/storage/task.py", line 307, in run#012  File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012  File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012  File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')"
	Mar  6 05:26:21 rhs-client16 ntpd[4730]: synchronized to 10.5.26.10, stratum 1
	Mar  6 05:26:23 rhs-client16 sanlock[13336]: 2013-03-06 05:26:23+0000 599480 [13342]: r6184 cmd_acquire 8,17,5709 acquire_token -90
	Mar  6 05:26:23 rhs-client16 vdsm TaskManager.Task ERROR Task=`2e88b03b-9427-443e-82bf-0dd914e891b6`::Unexpected error#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/task.py", line 840, in _run#012  File "/usr/share/vdsm/storage/task.py", line 307, in run#012  File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012  File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012  File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')"
	Mar  6 05:26:34 rhs-client16 sanlock[13336]: 2013-03-06 05:26:34+0000 599491 [13344]: r6185 cmd_acquire 8,17,5709 acquire_token -90
	Mar  6 05:26:34 rhs-client16 vdsm TaskManager.Task ERROR Task=`cfc9bfb8-def6-4c46-a479-10cefb382e90`::Unexpected error#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/task.py", line 840, in _run#012  File "/usr/share/vdsm/storage/task.py", line 307, in run#012  File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012  File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012  File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')"
	Mar  6 05:26:46 rhs-client16 sanlock[13336]: 2013-03-06 05:26:46+0000 599502 [13342]: r6186 cmd_acquire 8,17,5709 acquire_token -90
	Mar  6 05:26:46 rhs-client16 vdsm TaskManager.Task ERROR Task=`03151f80-dd4f-40a9-96bf-56a14dd50b49`::Unexpected error#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/task.py", line 840, in _run#012  File "/usr/share/vdsm/storage/task.py", line 307, in run#012  File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012  File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012  File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')"
	Mar  6 05:26:57 rhs-client16 sanlock[13336]: 2013-03-06 05:26:57+0000 599513 [13344]: r6187 cmd_acquire 8,17,5709 acquire_token -90
	Mar  6 05:26:57 rhs-client16 vdsm TaskManager.Task ERROR Task=`32ab06cb-9874-468d-8136-8afe67ceaa6a`::Unexpected error#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/task.py", line 840, in _run#012  File "/usr/share/vdsm/storage/task.py", line 307, in run#012  File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012  File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012  File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')"
	Mar  6 05:27:08 rhs-client16 sanlock[13336]: 2013-03-06 05:27:08+0000 599524 [13342]: r6188 cmd_acquire 8,17,5709 acquire_token -90
	Mar  6 05:27:08 rhs-client16 vdsm TaskManager.Task ERROR Task=`b751c946-4e84-49c5-93e0-0eef2d5fd2e7`::Unexpected error#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/task.py", line 840, in _run#012  File "/usr/share/vdsm/storage/task.py", line 307, in run#012  File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm#012  File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock#012  File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire#012AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')"



vdsm logs
================
VM Channels Listener::DEBUG::2013-03-06 05:28:42,368::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17.
VM Channels Listener::DEBUG::2013-03-06 05:28:42,369::guestIF::95::vm.Vm::(_connect) vmId=`0d5a1d47-448f-491c-a3af-08ab5a590ffb`::Attempting connection to /var/lib/libvirt/qemu/channels/APP3.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2013-03-06 05:28:42,370::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 26.
VM Channels Listener::DEBUG::2013-03-06 05:28:42,370::guestIF::95::vm.Vm::(_connect) vmId=`ff0d5b76-b27f-4e8c-b11f-0f6c34e5b97a`::Attempting connection to /var/lib/libvirt/qemu/channels/app4.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2013-03-06 05:28:42,370::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 20.
VM Channels Listener::DEBUG::2013-03-06 05:28:42,370::guestIF::95::vm.Vm::(_connect) vmId=`a2d63307-d80b-4c7c-ab6b-388419524150`::Attempting connection to /var/lib/libvirt/qemu/channels/APP1.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2013-03-06 05:28:42,371::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 29.
VM Channels Listener::DEBUG::2013-03-06 05:28:42,371::guestIF::95::vm.Vm::(_connect) vmId=`2b73a07c-8318-4200-a9fa-29222630f2a8`::Attempting connection to /var/lib/libvirt/qemu/channels/app5.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2013-03-06 05:28:42,371::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 31.
VM Channels Listener::DEBUG::2013-03-06 05:28:42,371::guestIF::95::vm.Vm::(_connect) vmId=`cce07a2c-331f-46eb-8dbb-d5323dc3df3a`::Attempting connection to /var/lib/libvirt/qemu/channels/app2.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2013-03-06 05:28:43,372::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17.
VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::guestIF::95::vm.Vm::(_connect) vmId=`0d5a1d47-448f-491c-a3af-08ab5a590ffb`::Attempting connection to /var/lib/libvirt/qemu/channels/APP3.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 26.
VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::guestIF::95::vm.Vm::(_connect) vmId=`ff0d5b76-b27f-4e8c-b11f-0f6c34e5b97a`::Attempting connection to /var/lib/libvirt/qemu/channels/app4.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 20.
VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::guestIF::95::vm.Vm::(_connect) vmId=`a2d63307-d80b-4c7c-ab6b-388419524150`::Attempting connection to /var/lib/libvirt/qemu/channels/APP1.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2013-03-06 05:28:43,373::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 29.
VM Channels Listener::DEBUG::2013-03-06 05:28:43,374::guestIF::95::vm.Vm::(_connect) vmId=`2b73a07c-8318-4200-a9fa-29222630f2a8`::Attempting connection to /var/lib/libvirt/qemu/channels/app5.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2013-03-06 05:28:43,374::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 31.
VM Channels Listener::DEBUG::2013-03-06 05:28:43,374::guestIF::95::vm.Vm::(_connect) vmId=`




engine logs
============
2013-03-05 21:28:02,606 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (QuartzScheduler_Worker-55) [701b7916] START, HSMClearTaskVDSCommand(HostName = rhs-client16.lab.eng.blr.redhat.com, HostId = 79532a34-809d-11e2-a5f5-525400632e67, taskId=372062e6-a2a7-45d8-af17-dda0eab710fb), log id: 1ece86c6
2013-03-05 21:28:02,622 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (QuartzScheduler_Worker-55) [701b7916] FINISH, HSMClearTaskVDSCommand, log id: 1ece86c6
2013-03-05 21:28:02,622 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (QuartzScheduler_Worker-55) [701b7916] FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@8f240f7, log id: 23721ba7
2013-03-05 21:28:02,624 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (QuartzScheduler_Worker-55) [58871cb7] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 5a66d2a6-bc44-49b6-8fe7-e1480f49ec44 Type: StoragePool
2013-03-05 21:28:02,630 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-55) [58871cb7] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed




attached the complete log files

Comment 29 shylesh 2013-03-06 05:43:11 UTC
Created attachment 705778 [details]
vdsm, engine , /var/log/messages logs

Comment 32 Maor 2013-03-13 21:37:15 UTC
From the engine log it seems that a scheduler operation which calls GetStoragePoolInfoVDSCommand every few seconds, tries to get the IrsProxy, which is the SPM object in the memory, and then founds out that there is no SPM, so it tries to register rhs-client16 to be the new SPM, but it gets "Cannot obtain lock" from the host.

It seams that the problem which needs to be solved is why VDSM can not obtain a lock on the SPM.

Comment 33 Ayal Baron 2013-03-14 09:13:18 UTC
Fede, can you take a quick look and see what's going on?
If relevant, please document how you reached the conclusion so that others would know what to look for themselves next time.

Comment 34 Federico Simoncelli 2013-03-15 16:02:46 UTC
(In reply to comment #33)
> Fede, can you take a quick look and see what's going on?
> If relevant, please document how you reached the conclusion so that others
> would know what to look for themselves next time.

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
  File "/usr/share/vdsm/storage/task.py", line 307, in run
  File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm
  File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock
  File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire
AcquireLockFailure: Cannot obtain lock: "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster lock, err=(90, 'Sanlock resource not acquired', 'Message too long')"

Despite the string error that has a default translation of 'Message too long', the underlying errno is EMSGSIZE which signals that the lockspace/resource has an incorrect size (typically the ids/leases files have been damaged with a truncate).

I suppose this could be related to bug 859589 comment 24.

Comment 35 Ayal Baron 2013-03-17 12:24:34 UTC
(In reply to comment #34)
> (In reply to comment #33)
> > Fede, can you take a quick look and see what's going on?
> > If relevant, please document how you reached the conclusion so that others
> > would know what to look for themselves next time.
> 
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 840, in _run
>   File "/usr/share/vdsm/storage/task.py", line 307, in run
>   File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm
>   File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock
>   File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire
> AcquireLockFailure: Cannot obtain lock:
> "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster
> lock, err=(90, 'Sanlock resource not acquired', 'Message too long')"
> 
> Despite the string error that has a default translation of 'Message too
> long', the underlying errno is EMSGSIZE which signals that the
> lockspace/resource has an incorrect size (typically the ids/leases files
> have been damaged with a truncate).
> 
> I suppose this could be related to bug 859589 comment 24.

Shylesh,

Can you take a look and see if this is a duplicate of bug 859589? or if gLuster is truncating in another use case?
In any event, this seems like a gluster bug?

Comment 36 shylesh 2013-03-18 05:24:42 UTC
(In reply to comment #35)
> (In reply to comment #34)
> > (In reply to comment #33)
> > > Fede, can you take a quick look and see what's going on?
> > > If relevant, please document how you reached the conclusion so that others
> > > would know what to look for themselves next time.
> > 
> > Traceback (most recent call last):
> >   File "/usr/share/vdsm/storage/task.py", line 840, in _run
> >   File "/usr/share/vdsm/storage/task.py", line 307, in run
> >   File "/usr/share/vdsm/storage/sp.py", line 254, in startSpm
> >   File "/usr/share/vdsm/storage/sd.py", line 465, in acquireClusterLock
> >   File "/usr/share/vdsm/storage/clusterlock.py", line 240, in acquire
> > AcquireLockFailure: Cannot obtain lock:
> > "id=ebb53abf-65f1-454d-b65d-13dd1f466254, rc=90, out=Cannot acquire cluster
> > lock, err=(90, 'Sanlock resource not acquired', 'Message too long')"
> > 
> > Despite the string error that has a default translation of 'Message too
> > long', the underlying errno is EMSGSIZE which signals that the
> > lockspace/resource has an incorrect size (typically the ids/leases files
> > have been damaged with a truncate).
> > 
> > I suppose this could be related to bug 859589 comment 24.
> 
> Shylesh,
> 
> Can you take a look and see if this is a duplicate of bug 859589? or if
> gLuster is truncating in another use case?
> In any event, this seems like a gluster bug?

As per comment 8, i once hit the bug with rebalance also but even without rebalance also i hit the bug.
859589 involves gluster rebalance operation which was causing truncation, whereas in this bug without doing rebalance just bring down a brick and bring it back.

Comment 37 Federico Simoncelli 2013-03-18 10:46:59 UTC
(In reply to comment #36)
> (In reply to comment #35)
> > (In reply to comment #34)
> > Shylesh,
> > 
> > Can you take a look and see if this is a duplicate of bug 859589? or if
> > gLuster is truncating in another use case?
> > In any event, this seems like a gluster bug?
> 
> As per comment 8, i once hit the bug with rebalance also but even without
> rebalance also i hit the bug.
> 859589 involves gluster rebalance operation which was causing truncation,
> whereas in this bug without doing rebalance just bring down a brick and
> bring it back.

The debugging suggestions given in bug 859589 comment 23 apply here too.
Try to inspect the "leases" file to check if it's corrupted in some way.

Comment 38 Sahina Bose 2013-04-30 11:49:25 UTC

*** This bug has been marked as a duplicate of bug 859589 ***