Bug 1305364

Summary: Hosted engine (3.5.6) doesn't automatically connect the storage pool
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: ovirt-hosted-engine-haAssignee: Martin Sivák <msivak>
Status: CLOSED DUPLICATE QA Contact: Ilanit Stein <istein>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.5.6CC: fdeutsch, gklein, lsurette, nashok, stirabos, ylavi
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-17 08:29:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nijin ashok 2016-02-07 15:54:47 UTC
Description of problem:

When a hosted engine hypervisor with iscsi storage type is restarted, it doesn't automatically connects to the hosted engine storage pool . Because of this  ovirt-ha-agent fails to start with below error.

[root@hosted-engine-host ~]# /etc/init.d/ovirt-ha-agent status
ovirt-ha-agent dead but subsys locked

In the ha logs we can see below error.

MainThread::ERROR::2016-02-07 08:07:57,637::agent::182::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'path to storage domain 0f431756-12e1-410e-8f12-9f0b507956a9 not found in /rhev/data-center/mnt' - trying to restart agent
MainThread::WARNING::2016-02-07 08:08:02,642::agent::185::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9'
MainThread::ERROR::2016-02-07 08:08:02,642::agent::187::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Too many errors occurred, giving up. Please review the log and consider filing a bug.
MainThread::INFO::2016-02-07 08:08:02,643::agent::121::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down

From the vdsm logs we can see that it's issuing connectStorageServer command which is success.

Thread-25::INFO::2016-02-07 08:07:24,685::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=3, spUUID='aef75be1-d19d-49c4-ae88-e1aa49c6ff2b', conList=[{'connection': '172.16.1.25', 'iqn': 'iqn.com.example:35hosted-engine', 'portal': '1', 'user': '', 'password': '******', 'id': '7d3bbdd8-75ae-4eb0-82d2-a64eb6e215a2', 'port': '3260'}], options=None)
Thread-25::DEBUG::2016-02-07 08:07:24,686::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.com.example:35hosted-engine -I default -p 172.16.1.25:3260,1 --op=new (cwd None)
Thread-25::DEBUG::2016-02-07 08:07:24,705::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> = ''; <rc> = 0
Thread-25::DEBUG::2016-02-07 08:07:24,706::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) /usr/bin/sudo -n /sbin/iscsiadm -m iface -I default (cwd None)
Thread-25::DEBUG::2016-02-07 08:07:24,717::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> = ''; <rc> = 0
Thread-25::INFO::2016-02-07 08:07:24,718::iscsi::557::Storage.ISCSI::(setRpFilterIfNeeded) iSCSI iface.net_ifacename not provided. Skipping.
Thread-25::DEBUG::2016-02-07 08:07:24,718::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.com.example:35hosted-engine -I default -p 172.16.1.25:3260,1 -l (cwd None)
Thread-25::DEBUG::2016-02-07 08:07:24,730::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> = ''; <rc> = 0
Thread-25::DEBUG::2016-02-07 08:07:24,731::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.com.example:35hosted-engine -I default -p 172.16.1.25:3260,1 -n node.startup -v manual --op=update (cwd None)
Thread-25::DEBUG::2016-02-07 08:07:24,742::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> = ''; <rc> = 0


However cannot find any events with connectStoragePool .

If we restart the ovirt-ha-agent service, then the connectStorageServer events keeps on generating in the vdsm logs.

To make the ovirt-ha-agent up, we have to manually follow the below steps.

hosted-engine --start-pool

Then restart the ha agent service

/etc/init.d/ovirt-ha-agent restart


Version-Release number of selected component (if applicable):

vdsm-4.16.32-1.el6ev.x86_64
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20160104.0.el6ev)


How reproducible:
100%

Steps to Reproduce:
1. Restart a hosted engine hypervisor which is configured as iscsi storage

2.  Check the status of ha-agent .

3. Check spuuid directory is created under /rhev/data-center/mnt/blockSD/

Actual results:

Hosted engine ha agent service is not starting during booting because the storage pool is not connected

Expected results:

Hosted engine ha service should start at boot

Additional info:

Can see similar discussion here https://bugzilla.redhat.com/show_bug.cgi?id=1205225#c5

Comment 1 Simone Tiraboschi 2016-02-10 11:25:41 UTC
No, we don't connect the storage pool by design: it's not a bug by itself.

prepareImage should create the link under '/rhev/data-center/mnt/' the issue is about why it's failing on that.

Comment 2 Simone Tiraboschi 2016-02-10 11:34:20 UTC
The issue is here on the prepareImage:

Thread-25::DEBUG::2016-02-04 14:34:46,123::task::595::Storage.TaskManager.Task::(_updateState) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::moving from state init -> state preparing
Thread-25::INFO::2016-02-04 14:34:46,124::logUtils::44::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='a61e367c-61f0-44de-b4b9-d288a8d94999', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='f1be6242-8776-4309-afbc-525e08ca4f17', leafUUID='33d9949d-590a-4b2f-a62f-e120f6a66ddb')
Thread-25::DEBUG::2016-02-04 14:34:46,129::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.a61e367c-61f0-44de-b4b9-d288a8d94999`ReqID=`d07766d1-632b-4c74-9b36-935993a35a90`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '3230' at 'prepareImage'
Thread-25::DEBUG::2016-02-04 14:34:46,131::resourceManager::542::Storage.ResourceManager::(registerResource) Trying to register resource 'Storage.a61e367c-61f0-44de-b4b9-d288a8d94999' for lock type 'shared'
Thread-25::DEBUG::2016-02-04 14:34:46,134::resourceManager::601::Storage.ResourceManager::(registerResource) Resource 'Storage.a61e367c-61f0-44de-b4b9-d288a8d94999' is free. Now locking as 'shared' (1 active user)
Thread-25::DEBUG::2016-02-04 14:34:46,136::resourceManager::238::Storage.ResourceManager.Request::(grant) ResName=`Storage.a61e367c-61f0-44de-b4b9-d288a8d94999`ReqID=`d07766d1-632b-4c74-9b36-935993a35a90`::Granted request
Thread-25::DEBUG::2016-02-04 14:34:46,139::task::827::Storage.TaskManager.Task::(resourceAcquired) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::_resourcesAcquired: Storage.a61e367c-61f0-44de-b4b9-d288a8d94999 (shared)
Thread-25::DEBUG::2016-02-04 14:34:46,140::task::993::Storage.TaskManager.Task::(_decref) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::ref 1 aborting False
Thread-25::DEBUG::2016-02-04 14:34:46,142::lvm::428::Storage.OperationMutex::(_reloadlvs) Operation 'lvm reload operation' got the operation mutex
Thread-25::DEBUG::2016-02-04 14:34:46,144::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360050764008101232800000000000004|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags a61e367c-61f0-44de-b4b9-d288a8d94999 (cwd None)
Thread-25::DEBUG::2016-02-04 14:34:46,463::lvm::291::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0
Thread-25::DEBUG::2016-02-04 14:34:46,464::lvm::463::Storage.LVM::(_reloadlvs) lvs reloaded
Thread-25::DEBUG::2016-02-04 14:34:46,464::lvm::463::Storage.OperationMutex::(_reloadlvs) Operation 'lvm reload operation' released the operation mutex
Thread-25::DEBUG::2016-02-04 14:34:46,465::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360050764008101232800000000000004|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --autobackup n --available y a61e367c-61f0-44de-b4b9-d288a8d94999/33d9949d-590a-4b2f-a62f-e120f6a66ddb (cwd None)
Thread-25::DEBUG::2016-02-04 14:34:46,833::lvm::291::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0
Thread-25::DEBUG::2016-02-04 14:34:46,833::lvm::514::Storage.OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-25::DEBUG::2016-02-04 14:34:46,834::lvm::526::Storage.OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-25::DEBUG::2016-02-04 14:34:46,834::fileUtils::142::Storage.fileUtils::(createdir) Creating directory: /var/run/vdsm/storage/a61e367c-61f0-44de-b4b9-d288a8d94999/f1be6242-8776-4309-afbc-525e08ca4f17
Thread-25::DEBUG::2016-02-04 14:34:46,835::blockSD::1075::Storage.StorageDomain::(createImageLinks) Creating symlink from /dev/a61e367c-61f0-44de-b4b9-d288a8d94999/33d9949d-590a-4b2f-a62f-e120f6a66ddb to /var/run/vdsm/storage/a61e367c-61f0-44de-b4b9-d288a8d94999/f1be6242-8776-4309-afbc-525e08ca4f17/33d9949d-590a-4b2f-a62f-e120f6a66ddb
Thread-25::DEBUG::2016-02-04 14:34:46,836::lvm::428::Storage.OperationMutex::(_reloadlvs) Operation 'lvm reload operation' got the operation mutex
Thread-25::DEBUG::2016-02-04 14:34:46,836::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360050764008101232800000000000004|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags a61e367c-61f0-44de-b4b9-d288a8d94999 (cwd None)
Thread-25::DEBUG::2016-02-04 14:34:47,135::lvm::291::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0
Thread-25::DEBUG::2016-02-04 14:34:47,137::lvm::463::Storage.LVM::(_reloadlvs) lvs reloaded
Thread-25::DEBUG::2016-02-04 14:34:47,137::lvm::463::Storage.OperationMutex::(_reloadlvs) Operation 'lvm reload operation' released the operation mutex
Thread-25::ERROR::2016-02-04 14:34:47,139::blockVolume::429::Storage.Volume::(validateImagePath) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/blockVolume.py", line 427, in validateImagePath
OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/blockSD/a61e367c-61f0-44de-b4b9-d288a8d94999/images/f1be6242-8776-4309-afbc-525e08ca4f17'
Thread-25::ERROR::2016-02-04 14:34:47,139::task::866::Storage.TaskManager.Task::(_setError) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
  File "/usr/share/vdsm/storage/hsm.py", line 3255, in prepareImage
  File "/usr/share/vdsm/storage/blockSD.py", line 1106, in getVolumeLease
  File "/usr/share/vdsm/storage/sd.py", line 400, in produceVolume
  File "/usr/share/vdsm/storage/blockVolume.py", line 80, in __init__
  File "/usr/share/vdsm/storage/volume.py", line 144, in __init__
  File "/usr/share/vdsm/storage/blockVolume.py", line 89, in validate
  File "/usr/share/vdsm/storage/volume.py", line 156, in validate
  File "/usr/share/vdsm/storage/blockVolume.py", line 430, in validateImagePath
ImagePathError: Image path does not exist or cannot be accessed/created: ('/rhev/data-center/mnt/blockSD/a61e367c-61f0-44de-b4b9-d288a8d94999/images/f1be6242-8776-4309-afbc-525e08ca4f17',)
Thread-25::DEBUG::2016-02-04 14:34:47,140::task::885::Storage.TaskManager.Task::(_run) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::Task._run: 0aae1b49-2954-4762-be74-d7c6fce2abba ('a61e367c-61f0-44de-b4b9-d288a8d94999', '00000000-0000-0000-0000-000000000000', 'f1be6242-8776-4309-afbc-525e08ca4f17', '33d9949d-590a-4b2f-a62f-e120f6a66ddb') {} failed - stopping task
Thread-25::DEBUG::2016-02-04 14:34:47,140::task::1217::Storage.TaskManager.Task::(stop) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::stopping in state preparing (force False)
Thread-25::DEBUG::2016-02-04 14:34:47,141::task::993::Storage.TaskManager.Task::(_decref) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::ref 1 aborting True
Thread-25::INFO::2016-02-04 14:34:47,141::task::1171::Storage.TaskManager.Task::(prepare) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::aborting: Task is aborted: 'Image path does not exist or cannot be accessed/created' - code 254
Thread-25::DEBUG::2016-02-04 14:34:47,141::task::1176::Storage.TaskManager.Task::(prepare) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::Prepare: aborted: Image path does not exist or cannot be accessed/created
Thread-25::DEBUG::2016-02-04 14:34:47,141::task::993::Storage.TaskManager.Task::(_decref) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::ref 0 aborting True
Thread-25::DEBUG::2016-02-04 14:34:47,142::task::928::Storage.TaskManager.Task::(_doAbort) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::Task._doAbort: force False
Thread-25::DEBUG::2016-02-04 14:34:47,142::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-25::DEBUG::2016-02-04 14:34:47,142::task::595::Storage.TaskManager.Task::(_updateState) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::moving from state preparing -> state aborting
Thread-25::DEBUG::2016-02-04 14:34:47,142::task::550::Storage.TaskManager.Task::(__state_aborting) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::_aborting: recover policy none
Thread-25::DEBUG::2016-02-04 14:34:47,143::task::595::Storage.TaskManager.Task::(_updateState) Task=`0aae1b49-2954-4762-be74-d7c6fce2abba`::moving from state aborting -> state failed
Thread-25::DEBUG::2016-02-04 14:34:47,143::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.a61e367c-61f0-44de-b4b9-d288a8d94999': < ResourceRef 'Storage.a61e367c-61f0-44de-b4b9-d288a8d94999', isValid: 'True' obj: 'None'>}

Can you please check the content and the permission of '/rhev/data-center/mnt/blockSD/' ?

Comment 3 nijin ashok 2016-02-14 14:35:35 UTC
This was easily reproducible at my end. From my test environment.

ls -ld /rhev/data-center/mnt
drwxr-xr-x. 2 vdsm kvm 40 2016-02-14 07:15 /rhev/data-center/mnt

ls -l /rhev/data-center/mnt
total 0

Comment 4 Yaniv Lavi 2016-02-17 08:29:40 UTC

*** This bug has been marked as a duplicate of bug 1306901 ***