Bug 960952

Summary:

cannot run VM when vdsm not connected to pool ("Image path does not exist or cannot be accessed/created")

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Dafna Ron <dron>

Component:

vdsm

Assignee:

Eduardo Warszawski <ewarszaw>

Status:

CLOSED ERRATA

QA Contact:

Leonid Natapov <lnatapov>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

3.2.0

CC:

abaron, acanan, acathrow, amureini, bazulay, cboyle, eedri, ewarszaw, hchiramm, iheim, jkt, jturner, lpeer, mkalinin, pablo.iranzo, scohen, yeylon

Target Milestone:

---

Keywords:

Reopened

Target Release:

3.3.0

Flags:

scohen: Triaged+

Hardware:

x86_64

OS:

Linux

Whiteboard:

storage

Fixed In Version:

is14

Doc Type:

Bug Fix

Doc Text:

When the connectivity to the storage was restored and the host was recovered, no virtual machine could be run. All of them failed to run with the error "Image path does not exist or cannot be accessed/created". This was caused by the run VM command being issued before the connect storage pool was completed. This was fixed by allowing the virtual machine to run without being connected to the pool.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-01-21 16:06:59 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

961038

Attachments:

Description	Flags
logs	none

Description Dafna Ron 2013-05-08 11:12:55 UTC

Created attachment 745209 [details]
logs

Description of problem:

I blocked connectivity to the storage on my hsm host and left it for 30 minutes. 
after 30 minutes I restored the connectivity to the storage from the host, autorecovery restored the host and than I tried running vm's on it. 
all vm's failed to run with error "Image path does not exist or cannot be accessed/created", the host changes status to Error since it failed to run the vms. 
after I activated the host back I was able to run the vms on it. 

This issue also happened before when I manually removed orphan lvs from the one host - vm's that I ran in the second host failed to run with same error in one host but were able to run in the other until I ran lvs on the troubled host. 
since both hosts are using the same storage we can only assume that this is a cache issue in vdsm 

Version-Release number of selected component (if applicable):

sf15
vdsm-4.10.2-17.0.el6ev.x86_64

How reproducible:

100%

Steps to Reproduce:
1. in iscsi storage on 2 hosts cluster, create 3 domains
2. block connectivity to the storage from the hsm host and leave it for 30 minutes
3. remove the block from the host and wait for autoRecovery to activate it
4. run vm's on the hsm host
  
Actual results:

vm's fail to run with "Image path does not exist or cannot be accessed/created" and are re-run on the second host. 
since the storage was availble from both hosts and I was able to manually migrate the vms to the host after I changed its status to UP from Error, we can assume that this is a cache issue. 

Expected results:

we should be able to run teh vms once the connectivity to the storage is restored and the host is recovered. 

Additional info: full logs


Thread-1169::DEBUG::2013-05-08 13:54:37,413::vm::676::vm.Vm::(_startUnderlyingVm) vmId=`205497a5-33cc-4f4a-b46e-7cd831615f9b`::_ongoingCreations released
Thread-1169::ERROR::2013-05-08 13:54:37,417::vm::700::vm.Vm::(_startUnderlyingVm) vmId=`205497a5-33cc-4f4a-b46e-7cd831615f9b`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 662, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1447, in _run
    self.preparePaths(devices[vm.DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 721, in preparePaths
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 275, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'index': 0, 'iface': 'ide', 'reqsize': '0', 'format': 'cow', 'bootOrder': '1', 'poolID': '7fd33b43-a9f4-4eb7-a885-e9583a929ceb', 'volumeID': '163ee5e3-5dae-4312-9157-0125f6f1074d', 'apparentsize': 
'1073741824', 'imageID': '7770aad5-3488-4cf5-b8ca-9451b5b12767', 'specParams': {}, 'readonly': 'false', 'domainID': 'd326916a-89b6-41c8-9c14-4b9cf8a1c979', 'optional': 'false', 'deviceId': '7770aad5-3488-4cf5-b8ca-9451b5b12767', 'truesi
ze': '1073741824', 'address': {' controller': '0', ' target': '0', 'unit': '0', ' bus': '0', ' type': 'drive'}, 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'type': 'disk'}

Comment 2 Eduardo Warszawski 2013-05-08 19:32:42 UTC

This is not a caching issue.

connectStoragePool issued before connectStorageServer and therefore the pool is not connected then refreshStoragePool fails and prepare image fails too.

Thread-1127::INFO::2013-05-08 13:51:07,311::logUtils::42::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '900d1c89-04ea-4c30-820e-ca3a86efb4e2'}]}
Thread-1128::INFO::2013-05-08 13:51:07,349::logUtils::40::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', hostID=2, scsiKey='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', msdUUID='81ef11d0-4c0c-47b4-8953-d61a6af442d8', masterVersion=410, options=None)
Thread-1140::INFO::2013-05-08 13:53:32,870::logUtils::40::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=3, spUUID='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', conList=[{'connection': '10.35.64.10', 'iqn': 'Dafna-32-02', 'portal': '1', 'user': '', 'password': '******', 'id': '7b34aee1-1f7f-4932-949c-247a406a0a02', 'port': '3260'}], options=None)
Thread-1140::INFO::2013-05-08 13:53:32,950::logUtils::42::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '7b34aee1-1f7f-4932-949c-247a406a0a02'}]}
Thread-1141::INFO::2013-05-08 13:53:33,023::logUtils::40::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', msdUUID='38755249-4bb3-4841-bf5b-05f4a521514d', masterVersion=411, options=None)
Thread-1169::INFO::2013-05-08 13:54:29,940::logUtils::40::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='d326916a-89b6-41c8-9c14-4b9cf8a1c979', spUUID='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', imgUUID='7770aad5-3488-4cf5-b8ca-9451b5b12767', volUUID='163ee5e3-5dae-4312-9157-0125f6f1074d', options=None)
Thread-1169::INFO::2013-05-08 13:54:36,821::logUtils::42::dispatcher::(wrapper) Run and protect: getVolumeSize, Return response: {'truesize': '1073741824', 'apparentsize': '1073741824'}
Thread-1169::INFO::2013-05-08 13:54:36,858::logUtils::40::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='d326916a-89b6-41c8-9c14-4b9cf8a1c979', spUUID='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', imgUUID='7770aad5-3488-4cf5-b8ca-9451b5b12767', volUUID='163ee5e3-5dae-4312-9157-0125f6f1074d')
Thread-1169::ERROR::2013-05-08 13:54:37,321::blockVolume::403::Storage.Volume::(validateImagePath) Unexpected error
Thread-1169::ERROR::2013-05-08 13:54:37,324::task::850::TaskManager.Task::(_setError) Task=`5f2d2e48-e79b-4aa4-98bc-8f00d4667e26`::Unexpected error
Thread-1169::ERROR::2013-05-08 13:54:37,409::dispatcher::66::Storage.Dispatcher.Protect::(run) {'status': {'message': "Image path does not exist or cannot be accessed/created: ('/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/d326916a-89b6-41c8-9c14-4b9cf8a1c979/images/7770aad5-3488-4cf5-b8ca-9451b5b12767',)", 'code': 254}}
Thread-1169::ERROR::2013-05-08 13:54:37,417::vm::700::vm.Vm::(_startUnderlyingVm) vmId=`205497a5-33cc-4f4a-b46e-7cd831615f9b`::The vm start process failed

Comment 5 Eduardo Warszawski 2013-05-12 11:28:10 UTC

We can see that it is a transient state, after a few more attempts the host is  connected to the pool, without intervention, and the image prepared:

Thread-1227::INFO::2013-05-08 13:56:07,149::logUtils::40::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', hostID=2, scsi
Key='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', msdUUID='38755249-4bb3-4841-bf5b-05f4a521514d', masterVersion=411, options=None)

Thread-1227::ERROR::2013-05-08 13:56:14,345::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 38755249-4bb3-4841-bf5b-05f4a521514d
Thread-1227::ERROR::2013-05-08 13:56:14,345::sdc::153::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 38755249-4bb3-4841-bf5b-05f4a521514d
Thread-1227::DEBUG::2013-05-08 13:56:14,346::lvm::373::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-1227::DEBUG::2013-05-08 13:56:14,350::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\
"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%1Dafna-32-011366808|1Dafna-32-021366808|1Dafna-32-031366808%\', \'r%.*%\' ] }  globa
l {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,at
tr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 38755249-4bb3-4841-bf5b-05f4a521514d' (cwd None)
Thread-1227::DEBUG::2013-05-08 13:56:14,674::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0

Thread-1227::DEBUG::2013-05-08 13:56:15,408::persistentDict::234::Storage.PersistentDict::(refresh) read lines (VGTagMetadataRW)=['LOCKPOLICY=', 'VERSION=3', 'TYPE=ISCSI', 'DESCRIPTION=Dafna-32-03', 'LOGBLKSIZE=512', 'VGUUID=4Ceo5k-vMud-sVKL-blqX-JwKo-ETN6-8ao4L9', 'LEASERETRIES=3', 'LEASETIMESEC=5', 'LOCKRENEWALINTERVALSEC=5', u'PV0=pv:1Dafna-32-031366808,uuid:0AYd8Y-R9mE-rLT6-QTOe-q3zo-OqCi-Edx4Uu,pestart:0,pecount:797,mapoffset:0', 'PHYBLKSIZE=512', 'SDUUID=38755249-4bb3-4841-bf5b-05f4a521514d', 'IOOPTIMEOUTSEC=1', 'CLASS=Data', 'POOL_UUID=7fd33b43-a9f4-4eb7-a885-e9583a929ceb', 'ROLE=Master', 'POOL_DESCRIPTION=iSCSI', 'MASTER_VERSION=411', u'POOL_DOMAINS=d326916a-89b6-41c8-9c14-4b9cf8a1c979:Active,81ef11d0-4c0c-47b4-8953-d61a6af442d8:Active,38755249-4bb3-4841-bf5b-05f4a521514d:Active,72ec1321-a114-451f-bee1-6790cbca1bc6:Active', 'POOL_SPM_ID=1', 'POOL_SPM_LVER=0', '_SHA_CKSUM=16f8ee2cdf42df2d4a605f0b5d2b8dd6eb6ab322']
Thread-1227::DEBUG::2013-05-08 13:56:15,410::sp::1577::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `7fd33b43-a9f4-4eb7-a885-e9583a929ceb` started monitoring domain `d326916a-89b6-41c8-9c14-4b9cf8a1c979`

Thread-1227::INFO::2013-05-08 13:56:16,870::logUtils::42::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True

Thread-1252::INFO::2013-05-08 13:56:43,986::logUtils::40::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='d326916a-89b6-41c8-9c14-4b9cf8a1c979', spUUID='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', imgUUID='a49d10a6-38f3-40b5-aea3-5628357b1f2e', volUUID='0dec5c38-e585-4565-bde4-37842a741595')

Thread-1252::INFO::2013-05-08 13:56:44,413::logUtils::42::dispatcher::(wrapper) Run and protect: prepareImage, Return response: {'path': '/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/d326916a-89b6-41c8-9c14-4b9cf8a1c979/images/a49d10a6-38f3-40b5-aea3-5628357b1f2e/0dec5c38-e585-4565-bde4-37842a741595', 'chain': [{'path': '/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/d326916a-89b6-41c8-9c14-4b9cf8a1c979/images/a49d10a6-38f3-40b5-aea3-5628357b1f2e/e886fb6e-f91f-46ae-a488-0f24e47f5731', 'domainID': 'd326916a-89b6-41c8-9c14-4b9cf8a1c979', 'volumeID': 'e886fb6e-f91f-46ae-a488-0f24e47f5731', 'imageID': 'a49d10a6-38f3-40b5-aea3-5628357b1f2e'}, {'path': '/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/d326916a-89b6-41c8-9c14-4b9cf8a1c979/images/a49d10a6-38f3-40b5-aea3-5628357b1f2e/0dec5c38-e585-4565-bde4-37842a741595', 'domainID': 'd326916a-89b6-41c8-9c14-4b9cf8a1c979', 'volumeID': '0dec5c38-e585-4565-bde4-37842a741595', 'imageID': 'a49d10a6-38f3-40b5-aea3-5628357b1f2e'}]}

Comment 6 Eduardo Warszawski 2013-05-12 11:33:03 UTC

The VM was started before the pool connectivity was restored.

In this case vdsm fails to prepareImage's (known, i.e. not a regression) and will be fixed in a future version.

After the connectivity was restored the images were prepared.

Comment 7 Haim 2013-05-28 17:08:34 UTC

reported by customer, need to understand root cause, if vdsmd restart solves the problem, it means something is broken in the links.
suggest to reconsider solving issue.

Comment 8 Haim 2013-05-28 17:08:38 UTC

*** Bug 967676 has been marked as a duplicate of this bug. ***

Comment 9 Eduardo Warszawski 2013-06-11 14:21:03 UTC

(In reply to Haim from comment #7)
> reported by customer, need to understand root cause, if vdsmd restart solves
> the problem, it means something is broken in the links.
> suggest to reconsider solving issue.

As stated before, the run VM command was issued before the connect storage pool was completed. Since the image in the actual code is reached using the pool, the prepare image fails.

The restart is unrelated, only that the pool is reconnected afterwards.

Comment 14 Leonid Natapov 2013-11-19 09:59:16 UTC

vdsm-4.13.0-0.8.beta1.el6ev.x86_64. VMs were able to start after blocking connectivity from host to storage for ~30 min.

Comment 15 Charlie 2013-11-28 00:33:25 UTC

This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 16 errata-xmlrpc 2014-01-21 16:06:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html