Created attachment 1453498 [details] vsdm.log backup from SPM showing the first occurrence +15 minutes prior the error Description of problem: I had failing OVF updates because of wiped metadata. Please see the following commands: # lvs -o vg_name,lv_name,tags | grep 3ad1987a-8b7d-426d-9d51-4a78cb0a888f 3ad1987a-8b7d-426d-9d51-4a78cb0a888f 212d644c-0155-4999-9df9-72bacfc7f141 IU_0ebefe5e-9053-4bf1-bdfd-fdb26579c179,MD_4,PU_00000000-0000-0000-0000-000000000000 3ad1987a-8b7d-426d-9d51-4a78cb0a888f 94f519de-bc19-4557-82c4-86bbcfc5dd2f IU_60d9eec7-951f-4594-ae99-7d31318ee3a9,MD_5,PU_00000000-0000-0000-0000-000000000000 3ad1987a-8b7d-426d-9d51-4a78cb0a888f ids 3ad1987a-8b7d-426d-9d51-4a78cb0a888f inbox 3ad1987a-8b7d-426d-9d51-4a78cb0a888f leases 3ad1987a-8b7d-426d-9d51-4a78cb0a888f master 3ad1987a-8b7d-426d-9d51-4a78cb0a888f metadata 3ad1987a-8b7d-426d-9d51-4a78cb0a888f outbox 3ad1987a-8b7d-426d-9d51-4a78cb0a888f xleases # dd if=/dev/3ad1987a-8b7d-426d-9d51-4a78cb0a888f/metadata bs=512 count=1 skip=4 of=metadata.4 # dd if=/dev/3ad1987a-8b7d-426d-9d51-4a78cb0a888f/metadata bs=512 count=1 skip=5 of=metadata.5 # file metadata.* metadata.4: data metadata.5: ASCII text # od -c metadata.4 0000000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 0001000 # cat metadata.5 DOMAIN=3ad1987a-8b7d-426d-9d51-4a78cb0a888f CTIME=1520597691 FORMAT=RAW DISKTYPE=OVFS LEGALITY=LEGAL SIZE=262144 VOLTYPE=LEAF DESCRIPTION={"Updated":true,"Size":4669440,"Last Updated":"Fri Jun 08 09:51:18 CEST 2018","Storage Domains":[{"uuid":"3ad1987a-8b7d-426d-9d51-4a78cb0a888f"}],"Disk Description":"OVF_STORE"} IMAGE=60d9eec7-951f-4594-ae99-7d31318ee3a9 PUUID=00000000-0000-0000-0000-000000000000 MTIME=0 POOL_UUID= TYPE=PREALLOCATED GEN=0 EOF Metadata for one OVF was wiped. Version-Release number of selected component (if applicable): oVirt 4.2.3.8-1.el7 vdsm-4.20.27.1-1.el7.centos.x86_64 How reproducible: Occurred on two of 40 storage domains, no reproduction scenario. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
We had another occurrence today on a newer storage domain (metadata block MD_4 wiped). Are there any more logs you would like to see?
Nir, can you have a look please?
(In reply to Tal Nisan from comment #2) > Nir, can you have a look please? This is in my queue.
Updating the title since the issue is not wipe of OVF metadata. We never wipe metadata area. If the metadata area was cleared, it will be: NONE=#########################################... In this case we don't have any metadata. Mostly likely the issue is OVF volume creation that failed after creating the LV, but before the metadata was written. We need to: - check the logs of this OVF volume, to understand if we have some failure in the creation flow. - check the code, to understand what are the possible failure modes that can lead to missing metadata.
Oliver, Can you please supply the logs (engine and VDSM) that contains the OVF_STORE creation? Because according to the logs, I couldn't find any reasonable explanation for the failure. And based of Nir's comment, maybe the log that contains the information about the volume creation can shed some light.
Eyal, logs from OVF_STORE creation are not available any more. We had a new occurrence yesterday on a storage domain that already had that problem on July 23rd 2018. oVirt logs are kept for one month, so both, VDSM logs and engine logs are not available. Since June 2018 we had eight occurrences of this problem on four storage domains (one storage domain with five occurrences and three other storage domains) of a total of about 40 storage domains (and about 40 hosts). I could change VDSM and engine logging retention to capture the next occurrence if this makes sense.
(In reply to oliver.albl from comment #6) > Eyal, > > logs from OVF_STORE creation are not available any more. We had a new > occurrence yesterday on a storage domain that already had that problem on > July 23rd 2018. oVirt logs are kept for one month, so both, VDSM logs and > engine logs are not available. > > Since June 2018 we had eight occurrences of this problem on four storage > domains (one storage domain with five occurrences and three other storage > domains) of a total of about 40 storage domains (and about 40 hosts). > > I could change VDSM and engine logging retention to capture the next > occurrence if this makes sense. Can you please add the engine and VDSM logs for those new occurrences. Also, is the OVF_STORE update failed for a specific disk consistently or it succeeds sometimes and failed at random disks?
Are VDSM logs from SPM sufficient or do you need logs from all hosts? The empty OVF data is always for the first OVF storage (dd if=/dev/<id>/metadata bs=512 count=1 skip=4 of=metadata.4)
I would also need the timeframe for relevant logs (available VDSM logs from SPM only and engine logs are around 3.5 GB uncompressed)
(In reply to oliver.albl from comment #8) > Are VDSM logs from SPM sufficient or do you need logs from all hosts? > > The empty OVF data is always for the first OVF storage (dd > if=/dev/<id>/metadata bs=512 count=1 skip=4 of=metadata.4) VDSM logs only. (In reply to oliver.albl from comment #9) > I would also need the timeframe for relevant logs (available VDSM logs from > SPM only and engine logs are around 3.5 GB uncompressed) The best time frame will be around 5 min before and after the last failed OVF_STORE update.
Created attachment 1480570 [details] VDSM logs from SPM Uploaded VDSM logs from SPM. The OVF updated started to fail on 2018-08-31 15:04. engine message: 2018-08-31 15:04:17.73+02 - Failed to update OVF disks ec65daea-2564-40f9-9664-3c351b94a341, OVF data isn't updated on those OVF stores (Data Center Production, Storage Domain FABAVMHOST_LUN_216). VDSM log excerpt (fgrep 'ec65daea-2564-40f9-9664-3c351b94a341' vdsm.log): 2018-08-31 15:04:17,335+0200 INFO (jsonrpc/5) [vdsm.api] START setVolumeDescription(sdUUID=u'a5624e7e-8eab-494d-a2c6-594c75b2ce0a', spUUID=u'5849b030-626e-47cb-ad90-3ce782d831b3', imgUUID=u'ec65daea-2564-40f9-9664-3c351b94a341', volUUID=u'67b76b78-aa64-4c16-9ea4-35c697d4657a', description=u'{"Updated":false,"Last Updated":"Fri Aug 31 14:02:02 CEST 2018","Storage Domains":[{"uuid":"a5624e7e-8eab-494d-a2c6-594c75b2ce0a"}],"Disk Description":"OVF_STORE"}', options=None) from=::ffff:ipaddress,52180, flow_id=49bf0017, task_id=7f214c7a-607c-47ba-860e-5a70410a4db9 (api:46) 2018-08-31 15:04:17,719+0200 WARN (jsonrpc/5) [storage.ResourceManager] Resource factory failed to create resource '01_img_a5624e7e-8eab-494d-a2c6-594c75b2ce0a.ec65daea-2564-40f9-9664-3c351b94a341'. Canceling request. (resourceManager:543) 2018-08-31 15:04:17,721+0200 WARN (jsonrpc/5) [storage.ResourceManager.Request] (ResName='01_img_a5624e7e-8eab-494d-a2c6-594c75b2ce0a.ec65daea-2564-40f9-9664-3c351b94a341', ReqID='ca211202-d182-49b8-964e-2c90e5f2bd76') Tried to cancel a processed request (resourceManager:187)
(In reply to oliver.albl from comment #11) > Created attachment 1480570 [details] > VDSM logs from SPM > > Uploaded VDSM logs from SPM. The OVF updated started to fail on 2018-08-31 > 15:04. > > engine message: > 2018-08-31 15:04:17.73+02 - Failed to update OVF disks > ec65daea-2564-40f9-9664-3c351b94a341, OVF data isn't updated on those OVF > stores (Data Center Production, Storage Domain FABAVMHOST_LUN_216). Can you please add the engine log also? > > VDSM log excerpt (fgrep 'ec65daea-2564-40f9-9664-3c351b94a341' vdsm.log): > 2018-08-31 15:04:17,335+0200 INFO (jsonrpc/5) [vdsm.api] START > setVolumeDescription(sdUUID=u'a5624e7e-8eab-494d-a2c6-594c75b2ce0a', > spUUID=u'5849b030-626e-47cb-ad90-3ce782d831b3', > imgUUID=u'ec65daea-2564-40f9-9664-3c351b94a341', > volUUID=u'67b76b78-aa64-4c16-9ea4-35c697d4657a', > description=u'{"Updated":false,"Last Updated":"Fri Aug 31 14:02:02 CEST > 2018","Storage > Domains":[{"uuid":"a5624e7e-8eab-494d-a2c6-594c75b2ce0a"}],"Disk > Description":"OVF_STORE"}', options=None) from=::ffff:ipaddress,52180, > flow_id=49bf0017, task_id=7f214c7a-607c-47ba-860e-5a70410a4db9 (api:46) > 2018-08-31 15:04:17,719+0200 WARN (jsonrpc/5) [storage.ResourceManager] > Resource factory failed to create resource > '01_img_a5624e7e-8eab-494d-a2c6-594c75b2ce0a.ec65daea-2564-40f9-9664- > 3c351b94a341'. Canceling request. (resourceManager:543) > 2018-08-31 15:04:17,721+0200 WARN (jsonrpc/5) > [storage.ResourceManager.Request] > (ResName='01_img_a5624e7e-8eab-494d-a2c6-594c75b2ce0a.ec65daea-2564-40f9- > 9664-3c351b94a341', ReqID='ca211202-d182-49b8-964e-2c90e5f2bd76') Tried to > cancel a processed request (resourceManager:187)
Created attachment 1480674 [details] engine log Uploaded engine log
It seems like as Nir said, this specific OVF volume creation maybe failed after the LV creation and the engine is failing to update it because of this missing metadata (failed to get the resource). I couldn't find anything In the log that can confirm this assumption or the other assumption that the metadata was wiped. Nir, can you please take a look, maybe I am missing something here?
Hey Oliver, Do you know how to reproduce this issue? Also, Do you have a workaround for it?
There is no scenario to reproduce, it occurs occasionally, up to 2 times a month. Workaround is to copy and adopt remaining OVF metadata.
Since we don't have the required data to investigate this issue but we do have a WA I am closing this bug. Please reopen it in case you have the needed data in order to investigate the root cause of this issue.