1593766 – Missing (empty) OVF metadata

Bug 1593766 - Missing (empty) OVF metadata

Summary: Missing (empty) OVF metadata

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.20.23
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.7
Target Release:	---
Assignee:	Eyal Shenitzky
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-21 14:33 UTC by oliver.albl
Modified:	2018-10-10 11:19 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-10-03 09:10:44 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+

Attachments	(Terms of Use)
vsdm.log backup from SPM showing the first occurrence +15 minutes prior the error (234.50 KB, application/x-7z-compressed) 2018-06-21 14:33 UTC, oliver.albl	no flags	Details
VDSM logs from SPM (4.63 MB, application/x-gzip) 2018-09-03 13:52 UTC, oliver.albl	no flags	Details
engine log (1.79 MB, application/x-gzip) 2018-09-04 06:14 UTC, oliver.albl	no flags	Details
View All

Description oliver.albl 2018-06-21 14:33:45 UTC

Created attachment 1453498 [details]
vsdm.log backup from SPM showing the first occurrence +15 minutes prior the error

Description of problem:

I had failing OVF updates because of wiped metadata. Please see the following commands:

# lvs -o vg_name,lv_name,tags | grep 3ad1987a-8b7d-426d-9d51-4a78cb0a888f
  3ad1987a-8b7d-426d-9d51-4a78cb0a888f 212d644c-0155-4999-9df9-72bacfc7f141 IU_0ebefe5e-9053-4bf1-bdfd-fdb26579c179,MD_4,PU_00000000-0000-0000-0000-000000000000
  3ad1987a-8b7d-426d-9d51-4a78cb0a888f 94f519de-bc19-4557-82c4-86bbcfc5dd2f IU_60d9eec7-951f-4594-ae99-7d31318ee3a9,MD_5,PU_00000000-0000-0000-0000-000000000000
  3ad1987a-8b7d-426d-9d51-4a78cb0a888f ids
  3ad1987a-8b7d-426d-9d51-4a78cb0a888f inbox
  3ad1987a-8b7d-426d-9d51-4a78cb0a888f leases
  3ad1987a-8b7d-426d-9d51-4a78cb0a888f master
  3ad1987a-8b7d-426d-9d51-4a78cb0a888f metadata
  3ad1987a-8b7d-426d-9d51-4a78cb0a888f outbox
  3ad1987a-8b7d-426d-9d51-4a78cb0a888f xleases

# dd if=/dev/3ad1987a-8b7d-426d-9d51-4a78cb0a888f/metadata bs=512 count=1 skip=4 of=metadata.4
# dd if=/dev/3ad1987a-8b7d-426d-9d51-4a78cb0a888f/metadata bs=512 count=1 skip=5 of=metadata.5

# file metadata.*
metadata.4: data
metadata.5: ASCII text

# od -c metadata.4
0000000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0001000

# cat metadata.5
DOMAIN=3ad1987a-8b7d-426d-9d51-4a78cb0a888f
CTIME=1520597691
FORMAT=RAW
DISKTYPE=OVFS
LEGALITY=LEGAL
SIZE=262144
VOLTYPE=LEAF
DESCRIPTION={"Updated":true,"Size":4669440,"Last Updated":"Fri Jun 08 09:51:18 CEST 2018","Storage Domains":[{"uuid":"3ad1987a-8b7d-426d-9d51-4a78cb0a888f"}],"Disk Description":"OVF_STORE"}
IMAGE=60d9eec7-951f-4594-ae99-7d31318ee3a9
PUUID=00000000-0000-0000-0000-000000000000
MTIME=0
POOL_UUID=
TYPE=PREALLOCATED
GEN=0
EOF

Metadata for one OVF was wiped.

Version-Release number of selected component (if applicable):

oVirt 4.2.3.8-1.el7
vdsm-4.20.27.1-1.el7.centos.x86_64

How reproducible:

Occurred on two of 40 storage domains, no reproduction scenario.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 oliver.albl 2018-06-22 14:26:34 UTC

We had another occurrence today on a newer storage domain (metadata block MD_4 wiped). Are there any more logs you would like to see?

Comment 2 Tal Nisan 2018-06-24 11:38:55 UTC

Nir, can you have a look please?

Comment 3 Nir Soffer 2018-06-26 23:44:20 UTC

(In reply to Tal Nisan from comment #2)
> Nir, can you have a look please?

This is in my queue.

Comment 4 Nir Soffer 2018-08-19 10:36:52 UTC

Updating the title since the issue is not wipe of OVF metadata. We never wipe 
metadata area. If the metadata area was cleared, it will be:

NONE=#########################################...

In this case we don't have any metadata. Mostly likely the issue is OVF volume
creation that failed after creating the LV, but before the metadata was written.

We need to:
- check the logs of this OVF volume, to understand if we have some failure in the
  creation flow.
- check the code, to understand what are the possible failure modes that can lead
  to missing metadata.

Comment 5 Eyal Shenitzky 2018-09-03 07:30:23 UTC

Oliver,

Can you please supply the logs (engine and VDSM) that contains the OVF_STORE creation?

Because according to the logs, I couldn't find any reasonable explanation for the failure.
And based of Nir's comment, maybe the log that contains the information about the volume creation can shed some light.

Comment 6 oliver.albl 2018-09-03 08:19:50 UTC

Eyal,

logs from OVF_STORE creation are not available any more. We had a new occurrence yesterday on a storage domain that already had that problem on July 23rd 2018. oVirt logs are kept for one month, so both, VDSM logs and engine logs are not available.

Since June 2018 we had eight occurrences of this problem on four storage domains (one storage domain with five occurrences and three other storage domains) of a total of about 40 storage domains (and about 40 hosts).

I could change VDSM and engine logging retention to capture the next occurrence if this makes sense.

Comment 7 Eyal Shenitzky 2018-09-03 09:17:03 UTC

(In reply to oliver.albl from comment #6)
> Eyal,
> 
> logs from OVF_STORE creation are not available any more. We had a new
> occurrence yesterday on a storage domain that already had that problem on
> July 23rd 2018. oVirt logs are kept for one month, so both, VDSM logs and
> engine logs are not available.
> 
> Since June 2018 we had eight occurrences of this problem on four storage
> domains (one storage domain with five occurrences and three other storage
> domains) of a total of about 40 storage domains (and about 40 hosts).
> 
> I could change VDSM and engine logging retention to capture the next
> occurrence if this makes sense.

Can you please add the engine and VDSM logs for those new occurrences.

Also, is the OVF_STORE update failed for a specific disk consistently or it succeeds sometimes and failed at random disks?

Comment 8 oliver.albl 2018-09-03 09:38:47 UTC

Are VDSM logs from SPM sufficient or do you need logs from all hosts?

The empty OVF data is always for the first OVF storage (dd if=/dev/<id>/metadata bs=512 count=1 skip=4 of=metadata.4)

Comment 9 oliver.albl 2018-09-03 10:12:05 UTC

I would also need the timeframe for relevant logs (available VDSM logs from SPM only and engine logs are around 3.5 GB uncompressed)

Comment 10 Eyal Shenitzky 2018-09-03 10:50:06 UTC

(In reply to oliver.albl from comment #8)
> Are VDSM logs from SPM sufficient or do you need logs from all hosts?
> 
> The empty OVF data is always for the first OVF storage (dd
> if=/dev/<id>/metadata bs=512 count=1 skip=4 of=metadata.4)

VDSM logs only.

(In reply to oliver.albl from comment #9)
> I would also need the timeframe for relevant logs (available VDSM logs from
> SPM only and engine logs are around 3.5 GB uncompressed)

The best time frame will be around 5 min before and after the last failed OVF_STORE update.

Comment 11 oliver.albl 2018-09-03 13:52:18 UTC

Created attachment 1480570 [details]
VDSM logs from SPM

Uploaded VDSM logs from SPM. The OVF updated started to fail on 2018-08-31 15:04.

engine message:
2018-08-31 15:04:17.73+02 - Failed to update OVF disks ec65daea-2564-40f9-9664-3c351b94a341, OVF data isn't updated on those OVF stores (Data Center Production, Storage Domain FABAVMHOST_LUN_216).

VDSM log excerpt (fgrep 'ec65daea-2564-40f9-9664-3c351b94a341' vdsm.log):
2018-08-31 15:04:17,335+0200 INFO  (jsonrpc/5) [vdsm.api] START setVolumeDescription(sdUUID=u'a5624e7e-8eab-494d-a2c6-594c75b2ce0a', spUUID=u'5849b030-626e-47cb-ad90-3ce782d831b3', imgUUID=u'ec65daea-2564-40f9-9664-3c351b94a341', volUUID=u'67b76b78-aa64-4c16-9ea4-35c697d4657a', description=u'{"Updated":false,"Last Updated":"Fri Aug 31 14:02:02 CEST 2018","Storage Domains":[{"uuid":"a5624e7e-8eab-494d-a2c6-594c75b2ce0a"}],"Disk Description":"OVF_STORE"}', options=None) from=::ffff:ipaddress,52180, flow_id=49bf0017, task_id=7f214c7a-607c-47ba-860e-5a70410a4db9 (api:46)
2018-08-31 15:04:17,719+0200 WARN  (jsonrpc/5) [storage.ResourceManager] Resource factory failed to create resource '01_img_a5624e7e-8eab-494d-a2c6-594c75b2ce0a.ec65daea-2564-40f9-9664-3c351b94a341'. Canceling request. (resourceManager:543)
2018-08-31 15:04:17,721+0200 WARN  (jsonrpc/5) [storage.ResourceManager.Request] (ResName='01_img_a5624e7e-8eab-494d-a2c6-594c75b2ce0a.ec65daea-2564-40f9-9664-3c351b94a341', ReqID='ca211202-d182-49b8-964e-2c90e5f2bd76') Tried to cancel a processed request (resourceManager:187)

Comment 12 Eyal Shenitzky 2018-09-04 05:41:30 UTC

(In reply to oliver.albl from comment #11)
> Created attachment 1480570 [details]
> VDSM logs from SPM
> 
> Uploaded VDSM logs from SPM. The OVF updated started to fail on 2018-08-31
> 15:04.
> 
> engine message:
> 2018-08-31 15:04:17.73+02 - Failed to update OVF disks
> ec65daea-2564-40f9-9664-3c351b94a341, OVF data isn't updated on those OVF
> stores (Data Center Production, Storage Domain FABAVMHOST_LUN_216).

Can you please add the engine log also?

> 
> VDSM log excerpt (fgrep 'ec65daea-2564-40f9-9664-3c351b94a341' vdsm.log):
> 2018-08-31 15:04:17,335+0200 INFO  (jsonrpc/5) [vdsm.api] START
> setVolumeDescription(sdUUID=u'a5624e7e-8eab-494d-a2c6-594c75b2ce0a',
> spUUID=u'5849b030-626e-47cb-ad90-3ce782d831b3',
> imgUUID=u'ec65daea-2564-40f9-9664-3c351b94a341',
> volUUID=u'67b76b78-aa64-4c16-9ea4-35c697d4657a',
> description=u'{"Updated":false,"Last Updated":"Fri Aug 31 14:02:02 CEST
> 2018","Storage
> Domains":[{"uuid":"a5624e7e-8eab-494d-a2c6-594c75b2ce0a"}],"Disk
> Description":"OVF_STORE"}', options=None) from=::ffff:ipaddress,52180,
> flow_id=49bf0017, task_id=7f214c7a-607c-47ba-860e-5a70410a4db9 (api:46)
> 2018-08-31 15:04:17,719+0200 WARN  (jsonrpc/5) [storage.ResourceManager]
> Resource factory failed to create resource
> '01_img_a5624e7e-8eab-494d-a2c6-594c75b2ce0a.ec65daea-2564-40f9-9664-
> 3c351b94a341'. Canceling request. (resourceManager:543)
> 2018-08-31 15:04:17,721+0200 WARN  (jsonrpc/5)
> [storage.ResourceManager.Request]
> (ResName='01_img_a5624e7e-8eab-494d-a2c6-594c75b2ce0a.ec65daea-2564-40f9-
> 9664-3c351b94a341', ReqID='ca211202-d182-49b8-964e-2c90e5f2bd76') Tried to
> cancel a processed request (resourceManager:187)

Comment 13 oliver.albl 2018-09-04 06:14:39 UTC

Created attachment 1480674 [details]
engine log

Uploaded engine log

Comment 14 Eyal Shenitzky 2018-09-04 09:25:11 UTC

It seems like as Nir said, this specific OVF volume
creation maybe failed after the LV creation and the engine is failing to update it because of this missing metadata (failed to get the resource).

I couldn't find anything In the log that can confirm this assumption or the other assumption that the metadata was wiped.

Nir, can you please take a look, maybe I am missing something here?

Comment 15 Eyal Shenitzky 2018-10-03 09:10:44 UTC

Hey Oliver,

Do you know how to reproduce this issue?

Also, Do you have a workaround for it?

Comment 16 oliver.albl 2018-10-03 10:29:00 UTC

There is no scenario to reproduce, it occurs occasionally, up to 2 times a month.

Workaround is to copy and adopt remaining OVF metadata.

Comment 17 Eyal Shenitzky 2018-10-10 11:19:34 UTC

Since we don't have the required data to investigate this issue but we do have a WA I am closing this bug.
Please reopen it in case you have the needed data in order to investigate the root cause of this issue.

Note You need to log in before you can comment on or make changes to this bug.