Bug 1474656

Summary: Silent Hosted-Engine Auto-Import failure
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: ovirt-engineAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: high    
Version: 4.0.7CC: akrejcir, gveitmic, lsurette, mgoldboi, rbalakri, Rhev-m-bugs, srevivo, ykaul, ylavi
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-11 04:52:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2017-07-25 06:41:57 UTC
Description of problem:

After the Hosted-Engine Storage Domain auto Import is triggered, the engine does a GetImagesListVDSCommand and then a series of GetImageInfoVDSCommand on the Images of the HE SD.

These GetImageInfo commands all fail on engine side with with:

2017-07-04 17:54:13,133 WARN  [org.ovirt.engine.core.bll.storage.disk.image.GetUnregisteredDiskQuery] (org.ovirt.thread.pool-6-thread-37) [385bd3ae] Exception while parsing JSON for disk. Exception: '{}': org.codehaus.jackson.JsonParseException: Unexpected character ('h' (code 104)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: java.io.StringReader@383484c1; line: 1, column: 2]
        at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433) [jackson-core-asl.jar:1.9.13.redhat-3]
        at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521) [jackson-core-asl.jar:1.9.13.redhat-3]
        at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442) [jackson-core-asl.jar:1.9.13.redhat-3]
        at org.codehaus.jackson.impl.ReaderBasedParser._handleUnexpectedValue(ReaderBasedParser.java:1198) [jackson-core-asl.jar:1.9.13.redhat-3]
        at org.codehaus.jackson.impl.ReaderBasedParser.nextToken(ReaderBasedParser.java:485) [jackson-core-asl.jar:1.9.13.redhat-3]
        at org.codehaus.jackson.map.ObjectMapper._initForReading(ObjectMapper.java:2770) [jackson-mapper-asl.jar:1.9.13.redhat-3]
        at org.codehaus.jackson.map.ObjectMapper._readMapAndClose(ObjectMapper.java:2718) [jackson-mapper-asl.jar:1.9.13.redhat-3]
        at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1877) [jackson-mapper-asl.jar:1.9.13.redhat-3]
        at org.ovirt.engine.core.utils.JsonHelper.jsonToMap(JsonHelper.java:41) [utils.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.MetadataDiskDescriptionHandler.enrichDiskByJsonDescription(MetadataDiskDescriptionHandler.java:247) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.GetUnregisteredDiskQuery.executeQueryCommand(GetUnregisteredDiskQuery.java:89) [bll.jar:]
        at org.ovirt.engine.core.bll.QueriesCommandBase.executeCommand(QueriesCommandBase.java:103) [bll.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.bll.Backend.runQueryImpl(Backend.java:558) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runInternalQuery(Backend.java:524) [bll.jar:]

But the HostedEngine VM is imported into the enviroment and so is the hosted_storage. No errors seen from the Administration Portal. I've seen it happen on two customers today. Both have symptoms even though apparently all went fine from user perspective.

1) 4.0.7 engine with 4.19 vdsm
Symptom: Can't deploy additional HE Host because configs passed to host on Hosted-Engine->Deploy are empty, similar to BZ #1414696.

2) 3.6 engine with 4.19 vdsm (customer doing upgrade to RHV 4.0)
Symptom: HE SD not attached to the Storage Pool (VG tag MDT_POOL_UUID= is empty), so any operation on the HE SD fails with "ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: ()" due to dom.getPools() returning an empty list.

Unfortunately both have rotated logs on vdsm side. I'm trying to reproduce it.

I'm afraid this might bite in other places like upgrades or disaster recovery.

How reproducible:
Trying to...

Comment 3 Germano Veit Michel 2017-07-25 07:10:23 UTC
Is this vdsm sending a disk info the engine fails to parse?

Could it be due to the higher vdsm version (4.19) vs the engines (3.6 and 4.0)?

Comment 4 Germano Veit Michel 2017-07-25 07:26:29 UTC
So...

The Storage Domain Metadata for the disks on customer 2 contains this:

DESCRIPTION=HostedEngineConfigurationImage
DESCRIPTION=hosted-engine.lockspace
DESCRIPTION=hosted-engine.metadata
DESCRIPTION=Hosted Engine Image

The engine was expecting it in json format like an OVF one right? Similar to this?
DESCRIPTION={"Updated":true,"Size":20480,"Last Updated":"Thu Jun 15 09:17:02 CEST 2017","Storage Domains":[{"uuid":"12166789-fa51-4639-8dc7-91ed4f94dfb7"}],"Disk Description":"OVF_STORE"}

But instead of an '{' it got an 'h'?

Comment 5 Andrej Krejcir 2017-08-02 13:07:05 UTC
It seems that the warning message (with a lot of stack trace) is not related to the problems.

The disk description field can be either a json object or a string. The json description is used when the disk is created by the engine. But the hosted engine VM disk is created by the hosted-engine-setup and it sets the description to a plain string instead of json.

Probably, this behavior has not unchanged since 3.6.
I'm getting the same warning message on master and a clean HE deployment.

Looking at the engine logs, a lot of the errors come directly from rpc calls to vdsm, so without the vdsm logs, it is hard to know why.

Comment 6 Germano Veit Michel 2017-08-03 02:47:35 UTC
Hi Andrej,

I find it quite intriguing that both cases have problems related to the Hosted-Engine storage domain. Problems that I have never seen before. I can't make any sense of of this.

(In reply to Andrej Krejcir from comment #5)
> I'm getting the same warning message on master and a clean HE deployment.

Are you using ovirt-engine from master or vdsm from master, or both? I can try too.