Bug 1574744
Summary: | Slow prepareImage prevents the HostedEngine from starting | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
Component: | ovirt-hosted-engine-ha | Assignee: | Martin Sivák <msivak> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | meital avital <mavital> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.10 | CC: | gveitmic, lsurette, msivak, ykaul |
Target Milestone: | ovirt-4.2.4 | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-05-18 07:53:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Germano Veit Michel
2018-05-04 00:08:34 UTC
This is significantly improved in 4.2, we do not refresh the volumes anymore if all is fine. (In reply to Martin Sivák from comment #1) > This is significantly improved in 4.2, we do not refresh the volumes anymore > if all is fine. In 4.1 I can see that once the images are prepared and all is running fine, it does not refresh the volumes anymore. But here we are seeing a loop of: 1. agent start 2. prepare images 3. timeout (takes more than 30s) 4. goto 1 So after agent restart, it tries to prepareImages again, even if the images are already prepared. Was this changed in 4.2? Probably not. I guess increasing the timeout is possible, although 30 seconds is a bit extreme. Hmm I just realized we changed the architecture in 4.2. Is there a way to check whether this is happening in 4.2 as well? The connection command is internal to broker there and no communication timeout should be happening. (In reply to Martin Sivák from comment #5) > Hmm I just realized we changed the architecture in 4.2. Is there a way to > check whether this is happening in 4.2 as well? The connection command is > internal to broker there and no communication timeout should be happening. Maybe add some sleeps on vdsm prepareImages code? Another ticket. We are troubleshooting the slow lvs commands, which shouldn't happen. But it would be nicer if this mechanism was more resilient with slow lvm/storage, especially to bring the HE up. Hi Martin, I added a sleep(20) in API.py for Image.prepare for the Hosted_Engine SD: def prepare(self, volumeID, allowIllegal=False): if self._sdUUID == "21ae95db-0f97-4c06-bb60-e3ba541400f0": time.sleep(20) return self._irs.prepareImage(self._sdUUID, self._spUUID, self._UUID, volumeID, allowIllegal=allowIllegal) Version: ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch This is the result of every prepareImage call from the broker/agent to the HE SD: 2018-05-18 10:50:26,624+0530 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Image.prepare succeeded in 20.04 seconds (__init__:573) Given there are 4 images to prepare this should easily go above 30s. And there is no timeout on the ha daemons, all is working fine (but slowly). So the new mechanism is more resilient to storage slowdowns. If you agree that this test is valid, feel free to close the bug. Thanks Germano, yes we improved the behaviour a lot in 4.2. The test seems to be fine, this method is exactly what we call to prepare the images. Since the bug was reported against 4.1 and we seem to fixed it in 4.2 I will close as current release. There is no backport likely as it was almost a complete rewrite of the storage part. BZ<2>Jira Resync |