Description of problem: Some lvm commands take slightly longer than usual (around 3s each). This is not ideal but the consequences are quit ugly. The HostedEngine won't start and the whole environment is down. 1. Agent calls broker with set_storage_domain, and waits for 30s, times out waiting for the response MainThread::ERROR::2018-05-02 14:39:58,973::brokerlink::280::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out MainThread::INFO::2018-05-02 14:39:58,973::hosted_engine::595::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed set the storage domain: 'Failed to set storage domain VdsmBackend, options {'hosted-engine.lockspace': '7B22696D6167655F75756964223A202235666336663538652D623563662D343065622D626134352D323335353765633166373263222C202270617468223A206E756C6C2C2022766F6C756D655F75756964223A202263323031636136382D393565322D346237612D393539392D653230363037633337393739227D', 'sp_uuid': '00000000-0000-0000-0000-000000000000', 'dom_type': 'fc', 'hosted-engine.metadata': '7B22696D6167655F75756964223A202233633062646266632D343339342D343361612D393661652D643362383534393733663364222C202270617468223A206E756C6C2C2022766F6C756D655F75756964223A202263396439346563322D633838392D343632652D393861622D616564336435356537333263227D', 'sd_uuid': '381e37fb-394c-4511-b088-88501085aed1'}: Connection timed out'. Waiting '5's before the next attempt 2. Broker receives the request Thread-64::DEBUG::2018-05-02 14:39:28,943::listener::163::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Input: set-storage-domain VdsmBackend hosted-engine.lockspace=7B22696D6167655F75756964223A202235666336663538652D623563662D343065622D626134352D323335353765633166373263222C202270617468223A206E756C6C2C2022766F6C756D655F75756964223A202263323031636136382D393565322D346237612D393539392D653230363037633337393739227D sp_uuid=00000000-0000-0000-0000-000000000000 dom_type=fc hosted-engine.metadata=7B22696D6167655F75756964223A202233633062646266632D343339342D343361612D393661652D643362383534393733663364222C202270617468223A206E756C6C2C2022766F6C756D655F75756964223A202263396439346563322D633838392D343632652D393861622D616564336435356537333263227D sd_uuid=381e37fb-394c-4511-b088-88501085aed1 Thread-64::DEBUG::2018-05-02 14:39:28,943::listener::238::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(_dispatch) Request type set-storage-domain from 140381265266432 Thread-64::DEBUG::2018-05-02 14:39:28,944::storage_broker::65::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(set_storage_domain) Connecting to <ovirt_hosted_engine_ha.lib.storage_backends.VdsmBackend object at 0x26075d0> 3. Broker contacts VDSM Thread-64::DEBUG::2018-05-02 14:39:28,944::storage_backends::395::ovirt_hosted_engine_ha.lib.storage_backends::(connect) Connecting to VDSM Thread-64::DEBUG::2018-05-02 14:39:31,956::util::209::ovirt_hosted_engine_ha.lib.storage_backends::(connect_vdsm_json_rpc) {'status': {'message': 'Done', 'code': 0}, u'systemProductName': u'Synergy 480 Gen9', u'systemSerialNumber': u'VCGGLCV001', u'systemFamily': u'Synergy', u'systemUUID': u'C21FA933-C275-4016-909E-6C9B0558ECD8', u'systemManufacturer': u'HP'} 4. Broker requests prepare-image for hosted-engine.lockspace to VDSM Thread-64::DEBUG::2018-05-02 14:39:46,891::storage_backends::246::ovirt_hosted_engine_ha.lib.storage_backends::(_get_volume_path) prepareImage: '{'status': {'message': 'Done', 'code': 0}, u'path': u'/var/run/vdsm/storage/381e37fb-394c-4511-b088-88501085aed1/5fc6f58e-b5cf-40eb-ba45-23557ec1f72c/c201ca68-95e2-4b7a-9599-e20607c37979'}' Thread-64::DEBUG::2018-05-02 14:39:46,892::storage_backends::248::ovirt_hosted_engine_ha.lib.storage_backends::(_get_volume_path) prepareImage: returned '/var/run/vdsm/storage/381e37fb-394c-4511-b088-88501085aed1/5fc6f58e-b5cf-40eb-ba45-23557ec1f72c/c201ca68-95e2-4b7a-9599-e20607c37979' as path Thread-64::INFO::2018-05-02 14:39:46,892::storage_backends::119::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/blockSD/381e37fb-394c-4511-b088-88501085aed1/ha_agent/hosted-engine.lockspace' 5. Broker requests prepare-image for hosted-engine.metadata to VDSM Thread-64::DEBUG::2018-05-02 14:40:01,708::storage_backends::246::ovirt_hosted_engine_ha.lib.storage_backends::(_get_volume_path) prepareImage: '{'status': {'message': 'Done', 'code': 0}, u'path': u'/var/run/vdsm/storage/381e37fb-394c-4511-b088-88501085aed1/3c0bdbfc-4394-43aa-96ae-d3b854973f3d/c9d94ec2-c889-462e-98ab-aed3d55e732c'}' Thread-64::DEBUG::2018-05-02 14:40:01,708::storage_backends::248::ovirt_hosted_engine_ha.lib.storage_backends::(_get_volume_path) prepareImage: returned '/var/run/vdsm/storage/381e37fb-394c-4511-b088-88501085aed1/3c0bdbfc-4394-43aa-96ae-d3b854973f3d/c9d94ec2-c889-462e-98ab-aed3d55e732c' as path Thread-64::INFO::2018-05-02 14:40:01,709::storage_backends::119::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/blockSD/381e37fb-394c-4511-b088-88501085aed1/ha_agent/hosted-engine.metadata' 6. Broker returns with success. Thread-64::DEBUG::2018-05-02 14:40:01,711::listener::170::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Response: success 39876048 But this took 32,744s. And the agent will only wait for 30s (hardcoded): MESSAGE_READ_TIMEOUT_SEC = 30.0 So this never completes, the HostedEngine is down and cannot be started. On the VDSM side, the lvm commands take slightly longer than usual: 2018-05-03 20:33:54,537+0000 DEBUG (jsonrpc/0) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-87 /usr/bin/sudo -n /sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360002ac000000000000000020001f7c1|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --refresh 381e37fb-394c-4511-b088-88501085aed1/leases (cwd None) (commands:69) ...~3 secs (lvm refresh)... 2018-05-03 20:33:58,192+0000 DEBUG (jsonrpc/0) [storage.Misc.excCmd] SUCCESS: <err> = ''; <rc> = 0 (commands:93) SUMMARY: * Agent requests set_storage_domain to broker and waits 30s for reply * Broker succeeds in doing it but takes 32,744s, because each prepareImage call to VDSM takes ~15s. * 32,744s > 30s, Agent closes the connection due to timeout, and never sees the successful return, even if all succeeded. REQUESTED: * Not sure if MESSAGE_READ_TIMEOUT_SEC should be configurable to workaround such scenarios and bring up the HE. * Why prepareImage all the time for images that are already prepared? I remember some fixes around this, maybe this is already improved in 4.2? Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.1.11-1.el7ev.noarch
This is significantly improved in 4.2, we do not refresh the volumes anymore if all is fine.
(In reply to Martin Sivák from comment #1) > This is significantly improved in 4.2, we do not refresh the volumes anymore > if all is fine. In 4.1 I can see that once the images are prepared and all is running fine, it does not refresh the volumes anymore. But here we are seeing a loop of: 1. agent start 2. prepare images 3. timeout (takes more than 30s) 4. goto 1 So after agent restart, it tries to prepareImages again, even if the images are already prepared. Was this changed in 4.2?
Probably not. I guess increasing the timeout is possible, although 30 seconds is a bit extreme.
Hmm I just realized we changed the architecture in 4.2. Is there a way to check whether this is happening in 4.2 as well? The connection command is internal to broker there and no communication timeout should be happening.
(In reply to Martin Sivák from comment #5) > Hmm I just realized we changed the architecture in 4.2. Is there a way to > check whether this is happening in 4.2 as well? The connection command is > internal to broker there and no communication timeout should be happening. Maybe add some sleeps on vdsm prepareImages code?
Another ticket. We are troubleshooting the slow lvs commands, which shouldn't happen. But it would be nicer if this mechanism was more resilient with slow lvm/storage, especially to bring the HE up.
Hi Martin, I added a sleep(20) in API.py for Image.prepare for the Hosted_Engine SD: def prepare(self, volumeID, allowIllegal=False): if self._sdUUID == "21ae95db-0f97-4c06-bb60-e3ba541400f0": time.sleep(20) return self._irs.prepareImage(self._sdUUID, self._spUUID, self._UUID, volumeID, allowIllegal=allowIllegal) Version: ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch This is the result of every prepareImage call from the broker/agent to the HE SD: 2018-05-18 10:50:26,624+0530 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Image.prepare succeeded in 20.04 seconds (__init__:573) Given there are 4 images to prepare this should easily go above 30s. And there is no timeout on the ha daemons, all is working fine (but slowly). So the new mechanism is more resilient to storage slowdowns. If you agree that this test is valid, feel free to close the bug.
Thanks Germano, yes we improved the behaviour a lot in 4.2. The test seems to be fine, this method is exactly what we call to prepare the images. Since the bug was reported against 4.1 and we seem to fixed it in 4.2 I will close as current release. There is no backport likely as it was almost a complete rewrite of the storage part.
BZ<2>Jira Resync