Created attachment 1369890 [details] logs from first ha-host alma03 Description of problem: There is a difference in /etc/ovirt-hosted-engine/hosted-engine.conf between two ha-hosts http://pastebin.test.redhat.com/541959 ha-score is jumping between 0 and 3400 and engine reports of: VM HostedEngine is down with error. Exit message: resource busy: Failed to acquire lock: Lease is held by another host. 12/19/1712:49:28 PM VM HostedEngine is down with error. Exit message: resource busy: Failed to acquire lock: Lease is held by another host. 12/19/1712:42:12 PM Version-Release number of selected component (if applicable): rhvm-appliance-4.2-20171207.0.el7.noarch ovirt-provider-ovn-driver-1.2.2-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch ovirt-host-deploy-1.7.0-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-common-1.2.0-0.el7ev.noarch qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch vdsm-4.20.9.3-1.el7ev.x86_64 ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch mom-0.5.11-1.el7ev.noarch libvirt-client-3.2.0-14.el7_4.5.x86_64 ovirt-imageio-daemon-1.2.0-0.el7ev.noarch ovirt-host-dependencies-4.2.0-1.el7ev.x86_64 ovirt-host-4.2.0-1.el7ev.x86_64 sanlock-3.5.0-1.el7.x86_64 ovirt-setup-lib-1.1.4-1.el7ev.noarch Linux version 3.10.0-693.15.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 14 05:13:32 EST 2017 Linux puma19.scl.lab.tlv.redhat.com 3.10.0-693.15.1.el7.x86_64 #1 SMP Thu Dec 14 05:13:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.4 (Maipo) How reproducible: 100% Steps to Reproduce: 1.Deploy SHE over NFS or Gluster. 2.Add one NFS data storage domain. 3.Add one additional ha-host. Actual results: ha-score is jumping 0-3400 on both hosts and unstable, /etc/ovirt-hosted-engine/hosted-engine.conf is not the same between both hosts. Expected results: ha score should be stable after adding of additional ha-host. Additional info: logs from both hosts and the engine attached. screencast is attached here https://drive.google.com/a/redhat.com/file/d/1XHhNOD3-gu71PerJfuM6qKG3RriCMnG5/view?usp=sharing.
Created attachment 1369892 [details] logs from additional ha-host alma04
Created attachment 1369893 [details] engine's log
Created attachment 1369895 [details] hosted-engine.conf from alma03
Created attachment 1369896 [details] hosted-engine.conf from alma04
(In reply to Nikolai Sednev from comment #0) > There is a difference in /etc/ovirt-hosted-engine/hosted-engine.conf between > two ha-hosts > http://pastebin.test.redhat.com/541959 host-deploy writes a sorted file, hosted-engine-setup no. Other than the sorting, the only relevant difference is < host_id=1 > host_id=2 as expected
The issue on flapping HA score seams instead real and due to MainThread::WARNING::2017-12-19 12:49:14,688::ovf_store::138::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE MainThread::ERROR::2017-12-19 12:49:14,689::config::475::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs MainThread::ERROR::2017-12-19 12:49:14,690::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 428, in start_monitoring self._config.refresh_vm_conf() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 519, in refresh_vm_conf content = self._get_file_content_from_shared_storage(VM) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 484, in _get_file_content_from_shared_storage config_volume_path = self._get_config_volume_path() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 188, in _get_config_volume_path conf_vol_uuid File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 358, in get_volume_path root=envconst.SD_RUN_DIR, RuntimeError: Path to volume af333f6f-b1e0-4cdd-ad96-86289302015d not found in /run/vdsm/storage MainThread::ERROR::2017-12-19 12:49:14,690::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2017-12-19 12:49:14,690::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2017-12-19 12:49:25,159::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.2 started MainThread::INFO::2017-12-19 12:49:25,227::hosted_engine::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: alma03.qa.lab.tlv.redhat.com MainThread::INFO::2017-12-19 12:49:25,779::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2017-12-19 12:49:25,782::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '10.35.72.254'} MainThread::INFO::2017-12-19 12:49:25,785::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id ping MainThread::INFO::2017-12-19 12:49:25,785::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'ovirtmgmt', 'address': '0'} MainThread::INFO::2017-12-19 12:49:25,790::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id mgmt-bridge MainThread::INFO::2017-12-19 12:49:25,791::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'} MainThread::INFO::2017-12-19 12:49:25,796::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id mem-free MainThread::INFO::2017-12-19 12:49:25,797::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': '412e1c12-133c-43cd-ba2e-e47e81e47bfb', 'address': '0'} MainThread::INFO::2017-12-19 12:49:25,800::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id cpu-load-no-engine MainThread::INFO::2017-12-19 12:49:25,800::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': '412e1c12-133c-43cd-ba2e-e47e81e47bfb', 'address': '0'} MainThread::INFO::2017-12-19 12:49:25,803::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id engine-health MainThread::INFO::2017-12-19 12:49:25,803::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor storage-domain, options {'sd_uuid': 'e061a688-ea1a-4da9-8412-6ec8705ad57b'} MainThread::INFO::2017-12-19 12:49:25,805::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id storage-domain MainThread::INFO::2017-12-19 12:49:25,806::hosted_engine::544::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started MainThread::INFO::2017-12-19 12:49:26,101::upgrade::994::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Host configuration is already up-to-date MainThread::INFO::2017-12-19 12:49:26,311::ovf_store::118::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:a59cb8b5-0e77-464c-8c0c-316ef29aa4a0, volUUID:d0428a63-e5ab-4d0f-8e50-06f336732654 MainThread::INFO::2017-12-19 12:49:26,445::ovf_store::118::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:11c50f3b-64b3-455e-8533-b5634a8b45d3, volUUID:0a0ec0b7-48e3-46ff-873b-b6e9beeaf96c MainThread::INFO::2017-12-19 12:49:26,461::ovf_store::149::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/e061a688-ea1a-4da9-8412-6ec8705ad57b/11c50f3b-64b3-455e-8533-b5634a8b45d3/0a0ec0b7-48e3-46ff-873b-b6e9beeaf96c MainThread::INFO::2017-12-19 12:49:26,500::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {} MainThread::INFO::2017-12-19 12:49:26,500::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host alma04.qa.lab.tlv.redhat.com (id 2): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=75932 (Tue Dec 19 12:49:16 2017)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=75932 (Tue Dec 19 12:49:16 2017)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineStarting\nstopped=False\n', 'hostname': 'alma04.qa.lab.tlv.redhat.com', 'host-id': 2, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': 'f05626cd', 'local_conf_timestamp': 75932, 'host-ts': 75932} MainThread::INFO::2017-12-19 12:49:26,500::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'health': 'good', 'vm': 'up', 'detail': 'Up'}, 'bridge': True, 'mem-free': 25251.0, 'maintenance': False, 'cpu-load': None, 'gateway': 1.0, 'storage-domain': True} MainThread::INFO::2017-12-19 12:49:26,503::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (StartState-ReinitializeFSM) sent? None MainThread::INFO::2017-12-19 12:49:26,788::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0)
After 30 minutes or so, flapping stops on ha-hosts and system becomes stable on its own.
It becomes stable on its own once the OVF is generated. Nikolai, can you please attach the output of tree /run/vdsm/storage?
(In reply to Nikolai Sednev from comment #7) > After 30 minutes or so, flapping stops on ha-hosts and system becomes stable > on its own. Reducing severity since it stabilizes.
(In reply to Martin Sivák from comment #8) > It becomes stable on its own once the OVF is generated. > > Nikolai, can you please attach the output of tree /run/vdsm/storage? New deployment over FC with these components: rhvm-appliance-4.2-20171207.0.el7.noarch ovirt-imageio-daemon-1.2.0-0.el7ev.noarch ovirt-imageio-common-1.2.0-0.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64 ovirt-setup-lib-1.1.4-1.el7ev.noarch ovirt-host-4.2.0-1.el7ev.x86_64 ovirt-vmconsole-1.0.4-1.el7ev.noarch mom-0.5.11-1.el7ev.noarch vdsm-4.20.9.3-1.el7ev.x86_64 ovirt-host-dependencies-4.2.0-1.el7ev.x86_64 ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch sanlock-3.5.0-1.el7.x86_64 ovirt-host-deploy-1.7.0-1.el7ev.noarch ovirt-provider-ovn-driver-1.2.2-1.el7ev.noarch libvirt-client-3.2.0-14.el7_4.5.x86_64 ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch Linux version 3.10.0-693.15.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 14 05:13:32 EST 2017 Linux purple-vds1.qa.lab.tlv.redhat.com 3.10.0-693.15.1.el7.x86_64 #1 SMP Thu Dec 14 05:13:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.4 (Maipo) Here is the data from pair of purples 1 and 2, freshly deployed over FC: purple-vds1 ~]# tree /run/vdsm/storage /run/vdsm/storage └── e9b15107-0ca7-41d9-9b0d-c826b99afb2e ├── 0a572170-c3eb-4044-a0ff-91847d9cda26 │ └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450 ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af │ └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78 ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97 │ └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a └── b5d24726-7373-48ae-ba02-a2fdd45b32a7 └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93 [root@purple-vds2 ~]# tree /run/vdsm/storage /run/vdsm/storage └── e9b15107-0ca7-41d9-9b0d-c826b99afb2e ├── 0a572170-c3eb-4044-a0ff-91847d9cda26 │ └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450 ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af │ └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78 ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97 │ └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a └── b5d24726-7373-48ae-ba02-a2fdd45b32a7 └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93 I've seen that already during addition of a new ha-host (purple-vds2) via UI, the score was zeroed on purple-vds1 (first initial ha-host on which SHE deployment was made): [root@purple-vds1 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : purple-vds1.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 0 stopped : True Local maintenance : False crc32 : 90e840f7 local_conf_timestamp : 1951 Host timestamp : 1948 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1948 (Tue Dec 19 19:11:00 2017) host-id=1 score=0 vm_conf_refresh_time=1951 (Tue Dec 19 19:11:04 2017) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True Once purple-vds2 appeared in UI as active, the flapping got started on both hosts: http://pastebin.test.redhat.com/542120
[root@purple-vds1 ~]# tree /run/vdsm/storage /run/vdsm/storage └── e9b15107-0ca7-41d9-9b0d-c826b99afb2e ├── 0a572170-c3eb-4044-a0ff-91847d9cda26 │ └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450 ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af │ └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78 └── 7d00c457-a14a-40e0-bc81-df7f634e6a97 └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a [root@purple-vds2 ~]# tree /run/vdsm/storage /run/vdsm/storage └── e9b15107-0ca7-41d9-9b0d-c826b99afb2e ├── 0a572170-c3eb-4044-a0ff-91847d9cda26 │ └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450 ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af │ └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78 ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97 │ └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a └── b5d24726-7373-48ae-ba02-a2fdd45b32a7 └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93
[root@purple-vds1 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : purple-vds1.qa.lab.tlv.redhat.com Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 90e840f7 local_conf_timestamp : 1951 Host timestamp : 1948 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1948 (Tue Dec 19 19:11:00 2017) host-id=1 score=0 vm_conf_refresh_time=1951 (Tue Dec 19 19:11:04 2017) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : purple-vds2.qa.lab.tlv.redhat.com Host ID : 2 Engine status : unknown stale-data Score : 0 stopped : False Local maintenance : False crc32 : a536f1c7 local_conf_timestamp : 3093 Host timestamp : 3092 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3092 (Tue Dec 19 19:31:50 2017) host-id=2 score=0 vm_conf_refresh_time=3093 (Tue Dec 19 19:31:51 2017) conf_on_shared_storage=True maintenance=False state=ReinitializeFSM stopped=False [root@purple-vds1 ~]# tree /run/vdsm/storage /run/vdsm/storage └── e9b15107-0ca7-41d9-9b0d-c826b99afb2e ├── 0a572170-c3eb-4044-a0ff-91847d9cda26 │ └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450 ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af │ └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78 └── 7d00c457-a14a-40e0-bc81-df7f634e6a97 └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a [root@purple-vds2 ~]# tree /run/vdsm/storage /run/vdsm/storage └── e9b15107-0ca7-41d9-9b0d-c826b99afb2e ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af │ └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78 ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97 │ └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a └── b5d24726-7373-48ae-ba02-a2fdd45b32a7 └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93
MainThread::INFO::2017-12-19 11:48:13,279::hosted_engine::924::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` MainThread::INFO::2017-12-19 11:48:14,622::hosted_engine::930::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: MainThread::INFO::2017-12-19 11:48:14,646::hosted_engine::931::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Command VM.getStats with args {'vmID': '412e1c12-133c-43cd-ba2e-e47e81e47bfb'} failed: (code=1, message=Virtual machine does not exist: {'vmId': u'412e1c12-133c-43cd-ba2e-e47e81e47bfb'}) This is caused by https://bugzilla.redhat.com/1525955 (Virt issue, see comment #10) MainThread::INFO::2017-12-19 11:48:14,646::hosted_engine::943::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2017-12-19 11:48:14,791::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent MainThread::INFO::2017-12-19 11:48:14,838::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) ... MainThread::INFO::2017-12-19 11:49:05,707::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) ... MainThread::INFO::2017-12-19 12:06:01,811::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::ERROR::2017-12-19 12:06:11,010::config::475::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs MainThread::ERROR::2017-12-19 12:06:11,010::hosted_engine::434::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception Traceback (most recent call last): .... File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 358, in get_volume_path root=envconst.SD_RUN_DIR, RuntimeError: Path to volume af333f6f-b1e0-4cdd-ad96-86289302015d not found in /run/vdsm/storage This is really weird. The HE storage domain is e061a688-ea1a-4da9-8412-6ec8705ad57b but the symlinks that must have been there are suddenly completely missing. And I do not see anything related to this issue in the VDSM log either. The lease error from the engine log is just annoying, the engine VM is up and this happens because the hosted engine is trying to start it again because the agents lost synchronization. The situation stabilizes once the OVF store is created, because we call Image.prepare to get it and it prepares the symlinks. This will affect only very new deployments that have no OVF store yet. Btw Nikolai, pastebin.test.redhat.com is not accessible from outside of vpn so you should probably not use it in public bugs.
Nir, can you please take a quick look? It seems that all the prepared symlinks suddenly disappeared and I can't find any reason for that..
(In reply to Martin Sivák from comment #14) > It seems that all the prepared symlinks suddenly disappeared Can you explain what symlinks are missing?
(In reply to Yaniv Kaul from comment #9) > (In reply to Nikolai Sednev from comment #7) > > After 30 minutes or so, flapping stops on ha-hosts and system becomes stable > > on its own. > > Reducing severity since it stabilizes. On my environment over FC I see that even overnight ha-score was not stabilized at all. I still have the issue: [root@purple-vds1 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : purple-vds1.qa.lab.tlv.redhat.com Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 90e840f7 local_conf_timestamp : 1951 Host timestamp : 1948 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1948 (Tue Dec 19 19:11:00 2017) host-id=1 score=0 vm_conf_refresh_time=1951 (Tue Dec 19 19:11:04 2017) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : purple-vds2.qa.lab.tlv.redhat.com Host ID : 2 Engine status : unknown stale-data Score : 0 stopped : False Local maintenance : False crc32 : a04c826a local_conf_timestamp : 52847 Host timestamp : 52846 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=52846 (Wed Dec 20 09:21:04 2017) host-id=2 score=0 vm_conf_refresh_time=52847 (Wed Dec 20 09:21:06 2017) conf_on_shared_storage=True maintenance=False state=ReinitializeFSM stopped=False [root@purple-vds1 ~]# [root@purple-vds1 ~]# [root@purple-vds1 ~]# [root@purple-vds1 ~]# [root@purple-vds1 ~]# [root@purple-vds1 ~]# [root@purple-vds1 ~]# date Wed Dec 20 09:22:30 IST 2017 [root@purple-vds1 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : purple-vds1.qa.lab.tlv.redhat.com Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 90e840f7 local_conf_timestamp : 1951 Host timestamp : 1948 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1948 (Tue Dec 19 19:11:00 2017) host-id=1 score=0 vm_conf_refresh_time=1951 (Tue Dec 19 19:11:04 2017) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : purple-vds2.qa.lab.tlv.redhat.com Host ID : 2 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : 0967536e local_conf_timestamp : 52922 Host timestamp : 52918 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=52918 (Wed Dec 20 09:22:17 2017) host-id=2 score=3400 vm_conf_refresh_time=52922 (Wed Dec 20 09:22:20 2017) conf_on_shared_storage=True maintenance=False state=EngineStart stopped=False [root@purple-vds1 ~]# tree /run/vdsm/storage /run/vdsm/storage ├── 0bfb70d9-f651-48ed-8bc1-5fac1707ddf7 └── e9b15107-0ca7-41d9-9b0d-c826b99afb2e ├── 0a572170-c3eb-4044-a0ff-91847d9cda26 │ └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450 ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af │ └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78 └── 7d00c457-a14a-40e0-bc81-df7f634e6a97 └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a [root@purple-vds2 ~]# tree /run/vdsm/storage /run/vdsm/storage └── e9b15107-0ca7-41d9-9b0d-c826b99afb2e ├── 0a572170-c3eb-4044-a0ff-91847d9cda26 │ └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450 ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af │ └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78 ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97 │ └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a └── b5d24726-7373-48ae-ba02-a2fdd45b32a7 └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93 Returning I think that its severity should be higher.
Nir: The whole /run/vdsm/storage/e061a688-ea1a-4da9-8412-6ec8705ad57b directory disappeared for no apparent reason.
So the case from #16: The symlink to /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93 disappeared from host purple-vds1. Which is VERY weird since it was the initial host and this happened during the addition of purple-vds2 (so nobody should have been touching vds1 at that time). purple-vds2 is fine, except it does not know the status of the engine and tries to start it again and again. Sanlock prevents it from doing so as it is supposed to. purple-vds1 is crashing because the expected disk with configuration is not accessible as the symlink was removed. No idea why yet.
Nir, the disappearing tree is weird, but the real issue is with disappearing volume link when a second host is being added to the engine.
Symlink issue might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1524119#c8.
Actually, not really. That issue was caused by vdsm restarting and failing to reconstruct the disk structures properly on recovery. It then failed to prepare the symlink as part of migration. The disk in question there was the actual hard drive of the VM. Here it happens on a host nobody is touching and with a volume no VM uses (it is internal to hosted engine infrastructure). As I said earlier.. the symptoms are similar, but the issue is different. The question is, why does this reproduce when you add a second host to the engine. It should have no effect on the existing host.. but it has.
(In reply to Martin Sivák from comment #17) > Nir: The whole /run/vdsm/storage/e061a688-ea1a-4da9-8412-6ec8705ad57b > directory disappeared for no apparent reason. Can you explain what is this tree and why hosted engine needs it? Vdsm provides apis for preparing and tearing down images. You can use the path returned from Image.prepare to access the volume, but you cannot assume that you own the path and that it remain on storage. For example, ovf store volumes are owned by engine, and engine updates them once in a while. On each update, engine prepare the volume, write new data, and tear it down. You can assume that a volume will be available only if you are the owner of the volume and no other part of the system touch it.
Well we prepare the images as well. All of them actually - both for OVF and our own disks. I suppose vdsm does not track how many users there are..
(In reply to Martin Sivák from comment #23) > Well we prepare the images as well. All of them actually - both for OVF and > our own disks. I suppose vdsm does not track how many users there are.. Vdsm does not track users of prepared volumes. Engine is responsible for this. To track users, vdsm need to persist the data on local storage, since vdsm is likely to be restarted or killed by sanlock on engine soft fencing. If you are using a volume that other parts of the sytsem is using like ovf volumes, you must be ready for other part of the system tearing down your volume, and prepare it again. Or better, use vdsm api to access the data, and let vdsm handle this for you.
Well engine cannot do that when it is down. And since there is no API for reading OVF or our own config volume we will have to hack around it.. again.
*** Bug 1524904 has been marked as a duplicate of this bug. ***
Since I was not able to find any teardown for the link in question and it is not the OVF I am moving this to storage for investigation as asked by Nir.
Martin, please add the info I asked in comment 22. - what are the the missing volumes - when they are prepared - is this about NFS/Gluster or FC/iSCSI? comment 0 suggests this the first, comment 16 suggest the second.
It is a hosted engine configuration volume (you see the id in the traceback). Nobody else uses it. We prepare all volumes when ovirt-ha-broker starts. Happened on both NFS and FC including OST.
Nikolai, we need to reproduce this with DEBUG level logs in vdsm. In debug level, we have logs for every operation that removes a directory from storage. If vdsm is removing the tree at /run/vdsm/storage/sd-id, we will see this log: Removing directory: /run/vdsm/storage/sd-id To enable debug logs in vdsm, you need to edit /etc/vdsm/logger.conf and change the root and storage loggers to use DEBUG instead of INFO, and restart vdsm. This should be standard practice for any bug report. If something does not work we must have debug logs.
(In reply to Nir Soffer from comment #30) > Nikolai, we need to reproduce this with DEBUG level logs in vdsm. In debug > level, > we have logs for every operation that removes a directory from storage. > > If vdsm is removing the tree at /run/vdsm/storage/sd-id, we will see this > log: > > Removing directory: /run/vdsm/storage/sd-id > > To enable debug logs in vdsm, you need to edit /etc/vdsm/logger.conf and > change > the root and storage loggers to use DEBUG instead of INFO, and restart vdsm. > > This should be standard practice for any bug report. If something does not > work we > must have debug logs. At current stage it is not possible due to https://bugzilla.redhat.com/show_bug.cgi?id=1531072.
(In reply to Nikolai Sednev from comment #31) > (In reply to Nir Soffer from comment #30) > > Nikolai, we need to reproduce this with DEBUG level logs in vdsm. In debug > > level, > > we have logs for every operation that removes a directory from storage. > > > > If vdsm is removing the tree at /run/vdsm/storage/sd-id, we will see this > > log: > > > > Removing directory: /run/vdsm/storage/sd-id > > > > To enable debug logs in vdsm, you need to edit /etc/vdsm/logger.conf and > > change > > the root and storage loggers to use DEBUG instead of INFO, and restart vdsm. > > > > This should be standard practice for any bug report. If something does not > > work we > > must have debug logs. > > At current stage it is not possible due to > https://bugzilla.redhat.com/show_bug.cgi?id=1531072. Strange, I can reproduce this easily with O-S-T (at least with the Ansible based deployment). What I'm not sure is how to get VDSM into debug mode during the deployment, though. Any ideas?
Yaniv, to enable debug logs, we can install /etc/vdsm/logger.conf with the suggested log levels, or we can issue these commands: # Set root logger to debug vdsm-client Host setLogLevel level=DEBUG # Set storage logger to debug vdsm-client Host setLogLevel level=DEBUG name=storage
I reproduced it on my local environment. Looks like VDSM doesn't know, that He is using that volume. 2018-01-08 16:10:18,363+0100 INFO (Thread-1) [vdsm.api] START uploadImageToStream(methodArgs={'length': 20480, 'file Obj': <socket._fileobject object at 0x3508250>}, callback=<function setCallback at 0x42b76e0>, startEvent=<threading. _Event object at 0x42a47d0>, spUUID='5a538125-0111-0282-0066-00000000033c', sdUUID='7f3c72a0-8b9a-46a2-93aa-0a69337c9 1fe', imgUUID='d3f97859-8f70-42e4-b0c8-41655a7fd32b', volUUID='3b9f9c09-2795-4c28-acf0-f94552d1c5d6') from=internal, task_id=de55c686-0f5e-4c16-9c2f-8f06c4f32dca (api:46) 2018-01-08 16:10:18,364+0100 DEBUG (Thread-1) [storage.oop] Creating ioprocess Global (outOfProcess:86) 2018-01-08 16:10:18,364+0100 INFO (Thread-1) [IOProcessClient] Starting client ioprocess-53 (__init__:308) 2018-01-08 16:10:18,364+0100 DEBUG (Thread-1) [IOProcessClient] Starting ioprocess for client ioprocess-53 (__init__:320) 2018-01-08 16:10:18,372+0100 DEBUG (Thread-1) [IOProcessClient] Starting communication thread for client ioprocess-53 (__init__:376) 2018-01-08 16:10:18,374+0100 DEBUG (Thread-1) [IOProcessClient] Communication thread for client ioprocess-53 started (__init__:388) 2018-01-08 16:10:18,375+0100 INFO (ioprocess/18400) [IOProcess] Starting ioprocess (__init__:437) 2018-01-08 16:10:18,382+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') _save: orig /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca temp /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca.temp (task:759) 2018-01-08 16:10:18,395+0100 DEBUG (Thread-1) [storage.TaskManager] scheduled job uploadImageToStream for task de55c686-0f5e-4c16-9c2f-8f06c4f32dca (taskManager:79) 2018-01-08 16:10:18,395+0100 INFO (Thread-1) [vdsm.api] FINISH uploadImageToStream return=None from=internal, task_id=de55c686-0f5e-4c16-9c2f-8f06c4f32dca (api:52) 2018-01-08 16:10:18,395+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') Prepare: 1 jobs exist, move to acquiring (task:1193) 2018-01-08 16:10:18,395+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') moving from state preparing -> state acquiring (task:602) 2018-01-08 16:10:18,396+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') _save: orig /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca temp /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca.temp (task:759) 2018-01-08 16:10:18,412+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') moving from state acquiring -> state queued (task:602) 2018-01-08 16:10:18,413+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') _save: orig /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca temp /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca.temp (task:759) 2018-01-08 16:10:18,431+0100 DEBUG (Thread-1) [storage.TaskManager] queuing task: de55c686-0f5e-4c16-9c2f-8f06c4f32dca (taskManager:57) 2018-01-08 16:10:18,431+0100 DEBUG (tasks/2) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60) 2018-01-08 16:10:18,431+0100 INFO (tasks/2) [storage.ThreadPool.WorkerThread] START task de55c686-0f5e-4c16-9c2f-8f06c4f32dca (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x3a02b48>>, args=None) (threadPool:208) 2018-01-08 16:10:18,431+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') committing task: de55c686-0f5e-4c16-9c2f-8f06c4f32dca (task:1210) 2018-01-08 16:10:18,432+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') moving from state queued -> state running (task:602) 2018-01-08 16:10:18,432+0100 DEBUG (Thread-1) [storage.TaskManager] task queued: de55c686-0f5e-4c16-9c2f-8f06c4f32dca (taskManager:65) 2018-01-08 16:10:18,432+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') returning (task:1198) 2018-01-08 16:10:18,432+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') ref 1 aborting False (task:1002) 2018-01-08 16:10:18,434+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') _save: orig /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca temp /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca.temp (task:759) 2018-01-08 16:10:18,451+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') Task.run: running job 0: uploadImageToStream: <bound method StoragePool.uploadImageToStream of <vdsm.storage.sp.StoragePool object at 0x385a3d0>> (args: ({'length': 20480, 'fileObj': <socket._fileobject object at 0x3508250>}, <function setCallback at 0x42b76e0>, <threading._Event object at 0x42a47d0>, '7f3c72a0-8b9a-46a2-93aa-0a69337c91fe', 'd3f97859-8f70-42e4-b0c8-41655a7fd32b', '3b9f9c09-2795-4c28-acf0-f94552d1c5d6') kwargs: {}) (task:914) 2018-01-08 16:10:18,452+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') Job.run: running uploadImageToStream: <bound method StoragePool.uploadImageToStream of <vdsm.storage.sp.StoragePool object at 0x385a3d0>> (args: ({'length': 20480, 'fileObj': <socket._fileobject object at 0x3508250>}, <function setCallback at 0x42b76e0>, <threading._Event object at 0x42a47d0>, '7f3c72a0-8b9a-46a2-93aa-0a69337c91fe', 'd3f97859-8f70-42e4-b0c8-41655a7fd32b', '3b9f9c09-2795-4c28-acf0-f94552d1c5d6') kwargs: {}) callback None (task:333) 2018-01-08 16:10:18,452+0100 DEBUG (tasks/2) [storage.ResourceManager] Trying to register resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' for lock type 'shared' (resourceManager:495) 2018-01-08 16:10:18,455+0100 DEBUG (tasks/2) [storage.VolumeManifest] validate path for 3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileVolume:124) 2018-01-08 16:10:18,458+0100 DEBUG (tasks/2) [storage.VolumeManifest] validate path for 3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileVolume:124) 2018-01-08 16:10:18,461+0100 DEBUG (tasks/2) [storage.VolumeManifest] validate path for 3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileVolume:124) 2018-01-08 16:10:18,465+0100 DEBUG (tasks/2) [storage.ResourceManager] Trying to register resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' for lock type 'shared' (resourceManager:495) 2018-01-08 16:10:18,465+0100 DEBUG (tasks/2) [storage.ResourceManager] Resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' is free. Now locking as 'shared' (1 active user) (resourceManager:552) 2018-01-08 16:10:18,466+0100 DEBUG (tasks/2) [storage.ResourceManager.Request] (ResName='02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6', ReqID='65efa1b0-38a3-4a45-9404-69e0791afca5') Granted request (resourceManager:222) 2018-01-08 16:10:18,466+0100 DEBUG (tasks/2) [storage.ResourceManager] Resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' is free. Now locking as 'shared' (1 active user) (resourceManager:552) 2018-01-08 16:10:18,466+0100 DEBUG (tasks/2) [storage.ResourceManager.Request] (ResName='01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b', ReqID='e11ab9cb-c29a-40c5-b075-2d756f71ed7e') Granted request (resourceManager:222) 2018-01-08 16:10:18,467+0100 DEBUG (tasks/2) [storage.VolumeManifest] validate path for 3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileVolume:124) 2018-01-08 16:10:18,471+0100 INFO (tasks/2) [storage.Image] chain=3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (top) (image:1343) 2018-01-08 16:10:18,472+0100 DEBUG (tasks/2) [storage.StorageDomain] Fixing permissions on /rhev/data-center/mnt/10.37.136.6:_dchaplyg_he/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/images/d3f97859-8f70-42e4-b0c8-41655a7fd32b/3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileSD:620) 2018-01-08 16:10:18,473+0100 DEBUG (tasks/2) [storage.fileUtils] Creating directory: /var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe mode: None (fileUtils:197) 2018-01-08 16:10:18,473+0100 DEBUG (tasks/2) [storage.fileUtils] Using existing directory: /var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe (fileUtils:206) 2018-01-08 16:10:18,473+0100 DEBUG (tasks/2) [storage.StorageDomain] Creating symlink from /rhev/data-center/mnt/10.37.136.6:_dchaplyg_he/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/images/d3f97859-8f70-42e4-b0c8-41655a7fd32b to /var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/d3f97859-8f70-42e4-b0c8-41655a7fd32b (fileSD:577) 2018-01-08 16:10:18,473+0100 DEBUG (tasks/2) [storage.StorageDomain] img run dir already exists: /var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/d3f97859-8f70-42e4-b0c8-41655a7fd32b (fileSD:582) 2018-01-08 16:10:18,474+0100 DEBUG (tasks/2) [root] /usr/bin/taskset --cpu-list 0-7 /usr/bin/dd if=/rhev/data-center/mnt/10.37.136.6:_dchaplyg_he/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/images/d3f97859-8f70-42e4-b0c8-41655a7fd32b/3b9f9c09-2795-4c28-acf0-f94552d1c5d6 bs=1048576 count=1 (cwd None) (commands:65) 2018-01-08 16:10:18,485+0100 DEBUG (tasks/2) [storage.StorageDomain] Removing image rundir link u'/var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/d3f97859-8f70-42e4-b0c8-41655a7fd32b' (fileSD:597) 2018-01-08 16:10:18,486+0100 DEBUG (tasks/2) [storage.ResourceManager] Trying to release resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' (resourceManager:567) 2018-01-08 16:10:18,486+0100 DEBUG (tasks/2) [storage.ResourceManager] Released resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' (0 active users) (resourceManager:585) 2018-01-08 16:10:18,486+0100 DEBUG (tasks/2) [storage.ResourceManager] Resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' is free, finding out if anyone is waiting for it. (resourceManager:591) 2018-01-08 16:10:18,486+0100 DEBUG (tasks/2) [storage.ResourceManager] Trying to release resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' (resourceManager:567) 2018-01-08 16:10:18,487+0100 DEBUG (tasks/2) [storage.ResourceManager] Released resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' (0 active users) (resourceManager:585) 2018-01-08 16:10:18,487+0100 DEBUG (tasks/2) [storage.ResourceManager] Resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' is free, finding out if anyone is waiting for it. (resourceManager:591) 2018-01-08 16:10:18,487+0100 DEBUG (tasks/2) [storage.ResourceManager] No one is waiting for resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6', Clearing records. (resourceManager:599) 2018-01-08 16:10:18,487+0100 DEBUG (tasks/2) [storage.ResourceManager] No one is waiting for resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b', Clearing records. (resourceManager:599) 2018-01-08 16:10:18,488+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') moving from state running -> state finished (task:602)
How can we mark the volume as in use so the cleanup skips it?
There is log in comment #34. Removing the 'depends on'.
Please see comment #34 and #36 and assign someone to help.
(In reply to Denis Chaplygin from comment #34) > I reproduced it on my local environment. Looks like VDSM doesn't know, that > He is using that volume. This correct, vdsm is mostly stateless, and does not track who prepared an image. Engine is managing prepares and teardowns, or it is builtin to flows like start/stop vm. > 2018-01-08 16:10:18,485+0100 DEBUG (tasks/2) [storage.StorageDomain] > Removing image rundir link u'/var/run/.../d3f97859-8f70-42e4-b0c8-41655a7fd32b' > (fileSD:597) Can you explain what is this image - is this OVF_STORE volume? Martin said in comment 29: > It is a hosted engine configuration volume (you see the id in the traceback). > Nobody else uses it. We prepare all volumes when ovirt-ha-broker starts. > Happened on both NFS and FC including OST. I don't think we have a way to keep an image prepared, and have the image rundir symlinks exists. In block storage, we can prevent teardown of the image by keeping the lv open in the process owning it. For example, open /dev/vg/lv after prepare, and close it before teardown. But even if you do this the image run links will be removed during teardown (but it can be easily fixed in vdsm). In file storage, there is nothing to do with current vdsm to keep an image prepared if engine asked to tear it down, or called a flow preparing and tearing down an image. The basic question is, why hosted engine need to keep an image prepared that is owned and managed by some other component? We need more info to understand how hosted engine is using this image.
> Can you explain what is this image - is this OVF_STORE volume? Martin > said in comment 29: No, it is hosted engine specific volume. Engine does not know about it and does not use it for anything. > The basic question is, why hosted engine need to keep an image prepared > that is owned and managed by some other component? It is managed by hosted engine only. Nobody called a teardown as far as I was able to see. > We need more info to understand how hosted engine is using this image. It is created during hosted engine setup and contains the shared config files hosted engine needs to behave the same on all hosts. We prepare it when hosted engine starts and (mostly) read from it every time we need something.
Ah! I know where the issue is. The engine does touch the volume indeed. It reads data from it when a new host is being deployed.
The patch fixing this looks fine, but I think we should consider support for shared resources. Martin, would file a RFE for this?
(In reply to Nir Soffer from comment #41) > The patch fixing this looks fine, but I think we should consider support for > shared resources. Martin, would file a RFE for this? So shouldn't this BZ on MODIFIED? (regardless of the RFE we probably want to continue improving in the future).
Based on comment 40 and the attached fix, moving back to integration.
Martin, can you assign this bug to someone in your team?
Works for me on these components: ovirt-hosted-engine-ha-2.2.3-0.0.master.20171218181916.20171218181911.git4c22b93.el7.centos.noarch ovirt-hosted-engine-setup-2.2.4-0.0.master.20180109170856.gitb593776.el7.centos.noarch ovirt-engine-appliance-4.2-20180110.1.el7.centos.noarch Moving to verified.
Created attachment 1380210 [details] screencast
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.