Bug 1527394 - [HE] - SHE ha-host's score is unstable and hosted-engine.conf is not equal on both ha-hosts.
Summary: [HE] - SHE ha-host's score is unstable and hosted-engine.conf is not equal on...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 2.2.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.2.1
: ---
Assignee: Martin Sivák
QA Contact: Nikolai Sednev
URL:
Whiteboard:
: 1524904 (view as bug list)
Depends On: 1531072
Blocks: 1458711
TreeView+ depends on / blocked
 
Reported: 2017-12-19 11:02 UTC by Nikolai Sednev
Modified: 2019-04-25 10:41 UTC (History)
22 users (show)

Fixed In Version: ovirt-hosted-engine-ha-2.2.3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-12 11:48:44 UTC
oVirt Team: SLA
Embargoed:
ylavi: ovirt-4.2+
ylavi: blocker+


Attachments (Terms of Use)
logs from first ha-host alma03 (10.07 MB, application/x-xz)
2017-12-19 11:02 UTC, Nikolai Sednev
no flags Details
logs from additional ha-host alma04 (9.64 MB, application/x-xz)
2017-12-19 11:04 UTC, Nikolai Sednev
no flags Details
engine's log (9.27 MB, application/x-xz)
2017-12-19 11:05 UTC, Nikolai Sednev
no flags Details
hosted-engine.conf from alma03 (1.03 KB, text/plain)
2017-12-19 11:11 UTC, Nikolai Sednev
no flags Details
hosted-engine.conf from alma04 (1.03 KB, text/plain)
2017-12-19 11:11 UTC, Nikolai Sednev
no flags Details
screencast (9.78 MB, application/octet-stream)
2018-01-11 19:26 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1531066 0 medium CLOSED vdsm: use INFO log level for all operation that change system state 2021-02-22 00:41:40 UTC
oVirt gerrit 86182 0 'None' MERGED Prepare a volume symlink again if it disappeared 2020-09-14 15:40:45 UTC
oVirt gerrit 86186 0 'None' MERGED Use new signature of get_volume_path 2020-09-14 15:40:45 UTC

Internal Links: 1531066

Description Nikolai Sednev 2017-12-19 11:02:44 UTC
Created attachment 1369890 [details]
logs from first ha-host alma03

Description of problem:
There is a difference in /etc/ovirt-hosted-engine/hosted-engine.conf between two ha-hosts
http://pastebin.test.redhat.com/541959
ha-score is jumping between 0 and 3400 and engine reports of: 
VM HostedEngine is down with error. Exit message: resource busy: Failed to acquire lock: Lease is held by another host.
12/19/1712:49:28 PM
VM HostedEngine is down with error. Exit message: resource busy: Failed to acquire lock: Lease is held by another host.
12/19/1712:42:12 PM

Version-Release number of selected component (if applicable):
rhvm-appliance-4.2-20171207.0.el7.noarch
ovirt-provider-ovn-driver-1.2.2-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch
ovirt-host-deploy-1.7.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-common-1.2.0-0.el7ev.noarch
qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
vdsm-4.20.9.3-1.el7ev.x86_64
ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch
mom-0.5.11-1.el7ev.noarch
libvirt-client-3.2.0-14.el7_4.5.x86_64
ovirt-imageio-daemon-1.2.0-0.el7ev.noarch
ovirt-host-dependencies-4.2.0-1.el7ev.x86_64
ovirt-host-4.2.0-1.el7ev.x86_64
sanlock-3.5.0-1.el7.x86_64
ovirt-setup-lib-1.1.4-1.el7ev.noarch
Linux version 3.10.0-693.15.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 14 05:13:32 EST 2017
Linux puma19.scl.lab.tlv.redhat.com 3.10.0-693.15.1.el7.x86_64 #1 SMP Thu Dec 14 05:13:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Deploy SHE over NFS or Gluster.
2.Add one NFS data storage domain.
3.Add one additional ha-host.

Actual results:
ha-score is jumping 0-3400 on both hosts and unstable, /etc/ovirt-hosted-engine/hosted-engine.conf is not the same between both hosts.

Expected results:
ha score should be stable after adding of additional ha-host.

Additional info:
logs from both hosts and the engine attached.
screencast is attached here https://drive.google.com/a/redhat.com/file/d/1XHhNOD3-gu71PerJfuM6qKG3RriCMnG5/view?usp=sharing.

Comment 1 Nikolai Sednev 2017-12-19 11:04:09 UTC
Created attachment 1369892 [details]
logs from additional ha-host alma04

Comment 2 Nikolai Sednev 2017-12-19 11:05:09 UTC
Created attachment 1369893 [details]
engine's log

Comment 3 Nikolai Sednev 2017-12-19 11:11:36 UTC
Created attachment 1369895 [details]
hosted-engine.conf from alma03

Comment 4 Nikolai Sednev 2017-12-19 11:11:57 UTC
Created attachment 1369896 [details]
hosted-engine.conf from alma04

Comment 5 Simone Tiraboschi 2017-12-19 11:18:26 UTC
(In reply to Nikolai Sednev from comment #0)
> There is a difference in /etc/ovirt-hosted-engine/hosted-engine.conf between
> two ha-hosts
> http://pastebin.test.redhat.com/541959

host-deploy writes a sorted file, hosted-engine-setup no.
Other than the sorting, the only relevant difference is

< host_id=1
> host_id=2

as expected

Comment 6 Simone Tiraboschi 2017-12-19 11:25:24 UTC
The issue on flapping HA score seams instead real and due to 

MainThread::WARNING::2017-12-19 12:49:14,688::ovf_store::138::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE
MainThread::ERROR::2017-12-19 12:49:14,689::config::475::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs
MainThread::ERROR::2017-12-19 12:49:14,690::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 428, in start_monitoring
    self._config.refresh_vm_conf()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 519, in refresh_vm_conf
    content = self._get_file_content_from_shared_storage(VM)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 484, in _get_file_content_from_shared_storage
    config_volume_path = self._get_config_volume_path()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 188, in _get_config_volume_path
    conf_vol_uuid
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 358, in get_volume_path
    root=envconst.SD_RUN_DIR,
RuntimeError: Path to volume af333f6f-b1e0-4cdd-ad96-86289302015d not found in /run/vdsm/storage

MainThread::ERROR::2017-12-19 12:49:14,690::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent
MainThread::INFO::2017-12-19 12:49:14,690::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
MainThread::INFO::2017-12-19 12:49:25,159::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.2 started
MainThread::INFO::2017-12-19 12:49:25,227::hosted_engine::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: alma03.qa.lab.tlv.redhat.com
MainThread::INFO::2017-12-19 12:49:25,779::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection
MainThread::INFO::2017-12-19 12:49:25,782::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '10.35.72.254'}
MainThread::INFO::2017-12-19 12:49:25,785::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id ping
MainThread::INFO::2017-12-19 12:49:25,785::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'ovirtmgmt', 'address': '0'}
MainThread::INFO::2017-12-19 12:49:25,790::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id mgmt-bridge
MainThread::INFO::2017-12-19 12:49:25,791::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'}
MainThread::INFO::2017-12-19 12:49:25,796::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id mem-free
MainThread::INFO::2017-12-19 12:49:25,797::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': '412e1c12-133c-43cd-ba2e-e47e81e47bfb', 'address': '0'}
MainThread::INFO::2017-12-19 12:49:25,800::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id cpu-load-no-engine
MainThread::INFO::2017-12-19 12:49:25,800::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': '412e1c12-133c-43cd-ba2e-e47e81e47bfb', 'address': '0'}
MainThread::INFO::2017-12-19 12:49:25,803::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id engine-health
MainThread::INFO::2017-12-19 12:49:25,803::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor storage-domain, options {'sd_uuid': 'e061a688-ea1a-4da9-8412-6ec8705ad57b'}
MainThread::INFO::2017-12-19 12:49:25,805::brokerlink::85::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id storage-domain
MainThread::INFO::2017-12-19 12:49:25,806::hosted_engine::544::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started
MainThread::INFO::2017-12-19 12:49:26,101::upgrade::994::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Host configuration is already up-to-date
MainThread::INFO::2017-12-19 12:49:26,311::ovf_store::118::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:a59cb8b5-0e77-464c-8c0c-316ef29aa4a0, volUUID:d0428a63-e5ab-4d0f-8e50-06f336732654
MainThread::INFO::2017-12-19 12:49:26,445::ovf_store::118::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:11c50f3b-64b3-455e-8533-b5634a8b45d3, volUUID:0a0ec0b7-48e3-46ff-873b-b6e9beeaf96c
MainThread::INFO::2017-12-19 12:49:26,461::ovf_store::149::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/e061a688-ea1a-4da9-8412-6ec8705ad57b/11c50f3b-64b3-455e-8533-b5634a8b45d3/0a0ec0b7-48e3-46ff-873b-b6e9beeaf96c
MainThread::INFO::2017-12-19 12:49:26,500::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {}
MainThread::INFO::2017-12-19 12:49:26,500::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host alma04.qa.lab.tlv.redhat.com (id 2): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=75932 (Tue Dec 19 12:49:16 2017)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=75932 (Tue Dec 19 12:49:16 2017)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineStarting\nstopped=False\n', 'hostname': 'alma04.qa.lab.tlv.redhat.com', 'host-id': 2, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': 'f05626cd', 'local_conf_timestamp': 75932, 'host-ts': 75932}
MainThread::INFO::2017-12-19 12:49:26,500::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'health': 'good', 'vm': 'up', 'detail': 'Up'}, 'bridge': True, 'mem-free': 25251.0, 'maintenance': False, 'cpu-load': None, 'gateway': 1.0, 'storage-domain': True}
MainThread::INFO::2017-12-19 12:49:26,503::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (StartState-ReinitializeFSM) sent? None
MainThread::INFO::2017-12-19 12:49:26,788::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0)

Comment 7 Nikolai Sednev 2017-12-19 11:49:17 UTC
After 30 minutes or so, flapping stops on ha-hosts and system becomes stable on its own.

Comment 8 Martin Sivák 2017-12-19 15:49:49 UTC
It becomes stable on its own once the OVF is generated.

Nikolai, can you please attach the output of tree /run/vdsm/storage?

Comment 9 Yaniv Kaul 2017-12-19 17:08:23 UTC
(In reply to Nikolai Sednev from comment #7)
> After 30 minutes or so, flapping stops on ha-hosts and system becomes stable
> on its own.

Reducing severity since it stabilizes.

Comment 10 Nikolai Sednev 2017-12-19 17:20:16 UTC
(In reply to Martin Sivák from comment #8)
> It becomes stable on its own once the OVF is generated.
> 
> Nikolai, can you please attach the output of tree /run/vdsm/storage?
New deployment over FC with these components:
rhvm-appliance-4.2-20171207.0.el7.noarch
ovirt-imageio-daemon-1.2.0-0.el7ev.noarch
ovirt-imageio-common-1.2.0-0.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64
ovirt-setup-lib-1.1.4-1.el7ev.noarch
ovirt-host-4.2.0-1.el7ev.x86_64
ovirt-vmconsole-1.0.4-1.el7ev.noarch
mom-0.5.11-1.el7ev.noarch
vdsm-4.20.9.3-1.el7ev.x86_64
ovirt-host-dependencies-4.2.0-1.el7ev.x86_64
ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
sanlock-3.5.0-1.el7.x86_64
ovirt-host-deploy-1.7.0-1.el7ev.noarch
ovirt-provider-ovn-driver-1.2.2-1.el7ev.noarch
libvirt-client-3.2.0-14.el7_4.5.x86_64
ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch
Linux version 3.10.0-693.15.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 14 05:13:32 EST 2017
Linux purple-vds1.qa.lab.tlv.redhat.com 3.10.0-693.15.1.el7.x86_64 #1 SMP Thu Dec 14 05:13:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 (Maipo)


Here is the data from pair of purples 1 and 2, freshly deployed over FC:
purple-vds1 ~]# tree /run/vdsm/storage
/run/vdsm/storage
└── e9b15107-0ca7-41d9-9b0d-c826b99afb2e
    ├── 0a572170-c3eb-4044-a0ff-91847d9cda26
    │   └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450
    ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af
    │   └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78
    ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97
    │   └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a
    └── b5d24726-7373-48ae-ba02-a2fdd45b32a7
        └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93

[root@purple-vds2 ~]# tree /run/vdsm/storage
/run/vdsm/storage
└── e9b15107-0ca7-41d9-9b0d-c826b99afb2e
    ├── 0a572170-c3eb-4044-a0ff-91847d9cda26
    │   └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450
    ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af
    │   └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78
    ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97
    │   └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a
    └── b5d24726-7373-48ae-ba02-a2fdd45b32a7
        └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93


I've seen that already during addition of a new ha-host (purple-vds2) via UI,  the score was zeroed on purple-vds1 (first initial ha-host on which SHE deployment was made): 
[root@purple-vds1 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : purple-vds1.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : 90e840f7
local_conf_timestamp               : 1951
Host timestamp                     : 1948
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1948 (Tue Dec 19 19:11:00 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=1951 (Tue Dec 19 19:11:04 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=AgentStopped
        stopped=True

Once purple-vds2 appeared in UI as active, the flapping got started on both hosts:

http://pastebin.test.redhat.com/542120

Comment 11 Nikolai Sednev 2017-12-19 17:31:28 UTC
[root@purple-vds1 ~]# tree /run/vdsm/storage
/run/vdsm/storage
└── e9b15107-0ca7-41d9-9b0d-c826b99afb2e
    ├── 0a572170-c3eb-4044-a0ff-91847d9cda26
    │   └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450
    ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af
    │   └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78
    └── 7d00c457-a14a-40e0-bc81-df7f634e6a97
        └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a


[root@purple-vds2 ~]# tree /run/vdsm/storage
/run/vdsm/storage
└── e9b15107-0ca7-41d9-9b0d-c826b99afb2e
    ├── 0a572170-c3eb-4044-a0ff-91847d9cda26
    │   └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450
    ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af
    │   └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78
    ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97
    │   └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a
    └── b5d24726-7373-48ae-ba02-a2fdd45b32a7
        └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93

Comment 12 Nikolai Sednev 2017-12-19 17:32:46 UTC
[root@purple-vds1 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : purple-vds1.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : 90e840f7
local_conf_timestamp               : 1951
Host timestamp                     : 1948
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1948 (Tue Dec 19 19:11:00 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=1951 (Tue Dec 19 19:11:04 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : purple-vds2.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : a536f1c7
local_conf_timestamp               : 3093
Host timestamp                     : 3092
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3092 (Tue Dec 19 19:31:50 2017)
        host-id=2
        score=0
        vm_conf_refresh_time=3093 (Tue Dec 19 19:31:51 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=ReinitializeFSM
        stopped=False

[root@purple-vds1 ~]# tree /run/vdsm/storage
/run/vdsm/storage
└── e9b15107-0ca7-41d9-9b0d-c826b99afb2e
    ├── 0a572170-c3eb-4044-a0ff-91847d9cda26
    │   └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450
    ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af
    │   └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78
    └── 7d00c457-a14a-40e0-bc81-df7f634e6a97
        └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a


[root@purple-vds2 ~]# tree /run/vdsm/storage
/run/vdsm/storage
└── e9b15107-0ca7-41d9-9b0d-c826b99afb2e
    ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af
    │   └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78
    ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97
    │   └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a
    └── b5d24726-7373-48ae-ba02-a2fdd45b32a7
        └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93

Comment 13 Martin Sivák 2017-12-19 21:50:13 UTC
MainThread::INFO::2017-12-19 11:48:13,279::hosted_engine::924::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start`
MainThread::INFO::2017-12-19 11:48:14,622::hosted_engine::930::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: 
MainThread::INFO::2017-12-19 11:48:14,646::hosted_engine::931::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Command VM.getStats with args {'vmID': '412e1c12-133c-43cd-ba2e-e47e81e47bfb'} failed:
(code=1, message=Virtual machine does not exist: {'vmId': u'412e1c12-133c-43cd-ba2e-e47e81e47bfb'})

This is caused by https://bugzilla.redhat.com/1525955 (Virt issue, see comment #10)

MainThread::INFO::2017-12-19 11:48:14,646::hosted_engine::943::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost
MainThread::INFO::2017-12-19 11:48:14,791::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent
MainThread::INFO::2017-12-19 11:48:14,838::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
...
MainThread::INFO::2017-12-19 11:49:05,707::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400)
... 
MainThread::INFO::2017-12-19 12:06:01,811::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400)
MainThread::ERROR::2017-12-19 12:06:11,010::config::475::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs
MainThread::ERROR::2017-12-19 12:06:11,010::hosted_engine::434::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception
Traceback (most recent call last):
  ....
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 358, in get_volume_path
    root=envconst.SD_RUN_DIR,
RuntimeError: Path to volume af333f6f-b1e0-4cdd-ad96-86289302015d not found in /run/vdsm/storage

This is really weird. The HE storage domain is e061a688-ea1a-4da9-8412-6ec8705ad57b but the symlinks that must have been there are suddenly completely missing. And I do not see anything related to this issue in the VDSM log either.

The lease error from the engine log is just annoying, the engine VM is up and this happens because the hosted engine is trying to start it again because the agents lost synchronization.

The situation stabilizes once the OVF store is created, because we call Image.prepare to get it and it prepares the symlinks. This will affect only very new deployments that have no OVF store yet.



Btw Nikolai, pastebin.test.redhat.com is not accessible from outside of vpn so you should probably not use it in public bugs.

Comment 14 Martin Sivák 2017-12-19 21:55:35 UTC
Nir, can you please take a quick look? It seems that all the prepared symlinks suddenly disappeared and I can't find any reason for that..

Comment 15 Nir Soffer 2017-12-20 00:47:53 UTC
(In reply to Martin Sivák from comment #14)
> It seems that all the prepared symlinks suddenly disappeared
Can you explain what symlinks are missing?

Comment 16 Nikolai Sednev 2017-12-20 07:26:55 UTC
(In reply to Yaniv Kaul from comment #9)
> (In reply to Nikolai Sednev from comment #7)
> > After 30 minutes or so, flapping stops on ha-hosts and system becomes stable
> > on its own.
> 
> Reducing severity since it stabilizes.

On my environment over FC I see that even overnight ha-score was not stabilized at all.
I still have the issue:
[root@purple-vds1 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : purple-vds1.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : 90e840f7
local_conf_timestamp               : 1951
Host timestamp                     : 1948
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1948 (Tue Dec 19 19:11:00 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=1951 (Tue Dec 19 19:11:04 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : purple-vds2.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : a04c826a
local_conf_timestamp               : 52847
Host timestamp                     : 52846
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=52846 (Wed Dec 20 09:21:04 2017)
        host-id=2
        score=0
        vm_conf_refresh_time=52847 (Wed Dec 20 09:21:06 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=ReinitializeFSM
        stopped=False
[root@purple-vds1 ~]# 
[root@purple-vds1 ~]# 
[root@purple-vds1 ~]# 
[root@purple-vds1 ~]# 
[root@purple-vds1 ~]# 
[root@purple-vds1 ~]# 
[root@purple-vds1 ~]# date
Wed Dec 20 09:22:30 IST 2017
[root@purple-vds1 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : purple-vds1.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : 90e840f7
local_conf_timestamp               : 1951
Host timestamp                     : 1948
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1948 (Tue Dec 19 19:11:00 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=1951 (Tue Dec 19 19:11:04 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : purple-vds2.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 0967536e
local_conf_timestamp               : 52922
Host timestamp                     : 52918
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=52918 (Wed Dec 20 09:22:17 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=52922 (Wed Dec 20 09:22:20 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStart
        stopped=False
[root@purple-vds1 ~]# tree /run/vdsm/storage
/run/vdsm/storage
├── 0bfb70d9-f651-48ed-8bc1-5fac1707ddf7
└── e9b15107-0ca7-41d9-9b0d-c826b99afb2e
    ├── 0a572170-c3eb-4044-a0ff-91847d9cda26
    │   └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450
    ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af
    │   └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78
    └── 7d00c457-a14a-40e0-bc81-df7f634e6a97
        └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a

[root@purple-vds2 ~]# tree /run/vdsm/storage
/run/vdsm/storage
└── e9b15107-0ca7-41d9-9b0d-c826b99afb2e
    ├── 0a572170-c3eb-4044-a0ff-91847d9cda26
    │   └── 27845fda-0009-452f-a1ff-56ae66120450 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/27845fda-0009-452f-a1ff-56ae66120450
    ├── 42c1020f-0dbe-421a-8e58-b87aba8f08af
    │   └── 147865c2-5c0d-4467-9240-a93292552c78 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/147865c2-5c0d-4467-9240-a93292552c78
    ├── 7d00c457-a14a-40e0-bc81-df7f634e6a97
    │   └── 9b73cf56-4abd-435e-96fb-e9d8c6020b4a -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/9b73cf56-4abd-435e-96fb-e9d8c6020b4a
    └── b5d24726-7373-48ae-ba02-a2fdd45b32a7
        └── 8eecd31e-5b7a-4b32-b713-b335f7f1fb93 -> /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93

Returning I think that its severity should be higher.

Comment 17 Martin Sivák 2017-12-20 09:01:08 UTC
Nir: The whole /run/vdsm/storage/e061a688-ea1a-4da9-8412-6ec8705ad57b directory disappeared for no apparent reason.

Comment 18 Martin Sivák 2017-12-20 09:45:51 UTC
So the case from #16:

The symlink to /dev/e9b15107-0ca7-41d9-9b0d-c826b99afb2e/8eecd31e-5b7a-4b32-b713-b335f7f1fb93 disappeared from host purple-vds1. Which is VERY weird since it was the initial host and this happened during the addition of purple-vds2 (so nobody should have been touching vds1 at that time).

purple-vds2 is fine, except it does not know the status of the engine and tries to start it again and again. Sanlock prevents it from doing so as it is supposed to.

purple-vds1 is crashing because the expected disk with configuration is not accessible as the symlink was removed. No idea why yet.

Comment 19 Martin Sivák 2017-12-20 09:51:35 UTC
Nir, the disappearing tree is weird, but the real issue is with disappearing volume link when a second host is being added to the engine.

Comment 20 Nikolai Sednev 2017-12-20 11:04:21 UTC
Symlink issue might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1524119#c8.

Comment 21 Martin Sivák 2017-12-20 11:47:43 UTC
Actually, not really. That issue was caused by vdsm restarting and failing to reconstruct the disk structures properly on recovery. It then failed to prepare the symlink as part of migration. The disk in question there was the actual hard drive of the VM.

Here it happens on a host nobody is touching and with a volume no VM uses (it is internal to hosted engine infrastructure). As I said earlier.. the symptoms are similar, but the issue is different.

The question is, why does this reproduce when you add a second host to the engine. It should have no effect on the existing host.. but it has.

Comment 22 Nir Soffer 2018-01-02 09:08:29 UTC
(In reply to Martin Sivák from comment #17)
> Nir: The whole /run/vdsm/storage/e061a688-ea1a-4da9-8412-6ec8705ad57b
> directory disappeared for no apparent reason.

Can you explain what is this tree and why hosted engine needs it?

Vdsm provides apis for preparing and tearing down images. You can use the path
returned from Image.prepare to access the volume, but you cannot assume that you
own the path and that it remain on storage.

For example, ovf store volumes are owned by engine, and engine updates them once
in a while. On each update, engine prepare the volume, write new data, and tear it
down.

You can assume that a volume will be available only if you are the owner of the 
volume and no other part of the system touch it.

Comment 23 Martin Sivák 2018-01-02 09:15:09 UTC
Well we prepare the images as well. All of them actually - both for OVF and our own disks. I suppose vdsm does not track how many users there are..

Comment 24 Nir Soffer 2018-01-02 09:24:26 UTC
(In reply to Martin Sivák from comment #23)
> Well we prepare the images as well. All of them actually - both for OVF and
> our own disks. I suppose vdsm does not track how many users there are..

Vdsm does not track users of prepared volumes. Engine is responsible for
this. To track users, vdsm need to persist the data on local storage, since vdsm
is likely to be restarted or killed by sanlock on engine soft fencing.

If you are using a volume that other parts of the sytsem is using like ovf volumes,
you must be ready for other part of the system tearing down your volume, and 
prepare it again. Or better, use vdsm api to access the data, and let vdsm handle
this for you.

Comment 25 Martin Sivák 2018-01-02 09:33:18 UTC
Well engine cannot do that when it is down. And since there is no API for reading OVF or our own config volume we will have to hack around it.. again.

Comment 26 Martin Sivák 2018-01-03 11:08:25 UTC
*** Bug 1524904 has been marked as a duplicate of this bug. ***

Comment 27 Martin Sivák 2018-01-03 12:00:27 UTC
Since I was not able to find any teardown for the link in question and it is not the OVF I am moving this to storage for investigation as asked by Nir.

Comment 28 Nir Soffer 2018-01-04 12:15:11 UTC
Martin, please add the info I asked in comment 22.

- what are the the missing volumes
- when they are prepared
- is this about NFS/Gluster or FC/iSCSI? comment 0 suggests this the first, comment
  16 suggest the second.

Comment 29 Martin Sivák 2018-01-04 12:46:44 UTC
It is a hosted engine configuration volume (you see the id in the traceback). Nobody else uses it. We prepare all volumes when ovirt-ha-broker starts. Happened on both NFS and FC including OST.

Comment 30 Nir Soffer 2018-01-04 13:47:56 UTC
Nikolai, we need to reproduce this with DEBUG level logs in vdsm. In debug level,
we have logs for every operation that removes a directory from storage.

If vdsm is removing the tree at /run/vdsm/storage/sd-id, we will see this log:

    Removing directory: /run/vdsm/storage/sd-id

To enable debug logs in vdsm, you need to edit /etc/vdsm/logger.conf and change
the root and storage loggers to use DEBUG instead of INFO, and restart vdsm.

This should be standard practice for any bug report. If something does not work we
must have debug logs.

Comment 31 Nikolai Sednev 2018-01-04 15:30:01 UTC
(In reply to Nir Soffer from comment #30)
> Nikolai, we need to reproduce this with DEBUG level logs in vdsm. In debug
> level,
> we have logs for every operation that removes a directory from storage.
> 
> If vdsm is removing the tree at /run/vdsm/storage/sd-id, we will see this
> log:
> 
>     Removing directory: /run/vdsm/storage/sd-id
> 
> To enable debug logs in vdsm, you need to edit /etc/vdsm/logger.conf and
> change
> the root and storage loggers to use DEBUG instead of INFO, and restart vdsm.
> 
> This should be standard practice for any bug report. If something does not
> work we
> must have debug logs.

At current stage it is not possible due to https://bugzilla.redhat.com/show_bug.cgi?id=1531072.

Comment 32 Yaniv Kaul 2018-01-07 12:55:59 UTC
(In reply to Nikolai Sednev from comment #31)
> (In reply to Nir Soffer from comment #30)
> > Nikolai, we need to reproduce this with DEBUG level logs in vdsm. In debug
> > level,
> > we have logs for every operation that removes a directory from storage.
> > 
> > If vdsm is removing the tree at /run/vdsm/storage/sd-id, we will see this
> > log:
> > 
> >     Removing directory: /run/vdsm/storage/sd-id
> > 
> > To enable debug logs in vdsm, you need to edit /etc/vdsm/logger.conf and
> > change
> > the root and storage loggers to use DEBUG instead of INFO, and restart vdsm.
> > 
> > This should be standard practice for any bug report. If something does not
> > work we
> > must have debug logs.
> 
> At current stage it is not possible due to
> https://bugzilla.redhat.com/show_bug.cgi?id=1531072.

Strange, I can reproduce this easily with O-S-T (at least with the Ansible based deployment).
What I'm not sure is how to get VDSM into debug mode during the deployment, though. Any ideas?

Comment 33 Nir Soffer 2018-01-07 13:12:32 UTC
Yaniv, to enable debug logs, we can install /etc/vdsm/logger.conf with the
suggested log levels, or we can issue these commands:

    # Set root logger to debug
    vdsm-client Host setLogLevel level=DEBUG

    # Set storage logger to debug
    vdsm-client Host setLogLevel level=DEBUG name=storage

Comment 34 Denis Chaplygin 2018-01-08 15:24:55 UTC
I reproduced it on my local environment. Looks like VDSM doesn't know, that He is using that volume.

2018-01-08 16:10:18,363+0100 INFO  (Thread-1) [vdsm.api] START uploadImageToStream(methodArgs={'length': 20480, 'file
Obj': <socket._fileobject object at 0x3508250>}, callback=<function setCallback at 0x42b76e0>, startEvent=<threading.
_Event object at 0x42a47d0>, spUUID='5a538125-0111-0282-0066-00000000033c', sdUUID='7f3c72a0-8b9a-46a2-93aa-0a69337c9
1fe', imgUUID='d3f97859-8f70-42e4-b0c8-41655a7fd32b', volUUID='3b9f9c09-2795-4c28-acf0-f94552d1c5d6') from=internal, 
task_id=de55c686-0f5e-4c16-9c2f-8f06c4f32dca (api:46)
2018-01-08 16:10:18,364+0100 DEBUG (Thread-1) [storage.oop] Creating ioprocess Global (outOfProcess:86)
2018-01-08 16:10:18,364+0100 INFO  (Thread-1) [IOProcessClient] Starting client ioprocess-53 (__init__:308)
2018-01-08 16:10:18,364+0100 DEBUG (Thread-1) [IOProcessClient] Starting ioprocess for client ioprocess-53 (__init__:320)
2018-01-08 16:10:18,372+0100 DEBUG (Thread-1) [IOProcessClient] Starting communication thread for client ioprocess-53 (__init__:376)
2018-01-08 16:10:18,374+0100 DEBUG (Thread-1) [IOProcessClient] Communication thread for client ioprocess-53 started (__init__:388)
2018-01-08 16:10:18,375+0100 INFO  (ioprocess/18400) [IOProcess] Starting ioprocess (__init__:437)
2018-01-08 16:10:18,382+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') _save: orig /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca temp /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca.temp (task:759)
2018-01-08 16:10:18,395+0100 DEBUG (Thread-1) [storage.TaskManager] scheduled job uploadImageToStream for task de55c686-0f5e-4c16-9c2f-8f06c4f32dca  (taskManager:79)
2018-01-08 16:10:18,395+0100 INFO  (Thread-1) [vdsm.api] FINISH uploadImageToStream return=None from=internal, task_id=de55c686-0f5e-4c16-9c2f-8f06c4f32dca (api:52)
2018-01-08 16:10:18,395+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') Prepare: 1 jobs exist, move to acquiring (task:1193)
2018-01-08 16:10:18,395+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') moving from state preparing -> state acquiring (task:602)
2018-01-08 16:10:18,396+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') _save: orig /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca temp /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca.temp (task:759)
2018-01-08 16:10:18,412+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') moving from state acquiring -> state queued (task:602)
2018-01-08 16:10:18,413+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') _save: orig /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca temp /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca.temp (task:759)
2018-01-08 16:10:18,431+0100 DEBUG (Thread-1) [storage.TaskManager] queuing task: de55c686-0f5e-4c16-9c2f-8f06c4f32dca (taskManager:57)
2018-01-08 16:10:18,431+0100 DEBUG (tasks/2) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60)
2018-01-08 16:10:18,431+0100 INFO  (tasks/2) [storage.ThreadPool.WorkerThread] START task de55c686-0f5e-4c16-9c2f-8f06c4f32dca (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x3a02b48>>, args=None) (threadPool:208)
2018-01-08 16:10:18,431+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') committing task: de55c686-0f5e-4c16-9c2f-8f06c4f32dca (task:1210)
2018-01-08 16:10:18,432+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') moving from state queued -> state running (task:602)
2018-01-08 16:10:18,432+0100 DEBUG (Thread-1) [storage.TaskManager] task queued: de55c686-0f5e-4c16-9c2f-8f06c4f32dca (taskManager:65)
2018-01-08 16:10:18,432+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') returning (task:1198)
2018-01-08 16:10:18,432+0100 DEBUG (Thread-1) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') ref 1 aborting False (task:1002)
2018-01-08 16:10:18,434+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') _save: orig /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca temp /rhev/data-center/5a538125-0111-0282-0066-00000000033c/mastersd/master/tasks/de55c686-0f5e-4c16-9c2f-8f06c4f32dca.temp (task:759)
2018-01-08 16:10:18,451+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') Task.run: running job 0: uploadImageToStream: <bound method StoragePool.uploadImageToStream of <vdsm.storage.sp.StoragePool object at 0x385a3d0>> (args: ({'length': 20480, 'fileObj': <socket._fileobject object at 0x3508250>}, <function setCallback at 0x42b76e0>, <threading._Event object at 0x42a47d0>, '7f3c72a0-8b9a-46a2-93aa-0a69337c91fe', 'd3f97859-8f70-42e4-b0c8-41655a7fd32b', '3b9f9c09-2795-4c28-acf0-f94552d1c5d6') kwargs: {}) (task:914)
2018-01-08 16:10:18,452+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') Job.run: running uploadImageToStream: <bound method StoragePool.uploadImageToStream of <vdsm.storage.sp.StoragePool object at 0x385a3d0>> (args: ({'length': 20480, 'fileObj': <socket._fileobject object at 0x3508250>}, <function setCallback at 0x42b76e0>, <threading._Event object at 0x42a47d0>, '7f3c72a0-8b9a-46a2-93aa-0a69337c91fe', 'd3f97859-8f70-42e4-b0c8-41655a7fd32b', '3b9f9c09-2795-4c28-acf0-f94552d1c5d6') kwargs: {}) callback None (task:333)
2018-01-08 16:10:18,452+0100 DEBUG (tasks/2) [storage.ResourceManager] Trying to register resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' for lock type 'shared' (resourceManager:495)
2018-01-08 16:10:18,455+0100 DEBUG (tasks/2) [storage.VolumeManifest] validate path for 3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileVolume:124)
2018-01-08 16:10:18,458+0100 DEBUG (tasks/2) [storage.VolumeManifest] validate path for 3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileVolume:124)
2018-01-08 16:10:18,461+0100 DEBUG (tasks/2) [storage.VolumeManifest] validate path for 3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileVolume:124)
2018-01-08 16:10:18,465+0100 DEBUG (tasks/2) [storage.ResourceManager] Trying to register resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' for lock type 'shared' (resourceManager:495)
2018-01-08 16:10:18,465+0100 DEBUG (tasks/2) [storage.ResourceManager] Resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' is free. Now locking as 'shared' (1 active user) (resourceManager:552)
2018-01-08 16:10:18,466+0100 DEBUG (tasks/2) [storage.ResourceManager.Request] (ResName='02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6', ReqID='65efa1b0-38a3-4a45-9404-69e0791afca5') Granted request (resourceManager:222)
2018-01-08 16:10:18,466+0100 DEBUG (tasks/2) [storage.ResourceManager] Resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' is free. Now locking as 'shared' (1 active user) (resourceManager:552)
2018-01-08 16:10:18,466+0100 DEBUG (tasks/2) [storage.ResourceManager.Request] (ResName='01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b', ReqID='e11ab9cb-c29a-40c5-b075-2d756f71ed7e') Granted request (resourceManager:222)
2018-01-08 16:10:18,467+0100 DEBUG (tasks/2) [storage.VolumeManifest] validate path for 3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileVolume:124)
2018-01-08 16:10:18,471+0100 INFO  (tasks/2) [storage.Image] chain=3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (top)  (image:1343)
2018-01-08 16:10:18,472+0100 DEBUG (tasks/2) [storage.StorageDomain] Fixing permissions on /rhev/data-center/mnt/10.37.136.6:_dchaplyg_he/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/images/d3f97859-8f70-42e4-b0c8-41655a7fd32b/3b9f9c09-2795-4c28-acf0-f94552d1c5d6 (fileSD:620)
2018-01-08 16:10:18,473+0100 DEBUG (tasks/2) [storage.fileUtils] Creating directory: /var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe mode: None (fileUtils:197)
2018-01-08 16:10:18,473+0100 DEBUG (tasks/2) [storage.fileUtils] Using existing directory: /var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe (fileUtils:206)
2018-01-08 16:10:18,473+0100 DEBUG (tasks/2) [storage.StorageDomain] Creating symlink from /rhev/data-center/mnt/10.37.136.6:_dchaplyg_he/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/images/d3f97859-8f70-42e4-b0c8-41655a7fd32b to /var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/d3f97859-8f70-42e4-b0c8-41655a7fd32b (fileSD:577)
2018-01-08 16:10:18,473+0100 DEBUG (tasks/2) [storage.StorageDomain] img run dir already exists: /var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/d3f97859-8f70-42e4-b0c8-41655a7fd32b (fileSD:582)
2018-01-08 16:10:18,474+0100 DEBUG (tasks/2) [root] /usr/bin/taskset --cpu-list 0-7 /usr/bin/dd if=/rhev/data-center/mnt/10.37.136.6:_dchaplyg_he/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/images/d3f97859-8f70-42e4-b0c8-41655a7fd32b/3b9f9c09-2795-4c28-acf0-f94552d1c5d6 bs=1048576 count=1 (cwd None) (commands:65)
2018-01-08 16:10:18,485+0100 DEBUG (tasks/2) [storage.StorageDomain] Removing image rundir link u'/var/run/vdsm/storage/7f3c72a0-8b9a-46a2-93aa-0a69337c91fe/d3f97859-8f70-42e4-b0c8-41655a7fd32b' (fileSD:597)
2018-01-08 16:10:18,486+0100 DEBUG (tasks/2) [storage.ResourceManager] Trying to release resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' (resourceManager:567)
2018-01-08 16:10:18,486+0100 DEBUG (tasks/2) [storage.ResourceManager] Released resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' (0 active users) (resourceManager:585)
2018-01-08 16:10:18,486+0100 DEBUG (tasks/2) [storage.ResourceManager] Resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b' is free, finding out if anyone is waiting for it. (resourceManager:591)
2018-01-08 16:10:18,486+0100 DEBUG (tasks/2) [storage.ResourceManager] Trying to release resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' (resourceManager:567)
2018-01-08 16:10:18,487+0100 DEBUG (tasks/2) [storage.ResourceManager] Released resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' (0 active users) (resourceManager:585)
2018-01-08 16:10:18,487+0100 DEBUG (tasks/2) [storage.ResourceManager] Resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6' is free, finding out if anyone is waiting for it. (resourceManager:591)
2018-01-08 16:10:18,487+0100 DEBUG (tasks/2) [storage.ResourceManager] No one is waiting for resource '02_vol_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.3b9f9c09-2795-4c28-acf0-f94552d1c5d6', Clearing records. (resourceManager:599)
2018-01-08 16:10:18,487+0100 DEBUG (tasks/2) [storage.ResourceManager] No one is waiting for resource '01_img_7f3c72a0-8b9a-46a2-93aa-0a69337c91fe.d3f97859-8f70-42e4-b0c8-41655a7fd32b', Clearing records. (resourceManager:599)
2018-01-08 16:10:18,488+0100 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='de55c686-0f5e-4c16-9c2f-8f06c4f32dca') moving from state running -> state finished (task:602)

Comment 35 Martin Sivák 2018-01-09 11:07:22 UTC
How can we mark the volume as in use so the cleanup skips it?

Comment 36 Yaniv Lavi 2018-01-10 13:25:03 UTC
There is log in comment #34.  Removing the 'depends on'.

Comment 37 Yaniv Lavi 2018-01-10 13:27:17 UTC
Please see comment #34 and #36 and assign someone to help.

Comment 38 Nir Soffer 2018-01-10 14:50:26 UTC
(In reply to Denis Chaplygin from comment #34)
> I reproduced it on my local environment. Looks like VDSM doesn't know, that
> He is using that volume.

This correct, vdsm is mostly stateless, and does not track who prepared
an image. Engine is managing prepares and teardowns, or it is builtin 
to flows like start/stop vm.

> 2018-01-08 16:10:18,485+0100 DEBUG (tasks/2) [storage.StorageDomain]
> Removing image rundir link u'/var/run/.../d3f97859-8f70-42e4-b0c8-41655a7fd32b'
> (fileSD:597)

Can you explain what is this image - is this OVF_STORE volume? Martin
said in comment 29:

> It is a hosted engine configuration volume (you see the id in the traceback).
> Nobody else uses it. We prepare all volumes when ovirt-ha-broker starts. 
> Happened on both NFS and FC including OST.

I don't think we have a way to keep an image prepared, and have the image
rundir symlinks exists.

In block storage, we can prevent teardown of the image by keeping the lv
open in the process owning it. For example, open /dev/vg/lv after prepare,
and close it before teardown. But even if you do this the image run links
will be removed during teardown (but it can be easily fixed in vdsm).

In file storage, there is nothing to do with current vdsm to keep an image
prepared if engine asked to tear it down, or called a flow preparing and
tearing down an image.

The basic question is, why hosted engine need to keep an image prepared
that is owned and managed by some other component?

We need more info to understand how hosted engine is using this image.

Comment 39 Martin Sivák 2018-01-10 14:55:54 UTC
> Can you explain what is this image - is this OVF_STORE volume? Martin
> said in comment 29:

No, it is hosted engine specific volume. Engine does not know about it and does not use it for anything.


> The basic question is, why hosted engine need to keep an image prepared
> that is owned and managed by some other component?

It is managed by hosted engine only. Nobody called a teardown as far as I was able to see.

> We need more info to understand how hosted engine is using this image.

It is created during hosted engine setup and contains the shared config files hosted engine needs to behave the same on all hosts.

We prepare it when hosted engine starts and (mostly) read from it every time we need something.

Comment 40 Martin Sivák 2018-01-10 15:01:59 UTC
Ah! I know where the issue is.

The engine does touch the volume indeed. It reads data from it when a new host is being deployed.

Comment 41 Nir Soffer 2018-01-10 16:55:17 UTC
The patch fixing this looks fine, but I think we should consider support for 
shared resources. Martin, would file a RFE for this?

Comment 42 Allon Mureinik 2018-01-11 12:24:07 UTC
(In reply to Nir Soffer from comment #41)
> The patch fixing this looks fine, but I think we should consider support for 
> shared resources. Martin, would file a RFE for this?

So shouldn't this BZ on MODIFIED?
(regardless of the RFE we probably want to continue improving in the future).

Comment 43 Nir Soffer 2018-01-11 13:20:10 UTC
Based on comment 40 and the attached fix, moving back to integration.

Comment 44 Nir Soffer 2018-01-11 13:35:25 UTC
Martin, can you assign this bug to someone in your team?

Comment 45 Nikolai Sednev 2018-01-11 19:25:26 UTC
Works for me on these components:
ovirt-hosted-engine-ha-2.2.3-0.0.master.20171218181916.20171218181911.git4c22b93.el7.centos.noarch
ovirt-hosted-engine-setup-2.2.4-0.0.master.20180109170856.gitb593776.el7.centos.noarch
ovirt-engine-appliance-4.2-20180110.1.el7.centos.noarch
Moving to verified.

Comment 46 Nikolai Sednev 2018-01-11 19:26:43 UTC
Created attachment 1380210 [details]
screencast

Comment 47 Sandro Bonazzola 2018-02-12 11:48:44 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.