Bug 1530605
| Summary: | [downstream clone - 4.1.9] ovirt-ha-agent fails parsing the OVF_STORE due to a change in OVF namespace URI | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | rhev-integ | ||||||||
| Component: | ovirt-hosted-engine-ha | Assignee: | Martin Sivák <msivak> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Nikolai Sednev <nsednev> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | unspecified | CC: | bugs, eheftman, lsurette, mavital, michal.skrivanek, msivak, stirabos, ykaul, ylavi | ||||||||
| Target Milestone: | ovirt-4.1.10 | Keywords: | AutomationBlocker, Regression, Triaged, ZStream | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | 1518887 | Environment: | |||||||||
| Last Closed: | 2018-03-20 16:36:48 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 1518887 | ||||||||||
| Bug Blocks: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
rhev-integ
2018-01-03 13:25:43 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. (Originally by rule-engine) This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. (Originally by rule-engine) Using ovirt-engine-appliance-4.2-20171210.1.el7.centos.noarch and ovirt-hosted-engine-setup-2.2.1-0.0.master.20171206172737.gitd3001c8.el7.centos.noarch during and after deployment of SHE on pair of hosts, over Gluster, and after adding some NFS data storage domain, I did not reproduced the original issue, hence moving to verified. (Originally by Nikolai Sednev) This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. (Originally by Sandro Bonazzola) This is only happening when 4.1 hosted engine hosts are used with 4.2 engine. That is a supported case and it needs to work. Reproduction steps performed: 1.I've deployed 4.1.9.1-0.1.el7 environment with 2 4.1.9 RHEL7.5 ha-hosts and SHE over NFS. 2.I've added one NFS data storage domain and waited for SHE to finish with auto-import. 3.I've set environment in to global maintenance. 4.I've upgraded the engine's OS from el7.4 to el7.5 and restarted the engine using "reboot" from the engine itself and then started the engine from host by running "hosted-engine --vm-start" command. 5. I tried to update to latest rhv-release-4.2.2-3-001.noarch, but failed with rpm dependency issue, which was covered here: https://bugzilla.redhat.com/show_bug.cgi?id=1548843 6.I had to upgrade the engine and release ha-hosts from global maintenance, but due to step 5, I was unable to continue. 7.I had to add additional CPU cores on the engine, to force an OVF_STORE upgrade, but due to step 5, I was unable to continue. 8.I had to restart the ha-agent on host that was hosting the engine and to check for errors related to OVF_STORE, but due to step 5, I was unable to continue. Components on hosts: rhvm-appliance-4.1.20180125.0-1.el7.noarch ovirt-hosted-engine-ha-2.1.9-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.4.1-1.el7ev.noarch Red Hat Enterprise Linux Server release 7.5 Beta (Maipo) Linux 3.10.0-855.el7.x86_64 #1 SMP Tue Feb 20 06:46:45 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Components on engine: ovirt-engine-setup-4.1.9.1-0.1.el7.noarch rhv-release-4.2.2-3-001.noarch Linux 3.10.0-855.el7.x86_64 #1 SMP Tue Feb 20 06:46:45 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 Beta (Maipo) I don't see how original upgrade scenario is possible at this stage. Please provide your input. Tried again but failed with 4.1.10->4.2.2 engine upgrade as described in https://bugzilla.redhat.com/show_bug.cgi?id=1548868. I did succeeded to upgrade from ovirt-engine-setup-4.1.9.1-0.1.el7.noarch to ovirt-engine-setup-4.1.10.1-0.1.el7.noarch and then restarted (around 20:04:27) both broker and agent on host that was hosting the engine, I did not seen the original errors related to OVF. MainThread::ERROR::2018-02-25 20:04:27,154::agent::199::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Disconn ected from broker 'Not connected to broker' - reinitializing MainThread::WARNING::2018-02-25 20:04:32,160::agent::209::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Resta rting agent, attempt '2' MainThread::INFO::2018-02-25 20:04:32,181::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ::(_get_hostname) Found certificate common name: alma04.qa.lab.tlv.redhat.com MainThread::INFO::2018-02-25 20:04:32,183::hosted_engine::604::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ::(_initialize_vdsm) Initializing VDSM MainThread::INFO::2018-02-25 20:12:42,461::config::416::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ::(_get_vm_conf_content_from_ovf_store) Trying to get a fresher copy of vm configuration from the OVF_STORE MainThread::INFO::2018-02-25 20:12:42,462::ovf_store::132::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngi neVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2018-02-25 20:12:42,462::ovf_store::134::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngi neVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/028cd962-6c8c-41d0-9db8-db88dc6d2cb1/3134b3ce-ced6-4c20-9eb0-98d 0447d7ad7/050c8a39-1d58-4a0a-8be2-86296ae87699 MainThread::INFO::2018-02-25 20:12:42,472::config::435::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ::(_get_vm_conf_content_from_ovf_store) Found an OVF for HE VM, trying to convert MainThread::INFO::2018-02-25 20:12:42,477::config::440::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Got vm.conf from OVF_STORE MainThread::INFO::2018-02-25 20:12:42,482::hosted_engine::604::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM MainThread::INFO::2018-02-25 20:12:44,902::hosted_engine::630::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::INFO::2018-02-25 20:12:44,902::storage_server::220::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(validate_storage_server) Validating storage server MainThread::INFO::2018-02-25 20:12:47,280::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Storage domain reported as valid and reconnect is not forced. Looks like the issue not being reproduced on 4.1.10.1-0.1, but I was unable to check it on 4.2.2 as being blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1548868. Moving to verified due to this bug was targeted to 4.1.10.1 and its working fine now. Components on hosts: ovirt-hosted-engine-ha-2.1.9-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.4.1-1.el7ev.noarch Linux 3.10.0-855.el7.x86_64 #1 SMP Tue Feb 20 06:46:45 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 Beta (Maipo) On engine: ovirt-engine-setup-4.1.10.1-0.1.el7.noarch Linux 3.10.0-855.el7.x86_64 #1 SMP Tue Feb 20 06:46:45 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 Beta (Maipo) Moving back to ON QA forth to comment #7. This bug is still blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1548868. Reproduction steps performed: 1.I've deployed 4.1.9.1-0.1.el7 environment with 2 4.1.9 RHEL7.5 ha-hosts and SHE over NFS. 2.I've added one NFS data storage domain and waited for SHE to finish with auto-import. 3.I've set environment in to global maintenance. 4.I've upgraded the engine's OS from el7.4 to el7.5 and restarted the engine using "reboot" from the engine itself and then started the engine from host by running "hosted-engine --vm-start" command. 5. I've upgraded the engine to ovirt-engine-setup-4.1.10.1-0.1.el7.noarch. 6. I tried to update to latest rhv-release-4.2.2-3-001.noarch, but failed for the first time and had to use the workaround as described here https://bugzilla.redhat.com/show_bug.cgi?id=1552539#c3, then I re-ran the setup and successfully upgraded the engine to ovirt-engine-setup-4.2.2.2-0.1.el7.noarch. 7.I released ha-hosts from global maintenance. 8.I've added additional CPU cores to the engine, to force an OVF_STORE upgrade. 9.I've restarted the ha-agent and ha-broker on host that was hosting the engine and checked for the errors related to OVF_STORE, and found an ha-agent in this state: " [root@alma04 ~]# systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2018-03-07 18:27:14 IST; 4min 25s ago Main PID: 61900 (ovirt-ha-agent) Tasks: 2 CGroup: /system.slice/ovirt-ha-agent.service └─61900 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon Mar 07 18:27:14 alma04.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent. Mar 07 18:27:14 alma04.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Monitoring Agent... Mar 07 18:28:07 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[61900]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore ERROR Unable to extract HEVM OVF Mar 07 18:28:07 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[61900]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf" Components on hosts: ovirt-hosted-engine-ha-2.1.9-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.4.1-1.el7ev.noarch 3.10.0-858.el7.x86_64 Red Hat Enterprise Linux Server release 7.5 (Maipo) Components on engine: ovirt-engine-setup-4.2.2.2-0.1.el7.noarch 3.10.0-858.el7.x86_64 Red Hat Enterprise Linux Server release 7.5 (Maipo) Moving back to assigned. See sosreports from both hosts and the engine attached. Created attachment 1405462 [details]
sosreport from alma03
Created attachment 1405463 [details]
sosreport from alma04
Created attachment 1405464 [details]
engine logs
So the agent was running just fine and you still failed the bug? I do not see the issue from the report happening. Yes, it was not able to download the ovf store. - How long did you wait for it? - Why did you restart the agent / broker? - Did you check engine log to see the OVF was actually written? As I see it you actually did not test the described bug as the OVF was not present yet and the original bug only appears once the OVF is present. It was parsing the OVF that crashed the agent, not the fact it was missing. Please try again and this time wait for the OVF to appear. Or open a different bug if it does not within couple of minutes as that has nothing to do with the failure to parse it. Forth to our discussion with Simone, we can't delete an older error message in systemd, if we reported an error to systemd, it will be keep there, the only option to get the error message away is restarting the service, it's just a kind of log that show latest error messages, but it doesn't matter if after that the error got solved, thus it appeared in "systemctl status ovirt-ha-agent -l", on host, although OVF was successfully parsed later:
MainThread::INFO::2018-03-07 18:41:35,887::config::416::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Trying to get a fresher copy of vm configuration from the OVF_STORE
MainThread::INFO::2018-03-07 18:41:35,888::ovf_store::132::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE
MainThread::INFO::2018-03-07 18:41:35,888::ovf_store::134::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/e760aa8d-4ca0-48c6-8944-5732fa1bf4fb/9fa8d34a-2049-43a4-8c1c-516cb3395dc5/b47ff40d-5dfb-4be8-8a96-5ee07ab15acb
MainThread::INFO::2018-03-07 18:41:35,898::config::435::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Found an OVF for HE VM, trying to convert
MainThread::INFO::2018-03-07 18:41:35,904::config::440::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Got vm.conf from OVF_STORE
Here goes result after restarting the service again:
[root@alma04 ~]# systemctl restart ovirt-ha-agent
[root@alma04 ~]# systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2018-03-07 18:59:06 IST; 39s ago
Main PID: 75123 (ovirt-ha-agent)
Tasks: 1
CGroup: /system.slice/ovirt-ha-agent.service
└─75123 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
Mar 07 18:59:06 alma04.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent.
Mar 07 18:59:06 alma04.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Monitoring Agent...
Moving to verified.
(In reply to Martin Sivák from comment #18) > So the agent was running just fine and you still failed the bug? I do not > see the issue from the report happening. > > Yes, it was not able to download the ovf store. > - How long did you wait for it? > - Why did you restart the agent / broker? > - Did you check engine log to see the OVF was actually written? > > As I see it you actually did not test the described bug as the OVF was not > present yet and the original bug only appears once the OVF is present. It > was parsing the OVF that crashed the agent, not the fact it was missing. > > Please try again and this time wait for the OVF to appear. Or open a > different bug if it does not within couple of minutes as that has nothing to > do with the failure to parse it. All reproduction steps were discussed with Simone and agreed. I've failed the bug because I've seen an old systemd OVF error which had been resolved after a few minutes, but not cleared from systemd as explained earlier. broker+agent were restarted to test the bug, as were discussed with Simone. OVF was already created. In ovirt-ha-agent we are just reading the OVF_STORE content without any locking mechanism maybe you was unlucky enough to got it read while the engine was updating it. I don't think it's an issue if the next read attempt will be OK. Maybe we can lower the error level on first error to avoid positive false. Indeed it was already fine by itself just a few seconds after: MainThread::INFO::2018-03-07 18:28:07,694::ovf_store::132::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2018-03-07 18:28:07,695::ovf_store::134::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/e760aa8d-4ca0-48c6-8944-5732fa1bf4fb/9fa8d34a-2049-43a4-8c1c-516cb3395dc5/b47ff40d-5dfb-4be8-8a96-5ee07ab15acb MainThread::ERROR::2018-03-07 18:28:07,707::ovf_store::139::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Unable to extract HEVM OVF MainThread::ERROR::2018-03-07 18:28:07,708::config::449::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf MainThread::INFO::2018-03-07 18:28:07,848::config::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_vm_conf) Reloading vm.conf from the shared storage domain MainThread::INFO::2018-03-07 18:28:07,848::config::416::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Trying to get a fresher copy of vm configuration from the OVF_STORE MainThread::INFO::2018-03-07 18:28:11,182::ovf_store::109::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:fa57c6f4-6357-403c-955e-f33d97027b3e, volUUID:1d016691-8e09-439c-afa4-982f8087416c MainThread::INFO::2018-03-07 18:28:12,504::ovf_store::109::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:9fa8d34a-2049-43a4-8c1c-516cb3395dc5, volUUID:b47ff40d-5dfb-4be8-8a96-5ee07ab15acb MainThread::INFO::2018-03-07 18:28:14,605::ovf_store::132::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2018-03-07 18:28:14,606::ovf_store::134::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/e760aa8d-4ca0-48c6-8944-5732fa1bf4fb/9fa8d34a-2049-43a4-8c1c-516cb3395dc5/b47ff40d-5dfb-4be8-8a96-5ee07ab15acb MainThread::INFO::2018-03-07 18:28:14,648::config::435::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Found an OVF for HE VM, trying to convert MainThread::INFO::2018-03-07 18:28:14,653::config::440::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Got vm.conf from OVF_STORE Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0561 BZ<2>Jira Resync |