Created attachment 1647318 [details] journalctl.logs.tar.gz Description of problem: [4.4.0-13] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd: Job for vdsmd.service canceled" Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.4.0-1.el8ev.noarch ovirt-host-deploy-common-1.9.0-0.0.master.20191128124417.gitd2b9fa5.el8ev.noarch ovirt-host-4.4.0-0.2.alpha.el8ev.x86_64 python3-ovirt-host-deploy-1.9.0-0.0.master.20191128124417.gitd2b9fa5.el8ev.noarch ovirt-hosted-engine-setup-2.4.0-1.el8ev.noarch ovirt-host-dependencies-4.4.0-0.2.alpha.el8ev.x86_64 ovirt-ansible-hosted-engine-setup-1.0.34-1.el8ev.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy hosted engine(in my case over nfs) 2. Add secondary host(with hosted engine) Actual results: when I tried to add secondary hosts(HA) I failed in host installation with the following error: Host host_mixed_2 installation failed. Task stop vdsmd failed to execute: . when I check in the host Please check logs for more details: /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20191201085346-caracal05.lab.eng.tlv2.redhat.com-6567b489.log. when I look at the ovirt-host-deploy-ansible-20191223071416-lynx02.lab.eng.tlv2.redhat.com-9658060.log I can see that it failed in: 2019-12-23 07:30:40 EST - TASK [Executing post tasks defined by user] ************************************ 2019-12-23 07:30:40 EST - RUNNING HANDLER [ovirt-host-deploy-vnc-certificates : stop vdsmd] ************** 2019-12-23 07:30:43 EST - fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n"} 2019-12-23 07:30:43 EST - { "status" : "OK", "msg" : "", "data" : { "event" : "runner_on_failed", "uuid" : "0f4c53ba-498f-4f3f-af43-5b40eec32e9e", "stdout" : "fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {\"changed\": false, \"msg\": \"Unable to stop service vdsmd: Job for vdsmd.service canceled.\\n\"}", "counter" : 542, "pid" : 3814, "created" : "2019-12-23T12:30:43.193815", "end_line" : 620, "runner_ident" : "befd29cc-257d-11ea-a51d-001a4a168bf7", "start_line" : 619, "event_data" : { "play_pattern" : "all", "play" : "all", "event_loop" : null, "task_args" : "", "remote_addr" : "lynx02.lab.eng.tlv2.redhat.com", "res" : { "invocation" : { "module_args" : { "no_block" : false, "force" : null, "name" : "vdsmd", "daemon_reexec" : false, "enabled" : null, "daemon_reload" : false, "state" : "stopped", "masked" : null, "scope" : null, "user" : null } }, "msg" : "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n", "changed" : false, "_ansible_no_log" : false }, "pid" : 3814, "play_uuid" : "001a4a16-8bf7-c234-d58c-000000000008", "task_uuid" : "001a4a16-8bf7-c234-d58c-000000000056", "task" : "stop vdsmd", "playbook_uuid" : "f3d7b12e-1df3-45e7-8731-d2135ae77265", "playbook" : "ovirt-host-deploy.yml", "task_action" : "service", "host" : "lynx02.lab.eng.tlv2.redhat.com", "ignore_errors" : null, "role" : "ovirt-host-deploy-vnc-certificates", "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml:1" }, "parent_uuid" : "001a4a16-8bf7-c234-d58c-000000000056" } } 2019-12-23 07:30:43 EST - fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n"} 2019-12-23 07:30:43 EST - { "status" : "OK", "msg" : "", "data" : { "event" : "runner_on_failed", "uuid" : "0f4c53ba-498f-4f3f-af43-5b40eec32e9e", "stdout" : "fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {\"changed\": false, \"msg\": \"Unable to stop service vdsmd: Job for vdsmd.service canceled.\\n\"}", "counter" : 542, "pid" : 3814, "created" : "2019-12-23T12:30:43.193815", "end_line" : 620, "runner_ident" : "befd29cc-257d-11ea-a51d-001a4a168bf7", "start_line" : 619, "event_data" : { "play_pattern" : "all", "play" : "all", "event_loop" : null, "task_args" : "", "remote_addr" : "lynx02.lab.eng.tlv2.redhat.com", "res" : { "invocation" : { "module_args" : { "no_block" : false, "force" : null, "name" : "vdsmd", "daemon_reexec" : false, "enabled" : null, "daemon_reload" : false, "state" : "stopped", "masked" : null, "scope" : null, "user" : null } }, "msg" : "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n", "changed" : false, "_ansible_no_log" : false }, "pid" : 3814, "play_uuid" : "001a4a16-8bf7-c234-d58c-000000000008", "task_uuid" : "001a4a16-8bf7-c234-d58c-000000000056", "task" : "stop vdsmd", "playbook_uuid" : "f3d7b12e-1df3-45e7-8731-d2135ae77265", "playbook" : "ovirt-host-deploy.yml", "task_action" : "service", "host" : "lynx02.lab.eng.tlv2.redhat.com", "ignore_errors" : null, "role" : "ovirt-host-deploy-vnc-certificates", "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml:1" }, "parent_uuid" : "001a4a16-8bf7-c234-d58c-000000000056" } } Expected results: Success to add the secondary hosts. Additional info: logs attached when I tried to reinstall the hosts I got the following error: Host host_mixed_3 installation failed. Format specifier '%s'.
Created attachment 1647319 [details] engine-hosted-engine-01_ovirt-engine-logs.tar.gz
Created attachment 1647320 [details] hypervisor-lynx02_ovirt-engine-logs.tar.gz
Created attachment 1647321 [details] hypervisor-lynx03_ovirt-engine-logs.tar.gz
we had almost the same issue in bug: https://bugzilla.redhat.com/show_bug.cgi?id=1778638 which is in status ON_QA
Are yo usure this is not a duplicate of BZ#1778638?
(In reply to Doron Fediuck from comment #5) > Are yo usure this is not a duplicate of BZ#1778638? I wrote in comment #4 that it almost the same because of the error message in the previous bug was: "Unable to stop service supervdsmd" and in this bug is: "Unable to stop service vdsmd: Job for vdsmd.service canceled" also, the first one in status ON_QA which means probably fix something but raise another issue. if its the same one. feel free to close it as duplicate and reopen the old one.
(In reply to Kobi Hakimi from comment #6) > (In reply to Doron Fediuck from comment #5) > > Are yo usure this is not a duplicate of BZ#1778638? > > I wrote in comment #4 that it almost the same because of the error message > in the previous bug was: > "Unable to stop service supervdsmd" > > and in this bug is: > "Unable to stop service vdsmd: Job for vdsmd.service canceled" It seems, that we really need to investigate how vdsm is restarted during host deploy, because according to Marcin below is standard systemd behavior: 1. service_1 is being stopped 2. While service_1 is still being stopped, another process wants to start service_1 3. systemd sends "job cancelled" to process which wanted service_1 to be stopped
I hit this issue as well. Exactly the same error in /var/log/ovirt-engine/host-deploy logs. What might be interesting is the error in the engine log: 2020-01-07 05:57:55,380-05 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (ForkJoinPool.commonPool-worker-3) [] createCommand failed: WFLYWELD0039: Singleton not set for jdk.internal.loader.ClassLoaders$AppClassLoader@78308db1. This means that you are trying to access a weld deployment with a Thread Context ClassLoader that is not associated with the deployment. 2020-01-07 05:57:55,383-05 ERROR [org.ovirt.engine.core.bll.hostedengine.HostedEngineConfigFetcher] (default task-1) [7cf9f78b] Failed to fetch the hosted engine config disk due to {}: java.util.concurrent.ExecutionException: java.lang.NullPointerException at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.hostedengine.HostedEngineConfigFetcher.fetch(HostedEngineConfigFetcher.java:95) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.hostedengine.HostedEngineHelper.createVdsDeployParams(HostedEngineHelper.java:88) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand.executeCommand(AddVdsCommand.java:209) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1326) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2005) at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1386) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:420) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runAction(Backend.java:451) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:433) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runAction(Backend.java:388) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.jboss.as.ee.0.GA-redhat-00001//org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52) at org.jboss.invocation.2.Final-redhat-00001//org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422) at org.jboss.invocation.2.Final-redhat-00001//org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:509) ... In this run 30 second pause was inserted between hosts to avoid possible race condition. The number of occurrences of this error corresponds to the number of failed hosts. The same error is present in the already uploaded engine.log (engine-hosted-engine-01_ovirt-engine-logs.tar.gz).
Created attachment 1650414 [details] 2nd-run-engine-hosted-engine-01_ovirt-engine-logs
NPE is different issue, but original problem is still there, we have verified that 4.4.0-13 includes https://gerrit.ovirt.org/105707 so there is some other process who attempts to start vdsm after we have stopped in host-deploy and systemd reported this stop as cancelled
A quick look at 'journalctl_errors.log' shows that the issue of constantly-crashing 'ovirt-ha-broker' is still here: Dec 23 07:51:48 lynx02.lab.eng.tlv2.redhat.com ovirt-ha-broker[31536]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.broker.Broker ERROR Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 80, in __init__ raise_on_none=True), File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config.py", line 87, in get "'{0} can't be '{1}'".format(key, val) ValueError: 'metadata_image_UUID can't be 'None' Since 'ovirt-ha-agent.service' depends on 'ovirt-ha-broker.service' and 'vdsmd.service' this might be the cause of the vdsm's stopping failure.
This issue was reproduced again in 4.4.0-18.
Worked for me on fresh and clean environment, I successfully added secondary ha-host with no issues, deployment of HE 4.4 was performed on NFS. Tested on host with these components: rhvm-appliance.x86_64 2:4.4-20200326.0.el8ev ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch Red Hat Enterprise Linux release 8.2 Beta (Ootpa) Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Engine: ovirt-engine-setup-base-4.4.0-0.26.master.el8ev.noarch ovirt-engine-4.4.0-0.26.master.el8ev.noarch openvswitch2.11-2.11.0-48.el8fdp.x86_64 Linux 4.18.0-192.el8.x86_64 #1 SMP Tue Mar 24 14:06:40 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 Beta (Ootpa)
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.