Bug 1786118

Summary: [4.4.0-17] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd: Job for vdsmd.service canceled"
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Kobi Hakimi <khakimi>
Component: BrokerAssignee: Yedidyah Bar David <didi>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: high    
Version: 2.4.2CC: aefrat, bugs, dagur, dfediuck, jmacku, lsvaty, mburman, michal.skrivanek, mperina, msobczyk, sbonazzo
Target Milestone: ovirt-4.4.0Keywords: AutomationBlocker, Triaged
Target Release: ---Flags: mperina: ovirt-4.4?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine 4.4.0 alpha 551beb5b Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-20 20:03:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
journalctl.logs.tar.gz
none
engine-hosted-engine-01_ovirt-engine-logs.tar.gz
none
hypervisor-lynx02_ovirt-engine-logs.tar.gz
none
hypervisor-lynx03_ovirt-engine-logs.tar.gz
none
2nd-run-engine-hosted-engine-01_ovirt-engine-logs none

Description Kobi Hakimi 2019-12-23 14:02:34 UTC
Created attachment 1647318 [details]
journalctl.logs.tar.gz

Description of problem:
[4.4.0-13] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd: Job for vdsmd.service canceled"

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.4.0-1.el8ev.noarch
ovirt-host-deploy-common-1.9.0-0.0.master.20191128124417.gitd2b9fa5.el8ev.noarch
ovirt-host-4.4.0-0.2.alpha.el8ev.x86_64
python3-ovirt-host-deploy-1.9.0-0.0.master.20191128124417.gitd2b9fa5.el8ev.noarch
ovirt-hosted-engine-setup-2.4.0-1.el8ev.noarch
ovirt-host-dependencies-4.4.0-0.2.alpha.el8ev.x86_64
ovirt-ansible-hosted-engine-setup-1.0.34-1.el8ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy hosted engine(in my case over nfs)
2. Add secondary host(with hosted engine)

Actual results:
when I tried to add secondary hosts(HA) I failed in host installation with the following error:
Host host_mixed_2 installation failed. Task stop vdsmd failed to execute: .

when I check in the host Please check logs for more details: /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20191201085346-caracal05.lab.eng.tlv2.redhat.com-6567b489.log.

when I look at the ovirt-host-deploy-ansible-20191223071416-lynx02.lab.eng.tlv2.redhat.com-9658060.log I can see that it failed in:
2019-12-23 07:30:40 EST - TASK [Executing post tasks defined by user] ************************************
2019-12-23 07:30:40 EST - RUNNING HANDLER [ovirt-host-deploy-vnc-certificates : stop vdsmd] **************
2019-12-23 07:30:43 EST - fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n"}
2019-12-23 07:30:43 EST - {
  "status" : "OK",
  "msg" : "",
  "data" : {
    "event" : "runner_on_failed",
    "uuid" : "0f4c53ba-498f-4f3f-af43-5b40eec32e9e",
    "stdout" : "fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {\"changed\": false, \"msg\": \"Unable to stop service vdsmd: Job for vdsmd.service canceled.\\n\"}",
    "counter" : 542,
    "pid" : 3814,
    "created" : "2019-12-23T12:30:43.193815",
    "end_line" : 620,
    "runner_ident" : "befd29cc-257d-11ea-a51d-001a4a168bf7",
    "start_line" : 619,
    "event_data" : {
      "play_pattern" : "all",
      "play" : "all",
      "event_loop" : null,
      "task_args" : "",
      "remote_addr" : "lynx02.lab.eng.tlv2.redhat.com",
      "res" : {
        "invocation" : {
          "module_args" : {
            "no_block" : false,
            "force" : null,
            "name" : "vdsmd",
            "daemon_reexec" : false,
            "enabled" : null,
            "daemon_reload" : false,
            "state" : "stopped",
            "masked" : null,
            "scope" : null,
            "user" : null
          }
        },
        "msg" : "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n",
        "changed" : false,
        "_ansible_no_log" : false
      },
      "pid" : 3814,
      "play_uuid" : "001a4a16-8bf7-c234-d58c-000000000008",
      "task_uuid" : "001a4a16-8bf7-c234-d58c-000000000056",
      "task" : "stop vdsmd",
      "playbook_uuid" : "f3d7b12e-1df3-45e7-8731-d2135ae77265",
      "playbook" : "ovirt-host-deploy.yml",
      "task_action" : "service",
      "host" : "lynx02.lab.eng.tlv2.redhat.com",
      "ignore_errors" : null,
      "role" : "ovirt-host-deploy-vnc-certificates",
      "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml:1"
    },
    "parent_uuid" : "001a4a16-8bf7-c234-d58c-000000000056"
  }
}

2019-12-23 07:30:43 EST - fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n"}
2019-12-23 07:30:43 EST - {
  "status" : "OK",
  "msg" : "",
  "data" : {
    "event" : "runner_on_failed",
    "uuid" : "0f4c53ba-498f-4f3f-af43-5b40eec32e9e",
    "stdout" : "fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {\"changed\": false, \"msg\": \"Unable to stop service vdsmd: Job for vdsmd.service canceled.\\n\"}",
    "counter" : 542,
    "pid" : 3814,
    "created" : "2019-12-23T12:30:43.193815",
    "end_line" : 620,
    "runner_ident" : "befd29cc-257d-11ea-a51d-001a4a168bf7",
    "start_line" : 619,
    "event_data" : {
      "play_pattern" : "all",
      "play" : "all",
      "event_loop" : null,
      "task_args" : "",
      "remote_addr" : "lynx02.lab.eng.tlv2.redhat.com",
      "res" : {
        "invocation" : {
          "module_args" : {
            "no_block" : false,
            "force" : null,
            "name" : "vdsmd",
            "daemon_reexec" : false,
            "enabled" : null,
            "daemon_reload" : false,
            "state" : "stopped",
            "masked" : null,
            "scope" : null,
            "user" : null
          }
        },
        "msg" : "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n",
        "changed" : false,
        "_ansible_no_log" : false
      },
      "pid" : 3814,
      "play_uuid" : "001a4a16-8bf7-c234-d58c-000000000008",
      "task_uuid" : "001a4a16-8bf7-c234-d58c-000000000056",
      "task" : "stop vdsmd",
      "playbook_uuid" : "f3d7b12e-1df3-45e7-8731-d2135ae77265",
      "playbook" : "ovirt-host-deploy.yml",
      "task_action" : "service",
      "host" : "lynx02.lab.eng.tlv2.redhat.com",
      "ignore_errors" : null,
      "role" : "ovirt-host-deploy-vnc-certificates",
      "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml:1"
    },
    "parent_uuid" : "001a4a16-8bf7-c234-d58c-000000000056"
  }
}


Expected results:
Success to add the secondary hosts.

Additional info:
logs attached

when I tried to reinstall the hosts I got the following error:
Host host_mixed_3 installation failed. Format specifier '%s'.

Comment 1 Kobi Hakimi 2019-12-23 14:05:02 UTC
Created attachment 1647319 [details]
engine-hosted-engine-01_ovirt-engine-logs.tar.gz

Comment 2 Kobi Hakimi 2019-12-23 14:06:32 UTC
Created attachment 1647320 [details]
hypervisor-lynx02_ovirt-engine-logs.tar.gz

Comment 3 Kobi Hakimi 2019-12-23 14:07:02 UTC
Created attachment 1647321 [details]
hypervisor-lynx03_ovirt-engine-logs.tar.gz

Comment 4 Kobi Hakimi 2019-12-23 14:36:14 UTC
we had almost the same issue in bug: https://bugzilla.redhat.com/show_bug.cgi?id=1778638 
which is in status ON_QA

Comment 5 Doron Fediuck 2019-12-24 16:01:33 UTC
Are yo usure this is not a duplicate of BZ#1778638?

Comment 6 Kobi Hakimi 2019-12-25 08:25:33 UTC
(In reply to Doron Fediuck from comment #5)
> Are yo usure this is not a duplicate of BZ#1778638?

I wrote in comment #4 that it almost the same because of the error message in the previous bug was:
"Unable to stop service supervdsmd"

and in this bug is:
"Unable to stop service vdsmd: Job for vdsmd.service canceled"

also, the first one in status ON_QA which means probably fix something but raise another issue.
if its the same one. feel free to close it as duplicate and reopen the old one.

Comment 7 Martin Perina 2019-12-29 12:42:51 UTC
(In reply to Kobi Hakimi from comment #6)
> (In reply to Doron Fediuck from comment #5)
> > Are yo usure this is not a duplicate of BZ#1778638?
> 
> I wrote in comment #4 that it almost the same because of the error message
> in the previous bug was:
> "Unable to stop service supervdsmd"
> 
> and in this bug is:
> "Unable to stop service vdsmd: Job for vdsmd.service canceled"

It seems, that we really need to investigate how vdsm is restarted during host deploy, because according to Marcin below is standard systemd behavior:

1. service_1 is being stopped
2. While service_1 is still being stopped, another process wants to start service_1
3. systemd sends "job cancelled" to process which wanted service_1 to be stopped

Comment 8 Jiri Macku 2020-01-07 14:22:37 UTC
I hit this issue as well. Exactly the same error in /var/log/ovirt-engine/host-deploy logs.
What might be interesting is the error in the engine log:

2020-01-07 05:57:55,380-05 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (ForkJoinPool.commonPool-worker-3) [] createCommand failed: WFLYWELD0039: Singleton not set for jdk.internal.loader.ClassLoaders$AppClassLoader@78308db1. This means that you are trying to access a weld deployment with a Thread Context ClassLoader that is not associated with the deployment.
2020-01-07 05:57:55,383-05 ERROR [org.ovirt.engine.core.bll.hostedengine.HostedEngineConfigFetcher] (default task-1) [7cf9f78b] Failed to fetch the hosted engine config disk due to {}: java.util.concurrent.ExecutionException: java.lang.NullPointerException
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.hostedengine.HostedEngineConfigFetcher.fetch(HostedEngineConfigFetcher.java:95)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.hostedengine.HostedEngineHelper.createVdsDeployParams(HostedEngineHelper.java:88)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand.executeCommand(AddVdsCommand.java:209)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1326)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2005)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1386)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:420)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runAction(Backend.java:451)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:433)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runAction(Backend.java:388)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.jboss.as.ee.0.GA-redhat-00001//org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52)
        at org.jboss.invocation.2.Final-redhat-00001//org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422)
        at org.jboss.invocation.2.Final-redhat-00001//org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:509)
...

In this run 30 second pause was inserted between hosts to avoid possible race condition.
The number of occurrences of this error corresponds to the number of failed hosts.
The same error is present in the already uploaded engine.log (engine-hosted-engine-01_ovirt-engine-logs.tar.gz).

Comment 9 Jiri Macku 2020-01-07 14:39:03 UTC
Created attachment 1650414 [details]
2nd-run-engine-hosted-engine-01_ovirt-engine-logs

Comment 10 Martin Perina 2020-01-07 15:21:24 UTC
NPE is different issue, but original problem is still there, we have verified that 4.4.0-13 includes https://gerrit.ovirt.org/105707 so there is some other process who attempts to start vdsm after we have stopped in host-deploy and systemd reported this stop as cancelled

Comment 11 Marcin Sobczyk 2020-01-08 10:01:06 UTC
A quick look at 'journalctl_errors.log' shows that the issue of constantly-crashing 'ovirt-ha-broker' is still here:

Dec 23 07:51:48 lynx02.lab.eng.tlv2.redhat.com ovirt-ha-broker[31536]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.broker.Broker ERROR Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run
        self._storage_broker_instance = self._get_storage_broker()
      File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker
        return storage_broker.StorageBroker()
      File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 80, in __init__
        raise_on_none=True),
      File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config.py", line 87, in get
        "'{0} can't be '{1}'".format(key, val)
    ValueError: 'metadata_image_UUID can't be 'None'

Since 'ovirt-ha-agent.service' depends on 'ovirt-ha-broker.service' and 'vdsmd.service' this might be the cause of the vdsm's stopping failure.

Comment 12 Lukas Svaty 2020-02-03 10:48:42 UTC
This issue was reproduced again in 4.4.0-18.

Comment 15 Nikolai Sednev 2020-03-31 12:08:45 UTC
Worked for me on fresh and clean environment, I successfully added secondary ha-host with no issues, deployment of HE 4.4 was performed on NFS.

Tested on host with these components:
rhvm-appliance.x86_64 2:4.4-20200326.0.el8ev
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)
Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Engine:
ovirt-engine-setup-base-4.4.0-0.26.master.el8ev.noarch
ovirt-engine-4.4.0-0.26.master.el8ev.noarch
openvswitch2.11-2.11.0-48.el8fdp.x86_64
Linux 4.18.0-192.el8.x86_64 #1 SMP Tue Mar 24 14:06:40 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

Comment 16 Sandro Bonazzola 2020-05-20 20:03:09 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.