Bug 1786118 - [4.4.0-17] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd: Job for vdsmd.service canceled"
Summary: [4.4.0-17] failed to add secondary hosts(HA) with error "Unable to stop servi...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Broker
Version: 2.4.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Yedidyah Bar David
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-23 14:02 UTC by Kobi Hakimi
Modified: 2020-05-20 20:03 UTC (History)
11 users (show)

Fixed In Version: ovirt-engine 4.4.0 alpha 551beb5b
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-20 20:03:09 UTC
oVirt Team: Integration
Embargoed:
mperina: ovirt-4.4?


Attachments (Terms of Use)
journalctl.logs.tar.gz (246.52 KB, application/gzip)
2019-12-23 14:02 UTC, Kobi Hakimi
no flags Details
engine-hosted-engine-01_ovirt-engine-logs.tar.gz (976.08 KB, application/gzip)
2019-12-23 14:05 UTC, Kobi Hakimi
no flags Details
hypervisor-lynx02_ovirt-engine-logs.tar.gz (209.14 KB, application/gzip)
2019-12-23 14:06 UTC, Kobi Hakimi
no flags Details
hypervisor-lynx03_ovirt-engine-logs.tar.gz (204.57 KB, application/gzip)
2019-12-23 14:07 UTC, Kobi Hakimi
no flags Details
2nd-run-engine-hosted-engine-01_ovirt-engine-logs (975.51 KB, application/gzip)
2020-01-07 14:39 UTC, Jiri Macku
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 105707 0 None MERGED ansible: Don't start libvirtd.service 2020-12-15 20:10:55 UTC
oVirt gerrit 106187 0 master MERGED core: Fetch HE config file from the host where HE VM is running 2020-12-15 20:10:55 UTC
oVirt gerrit 106750 0 master MERGED ansible: change stop&start to restart for vnc certificates deployment 2020-12-15 20:10:55 UTC

Description Kobi Hakimi 2019-12-23 14:02:34 UTC
Created attachment 1647318 [details]
journalctl.logs.tar.gz

Description of problem:
[4.4.0-13] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd: Job for vdsmd.service canceled"

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.4.0-1.el8ev.noarch
ovirt-host-deploy-common-1.9.0-0.0.master.20191128124417.gitd2b9fa5.el8ev.noarch
ovirt-host-4.4.0-0.2.alpha.el8ev.x86_64
python3-ovirt-host-deploy-1.9.0-0.0.master.20191128124417.gitd2b9fa5.el8ev.noarch
ovirt-hosted-engine-setup-2.4.0-1.el8ev.noarch
ovirt-host-dependencies-4.4.0-0.2.alpha.el8ev.x86_64
ovirt-ansible-hosted-engine-setup-1.0.34-1.el8ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy hosted engine(in my case over nfs)
2. Add secondary host(with hosted engine)

Actual results:
when I tried to add secondary hosts(HA) I failed in host installation with the following error:
Host host_mixed_2 installation failed. Task stop vdsmd failed to execute: .

when I check in the host Please check logs for more details: /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20191201085346-caracal05.lab.eng.tlv2.redhat.com-6567b489.log.

when I look at the ovirt-host-deploy-ansible-20191223071416-lynx02.lab.eng.tlv2.redhat.com-9658060.log I can see that it failed in:
2019-12-23 07:30:40 EST - TASK [Executing post tasks defined by user] ************************************
2019-12-23 07:30:40 EST - RUNNING HANDLER [ovirt-host-deploy-vnc-certificates : stop vdsmd] **************
2019-12-23 07:30:43 EST - fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n"}
2019-12-23 07:30:43 EST - {
  "status" : "OK",
  "msg" : "",
  "data" : {
    "event" : "runner_on_failed",
    "uuid" : "0f4c53ba-498f-4f3f-af43-5b40eec32e9e",
    "stdout" : "fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {\"changed\": false, \"msg\": \"Unable to stop service vdsmd: Job for vdsmd.service canceled.\\n\"}",
    "counter" : 542,
    "pid" : 3814,
    "created" : "2019-12-23T12:30:43.193815",
    "end_line" : 620,
    "runner_ident" : "befd29cc-257d-11ea-a51d-001a4a168bf7",
    "start_line" : 619,
    "event_data" : {
      "play_pattern" : "all",
      "play" : "all",
      "event_loop" : null,
      "task_args" : "",
      "remote_addr" : "lynx02.lab.eng.tlv2.redhat.com",
      "res" : {
        "invocation" : {
          "module_args" : {
            "no_block" : false,
            "force" : null,
            "name" : "vdsmd",
            "daemon_reexec" : false,
            "enabled" : null,
            "daemon_reload" : false,
            "state" : "stopped",
            "masked" : null,
            "scope" : null,
            "user" : null
          }
        },
        "msg" : "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n",
        "changed" : false,
        "_ansible_no_log" : false
      },
      "pid" : 3814,
      "play_uuid" : "001a4a16-8bf7-c234-d58c-000000000008",
      "task_uuid" : "001a4a16-8bf7-c234-d58c-000000000056",
      "task" : "stop vdsmd",
      "playbook_uuid" : "f3d7b12e-1df3-45e7-8731-d2135ae77265",
      "playbook" : "ovirt-host-deploy.yml",
      "task_action" : "service",
      "host" : "lynx02.lab.eng.tlv2.redhat.com",
      "ignore_errors" : null,
      "role" : "ovirt-host-deploy-vnc-certificates",
      "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml:1"
    },
    "parent_uuid" : "001a4a16-8bf7-c234-d58c-000000000056"
  }
}

2019-12-23 07:30:43 EST - fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n"}
2019-12-23 07:30:43 EST - {
  "status" : "OK",
  "msg" : "",
  "data" : {
    "event" : "runner_on_failed",
    "uuid" : "0f4c53ba-498f-4f3f-af43-5b40eec32e9e",
    "stdout" : "fatal: [lynx02.lab.eng.tlv2.redhat.com]: FAILED! => {\"changed\": false, \"msg\": \"Unable to stop service vdsmd: Job for vdsmd.service canceled.\\n\"}",
    "counter" : 542,
    "pid" : 3814,
    "created" : "2019-12-23T12:30:43.193815",
    "end_line" : 620,
    "runner_ident" : "befd29cc-257d-11ea-a51d-001a4a168bf7",
    "start_line" : 619,
    "event_data" : {
      "play_pattern" : "all",
      "play" : "all",
      "event_loop" : null,
      "task_args" : "",
      "remote_addr" : "lynx02.lab.eng.tlv2.redhat.com",
      "res" : {
        "invocation" : {
          "module_args" : {
            "no_block" : false,
            "force" : null,
            "name" : "vdsmd",
            "daemon_reexec" : false,
            "enabled" : null,
            "daemon_reload" : false,
            "state" : "stopped",
            "masked" : null,
            "scope" : null,
            "user" : null
          }
        },
        "msg" : "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n",
        "changed" : false,
        "_ansible_no_log" : false
      },
      "pid" : 3814,
      "play_uuid" : "001a4a16-8bf7-c234-d58c-000000000008",
      "task_uuid" : "001a4a16-8bf7-c234-d58c-000000000056",
      "task" : "stop vdsmd",
      "playbook_uuid" : "f3d7b12e-1df3-45e7-8731-d2135ae77265",
      "playbook" : "ovirt-host-deploy.yml",
      "task_action" : "service",
      "host" : "lynx02.lab.eng.tlv2.redhat.com",
      "ignore_errors" : null,
      "role" : "ovirt-host-deploy-vnc-certificates",
      "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml:1"
    },
    "parent_uuid" : "001a4a16-8bf7-c234-d58c-000000000056"
  }
}


Expected results:
Success to add the secondary hosts.

Additional info:
logs attached

when I tried to reinstall the hosts I got the following error:
Host host_mixed_3 installation failed. Format specifier '%s'.

Comment 1 Kobi Hakimi 2019-12-23 14:05:02 UTC
Created attachment 1647319 [details]
engine-hosted-engine-01_ovirt-engine-logs.tar.gz

Comment 2 Kobi Hakimi 2019-12-23 14:06:32 UTC
Created attachment 1647320 [details]
hypervisor-lynx02_ovirt-engine-logs.tar.gz

Comment 3 Kobi Hakimi 2019-12-23 14:07:02 UTC
Created attachment 1647321 [details]
hypervisor-lynx03_ovirt-engine-logs.tar.gz

Comment 4 Kobi Hakimi 2019-12-23 14:36:14 UTC
we had almost the same issue in bug: https://bugzilla.redhat.com/show_bug.cgi?id=1778638 
which is in status ON_QA

Comment 5 Doron Fediuck 2019-12-24 16:01:33 UTC
Are yo usure this is not a duplicate of BZ#1778638?

Comment 6 Kobi Hakimi 2019-12-25 08:25:33 UTC
(In reply to Doron Fediuck from comment #5)
> Are yo usure this is not a duplicate of BZ#1778638?

I wrote in comment #4 that it almost the same because of the error message in the previous bug was:
"Unable to stop service supervdsmd"

and in this bug is:
"Unable to stop service vdsmd: Job for vdsmd.service canceled"

also, the first one in status ON_QA which means probably fix something but raise another issue.
if its the same one. feel free to close it as duplicate and reopen the old one.

Comment 7 Martin Perina 2019-12-29 12:42:51 UTC
(In reply to Kobi Hakimi from comment #6)
> (In reply to Doron Fediuck from comment #5)
> > Are yo usure this is not a duplicate of BZ#1778638?
> 
> I wrote in comment #4 that it almost the same because of the error message
> in the previous bug was:
> "Unable to stop service supervdsmd"
> 
> and in this bug is:
> "Unable to stop service vdsmd: Job for vdsmd.service canceled"

It seems, that we really need to investigate how vdsm is restarted during host deploy, because according to Marcin below is standard systemd behavior:

1. service_1 is being stopped
2. While service_1 is still being stopped, another process wants to start service_1
3. systemd sends "job cancelled" to process which wanted service_1 to be stopped

Comment 8 Jiri Macku 2020-01-07 14:22:37 UTC
I hit this issue as well. Exactly the same error in /var/log/ovirt-engine/host-deploy logs.
What might be interesting is the error in the engine log:

2020-01-07 05:57:55,380-05 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (ForkJoinPool.commonPool-worker-3) [] createCommand failed: WFLYWELD0039: Singleton not set for jdk.internal.loader.ClassLoaders$AppClassLoader@78308db1. This means that you are trying to access a weld deployment with a Thread Context ClassLoader that is not associated with the deployment.
2020-01-07 05:57:55,383-05 ERROR [org.ovirt.engine.core.bll.hostedengine.HostedEngineConfigFetcher] (default task-1) [7cf9f78b] Failed to fetch the hosted engine config disk due to {}: java.util.concurrent.ExecutionException: java.lang.NullPointerException
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.hostedengine.HostedEngineConfigFetcher.fetch(HostedEngineConfigFetcher.java:95)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.hostedengine.HostedEngineHelper.createVdsDeployParams(HostedEngineHelper.java:88)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand.executeCommand(AddVdsCommand.java:209)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1326)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2005)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1386)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:420)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runAction(Backend.java:451)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:433)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runAction(Backend.java:388)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.jboss.as.ee.0.GA-redhat-00001//org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52)
        at org.jboss.invocation.2.Final-redhat-00001//org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422)
        at org.jboss.invocation.2.Final-redhat-00001//org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:509)
...

In this run 30 second pause was inserted between hosts to avoid possible race condition.
The number of occurrences of this error corresponds to the number of failed hosts.
The same error is present in the already uploaded engine.log (engine-hosted-engine-01_ovirt-engine-logs.tar.gz).

Comment 9 Jiri Macku 2020-01-07 14:39:03 UTC
Created attachment 1650414 [details]
2nd-run-engine-hosted-engine-01_ovirt-engine-logs

Comment 10 Martin Perina 2020-01-07 15:21:24 UTC
NPE is different issue, but original problem is still there, we have verified that 4.4.0-13 includes https://gerrit.ovirt.org/105707 so there is some other process who attempts to start vdsm after we have stopped in host-deploy and systemd reported this stop as cancelled

Comment 11 Marcin Sobczyk 2020-01-08 10:01:06 UTC
A quick look at 'journalctl_errors.log' shows that the issue of constantly-crashing 'ovirt-ha-broker' is still here:

Dec 23 07:51:48 lynx02.lab.eng.tlv2.redhat.com ovirt-ha-broker[31536]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.broker.Broker ERROR Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run
        self._storage_broker_instance = self._get_storage_broker()
      File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker
        return storage_broker.StorageBroker()
      File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 80, in __init__
        raise_on_none=True),
      File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config.py", line 87, in get
        "'{0} can't be '{1}'".format(key, val)
    ValueError: 'metadata_image_UUID can't be 'None'

Since 'ovirt-ha-agent.service' depends on 'ovirt-ha-broker.service' and 'vdsmd.service' this might be the cause of the vdsm's stopping failure.

Comment 12 Lukas Svaty 2020-02-03 10:48:42 UTC
This issue was reproduced again in 4.4.0-18.

Comment 15 Nikolai Sednev 2020-03-31 12:08:45 UTC
Worked for me on fresh and clean environment, I successfully added secondary ha-host with no issues, deployment of HE 4.4 was performed on NFS.

Tested on host with these components:
rhvm-appliance.x86_64 2:4.4-20200326.0.el8ev
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)
Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Engine:
ovirt-engine-setup-base-4.4.0-0.26.master.el8ev.noarch
ovirt-engine-4.4.0-0.26.master.el8ev.noarch
openvswitch2.11-2.11.0-48.el8fdp.x86_64
Linux 4.18.0-192.el8.x86_64 #1 SMP Tue Mar 24 14:06:40 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

Comment 16 Sandro Bonazzola 2020-05-20 20:03:09 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.