Bug 1908773 - HE deployment fails on CentOS Stream when verifying that the engine service is up
Summary: HE deployment fails on CentOS Stream when verifying that the engine service i...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-ansible-collection
Classification: oVirt
Component: hosted-engine-setup
Version: 1.2.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.4.5
: ---
Assignee: Asaf Rachmani
QA Contact: meital avital
URL:
Whiteboard:
Depends On: 1908275
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-17 15:09 UTC by Yedidyah Bar David
Modified: 2023-09-15 00:53 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-21 12:41:29 UTC
oVirt Team: Integration
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)
ovirt-system-tests logs (2.72 MB, application/gzip)
2020-12-17 20:51 UTC, Yedidyah Bar David
no flags Details

Description Yedidyah Bar David 2020-12-17 15:09:50 UTC
Description of problem:

$subject. Reproduced on CI [1][2] and locally (using OST):

2020-12-17 14:57:30,137+0100 INFO ansible task start {'status': 'OK', 'ansible_type': 'task', 'ansible_playbook': '/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml', 'ansible_task': 'ovirt.ovirt.engine_setup : Make sure `ovirt-engine` service is running'}
2020-12-17 14:57:30,137+0100 DEBUG ansible on_any args TASK: ovirt.ovirt.engine_setup : Make sure `ovirt-engine` service is running  kwargs is_conditional:False 
2020-12-17 14:57:30,138+0100 DEBUG ansible on_any args localhost TASK: ovirt.ovirt.engine_setup : Make sure `ovirt-engine` service is running  kwargs 
2020-12-17 14:57:33,468+0100 DEBUG var changed: host "localhost" var "ansible_play_hosts" type "<class 'list'>" value: "[]"
2020-12-17 14:57:33,469+0100 DEBUG var changed: host "localhost" var "ansible_play_batch" type "<class 'list'>" value: "[]"
2020-12-17 14:57:33,469+0100 DEBUG var changed: host "localhost" var "play_hosts" type "<class 'list'>" value: "[]"
2020-12-17 14:57:33,469+0100 ERROR ansible failed {
    "ansible_host": "localhost",
    "ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
    "ansible_result": {
        "_ansible_delegated_vars": {
            "ansible_host": "lago-he-basic-suite-master-engine.lago.local",
            "ansible_port": null,
            "ansible_user": "root"
        },
        "_ansible_no_log": false,
        "changed": false,
        "invocation": {
            "module_args": {
                "daemon_reexec": false,
                "daemon_reload": false,
                "enabled": null,
                "force": null,
                "masked": null,
                "name": "ovirt-engine",
                "no_block": false,
                "scope": null,
                "state": "started",
                "user": null
            }
        },
        "msg": "Service is in unknown state",
        "status": {}
    },
    "ansible_task": "Make sure `ovirt-engine` service is running",
    "ansible_type": "task",
    "status": "FAILED",
    "task_duration": 4
}

When I checked, a few minutes later, on the local run, the service was up and well.

Still not fully certain about the reason for breakage.

The engine service does take time between it's being started and being up. See also bug 1075013.

It might be a change in systemd about what it does, and when, about this situation, or about ansible's code that does [3].

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14481/

[2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14481/artifact/check-patch.he-basic_suite_master.el8.x86_64/test_logs/he-basic-suite-master/post-he_deploy/lago-he-basic-suite-master-host-0/_var_log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-20201217144447-0yncwg.log

[3] https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/engine_setup/tasks/engine_setup.yml#L89

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sandro Bonazzola 2020-12-17 15:19:58 UTC
Rick are you aware of any change in ansible systemd that touched this part and may require some change in our code?

Comment 2 Sandro Bonazzola 2020-12-17 15:23:09 UTC
jamacku are you aware of any change in systemd-239-43.el8 that can lead to some change in state reporting?

Comment 3 Yedidyah Bar David 2020-12-17 20:49:08 UTC
I'd like to clarify, more-or-less accurately, the flow:

1. The mentioned ansible script connects to the engine machine, and runs there 'engine-setup'.
2. engine-setup configures stuff, and near the end also starts the engine service.
3. This succeeds. You can see that in [1] (the log of engine-setup):

2020-12-17 14:57:22,342+0100 DEBUG otopi.plugins.otopi.services.systemd systemd.state:170 starting service ovirt-engine
2020-12-17 14:57:22,342+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:813 execute: ('/usr/bin/systemctl', 'start', 'ovirt-engine.service'), executable='None', cwd='None', env=None
2020-12-17 14:57:22,509+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:863 execute-result: ('/usr/bin/systemctl', 'start', 'ovirt-engine.service'), rc=0
2020-12-17 14:57:22,509+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:921 execute-output: ('/usr/bin/systemctl', 'start', 'ovirt-engine.service') stdout:
2020-12-17 14:57:22,509+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:926 execute-output: ('/usr/bin/systemctl', 'start', 'ovirt-engine.service') stderr:

Meaning, 'systemctl start ovirt-engine.service' exited with rc=0 and empty stdout/stderr.
You can see the systemd conf file in [2]. This "service" actually runs a python script, that runs the real (wildfly/java) service.
It's of Type=notify. The code that does the notification is in python, and it runs before starting the java process, but after some other processing. It's in [3][4] (plus some other library code, less relevant I think). I have no idea if the bug related to notifification, just mentioning. It does work in CentOS 8.3.

4. Once engine-setup finishes, the ansible script runs (see link [3] from comment 0):
  - name: Make sure `ovirt-engine` service is running
    service:
      name: ovirt-engine
      state: started

At this point, not sure what ansible does (didn't check its code), but: It's quite likely that the notification still didn't reach systemd, and so it might tell ansible that the service failed, or is down, or whatever. In reality, it does start eventually. You can't see this in jenkins, I think because the CI job kills everything rather quickly (can be patched, if needed), but on my local try it did [5].

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14481/artifact/check-patch.he-basic_suite_master.el8.x86_64/test_logs/he-basic-suite-master/post-he_deploy/lago-he-basic-suite-master-host-0/_var_log/ovirt-hosted-engine-setup/engine-logs-2020-12-17T13%3A58%3A35Z/ovirt-engine/setup/ovirt-engine-setup-20201217145323-za8yqf.log

[2] https://github.com/oVirt/ovirt-engine/blob/master/packaging/services/ovirt-engine/ovirt-engine.systemd.in

[3] https://github.com/oVirt/ovirt-engine/blob/master/packaging/services/ovirt-engine/ovirt-engine.py.in

[4] https://github.com/oVirt/ovirt-engine/blob/master/packaging/pythonlib/ovirt_engine/service.py

[5] Full logs attached. Relevant parts:

engine-setup log:

2020-12-17 14:53:11,283+0100 DEBUG otopi.plugins.otopi.services.systemd systemd.state:170 starting service ovirt-engine
2020-12-17 14:53:11,284+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:813 execute: ('/usr/bin/systemctl', 'start', 'ovirt-engine.service'), executable='None', cwd='None', env=None
2020-12-17 14:53:11,583+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:863 execute-result: ('/usr/bin/systemctl', 'start', 'ovirt-engine.service'), rc=0
2020-12-17 14:53:11,584+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:921 execute-output: ('/usr/bin/systemctl', 'start', 'ovirt-engine.service') stdout:
2020-12-17 14:53:11,584+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:926 execute-output: ('/usr/bin/systemctl', 'start', 'ovirt-engine.service') stderr:

/var/log/messages:

Dec 17 13:53:11 lago-he-basic-suite-master-engine systemd[1]: Starting oVirt Engine...
Dec 17 13:53:11 lago-he-basic-suite-master-engine systemd[1]: Started oVirt Engine.
...
Dec 17 13:53:14 lago-he-basic-suite-master-engine journal[26755]: 2020-12-17 14:53:14,427+0100 ovirt-engine: INFO _detectJBossVersion:186 Detecting JBoss version. Running: /usr/lib/jvm/jre-1
1-openjdk-11.0.9.11-3.el8.x86_64/bin/java ['ovirt-engine-version', '--add-modules', 'java.se', '-server', '-XX:+TieredCompilation', '-Xms1024M', '-Xmx1024M', '-Xss1M', '-Djava.awt.headless=t
rue', '-Dsun.rmi.dgc.client.gcInterval=3600000', '-Dsun.rmi.dgc.server.gcInterval=3600000', '-Djsse.enableSNIExtension=false', '-Dresteasy.preferJacksonOverJsonB=true', '-Djackson.deserializ
ation.whitelist.packages=org,com,java,javax', '-XX:+HeapDumpOnOutOfMemoryError', '-XX:HeapDumpPath=/var/log/ovirt-engine/dump', '-Djava.util.logging.manager=org.jboss.logmanager', '-Dlogging
.configuration=file:///var/lib/ovirt-engine/jboss_runtime/config/ovirt-engine-logging.properties', '-Dorg.jboss.resolver.warning=true', '-Djboss.modules.system.pkgs=org.jboss.byteman', '-Djb
oss.server.default.config=ovirt-engine', '-Djboss.home.dir=/usr/share/ovirt-engine-wildfly', '-Djboss.server.base.dir=/usr/share/ovirt-engine', '-Djboss.server.data.dir=/var/lib/ovirt-engine
', '-Djboss.server.log.dir=/var/log/ovirt-engine', '-Djboss.server.config.dir=/var/lib/ovirt-engine/jboss_runtime/config', '-Djboss.server.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp', 
'-Djboss.controller.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp', '-jar', '/usr/share/ovirt-engine-wildfly/jboss-modules.jar', '-mp', '/usr/share/ovirt-engine-wildfly-overlay/modules:/u
sr/share/ovirt-engine/modules/common:/usr/share/ovirt-engine-extension-aaa-jdbc/modules:/usr/share/ovirt-engine-extension-aaa-ldap/modules:/usr/share/ovirt-engine-extension-aaa-misc/modules:
/usr/share/ovirt-engine-extension-logger-log4j/modules:/usr/share/ovirt-engine-wildfly/modules', '-jaxpmodule', 'javax.xml.jaxp-provider', 'org.jboss.as.standalone', '-v']
Dec 17 13:53:21 lago-he-basic-suite-master-engine journal[26755]: 2020-12-17 14:53:21,740+0100 ovirt-engine: INFO _detectJBossVersion:206 Return code: 0,  | stdout: '['WildFly Full 19.1.0.Final (WildFly Core 11.1.1.Final)'],  | stderr: '['WARNING: -jaxpmodule is deprecated and may be removed in a future release']'

Above is the last line from the python code, before it starts java. We have some debugging code that makes it emit a bit more, can be activated if needed.

The java code writes two (relevant) log files, first of which is server.log, starts with:

2020-12-17 14:53:46,557+01 INFO  [org.jboss.as.server.deployment.scanner] (MSC service thread 1-4) WFLYDS0013: Started FileSystemDeploymentService for directory /var/lib/ovirt-engine/jboss_runtime/deployments

As you can see (notably in the syslog lines from the python code, which include both syslog time and python-logged time), things above are a bit messy around timezones - still not sure why, but I do not think this is related to the bug - and is probably an issue in how we run the various VMs in ovirt-system-tests etc.

Comment 4 Yedidyah Bar David 2020-12-17 20:51:38 UTC
Created attachment 1740064 [details]
ovirt-system-tests logs

Comment 5 Yedidyah Bar David 2020-12-20 12:41:39 UTC
(In reply to Sandro Bonazzola from comment #2)
> jamacku are you aware of any change in systemd-239-43.el8 that can lead to
> some change in state reporting?

Running this manually on the engine shows:

# systemctl show ovirt-engine | grep -i active
Failed to parse bus message: Invalid argument

So I suspect current is a duplicate of one of the other bugs mentioning this error, e.g. bug 1908275.

Comment 6 Yedidyah Bar David 2020-12-20 13:04:33 UTC
I confirm that the workaround in ansible [1], mentioned in bug 1908275 comment 1, fixes the problem locally.

[1] https://github.com/ansible/ansible/pull/72337

Comment 7 Yedidyah Bar David 2020-12-21 12:41:09 UTC
I confirm that current bug is solved by upgrading to ansible-2.9.16-2. Closing. If we have a Stream ansible bug, we can change the reason to duplicate of it.

Comment 8 Red Hat Bugzilla 2023-09-15 00:53:14 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.