Created attachment 1733308 [details] /var/log from engine+hosts, lago/OST logs Description of problem: I ran OST manually with suite he-basic-master, and it failed at a rather late stage (010_local_maintenance_cli. My changes were for 008_restart_he_vm, which passed). It failed because the hosted-engine VM was already on host-1, which wasn't expected. I spent quite some time trying to understand this, and I *think* the root cause was quite before that. engine.log: 2020-11-25 09:48:50,864+01 INFO [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (EE-ManagedThreadFactory-engine-Thread-273) [2e1072c7] Host setup networks finished. Lock released. Monitoring can run now for host 'lago-he-basic-suite-master-host-0.lago.local' from data-center 'Default' 2020-11-25 09:48:50,866+01 INFO [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (EE-ManagedThreadFactory-engine-Thread-273) [2e1072c7] Lock freed to object 'EngineLock:{exclusiveLocks='[HOST_NETWORKfba14f41-7ece-46ca-af65-7476e9fd2018=HOST_NETWORK]', sharedLocks=''}' 2020-11-25 09:48:50,923+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-273) [2e1072c7] EVENT_ID: PERSIST_NETWORK_ON_HOST_FINISHED(1,146), (1/2): Successfully applied changes for network(s) Labeled_Network on host lago-he-basic-suite-master-host-0.lago.local. (User: admin@internal-authz) 2020-11-25 09:48:53,883+01 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [2e1072c7] Command 'PropagateNetworksToClusterHosts' id: '6e61f53b-c2b7-4a35-a840-67076e6d66cb' child commands '[fe7f022e-c13b-4924-a6ea-cc56407e6969, ca990d6e-199b-473d-928d-0a7e6d8cf073]' executions were completed, status 'SUCCEEDED' 2020-11-25 09:48:54,891+01 INFO [org.ovirt.engine.core.bll.network.cluster.PropagateNetworksToClusterHostsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-85) [2e1072c7] Ending command 'org.ovirt.engine.core.bll.network.cluster.PropagateNetworksToClusterHostsCommand' successfully. 2020-11-25 09:49:24,749+01 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) [] Failure on checkInterfaces on update runtime info for host 'lago-he-basic-suite-master-host-0.lago.local': PreparedStatementCallback; uncategorized SQLException for SQL [select * from getallnetworkbyclusterid(?, ?, ?)]; SQL state [null]; error code [0]; IJ031013: Interrupted attempting lock: org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842; nested exception is java.sql.SQLException: IJ031013: Interrupted attempting lock: org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842 2020-11-25 09:49:24,752+01 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) [] ResourceManager::refreshVdsRunTimeInfo: host=lago-he-basic-suite-master-host-0.lago.local(fba14f41-7ece-46ca-af65-7476e9fd2018): PreparedStatementCallback; uncategorized SQLException for SQL [select * from getvdsdynamicbyvdsid(?)]; SQL state [null]; error code [0]; IJ031013: Interrupted attempting lock: org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842; nested exception is java.sql.SQLException: IJ031013: Interrupted attempting lock: org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842 2020-11-25 09:49:24,761+01 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) [] Timer update runtime info failed. Exception: SQLException: IJ031013: Interrupted attempting lock: org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842
For the record: A later OST run failed similarly, but without this error in the log. So it wasn't the reason for failure. It's probably still worth fixing, like any other ERROR, but is probably not urgent.
Artur, it looks like the SQL error is just cause by interruption of host monitoring thread, but the real question is why the monitoring thread was interrupted ...
(In reply to Martin Perina from comment #2) > Artur, it looks like the SQL error is just cause by interruption of host > monitoring thread, but the real question is why the monitoring thread was > interrupted ... The attached logs are a result of one of the patches around port to pytest, I think [1][2]. The relevant change there, compared to then-current master, is that (as part of the port to pytest) I changed arbitrary delays of 5 minutes to a loop that checks that engine vm is not migrating. This likely lowered the delay and perhaps makes us reboot the engine vm too quickly. The case I changed, 008_restart_he_vm, does this: - test_set_global_maintenance host-0/ovirt-hosted-engine-ha/agent.log: MainThread::INFO::2020-11-25 09:49:05,414::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected - test_restart_he_vm engine/messages: Nov 25 09:49:01 lago-he-basic-suite-master-engine systemd[5171]: Reached target Shutdown. ... Nov 25 09:49:20 lago-he-basic-suite-master-engine systemd[1]: Stopping LVM event activation on device 252:2... Above is the last line there before it rebooted. As you can see from the timestamp, the last few lines from engine.log (in comment 0) are from a few seconds later, also as part of shutting down the engine. So, bottom line: 1. Current bug is probably not a bug - it simply what happens when you stop the engine when it's busy doing something. Feel free to close unless you think something should still be fixed. 2. It's not clear to me what else I should check in OST for "stability" before deciding it's safe to reboot the engine vm. I'd really not want to give up on this and restore the 5 minute delay (both for the time waste and because it's likely not enough under certain conditions). Ideas are welcome. [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/112320/4 [2] https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=commit;h=176723c7d8519080405397914c54293c4c3d388a
In the engine server.log I see the following: ================================================================ 2020-11-25 09:49:21,270 Server shutdown has been requested via an OS signal 2020-11-25 09:49:24,745 connection pool (java:/ENGINEDataSourceNoJTA) unbound ... stopping other services ... 2020-11-25 09:49:27 WildFly stopped in 5486ms ================================================================ At the same time in engine.log: ================================================================ 2020-11-25 09:49:24,749+01 ERROR (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) uncategorized SQLException error code [0]; IJ031013: Interrupted attempting lock: org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842;ting lock: org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842 ================================================================ "SQLException error code [0]; IJ031013: Interrupted attempting lock" has been identified in [1] as related to canceling a job while accessing the DB. So it seems reasonable to conclude that due to the shutdown all accesses to the DB are being canceled with the above exception appearing in the logs. This is normal behaviour. Around the same time the broker.log on host-0:ovirt-hosted-engine-ha also logs that engine is not up: ================================================================ Thread-5::WARNING::2020-11-25 09:49:30,025::engine_health::238::engine_health.EngineHealth::(_result_from_stats) bad health status: Hosted Engine is not up! There is nothing in engine:/var/log/messages or in lago.log to suggest why a server shutdown was signaled by the OS. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665784
(In reply to eraviv from comment #4) > There is nothing in engine:/var/log/messages or in lago.log to suggest why a > server shutdown was signaled by the OS. Indeed, but this is just not enough logging in the pytest tests. I am working on this. Sorry for the noise.