1901471 – Failure on checkInterfaces: uncategorized SQLException

Bug 1901471 - Failure on checkInterfaces: uncategorized SQLException

Summary: Failure on checkInterfaces: uncategorized SQLException

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Network
Sub Component:
Version:	4.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	eraviv
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-25 11:12 UTC by Yedidyah Bar David
Modified:	2020-11-29 07:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-11-29 07:11:27 UTC
oVirt Team:	Network
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log from engine+hosts, lago/OST logs (8.19 MB, application/gzip) 2020-11-25 11:12 UTC, Yedidyah Bar David	no flags	Details
View All

Description Yedidyah Bar David 2020-11-25 11:12:25 UTC

Created attachment 1733308 [details]
/var/log from engine+hosts, lago/OST logs

Description of problem:

 I ran OST manually with suite he-basic-master, and it failed at a
rather late stage (010_local_maintenance_cli. My changes were for
008_restart_he_vm, which passed). It failed because the hosted-engine
VM was already on host-1, which wasn't expected. I spent quite some
time trying to understand this, and I *think* the root cause was quite
before that. engine.log:

   2020-11-25 09:48:50,864+01 INFO
[org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand]
(EE-ManagedThreadFactory-engine-Thread-273) [2e1072c7] Host setup
networks finished. Lock released. Monitoring can run now for host
'lago-he-basic-suite-master-host-0.lago.local' from data-center
'Default'
   2020-11-25 09:48:50,866+01 INFO
[org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand]
(EE-ManagedThreadFactory-engine-Thread-273) [2e1072c7] Lock freed to
object 'EngineLock:{exclusiveLocks='[HOST_NETWORKfba14f41-7ece-46ca-af65-7476e9fd2018=HOST_NETWORK]',
sharedLocks=''}'
   2020-11-25 09:48:50,923+01 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engine-Thread-273) [2e1072c7] EVENT_ID:
PERSIST_NETWORK_ON_HOST_FINISHED(1,146), (1/2): Successfully applied
changes for network(s) Labeled_Network on host
lago-he-basic-suite-master-host-0.lago.local. (User:
admin@internal-authz)
   2020-11-25 09:48:53,883+01 INFO
[org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47)
[2e1072c7] Command 'PropagateNetworksToClusterHosts' id:
'6e61f53b-c2b7-4a35-a840-67076e6d66cb' child commands
'[fe7f022e-c13b-4924-a6ea-cc56407e6969,
ca990d6e-199b-473d-928d-0a7e6d8cf073]' executions were completed,
status 'SUCCEEDED'
   2020-11-25 09:48:54,891+01 INFO
[org.ovirt.engine.core.bll.network.cluster.PropagateNetworksToClusterHostsCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-85)
[2e1072c7] Ending command
'org.ovirt.engine.core.bll.network.cluster.PropagateNetworksToClusterHostsCommand'
successfully.
   2020-11-25 09:49:24,749+01 ERROR
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92)
[] Failure on checkInterfaces on update runtime info for host
'lago-he-basic-suite-master-host-0.lago.local':
PreparedStatementCallback; uncategorized SQLException for SQL [select
* from  getallnetworkbyclusterid(?, ?, ?)]; SQL state [null]; error
code [0]; IJ031013: Interrupted attempting lock:
org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842;
nested exception is java.sql.SQLException: IJ031013: Interrupted
attempting lock:
org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842
   2020-11-25 09:49:24,752+01 ERROR
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92)
[] ResourceManager::refreshVdsRunTimeInfo:
host=lago-he-basic-suite-master-host-0.lago.local(fba14f41-7ece-46ca-af65-7476e9fd2018):
PreparedStatementCallback; uncategorized SQLException for SQL [select
* from  getvdsdynamicbyvdsid(?)]; SQL state [null]; error code [0];
IJ031013: Interrupted attempting lock:
org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842;
nested exception is java.sql.SQLException: IJ031013: Interrupted
attempting lock:
org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842
   2020-11-25 09:49:24,761+01 ERROR
[org.ovirt.engine.core.vdsbroker.VdsManager]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92)
[] Timer update runtime info failed. Exception: SQLException:
IJ031013: Interrupted attempting lock:
org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842

Comment 1 Yedidyah Bar David 2020-11-26 06:27:24 UTC

For the record: A later OST run failed similarly, but without this error in the log. So it wasn't the reason for failure. It's probably still worth fixing, like any other ERROR, but is probably not urgent.

Comment 2 Martin Perina 2020-11-26 10:24:29 UTC

Artur, it looks like the SQL error is just cause by interruption of host monitoring thread, but the real question is why the monitoring thread was interrupted ...

Comment 3 Yedidyah Bar David 2020-11-26 11:15:08 UTC

(In reply to Martin Perina from comment #2)
> Artur, it looks like the SQL error is just cause by interruption of host
> monitoring thread, but the real question is why the monitoring thread was
> interrupted ...

The attached logs are a result of one of the patches around port to pytest, I think [1][2].

The relevant change there, compared to then-current master, is that (as part of the port to pytest) I changed arbitrary delays of 5 minutes to a loop that checks that engine vm is not migrating. This likely lowered the delay and perhaps makes us reboot the engine vm too quickly.

The case I changed, 008_restart_he_vm, does this:

- test_set_global_maintenance

host-0/ovirt-hosted-engine-ha/agent.log:

MainThread::INFO::2020-11-25 09:49:05,414::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected

- test_restart_he_vm

engine/messages:

Nov 25 09:49:01 lago-he-basic-suite-master-engine systemd[5171]: Reached target Shutdown.
...
Nov 25 09:49:20 lago-he-basic-suite-master-engine systemd[1]: Stopping LVM event activation on device 252:2...

Above is the last line there before it rebooted. As you can see from the timestamp, the last few lines from engine.log (in comment 0) are from a few seconds later, also as part of shutting down the engine.

So, bottom line:

1. Current bug is probably not a bug - it simply what happens when you stop the engine when it's busy doing something. Feel free to close unless you think something should still be fixed.

2. It's not clear to me what else I should check in OST for "stability" before deciding it's safe to reboot the engine vm. I'd really not want to give up on this and restore the 5 minute delay (both for the time waste and because it's likely not enough under certain conditions). Ideas are welcome.

[1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/112320/4

[2] https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=commit;h=176723c7d8519080405397914c54293c4c3d388a

Comment 4 eraviv 2020-11-26 19:37:00 UTC

In the engine server.log I see the following:
    ================================================================
    2020-11-25 09:49:21,270 Server shutdown has been requested via an OS signal 
    2020-11-25 09:49:24,745 connection pool (java:/ENGINEDataSourceNoJTA) unbound
    ... stopping other services ...
    2020-11-25 09:49:27 WildFly stopped in 5486ms
    ================================================================
At the same time in engine.log:
    ================================================================
    2020-11-25 09:49:24,749+01 ERROR (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) uncategorized SQLException error code [0]; IJ031013: Interrupted attempting
    lock: org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842;ting lock: org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@7d88842
    ================================================================

"SQLException error code [0]; IJ031013: Interrupted attempting lock" has been identified in [1] as related to canceling a job while accessing the DB. So it seems reasonable to conclude that due to the shutdown all accesses to the DB are being canceled with the above exception appearing in the logs. This is normal behaviour.

Around the same time the broker.log on host-0:ovirt-hosted-engine-ha also logs that engine is not up:
    ================================================================
    Thread-5::WARNING::2020-11-25 09:49:30,025::engine_health::238::engine_health.EngineHealth::(_result_from_stats) bad health status: Hosted Engine is not up!



There is nothing in engine:/var/log/messages or in lago.log to suggest why a server shutdown was signaled by the OS.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1665784

Comment 5 Yedidyah Bar David 2020-11-29 07:11:27 UTC

(In reply to eraviv from comment #4)
> There is nothing in engine:/var/log/messages or in lago.log to suggest why a
> server shutdown was signaled by the OS.

Indeed, but this is just not enough logging in the pytest tests. I am working on this. Sorry for the noise.

Note You need to log in before you can comment on or make changes to this bug.