Bug 1527362 - SQL Deadlock ERROR on DisplayAllAuditLogEventsCommand -under scaled topology
Summary: SQL Deadlock ERROR on DisplayAllAuditLogEventsCommand -under scaled topology
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: future
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.2.2
: ---
Assignee: Eli Mesika
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks: 1528812
TreeView+ depends on / blocked
 
Reported: 2017-12-19 09:45 UTC by Ilan Zuckerman
Modified: 2018-03-29 11:02 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1528812 (view as bug list)
Environment:
Last Closed: 2018-03-29 11:02:40 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.2+
lsvaty: testing_ack+


Attachments (Terms of Use)
Engine, Postgres logs showing exception related to deadlock in audit table on clearing (309.37 KB, application/zip)
2018-02-28 15:00 UTC, mlehrer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 85708 0 master MERGED db: resolve deadlock in audit log 2018-01-05 13:35:41 UTC
oVirt gerrit 88238 0 None None None 2018-03-01 23:40:01 UTC
oVirt gerrit 88239 0 None None None 2018-03-02 09:33:57 UTC

Description Ilan Zuckerman 2017-12-19 09:45:23 UTC
Description of problem:

When System is idle, we see the following error in engine.log (you can see the all error in attached log):

2017-12-18 12:14:03,408+02 ERROR [org.ovirt.engine.core.bll.DisplayAllAuditLogEventsCommand] (default task-3) [9b5
51e4d-82e2-4b80-a7c3-eb8b42502abc] Exception: org.springframework.dao.DeadlockLoserDataAccessException: CallableSt
atementCallback; SQL [{call displayallauditlogevents(?)}]; ERROR: deadlock detected
PL/pgSQL function displayallauditlogevents(integer) line 3 at SQL statement; nested exception is org.postgresql.ut
il.PSQLException: ERROR: deadlock detected


Version-Release number of selected component (if applicable):

[root@vega10 ovirt-engine]# rpm -qa | grep ovirt-engine
ovirt-engine-websocket-proxy-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-vmconsole-proxy-helper-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-webadmin-portal-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-extensions-api-impl-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-metrics-1.1.2-0.0.master.20171204093359.el7.centos.noarch
ovirt-engine-extension-aaa-jdbc-1.1.7-0.0.master.gitbbb2931.el7.centos.noarch
ovirt-engine-tools-backup-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-lib-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-cli-3.6.9.3-0.1.20170116.gitad0561a.el7.centos.noarch
ovirt-engine-wildfly-overlay-11.0.1-1.el7.centos.noarch
ovirt-engine-dwh-4.2.2-0.0.master.20171129124832.el7.centos.noarch
ovirt-engine-restapi-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-tools-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.2-0.1.20161204.gite99bbd1.el7.centos.noarch
ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64
ovirt-engine-setup-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-dbscripts-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-dashboard-1.2.0-1.20171206git2170060.el7.centos.noarch
ovirt-engine-api-explorer-0.0.3-0.alpha.1.20171205git87608c5.el7.centos.noarch
ovirt-engine-dwh-setup-4.2.2-0.0.master.20171129124832.el7.centos.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-backend-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
python-ovirt-engine-sdk4-4.2.2-2.el7.centos.x86_64
ovirt-engine-setup-base-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.2.1-0.0.master.20171206161426.git88e9120.el7.centos.noarch

# cat /etc/os-release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.4 (Maipo)"


How reproducible:
This was noticed only once, during monitoring of our system which is used for scale tests. System was idle at that time


Steps to Reproduce:

We made the following setup and monitored it during the night:

Physical hosts: 6
Nested hosts: 400
those are basically vms which act as hosts (with vdsm package)
RHEVM engine: 1 (vega10)
VMs: 1200
400 of them reserved for nested hosts
Storage: ISCSI


Actual results:
See above ERROR from engine.log

Expected results:
No errors

Additional info:
Attached engine.log

Comment 2 Yaniv Kaul 2017-12-19 21:53:28 UTC
- severity?
- title misses the fact it happens in scale?
- title misses where the deadlock occured (added).
- was that before or after the DWH change?
- have we seen it in lower scale?

Comment 3 Eli Mesika 2017-12-20 10:34:07 UTC
Please add the relevant PG server log, we need it to get the details about the deadlock details from the database log

Comment 4 eberman 2017-12-21 07:26:17 UTC
(In reply to Yaniv Kaul from comment #2)
> - severity?
> - title misses the fact it happens in scale?
> - title misses where the deadlock occured (added).
> - was that before or after the DWH change?
> - have we seen it in lower scale?


> - severity?-->Still investigating for the impact 
> - title misses the fact it happens in scale?-->changed
> - title misses where the deadlock occured (added).
> - was that before or after the DWH change?-->before 
> - have we seen it in lower scale?-->no

Comment 6 Yaniv Kaul 2017-12-24 09:24:26 UTC
Severity?

Comment 8 Moti Asayag 2017-12-24 11:15:26 UTC
Eli, note that the engine.log contains also the following error:

 2017-12-18 11:56:08,918+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.MessageBundler] (default task-35) [09f17fff-ccfe-4105-99f1-02507a268e76] Key 'USER_DISPLAY_ALL_AUDIT_LOG' is not translated in 'bundles/AuditLogMessages'


and by looking at backend/manager/modules/dal/src/main/resources/bundles/AuditLogMessages.properties - I can't find any message to be printed for 'USER_DISPLAY_ALL_AUDIT_LOG' event and neither for 'USER_DISPLAY_ALL_AUDIT_LOG_FAILED'

Comment 9 Petr Matyáš 2018-01-09 11:18:13 UTC
Verified on ovirt-engine-4.2.2-0.0.master.20180107151411.git59e7855.el7.centos.noarch

I'm not seeing any such error in log in my env.
If you are able to reproduce this, please reopen this bug.

Comment 10 Sandro Bonazzola 2018-02-12 11:55:01 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 11 mlehrer 2018-02-28 15:00:38 UTC
Created attachment 1401912 [details]
Engine, Postgres logs showing exception related to deadlock in audit table on clearing

Comment 12 mlehrer 2018-02-28 15:01:12 UTC
#Reproducing:

Able to reproduce this by pressing 'clear all' on alerts, then 'clear all' on events in the drop down tab in webadmin UI.

#Enviroment
Running 4.2.2.1-0.1.el7 on RHEL 7.5 (3.10.0-855.el7.x86_64)
401 Hosts, and 1542 VMs

Postgres engine db is on same tier as engine application.
Postgres of dwh is on remote tier.
Both pg_data are on mounted luns.

#Attachment shows:
UI exception and Engine.log d exception and corresponding postgres.log during the time of the exception.


<snip>

2018-02-28 11:44:57.881 UTC DETAIL:  Process 27691 waits for ShareLock on transaction 69491033; blocked by process 27707.
	Process 27707 waits for ExclusiveLock on tuple (22721,5) of relation 16465 of database 16385; blocked by process 22313.
	Process 22313 waits for ShareLock on transaction 69491702; blocked by process 27691.
	Process 27691: select * from clearallauditlogevents($1) as result
	Process 27707: select * from setallauditlogalerts($1, $2) as result
	Process 22313: select * from clearallauditlogevents($1) as result
2018-02-28 11:44:57.881 UTC HINT:  See server log for query details.
2018-02-28 11:44:57.881 UTC CONTEXT:  while updating tuple (22368,1) in relation "audit_log"
	SQL statement "UPDATE audit_log
	    SET deleted = true
	    FROM ( SELECT * FROM audit_log
	           WHERE severity != v_severity
	               AND NOT deleted
	           FOR UPDATE) AS s"
	PL/pgSQL function clearallauditlogevents(integer) line 4 at SQL statement
2018-02-28 11:44:57.881 UTC STATEMENT:  select * from clearallauditlogevents($1) as result
</snip>

Re-opening this bug.

Comment 13 Martin Perina 2018-03-01 12:30:55 UTC
Eli, might this issue be fixed by patches which was created for BZ1535864?

Comment 14 Eli Mesika 2018-03-01 23:38:10 UTC
(In reply to Martin Perina from comment #13)
> Eli, might this issue be fixed by patches which was created for BZ1535864?

Probably , because in current code all table is updated and the select for update protects only part of the records

Comment 15 Martin Perina 2018-03-02 19:25:46 UTC
Moving to ON_QA as this should be fixed using the same fix as for BZ1535864

Comment 16 Ilan Zuckerman 2018-03-27 13:43:34 UTC
Validated on 4.2.2.4-0.1.el7 on our scale setup.
Watched both, engine log and pg log while executing the steps described by Mordechai in comment 12.
Also watched those logs in idle, no deadlock error noticed.

Comment 17 Sandro Bonazzola 2018-03-29 11:02:40 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.