Description of problem: While trying to remove user from engine, NPE appeares: Caused by: java.lang.NullPointerException Version-Release number of selected component (if applicable): ovirt-engine-4.0.2.7-0.1.el7ev.noarch How reproducible: seems to be connected to some kind of permissions, was not able to reproduce manually Steps to Reproduce: 1. 2016-08-20 15:37:28,627 - MainThread - users - DEBUG - DELETE request content is -- url:/ovirt-engine/api/users/99609e2a-14ef-4481-9242-caed7a389fb4 Actual results: NPE Expected results: Should succeed Additional info: 2016-08-20 15:37:28,994 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (default task-11) [4585010c] Validation of action 'RestoreStatelessVm' failed for user admin@internal-authz. Reasons: ACTION_TYPE_FAILED_OBJECT_LOCKED 2016-08-20 15:37:29,076 ERROR [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [4585010c] Command 'org.ovirt.engine.core.bll.RemovePermissionCommand' failed: null 2016-08-20 15:37:29,076 ERROR [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [4585010c] Exception: javax.ejb.EJBTransactionRolledbackException 2016-08-20 15:37:29,158 INFO [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [6f9b8f18] Transaction was aborted in 'org.ovirt.engine.core.bll.RemovePermissionCommand' 2016-08-20 15:37:29,173 INFO [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [1493fbc9] Transaction was aborted in 'org.ovirt.engine.core.bll.RemovePermissionCommand' 2016-08-20 15:37:29,187 INFO [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [2d36ee90] Transaction was aborted in 'org.ovirt.engine.core.bll.RemovePermissionCommand' 2016-08-20 15:37:29,200 ERROR [org.ovirt.engine.core.bll.aaa.RemoveUserCommand] (default task-11) [2d36ee90] Command 'org.ovirt.engine.core.bll.aaa.RemoveUserCommand' failed: CallableStatementCallback; uncategorized SQLException for SQL [{call deleteuser(?)}]; SQL state [null]; error code [0]; IJ031070: Transaction cannot proceed: STATUS_MARKED_ROLLBACK; nested exception is java.sql.SQLException: IJ031070: Transaction cannot proceed: STATUS_MARKED_ROLLBACK 2016-08-20 15:37:29,200 ERROR [org.ovirt.engine.core.bll.aaa.RemoveUserCommand] (default task-11) [2d36ee90] Exception: org.springframework.jdbc.UncategorizedSQLException: CallableStatementCallback; uncategorized SQLException for SQL [{call deleteuser(?)}]; SQL state [null]; error code [0]; IJ031070: Transaction cannot proceed: STATUS_MARKED_ROLLBACK; nested exception is java.sql.SQLException: IJ031070: Transaction cannot proceed: STATUS_MARKED_ROLLBACK 2016-08-20 15:37:29,270 ERROR [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (default task-11) [2d36ee90] Transaction rolled-back for command 'org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand'. 2016-08-20 15:37:29,270 ERROR [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [2d36ee90] Transaction rolled-back for command 'org.ovirt.engine.core.bll.RemovePermissionCommand'. 2016-08-20 15:37:29,270 ERROR [org.ovirt.engine.core.bll.aaa.RemoveUserCommand] (default task-11) [2d36ee90] Transaction rolled-back for command 'org.ovirt.engine.core.bll.aaa.RemoveUserCommand'. 2016-08-20 15:37:29,280 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-11) [2d36ee90] Correlation ID: users_delete_9cc09e1e-8573-4e55, Call Stack: null, Custom Event ID: -1, Message: Failed to remove User auto_user_dc (User: admin@internal-authz). 2016-08-20 15:37:29,282 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-11) [] Operation Failed: [Internal Engine Error]
Created attachment 1193295 [details] Relevant part of engine log
Lukas - can you give Eli access to your environment? (In case he won't be able to reproduce).
The problem here is that we don;t have a environment for it, this happens within automation tests and we are unable to reproduce it manually, after this issue when tests are stopped on automation environment, still unable to reproduce manually. If you will in the worst case scenario need an environment we can start test run to reproduce this issue.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
please attach relevant postgres server log It is under /var/lib/pgsql/data/pg_log I would like to see if this is a result of a DB deadlock, in that case we will have the exact information in the PG server log
nothing interesting there... some logs from previous todays run.. no error during test execution LOG: using stale statistics instead of current ones because stats collector is not responding ERROR: canceling autovacuum task CONTEXT: automatic vacuum of table "engine.public.vds_dynamic"
after reproduction add debug logs of engine and pg_sql
Can we had logs as described in Comment 5?
Created attachment 1198634 [details] Debug logs of postgresql and engine 13:23:58 2016-09-07 13:23:58,563 ERROR ERR: Failed to delete element NOT as expected: 13:23:58 Status: 400 13:23:58 Reason: Bad Request 13:23:58 Detail: [Internal Engine Error]
Well, the flow is as follows : RemoveGroupCommand=>RemovePermissionCommand=>DetachUserFromVmFromPoolCommand=>RestoreStatelessVmCommand IMO this BZ should be moved to the virt team for analysis
Tomas, looks like some issue when removing group with permissions for VM operations, could you please investigate?
The issue seems to be that the VM is locked while trying to restore the snapshots as a reaction to returning the VM to pool. For this reason the compensation tried to kick in and failed on NPE (which is an issue by itself but I believe it is there for long time). @Lukas: could you please provide bigger chunk of the logs to see who have locked the VM?
Created attachment 1203277 [details] engine.log up to NPE
So, it seems that there is a VM in pool which has some disk(s). On this VM the attach user and detach user is racing. TL;DR: it seems that the only issue here is a mistake in the rollback part of the RemovePermissionCommand which does not cause any harm. @Lukas: is there some actual harm which happens after this? Because I don't see anything else but an ugly stacktrace which in case it will be fixed will transform to a warning message, but the issue in your tests would persist. To fix the automated tests, you could after adding a user to a VM from a pool to wait until the VM is not locked and than removing it. In more detail: attach user part: - 15:13:19,357: AttachUserToVmFromPoolAndRunCommand is ran which locks the VM - since this VM is down, the RunVmCommand is called on it - since the VM is stateless, the stateless snapshots are started to be created detach user part: - 15:13:21,460: RemovePermissionCommand is called on this VM which is still locked - this leads to RestoreStatelessVmCommand, but since the VM is locked, it fails while trying to lock it - as a reaction, the RemovePermissionCommand tries to roll back and fails on NPE end: - 15:13:39,200: the lock on the VM is freed
Yes, as user is not removed cant be added on next test runs. However as you suggested it should be fixed with waiting for unlock of the VM. Still the NPE in logs is an issue.
Yes, but basically a cosmetic issue because the user would not be removed even after fixing the NPE. Can you please try to wait for the unlock and check and make sure it helps?
looks like timing issue rather than a regression Pending update after fixing the automated tests, but for now moving out of 4.0.5
ACK, indeed waiting for unlocked VM fixed the problem, because of this it seems not to be regression nor AutomationBlocker anymore
Based on comment 18 it is only an issue with automation where the requests are sent very fast. Also in this case the issue does not cause any harm, only the stack trace is ugly. The correct fix would replace this stack trace by an error message. Pushing out from z-stream.
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.