Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1369407

Summary: NPE while trying to remove user
Product: [oVirt] ovirt-engine Reporter: Lukas Svaty <lsvaty>
Component: BLL.InfraAssignee: Shmuel Melamud <smelamud>
Status: CLOSED CURRENTRELEASE QA Contact: Lukas Svaty <lsvaty>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0.2.7CC: bugs, lsvaty, mgoldboi, michal.skrivanek, mperina, tjelinek
Target Milestone: ovirt-4.3.0Flags: rule-engine: ovirt-4.3+
mgoldboi: planning_ack+
oourfali: devel_ack+
lsvaty: testing_ack+
Target Release: 4.3.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.3.0_alpha Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-13 07:46:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Relevant part of engine log
none
Debug logs of postgresql and engine
none
engine.log up to NPE none

Description Lukas Svaty 2016-08-23 10:30:53 UTC
Description of problem:
While trying to remove user from engine, NPE appeares:

Caused by: java.lang.NullPointerException


Version-Release number of selected component (if applicable):
ovirt-engine-4.0.2.7-0.1.el7ev.noarch

How reproducible:
seems to be connected to some kind of permissions,
was not able to reproduce manually

Steps to Reproduce:
1. 2016-08-20 15:37:28,627 - MainThread - users - DEBUG - DELETE request content is --  url:/ovirt-engine/api/users/99609e2a-14ef-4481-9242-caed7a389fb4

Actual results:
NPE

Expected results:
Should succeed

Additional info:
2016-08-20 15:37:28,994 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (default task-11) [4585010c] Validation of action 'RestoreStatelessVm' failed for user admin@internal-authz. Reasons: ACTION_TYPE_FAILED_OBJECT_LOCKED
2016-08-20 15:37:29,076 ERROR [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [4585010c] Command 'org.ovirt.engine.core.bll.RemovePermissionCommand' failed: null
2016-08-20 15:37:29,076 ERROR [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [4585010c] Exception: javax.ejb.EJBTransactionRolledbackException
2016-08-20 15:37:29,158 INFO  [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [6f9b8f18] Transaction was aborted in 'org.ovirt.engine.core.bll.RemovePermissionCommand'
2016-08-20 15:37:29,173 INFO  [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [1493fbc9] Transaction was aborted in 'org.ovirt.engine.core.bll.RemovePermissionCommand'
2016-08-20 15:37:29,187 INFO  [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [2d36ee90] Transaction was aborted in 'org.ovirt.engine.core.bll.RemovePermissionCommand'
2016-08-20 15:37:29,200 ERROR [org.ovirt.engine.core.bll.aaa.RemoveUserCommand] (default task-11) [2d36ee90] Command 'org.ovirt.engine.core.bll.aaa.RemoveUserCommand' failed: CallableStatementCallback; uncategorized SQLException for SQL [{call deleteuser(?)}]; SQL state [null]; error code [0]; IJ031070: Transaction cannot proceed: STATUS_MARKED_ROLLBACK; nested exception is java.sql.SQLException: IJ031070: Transaction cannot proceed: STATUS_MARKED_ROLLBACK
2016-08-20 15:37:29,200 ERROR [org.ovirt.engine.core.bll.aaa.RemoveUserCommand] (default task-11) [2d36ee90] Exception: org.springframework.jdbc.UncategorizedSQLException: CallableStatementCallback; uncategorized SQLException for SQL [{call deleteuser(?)}]; SQL state [null]; error code [0]; IJ031070: Transaction cannot proceed: STATUS_MARKED_ROLLBACK; nested exception is java.sql.SQLException: IJ031070: Transaction cannot proceed: STATUS_MARKED_ROLLBACK
2016-08-20 15:37:29,270 ERROR [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (default task-11) [2d36ee90] Transaction rolled-back for command 'org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand'.
2016-08-20 15:37:29,270 ERROR [org.ovirt.engine.core.bll.RemovePermissionCommand] (default task-11) [2d36ee90] Transaction rolled-back for command 'org.ovirt.engine.core.bll.RemovePermissionCommand'.
2016-08-20 15:37:29,270 ERROR [org.ovirt.engine.core.bll.aaa.RemoveUserCommand] (default task-11) [2d36ee90] Transaction rolled-back for command 'org.ovirt.engine.core.bll.aaa.RemoveUserCommand'.
2016-08-20 15:37:29,280 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-11) [2d36ee90] Correlation ID: users_delete_9cc09e1e-8573-4e55, Call Stack: null, Custom Event ID: -1, Message: Failed to remove User auto_user_dc (User: admin@internal-authz).
2016-08-20 15:37:29,282 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-11) [] Operation Failed: [Internal Engine Error]

Comment 1 Lukas Svaty 2016-08-23 10:32:45 UTC
Created attachment 1193295 [details]
Relevant part of engine log

Comment 2 Oved Ourfali 2016-08-23 10:34:55 UTC
Lukas - can you give Eli access to your environment? (In case he won't be able to reproduce).

Comment 3 Lukas Svaty 2016-08-23 11:00:53 UTC
The problem here is that we don;t have a environment for it, this happens within automation tests and we are unable to reproduce it manually, after this issue when tests are stopped on automation environment, still unable to reproduce manually. If you will in the worst case scenario need an environment we can start test run to reproduce this issue.

Comment 4 Red Hat Bugzilla Rules Engine 2016-08-23 11:01:20 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 5 Eli Mesika 2016-08-23 14:44:59 UTC
please attach relevant postgres server log 
It is under /var/lib/pgsql/data/pg_log

I would like to see if this is a result of a DB deadlock, in that case we will have the exact information in the PG server log

Comment 6 Lukas Svaty 2016-08-23 15:21:59 UTC
nothing interesting there... some logs from previous todays run.. no error during test execution

LOG:  using stale statistics instead of current ones because stats collector is not responding
ERROR:  canceling autovacuum task
CONTEXT:  automatic vacuum of table "engine.public.vds_dynamic"

Comment 7 Lukas Svaty 2016-08-24 08:17:07 UTC
after reproduction add debug logs of engine and pg_sql

Comment 8 Martin Perina 2016-08-30 08:08:17 UTC
Can we had logs as described in Comment 5?

Comment 9 Lukas Svaty 2016-09-07 10:39:43 UTC
Created attachment 1198634 [details]
Debug logs of postgresql and engine

13:23:58 2016-09-07 13:23:58,563 ERROR  ERR: Failed to delete element NOT as expected:
13:23:58 	Status: 400
13:23:58 	Reason: Bad Request
13:23:58 	Detail: [Internal Engine Error]

Comment 10 Eli Mesika 2016-09-15 13:38:13 UTC
Well, the flow is as follows :

RemoveGroupCommand=>RemovePermissionCommand=>DetachUserFromVmFromPoolCommand=>RestoreStatelessVmCommand


IMO this BZ should be moved to the virt team for analysis

Comment 11 Martin Perina 2016-09-15 13:57:48 UTC
Tomas, looks like some issue when removing group with permissions for VM operations, could you please investigate?

Comment 12 Tomas Jelinek 2016-09-16 07:33:36 UTC
The issue seems to be that the VM is locked while trying to restore the snapshots as a reaction to returning the VM to pool. For this reason the compensation tried to kick in and failed on NPE (which is an issue by itself but I believe it is there for long time).

@Lukas: could you please provide bigger chunk of the logs to see who have locked the VM?

Comment 13 Lukas Svaty 2016-09-21 12:29:13 UTC
Created attachment 1203277 [details]
engine.log up to NPE

Comment 14 Tomas Jelinek 2016-09-21 13:39:58 UTC
So, it seems that there is a VM in pool which has some disk(s). On this VM the attach user and detach user is racing. 

TL;DR: it seems that the only issue here is a mistake in the rollback part of the RemovePermissionCommand which does not cause any harm.

@Lukas: is there some actual harm which happens after this? Because I don't see anything else but an ugly stacktrace which in case it will be fixed will transform to a warning message, but the issue in your tests would persist.

To fix the automated tests, you could after adding a user to a VM from a pool to wait until the VM is not locked and than removing it.

In more detail:

attach user part:
- 15:13:19,357: AttachUserToVmFromPoolAndRunCommand is ran which locks the VM
- since this VM is down, the RunVmCommand is called on it
- since the VM is stateless, the stateless snapshots are started to be created

detach user part:
- 15:13:21,460: RemovePermissionCommand is called on this VM which is still locked
- this leads to RestoreStatelessVmCommand, but since the VM is locked, it fails while trying to lock it
- as a reaction, the RemovePermissionCommand tries to roll back and fails on NPE

end:
- 15:13:39,200: the lock on the VM is freed

Comment 15 Lukas Svaty 2016-09-21 14:14:35 UTC
Yes, as user is not removed cant be added on next test runs. However as you suggested it should be fixed with waiting for unlock of the VM.

Still the NPE in logs is an issue.

Comment 16 Tomas Jelinek 2016-09-21 14:27:15 UTC
Yes, but basically a cosmetic issue because the user would not be removed even after fixing the NPE.

Can you please try to wait for the unlock and check and make sure it helps?

Comment 17 Michal Skrivanek 2016-09-23 10:45:31 UTC
looks like timing issue rather than a regression
Pending update after fixing the automated tests, but for now moving out of 4.0.5

Comment 18 Lukas Svaty 2016-09-26 09:03:32 UTC
ACK, indeed waiting for unlocked VM fixed the problem, because of this it seems not to be regression nor AutomationBlocker anymore

Comment 19 Tomas Jelinek 2016-11-01 08:53:41 UTC
Based on comment 18 it is only an issue with automation where the requests are sent very fast. Also in this case the issue does not cause any harm, only the stack trace is ugly. The correct fix would replace this stack trace by an error message. Pushing out from z-stream.

Comment 20 Sandro Bonazzola 2019-02-13 07:46:43 UTC
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.