Bug 1102729 - [engine-backend] concurrent host related operations causes to sql exceptions
Summary: [engine-backend] concurrent host related operations causes to sql exceptions
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.5
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Eli Mesika
QA Contact: Pavel Stehlik
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-29 12:52 UTC by Elad
Modified: 2016-02-10 19:34 UTC (History)
8 users (show)

Fixed In Version: ovirt-3.5.0_rc3
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-17 12:22:25 UTC
oVirt Team: Infra
Embargoed:


Attachments (Terms of Use)
engine and vdsm logs (829.37 KB, application/x-gzip)
2014-05-29 12:52 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 32958 0 master MERGED core: Change cluster command should lock host Never
oVirt gerrit 32961 0 ovirt-engine-3.5 MERGED core: Change cluster command should lock host Never

Description Elad 2014-05-29 12:52:18 UTC
Created attachment 900345 [details]
engine and vdsm logs

Description of problem:
I tried to put vdsm to maintenance and failed with an internal engine error 
Happened to me after a specific scenario, not sure if it's related to those specific steps.

Version-Release number of selected component (if applicable):
ovirt-alpha-1
ovirt-engine-3.5.0-0.0.master.20140519181229.gitc6324d4.el6.noarch
vdsm-4.14.1-340.gitedb02ba.el6.x86_64

How reproducible:
Unknown, happened to me after the scenario described in "Steps to Reproduce".

Steps to Reproduce:
On a shared DC with active host connected to storage server by iSCSI and FC. Make sure host is exposed to more than 1 LUNS by FC:
1. Have 2 storage domains: FC and iSCSI
2. Block connectivity to iSCSI domain and wait for it to become inactive and to FC domain to take master
3. When FC domain becomes active and master, remove 1 of the FC connected LUNS from LUN masking in storage server. Click on 'edit' on the FC domain  
4. Engine sends getDeviceList to vdsm and fails to get a response after 3 minutes:

2014-05-29 14:42:04,333 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp--127.0.0.1-8702-1) Query GetDeviceListQuery failed. Exception message is VdcBLLException: org.ovirt.engine.core.vdsbroker.v
dsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) : org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core
.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.
engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022)

5. FC domain becomes inactive. Engine tries to perform reconstruct and fails because the other domain (iSCSI) is also inactive:

2014-05-29 14:51:18,204 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-6-thread-3) [237029dd] Command ReconstructMasterVDSCommand(HostName = green-vdsa, HostId = f397fb84-ac4e-4672-b11e-f0e4b24b3d65, vdsSpmId = 1, storagePoolId = a6aab135-2d4c-46b4-9062-7d0c6698fac0, storagePoolName = DC1, masterDomainId = d9520218-a361-4edf-b452-daf316b571d6, masterVersion = 8, domainsList = [{ domainId: d9520218-a361-4edf-b452-daf316b571d6, status: Unknown };{ domainId: aa3c9195-594c-4090-9050-74c034b9535d, status: Active };]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException

6. Tried to force remove DC and failed with a CDA because the host was active. Tried to put the host to maintenance.



Actual results:
failed to put host to maintenance:
2014-05-29 14:57:06,190 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (org.ovirt.thread.pool-6-thread-10) [7a5b92a2] Running command: MaintenanceNumberOfVdssCommand internal: false. Entities aff
ected :  ID: f397fb84-ac4e-4672-b11e-f0e4b24b3d65 Type: VDS
2014-05-29 14:57:06,202 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (org.ovirt.thread.pool-6-thread-10) [7a5b92a2] START, SetVdsStatusVDSCommand(HostName = green-vdsa, HostId = f397fb84-ac4e-467
2-b11e-f0e4b24b3d65, status=PreparingForMaintenance, nonOperationalReason=NONE, stopSpmFailureLogged=true), log id: 1d25727
2014-05-29 14:57:19,050 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-6-thread-49) Command ConnectStoragePoolVDSCommand(HostName = green-vdsa, HostId = f397f
b84-ac4e-4672-b11e-f0e4b24b3d65, storagePoolId = a6aab135-2d4c-46b4-9062-7d0c6698fac0, vds_spm_id = 1, masterDomainId = d9520218-a361-4edf-b452-daf316b571d6, masterVersion = 8) execution failed. Exception: VDSNetw
orkException: java.util.concurrent.TimeoutException
2014-05-29 14:57:19,051 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-6-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 1a7c1bd4
2014-05-29 14:57:19,052 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-6-thread-49) IrsBroker::Failed::SPMGetAllTasksInfoVDS due to: IRSNonOperationalException: IRSGeneri
cException: IRSErrorException: IRSNonOperationalException: Could not connect host to Data Center(Storage issue)
2014-05-29 14:57:19,059 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMGetAllTasksInfoVDSCommand] (org.ovirt.thread.pool-6-thread-49) FINISH, SPMGetAllTasksInfoVDSCommand, log id: 53e312f
2014-05-29 14:57:19,060 ERROR [org.ovirt.engine.core.bll.AsyncTaskManager] (org.ovirt.thread.pool-6-thread-49) Getting existing tasks on Storage Pool DC1 failed.: org.ovirt.engine.core.common.errors.VdcBLLExceptio
n: VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Could not connect host to Data Center(Storage issue) (F
ailed with error ENGINE and code 5001)


I removed the host from the setup and tried to add it again. I failed with the following SQL exception:

2014-05-29 14:59:12,363 ERROR [org.ovirt.engine.core.bll.AddVdsSpmIdCommand] (ajp--127.0.0.1-8702-11) [14098a8e] Command org.ovirt.engine.core.bll.AddVdsSpmIdCommand throw exception: org.springframework.dao.DataIntegrityViolationException: CallableStatementCallback; SQL [{call insertvds_spm_id_map(?, ?, ?)}]; ERROR: insert or update on table "vds_spm_id_map" violates foreign key constraint "fk_vds_spm_id_map_vds_id"
  Detail: Key (vds_id)=(f397fb84-ac4e-4672-b11e-f0e4b24b3d65) is not present in table "vds_static".
  Where: SQL statement "INSERT INTO vds_spm_id_map(storage_pool_id, vds_id, vds_spm_id) VALUES( $1 ,  $2 ,  $3 )"
PL/pgSQL function "insertvds_spm_id_map" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "vds_spm_id_map" violates foreign key constraint "fk_vds_spm_id_map_vds_id"
  Detail: Key (vds_id)=(f397fb84-ac4e-4672-b11e-f0e4b24b3d65) is not present in table "vds_static".
  Where: SQL statement "INSERT INTO vds_spm_id_map(storage_pool_id, vds_id, vds_spm_id) VALUES( $1 ,  $2 ,  $3 )"
PL/pgSQL function "insertvds_spm_id_map" line 2 at SQL statement
        at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:245) [spring-jdbc.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72) [spring-jdbc.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:1030) [spring-jdbc.jar:3.1.1.RELEASE]



Expected results:
Putting host to maintenance should work

Additional info: engine and vdsm logs

Comment 1 Elad 2014-05-29 13:03:56 UTC
Restart to engine service didn't solve the issue.

Comment 2 Nir Soffer 2014-05-29 15:36:05 UTC
According to Elad, the only way to use this system after this is to clean up the database.

Comment 3 Liron Aravot 2014-09-14 10:51:37 UTC
The issue here seems to be related to the host lifecycle, from looking in the logs i've noticed few issues - 
1. host cluster can be changed before it's on maintenance 
2. different dml operations run on the host relted tables concurrently, leaving the db in problematic state leading to sql exceptions.

relevant log starts from: 2014-05-29 14:58:03,258 INFO  

as that related to the host life cycle moving to infra, oved - please take a look on that.

thanks.

Comment 4 Sandro Bonazzola 2014-10-17 12:22:25 UTC
oVirt 3.5 has been released and should include the fix for this issue.


Note You need to log in before you can comment on or make changes to this bug.