Created attachment 900345 [details] engine and vdsm logs Description of problem: I tried to put vdsm to maintenance and failed with an internal engine error Happened to me after a specific scenario, not sure if it's related to those specific steps. Version-Release number of selected component (if applicable): ovirt-alpha-1 ovirt-engine-3.5.0-0.0.master.20140519181229.gitc6324d4.el6.noarch vdsm-4.14.1-340.gitedb02ba.el6.x86_64 How reproducible: Unknown, happened to me after the scenario described in "Steps to Reproduce". Steps to Reproduce: On a shared DC with active host connected to storage server by iSCSI and FC. Make sure host is exposed to more than 1 LUNS by FC: 1. Have 2 storage domains: FC and iSCSI 2. Block connectivity to iSCSI domain and wait for it to become inactive and to FC domain to take master 3. When FC domain becomes active and master, remove 1 of the FC connected LUNS from LUN masking in storage server. Click on 'edit' on the FC domain 4. Engine sends getDeviceList to vdsm and fails to get a response after 3 minutes: 2014-05-29 14:42:04,333 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp--127.0.0.1-8702-1) Query GetDeviceListQuery failed. Exception message is VdcBLLException: org.ovirt.engine.core.vdsbroker.v dsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) : org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core .vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt. engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) 5. FC domain becomes inactive. Engine tries to perform reconstruct and fails because the other domain (iSCSI) is also inactive: 2014-05-29 14:51:18,204 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-6-thread-3) [237029dd] Command ReconstructMasterVDSCommand(HostName = green-vdsa, HostId = f397fb84-ac4e-4672-b11e-f0e4b24b3d65, vdsSpmId = 1, storagePoolId = a6aab135-2d4c-46b4-9062-7d0c6698fac0, storagePoolName = DC1, masterDomainId = d9520218-a361-4edf-b452-daf316b571d6, masterVersion = 8, domainsList = [{ domainId: d9520218-a361-4edf-b452-daf316b571d6, status: Unknown };{ domainId: aa3c9195-594c-4090-9050-74c034b9535d, status: Active };]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 6. Tried to force remove DC and failed with a CDA because the host was active. Tried to put the host to maintenance. Actual results: failed to put host to maintenance: 2014-05-29 14:57:06,190 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (org.ovirt.thread.pool-6-thread-10) [7a5b92a2] Running command: MaintenanceNumberOfVdssCommand internal: false. Entities aff ected : ID: f397fb84-ac4e-4672-b11e-f0e4b24b3d65 Type: VDS 2014-05-29 14:57:06,202 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (org.ovirt.thread.pool-6-thread-10) [7a5b92a2] START, SetVdsStatusVDSCommand(HostName = green-vdsa, HostId = f397fb84-ac4e-467 2-b11e-f0e4b24b3d65, status=PreparingForMaintenance, nonOperationalReason=NONE, stopSpmFailureLogged=true), log id: 1d25727 2014-05-29 14:57:19,050 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-6-thread-49) Command ConnectStoragePoolVDSCommand(HostName = green-vdsa, HostId = f397f b84-ac4e-4672-b11e-f0e4b24b3d65, storagePoolId = a6aab135-2d4c-46b4-9062-7d0c6698fac0, vds_spm_id = 1, masterDomainId = d9520218-a361-4edf-b452-daf316b571d6, masterVersion = 8) execution failed. Exception: VDSNetw orkException: java.util.concurrent.TimeoutException 2014-05-29 14:57:19,051 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-6-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 1a7c1bd4 2014-05-29 14:57:19,052 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-6-thread-49) IrsBroker::Failed::SPMGetAllTasksInfoVDS due to: IRSNonOperationalException: IRSGeneri cException: IRSErrorException: IRSNonOperationalException: Could not connect host to Data Center(Storage issue) 2014-05-29 14:57:19,059 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMGetAllTasksInfoVDSCommand] (org.ovirt.thread.pool-6-thread-49) FINISH, SPMGetAllTasksInfoVDSCommand, log id: 53e312f 2014-05-29 14:57:19,060 ERROR [org.ovirt.engine.core.bll.AsyncTaskManager] (org.ovirt.thread.pool-6-thread-49) Getting existing tasks on Storage Pool DC1 failed.: org.ovirt.engine.core.common.errors.VdcBLLExceptio n: VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Could not connect host to Data Center(Storage issue) (F ailed with error ENGINE and code 5001) I removed the host from the setup and tried to add it again. I failed with the following SQL exception: 2014-05-29 14:59:12,363 ERROR [org.ovirt.engine.core.bll.AddVdsSpmIdCommand] (ajp--127.0.0.1-8702-11) [14098a8e] Command org.ovirt.engine.core.bll.AddVdsSpmIdCommand throw exception: org.springframework.dao.DataIntegrityViolationException: CallableStatementCallback; SQL [{call insertvds_spm_id_map(?, ?, ?)}]; ERROR: insert or update on table "vds_spm_id_map" violates foreign key constraint "fk_vds_spm_id_map_vds_id" Detail: Key (vds_id)=(f397fb84-ac4e-4672-b11e-f0e4b24b3d65) is not present in table "vds_static". Where: SQL statement "INSERT INTO vds_spm_id_map(storage_pool_id, vds_id, vds_spm_id) VALUES( $1 , $2 , $3 )" PL/pgSQL function "insertvds_spm_id_map" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "vds_spm_id_map" violates foreign key constraint "fk_vds_spm_id_map_vds_id" Detail: Key (vds_id)=(f397fb84-ac4e-4672-b11e-f0e4b24b3d65) is not present in table "vds_static". Where: SQL statement "INSERT INTO vds_spm_id_map(storage_pool_id, vds_id, vds_spm_id) VALUES( $1 , $2 , $3 )" PL/pgSQL function "insertvds_spm_id_map" line 2 at SQL statement at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:245) [spring-jdbc.jar:3.1.1.RELEASE] at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72) [spring-jdbc.jar:3.1.1.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:1030) [spring-jdbc.jar:3.1.1.RELEASE] Expected results: Putting host to maintenance should work Additional info: engine and vdsm logs
Restart to engine service didn't solve the issue.
According to Elad, the only way to use this system after this is to clean up the database.
The issue here seems to be related to the host lifecycle, from looking in the logs i've noticed few issues - 1. host cluster can be changed before it's on maintenance 2. different dml operations run on the host relted tables concurrently, leaving the db in problematic state leading to sql exceptions. relevant log starts from: 2014-05-29 14:58:03,258 INFO as that related to the host life cycle moving to infra, oved - please take a look on that. thanks.
oVirt 3.5 has been released and should include the fix for this issue.