Bug 1293928 - Failed activating hosted engine SD during auto-import on iSCSI/FC
Failed activating hosted engine SD during auto-import on iSCSI/FC
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: BLL.HostedEngine (Show other bugs)
3.6.1.3
x86_64 Linux
high Severity high (vote)
: ovirt-3.6.2
: 3.6.2.6
Assigned To: Roy Golan
Elad
alma03 host after trying to attach HE...
:
Depends On:
Blocks: 1246862 1269768 1291634 1293892
  Show dependency treegraph
 
Reported: 2015-12-23 09:59 EST by Nikolai Sednev
Modified: 2016-02-23 04:26 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: importing the nfs domain could interleave with the engine restart as a result of installation. Consequence: the hosted engine domain won't be imported or activated on the engine setup, and the engine vm wouldn't be imported and displayed in the webadmin or REST. Fix: Exclusively lock the import attempt so interleaving of 2 imports is impossible and prevent starting the import if there is no DATA domain and an ACTIVE DC. Note: this means that in order to see the engine VM a user must first import a DATA domain and activate the DC. Result: the hosted engine domain is imported and the hosted engine VM.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-23 04:26:44 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: SLA
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
rule-engine: blocker+
bmcclain: planning_ack+
rgolan: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
logsafter autoimport on clean env.tar.gz (15.67 MB, application/x-gzip)
2015-12-23 10:02 EST, Nikolai Sednev
no flags Details
fc deployment (1.20 MB, application/x-gzip)
2015-12-28 09:19 EST, Elad
no flags Details
sosreport from the engine (6.92 MB, application/x-xz)
2016-01-25 10:26 EST, Nikolai Sednev
no flags Details
sosreport from HE-host (6.99 MB, application/x-xz)
2016-01-25 10:28 EST, Nikolai Sednev
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 51033 master MERGED core: hosted-engine: Lock the sd import exclusively 2016-01-12 05:29 EST
oVirt gerrit 51126 master MERGED core: hosted-engine: import the domain only when DC is UP 2016-01-12 08:21 EST
oVirt gerrit 51208 master MERGED core: hosted-engine: Add connection details explicitly for NFS 2016-01-12 18:46 EST
oVirt gerrit 51414 ovirt-engine-3.6.2 MERGED core: hosted-engine: Lock the sd import exclusively 2016-01-14 04:16 EST
oVirt gerrit 51415 ovirt-engine-3.6.2 MERGED core: hosted-engine: import the domain only when DC is UP 2016-01-14 04:17 EST
oVirt gerrit 51416 ovirt-engine-3.6.2 MERGED core: hosted-engine: Add connection details explicitly for NFS 2016-01-14 04:18 EST
oVirt gerrit 51417 ovirt-engine-3.6 MERGED core: hosted-engine: Lock the sd import exclusively 2016-01-13 10:13 EST
oVirt gerrit 51418 ovirt-engine-3.6 MERGED core: hosted-engine: import the domain only when DC is UP 2016-01-13 10:15 EST
oVirt gerrit 51419 ovirt-engine-3.6 MERGED core: hosted-engine: Add connection details explicitly for NFS 2016-01-13 10:19 EST
oVirt gerrit 51457 ovirt-engine-3.6 MERGED core: storage: make storagServerConnectin compensatable 2016-01-13 10:17 EST
oVirt gerrit 51478 ovirt-engine-3.6.2 MERGED core: storage: make storagServerConnectin compensatable 2016-01-14 04:17 EST
oVirt gerrit 51488 master MERGED core: storage: make storagServerConnectin compensatable 2016-01-12 18:46 EST

  None (edit)
Description Nikolai Sednev 2015-12-23 09:59:23 EST
Description of problem:
Failed Activating hosted engine domain during auto-import on iSCSI.
After deployment of HE on two hosts over iSCSI, failed activating hosted engine SD domain during auto-import on iSCSI.

Version-Release number of selected component (if applicable):
engine:
ovirt-engine-extension-aaa-jdbc-1.0.4-1.el6ev.noarch
ovirt-vmconsole-1.0.0-1.el6ev.noarch
ovirt-host-deploy-1.4.1-1.el6ev.noarch
ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch
ovirt-host-deploy-java-1.4.1-1.el6ev.noarch
rhevm-3.6.1.3-0.1.el6.noarch
rhevm-dwh-3.6.1-1.el6ev.noarch
rhevm-dwh-setup-3.6.1-1.el6ev.noarch
Linux version 2.6.32-573.8.1.el6.x86_64 (mockbuild@x86-033.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Fri Sep 25 19:24:22 EDT 2015

Couple of hosts:
qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64
ovirt-host-deploy-1.4.1-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.2.x86_64
ovirt-vmconsole-1.0.0-1.el7ev.noarch
vdsm-4.17.13-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
mom-0.5.1-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.3.5-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.1.3-1.el7ev.noarch
ovirt-setup-lib-1.0.0-1.el7ev.noarch
Linux version 3.10.0-327.4.4.el7.x86_64 (mockbuild@x86-019.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Dec 17 15:51:24 EST 2015

How reproducible:
100%

Steps to Reproduce:
1.Deploy HE with (DWH) reports and with ovirt-vmconsole-proxy on pairt of RHEL7.2 hosts.
2.Once engine is running, try to attach the iSCSI HE-SD and activating it.
3.

Actual results:
	
Dec 23, 2015 1:10:42 PM
	
Failed to attach Storage Domain hosted_storage to Data Center Default. (User: admin@internal)
	
	
Dec 23, 2015 1:10:42 PM
	
VDSM command failed: Cannot obtain lock: "id=df2356f7-8272-401a-97f7-63c14f37ec7a, rc=-243, out=Cannot acquire cluster lock, err=(-243, 'Sanlock resource not acquired', 'Sanlock exception')"
	
	
Dec 23,

Expected results:
HE-SD should be auto-imported on clean iSCSI environment.

Additional info:
Attached logs are sosreports from the engine and hosts right after deployment and before the auto-import attachment to DC, then sosreports of the engine and hosts, right after HE-SD attachment to DC.
Comment 1 Nikolai Sednev 2015-12-23 10:02 EST
Created attachment 1108975 [details]
logsafter autoimport on clean env.tar.gz
Comment 2 Nikolai Sednev 2015-12-23 10:19:16 EST
As second sosreport files is a bit large file, attached it as google-drive https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88Y2MxSk1IRzRUNFk/view?usp=sharing
Comment 3 Roy Golan 2015-12-23 14:59:43 EST
I see that in the middle of the import the engine was restarting. So we had part of the data inside the engine. The domain is there just that it isn't activated. So this is not a bug.

But what I do see wrong is that 2 auto import processes started together and that's why the later one failed with and error that domain already exists.

I should take an engineLock on the whole process to prevent that.

From the logs;

-- the first import 
2015-12-22 21:07:46,034 INFO  [org.ovirt.engine.core.bll.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-7-thread-6) [] Running command: ImportHostedEngineStorageDomainCommand internal: true.

-- the 2nd one comes right after 12 seconds 
2015-12-22 21:07:58,901 INFO  [org.ovirt.engine.core.bll.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-7-thread-9) [] Running command: ImportHostedEngineStorageDomainCommand internal: true.


-- engine shuts down (the DB resource is shutting down and throws exception)
2015-12-22 21:08:36,629 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-37) [] Timer update runtime info failed. Exception:: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000451: The connection manager is shutdown: java:/ENGINEDataSource


The first added the domain
Comment 4 Roy Golan 2015-12-23 15:44:08 EST
Gil, Niko this is not a blocker due to the fact of the engine shutdown in the middle of the process. Niko please confirm this finding.
Comment 5 Nikolai Sednev 2015-12-27 08:22:39 EST
(In reply to Roy Golan from comment #4)
> Gil, Niko this is not a blocker due to the fact of the engine shutdown in
> the middle of the process. Niko please confirm this finding.

I actually tried to attach the HE-SD manually (via WEBUI) on Dec 23, 2015 1:10:42 PM, what happened during deployment was made following regular hosted-engine --deploy procedures. I beleive that what you're seeing there around engine being restarted, actually corresponds to step, where engine required to be powered off, so deployment could finish, its right after host being added to the engine. There is also another time when engine being rebooted during deployment process, where hosts not yet added in to the engine, but where engine-VM finishes it's engine-setup, so you have to report to the hosted-engine --deploy that engine's installation complete, then it automatically restarts the VM.

This explains why do we have HE-VM being rebooted twice. You auto-import tries twice to add HE-SD in to the engine even when actually HE deployment not really finished, e.g. during deployment process, which is improper flow for autoimport, as deployment not really accomplished yet.

To conclude:
We have here two different issues, first is that auto-import executed during he-deployment not yet completed, second is when auto-import somehow gets running twice.

It's a blocker IMHO, as I can't get any WA.
Comment 6 Roy Golan 2015-12-28 03:23:13 EST
The engine does get rebooted by the setup. So this means that auto import might get interrupted while running. 

1. Need to make the auto import to be able to cleanup totally if interrupted/aborted

2. Prevent interleaving imports and this fixes is already posted to this bug.

3. Prevent auto import if a master domain doesn't exist. That will keep the auto import till we have data domain up.

Workaround :
   
 prevent to import from running using config val -
 while we know the setup is going to reboot the engine initially we can use the config value I added to auto import. But that would mean to turn it on again after engine vm boots and restart the engine.
Comment 7 Nikolai Sednev 2015-12-28 06:34:58 EST
(In reply to Roy Golan from comment #6)
> The engine does get rebooted by the setup. So this means that auto import
> might get interrupted while running. 
> 
> 1. Need to make the auto import to be able to cleanup totally if
> interrupted/aborted
> 
> 2. Prevent interleaving imports and this fixes is already posted to this bug.
> 
> 3. Prevent auto import if a master domain doesn't exist. That will keep the
> auto import till we have data domain up.
> 
> Workaround :
>    
>  prevent to import from running using config val -
>  while we know the setup is going to reboot the engine initially we can use
> the config value I added to auto import. But that would mean to turn it on
> again after engine vm boots and restart the engine.

How can you enable auto-import on clean deployment, while there is not active SD and DC is down, as there is no master SD exists yet?
Comment 8 Roy Golan 2015-12-28 08:35:44 EST
(In reply to Nikolai Sednev from comment #7)
> (In reply to Roy Golan from comment #6)
> > The engine does get rebooted by the setup. So this means that auto import
> > might get interrupted while running. 
> > 
> > 1. Need to make the auto import to be able to cleanup totally if
> > interrupted/aborted
> > 
> > 2. Prevent interleaving imports and this fixes is already posted to this bug.
> > 
> > 3. Prevent auto import if a master domain doesn't exist. That will keep the
> > auto import till we have data domain up.
> > 
> > Workaround :
> >    
> >  prevent to import from running using config val -
> >  while we know the setup is going to reboot the engine initially we can use
> > the config value I added to auto import. But that would mean to turn it on
> > again after engine vm boots and restart the engine.
> 
> How can you enable auto-import on clean deployment, while there is not
> active SD and DC is down, as there is no master SD exists yet?

Apparently the import of the domain CAN start without an active master. (this should be avoided) The import of the VM will be blocked till we have an active DC)

So my proposal for workaround is probably no-go. Because we won't be able to set the config value prior to the engine starting up
Comment 9 Elad 2015-12-28 09:19 EST
Created attachment 1110007 [details]
fc deployment

Deployed over FC. HE SD is imported and in status unattached. In engine.log, saw the following exception, attaching full logs.


2015-12-24 20:16:39,775 ERROR [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 46) [] Failed to run compensation on startup for Command 'org.ovirt.engine.core.b
ll.storage.AddExistingBlockStorageDomainCommand', Command Id '6e50e550-9c6b-4b91-a8f8-ee1cff73f891': CallableStatementCallback; SQL [{call insertstorage_domain_dynamic(?, ?, ?)}
]; ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static"
  Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
  Where: SQL statement "INSERT INTO storage_domain_dynamic(available_disk_size, id, used_disk_size) VALUES( $1 ,  $2 ,  $3 )"
PL/pgSQL function "insertstorage_domain_dynamic" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "storage_domain
_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static"
  Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
  Where: SQL statement "INSERT INTO storage_domain_dynamic(available_disk_size, id, used_disk_size) VALUES( $1 ,  $2 ,  $3 )"
PL/pgSQL function "insertstorage_domain_dynamic" line 2 at SQL statement
2015-12-24 20:16:39,775 ERROR [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 46) [] Exception: org.springframework.dao.DataIntegrityViolationException: Callab
leStatementCallback; SQL [{call insertstorage_domain_dynamic(?, ?, ?)}]; ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_do
main_dynamic_storage_domain_static"
  Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
  Where: SQL statement "INSERT INTO storage_domain_dynamic(available_disk_size, id, used_disk_size) VALUES( $1 ,  $2 ,  $3 )"
PL/pgSQL function "insertstorage_domain_dynamic" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "storage_domain
_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static"
  Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
  Where: SQL statement "INSERT INTO storage_domain_dynamic(available_disk_size, id, used_disk_size) VALUES( $1 ,  $2 ,  $3 )"
PL/pgSQL function "insertstorage_domain_dynamic" line 2 at SQL statement
        at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:245) [spring-jdbc.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72) [spring-jdbc.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:1030) [spring-jdbc.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.call(JdbcTemplate.java:1064) [spring-jdbc.jar:3.1.1.RELEASE]
Comment 10 Nikolai Sednev 2015-12-30 08:20:59 EST
Following comment #9 from https://bugzilla.redhat.com/show_bug.cgi?id=1290518, I've set the iSCSI HE-SD in to maintenance, then destroyed it, then after some time it was auto-imported and got attached and activated - OK, but HE-VM was not imported following the https://bugzilla.redhat.com/show_bug.cgi?id=1294457.
Comment 11 Nikolai Sednev 2016-01-21 08:22:03 EST
FC is working fine forth to Comment 54 from https://bugzilla.redhat.com/show_bug.cgi?id=1269768.
Comment 12 Nikolai Sednev 2016-01-25 10:25:23 EST
Did not worked for me on iSCSI deployment, auto-import did not work.
I've got HE-VM installed on Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20160113.0.el7ev) with rhevm-appliance-20160120.0-1 taken manually during TUI HE-deployment. iSCSI HE's SD was not imported, but data SD was successfully added and I've created working guest VM for farther tasks.

Engine:
rhevm-3.6.2.6-0.1.el6.noarch

Host:
rhevm-sdk-python-3.6.2.1-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
libvirt-client-1.2.17-13.el7_2.2.x86_64
mom-0.5.1-1.el7ev.noarch
vdsm-4.17.17-0.el7ev.noarch
Comment 13 Nikolai Sednev 2016-01-25 10:26 EST
Created attachment 1118066 [details]
sosreport from the engine
Comment 14 Nikolai Sednev 2016-01-25 10:28 EST
Created attachment 1118067 [details]
sosreport from HE-host
Comment 15 Nikolai Sednev 2016-01-25 10:46:40 EST
vdsClient -s 0 getConnectedStoragePoolsList
00000001-0001-0001-0001-000000000160

vdsClient -s 0 getStorageDomainsList 00000001-0001-0001-0001-000000000160
4cd2ab12-6564-44c9-a0fb-41b8d6523293
Comment 16 Red Hat Bugzilla Rules Engine 2016-01-25 10:46:41 EST
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 17 Roy Golan 2016-01-26 15:48:34 EST

From vdsm logs I see you decided to call you domain hosted_storage2 (id 5c238) but you probably didn't change that in the engine config value and that's why the whole process doesn't start. 

From engine.log,

 - get all data domains 
(org.ovirt.thread.pool-7-thread-28) [] FINISH, HSMGetStorageDomainsListVDSCommand, return: [5c248582-48fd-4334-9adb-716f771
     d1422, 4cd2ab12-6564-44c9-a0fb-41b8d6523293]

 - 5c248 exists, but its name doesn't match so engine says:
CanDoAction of action 'ImportHostedEngineStorageDomain' failed for user SYSTEM. Reasons:
      VAR__ACTION__ADD,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_NOT_EXIST


Regarding comment 15, try the getStorageDomainsList without specifying the pool - because the domain doesn't belong to the pool yet.
Comment 18 Red Hat Bugzilla Rules Engine 2016-01-26 15:48:37 EST
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 19 Nikolai Sednev 2016-01-27 04:42:05 EST
(In reply to Roy Golan from comment #17)
> 
> From vdsm logs I see you decided to call you domain hosted_storage2 (id
> 5c238) but you probably didn't change that in the engine config value and
> that's why the whole process doesn't start. 
> 
> From engine.log,
> 
>  - get all data domains 
> (org.ovirt.thread.pool-7-thread-28) [] FINISH,
> HSMGetStorageDomainsListVDSCommand, return: [5c248582-48fd-4334-9adb-716f771
>      d1422, 4cd2ab12-6564-44c9-a0fb-41b8d6523293]
> 
>  - 5c248 exists, but its name doesn't match so engine says:
> CanDoAction of action 'ImportHostedEngineStorageDomain' failed for user
> SYSTEM. Reasons:
>      
> VAR__ACTION__ADD,VAR__TYPE__STORAGE__DOMAIN,
> ACTION_TYPE_FAILED_STORAGE_DOMAIN_NOT_EXIST
> 
> 
> Regarding comment 15, try the getStorageDomainsList without specifying the
> pool - because the domain doesn't belong to the pool yet.

Probably, I did changed default "hosted_storage" name to "hosted_storage2", this is forth to your comment #14 from https://bugzilla.redhat.com/show_bug.cgi?id=1294457. Due to the fact that I've used appliance engine setup, which was executed by the HE deployment, I was not asked about hosted_storage2.
Comment 20 Elad 2016-01-27 05:15:38 EST
Deployed HE over iSCSI using a clean host (OS installed right before deployment), clean storage (only one empty LUN exposed to the host).

Deployment succeeded, hosted-engine storage domain imported successfully to the first initialized DC, both master and HE storage domains activation went well.

Used the following:
Red Hat Enterprise Linux Server release 7.2 (Maipo)                                     

ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.2.3-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch
vdsm-cli-4.17.18-0.el7ev.noarch
vdsm-4.17.18-0.el7ev.noarch
vdsm-infra-4.17.18-0.el7ev.noarch
vdsm-yajsonrpc-4.17.18-0.el7ev.noarch
vdsm-xmlrpc-4.17.18-0.el7ev.noarch
vdsm-python-4.17.18-0.el7ev.noarch
vdsm-hook-vmfex-dev-4.17.18-0.el7ev.noarch
vdsm-jsonrpc-4.17.18-0.el7ev.noarch

rhevm-3.6.2.6-0.1.el6.noarch
Comment 21 Simone Tiraboschi 2016-01-27 06:46:50 EST
(In reply to Nikolai Sednev from comment #19)
> Probably, I did changed default "hosted_storage" name to "hosted_storage2",
> this is forth to your comment #14 from
> https://bugzilla.redhat.com/show_bug.cgi?id=1294457. Due to the fact that
> I've used appliance engine setup, which was executed by the HE deployment, I
> was not asked about hosted_storage2.

This is probably a bug on its own: hosted-engine-setup let the user enter a custom name for the hosted-engine SD but the engine can just look for "hosted_storage" to identify it.
Comment 22 Elad 2016-02-22 04:54:29 EST
According to comment #20 and https://bugzilla.redhat.com/show_bug.cgi?id=1291634#c2 (iSCSI and FC deployments), moving this bug to VERIFIED.

Note You need to log in before you can comment on or make changes to this bug.