Bug 796371 - [RFE]: rhevm resource status needs to be considered on deployment
Summary: [RFE]: rhevm resource status needs to be considered on deployment
Keywords:
Status: CLOSED EOL
Alias: None
Product: CloudForms Cloud Engine
Classification: Retired
Component: aeolus-conductor
Version: 1.0.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
Assignee: Jan Provaznik
QA Contact: Rehana
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-22 18:55 UTC by Dave Johnson
Modified: 2017-01-09 07:53 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-09 07:53:36 UTC


Attachments (Terms of Use)

Description Dave Johnson 2012-02-22 18:55:03 UTC
Description of problem:
===========================
Can't say I fully understand everything going on behind the scenes with DC but when deploying DC will search rhevm clusters for a images location.  Thinking we need to take this a step further and when we find a matching image, also check if the place where we found it is online.  If resources are done or unavailable, we should continuing searching for a good(online) location.

storage domain status:  /api/datacenters/uuid/storagedomains/uuid -> status
hypervisor host status: /api/hosts/uuid -> status

Comment 1 Michal Fojtik 2012-02-24 13:45:20 UTC
This bug is not really DC related. Deltacloud does not perform any kind of 'checking' if the datacenter is UP or if host is UP.

However we have this:

GET /api/realms/3c8af388-cff6-11e0-9267-52540013f702

<realm href='http://localhost:3001/api/realms/3c8af388-cff6-11e0-9267-52540013f702' id='3c8af388-cff6-11e0-9267-52540013f702'>
  <name>engops-nfs</name>
  <state>AVAILABLE</state>
</realm>

(Note the <state> above. The 'realm' in Deltacloud represent 'Cluster' in RHEV-M. However the 'status' in Realm model is the status of the Datacenter where the Cluster is located. So if this status reported as 'DOWN' then whole datacenter is down.

We don't have any collection that check 'Host', however if the 'Host' is down, then the Datacenter is DOWN as well. And this informations should be available in Realm details.

My suggestion is that Conductor before deployment should call to Deltacloud, request Realm where the Deployment will be placed and throw some decent error to user when Realm is down, instead of showing 'failed' deployment. Also it will help to admins figure out what is happening.

Comment 2 Ronelle Landy 2012-02-24 13:54:32 UTC
reassigning to aeolus-conductor ... as per Michal's comments above and agreement in char conv:

<athomas> mfojtik, Looks like it needs to be reassigned within the conductor team

Comment 3 Angus Thomas 2012-02-24 13:55:29 UTC
Can we add a check to conductor, immediately prior to launch, which checks the state of the target host/data centre and throws an error, or attempts to select an alternative if host/data centre is unavailable?

Comment 4 Michal Fojtik 2012-02-24 14:06:13 UTC
@Angus: Yes, it's possible to use 'next-available' realm in DC and deploy  instance there. By default, when DC not receive any information about 'where the instance should be deployed' (I mean 'realm_id' here), DC will try to deploy image in same datacenter as Template is located/was registered.

Note that DC will not do any check if the realm is UP or DOWN when starting an instance. If the realm is DOWN, DC will forward error from RHEV-M API.

Comment 5 Angus Thomas 2012-02-27 16:16:43 UTC
For 1.0, we need to give a useful error report to the user when a launch fails because a datacentre is uncontactable.

Logic to handling selecting an alternat provider to launch on etc. will come in after 1.0.

Comment 6 Scott Seago 2012-02-29 19:18:52 UTC
A datacenter is a provider, not a realm -- so if the datacenter is unavailable, the provider is down. Doesn't our 'provider availability' checking handle that?

For realms, we should be tracking provider realm (which is a RHEV cluster) availability as well.

What we're _not_ checking currently are storage domains or hypervisor status -- are those even exposed via deltacloud? Conductor only uses deltacloud to contact RHEV, so if it's not exposed via deltacloud we won't know about it.

Comment 7 Angus Thomas 2012-03-06 17:25:05 UTC
The underlying issue here needs to be resolved through a combination of adding launch retries to conductor when a single launch invocation for realm/provider fails and, ponentially, by raising feature requests for enhanced state reporting bby deltacloud when the API returns errors.

Comment 8 Jan Provaznik 2012-08-23 14:58:37 UTC
This issue is partially solved: if a datacenter is not available and launch request fails, rollback+relaunch will be done and other datacenter will be used, so a user will not end up with create_failed instance.

Though we can improve the process by:
1) we periodically check if providers are accessible, if a provider is not accessible it's marked as unavailable and is ignored when launching deployments.
But this test checks connection only to dc-core, not to the cloud provider which stays behind dc-core (for example RHEVM datacenter). Once there is a method on dc-core side which we can use to test end-cloud-provider connection, we can use it in periodical checking on conductor side. Request for this method is filled here https://issues.apache.org/jira/browse/DTACLOUD-307.

2) better error reporting described in Comment 7, so that if a launch request fails, we can easily identify error category (inaccessible provider, wrong realm, wrong hwp,...), and consider this error when choosing next match in rollback+relaunch process.

Comment 9 Angus Thomas 2012-08-30 11:39:30 UTC
The two actions referred to in Comment 8 both require new features in deltacloud, which puts them beyond the scope of 1.1

Comment 10 Michal Fojtik 2012-08-30 12:25:10 UTC
2)

I created a RFE in Deltacloud JIRA to collect the generic errors that will be useful for Conductor:

https://issues.apache.org/jira/browse/DTACLOUD-309


Note You need to log in before you can comment on or make changes to this bug.