Hide Forgot
Description of problem: =========================== Can't say I fully understand everything going on behind the scenes with DC but when deploying DC will search rhevm clusters for a images location. Thinking we need to take this a step further and when we find a matching image, also check if the place where we found it is online. If resources are done or unavailable, we should continuing searching for a good(online) location. storage domain status: /api/datacenters/uuid/storagedomains/uuid -> status hypervisor host status: /api/hosts/uuid -> status
This bug is not really DC related. Deltacloud does not perform any kind of 'checking' if the datacenter is UP or if host is UP. However we have this: GET /api/realms/3c8af388-cff6-11e0-9267-52540013f702 <realm href='http://localhost:3001/api/realms/3c8af388-cff6-11e0-9267-52540013f702' id='3c8af388-cff6-11e0-9267-52540013f702'> <name>engops-nfs</name> <state>AVAILABLE</state> </realm> (Note the <state> above. The 'realm' in Deltacloud represent 'Cluster' in RHEV-M. However the 'status' in Realm model is the status of the Datacenter where the Cluster is located. So if this status reported as 'DOWN' then whole datacenter is down. We don't have any collection that check 'Host', however if the 'Host' is down, then the Datacenter is DOWN as well. And this informations should be available in Realm details. My suggestion is that Conductor before deployment should call to Deltacloud, request Realm where the Deployment will be placed and throw some decent error to user when Realm is down, instead of showing 'failed' deployment. Also it will help to admins figure out what is happening.
reassigning to aeolus-conductor ... as per Michal's comments above and agreement in char conv: <athomas> mfojtik, Looks like it needs to be reassigned within the conductor team
Can we add a check to conductor, immediately prior to launch, which checks the state of the target host/data centre and throws an error, or attempts to select an alternative if host/data centre is unavailable?
@Angus: Yes, it's possible to use 'next-available' realm in DC and deploy instance there. By default, when DC not receive any information about 'where the instance should be deployed' (I mean 'realm_id' here), DC will try to deploy image in same datacenter as Template is located/was registered. Note that DC will not do any check if the realm is UP or DOWN when starting an instance. If the realm is DOWN, DC will forward error from RHEV-M API.
For 1.0, we need to give a useful error report to the user when a launch fails because a datacentre is uncontactable. Logic to handling selecting an alternat provider to launch on etc. will come in after 1.0.
A datacenter is a provider, not a realm -- so if the datacenter is unavailable, the provider is down. Doesn't our 'provider availability' checking handle that? For realms, we should be tracking provider realm (which is a RHEV cluster) availability as well. What we're _not_ checking currently are storage domains or hypervisor status -- are those even exposed via deltacloud? Conductor only uses deltacloud to contact RHEV, so if it's not exposed via deltacloud we won't know about it.
The underlying issue here needs to be resolved through a combination of adding launch retries to conductor when a single launch invocation for realm/provider fails and, ponentially, by raising feature requests for enhanced state reporting bby deltacloud when the API returns errors.
This issue is partially solved: if a datacenter is not available and launch request fails, rollback+relaunch will be done and other datacenter will be used, so a user will not end up with create_failed instance. Though we can improve the process by: 1) we periodically check if providers are accessible, if a provider is not accessible it's marked as unavailable and is ignored when launching deployments. But this test checks connection only to dc-core, not to the cloud provider which stays behind dc-core (for example RHEVM datacenter). Once there is a method on dc-core side which we can use to test end-cloud-provider connection, we can use it in periodical checking on conductor side. Request for this method is filled here https://issues.apache.org/jira/browse/DTACLOUD-307. 2) better error reporting described in Comment 7, so that if a launch request fails, we can easily identify error category (inaccessible provider, wrong realm, wrong hwp,...), and consider this error when choosing next match in rollback+relaunch process.
The two actions referred to in Comment 8 both require new features in deltacloud, which puts them beyond the scope of 1.1
2) I created a RFE in Deltacloud JIRA to collect the generic errors that will be useful for Conductor: https://issues.apache.org/jira/browse/DTACLOUD-309