| Summary: | Automatic nova evacuate not working | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jeremy <jmelvin> | |
| Component: | fence-agents | Assignee: | Andrew Beekhof <abeekhof> | |
| Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> | |
| Severity: | urgent | Docs Contact: | Milan Navratil <mnavrati> | |
| Priority: | urgent | |||
| Version: | 7.0 | CC: | abeekhof, ahirshbe, arkady_kanevsky, berrange, cdevine, cfeist, christopher_dearborn, cluster-maint, dasmith, dbellant, eglynn, fdinitto, gael_rehault, jmelvin, John_walsh, j_t_williams, kchamart, kurt_hey, mburns, mgrac, michele, mkolaja, mkrcmari, mnavrati, morazi, oalbrigt, oblaut, randy_perryman, rbryant, royoung, rscarazz, rsussman, sbauza, sferdjao, sgordon, sreichar, srevivo, ushkalim, vromanso, wayne_allen | |
| Target Milestone: | pre-dev-freeze | Keywords: | Unconfirmed, ZStream | |
| Target Release: | 7.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | fence-agents-4.0.11-34.el7 | Doc Type: | Bug Fix | |
| Doc Text: |
High Availability instances created by non-admin users are now evacuated when a compute instance is turned off
Previously, the `fence_compute` agent searched only for compute instances created by admin users. As a consequence, instances created by non-admin users were not evacuated when a compute instance was turned off. This update makes sure that `fence_compute` searches for instances run as any user, and compute instances are evacuated to new compute nodes as expected.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1322702 (view as bug list) | Environment: | ||
| Last Closed: | 2016-11-04 04:49:35 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1310828, 1322702 | |||
|
Description
Jeremy
2016-03-01 22:22:14 UTC
Jeremy, I think I find out the problem. If you look at the status of the resources we've got a problem here: Mar 01 08:01:13 [26035] overcloud-controller-0.localdomain cib: info: cib_perform_op: ++ /cib/status/node_state[@id='overcloud-compute-0']/lrm[@id='overcloud-compute-0']/lrm_resources: <lrm_resource id="nova-compute" type="NovaCompute" class="ocf" provider="openstack"/> But NovaCompute is not present into the version of the package you are using inside the system resource-agents-3.9.5-54.el7_2.6.x86_64, which changes the way instance-ha work. Modifications on the KB are ongoing, but if you want to apply a quick solutions you can try to do this (from a controller as root): source ./overcloudrc pcs resource delete nova-compute pcs resource cleanup nova-compute pcs resource create nova-compute-checkevacuate ocf:openstack:nova-compute-wait auth_url=$OS_AUTH_URL username=$OS_USERNAME password=$OS_PASSWORD tenant_name=$OS_TENANT_NAME domain=localdomain no_shared_storage=1 op start timeout=300 --clone interleave=true --disabled --force" pcs constraint location nova-compute-checkevacuate-clone rule resource-discovery=exclusive score=0 osprole eq compute pcs constraint order start openstack-nova-conductor-clone then nova-compute-checkevacuate-clone require-all=false pcs resource create nova-compute systemd:openstack-nova-compute --clone interleave=true --disabled --force pcs constraint location nova-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute pcs constraint order start nova-compute-checkevacuate-clone then nova-compute-clone require-all=true pcs constraint order start nova-compute-clone then nova-evacuate require-all=false pcs constraint order start libvirtd-compute-clone then nova-compute-clone sudo pcs constraint colocation add nova-compute-clone with libvirtd-compute-clone So, in short, you need to remove the old nova-compute, introduce nova-compute-checkevacuate and redeclare nova-compute with the new resource agent nova-compute-wait. After this, clean up the environment and try a new instance-ha test. Raoul Scarazzini: He did what you suggested and now he is getting : Mar 3 15:41:39 overcloud-controller-0 lrmd[20726]: notice: nova-evacuate_monitor_10000:24793:stderr [ Exception from attempt to force host down via nova API: AttributeError: 'ServiceManager' object has no attribute 'force_down' ] fresh sosreports on the way soon. Ok, the updated knowledge base article is online here [1], so first of all let's check if the version of the packages reflect the requirements. [1] https://access.redhat.com/articles/1544823 Based on the logs, we're getting all the way to: Mar 09 13:58:54 overcloud-controller-2.localdomain NovaEvacuate(nova-evacuate)[1975]: NOTICE: Initiating evacuation of overcloud-compute-0.localdomain The evacuation even claims to have completed successfully: Mar 09 13:58:57 overcloud-controller-2.localdomain NovaEvacuate(nova-evacuate)[2233]: NOTICE: Completed evacuation of overcloud-compute-0.localdomain The question to be answered is which instances, if any, were attempted to be moved and what got in the way. Happy to take part in a live debug, I can be found now on #rhos-pidone Can you confirm if instances are being created with the admin credentials?
If not, please retest with the following patch:
diff --git a/sbin/fence_compute b/sbin/fence_compute
index 4538beb..fb28b9d 100644
--- a/sbin/fence_compute
+++ b/sbin/fence_compute
@@ -103,7 +103,7 @@ def _get_evacuable_images():
def _host_evacuate(options):
result = True
- servers = nova.servers.list(search_opts={'host': options["--plug"]})
+ servers = nova.servers.list(search_opts={'host': options["--plug", 'all_tenants': 1]})
if options["--instance-filtering"] == "False":
evacuables = servers
else:
Background, so far our testing had created instances as the admin user and in such cases evacuation works. However it is now understood that this is not normal.
The above patch allows the agent to find instances created by tenants other than 'admin' and therefor makes it possible to evacuate them.
The patch successfully resolved evacuation issues on another installation, so there is strong reason to believe it will work here too. Please confirm if the fix works for you and we'll get it into official packages.
Upstream patches: https://github.com/ClusterLabs/fence-agents/commit/64f086d https://github.com/ClusterLabs/fence-agents/commit/785a381 From the customer:
My work around for this is to have the following scripts run every minute. Since I only have 2 compute nodes, this is an acceptable workaround.
evacuate-compute-0.sh:
((count = 5)) # Maximum number to try.
while [[ $count -ne 0 ]] ; do
ping -c 1 172.22.11.13 # Try once.
rc=$?
if [[ $rc -eq 0 ]] ; then
((count = 1)) # If okay, flag to exit loop.
fi
((count = count - 1)) # So we don't go forever.
done
if [[ $rc -eq 0 ]] ; then # Make final determination.
echo "overcloud-compute-0 is up."
else
source /home/heat-admin/overcloudrc
comp0vms=$( nova list --all --fields host | grep compute-0 | awk '{print $2}')
for comp0vm in ${comp0vms}; do nova evacuate ${comp0vm} overcloud-compute-1.localdomain ; done
fi
evacuate-compute-1.sh:
((count = 5)) # Maximum number to try.
while [[ $count -ne 0 ]] ; do
ping -c 1 172.22.11.12 # Try once.
rc=$?
if [[ $rc -eq 0 ]] ; then
((count = 1)) # If okay, flag to exit loop.
fi
((count = count - 1)) # So we don't go forever.
done
if [[ $rc -eq 0 ]] ; then # Make final determination.
echo "overcloud-compute-1 is up."
else
source /home/heat-admin/overcloudrc
comp1vms=$( nova list --all --fields host | grep compute-1 | awk '{print $2}')
for comp1vm in ${comp1vms}; do nova evacuate ${comp1vm} overcloud-compute-0.localdomain ; done
fi
Does this help to diagnose why it was failing?
update: after re-deploy using 7.3 images and latest documentation it works. However: It is now happening again where the instances do not evacuate off of the failed compute. Findings: nova-evacuate worked when admin was the only project/user and only two networks were created with two instances. After multiple instances were launched and networks created nova evacuate no longer works. The previous mentioned script in comment 16 solves this issue. From the customer: The manual nova evacuate command works no matter what tenant the instance was created under. The automatic nova-evacuate does not work on any instances at this point. So he is using the script to manually evacuate if there is not ping to a compute node. Did you test adding the patches from comment #15 ? No the customer has not tested with those patches (In reply to Jeremy from comment #20) > No the customer has not tested with those patches That´s probably why it´s not working. Can you build a package for them to test or should we get you one? Fabio I would greatly appreciate if you guys could build a test package please. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Is there an analog BZ for OSP8? (In reply to arkady kanevsky from comment #28) > Is there an analog BZ for OSP8? The fix landed in fence-agents package which is distributed in RHEL7 channels, once the package with the fix is released, update of fence-agents package on RHEL7 based compute nodes should resolve the issue on any release of OSP. Thanks Marian. Mike, does this going to land in OSP8 GA and if yes, which beta/RC? Thanks, Arkady (In reply to arkady kanevsky from comment #30) > Thanks Marian. > Mike, does this going to land in OSP8 GA and if yes, which beta/RC? > Thanks, > Arkady Arkady, the fence-agents package is not shipped as part of OSP but from RHEL HA channel, we will make sure that the package is available by OSP8 GA. Fabio (In reply to Jeremy from comment #3) > Raoul Scarazzini: He did what you suggested and now he is getting : > > Mar 3 15:41:39 overcloud-controller-0 lrmd[20726]: notice: > nova-evacuate_monitor_10000:24793:stderr [ Exception from attempt to force > host down via nova API: AttributeError: 'ServiceManager' object has no > attribute 'force_down' ] > > fresh sosreports on the way soon. I have the same problem, I applied both upstream patches but the problem persist. logs: apr 14 09:54:01 overcloud-controller-1.localdomain fence_compute[14108]: Exception from attempt to force host down via nova API: AttributeError: 'ServiceManager' object has no attribute 'force_down apr 14 09:54:22 overcloud-controller-1.localdomain lrmd[4322]: notice: nova-evacuate_monitor_10000:13997:stderr [ Exception from attempt to force host down via nova API: AttributeError: 'ServiceManager' object has no attribute 'force_down' ] > apr 14 09:54:01 overcloud-controller-1.localdomain fence_compute[14108]:
> Exception from attempt to force host down via nova API: AttributeError:
> 'ServiceManager' object has no attribute 'force_down
>
> apr 14 09:54:22 overcloud-controller-1.localdomain lrmd[4322]: notice:
> nova-evacuate_monitor_10000:13997:stderr [ Exception from attempt to force
> host down via nova API: AttributeError: 'ServiceManager' object has no
> attribute 'force_down' ]
on their own these messages only imply that evacuations will take longer than the ideal, you can't infer from these alone that evacuation isn't working.
To remove needinfo flag. Verified on RHEL-OSP director 9.0 puddle - 2016-06-03.1 1) +--------------------------------------+-------------------------------------+------+ | ID | Host | Name | +--------------------------------------+-------------------------------------+------+ | bf857b44-959c-4513-a6ca-f91ce57a52e6 | overcloud-novacompute-0.localdomain | vm01 | | 88ac2cc4-9170-49a1-bf16-56d65312286c | overcloud-novacompute-1.localdomain | vm02 | +--------------------------------------+-------------------------------------+------+ 2) [root@overcloud-controller-0 ~]# pcs stonith fence overcloud-novacompute-0 3) [stack@puma33 ~]$ nova list --fields host,name +--------------------------------------+-------------------------------------+------+ | ID | Host | Name | +--------------------------------------+-------------------------------------+------+ | 88ac2cc4-9170-49a1-bf16-56d65312286c | overcloud-novacompute-1.localdomain | vm02 | | bf857b44-959c-4513-a6ca-f91ce57a52e6 | overcloud-novacompute-1.localdomain | vm01 | +--------------------------------------+-------------------------------------+------+ rpm: pacemaker-debuginfo-1.1.13-10.el7_2.2.x86_64 pacemaker-1.1.13-10.el7_2.2.x86_64 pacemaker-libs-1.1.13-10.el7_2.2.x86_64 pacemaker-cluster-libs-1.1.13-10.el7_2.2.x86_64 pacemaker-cli-1.1.13-10.el7_2.2.x86_64 pacemaker-nagios-plugins-metadata-1.1.13-10.el7_2.2.x86_64 pacemaker-doc-1.1.13-10.el7_2.2.x86_64 pacemaker-remote-1.1.13-10.el7_2.2.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2373.html |