1313561 – Automatic nova evacuate not working

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1313561 - Automatic nova evacuate not working

Summary: Automatic nova evacuate not working

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	fence-agents
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	pre-dev-freeze
Target Release:	7.0
Assignee:	Andrew Beekhof
QA Contact:	Asaf Hirshberg
Docs Contact:	Milan Navratil
URL:
Whiteboard:
Depends On:
Blocks:	1310828 1322702
TreeView+	depends on / blocked

Reported:	2016-03-01 22:22 UTC by Jeremy
Modified:	2019-10-10 11:24 UTC (History)
CC List:	40 users (show)
Fixed In Version:	fence-agents-4.0.11-34.el7
Doc Type:	Bug Fix
Doc Text:	High Availability instances created by non-admin users are now evacuated when a compute instance is turned off Previously, the `fence_compute` agent searched only for compute instances created by admin users. As a consequence, instances created by non-admin users were not evacuated when a compute instance was turned off. This update makes sure that `fence_compute` searches for instances run as any user, and compute instances are evacuated to new compute nodes as expected.
Clone Of:
Clones:	1322702 (view as bug list)
Environment:
Last Closed:	2016-11-04 04:49:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:2373	0	normal	SHIPPED_LIVE	fence-agents bug fix update	2016-11-03 13:51:30 UTC

Description Jeremy 2016-03-01 22:22:14 UTC

Description of problem:Automatic nova evacuate not working using pacemaker remote.. https://access.redhat.com/articles/1544823


Version-Release number of selected component (if applicable):
openstack-nova-api-2015.1.2-18.el7ost.noarch 

How reproducible:
100%

Steps to Reproduce:
1.set up ha instances using https://access.redhat.com/articles/1544823
2.check what computes instances are located on.
3. turn off one compute
4. note that instances living on that compute are not evacuated to another compute

Actual results:
instances are not evacuated

Expected results:
instances evacuated to new compute node


Additional info:

[heat-admin@overcloud-controller-1 ~]$ nova list --fields name,status,host --all
+--------------------------------------+---------------+--------+---------------------------------+
| ID                                   | Name          | Status | Host                            |
+--------------------------------------+---------------+--------+---------------------------------+
| 21e14c4b-fa5c-4c1b-a313-ff4cd79f2469 | asa1          | ACTIVE | overcloud-compute-1.localdomain |
| fc10912c-f33c-4a87-ba29-9a5deab4e688 | asa2          | ACTIVE | overcloud-compute-0.localdomain |
| 862c394c-a3b3-4b4e-a589-7f13981bfcbe | mgmt.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| ecf831a0-7ae8-46a0-9bb4-648b7b25a197 | ns12.fltg.com | ACTIVE | overcloud-compute-0.localdomain |
| 6f3befc9-cadb-4a3d-a03b-62f180d8b2eb | ns13.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| b282a48c-7a0d-4aec-973d-b51924224c03 | test          | ACTIVE | overcloud-compute-0.localdomain |
| e760d8ed-38b9-4ed3-aaa6-768d3a9e0867 | web1.fltg.com | ACTIVE | overcloud-compute-0.localdomain |
| 23b9e733-258d-464e-9662-92a1a5deb374 | web2.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| a8392d16-5b9b-40db-969c-45746708602b | www.fltg.com  | ACTIVE | overcloud-compute-0.localdomain |
+--------------------------------------+---------------+--------+---------------------------------+

[heat-admin@overcloud-controller-1 ~]$ nova service-list
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                               | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| 3  | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-02-26T18:16:08.000000 | -               |
| 6  | nova-scheduler   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-02-26T18:15:47.000000 | -               |
| 9  | nova-scheduler   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-02-26T18:15:57.000000 | -               |
| 12 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-02-26T18:16:00.000000 | -               |
| 15 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-02-26T18:15:49.000000 | -               |
| 18 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-02-26T18:16:14.000000 | -               |
| 21 | nova-conductor   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-02-26T18:15:51.000000 | -               |
| 24 | nova-conductor   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-02-26T18:15:55.000000 | -               |
| 33 | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-02-26T18:16:16.000000 | -               |
| 51 | nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | up    | 2016-02-26T18:15:52.000000 | -               |
| 54 | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | down  | 2016-02-26T18:13:55.000000 | -               |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+

[heat-admin@overcloud-controller-1 ~]$ nova list --fields name,status,host --all
+--------------------------------------+---------------+--------+---------------------------------+
| ID                                   | Name          | Status | Host                            |
+--------------------------------------+---------------+--------+---------------------------------+
| 21e14c4b-fa5c-4c1b-a313-ff4cd79f2469 | asa1          | ACTIVE | overcloud-compute-1.localdomain |
| fc10912c-f33c-4a87-ba29-9a5deab4e688 | asa2          | ACTIVE | overcloud-compute-0.localdomain |
| 862c394c-a3b3-4b4e-a589-7f13981bfcbe | mgmt.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| ecf831a0-7ae8-46a0-9bb4-648b7b25a197 | ns12.fltg.com | ACTIVE | overcloud-compute-0.localdomain |
| 6f3befc9-cadb-4a3d-a03b-62f180d8b2eb | ns13.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| b282a48c-7a0d-4aec-973d-b51924224c03 | test          | ACTIVE | overcloud-compute-0.localdomain |
| e760d8ed-38b9-4ed3-aaa6-768d3a9e0867 | web1.fltg.com | ACTIVE | overcloud-compute-0.localdomain |
| 23b9e733-258d-464e-9662-92a1a5deb374 | web2.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| a8392d16-5b9b-40db-969c-45746708602b | www.fltg.com  | ACTIVE | overcloud-compute-0.localdomain |
+--------------------------------------+---------------+--------+---------------------------------+

Compute0 is turned off, yet the instances are not evacuated to compute 1. After manually bringing back up compute0 the instances will show in error state.

Comment 2 Raoul Scarazzini 2016-03-03 10:44:58 UTC

Jeremy, I think I find out the problem. If you look at the status of the resources we've got a problem here:

Mar 01 08:01:13 [26035] overcloud-controller-0.localdomain        cib:     info: cib_perform_op:        ++ /cib/status/node_state[@id='overcloud-compute-0']/lrm[@id='overcloud-compute-0']/lrm_resources:  <lrm_resource id="nova-compute" type="NovaCompute" class="ocf" provider="openstack"/>

But NovaCompute is not present into the version of the package you are using inside the system resource-agents-3.9.5-54.el7_2.6.x86_64, which changes the way instance-ha work.
 Modifications on the KB are ongoing, but if you want to apply a quick solutions you can try to do this (from a controller as root):

source ./overcloudrc
pcs resource delete nova-compute
pcs resource cleanup nova-compute
pcs resource create nova-compute-checkevacuate ocf:openstack:nova-compute-wait auth_url=$OS_AUTH_URL username=$OS_USERNAME password=$OS_PASSWORD tenant_name=$OS_TENANT_NAME domain=localdomain no_shared_storage=1 op start timeout=300 --clone interleave=true --disabled --force"
pcs constraint location nova-compute-checkevacuate-clone rule resource-discovery=exclusive score=0 osprole eq compute
pcs constraint order start openstack-nova-conductor-clone then nova-compute-checkevacuate-clone require-all=false
pcs resource create nova-compute systemd:openstack-nova-compute --clone interleave=true --disabled --force
pcs constraint location nova-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute
pcs constraint order start nova-compute-checkevacuate-clone then nova-compute-clone require-all=true
pcs constraint order start nova-compute-clone then nova-evacuate require-all=false
pcs constraint order start libvirtd-compute-clone then nova-compute-clone
sudo pcs constraint colocation add nova-compute-clone with libvirtd-compute-clone

So, in short, you need to remove the old nova-compute, introduce nova-compute-checkevacuate and redeclare nova-compute with the new resource agent nova-compute-wait.

After this, clean up the environment and try a new instance-ha test.

Comment 3 Jeremy 2016-03-03 22:33:15 UTC

Raoul Scarazzini: He did what you suggested and now he is getting :

Mar  3 15:41:39 overcloud-controller-0 lrmd[20726]:  notice: nova-evacuate_monitor_10000:24793:stderr [ Exception from attempt to force host down via nova API: AttributeError: 'ServiceManager' object has no attribute 'force_down' ]

fresh sosreports on the way soon.

Comment 4 Raoul Scarazzini 2016-03-04 14:27:31 UTC

Ok, the updated knowledge base article is online here [1], so first of all let's check if the version of the packages reflect the requirements.

[1] https://access.redhat.com/articles/1544823

Comment 10 Andrew Beekhof 2016-03-10 23:45:01 UTC

Based on the logs, we're getting all the way to:

Mar 09 13:58:54 overcloud-controller-2.localdomain NovaEvacuate(nova-evacuate)[1975]: NOTICE: Initiating evacuation of overcloud-compute-0.localdomain

The evacuation even claims to have completed successfully:

Mar 09 13:58:57 overcloud-controller-2.localdomain NovaEvacuate(nova-evacuate)[2233]: NOTICE: Completed evacuation of overcloud-compute-0.localdomain

The question to be answered is which instances, if any, were attempted to be moved and what got in the way.

Happy to take part in a live debug, I can be found now on #rhos-pidone

Comment 14 Andrew Beekhof 2016-03-16 02:30:24 UTC

Can you confirm if instances are being created with the admin credentials?
If not, please retest with the following patch:

diff --git a/sbin/fence_compute b/sbin/fence_compute
index 4538beb..fb28b9d 100644
--- a/sbin/fence_compute
+++ b/sbin/fence_compute
@@ -103,7 +103,7 @@ def _get_evacuable_images():
 
 def _host_evacuate(options):
        result = True
-       servers = nova.servers.list(search_opts={'host': options["--plug"]})
+       servers = nova.servers.list(search_opts={'host': options["--plug", 'all_tenants': 1]})
        if options["--instance-filtering"] == "False":
                evacuables = servers
        else:


Background, so far our testing had created instances as the admin user and in such cases evacuation works.  However it is now understood that this is not normal.

The above patch allows the agent to find instances created by tenants other than 'admin' and therefor makes it possible to evacuate them.

The patch successfully resolved evacuation issues on another installation, so there is strong reason to believe it will work here too.  Please confirm if the fix works for you and we'll get it into official packages.

Comment 15 Andrew Beekhof 2016-03-17 02:45:45 UTC

Upstream patches:

  https://github.com/ClusterLabs/fence-agents/commit/64f086d
  https://github.com/ClusterLabs/fence-agents/commit/785a381

Comment 16 Jeremy 2016-03-21 14:30:30 UTC

From the customer:
My work around for this is to have the following scripts run every minute. Since I only have 2 compute nodes, this is an acceptable workaround.

evacuate-compute-0.sh:

((count = 5))                            # Maximum number to try.
while [[ $count -ne 0 ]] ; do
    ping -c 1 172.22.11.13                      # Try once.
    rc=$?
    if [[ $rc -eq 0 ]] ; then
        ((count = 1))                      # If okay, flag to exit loop.
    fi
    ((count = count - 1))                  # So we don't go forever.
done

if [[ $rc -eq 0 ]] ; then                  # Make final determination.
    echo "overcloud-compute-0 is up."
else
    source /home/heat-admin/overcloudrc
    comp0vms=$( nova list --all --fields host | grep compute-0 | awk '{print $2}')
    for comp0vm in ${comp0vms}; do nova evacuate ${comp0vm} overcloud-compute-1.localdomain ; done
fi

evacuate-compute-1.sh:

((count = 5))                            # Maximum number to try.
while [[ $count -ne 0 ]] ; do
    ping -c 1 172.22.11.12                      # Try once.
    rc=$?
    if [[ $rc -eq 0 ]] ; then
        ((count = 1))                      # If okay, flag to exit loop.
    fi
    ((count = count - 1))                  # So we don't go forever.
done

if [[ $rc -eq 0 ]] ; then                  # Make final determination.
    echo "overcloud-compute-1 is up."
else
    source /home/heat-admin/overcloudrc
    comp1vms=$( nova list --all --fields host | grep compute-1 | awk '{print $2}')
    for comp1vm in ${comp1vms}; do nova evacuate ${comp1vm} overcloud-compute-0.localdomain ; done
fi


Does this help to diagnose why it was failing?

Comment 17 Jeremy 2016-03-21 14:34:29 UTC

update:

after re-deploy using 7.3 images and latest documentation it works. However:

It is now happening again where the instances do not evacuate off of the failed compute.

Findings:

nova-evacuate worked when admin was the only project/user and only two networks were created with two instances.

After multiple instances were launched and networks created nova evacuate no longer works.

The previous mentioned script in comment 16 solves this issue.

Comment 18 Jeremy 2016-03-21 18:12:42 UTC

From the customer:
The manual nova evacuate command works no matter what tenant the instance was created under. The automatic nova-evacuate does not work on any instances at this point.


So he is using the script to manually evacuate if there is not ping to a compute node.

Comment 19 Fabio Massimo Di Nitto 2016-03-21 18:19:39 UTC

Did you test adding the patches from comment #15 ?

Comment 20 Jeremy 2016-03-21 18:50:20 UTC

No the customer has not tested with those patches

Comment 21 Fabio Massimo Di Nitto 2016-03-21 19:13:32 UTC

(In reply to Jeremy from comment #20)
> No the customer has not tested with those patches

That´s probably why it´s not working. Can you build a package for them to test or should we get you one?

Comment 22 Jeremy 2016-03-22 15:03:54 UTC

Fabio I would greatly appreciate if you guys could build a test package please.

Comment 27 Mike McCune 2016-03-28 21:59:03 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 28 arkady kanevsky 2016-03-28 22:12:07 UTC

Is there an analog BZ for OSP8?

Comment 29 Marian Krcmarik 2016-03-28 22:29:59 UTC

(In reply to arkady kanevsky from comment #28)
> Is there an analog BZ for OSP8?

The fix landed in fence-agents package which is distributed in RHEL7 channels, once the package with the fix is released, update of fence-agents package on RHEL7 based compute nodes should resolve the issue on any release of OSP.

Comment 30 arkady kanevsky 2016-03-29 13:49:18 UTC

Thanks Marian.
Mike, does this going to land in OSP8 GA and if yes, which beta/RC?
Thanks,
Arkady

Comment 31 Fabio Massimo Di Nitto 2016-03-29 13:52:24 UTC

(In reply to arkady kanevsky from comment #30)
> Thanks Marian.
> Mike, does this going to land in OSP8 GA and if yes, which beta/RC?
> Thanks,
> Arkady

Arkady, the fence-agents package is not shipped as part of OSP but from RHEL HA channel, we will make sure that the package is available by OSP8 GA.

Fabio

Comment 37 Daniel Bellantuono 2016-04-14 10:02:49 UTC

(In reply to Jeremy from comment #3)
> Raoul Scarazzini: He did what you suggested and now he is getting :
> 
> Mar  3 15:41:39 overcloud-controller-0 lrmd[20726]:  notice:
> nova-evacuate_monitor_10000:24793:stderr [ Exception from attempt to force
> host down via nova API: AttributeError: 'ServiceManager' object has no
> attribute 'force_down' ]
> 
> fresh sosreports on the way soon.

I have the same problem, I applied both upstream patches but the problem persist.

logs:
apr 14 09:54:01 overcloud-controller-1.localdomain fence_compute[14108]: Exception from attempt to force host down via nova API: AttributeError: 'ServiceManager' object has no attribute 'force_down

apr 14 09:54:22 overcloud-controller-1.localdomain lrmd[4322]:   notice: nova-evacuate_monitor_10000:13997:stderr [ Exception from attempt to force host down via nova API: AttributeError: 'ServiceManager' object has no attribute 'force_down' ]

Comment 38 Andrew Beekhof 2016-04-15 03:37:05 UTC

> apr 14 09:54:01 overcloud-controller-1.localdomain fence_compute[14108]:
> Exception from attempt to force host down via nova API: AttributeError:
> 'ServiceManager' object has no attribute 'force_down
> 
> apr 14 09:54:22 overcloud-controller-1.localdomain lrmd[4322]:   notice:
> nova-evacuate_monitor_10000:13997:stderr [ Exception from attempt to force
> host down via nova API: AttributeError: 'ServiceManager' object has no
> attribute 'force_down' ]

on their own these messages only imply that evacuations will take longer than the ideal, you can't infer from these alone that evacuation isn't working.

Comment 39 Rob Young 2016-05-23 16:34:45 UTC

To remove needinfo flag.

Comment 41 Asaf Hirshberg 2016-06-06 11:05:23 UTC

Verified on RHEL-OSP director 9.0 puddle - 2016-06-03.1

1)
+--------------------------------------+-------------------------------------+------+
| ID                                   | Host                                | Name |
+--------------------------------------+-------------------------------------+------+
| bf857b44-959c-4513-a6ca-f91ce57a52e6 | overcloud-novacompute-0.localdomain | vm01 |
| 88ac2cc4-9170-49a1-bf16-56d65312286c | overcloud-novacompute-1.localdomain | vm02 |
+--------------------------------------+-------------------------------------+------+
2) 
[root@overcloud-controller-0 ~]# pcs stonith fence overcloud-novacompute-0

3)
[stack@puma33 ~]$ nova list --fields host,name
+--------------------------------------+-------------------------------------+------+
| ID                                   | Host                                | Name |
+--------------------------------------+-------------------------------------+------+
| 88ac2cc4-9170-49a1-bf16-56d65312286c | overcloud-novacompute-1.localdomain | vm02 |
| bf857b44-959c-4513-a6ca-f91ce57a52e6 | overcloud-novacompute-1.localdomain | vm01 |
+--------------------------------------+-------------------------------------+------+



rpm:
pacemaker-debuginfo-1.1.13-10.el7_2.2.x86_64
pacemaker-1.1.13-10.el7_2.2.x86_64
pacemaker-libs-1.1.13-10.el7_2.2.x86_64
pacemaker-cluster-libs-1.1.13-10.el7_2.2.x86_64
pacemaker-cli-1.1.13-10.el7_2.2.x86_64
pacemaker-nagios-plugins-metadata-1.1.13-10.el7_2.2.x86_64
pacemaker-doc-1.1.13-10.el7_2.2.x86_64
pacemaker-remote-1.1.13-10.el7_2.2.x86_64

Comment 43 errata-xmlrpc 2016-11-04 04:49:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2373.html

Note You need to log in before you can comment on or make changes to this bug.

abeekhof
ahirshbe
arkady_kanevsky
berrange
cdevine
cfeist
christopher_dearborn
cluster-maint
dasmith
dbellant
eglynn
fdinitto
gael_rehault
jmelvin
John_walsh
j_t_williams
kchamart
kurt_hey
mburns
mgrac
michele
mkolaja
mkrcmari
mnavrati
morazi
oalbrigt
oblaut
randy_perryman
rbryant
royoung
rscarazz
rsussman
sbauza
sferdjao
sgordon
sreichar
srevivo
ushkalim
vromanso
wayne_allen