This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1313561 - Automatic nova evacuate not working
Automatic nova evacuate not working
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: fence-agents (Show other bugs)
7.0
Unspecified Unspecified
urgent Severity urgent
: pre-dev-freeze
: 7.0
Assigned To: Andrew Beekhof
Asaf Hirshberg
Milan Navratil
: Unconfirmed, ZStream
Depends On:
Blocks: 1310828 1322702
  Show dependency treegraph
 
Reported: 2016-03-01 17:22 EST by Jeremy
Modified: 2016-11-04 00:49 EDT (History)
40 users (show)

See Also:
Fixed In Version: fence-agents-4.0.11-34.el7
Doc Type: Bug Fix
Doc Text:
High Availability instances created by non-admin users are now evacuated when a compute instance is turned off Previously, the `fence_compute` agent searched only for compute instances created by admin users. As a consequence, instances created by non-admin users were not evacuated when a compute instance was turned off. This update makes sure that `fence_compute` searches for instances run as any user, and compute instances are evacuated to new compute nodes as expected.
Story Points: ---
Clone Of:
: 1322702 (view as bug list)
Environment:
Last Closed: 2016-11-04 00:49:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jeremy 2016-03-01 17:22:14 EST
Description of problem:Automatic nova evacuate not working using pacemaker remote.. https://access.redhat.com/articles/1544823


Version-Release number of selected component (if applicable):
openstack-nova-api-2015.1.2-18.el7ost.noarch 

How reproducible:
100%

Steps to Reproduce:
1.set up ha instances using https://access.redhat.com/articles/1544823
2.check what computes instances are located on.
3. turn off one compute
4. note that instances living on that compute are not evacuated to another compute

Actual results:
instances are not evacuated

Expected results:
instances evacuated to new compute node


Additional info:

[heat-admin@overcloud-controller-1 ~]$ nova list --fields name,status,host --all
+--------------------------------------+---------------+--------+---------------------------------+
| ID                                   | Name          | Status | Host                            |
+--------------------------------------+---------------+--------+---------------------------------+
| 21e14c4b-fa5c-4c1b-a313-ff4cd79f2469 | asa1          | ACTIVE | overcloud-compute-1.localdomain |
| fc10912c-f33c-4a87-ba29-9a5deab4e688 | asa2          | ACTIVE | overcloud-compute-0.localdomain |
| 862c394c-a3b3-4b4e-a589-7f13981bfcbe | mgmt.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| ecf831a0-7ae8-46a0-9bb4-648b7b25a197 | ns12.fltg.com | ACTIVE | overcloud-compute-0.localdomain |
| 6f3befc9-cadb-4a3d-a03b-62f180d8b2eb | ns13.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| b282a48c-7a0d-4aec-973d-b51924224c03 | test          | ACTIVE | overcloud-compute-0.localdomain |
| e760d8ed-38b9-4ed3-aaa6-768d3a9e0867 | web1.fltg.com | ACTIVE | overcloud-compute-0.localdomain |
| 23b9e733-258d-464e-9662-92a1a5deb374 | web2.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| a8392d16-5b9b-40db-969c-45746708602b | www.fltg.com  | ACTIVE | overcloud-compute-0.localdomain |
+--------------------------------------+---------------+--------+---------------------------------+

[heat-admin@overcloud-controller-1 ~]$ nova service-list
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                               | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| 3  | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-02-26T18:16:08.000000 | -               |
| 6  | nova-scheduler   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-02-26T18:15:47.000000 | -               |
| 9  | nova-scheduler   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-02-26T18:15:57.000000 | -               |
| 12 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-02-26T18:16:00.000000 | -               |
| 15 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-02-26T18:15:49.000000 | -               |
| 18 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-02-26T18:16:14.000000 | -               |
| 21 | nova-conductor   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-02-26T18:15:51.000000 | -               |
| 24 | nova-conductor   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-02-26T18:15:55.000000 | -               |
| 33 | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-02-26T18:16:16.000000 | -               |
| 51 | nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | up    | 2016-02-26T18:15:52.000000 | -               |
| 54 | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | down  | 2016-02-26T18:13:55.000000 | -               |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+

[heat-admin@overcloud-controller-1 ~]$ nova list --fields name,status,host --all
+--------------------------------------+---------------+--------+---------------------------------+
| ID                                   | Name          | Status | Host                            |
+--------------------------------------+---------------+--------+---------------------------------+
| 21e14c4b-fa5c-4c1b-a313-ff4cd79f2469 | asa1          | ACTIVE | overcloud-compute-1.localdomain |
| fc10912c-f33c-4a87-ba29-9a5deab4e688 | asa2          | ACTIVE | overcloud-compute-0.localdomain |
| 862c394c-a3b3-4b4e-a589-7f13981bfcbe | mgmt.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| ecf831a0-7ae8-46a0-9bb4-648b7b25a197 | ns12.fltg.com | ACTIVE | overcloud-compute-0.localdomain |
| 6f3befc9-cadb-4a3d-a03b-62f180d8b2eb | ns13.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| b282a48c-7a0d-4aec-973d-b51924224c03 | test          | ACTIVE | overcloud-compute-0.localdomain |
| e760d8ed-38b9-4ed3-aaa6-768d3a9e0867 | web1.fltg.com | ACTIVE | overcloud-compute-0.localdomain |
| 23b9e733-258d-464e-9662-92a1a5deb374 | web2.fltg.com | ACTIVE | overcloud-compute-1.localdomain |
| a8392d16-5b9b-40db-969c-45746708602b | www.fltg.com  | ACTIVE | overcloud-compute-0.localdomain |
+--------------------------------------+---------------+--------+---------------------------------+

Compute0 is turned off, yet the instances are not evacuated to compute 1. After manually bringing back up compute0 the instances will show in error state.
Comment 2 Raoul Scarazzini 2016-03-03 05:44:58 EST
Jeremy, I think I find out the problem. If you look at the status of the resources we've got a problem here:

Mar 01 08:01:13 [26035] overcloud-controller-0.localdomain        cib:     info: cib_perform_op:        ++ /cib/status/node_state[@id='overcloud-compute-0']/lrm[@id='overcloud-compute-0']/lrm_resources:  <lrm_resource id="nova-compute" type="NovaCompute" class="ocf" provider="openstack"/>

But NovaCompute is not present into the version of the package you are using inside the system resource-agents-3.9.5-54.el7_2.6.x86_64, which changes the way instance-ha work.
 Modifications on the KB are ongoing, but if you want to apply a quick solutions you can try to do this (from a controller as root):

source ./overcloudrc
pcs resource delete nova-compute
pcs resource cleanup nova-compute
pcs resource create nova-compute-checkevacuate ocf:openstack:nova-compute-wait auth_url=$OS_AUTH_URL username=$OS_USERNAME password=$OS_PASSWORD tenant_name=$OS_TENANT_NAME domain=localdomain no_shared_storage=1 op start timeout=300 --clone interleave=true --disabled --force"
pcs constraint location nova-compute-checkevacuate-clone rule resource-discovery=exclusive score=0 osprole eq compute
pcs constraint order start openstack-nova-conductor-clone then nova-compute-checkevacuate-clone require-all=false
pcs resource create nova-compute systemd:openstack-nova-compute --clone interleave=true --disabled --force
pcs constraint location nova-compute-clone rule resource-discovery=exclusive score=0 osprole eq compute
pcs constraint order start nova-compute-checkevacuate-clone then nova-compute-clone require-all=true
pcs constraint order start nova-compute-clone then nova-evacuate require-all=false
pcs constraint order start libvirtd-compute-clone then nova-compute-clone
sudo pcs constraint colocation add nova-compute-clone with libvirtd-compute-clone

So, in short, you need to remove the old nova-compute, introduce nova-compute-checkevacuate and redeclare nova-compute with the new resource agent nova-compute-wait.

After this, clean up the environment and try a new instance-ha test.
Comment 3 Jeremy 2016-03-03 17:33:15 EST
Raoul Scarazzini: He did what you suggested and now he is getting :

Mar  3 15:41:39 overcloud-controller-0 lrmd[20726]:  notice: nova-evacuate_monitor_10000:24793:stderr [ Exception from attempt to force host down via nova API: AttributeError: 'ServiceManager' object has no attribute 'force_down' ]

fresh sosreports on the way soon.
Comment 4 Raoul Scarazzini 2016-03-04 09:27:31 EST
Ok, the updated knowledge base article is online here [1], so first of all let's check if the version of the packages reflect the requirements.

[1] https://access.redhat.com/articles/1544823
Comment 10 Andrew Beekhof 2016-03-10 18:45:01 EST
Based on the logs, we're getting all the way to:

Mar 09 13:58:54 overcloud-controller-2.localdomain NovaEvacuate(nova-evacuate)[1975]: NOTICE: Initiating evacuation of overcloud-compute-0.localdomain

The evacuation even claims to have completed successfully:

Mar 09 13:58:57 overcloud-controller-2.localdomain NovaEvacuate(nova-evacuate)[2233]: NOTICE: Completed evacuation of overcloud-compute-0.localdomain

The question to be answered is which instances, if any, were attempted to be moved and what got in the way.

Happy to take part in a live debug, I can be found now on #rhos-pidone
Comment 14 Andrew Beekhof 2016-03-15 22:30:24 EDT
Can you confirm if instances are being created with the admin credentials?
If not, please retest with the following patch:

diff --git a/sbin/fence_compute b/sbin/fence_compute
index 4538beb..fb28b9d 100644
--- a/sbin/fence_compute
+++ b/sbin/fence_compute
@@ -103,7 +103,7 @@ def _get_evacuable_images():
 
 def _host_evacuate(options):
        result = True
-       servers = nova.servers.list(search_opts={'host': options["--plug"]})
+       servers = nova.servers.list(search_opts={'host': options["--plug", 'all_tenants': 1]})
        if options["--instance-filtering"] == "False":
                evacuables = servers
        else:


Background, so far our testing had created instances as the admin user and in such cases evacuation works.  However it is now understood that this is not normal.

The above patch allows the agent to find instances created by tenants other than 'admin' and therefor makes it possible to evacuate them.

The patch successfully resolved evacuation issues on another installation, so there is strong reason to believe it will work here too.  Please confirm if the fix works for you and we'll get it into official packages.
Comment 16 Jeremy 2016-03-21 10:30:30 EDT
From the customer:
My work around for this is to have the following scripts run every minute. Since I only have 2 compute nodes, this is an acceptable workaround.

evacuate-compute-0.sh:

((count = 5))                            # Maximum number to try.
while [[ $count -ne 0 ]] ; do
    ping -c 1 172.22.11.13                      # Try once.
    rc=$?
    if [[ $rc -eq 0 ]] ; then
        ((count = 1))                      # If okay, flag to exit loop.
    fi
    ((count = count - 1))                  # So we don't go forever.
done

if [[ $rc -eq 0 ]] ; then                  # Make final determination.
    echo "overcloud-compute-0 is up."
else
    source /home/heat-admin/overcloudrc
    comp0vms=$( nova list --all --fields host | grep compute-0 | awk '{print $2}')
    for comp0vm in ${comp0vms}; do nova evacuate ${comp0vm} overcloud-compute-1.localdomain ; done
fi

evacuate-compute-1.sh:

((count = 5))                            # Maximum number to try.
while [[ $count -ne 0 ]] ; do
    ping -c 1 172.22.11.12                      # Try once.
    rc=$?
    if [[ $rc -eq 0 ]] ; then
        ((count = 1))                      # If okay, flag to exit loop.
    fi
    ((count = count - 1))                  # So we don't go forever.
done

if [[ $rc -eq 0 ]] ; then                  # Make final determination.
    echo "overcloud-compute-1 is up."
else
    source /home/heat-admin/overcloudrc
    comp1vms=$( nova list --all --fields host | grep compute-1 | awk '{print $2}')
    for comp1vm in ${comp1vms}; do nova evacuate ${comp1vm} overcloud-compute-0.localdomain ; done
fi


Does this help to diagnose why it was failing?
Comment 17 Jeremy 2016-03-21 10:34:29 EDT
update:

after re-deploy using 7.3 images and latest documentation it works. However:

It is now happening again where the instances do not evacuate off of the failed compute.

Findings:

nova-evacuate worked when admin was the only project/user and only two networks were created with two instances.

After multiple instances were launched and networks created nova evacuate no longer works.

The previous mentioned script in comment 16 solves this issue.
Comment 18 Jeremy 2016-03-21 14:12:42 EDT
From the customer:
The manual nova evacuate command works no matter what tenant the instance was created under. The automatic nova-evacuate does not work on any instances at this point.


So he is using the script to manually evacuate if there is not ping to a compute node.
Comment 19 Fabio Massimo Di Nitto 2016-03-21 14:19:39 EDT
Did you test adding the patches from comment #15 ?
Comment 20 Jeremy 2016-03-21 14:50:20 EDT
No the customer has not tested with those patches
Comment 21 Fabio Massimo Di Nitto 2016-03-21 15:13:32 EDT
(In reply to Jeremy from comment #20)
> No the customer has not tested with those patches

That´s probably why it´s not working. Can you build a package for them to test or should we get you one?
Comment 22 Jeremy 2016-03-22 11:03:54 EDT
Fabio I would greatly appreciate if you guys could build a test package please.
Comment 27 Mike McCune 2016-03-28 17:59:03 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 28 arkady kanevsky 2016-03-28 18:12:07 EDT
Is there an analog BZ for OSP8?
Comment 29 Marian Krcmarik 2016-03-28 18:29:59 EDT
(In reply to arkady kanevsky from comment #28)
> Is there an analog BZ for OSP8?

The fix landed in fence-agents package which is distributed in RHEL7 channels, once the package with the fix is released, update of fence-agents package on RHEL7 based compute nodes should resolve the issue on any release of OSP.
Comment 30 arkady kanevsky 2016-03-29 09:49:18 EDT
Thanks Marian.
Mike, does this going to land in OSP8 GA and if yes, which beta/RC?
Thanks,
Arkady
Comment 31 Fabio Massimo Di Nitto 2016-03-29 09:52:24 EDT
(In reply to arkady kanevsky from comment #30)
> Thanks Marian.
> Mike, does this going to land in OSP8 GA and if yes, which beta/RC?
> Thanks,
> Arkady

Arkady, the fence-agents package is not shipped as part of OSP but from RHEL HA channel, we will make sure that the package is available by OSP8 GA.

Fabio
Comment 37 Daniel Bellantuono 2016-04-14 06:02:49 EDT
(In reply to Jeremy from comment #3)
> Raoul Scarazzini: He did what you suggested and now he is getting :
> 
> Mar  3 15:41:39 overcloud-controller-0 lrmd[20726]:  notice:
> nova-evacuate_monitor_10000:24793:stderr [ Exception from attempt to force
> host down via nova API: AttributeError: 'ServiceManager' object has no
> attribute 'force_down' ]
> 
> fresh sosreports on the way soon.

I have the same problem, I applied both upstream patches but the problem persist.

logs:
apr 14 09:54:01 overcloud-controller-1.localdomain fence_compute[14108]: Exception from attempt to force host down via nova API: AttributeError: 'ServiceManager' object has no attribute 'force_down

apr 14 09:54:22 overcloud-controller-1.localdomain lrmd[4322]:   notice: nova-evacuate_monitor_10000:13997:stderr [ Exception from attempt to force host down via nova API: AttributeError: 'ServiceManager' object has no attribute 'force_down' ]
Comment 38 Andrew Beekhof 2016-04-14 23:37:05 EDT
> apr 14 09:54:01 overcloud-controller-1.localdomain fence_compute[14108]:
> Exception from attempt to force host down via nova API: AttributeError:
> 'ServiceManager' object has no attribute 'force_down
> 
> apr 14 09:54:22 overcloud-controller-1.localdomain lrmd[4322]:   notice:
> nova-evacuate_monitor_10000:13997:stderr [ Exception from attempt to force
> host down via nova API: AttributeError: 'ServiceManager' object has no
> attribute 'force_down' ]

on their own these messages only imply that evacuations will take longer than the ideal, you can't infer from these alone that evacuation isn't working.
Comment 39 Rob Young 2016-05-23 12:34:45 EDT
To remove needinfo flag.
Comment 41 Asaf Hirshberg 2016-06-06 07:05:23 EDT
Verified on RHEL-OSP director 9.0 puddle - 2016-06-03.1

1)
+--------------------------------------+-------------------------------------+------+
| ID                                   | Host                                | Name |
+--------------------------------------+-------------------------------------+------+
| bf857b44-959c-4513-a6ca-f91ce57a52e6 | overcloud-novacompute-0.localdomain | vm01 |
| 88ac2cc4-9170-49a1-bf16-56d65312286c | overcloud-novacompute-1.localdomain | vm02 |
+--------------------------------------+-------------------------------------+------+
2) 
[root@overcloud-controller-0 ~]# pcs stonith fence overcloud-novacompute-0

3)
[stack@puma33 ~]$ nova list --fields host,name
+--------------------------------------+-------------------------------------+------+
| ID                                   | Host                                | Name |
+--------------------------------------+-------------------------------------+------+
| 88ac2cc4-9170-49a1-bf16-56d65312286c | overcloud-novacompute-1.localdomain | vm02 |
| bf857b44-959c-4513-a6ca-f91ce57a52e6 | overcloud-novacompute-1.localdomain | vm01 |
+--------------------------------------+-------------------------------------+------+



rpm:
pacemaker-debuginfo-1.1.13-10.el7_2.2.x86_64
pacemaker-1.1.13-10.el7_2.2.x86_64
pacemaker-libs-1.1.13-10.el7_2.2.x86_64
pacemaker-cluster-libs-1.1.13-10.el7_2.2.x86_64
pacemaker-cli-1.1.13-10.el7_2.2.x86_64
pacemaker-nagios-plugins-metadata-1.1.13-10.el7_2.2.x86_64
pacemaker-doc-1.1.13-10.el7_2.2.x86_64
pacemaker-remote-1.1.13-10.el7_2.2.x86_64
Comment 43 errata-xmlrpc 2016-11-04 00:49:35 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2373.html

Note You need to log in before you can comment on or make changes to this bug.