Bug 1403728 - Unexpected "Instance not resizing, skipping migration" caused by nova evacuation
Summary: Unexpected "Instance not resizing, skipping migration" caused by nova evacuation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: async
: 8.0 (Liberty)
Assignee: Rajesh Tailor
QA Contact: Prasanth Anbalagan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-12 09:43 UTC by Chen
Modified: 2020-12-14 07:56 UTC (History)
16 users (show)

Fixed In Version: openstack-nova-12.0.6-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-08 17:47:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 318832 0 'None' MERGED Completed migrations are not "in progress" 2021-02-08 07:17:26 UTC
Red Hat Product Errata RHBA-2017:0465 0 normal SHIPPED_LIVE openstack-nova bug fix advisory 2017-03-08 22:44:58 UTC

Internal Links: 1669335

Description Chen 2016-12-12 09:43:54 UTC
Description of problem:

Unexpected "Instance not resizing, skipping migration" is output by resource_tracker 

2016-12-02 20:31:16.514 4580 WARNING nova.compute.resource_tracker [req-c376a2a1-b82b-4741-bde0-585758d7de42 - - - - -] [instance: 60dd704d-c5e8-41c6-843e-4d7e5e4d3cf0] Instance not resizing, skipping migration.

# mysql -u root -e "select * from migrations" nova
+---------------------+---------------------+------------+-----+---------------------------+---------------------------+------------+-----------+--------------------------------------+----------------------+----------------------+---------------------------+---------------------------+---------+----------------+--------+
| created_at          | updated_at          | deleted_at | id  | source_compute            | dest_compute              | dest_host  | status    | instance_uuid                        | old_instance_type_id | new_instance_type_id | source_node               | dest_node                 | deleted | migration_type | hidden |
+---------------------+---------------------+------------+-----+---------------------------+---------------------------+------------+-----------+--------------------------------------+----------------------+----------------------+---------------------------+---------------------------+---------+----------------+--------+
| 2016-11-30 09:42:18 | 2016-11-30 09:46:53 | NULL       |  20 | adc-r3-az2-compute002.bky | adc-r3-az2-compute014.bky | 10.1.17.54 | completed | 60dd704d-c5e8-41c6-843e-4d7e5e4d3cf0 |                 NULL |                 NULL | adc-r3-az2-compute002.bky | adc-r3-az2-compute014.bky |       0 | evacuation     |      0 |


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Chen 2016-12-12 09:45:19 UTC
I can see the status of this migration record is "completed" while the type is "evacuation".

compute/resource_tracker.py

    778         for migration in migrations:
    779             uuid = migration.instance_uuid
    780 
    781             try:
    782                 if uuid not in instances:
    783                     instances[uuid] = migration.instance
    784             except exception.InstanceNotFound as e:
    785                 # migration referencing deleted instance
    786                 LOG.debug('Migration instance not found: %s', e)
    787                 continue
    788 
    789             # skip migration if instance isn't in a resize state:
    790             if not _instance_in_resize_state(instances[uuid]):
    791                 LOG.warning(_LW("Instance not resizing, skipping migration."),
    792                             instance_uuid=uuid)
    793                 continue

    517         migrations = objects.MigrationList.get_in_progress_by_host_and_node(
    518                 context, self.host, self.nodename)

The "migrations" should get from a SQL query.

db/sqlalchemy/api.py

   4478 def migration_get_in_progress_by_host_and_node(context, host, node):
   4479 
   4480     return model_query(context, models.Migration).\
   4481             filter(or_(and_(models.Migration.source_compute == host,
   4482                             models.Migration.source_node == node),
   4483                        and_(models.Migration.dest_compute == host,
   4484                             models.Migration.dest_node == node))).\
   4485             filter(~models.Migration.status.in_(['accepted', 'confirmed',
   4486                                                  'reverted', 'error',
   4487                                                  'failed'])).\
   4488             options(joinedload_all('instance.system_metadata')).\
   4489             all()

Here comes the problem. It seems that our code didn't filter the instance 60dd704d-c5e8-41c6-843e-4d7e5e4d3cf0. As a result in the logs we can see warning keeps outputting.

So wondering whether we could hack the DB to remove those entries safely ?

Comment 2 Chen 2016-12-12 14:28:57 UTC
How to reproduce the issue:

How to reproduce:

1. Create instances like below.

[root@hp-dl160g8-7 nova(keystone_admin)]# nova list --field host,status,name
+--------------------------------------+-------------------------------------+--------+------------+
| ID                                   | Host                                | Status | Name       |
+--------------------------------------+-------------------------------------+--------+------------+
| d46f455b-8b6c-4b3f-9c5d-69fb446f028b | hp-dl160g8-7.gsslab.pek2.redhat.com | ACTIVE | instance-1 |
| 4c1036e2-4efd-4f43-9701-35c5f1351cc6 | hp-dl160g8-8.gsslab.pek2.redhat.com | ACTIVE | instance-2 |
+--------------------------------------+-------------------------------------+--------+------------+

2. Stop the nova-compute on hp-dl160g8-8.gsslab.pek2.redhat.com.

3. Wait for nova hypervisor-list to show it is down.

[root@hp-dl160g8-7 nova(keystone_admin)]# nova hypervisor-list
+----+-------------------------------------+-------+---------+
| ID | Hypervisor hostname                 | State | Status  |
+----+-------------------------------------+-------+---------+
| 1  | hp-dl160g8-7.gsslab.pek2.redhat.com | up    | enabled |
| 2  | hp-dl160g8-8.gsslab.pek2.redhat.com | down  | enabled |
+----+-------------------------------------+-------+---------+

4. Evacuate the instance-2

[root@hp-dl160g8-7 nova(keystone_admin)]# nova evacuate instance-2
+-----------+--------------+
| Property  | Value        |
+-----------+--------------+
| adminPass | pWU6AdpFmX7d |
+-----------+--------------+

5. Confirm instance-2 becomes active

[root@hp-dl160g8-7 nova(keystone_admin)]# nova list
+--------------------------------------+------------+--------+------------+-------------+-----------------------+
| ID                                   | Name       | Status | Task State | Power State | Networks              |
+--------------------------------------+------------+--------+------------+-------------+-----------------------+
| d46f455b-8b6c-4b3f-9c5d-69fb446f028b | instance-1 | ACTIVE | -          | Running     | private-1=192.168.1.2 |
| 4c1036e2-4efd-4f43-9701-35c5f1351cc6 | instance-2 | ACTIVE | -          | Running     | private-1=192.168.1.4 |
+--------------------------------------+------------+--------+------------+-------------+-----------------------+


6. Check the log

[root@hp-dl160g8-7 nova(keystone_admin)]# tailf /var/log/nova/nova-compute.log | grep d46f455b-8b6c-4b3f-9c5d-69fb446f028b
2016-12-12 09:24:23.013 16316 WARNING nova.compute.resource_tracker [req-17679e9c-86e4-4a01-84cb-9d0fb97e7ccf - - - - -] [instance: d46f455b-8b6c-4b3f-9c5d-69fb446f028b] Instance not resizing, skipping migration.
2016-12-12 09:25:22.765 16316 WARNING nova.compute.resource_tracker [req-17679e9c-86e4-4a01-84cb-9d0fb97e7ccf - - - - -] [instance: d46f455b-8b6c-4b3f-9c5d-69fb446f028b] Instance not resizing, skipping migration.
2016-12-12 09:26:25.077 16316 WARNING nova.compute.resource_tracker [req-17679e9c-86e4-4a01-84cb-9d0fb97e7ccf - - - - -] [instance: d46f455b-8b6c-4b3f-9c5d-69fb446f028b] Instance not resizing, skipping migration.

Comment 3 Masaki Furuta ( RH ) 2016-12-13 02:34:09 UTC
(In reply to Chen from comment #2)

Though we have many fix around migration, but I also found that we added completed status to the filter here.

Egohan, can we take a look into this ?


Diff: filter on OSP8 vs OSP9

  [root@el72-osp8-all-virbr1 ~]# rpm -qf /usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py
  python-nova-12.0.4-4.el7ost.noarch
  
  [root@el72-osp9-all-virbr1 ~]# rpm -qf /usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py
  python-nova-13.1.1-2.el7ost.noarch
  
  
  # diff -up osp8/api.py /usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py
  ...
  +@pick_context_manager_reader
   def migration_get_in_progress_by_host_and_node(context, host, node):
   
       return model_query(context, models.Migration).\
  @@ -4484,16 +4771,34 @@ def migration_get_in_progress_by_host_an
                               models.Migration.dest_node == node))).\
               filter(~models.Migration.status.in_(['accepted', 'confirmed',
                                                    'reverted', 'error',
  -                                                 'failed'])).\
  +                                                 'failed', 'completed'])).\
               options(joinedload_all('instance.system_metadata')).\
               all()
   
   
  +@pick_context_manager_reader
  +def migration_get_in_progress_by_instance(context, instance_uuid,
  +                                          migration_type=None):
  +    # TODO(Shaohe Feng) we should share the in-progress list.
  +    # TODO(Shaohe Feng) will also summarize all status to a new
  +    # MigrationStatus class.
  +    query = model_query(context, models.Migration).\
  +            filter_by(instance_uuid=instance_uuid).\
  +            filter(models.Migration.status.in_(['queued', 'preparing',
  +                                                'running',
  +                                                'post-migrating']))
  +    if migration_type:
  +        query = query.filter(models.Migration.migration_type == migration_type)
  +
  +    return query.all()
  +
  +
  +@pick_context_manager_reader
  ...

...

Comment 4 Masaki Furuta ( RH ) 2016-12-13 02:43:54 UTC
(In reply to Masaki Furuta from comment #3)
> (In reply to Chen from comment #2)
> 
> Though we have many fix around migration, but I also found that we added
> completed status to the filter here.
> 
> Egohan, can we take a look into this ?

I think this upstream fix adding completed status.

- "Migration instance not found" is logged repeatedly to nova-compute.log after an instance was deleted
  https://code.launchpad.net/bugs/1583720
  https://git.openstack.org/cgit/openstack/nova/commit/?id=e5b8060c08bb972a1960575749f240d7160bc114

Comment 5 Rajesh Tailor 2017-01-16 07:18:17 UTC
I was assigned to backport patch [1] to OSP8 to resolve this issue.

After creating test package for OSP8 with patch [1] changes, what I found out that it solves this issue partially. 

For eg.:
i) Let say, you evacuate instance from host 'A' (which is down) to host B.
ii) After completion of evacuation process, the migration status is marked as "done" in migrations table and it will be in this status as long as host A is down. 
iii) When we start host 'A' and when nova-compute is up and running on it, "_destroy_evacuated_instances" method will be called by init_host method, which would change migration status form "done" to "completed".

Since we are only taking "completed" migration status into consideration while applying filtering criterion in patch[1], we would get "Instance not resizing, skipping migration." warning messages in nova-compute logs, unless step (iii) is performed.

As long as the the host from which instance is evacuated is down, migration status won't be set to 'completed' status.

I have submitted for review to OSP8 [2].

[1] https://review.openstack.org/#/q/I7aafab9abdbfafe9479846f06068ba8a963d290a,n,z
[2] https://code.engineering.redhat.com/gerrit/#/c/95329/

Comment 9 Prasanth Anbalagan 2017-03-06 15:22:40 UTC
Resource tracker outputs the warning message, but after starting the compute node again the messages stop.

*********
VERSION
*********

1:12.0.6-9.el7ost

******
LOGS
******

# nova list --field host,status,name
+--------------------------------------+------------------------------+--------+------+
| ID                                   | Host                         | Status | Name |
+--------------------------------------+------------------------------+--------+------+
| ea6af1b0-f0ed-4a7a-8569-c07f18ea54be | compute-0 | ACTIVE | vm2  |
| 274b50e1-4a31-45a9-b7c5-917992c39eca | compute-1 | ACTIVE | vm1  |
+--------------------------------------+------------------------------+--------+------+

# nova hypervisor-list 
+----+------------------------------+-------+---------+
| ID | Hypervisor hostname          | State | Status  |
+----+------------------------------+-------+---------+
| 1  | compute-1 | down  | enabled |
| 2  | compute-0 | up    | enabled |
+----+------------------------------+-------+---------+

# nova evacuate vm1
+-----------+--------------+
| Property  | Value        |
+-----------+--------------+
| adminPass | EcNjjnfK7VzS |
+-----------+--------------+

# nova list --field host,status,name
+--------------------------------------+------------------------------+--------+------+
| ID                                   | Host                         | Status | Name |
+--------------------------------------+------------------------------+--------+------+
| 274b50e1-4a31-45a9-b7c5-917992c39eca | compute-0 | ACTIVE | vm1  |
| ea6af1b0-f0ed-4a7a-8569-c07f18ea54be | compute-0 | ACTIVE | vm2  |
+--------------------------------------+------------------------------+--------+------+


# tail -f /var/log/nova/nova-compute.log | grep 274b50e1-4a31-45a9-b7c5-917992c39eca
2017-03-06 17:13:50.521 26086 INFO nova.compute.manager [req-c4ed3458-103a-42a6-9a07-6ac52266310e - - - - -] [instance: 274b50e1-4a31-45a9-b7c5-917992c39eca] VM Resumed (Lifecycle Event)
2017-03-06 17:13:50.558 26086 INFO nova.virt.libvirt.driver [-] [instance: 274b50e1-4a31-45a9-b7c5-917992c39eca] Instance spawned successfully.
2017-03-06 17:13:50.679 26086 INFO nova.compute.manager [req-c4ed3458-103a-42a6-9a07-6ac52266310e - - - - -] [instance: 274b50e1-4a31-45a9-b7c5-917992c39eca] During the sync_power process the instance has moved from host compute-1 to host compute-0
2017-03-06 17:13:50.680 26086 INFO nova.compute.manager [req-c4ed3458-103a-42a6-9a07-6ac52266310e - - - - -] [instance: 274b50e1-4a31-45a9-b7c5-917992c39eca] VM Started (Lifecycle Event)
2017-03-06 17:13:50.948 26086 INFO nova.compute.manager [req-c4ed3458-103a-42a6-9a07-6ac52266310e - - - - -] [instance: 274b50e1-4a31-45a9-b7c5-917992c39eca] During the sync_power process the instance has moved from host compute-1 to host compute-0
2017-03-06 17:14:05.196 26086 WARNING nova.compute.resource_tracker [req-6c717967-4c30-4d25-839c-670ebedc9749 - - - - -] [instance: 274b50e1-4a31-45a9-b7c5-917992c39eca] Instance not resizing, skipping migration.
2017-03-06 17:15:07.190 26086 WARNING nova.compute.resource_tracker [req-6c717967-4c30-4d25-839c-670ebedc9749 - - - - -] [instance: 274b50e1-4a31-45a9-b7c5-917992c39eca] Instance not resizing, skipping migration.
2017-03-06 17:16:09.185 26086 WARNING nova.compute.resource_tracker [req-6c717967-4c30-4d25-839c-670ebedc9749 - - - - -] [instance: 274b50e1-4a31-45a9-b7c5-917992c39eca] Instance not resizing, skipping migration.

Comment 11 errata-xmlrpc 2017-03-08 17:47:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0465.html


Note You need to log in before you can comment on or make changes to this bug.