Bug 2312197

Summary: Conflict with default values for live_migration_permit_post_copy and live_migration_permit_auto_converge
Product: Red Hat OpenStack Reporter: Eric Nothen <enothen>
Component: documentationAssignee: Joanne O'Flynn <joflynn>
Status: CLOSED MIGRATED QA Contact: RHOS Documentation Team <rhos-docs>
Severity: high Docs Contact:
Priority: medium    
Version: 17.1 (Wallaby)CC: astupnik, bgibizer, joflynn, jslagle, mariel, mburns, smooney
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-01-21 16:37:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Nothen 2024-09-13 14:02:19 UTC
Description of problem:

The default values set by THT on live_migration_permit_post_copy and live_migration_permit_auto_converge are in conflict.

Version-Release number of selected component (if applicable):
17.1
16.2

How reproducible:
Default behavior

Steps to Reproduce:
1. Deploy RHOSP
2. Check values of both parameters in any compute host:
~~~
$ sudo egrep "^\[libvirt|^live_migration_permit" /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf
[libvirt]
live_migration_permit_post_copy=True
live_migration_permit_auto_converge=True
~~~


Actual results:
Both parameters are set to True by default:

~~~
$ egrep -nA9 "NovaLiveMigrationPermitPostCopy:$|NovaLiveMigrationPermitAutoConverge:$" /usr/share/openstack-tripleo-heat-templates/deployment/nova/nova-compute-container-puppet.yaml
347:  NovaLiveMigrationPermitPostCopy:
348-    description: >
349-      If "True" activates the instance on the destination node before migration is complete,
350-      and to set an upper bound on the memory that needs to be transferred. Post copy
351-      gets enabled per default if the compute roles is not a realtime role or disabled
352-      by this parameter.
353-    default: true
354-    type: boolean
355-    tags:
356-      - role_specific
357:  NovaLiveMigrationPermitAutoConverge:
358-    description: >
359-        Defaults to "True" to slow down the instance CPU until the memory copy process is faster than
360-        the instance's memory writes when the migration performance is slow and might not complete.
361-        Auto converge will only be used if this flag is set to True and post copy is not permitted
362-        or post copy is unavailable due to the version of libvirt and QEMU.
363-    default: true
364-    type: boolean
365-    tags:
366-      - role_specific
~~~

Note however, that auto_converge is only used if post copy is not permitted or unavailable, which is not the case in our default configuration.

Expected results:

Either post_copy or auto_converge should be enabled, not both at the same time. Based on the findings on the related BZ#2312196, I'm inclined to think that post_copy should be disabled by default, as auto_converge performs way better, especially on workloads that are under heavy memory pressure.


Additional info:

https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_permit_auto_converge
https://github.com/openstack/nova/blob/8a24acd9240f2a2705ccd979577e0e2338a238ef/nova/virt/libvirt/driver.py#L1022-L1029

Comment 1 smooney 2024-09-18 15:10:23 UTC
this is not a bug
it was done this way for upgrade reasons.
we expect both to he set to true and for post copy to take precedence

NovaLiveMigrationPermitAutoConverge does not mean it will be used just that nova is allow to use it.

Comment 2 Eric Nothen 2024-09-19 11:13:47 UTC
(In reply to smooney from comment #1)
> this is not a bug
> it was done this way for upgrade reasons.

Can you elaborate on this please? Which upgrade are these settings allowing, and how does disabling post copy affect upgrades?

I have recommended my customer to test live migration with post copy disabled, and given that a) They report time to completion has changed from hours to a handful of minutes, and b) They don't have to force completion any more, they are making this configuration persistent in their compute.yaml, so it will remain during and after the upgrade to 17.1.


> we expect both to he set to true and for post copy to take precedence
> 
> NovaLiveMigrationPermitAutoConverge does not mean it will be used just that
> nova is allow to use it.

Yes, Melanie has also clarified the intended behavior in BZ#2312196.

Comment 7 Joanne O'Flynn 2024-10-07 13:58:04 UTC
Triage meeting:

Notes:
Add more info re set the action to force it to trigger/ to guarantee it will trigger
A small explanation of what auto_converge does
Similar with post_copy

Just update 17 docs

Update section 15.8.2

Priority: Medium

Resources from Sean:

https://github.com/openstack-k8s-operators/nova-operator/blob/main/templates/nova.conf#L202-L204
live_migration_permit_post_copy=true
live_migration_permit_auto_converge=true
live_migration_timeout_action=force_complete

Comment 9 Red Hat Bugzilla 2025-05-22 04:25:05 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days