Bug 2216798

Summary: Ping latency temporarily increased at the end of live migration when multichassis port bindings are used (OSP 17.1+)
Product: Red Hat OpenStack Reporter: Eric Nothen <enothen>
Component: python-networking-ovnAssignee: Ihar Hrachyshka <ihrachys>
Status: ASSIGNED --- QA Contact: Eran Kuris <ekuris>
Severity: medium Docs Contact:
Priority: medium    
Version: 17.1 (Wallaby)CC: apevec, ihrachys, jasowang, kchamart, lhh, lmartins, majopela, peterx, quintela, ralonsoh, scohen
Target Milestone: ---Keywords: Triaged
Target Release: ---Flags: ifrangs: needinfo? (ihrachys)
ihrachys: needinfo? (enothen)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Nothen 2023-06-22 15:54:41 UTC
Description of problem:

Customer requested an analysis of performance of a live migration performed in 17.1-beta to find out if there's further room for improvement or tuning to reduce increased latency at the end of the migration process.

Version-Release number of selected component (if applicable):
17.1-beta

How reproducible:
Reproducible when migrating a VM with a relatively large size of RAM (128GB) and actively under memory stress.

Additional info:
Adding reference to attached files in a private comment.

Comment 3 Ihar Hrachyshka 2023-06-23 14:34:47 UTC
Thanks a lot for the detailed info attached here and in the customer case. I won't get to this immediately, but I've checked sos-reports attached. I think they are not as complete as I was hoping for.

- I don't see DEBUG messages in ovn-controller logs.
- I don't see DEBUG messages in ovs-vswitchd logs.

These should be enabled by ov[ns]-appctl (for each of the processes). See: https://access.redhat.com/solutions/4270652 (that covers ovn-controller, but you'll need to do the same for ovs-vswitchd.)

Neutron and Nova logs are enabled for DEBUG, so that's fine.

===

I understand this is a slightly different scenario, but it could help if we could do the same latency measurements on the environment with
- the same VM but not loaded / swapping
- a small VM (cirros m1.tiny flavored?) with no load

Just to get a baseline and see if the majority of the latency originates from the type of flavor and workload, or it's inherent to OVN / Nova port binding procedures that are not primarily influenced by the kind of the VM.

===

I also wonder if the same latency pattern exists when you control the location of the VMs (the one that is being migrated and the one that runs ping), e.g.

- when source chassis of the migrating-VM is the same as chassis-of-ping-VM
- when destination chassis of the migrating-VM is the same as chassis-of-ping-VM
- when chassis-of-ping-VM is different from either source or destination chassis of the migrating-VM

There may be differences in how traffic is steered by OVN, depending on whether ports are co-located, and this information may help to identify the probable sources of the latency.

===

It may also help to debug the situation if we understand the source of the latency. One source could be just that migrating-VM doesn't do any computation (regardless whether local or network related) during pause-unpause cycle steered by Nova and libvirt. Obviously, if the VM doesn't do any computation during an extended period of time, it won't do computation related to networking either. To isolate this global freeze (pause) effect from any effect that is tied to Neutron/OVN implementation of multichassis port bindings, we could run a script inside the migrating-VM that would, during live migration, dump - with high granularity - time as seen by its clock to a file, in a loop. Then we could examine the file and see if there are significant gaps in the timestamps, that would indicate extended freeze of computation by the VM.

This is not a complete script and it will need some testing and adjustment, (sorry for not polishing it) but something along the lines of:

#!/bin/bash
while true; do
  date >> timestamps-during-migration.txt
  sleep 0.01
done

(just need to adjust `date` to dump miliseconds in addition to seconds)

The script would run inside the migrating-VM.

===

I've talked to Sean Mooney from Compute about the matter too, and I think we have some ideas on what to try next from engineering perspective. Dumping the list here in case something happens and I can't follow up on it myself.

- Disable RARP activation and see if it improves latency (perhaps with the cost of observing some duplicate replies to ping)
- Add sleep before / comment out neutron port activate call done inside post-migrate handler of nova and see if latency is still there (the idea here is that we would be able to isolate if the latency is the result of neutron port activate handling or something more low level, e.g. libvirt extending the VM pause state / buffering packets)
- Is there maybe an issue with async updates to flow tables on multi->single transition between nodes (hosting ping and src/dest)? Is it safe to assume that the order of flow table updates on all participating chassis is irrelevant? (Need to think about it for a bit. Though on first sight it should be irrelevant.)
- Try the nova reordering patch to do neutron activate earlier: https://review.opendev.org/c/openstack/nova/+/883682 (though we believe it shouldn't affect the connectivity because it's assumed by nova that neutron already configured networking on destination in pre-migrate phase)

Comment 4 Ihar Hrachyshka 2023-06-23 14:51:04 UTC
Please enable the following debug modules for ovn-controller:
binding
chassis
encaps
if_status
inc_proc_eng
lport
main
ofctrl
physical
pinctrl

For ovs-vscwitchd, please enable the following (this is done via ovs-appctl tool, not ovn-appctl):
bridge
flow
vswitchd

Once you are done with tests and capturing these logs, consider reverting back to original settings for each of them not to pollute disc with debug logs.

Comment 7 Ihar Hrachyshka 2023-07-25 16:47:58 UTC
I did a number of tests in my local environment.

0. I am running the test on devstack 2node setup, the VM under test is cirros using m1.tiny flavor, nothing is running inside the VM except the timestamp script (see below). The VM running ping is running on one of the nodes and is part of the same private (geneve) network.

1. I see increased latency in ping output around the time when the VM is paused / unpaused. The bump in latency is not as prominent as you experienced with your large VM, but still prominent enough. (I believe the size of the VM RAM + whether it swaps / runs intensive processes affects the speed of the host switch by libvirt.)

Example of latency bump:

```
64 bytes from 10.0.0.30: icmp_seq=105 ttl=64 time=3.92 ms
64 bytes from 10.0.0.30: icmp_seq=106 ttl=64 time=4.22 ms
64 bytes from 10.0.0.30: icmp_seq=107 ttl=64 time=22.5 ms
64 bytes from 10.0.0.30: icmp_seq=108 ttl=64 time=5.64 ms
64 bytes from 10.0.0.30: icmp_seq=109 ttl=64 time=29.1 ms
64 bytes from 10.0.0.30: icmp_seq=110 ttl=64 time=43.9 ms
64 bytes from 10.0.0.30: icmp_seq=111 ttl=64 time=15.8 ms
64 bytes from 10.0.0.30: icmp_seq=112 ttl=64 time=27.8 ms
64 bytes from 10.0.0.30: icmp_seq=113 ttl=64 time=42.8 ms
64 bytes from 10.0.0.30: icmp_seq=114 ttl=64 time=74.9 ms
64 bytes from 10.0.0.30: icmp_seq=115 ttl=64 time=158 ms 
64 bytes from 10.0.0.30: icmp_seq=116 ttl=64 time=48.6 ms
64 bytes from 10.0.0.30: icmp_seq=117 ttl=64 time=13.9 ms
64 bytes from 10.0.0.30: icmp_seq=118 ttl=64 time=2.96 ms
64 bytes from 10.0.0.30: icmp_seq=119 ttl=64 time=8.91 ms
```

(the regular values are ~5ms when live migration is not running.)

2. I've tried the same with Artom's patch to nova that re-orders storage / network activation steps in post-migrate phase: https://review.opendev.org/c/openstack/nova/+/883682 It didn't change anything (as we expected).

3. I've tried adding time.sleep(60) before network activation happens on nova side to isolate potential delays due to libvirt host switch from any effect of OVN requested-chassis switch from src,dst to dst happening in post-migrate. Adding the sleep didn't change the latency pattern, which makes me believe that it's not OVN dataplane changes that introduce the latency.

4. I then ran a script inside the cirros VM that constantly dumps current clock timestamp (with nanosecond precision) into a file under /tmp, and then ran live migration to see if there's visible delay in timestamps around the time when we see network latency. Indeed:

The script (I had to pull timestamps from /proc because libc / date util used in cirros doesn't give me nanosecond precision with %N):

```
#!/bin/sh
echo > time.txt
while true; do
  sudo cat /proc/timer_list | head -n 3 | tail -n 1 >> time.txt
  #sleep 0.01
done
```

Output:

```
now at 1437443317305 nsecs
now at 1437792103249 nsecs
now at 1438203851566 nsecs
now at 1438579200107 nsecs
now at 1438925215601 nsecs
now at 1439310798943 nsecs
now at 1439707246297 nsecs
now at 1440120358926 nsecs
now at 1440494016110 nsecs
now at 1440831918200 nsecs
now at 1441225121595 nsecs
now at 1441661974233 nsecs
now at 1441959671940 nsecs
now at 1442244247463 nsecs
now at 1442654619997 nsecs
now at 1443152651983 nsecs
now at 1443529384693 nsecs
now at 1443825461744 nsecs
now at 1444677407216 nsecs
now at 1445302003744 nsecs
now at 1445814026209 nsecs
now at 1446400892278 nsecs
now at 1446995647303 nsecs
**
now at 1447607051015 nsecs
now at 1449154766140 nsecs
**
now at 1449549856250 nsecs
now at 1449857977640 nsecs
now at 1450220032010 nsecs
```

I've emphasized the important lines. You can see that there's a ~1.5 second delay between the two timestamps, so what took ~0.3s before the delay bump during live migration switch, now took 1s+ more time for vCPU to process. Which makes me think that the primary contributor to the latency is just the VM under test being slow during the unpause, so its networking stack can't keep up as quickly to incoming ICMP requests. Once unpause completes and the VM state is settled, latency (and vCPU performance) gets restored to normal level.

If that's indeed the case that VMs unpaused are initially slow, then - if we want to tackle latency - we may have to start looking at ways to speed up the unpause / migration process in libvirt, and not at openstack services / OVN interaction.

I'm not closing / moving this bz just yet because I want to take a better look at the scenarios on more beefy clusters. But I'd advise to engage libvirt / qemu folks into this discussion already so that they can check if my notes and approach to isolate networking from overall delays in VMs make sense.

Comment 8 Kashyap Chamarthy 2023-07-31 09:47:23 UTC
@Jason / Peter: Pinging you as a starting point, feel free to redirect it as you see fit.

Do you know if there's a reasonable way to go about isolating the above network-latency issue during live migration? Ihar seems to think (see his comment#7) the issue might be at the level of libvirt/QEMU interactions with OVN. Admittedly, there are several moving parts here, so I'm not sure how tractable the issue is.

Comment 9 Juan Quintela 2023-07-31 12:51:09 UTC
dSeveral questions:
- network topology: Are you sharing the same network link for your meassurements and migration
- What are the specs do you have?
  Network Hardware speed, and How big is the downtime that you setting for migration
- Can you measure how much bandwidth is being used during migration

About the network latency, the only thing that I can think of without more data is that you are saturating the network link with migration traffic and then nothing else gets through.
About the timer change, As it only happens "twice" I think that the problem is the last stage of migration. I.e. How much downtime do you allow.

Once this questions are answere, we can continue looking at other issues.

Later, Juan.

Comment 10 Peter Xu 2023-07-31 15:18:27 UTC
(In reply to Ihar Hrachyshka from comment #7)
> **
> now at 1447607051015 nsecs
> now at 1449154766140 nsecs
> **

I don't really know the detail configs of m1.tiny config, but for a small vm 1.5s downtime is definitely not natural.

I tried just now with a simplified QEMU cmdline (8 vcpu, 4G mem) and I cannot trigger this issue with at least an upstream qemu.  I uncommented the "sleep" though with a 0.1 interval, assuming that shouldn't matter here.

#!/bin/sh
echo > time.txt
while true; do
  sudo cat /proc/timer_list | head -n 3 | tail -n 1 | tee -a time.txt
  sleep 0.1
done

...
now at 33714504765 nsecs
now at 33948119077 nsecs
now at 34180928559 nsecs
now at 34414336505 nsecs
now at 34644574167 nsecs
now at 34812645201 nsecs
--- switchover happened here ---
now at 34975213118 nsecs
now at 35136053289 nsecs
now at 35296922832 nsecs
now at 35458301301 nsecs
now at 35619638850 nsecs
...

$ qemu-system-x86_64 -M q35 -smp 8 -m 4G -msg timestamp=on -name peter-vm,debug-threads=on -cpu Nehalem -accel kvm -global migration.x-max-bandwidth=0 -qmp unix:/tmp/peter.qmp,server,nowait -nographic -nodefaults -serial stdio -monitor telnet::6666,server,nowait -netdev user,id=net0,hostfwd=tcp::5555-:22 -device virtio-net-pci,netdev=net0 -incoming tcp:0.0.0.0:12345 /home/peterx/remote/images/default.qcow2

Ihar, is your result reliably reproducing?  If so, worth trying with a simpled qemu cmdline and see whether that changes anything (e.g., some qemu parameter may cause a determinstically long downtime, but I'm not sure).

Even if so, it may or may not be directly related to the comment 0 issue - that one has much drastic 16 sec downtime.  I'm not sure whether Jason can come up with anything that might be related to 16sec, e.g., what would be the timeout if the ARP packets just get lost.

It'll also be interesting to know whether port binding is a must to trigger that 16 sec delay.

Comment 11 jason wang 2023-08-01 03:17:30 UTC
If pause is used, this is somehow expected. As Qemu or virtio-net driver doesn't generate GARP packets in this case.

Management layer can manually ask Qemu (or virtio-net driver) to generate those packets when necessary:

https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#qapidoc-1476

If the GARP is sent in time, we probably need to make sure it is not dropped unexpectedly.

Thanks

Comment 12 Ihar Hrachyshka 2023-08-01 13:46:25 UTC
Both VMs are on geneve network and don't rely on GARP advertisements to ensure delivery of packets to destination. (Actually, OVN duplicates all packets to both locations during the live migration procedure.)

Comment 13 jason wang 2023-08-02 02:40:46 UTC
(In reply to Ihar Hrachyshka from comment #12)
> Both VMs are on geneve network and don't rely on GARP advertisements to
> ensure delivery of packets to destination. (Actually, OVN duplicates all
> packets to both locations during the live migration procedure.)

If I understand correctly, this is how things are expected to work but to isolate the issue, I'd suggest to instrument the Qemu to send gARP after live migration to see if anything is improved or not.

Thanks

Comment 14 jason wang 2023-08-02 06:13:39 UTC
(In reply to jason wang from comment #13)
> (In reply to Ihar Hrachyshka from comment #12)
> > Both VMs are on geneve network and don't rely on GARP advertisements to
> > ensure delivery of packets to destination. (Actually, OVN duplicates all
> > packets to both locations during the live migration procedure.)
> 
> If I understand correctly, this is how things are expected to work but to
> isolate the issue, I'd suggest to instrument the Qemu to send gARP after
> live migration to see if anything is improved or not.
> 
> Thanks


One more question, is the latency measured in the guest or not? (Looks yes, according to comment #7). If yes, how about measure it from the host? If we see difference, it could be a bug of the gtod (kvmclock).

Thanks

Comment 15 Ihar Hrachyshka 2023-08-02 15:41:24 UTC
I will try to answer the question listed above. Where information is not yet available, I'm also listing the questions for completeness to clarify that they are not lost and I will report back later with answers.

---

- Network topology: Are you sharing the same network link for your measurements and migration?
	- Yes. This is virtual environment (multinode vagrant).

- What are the specs do you have?
	- This is fully virtualized environment. No hardware is involved.

- How big is the downtime that you are setting for migration?
	- This is OpenStack, Nova service is driving downtime settings for Libvirt. I believe [this comment and function](https://github.com/openstack/nova/blob/bee1313240d63c26620bb091766232b7d7a97539/nova/virt/libvirt/migration.py#L665-L690) in Nova code base explain how the value is picked.

```python
def downtime_steps(data_gb):
    '''Calculate downtime value steps and time between increases.

    :param data_gb: total GB of RAM and disk to transfer

    This looks at the total downtime steps and upper bound
    downtime value and uses a linear function.

    For example, with 10 steps, 30 second step delay, 3 GB
    of RAM and 400ms target maximum downtime, the downtime will
    be increased every 90 seconds in the following progression:

    -   0 seconds -> set downtime to  40ms
    -  90 seconds -> set downtime to  76ms
    - 180 seconds -> set downtime to 112ms
    - 270 seconds -> set downtime to 148ms
    - 360 seconds -> set downtime to 184ms
    - 450 seconds -> set downtime to 220ms
    - 540 seconds -> set downtime to 256ms
    - 630 seconds -> set downtime to 292ms
    - 720 seconds -> set downtime to 328ms
    - 810 seconds -> set downtime to 364ms
    - 900 seconds -> set downtime to 400ms

    This allows the guest a good chance to complete migration
    with a small downtime value.
    '''
    downtime = CONF.libvirt.live_migration_downtime
    steps = CONF.libvirt.live_migration_downtime_steps
    delay = CONF.libvirt.live_migration_downtime_delay

    delay = int(delay * data_gb)

    base = downtime / steps
    offset = (downtime - base) / steps

    for i in range(steps + 1):
        yield (int(delay * i), int(base + offset * i))
```

- Can you measure how much bandwidth is being used during migration?
	- [ ] I will have to do some measurements to answer this question.

- Is your result reliably reproducing?
	- [ ] I should run more tests to confirm it.

- It'll also be interesting to know whether port binding is a must to trigger that 16 sec delay.
	- No. I completely commented out the code that activates / rebinds the ports on source and destination at the end of Live Migration, and this doesn't change the latency pattern at all. Which is why I started suspecting general slowdown of the VM, not specific to networking.

- To isolate the issue, I'd suggest to instrument the Qemu to send gARP after live migration to see if anything is improved or not.
	- OpenStack ML2/OVN implementation already uses `activation-strategy=rarp` for VM ports, which makes OVN block any traffic for the port until a RARP is received from the VM port, at which point OVN removes the blocking flows. So this shouldn't be a problem.
	- [ ] But for completeness, I will try to enable the Nova's `[workarounds]/enable_qemu_monitor_announce_self` option and see if it changes anything.

- Is the latency measured in the guest or not?
	- Latency is measured on a different guest that is running in the same cluster and belongs to the same virtual network as the VM-under-test.
	- "how about measure it from the host"
		- In OpenStack context, it will require the VM to be connected to an external network that provides access to the VM from the hypervisor. I am trying to isolate the effect of NAT, among other moving parts.

Comment 17 Ihar Hrachyshka 2023-08-15 12:46:48 UTC
As discussed with @ekuris, re-targeting this bz to 18 since we still don't know if there's anything we can do for the latency on neutron / ovn side (if not, we may have to punt it down the stack to qemu/libvirt/kernel and see if anything can be done there.) If / when a fix for the latency behavior is implemented in a later release, we may backport it to earlier releases, e.g. 17. Until then, there's little reason to track this request for 17.

Comment 18 Ihar Hrachyshka 2023-08-15 12:48:54 UTC
Is it possible for the customer to repeat the same test with ml2/ovs? I don't think we did test the same scenario (with 128gb ram VM that is thrashed) there, and considering that I have suspicion the latency has little to do with port binding and hence ml2/ovn, having a data point for the "baseline" ml2/ovs driver too could be of real help to isolate the source of the latency.

Please let me know if the customer is willing - and has the lab - to do this testing. Thanks.