Bug 1590876 - test_create_list_show_delete_interfaces frequently failing
Summary: test_create_list_show_delete_interfaces frequently failing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tempest
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 10.0 (Newton)
Assignee: Chandan Kumar
QA Contact: Martin Kopec
URL:
Whiteboard:
Depends On:
Blocks: 1647954 1648021
TreeView+ depends on / blocked
 
Reported: 2018-06-13 14:51 UTC by Attila Fazekas
Modified: 2020-12-21 19:18 UTC (History)
22 users (show)

Fixed In Version: openstack-tempest-13.0.0-20.bafe630git.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1647954 1648021 (view as bug list)
Environment:
Last Closed: 2019-01-16 17:09:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1696125 0 None None None 2018-06-15 11:19:13 UTC
OpenStack gerrit 600046 0 None MERGED Interface tests: wait for server to become SSHable 2020-12-14 07:20:01 UTC
OpenStack gerrit 602206 0 None MERGED Default enable run_validation 2020-12-14 07:20:01 UTC
Red Hat Product Errata RHBA-2019:0075 0 None None None 2019-01-16 17:09:28 UTC

Comment 6 Attila Fazekas 2018-06-15 12:54:38 UTC
I wonder does it worth to retry on DeviceDetachFailed on a non persistent domain ?
Is the domain persistent when it works ?

Comment 7 Attila Fazekas 2018-06-18 11:55:45 UTC
You can reproduce it by running tempest,
I can give you a VM if you need it. 

while ostestr --regex '(?!.*\[.*\bslow\b.*\])(^tempest\.(api|scenario))'   --concurrency 8; do echo retry ; done

Should give the issue within 90 minute.
But if you just run the interface test in another terminal in a loop, might be even faster.

Comment 12 Attila Fazekas 2018-06-26 10:58:48 UTC
The did not reproduced the issue with an environment with virt-type=qemu so far.
Our environment is nested guest capable and behaves more like barematel would.

The upstream gates does not have this ability, so they exercises the qemu instead of kvm, if the issue is kvm specific it can explain why this is rare at upstream jobs.

Is the issue maybe related to the kvm path, maybe just random dice roll noise.

packstack --allinone --os-debug-mode y --provision-tempest-floatrange 172.24.4.0/24 --provision-demo-floatrange 172.24.4.0/24 --nova-libvirt-virt-type kvm
tempest init tempest-dir
cd tempest-dir
(source /root/keystonerc_admin;  discover-tempest-config --debug --create identity.uri $OS_AUTH_URL identity.admin_password $OS_PASSWORD)
while ostestr --regex '(?!.*\[.*\bslow\b.*\])(?!.*encrypted.*)(?!.*implied_roles.*)(?!.*inherit.*)(^tempest\.(api|scenario))'   --concurrency 8; do echo retry ; done

Comment 13 Attila Fazekas 2018-06-26 13:00:16 UTC
with osp13 the libvirt revives multiple (7) request for 
2018-06-26 10:30:47.281+0000: 10195: debug : virDomainDetachDeviceFlags:8270 : dom=0x7f6fe00273b0, (VM: name=instance-00000014, uuid=c4244e1c-5fa0-45da-80f9-13b8a2ac9bea), xml=<interface type="bridge">
  <mac address="fa:16:3e:c6:d9:83"/>
  <model type="virtio"/>
  <source bridge="qbr8ecde647-26"/>
  <target dev="tap8ecde647-26"/>
</interface>
, flags=0x1

It sends it to the qemu monitor:
2018-06-26 10:30:47.283+0000: 10195: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7f6fd4024b10 msg={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-37"}
 fd=-1
2018-06-26 10:30:47.283+0000: 10194: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f6fd4024b10 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-37"}
 len=70 ret=70 errno=0
2018-06-26 10:30:47.283+0000: 10194: info : qemuMonitorIOProcess:447 : QEMU_MONITOR_IO_PROCESS: mon=0x7f6fd4024b10 buf={"return": {}, "id": "libvirt-37"}
 len=36
2018-06-26 10:30:47.283+0000: 10194: debug : qemuMonitorJSONIOProcessLine:193 : Line [{"return": {}, "id": "libvirt-37"}]
2018-06-26 10:30:47.283+0000: 10194: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6fd4024b10 reply={"return": {}, "id": "libvirt-37"}


Based on later communication net1 is still there: 
2018-06-26 10:31:56.691+0000: 10197: debug : qemuMonitorJSONCommandWithFd:298 : Send command '{"execute":"qom-list","arguments":{"path":"/machine/peripheral"},"id":"libvirt-46"}' for write with FD -1
2018-06-26 10:31:56.691+0000: 10197: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7f6fd4024b10 msg={"execute":"qom-list","arguments":{"path":"/machine/peripheral"},"id":"libvirt-46"}
 fd=-1
2018-06-26 10:31:56.692+0000: 10194: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f6fd4024b10 buf={"execute":"qom-list","arguments":{"path":"/machine/peripheral"},"id":"libvirt-46"}
 len=85 ret=85 errno=0
2018-06-26 10:31:56.694+0000: 10194: info : qemuMonitorIOProcess:447 : QEMU_MONITOR_IO_PROCESS: mon=0x7f6fd4024b10 buf={"return": [{"name": "virtio-disk0", "type": "child<virtio-blk-pci>"}, {"name": "video0", "type": "child<cirrus-vga>"}, {"name": "serial0", "type": "child<isa-serial>"}, {"name": "net2", "type": "child<virtio-net-pci>"}, {"name": "net1", "type": "child<virtio-net-pci>"}, {"name": "net0", "type": "child<virtio-net-pci>"}, {"name": "balloon0", "type": "child<virtio-balloon-pci>"}, {"name": "input0", "type": "child<usb-tablet>"}, {"name": "usb", "type": "child<piix3-usb-uhci>"}, {"name": "type", "type": "string"}], "id": "libvirt-46"}
 len=539
2018-06-26 10:31:56.694+0000: 10194: debug : qemuMonitorJSONIOProcessLine:193 : Line [{"return": [{"name": "virtio-disk0", "type": "child<virtio-blk-pci>"}, {"name": "video0", "type": "child<cirrus-vga>"}, {"name": "serial0", "type": "child<isa-serial>"}, {"name": "net2", "type": "child<virtio-net-pci>"}, {"name": "net1", "type": "child<virtio-net-pci>"}, {"name": "net0", "type": "child<virtio-net-pci>"}, {"name": "balloon0", "type": "child<virtio-balloon-pci>"}, {"name": "input0", "type": "child<usb-tablet>"}, {"name": "usb", "type": "child<piix3-usb-uhci>"}, {"name": "type", "type": "string"}], "id": "libvirt-46"}]
2018-06-26 10:31:56.694+0000: 10194: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6fd4024b10 reply={"return": [{"name": "virtio-disk0", "type": "child<virtio-blk-pci>"}, {"name": "video0", "type": "child<cirrus-vga>"}, {"name": "serial0", "type": "child<isa-serial>"}, {"name": "net2", "type": "child<virtio-net-pci>"}, {"name": "net1", "type": "child<virtio-net-pci>"}, {"name": "net0", "type": "child<virtio-net-pci>"}, {"name": "balloon0", "type": "child<virtio-balloon-pci>"}, {"name": "input0", "type": "child<usb-tablet>"}, {"name": "usb", "type": "child<piix3-usb-uhci>"}, {"name": "type", "type": "string"}], "id": "libvirt-46"}


Looks like kvm ignored 7 delete request (96 sec), without an actual error message.
Block operation is not expected to be in progress,
the domain expected to be running. 
Is guest machine crashed or not is unknown, AFAIK the guest does not have qemu agent.
Does the guest really needs to cooperate anyhow ?

The test is api only does not checks the guest aliveness,
maybe adding a serial console log check to the test's exception handler could help.

Comment 15 Attila Fazekas 2018-06-27 11:00:38 UTC
#  while true ; do echo -ne "\0"; done | xargs -0 -I foobar -P 8 sh -c 'ostestr -r AttachInterfacesTestJSON || exit 255'


The above (not fully correct) attempt did not dropped the issue, the concurrent test mostly volume tests in the original report which may involve high disk I/O (secure delete,1 GB 0 writes) and volume attache to other domains as well.

Some interface deletion happens as with persistent domain some as with transient domain. When nova changes the domain transient/persistent state ? Can it happen in parallel to the interface deletion ?

Comment 16 Attila Fazekas 2018-06-27 16:44:15 UTC
The guest is alive, but boot it much slower than I expected on the given load.
Looks like it did not revived slot6 /net1 as hotplug, but it listed it as PCI device in the normal boot log.

Comment 17 Attila Fazekas 2018-06-27 19:08:10 UTC
Tomorrow I try compile qemu with some extra debug.

Theory, The pcidevice hotplug removal was asked before the handler for hotplug interrupts registered and the repeated device_del does not leads to new interrupt so the guest does not tries to remove the device. (old cirros (0.3.5))
Also I will try it with never cirros (0.4.0).
Also I'll try to increase the vcpu count in the test machines.

Cirros (Likely also vanilla) does not processes all hotplug event at early boot time was suspected issue long time ago, but it was never fully proved.

tempest.api test usually not ssh the vm, so we are not ensuring the guest fully booted before any operation. Even tough the test would have ssh capability it would require not default `instance validation` option. (Not enabled on upstream gate because without kvm it would add too much to the gate time)
The scenario version does connects to the vm and did not seen randomly failing with similar issue. (tempest/scenario/test_network_basic_ops.py)

Comment 18 Attila Fazekas 2018-06-27 21:58:54 UTC
A  cloud-config script which printing /sys/firmware/acpi/interrupts/gpe01 and/or /proc/interrups also can help to see is the guest sees the interrupt. I did not find anything so far on qemu/master source side which would suppress it, but we will see.

Comment 19 Attila Fazekas 2018-06-28 10:23:31 UTC
cirros 0.4.0 does not seams to help, the machine size increase actually leaded to more random test failure.

Comment 20 Attila Fazekas 2018-06-28 16:44:32 UTC
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=707900 seams solves the issue, unless my other change healed the machine unexpectedly (I rebuild it with debug patches, I did not rebuild the older version).

The L2 guests are generally faster.

device_del always sends the acpi event.

Very early device_add/device_del can lead to the device is not removed, but repeated interrupt `removes` all pending device.

IMHO it is also possible  (does not have bad effect to this test),
device_add is not noticed by the guest when it happens after pci enumeration, but before the hotplug events handled. The hotplug related registers never read by the guest without an event/irq.

I'll run several testes with qemu-kvm-rhev-2.10.0-21.el7_5.4 in a cleaner env, to make sure it really helps.

We had https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=692240 qemu-kvm-rhev-2.10.0-21.el7_5.3 .

Comment 21 Attila Fazekas 2018-06-28 17:51:36 UTC
;-( , still happening in cleaner env even with the newer qemu from brew.

(gnocchi wsgi maybe leaking memory)

Comment 22 Attila Fazekas 2018-06-28 23:26:28 UTC
In case I crashed by tram ;-)

The working environment had selinux in permissive.
Other possible difference is compile flags/ gcc version.

I also increased the retry attempts, but it does not explains why the vms was faster.

Comment 23 Attila Fazekas 2018-06-29 09:07:30 UTC
My qemu-kvm  does not fixes anything alone.
selinux=permissive looks better, I had some I/O issue which may lead to the test bowed up once, nothing interesting in the audit log, no idea so far why permissive could help here.

Using cirros-0.4.0.

Guest performance was based on it's console log, but maybe I observed it when less test was running. Neither selinux=permissive or my kvm provides significantly better performance in consistent way,
I still have some cases relative slow boots under load >>30sec. (Without load 2 sec) .

The new qemu has some iscsi related changes, but non of them seams to related to the i/o issue.

Comment 24 Attila Fazekas 2018-06-29 14:30:37 UTC
I consider my success with first kvm-build false positive, I was not able to replicate it even on the same machine.

Tempest could fail faster in case of the instance action reported as error,
the serial console log LOGing also should be added to the upstream version.

Nova might consider to increase the retry attempts,
 hyperv has 10  attempts, 5 sec wait, nova/libvirt has 7 retry attempt with 2 sec sleep increment for each, it is possible nova raises the exception even if the last attempt leads to successful device eject.

No evidence for either libvirt or qemu-kvm does anything wrong.

I wonder if the device removal reported as failed in the instance action,is nova expected try to add it back to domain?

I'll try to sabotage the guest and see is it possible to reach an inconsistent state. 


It is time to check how the bigger environment failed (8->16 vcpu, 16G->32G memory, api worker 4->8 tempest worker 4) .

The regular envs does not reports huge cpu steal time, lvm is thin so only cider backup does significant I/O which should not be significant to the host.
Our luck maybe depending on the host L1 clod load, but nothing clearly indicated i/o or cpu steal so far.

I'll also check is the significant vm boot time increase really just due to a `valid` load from other things around.

Comment 25 Attila Fazekas 2018-06-30 19:29:00 UTC
Looks like the test jobs when passing using Skylake (2.1Ghz) when failing using Haswell(2.6GHz) CPU.
nested guest issue or luck ? ..

16vcpu Haswell node produced more issue than 8 vcpu ones.

Comment 26 Attila Fazekas 2018-07-01 09:47:48 UTC
Skylake (2.1Ghz) hosts has better I/O (randwrite 4k , depth 16, 10.3kIOPS vs 2.5kIOPS , 99% below 54ms lat vs  below 242ms, 50% 26ms vs 103ms) .
Despite to the lower reported frequency , stress-ng --cpu 16 --cpu-method matrixprod  --metrics-brief --perf -t 60 has 4163.36 ops/sec vs 3016.73 (Haswell 2.6Ghz) (openstack was not stopped, but it is `idle`)

Comment 27 Attila Fazekas 2018-07-01 16:56:58 UTC
I created a bigger rhel-vm L2 vm on both host,
When I just running cpu intensive or I/O intensive tests on the L1 the load  effect on the sample VM performance does not seams abnormal,
 but tempest run (maybe the other qemu-kvm activity at boot)  makes the test vm 1000 times slower and multiple soft lockup reported:

[14968.148616] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [pickup:17742]
[14968.149606] Modules linked in: virtio_console cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ppdev i2c_piix4 parport_pc pcspkr joydev i2c_core virtio_balloon parport ip_tables xfs libcrc32c virtio_net virtio_blk ata_generic pata_acpi ata_piix libata virtio_pci virtio_ring serio_raw floppy virtio sunrpc
[14968.149606] CPU: 0 PID: 17742 Comm: pickup Kdump: loaded Tainted: G             L ------------   3.10.0-862.el7.x86_64 #1
[14968.149606] Hardware name: Red Hat OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014
[14968.149606] task: ffff931e39b71fa0 ti: ffff931e3a134000 task.ti: ffff931e3a134000
[14968.149606] RIP: 0010:[<ffffffffba09a8ea>]  [<ffffffffba09a8ea>] __do_softirq+0x9a/0x280
[14968.149606] RSP: 0000:ffff931e3fc03f20  EFLAGS: 00000206
[14968.149606] RAX: ffff931e3a137fd8 RBX: ffff931e3fc13ad8 RCX: 0000000000000000
[14968.149606] RDX: 0000000100dfd16f RSI: 00000000000096e9 RDI: ffff931e39b71fa0
[14968.149606] RBP: ffff931e3fc03f80 R08: 00000000f463f629 R09: 0000000000000001
[14968.149606] R10: 0000000000000000 R11: ffff931e3fc03de8 R12: ffff931e3fc03e98
[14968.149606] R13: ffffffffba7207f2 R14: ffff931e3fc03f80 R15: ffff931e3a137fd8
[14968.149606] FS:  00007f4c5762b880(0000) GS:ffff931e3fc00000(0000) knlGS:0000000000000000
[14968.149606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14968.149606] CR2: 00007f4c52bb7144 CR3: 000000003512e000 CR4: 00000000000006f0
[14968.149606] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[14968.149606] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[14968.149606] Call Trace:
[14968.149606]  <IRQ> 
[14968.149606]  [<ffffffffba722d2c>] call_softirq+0x1c/0x30
[14968.149606]  [<ffffffffba02d625>] do_softirq+0x65/0xa0
[14968.149606]  [<ffffffffba09acc5>] irq_exit+0x105/0x110
[14968.149606]  [<ffffffffba7240a8>] smp_apic_timer_interrupt+0x48/0x60
[14968.149606]  [<ffffffffba7207f2>] apic_timer_interrupt+0x162/0x170
[14968.149606]  <EOI> 
[14968.149606]  [<ffffffffba20f0f0>] ? mem_cgroup_update_page_stat+0x20/0x50
[14968.149606]  [<ffffffffba20f0e6>] ? mem_cgroup_update_page_stat+0x16/0x50
[14968.149606]  [<ffffffffba1d219c>] page_add_file_rmap+0x8c/0xc0
[14968.149606]  [<ffffffffba1c0831>] do_set_pte+0x91/0xe0
[14968.149606]  [<ffffffffba1c0c1e>] do_read_fault.isra.60+0x10e/0x1a0
[14968.149606]  [<ffffffffba1c52dc>] handle_pte_fault+0x2dc/0xc30
[14968.149606]  [<ffffffffba1c747d>] handle_mm_fault+0x39d/0x9b0
[14968.149606]  [<ffffffffba71a587>] __do_page_fault+0x197/0x4f0
[14968.149606]  [<ffffffffba71a9c6>] trace_do_page_fault+0x56/0x150
[14968.149606]  [<ffffffffba719f52>] do_async_page_fault+0x22/0xf0
[14968.149606]  [<ffffffffba716798>] async_page_fault+0x28/0x30
[14968.149606] Code: 59 37 f7 45 c7 45 a4 0a 00 00 00 89 4d d0 48 89 45 c0 48 89 45 c8 0f 1f 00 65 c7 05 5d d8 f7 45 00 00 00 00 fb 66 0f 1f 44 00 00 <49> c7 c4 c0 70 c0 ba eb 0e 0f 1f 44 00 00 49 83 c4 08 41 d1 ef 

perf reported CPU 0/KVM (only 0) vmx_handle_external_intr as a frequent item:

Samples: 1M of event 'cpu-clock', Event count (approx.): 317452750000
Overhead  Command          Shared Object                     Symbol
  22.83%  CPU 0/KVM        [kernel.kallsyms]                 [k] vmx_handle_external_intr
  16.08%  swapper          [kernel.kallsyms]                 [k] native_safe_halt
   3.03%  qemu-kvm         [kernel.kallsyms]                 [k] vmx_handle_external_intr
   2.78%  httpd            _bcrypt.so                        [.] Blowfish_encipher
   1.72%  neutron-server   libpython2.7.so.1.0               [.] PyEval_EvalFrameEx
   1.44%  httpd            libpython2.7.so.1.0               [.] PyEval_EvalFrameEx
   0.92%  swift-container  [kernel.kallsyms]                 [k] copy_page_rep
   0.64%  gnocchi-metricd  libpython2.7.so.1.0               [.] PyEval_EvalFrameEx


After tempest finishes the VM working again,
the reported load on L1 guest just 30~40 .


The effect visible in both Skylake and Haswell.

Comment 29 Attila Fazekas 2018-07-02 07:59:44 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1597159 
cpu_model change helps a little, but does not solves the issue.

Comment 31 Attila Fazekas 2018-07-03 07:53:38 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1597539 it likely also leads to less libguestfs (qemu-kvm) invocation.

Comment 32 Attila Fazekas 2018-07-03 17:53:33 UTC
The traces led to blame L0. The environment where we run packsack jobs are not controlled by us. We will contact the hosting group.

Our fleet used to run tripleo jobs still using old rhel 7.3, which does not seams to be affected.

Linux sealusa4.mobius.lab.eng.rdu2.redhat.com 3.10.0-514.6.2.el7.x86_64 #1 SMP Fri Feb 17 19:21:31 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
libvirt-daemon-driver-qemu-2.0.0-10.el7_3.9.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64

/sys/module/kvm_intel/parameters/:
emulate_invalid_guest_state:Y
enable_apicv:Y
enable_shadow_vmcs:Y
ept:Y
eptad:Y
fasteoi:Y
flexpriority:Y
nested:Y
ple_gap:128
ple_window:4096
ple_window_grow:2
ple_window_max:1073741823
ple_window_shrink:0
pml:Y
unrestricted_guest:Y
vmm_exclusive:Y
vpid:Y

Broadwell

Comment 33 Attila Fazekas 2018-07-04 09:37:47 UTC
L0 likely using host-model instead of host-passtrough,
vmx_handle_external_intr started show up in perf report in my env when switched to host-model. It also explains why the exact cpu is missing from the L1 /proc/cpuinfo .

Comment 34 Daniel Berrangé 2018-07-04 09:46:10 UTC
Having read comments #12 onwards, I very much believe you are seeing a race condition here.

Device unplug in KVM is asynchronous and requires guest OS co-operation to complete the removal.  IOW, if you trigger a device unplug while the guest is in early bootup phase it is quite likely to be lost, or at the very least delayed.

I could certainly imagine that switching from QEMU to KVM has changed the boot up time in a way that causes the latent bug in the tests to be exposed. Ideally you want some mechanism by which you can detect that the guest has completed booting up, before trying the detach.

Comment 35 Attila Fazekas 2018-07-04 18:43:44 UTC
https://review.openstack.org/#/c/62886/6/tempest/scenario/test_stamp_pattern.py not always easy to land these kind of changes to tempest.

Nova actually signals the vm multiple times, the first one likely lost, the last one likely not waited enough, but the test VM should be able install the acpi handler within 2 sec under load maybe 10 sec, but we have boot times above a minute.

Because we have cases when even a booted VM is delayed on the host side for more than 23 sec (lockup reports, the actual length unknown, but enough to loose an ssh session), we need to solve the performance (and mini freeze) issue in all cases, to remove it from the list of possible reasons.

I am not sure it is the only reason, and not just this test case have detach issues, at higher load the ones which ssh vm first can have (volume) detach issue .

In this particular test case adding hard reboot step might be an option, if it passes the review. The hard reboot makes unnecessary the guest cooperation.

Comment 37 Attila Fazekas 2018-07-25 13:16:35 UTC
I found another log in one of the failed jobs, it does not related to guest cooperation and it failed the detach.

2018-07-25 04:47:54.664 7256 DEBUG nova.objects.instance [req-b9344ef8-94ab-469b-aec7-ab7dd0d2ed1e 6b2bb6c8029349a1b3c78a2fd342eb1f 28e30556001b4b02abcb711c4b4172c3 - default default] Lazy-loading 'flavor' on Instance uuid 60b5b348-9144-4bfc-9315-ff67c3a93149 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:1052
2018-07-25 04:47:54.745 7256 DEBUG nova.notifications.objects.base [req-b9344ef8-94ab-469b-aec7-ab7dd0d2ed1e 6b2bb6c8029349a1b3c78a2fd342eb1f 28e30556001b4b02abcb711c4b4172c3 - default default] Defaulting the value of the field 'projects' to None in FlavorPayload due to 'Cannot call _load_projects on orphaned Flavor object' populate_schema /usr/lib/python2.7/site-packages/nova/notifications/objects/base.py:130
2018-07-25 04:47:54.753 7256 DEBUG nova.network.neutronv2.api [req-b9344ef8-94ab-469b-aec7-ab7dd0d2ed1e 6b2bb6c8029349a1b3c78a2fd342eb1f 28e30556001b4b02abcb711c4b4172c3 - default default] [instance: 60b5b348-9144-4bfc-9315-ff67c3a93149] allocate_for_instance() allocate_for_instance /usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py:897
2018-07-25 04:47:55.035 7256 DEBUG nova.policy [req-b9344ef8-94ab-469b-aec7-ab7dd0d2ed1e 6b2bb6c8029349a1b3c78a2fd342eb1f 28e30556001b4b02abcb711c4b4172c3 - default default] Policy check for network:attach_external_network failed with credentials {'service_roles': [], 'user_id': u'6b2bb6c8029349a1b3c78a2fd342eb1f', 'roles': [u'_member_'], 'user_domain_id': u'default', 'service_project_id': None, 'service_user_id': None, 'service_user_domain_id': None, 'service_project_domain_id': None, 'is_admin_project': True, 'is_admin': False, 'project_id': u'28e30556001b4b02abcb711c4b4172c3', 'project_domain_id': u'default'} authorize /usr/lib/python2.7/site-packages/nova/policy.py:168


Looks like nova-compute rejected to do the operation at policy check,
server id and project/user id seams correct.

Comment 39 Attila Fazekas 2018-07-26 07:34:39 UTC
network:attach_external_network is admin only permission the user expected to be The calling user expected to be non-admin. The network is not external.

Not the policy check failure prevented the interface detach, but no further log with the original request id.

The policy is checked and logged before even testing what type of network we have in context, it is an issue, but not the CI issue.

Comment 40 Attila Fazekas 2018-07-26 07:44:06 UTC
The request tracking was wrong, there were attempts to remove the interface
https://bugzilla.redhat.com/show_bug.cgi?id=1590876 .

Comment 41 Artom Lifshitz 2018-07-27 01:01:57 UTC
Looking at [1], the interface that failed to detach in time is 73258ac6-55b7-4a34-b5cf-86c5a0f5c3af. Here's when the DELETE is done:


2018-07-25 04:48:00,035 12252 INFO     [tempest.lib.common.rest_client] Request (AttachInterfacesTestJSON:test_reassign_port_between_servers): 202 DELETE http://172.16.1.21:8774/v2.1/28e30556001b4b02abcb711c4b4172c3/servers/60b5b348-9144-4bfc-9315-ff67c3a93149/os-interface/73258ac6-55b7-4a34-b5cf-86c5a0f5c3af 0.116s
2018-07-25 04:48:00,036 12252 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'status': '202', u'content-length': '0', 'content-location': 'http://172.16.1.21:8774/v2.1/28e30556001b4b02abcb711c4b4172c3/servers/60b5b348-9144-4bfc-9315-ff67c3a93149/os-interface/73258ac6-55b7-4a34-b5cf-86c5a0f5c3af', u'x-compute-request-id': 'req-90e613b0-9cdd-4982-8bf4-976e8ae3a1b8', u'vary': 'X-OpenStack-Nova-API-Version', u'openstack-api-version': 'compute 2.1', u'connection': 'close', u'x-openstack-nova-api-version': '2.1', u'date': 'Wed, 25 Jul 2018 08:48:00 GMT', u'content-type': 'application/json', u'x-openstack-request-id': 'req-90e613b0-9cdd-4982-8bf4-976e8ae3a1b8'}
        Body: 


After that follows dozens of GET requests to see whether it detaches:


200 GET http://172.16.1.21:9696/v2.0/ports/73258ac6-55b7-4a34-b5cf-86c5a0f5c3af


Once all of those fail to detect a detach, tempest runs cleanups and fails the test. I can't find the original DELETE request ID anywhere in the logs (6cffeafb32d8616412978e0758913530-aio-0.tar.gz). Since you mentioned request ID tracking might be broken, I tried looking for the port ID (73258ac6-55b7-4a34-b5cf-86c5a0f5c3af), but all I could find was a 'ovs-vsctl' line in var/log/messages.


[1] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/ReleaseDelivery/view/OSP13/job/phase2-13-rhel-7.5-openstack-all-in-one-neutron-rabbitmq/203/testReport/junit/tempest.api.compute.servers.test_attach_interfaces/AttachInterfacesTestJSON/test_reassign_port_between_servers_id_2f3a0127_95c7_4977_92d2_bc5aec602fb4_/

Comment 42 Artom Lifshitz 2018-08-07 12:19:01 UTC
After initially being able to reproduce this at will on the env that Attila gave me (10.8.247.245), I enabled libvirt debugging and stopped being able to reproduce it. Lately, I'd been running into resource issues that manifested themselves as timeouts, no valid host errors, etc. In an attempt to fix those, I re-deployed with packstack with all metrics disabled (by setting CONFIG_CEILOMETER/PANKO/AODH/_INSTALL=n in /root/packstack_config.txt). Since then, the resource issues have gone away, but I'm still unable to reproduce this. I'm wondering if re-running packstack has unintentionally fixed this, somehow.

I believe that while this bug manifests (manifested?) itself in Nova, the underlying cause is lower down in the stack. In case this becomes reproducible in another env, we can attempt to enable libvirt debugging in that env, and submit the resulting logs to the virt team for review. The following commands do so for the running libvirtd daemon:

# virt-admin daemon-log-filters         "1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util"
# virt-admin daemon-log-outputs "1:file:/var/log/libvirt/libvirtd.log"

For now I'd like to close this bug, since there's no longer anything actionable here. I understand CI is still experiencing issues, but I'd really like to keep this BZ scoped to this single "detaching an interface fails in specific conditions" bug. If any other Nova issues arise, by all means open a new bug or hit us up on IRC and we'll do our best to help.

Cheers!

Comment 45 Artom Lifshitz 2018-08-09 14:35:49 UTC
Talking with Daniel Berrange on IRC, he reiterated that we need guest cooperation to detach any PCI device. QEMU sends an ACPI event, it's up to the guest OS to act on it. Clearly in this case this is not happening. So I'm keeping this CLOSED, but changing to NOTABUG, because it's not a bug in Nova.

What happens now is less clear. One question that arose is whether calling the libvirt interface detach API again and again causes QEMU to send more than one event. If only the initial event is sent but is ignored by the guest, there's no point in retrying the detach. If QEMU resends the event every time, it could be worth trying to at least work around this in tempest by retrying the detach multiple times. There's already a retry loop at the Nova level, but maybe that's not enough.

Short of looking at the libvirt and/or QEMU code (my C isn't up to par for that, sorry), we could try artificially reproducing this by stopping acpid in a guest in order to "ignore" the event(s) sent by QEMU, and seeing whether retrying the detach in libvirt causes multiple events to be sent.

We could also straight up hack tempest to retry the detach multiple times and hope it changes something.

Another option would be to ask the KVM folks to look at this as a nested KVM guest performance bug.

Comment 46 Artom Lifshitz 2018-08-09 16:38:13 UTC
Conversing further with Dan Berrange, it may be that when *attaching* the interface, it's not being fully initialised by the guest (or we're detaching it too quickly), preventing it from being subsequently detached. Indeed it looks like the detach failure is only happening with interfaces that were attached *after* instance boot, as opposed to interfaces the instance was booted with from the beginning. So another option would be for tempest to wait a bit between interface attach and the subsequent detach. If the interface has a floating IP, we could even wait until it starts responding to ping to make sure the guest has initialised it properly.

Comment 47 Attila Fazekas 2018-08-10 07:52:24 UTC
There were bug about qemu misses the event with different chipset (Q35),
I would like to make sure  we did not run into a similar issue with (i440) before starting adding random things to tempest.

Unconditional sleep will be -2.

ssh usage in an api test usually allowed only when a non default tempest flag enabled, it might get rejected, but it may worth a try.

IMHO we also can try to hard reboot the machine, if understood the documentation correctly it should remove the interface without guest cooperation. (only when the reboot capability reported to the hypervisor, kvm ok, default enabled)

Yes, if you would freeze the vm the failure would look like similar.
Some cases the interface remove event received around when tempest cleanup kicked in.

I also patched tempest to print the console log of the guest, no conclusive thing.

LibVirt log confirms the send of multiple requests,
I did not tried just disabling acpi, but I also assume the acpi irq sent.

I just enabled this last time:
https://git.qemu.org/gitweb.cgi?p=qemu.git;a=blob;f=hw/acpi/pcihp.c;h=2b0f3e1bfb414d2fd0aa3a2395104ea76172957d;hb=df8ad9f128c15aa0a0ebc7b24e9a22c9775b67af#l42

I am planning to add a timestamp next time,
and maybe another debug print around the confirmation event to libvirt.

I also made an interface attache faster than nova/libvirt could over the mon interface after boot. 
The vm did not see the hotplug event, it missed some unplug event, but the unplug expected to work when it actually starts listening on a later time.

Comment 48 Attila Fazekas 2018-08-10 13:45:56 UTC
Domain start:
2018-08-10 12:39:08.679+0000: 4100: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f22c0029ca0 event={"timestamp": {"seconds": 1533904748, "microseconds": 679310}, "event": "RESUME"}


ACPI REQ not processed by the VM (IRQ not logged):
2018-08-10 12:39:13.022+0000: 4100: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f22c0029ca0 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-24"}
2018-08-10 12:39:18.033+0000: 4100: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f22c0029ca0 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-26"}
2018-08-10 12:39:25.043+0000: 4100: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f22c0029ca0 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-28"}
2018-08-10 12:39:34.058+0000: 4100: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f22c0029ca0 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-30"}
2018-08-10 12:39:45.070+0000: 4100: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f22c0029ca0 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-42"}
2018-08-10 12:39:58.101+0000: 4100: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f22c0029ca0 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-44"}
2018-08-10 12:40:13.112+0000: 4100: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f22c0029ca0 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-46"}


First event got by the vm:
2018-08-10 12:40:30.139+0000: 4100: info : qemuMonitorIOWrite:552 : 
QEMU_MONITOR_IO_WRITE: mon=0x7f22c0029ca0 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-48"}


The VM see both add and removel request, it does not removes the dev.
1533904830.922498775 pcisel write 10 <== 0
1533904830.923318329 pci_up_read 64
1533904830.925539634 pci_down_read 64

(qemu could have remove the add request)

The Last removal request, The VM obeyed, nova likely processed data before the event and concluded failure: 

2018-08-10 12:40:49.153+0000: 4100: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f22c0029ca0 buf={"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-50"}

1533904849.159214033 pcisel write 10 <== 0
1533904849.159912083 pci_up_read 0
1533904849.160179216 pci_down_read 64
1533904849.175131090 pcisel write 10 <== 0
1533904849.176430266 pciej write 8 <== 64

2018-08-10 12:40:49.176+0000: 4100: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f22c0029ca0 event={"timestamp": {"seconds": 1533904849, "microseconds": 176388}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/net1/virtio-backend"}}


2018-08-10 08:40:49.208 31417 ERROR oslo.service.loopingcall DeviceDetachFailed: Device detach failed for tap54e4e6c9-4f: Unable to detach from guest transient domain.

In the system the VM did not processed acpi irq in the first 50 sec in it's life, which definitely not what we expected. We had some change on the system, so will measure parallel boot (acpi ready) times again. Previously concurrent L2 guest operations showed to be much slower than it is expected from the cpu count and load.

Landing changes in tempest regarding to solve this issue MAY have difficulties on review. 
I wonder would it be possible to change the nova retry count or make it configurable like for example hyperV has.

In tempest case more than just this would be need to be changed to eliminate the similar issues. It might be a topic for PTG.

Tempest by default wait 300 sec for most change, but the nova internal timeout is lower . 
Defaulting to ~10 retry and making it configurable on nova side (also for volume and other pci dev) would be a great help.

Comment 49 Attila Fazekas 2018-08-10 17:49:34 UTC
IMHO I figured out a solution which may be easier to sell without any new config option either in nova or tempest.

I mentioned before, tempest could fail earlier if it would consume the `server actions` history, but it could also retry the operation as long as it does not reaches the original (default 300 sec) timeout.
For bonus it could print the console log of the related server in case of failure (when the console log enabled) . If initially we touch only the most flaky test case it might be small enough to backport easily.

Comment 50 smooney 2018-08-31 13:51:32 UTC
hi  i think i may have a less invasive workaround.

my speculation is that the guest does not respond to the interupts because the guest os/kernel is not full running for the first X seconds after the vm is active

if we change 
http://git.openstack.org/cgit/openstack/tempest/tree/tempest/api/compute/servers/test_attach_interfaces.py#n55
from 
self.create_test_server(wait_until='ACTIVE')
to 
servers.create_test_server(
        sshable=True, wait_until='SSHABLE')

per the example here
https://specs.openstack.org/openstack/qa-specs/specs/tempest/ssh-auth-strategy.html

this will make sure we wait until the os is fully running and we can log in the machine before we attach/remove interfaces.

i will submit a patch upstream to do this.

regards
sean.

Comment 51 smooney 2018-08-31 14:42:18 UTC
this is an upstream review
https://review.openstack.org/598983
to make that change. i don't know if there is a way to test this as its intermittent but i think that might resolve this issue.

regards
sean

Comment 52 Artom Lifshitz 2018-09-05 13:44:45 UTC
(In reply to smooney from comment #50)

> per the example here
> https://specs.openstack.org/openstack/qa-specs/specs/tempest/ssh-auth-
> strategy.html

That specs looks to have never been implemented, as evidenced by, for example, the comment "TODO(jlanoux) add support of wait_until PINGABLE/SSHABLE" in the create_test_server method in tempest/common/compute.py. We could try doing it by hand by just calling the SSH with a "heartbeat" command like "true" to make sure the guest is up and running. I've proposed a patch to that effect [1]. We could try to backport the in-flight patch to our downstream tempest to see if it has an effect on the failing tests?

[1] https://review.openstack.org/600046

Comment 53 Attila Fazekas 2018-09-12 20:07:50 UTC
I am on the PTG, most likely we will add an ssh-step to the relevant api tests and make the run_validation option default true.

Comment 54 Artom Lifshitz 2018-09-13 07:30:16 UTC
(In reply to Attila Fazekas from comment #53)
> I am on the PTG, most likely we will add an ssh-step to the relevant api
> tests and make the run_validation option default true.

For what it's worth, I've proposed a patch that I believe does exactly that [1]/

[1] https://review.openstack.org/#/c/600046/

Comment 55 Attila Fazekas 2018-10-03 14:39:30 UTC
Please backport both changes:

https://review.openstack.org/#/c/600046/
https://review.openstack.org/#/c/602206/

Other patches will come, but they are out of scope of this BZ.

Comment 56 Artom Lifshitz 2018-10-05 09:55:50 UTC
(In reply to Attila Fazekas from comment #55)
> Please backport both changes:
> 
> https://review.openstack.org/#/c/600046/
> https://review.openstack.org/#/c/602206/
> 
> Other patches will come, but they are out of scope of this BZ.

Not massively familiar with how downstream tempest works - by backport you mean downstream, all the way down to OSP10?

Comment 57 Attila Fazekas 2018-11-08 11:31:29 UTC
Yes, I meant all the way down, start with 13 and see how far we can go.
 
The issue happens with 13/12  more frequently with  test_reassign_port_between_servers . 

10 is not the most problematic in a typical test run (before the performance change we did not had issues).

11 is not supported anymore (skip),
12 life-cycle also short (2018.12.13 EoL).

13 is long time support release, so it is a must have change there.

10 still has very long remaining life-cycle, so we should avoid
this kind of test failure noises as much as possible.

So start with 13, and clone to the other still supported ones.

Comment 81 Martin Kopec 2018-12-19 12:47:36 UTC
The change which seems to fix the issue (during testing we haven't hit it again) is present in openstack-tempest-13.0.0-20.bafe630git.el7ost package which is available in the latest puddles.

Comment 83 errata-xmlrpc 2019-01-16 17:09:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0075


Note You need to log in before you can comment on or make changes to this bug.