Bug 1821277 - virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Summary: virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-containers
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Dan Prince
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-06 13:23 UTC by Ketan Mehta
Modified: 2024-10-01 16:32 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-28 14:22:00 UTC
Target Upstream Version:
Embargoed:
jhardee: needinfo-


Attachments (Terms of Use)
VM xml - g2.xml (5.58 KB, text/plain)
2020-10-26 11:02 UTC, chhu
no flags Details
libvirtd.log (11.98 MB, text/plain)
2020-10-26 11:03 UTC, chhu
no flags Details
client_libvirtd.log (1.03 MB, application/gzip)
2020-10-26 12:42 UTC, chhu
no flags Details
client_hang_libvirtd.log (3.39 MB, application/gzip)
2020-10-26 13:21 UTC, chhu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-3250 0 None None None 2022-07-08 14:10:09 UTC

Description Ketan Mehta 2020-04-06 13:23:36 UTC
Description of problem:

The mentioned errors are noticed on all controller and compute nodes on a RHOSP13 setup.

systemctl status docker

~~~
Mar 24 08:47:20 host8..com dockerd-current[7497]: 2020-03-24 08:47:20.804+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Mar 24 08:47:34 host8..com dockerd-current[7497]: 2020-03-24 08:47:34.668+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Mar 24 08:48:20 host8..com dockerd-current[7497]: 2020-03-24 08:48:20.835+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Mar 24 08:48:35 host8..com dockerd-current[7497]: 2020-03-24 08:48:35.015+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Mar 24 08:49:20 host8..com dockerd-current[7497]: 2020-03-24 08:49:20.765+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Mar 24 08:49:34 host8..com dockerd-current[7497]: 2020-03-24 08:49:34.738+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Mar 24 08:50:21 host8..com dockerd-current[7497]: 2020-03-24 08:50:21.006+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Mar 24 08:50:34 host8..com dockerd-current[7497]: 2020-03-24 08:50:34.784+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Mar 24 08:51:20 host8..com dockerd-current[7497]: 2020-03-24 08:51:20.844+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
~~~

Just to add the environment in which such errors are noticed, there are filebeat container running on those node (as part of ELK stack).

But, stopping the filebeat container does not prevents occurrence of these log messages.

Version-Release number of selected component (if applicable):

docker-1.13.1-109.gitcccb291.el7_7.x86_64                   Sun Feb 23 19:02:51 2020
docker-client-1.13.1-109.gitcccb291.el7_7.x86_64            Sun Feb 23 19:02:48 2020
docker-common-1.13.1-109.gitcccb291.el7_7.x86_64            Sun Feb 23 19:02:47 2020
docker-rhel-push-plugin-1.13.1-109.gitcccb291.el7_7.x86_64  Sun Feb 23 19:02:47 2020
python-docker-2.4.2-2.el7.noarch                            Fri Jun 28 01:59:48 2019
python-docker-pycreds-0.3.0-11.el7.noarch                   Sun Feb 23 19:43:45 2020
python-obfuscateduser40-agent-docker-cmd-1.5.4-1.el7ost.noarch          Fri Jun 28 02:03:27 2019


How reproducible:

Not that I'm sure of as of now, these are not noticed in every RHOSP13 environment.


Actual results:


Expected results:


Additional info:
There is no impact, but would like to know the cause of this.

Comment 9 jhardee 2020-07-27 19:50:53 UTC
Hi,

I have a customer who is showing this message as well.

2020-07-15 13:00:16.389 1 ERROR nova.virt.libvirt.driver [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance: e0f46da7-9105-4b2e-8ef0-58928e1ad325] Failed to start libvirt guest: libvirtError: Cannot recv data: Connection reset by peer
2020-07-15 13:00:16.391 1 INFO os_vif [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] Successfully unplugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:86:ea:af,bridge_name='br-int',has_traffic_filtering=True,id=77ee2aaa-f1d4-4d8d-a99f-878ff0842c52,network=Network(c14dce0c-0b08-4b65-b5e1-f70c426e246f),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap77ee2aaa-f1')
2020-07-15 13:00:16.395 1 ERROR nova.compute.manager [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance: e0f46da7-9105-4b2e-8ef0-58928e1ad325] Cannot reboot instance: Cannot recv data: Connection reset by peer: libvirtError: Cannot recv data: Connection reset by peer
2020-07-15 13:00:17.752 1 INFO nova.compute.manager [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance: e0f46da7-9105-4b2e-8ef0-58928e1ad325] Successfully reverted task state from reboot_started_hard on failure for instance.
2020-07-15 13:00:17.759 1 ERROR oslo_messaging.rpc.server [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] Exception during message handling: libvirtError: Cannot recv data: Connection reset by peer


2020-07-15 13:00:17.759 1 ERROR oslo_messaging.rpc.server libvirtError: Cannot recv data: Connection reset by peer
2020-07-15 13:00:17.759 1 ERROR oslo_messaging.rpc.server
2020-07-15 13:00:22.497 1 INFO nova.compute.manager [req-d48db47e-a5cd-476e-b415-539f307687c6 - - - - -] Running instance usage audit for host compute02.localdomain from 2020-07-15 12:00:00 to 2020-07-15 13:00:00. 7 instances.
2020-07-15 13:00:30.940 1 INFO nova.compute.resource_tracker [req-d48db47e-a5cd-476e-b415-539f307687c6 - - - - -] Final resource view: name=compute02.localdomain phys_ram=64163MB used_ram=58368MB phys_disk=8371GB used_disk=540GB total_vcpus=64 used_vcpus=24 pci_stats=[]
2020-07-15 13:01:31.947 1 INFO nova.compute.resource_tracker [req-d48db47e-a5cd-476e-b415-539f307687c6 - - - - -] Final resource view: name=compute02.localdomain phys_ram=64163MB used_ram=58368MB phys_disk=8371GB used_disk=540GB total_vcpus=64 used_vcpus=24 pci_stats=[]
2020-07-15 13:02:31.952 1 INFO nova.compute.resource_tracker [req-d48db47e-a5cd-476e-b415-539f307687c6 - - - - -] Final resource view: name=compute02.localdomain phys_ram=64163MB used_ram=58368MB phys_disk=8371GB used_disk=540GB total_vcpus=64 used_vcpus=24 pci_stats=[]
2020-07-15 13:02:56.420 1 INFO nova.compute.manager [-] [instance: e0f46da7-9105-4b2e-8ef0-58928e1ad325] During _sync_instance_power_state the DB power_state (4) does not match the vm_power_state from the hypervisor (0). Updating power_state in the DB to match the hypervisor.

* snapshot of virtnetsocketreadwire:1806 ERROR
https://access.redhat.com/hydra/rest/cases/02702586/attachments/52410106-1526-4180-89d0-54d0be99cf41?usePresignedUrl=true

Comment 10 Emma Foley 2020-07-29 15:33:08 UTC
@Kevin, This doesn't look relevant to CloudOps, perhaps the compute DFG should take a look at it.

Comment 16 Kashyap Chamarthy 2020-10-23 11:59:54 UTC
[My comment from last week didn't register as the browser crashed.]

From a chat with libvirt folks, that error is usually seen when libvirtd dies, the client loses connection (as expected).

That said, if this is reproducible, then more fine-grained libvirt logs[1] might tell us more

@Michal: Does this ring any bells for you?

[1] https://kashyapc.fedorapeople.org/virt/openstack/request-nova-libvirt-qemu-debug-logs.txt

Comment 20 chhu 2020-10-26 11:02:25 UTC
Created attachment 1724139 [details]
VM xml - g2.xml

Comment 21 chhu 2020-10-26 11:03:36 UTC
Created attachment 1724140 [details]
libvirtd.log

Comment 22 Michal Privoznik 2020-10-26 11:14:15 UTC
(In reply to chhu from comment #21)
> Created attachment 1724140 [details]
> libvirtd.log

This is the daemon log, can you please provide the client log? Per my analysis in comment 17 it's client who is disconnecting and I'm wondering why.

Comment 23 chhu 2020-10-26 12:42:23 UTC
Created attachment 1724151 [details]
client_libvirtd.log

Comment 24 chhu 2020-10-26 13:21:20 UTC
Created attachment 1724156 [details]
client_hang_libvirtd.log

Comment 25 Michal Privoznik 2020-10-27 08:49:21 UTC
Okay, so from the client logs I see some weirdness happening. These are my findings from investigating the problematic run:

Here, the client (virsh) starts up and connects to the daemon.
client:
2020-10-26 12:36:22.226+0000: 14506: info : libvirt version: 6.0.0, package: 25.module+el8.2.1+7154+47ffd890 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-06-25-15:13:13, )
2020-10-26 12:36:22.226+0000: 14506: info : hostname: dell-per740-05.rhts.eng.pek2.redhat.com
2020-10-26 12:36:22.226+0000: 14506: debug : virGlobalInit:244 : register drivers
...
2020-10-26 12:36:22.226+0000: 14507: debug : virThreadJobSet:97 : Thread 14507 is now running job vshEventLoop
2020-10-26 12:36:22.226+0000: 14506: debug : doRemoteOpen:1094 : proceeding with name = 
2020-10-26 12:36:22.226+0000: 14507: debug : virEventRunDefaultImpl:350 : running default event implementation
2020-10-26 12:36:22.226+0000: 14506: debug : doRemoteOpen:1103 : Connecting with transport 1
2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollCleanupTimeouts:520 : Cleanup 1
2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollCleanupHandles:569 : Cleanup 1
2020-10-26 12:36:22.226+0000: 14506: debug : remoteGetUNIXSocketHelper:799 : Built UNIX sockname /var/run/libvirt/libvirt-sock for transport unix prefix libvirt flags=0x0
2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollMakePollFDs:396 : Prepare n=0 w=1, f=4 e=1 d=0
2020-10-26 12:36:22.226+0000: 14506: debug : remoteGetUNIXSocket:885 : Chosen UNIX sockname /var/run/libvirt/libvirt-sock daemon libvirtd for mode legacy transport unix flags=0x0
2020-10-26 12:36:22.226+0000: 14506: debug : doRemoteOpen:1127 : Chosen UNIX socket /var/run/libvirt/libvirt-sock
2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollCalculateTimeout:333 : Calculate expiry of 1 timers
2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollCalculateTimeout:366 : No timeout is pending
2020-10-26 12:36:22.226+0000: 14507: info : virEventPollRunOnce:635 : EVENT_POLL_RUN: nhandles=1 timeout=-1


The daemon accepts connection request and open the qemu driver.
server:
2020-10-26 12:36:22.228+0000: 10048: debug : virThreadJobSet:94 : Thread 10048 (virNetServerHandleJob) is now running job remoteDispatchConnectOpen
2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpen:1074 : name=
2020-10-26 12:36:22.228+0000: 10048: debug : virConfLoadConfig:1533 : Loading config file '/etc/libvirt/libvirt.conf'
2020-10-26 12:36:22.228+0000: 10048: debug : virConfReadFile:737 : filename=/etc/libvirt/libvirt.conf
2020-10-26 12:36:22.228+0000: 10048: debug : virConfGetValueString:882 : Get value string (nil) 0
2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:859 : Trying to probe for default URI
2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:866 : QEMU driver URI probe returned 'qemu:///system'
2020-10-26 12:36:22.228+0000: 10048: debug : virConfGetValueStringList:930 : Get value string list (nil) 0
2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:903 : Split "qemu:///system" to URI components:
  scheme qemu
  server <null>
  user <null>
  port 0
  path /system

  2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:954 : trying driver 9 (QEMU) ...
2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:969 : Checking for supported URI schemes
2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:972 : Matched URI scheme 'qemu'
2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:1005 : driver 9 QEMU returned SUCCESS
2020-10-26 12:36:22.228+0000: 10048: debug : virThreadJobClear:119 : Thread 10048 (virNetServerHandleJob) finished job remoteDispatchConnectOpen with ret=0


The client proceeds with sending the virDomainCreate() request for the g2 domain.
client:
2020-10-26 12:36:22.231+0000: 14506: debug : virDomainCreate:6525 : dom=0x5623e67c0e80, (VM: name=g2, uuid=743eee8c-16f1-11eb-a35f-20040febab30)
2020-10-26 12:36:22.231+0000: 14506: debug : virNetMessageNew:45 : msg=0x5623e67c47d0 tracked=0
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollDispatchTimeouts:427 : Dispatch 2
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollDispatchHandles:473 : Dispatch 2
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollDispatchHandles:487 : i=0 w=1
2020-10-26 12:36:22.231+0000: 14506: debug : virNetMessageEncodePayload:388 : Encode length as 56
2020-10-26 12:36:22.231+0000: 14507: info : virEventPollDispatchHandles:501 : EVENT_POLL_DISPATCH_HANDLE: watch=1 events=1
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollDispatchHandles:487 : i=1 w=2
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCleanupTimeouts:520 : Cleanup 2
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCleanupHandles:569 : Cleanup 2
2020-10-26 12:36:22.231+0000: 14506: info : virNetClientSendInternal:2085 : RPC_CLIENT_MSG_TX_QUEUE: client=0x5623e67c3f20 len=56 prog=536903814 vers=1 proc=9 type=0 status=0 serial=8
2020-10-26 12:36:22.231+0000: 14507: debug : virEventRunDefaultImpl:350 : running default event implementation
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCleanupTimeouts:520 : Cleanup 2
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCleanupHandles:569 : Cleanup 2
2020-10-26 12:36:22.231+0000: 14506: debug : virNetClientCallNew:2038 : New call 0x5623e67c32e0: msg=0x5623e67c47d0, expectReply=1, nonBlock=0
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollMakePollFDs:396 : Prepare n=0 w=1, f=4 e=1 d=0
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollMakePollFDs:396 : Prepare n=1 w=2, f=6 e=1 d=0
2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCalculateTimeout:333 : Calculate expiry of 2 timers


The daemon performs the startup.
server:
2020-10-26 12:36:22.231+0000: 10050: debug : virThreadJobSet:94 : Thread 10050 (virNetServerHandleJob) is now running job remoteDispatchDomainCreate
2020-10-26 12:36:22.231+0000: 10050: debug : virDomainCreate:6525 : dom=0x7f90e802e670, (VM: name=g2, uuid=743eee8c-16f1-11eb-a35f-20040febab30)
2020-10-26 12:36:22.231+0000: 10050: debug : qemuDomainObjBeginJobInternal:9784 : Starting job: job=none agentJob=none asyncJob=start (vm=0x7f90a0106e10 name=g2, current job=none agentJob=none async=none)
2020-10-26 12:36:22.231+0000: 10050: debug : qemuDomainObjBeginJobInternal:9841 : Started async job: start (vm=0x7f90a0106e10 name=g2)
2020-10-26 12:36:22.231+0000: 10050: debug : qemuProcessStart:7077 : conn=0x7f90f4019b10 driver=0x7f90a00f7f60 vm=0x7f90a0106e10 name=g2 id=-1 asyncJob=start migrateFrom=<null> migrateFd=-1 migratePath=<null> snapshot=(nil) vmop=0 flags=0x1


While the daemon is performing the startup, the client waits for reply for the issued RPC, but also set timer to wake up in 5 seconds so that keepalive message can be sent:
client:
2020-10-26 12:36:22.231+0000: 14507: info : virEventPollRunOnce:635 : EVENT_POLL_RUN: nhandles=1 timeout=5000
2020-10-26 12:36:27.571+0000: 14572: info : libvirt version: 6.0.0, package: 25.module+el8.2.1+7154+47ffd890 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-06-25-15:13:13, )


But the last message logged is entering the poll(). The next one is greeting message from another run of new virsh (after 5 seconds; because there's "sleep 5" in the loop in the comment 19). Therefore, I think virsh crashed when entering poll(). Chenli, is there any record of this? something in the system log, perhaps a coredump even?

Comment 28 chhu 2020-10-29 11:55:53 UTC
Thanks Michal Privoznik's debugging!

After further debugging and testing on the env.
We found the `virsh start` hang is a qemu bug, I'll file another qemu bug for the VM hang issue.
The libvirtd.log error message below is happenned when the VM is hang and exit the `virsh start` terminal manually.
- "error : virNetSocketReadWire:1832 : End of file while reading data: Input/output error"

Comment 29 Michal Privoznik 2020-10-29 12:37:15 UTC
Right. So in my investigation in comment 25 the virsh did not crash, but was killed manually by user. Therefore, we don't have a reproducer. If somebody who is able to reproduce could provide logs including client logs that would be great.

Just for the record, the error does not necessarily mean there's a problem. It is reported if client disconnects without closing the connection properly. It is a problem if the client was doing something/wanted to do something more.

Comment 30 chhu 2020-10-30 03:03:47 UTC
(In reply to chhu from comment #28)
> Thanks Michal Privoznik's debugging!
> 
> After further debugging and testing on the env.
> We found the `virsh start` hang is a qemu bug, I'll file another qemu bug
> for the VM hang issue.
Tracked in Bug 1892669 - VM start hang - qemu stuck in query-balloon

Comment 31 chhu 2020-10-30 06:32:26 UTC
(In reply to jhardee from comment #9)
> Hi,
> 
> I have a customer who is showing this message as well.
> 
> 2020-07-15 13:00:16.389 1 ERROR nova.virt.libvirt.driver
> [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05
> d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance:
> e0f46da7-9105-4b2e-8ef0-58928e1ad325] Failed to start libvirt guest:
> libvirtError: Cannot recv data: Connection reset by peer
> ....

Hi, Ketan

The issue in comment 14 and 19 is another Bug 1892669.

As Michal said in comment 29:
"the error (virNetSocketReadWire:1832 : End of file while reading data: Input/output error) does not necessarily mean there's a problem. 
It is reported if client disconnects without closing the connection properly. 
It is a problem if the client was doing something/wanted to do something more."

According to the nova log in comment 9, we can see the customer hit start,reboot VM failures,
we need to know what happened to the VM before the failures. Will you please help to provide below information? Thank you!
--------------------------------------------------------
Tested on packages:
 - OSP version: OSP 13 (compute node with rhel7.9) ?
 - Compute node: realtime kernel ? libvirt, qemu-kvm version ?

Test env:
 - VM xml
 - Env with Realtime ? with SR-IOV PF, VF ?
 - VM storage: nfs ? ceph ? iscsi ?

Reproduced steps:
1. 2. 3. ....

Related logs:
 - sos report, Or
   - compute node:
     - /var/log/containers/nova/nova-compute.log
     - /var/log/containers/libvirt/libvirtd.log
     - /var/log/libvirt/qemu/instance-0000000*.log
   - controller node:
     - /var/log/containers/nova/nova-conductor.log
     - /var/log/containers/nova/nova-scheduler.log

Please do these settings for the libvirtd log, before reproducing the bug:
- On compute node:
# sudo vi /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/libvirtd.conf
  log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util"
  log_outputs="1:file:/var/log/libvirt/libvirtd.log"
# sudo podman restart nova_libvirt

Comment 32 Kashyap Chamarthy 2020-11-20 15:01:10 UTC
(In reply to chhu from comment #31)
> (In reply to jhardee from comment #9)
> > Hi,
> > 
> > I have a customer who is showing this message as well.
> > 
> > 2020-07-15 13:00:16.389 1 ERROR nova.virt.libvirt.driver
> > [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05
> > d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance:
> > e0f46da7-9105-4b2e-8ef0-58928e1ad325] Failed to start libvirt guest:
> > libvirtError: Cannot recv data: Connection reset by peer
> > ....
> 
> Hi, Ketan
> 
> The issue in comment 14 and 19 is another Bug 1892669.
> 
> As Michal said in comment 29:
> "the error (virNetSocketReadWire:1832 : End of file while reading data:
> Input/output error) does not necessarily mean there's a problem. 
> It is reported if client disconnects without closing the connection
> properly. 
> It is a problem if the client was doing something/wanted to do something
> more."
> 
> According to the nova log in comment 9, we can see the customer hit
> start,reboot VM failures,
> we need to know what happened to the VM before the failures. Will you please
> help to provide below information? Thank you!
> --------------------------------------------------------
> Tested on packages:
>  - OSP version: OSP 13 (compute node with rhel7.9) ?
>  - Compute node: realtime kernel ? libvirt, qemu-kvm version ?
> 
> Test env:
>  - VM xml
>  - Env with Realtime ? with SR-IOV PF, VF ?
>  - VM storage: nfs ? ceph ? iscsi ?
> 
> Reproduced steps:
> 1. 2. 3. ....
> 
> Related logs:
>  - sos report, Or
>    - compute node:
>      - /var/log/containers/nova/nova-compute.log
>      - /var/log/containers/libvirt/libvirtd.log
>      - /var/log/libvirt/qemu/instance-0000000*.log
>    - controller node:
>      - /var/log/containers/nova/nova-conductor.log
>      - /var/log/containers/nova/nova-scheduler.log
> 
> Please do these settings for the libvirtd log, before reproducing the bug:
> - On compute node:
> # sudo vi
> /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/libvirtd.conf
>   log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file
> 3:object 1:util"
>   log_outputs="1:file:/var/log/libvirt/libvirtd.log"
> # sudo podman restart nova_libvirt


Ketan, any update on the above?

Comment 37 Han Han 2021-01-20 09:57:42 UTC
FYI, a new bug was found in libvirt https://bugzilla.redhat.com/show_bug.cgi?id=1918211
It has similar error msg to this bug

Comment 38 Kashyap Chamarthy 2021-01-22 12:04:06 UTC
(In reply to Han Han from comment #37)
> FYI, a new bug was found in libvirt
> https://bugzilla.redhat.com/show_bug.cgi?id=1918211
> It has similar error msg to this bug

Right, but Michal confirms that in your new bug that it is not the same cause: https://bugzilla.redhat.com/show_bug.cgi?id=1918211#c4

(Also your new bug seems to be reproducible only in nested KVM setup; and not on physical VMs.  And in _this_ bug, I think, the setup is not nested.)

Comment 39 Kashyap Chamarthy 2021-01-29 11:40:10 UTC
A solid reproducer seems to be completely elusive here — without that we can't do much here.

So closing this bug for now.  

If a clear reproducer emerges, please feel free to reopen this one with appropriate details.


Note You need to log in before you can comment on or make changes to this bug.