Description of problem: The mentioned errors are noticed on all controller and compute nodes on a RHOSP13 setup. systemctl status docker ~~~ Mar 24 08:47:20 host8..com dockerd-current[7497]: 2020-03-24 08:47:20.804+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error Mar 24 08:47:34 host8..com dockerd-current[7497]: 2020-03-24 08:47:34.668+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error Mar 24 08:48:20 host8..com dockerd-current[7497]: 2020-03-24 08:48:20.835+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error Mar 24 08:48:35 host8..com dockerd-current[7497]: 2020-03-24 08:48:35.015+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error Mar 24 08:49:20 host8..com dockerd-current[7497]: 2020-03-24 08:49:20.765+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error Mar 24 08:49:34 host8..com dockerd-current[7497]: 2020-03-24 08:49:34.738+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error Mar 24 08:50:21 host8..com dockerd-current[7497]: 2020-03-24 08:50:21.006+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error Mar 24 08:50:34 host8..com dockerd-current[7497]: 2020-03-24 08:50:34.784+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error Mar 24 08:51:20 host8..com dockerd-current[7497]: 2020-03-24 08:51:20.844+0000: 8707: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error ~~~ Just to add the environment in which such errors are noticed, there are filebeat container running on those node (as part of ELK stack). But, stopping the filebeat container does not prevents occurrence of these log messages. Version-Release number of selected component (if applicable): docker-1.13.1-109.gitcccb291.el7_7.x86_64 Sun Feb 23 19:02:51 2020 docker-client-1.13.1-109.gitcccb291.el7_7.x86_64 Sun Feb 23 19:02:48 2020 docker-common-1.13.1-109.gitcccb291.el7_7.x86_64 Sun Feb 23 19:02:47 2020 docker-rhel-push-plugin-1.13.1-109.gitcccb291.el7_7.x86_64 Sun Feb 23 19:02:47 2020 python-docker-2.4.2-2.el7.noarch Fri Jun 28 01:59:48 2019 python-docker-pycreds-0.3.0-11.el7.noarch Sun Feb 23 19:43:45 2020 python-obfuscateduser40-agent-docker-cmd-1.5.4-1.el7ost.noarch Fri Jun 28 02:03:27 2019 How reproducible: Not that I'm sure of as of now, these are not noticed in every RHOSP13 environment. Actual results: Expected results: Additional info: There is no impact, but would like to know the cause of this.
Hi, I have a customer who is showing this message as well. 2020-07-15 13:00:16.389 1 ERROR nova.virt.libvirt.driver [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance: e0f46da7-9105-4b2e-8ef0-58928e1ad325] Failed to start libvirt guest: libvirtError: Cannot recv data: Connection reset by peer 2020-07-15 13:00:16.391 1 INFO os_vif [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] Successfully unplugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:86:ea:af,bridge_name='br-int',has_traffic_filtering=True,id=77ee2aaa-f1d4-4d8d-a99f-878ff0842c52,network=Network(c14dce0c-0b08-4b65-b5e1-f70c426e246f),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap77ee2aaa-f1') 2020-07-15 13:00:16.395 1 ERROR nova.compute.manager [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance: e0f46da7-9105-4b2e-8ef0-58928e1ad325] Cannot reboot instance: Cannot recv data: Connection reset by peer: libvirtError: Cannot recv data: Connection reset by peer 2020-07-15 13:00:17.752 1 INFO nova.compute.manager [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance: e0f46da7-9105-4b2e-8ef0-58928e1ad325] Successfully reverted task state from reboot_started_hard on failure for instance. 2020-07-15 13:00:17.759 1 ERROR oslo_messaging.rpc.server [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 d6a5f6c4f88e41b499f823dd147a24d2 - default default] Exception during message handling: libvirtError: Cannot recv data: Connection reset by peer 2020-07-15 13:00:17.759 1 ERROR oslo_messaging.rpc.server libvirtError: Cannot recv data: Connection reset by peer 2020-07-15 13:00:17.759 1 ERROR oslo_messaging.rpc.server 2020-07-15 13:00:22.497 1 INFO nova.compute.manager [req-d48db47e-a5cd-476e-b415-539f307687c6 - - - - -] Running instance usage audit for host compute02.localdomain from 2020-07-15 12:00:00 to 2020-07-15 13:00:00. 7 instances. 2020-07-15 13:00:30.940 1 INFO nova.compute.resource_tracker [req-d48db47e-a5cd-476e-b415-539f307687c6 - - - - -] Final resource view: name=compute02.localdomain phys_ram=64163MB used_ram=58368MB phys_disk=8371GB used_disk=540GB total_vcpus=64 used_vcpus=24 pci_stats=[] 2020-07-15 13:01:31.947 1 INFO nova.compute.resource_tracker [req-d48db47e-a5cd-476e-b415-539f307687c6 - - - - -] Final resource view: name=compute02.localdomain phys_ram=64163MB used_ram=58368MB phys_disk=8371GB used_disk=540GB total_vcpus=64 used_vcpus=24 pci_stats=[] 2020-07-15 13:02:31.952 1 INFO nova.compute.resource_tracker [req-d48db47e-a5cd-476e-b415-539f307687c6 - - - - -] Final resource view: name=compute02.localdomain phys_ram=64163MB used_ram=58368MB phys_disk=8371GB used_disk=540GB total_vcpus=64 used_vcpus=24 pci_stats=[] 2020-07-15 13:02:56.420 1 INFO nova.compute.manager [-] [instance: e0f46da7-9105-4b2e-8ef0-58928e1ad325] During _sync_instance_power_state the DB power_state (4) does not match the vm_power_state from the hypervisor (0). Updating power_state in the DB to match the hypervisor. * snapshot of virtnetsocketreadwire:1806 ERROR https://access.redhat.com/hydra/rest/cases/02702586/attachments/52410106-1526-4180-89d0-54d0be99cf41?usePresignedUrl=true
@Kevin, This doesn't look relevant to CloudOps, perhaps the compute DFG should take a look at it.
[My comment from last week didn't register as the browser crashed.] From a chat with libvirt folks, that error is usually seen when libvirtd dies, the client loses connection (as expected). That said, if this is reproducible, then more fine-grained libvirt logs[1] might tell us more @Michal: Does this ring any bells for you? [1] https://kashyapc.fedorapeople.org/virt/openstack/request-nova-libvirt-qemu-debug-logs.txt
Created attachment 1724139 [details] VM xml - g2.xml
Created attachment 1724140 [details] libvirtd.log
(In reply to chhu from comment #21) > Created attachment 1724140 [details] > libvirtd.log This is the daemon log, can you please provide the client log? Per my analysis in comment 17 it's client who is disconnecting and I'm wondering why.
Created attachment 1724151 [details] client_libvirtd.log
Created attachment 1724156 [details] client_hang_libvirtd.log
Okay, so from the client logs I see some weirdness happening. These are my findings from investigating the problematic run: Here, the client (virsh) starts up and connects to the daemon. client: 2020-10-26 12:36:22.226+0000: 14506: info : libvirt version: 6.0.0, package: 25.module+el8.2.1+7154+47ffd890 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-06-25-15:13:13, ) 2020-10-26 12:36:22.226+0000: 14506: info : hostname: dell-per740-05.rhts.eng.pek2.redhat.com 2020-10-26 12:36:22.226+0000: 14506: debug : virGlobalInit:244 : register drivers ... 2020-10-26 12:36:22.226+0000: 14507: debug : virThreadJobSet:97 : Thread 14507 is now running job vshEventLoop 2020-10-26 12:36:22.226+0000: 14506: debug : doRemoteOpen:1094 : proceeding with name = 2020-10-26 12:36:22.226+0000: 14507: debug : virEventRunDefaultImpl:350 : running default event implementation 2020-10-26 12:36:22.226+0000: 14506: debug : doRemoteOpen:1103 : Connecting with transport 1 2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollCleanupTimeouts:520 : Cleanup 1 2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollCleanupHandles:569 : Cleanup 1 2020-10-26 12:36:22.226+0000: 14506: debug : remoteGetUNIXSocketHelper:799 : Built UNIX sockname /var/run/libvirt/libvirt-sock for transport unix prefix libvirt flags=0x0 2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollMakePollFDs:396 : Prepare n=0 w=1, f=4 e=1 d=0 2020-10-26 12:36:22.226+0000: 14506: debug : remoteGetUNIXSocket:885 : Chosen UNIX sockname /var/run/libvirt/libvirt-sock daemon libvirtd for mode legacy transport unix flags=0x0 2020-10-26 12:36:22.226+0000: 14506: debug : doRemoteOpen:1127 : Chosen UNIX socket /var/run/libvirt/libvirt-sock 2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollCalculateTimeout:333 : Calculate expiry of 1 timers 2020-10-26 12:36:22.226+0000: 14507: debug : virEventPollCalculateTimeout:366 : No timeout is pending 2020-10-26 12:36:22.226+0000: 14507: info : virEventPollRunOnce:635 : EVENT_POLL_RUN: nhandles=1 timeout=-1 The daemon accepts connection request and open the qemu driver. server: 2020-10-26 12:36:22.228+0000: 10048: debug : virThreadJobSet:94 : Thread 10048 (virNetServerHandleJob) is now running job remoteDispatchConnectOpen 2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpen:1074 : name= 2020-10-26 12:36:22.228+0000: 10048: debug : virConfLoadConfig:1533 : Loading config file '/etc/libvirt/libvirt.conf' 2020-10-26 12:36:22.228+0000: 10048: debug : virConfReadFile:737 : filename=/etc/libvirt/libvirt.conf 2020-10-26 12:36:22.228+0000: 10048: debug : virConfGetValueString:882 : Get value string (nil) 0 2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:859 : Trying to probe for default URI 2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:866 : QEMU driver URI probe returned 'qemu:///system' 2020-10-26 12:36:22.228+0000: 10048: debug : virConfGetValueStringList:930 : Get value string list (nil) 0 2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:903 : Split "qemu:///system" to URI components: scheme qemu server <null> user <null> port 0 path /system 2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:954 : trying driver 9 (QEMU) ... 2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:969 : Checking for supported URI schemes 2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:972 : Matched URI scheme 'qemu' 2020-10-26 12:36:22.228+0000: 10048: debug : virConnectOpenInternal:1005 : driver 9 QEMU returned SUCCESS 2020-10-26 12:36:22.228+0000: 10048: debug : virThreadJobClear:119 : Thread 10048 (virNetServerHandleJob) finished job remoteDispatchConnectOpen with ret=0 The client proceeds with sending the virDomainCreate() request for the g2 domain. client: 2020-10-26 12:36:22.231+0000: 14506: debug : virDomainCreate:6525 : dom=0x5623e67c0e80, (VM: name=g2, uuid=743eee8c-16f1-11eb-a35f-20040febab30) 2020-10-26 12:36:22.231+0000: 14506: debug : virNetMessageNew:45 : msg=0x5623e67c47d0 tracked=0 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollDispatchTimeouts:427 : Dispatch 2 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollDispatchHandles:473 : Dispatch 2 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollDispatchHandles:487 : i=0 w=1 2020-10-26 12:36:22.231+0000: 14506: debug : virNetMessageEncodePayload:388 : Encode length as 56 2020-10-26 12:36:22.231+0000: 14507: info : virEventPollDispatchHandles:501 : EVENT_POLL_DISPATCH_HANDLE: watch=1 events=1 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollDispatchHandles:487 : i=1 w=2 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCleanupTimeouts:520 : Cleanup 2 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCleanupHandles:569 : Cleanup 2 2020-10-26 12:36:22.231+0000: 14506: info : virNetClientSendInternal:2085 : RPC_CLIENT_MSG_TX_QUEUE: client=0x5623e67c3f20 len=56 prog=536903814 vers=1 proc=9 type=0 status=0 serial=8 2020-10-26 12:36:22.231+0000: 14507: debug : virEventRunDefaultImpl:350 : running default event implementation 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCleanupTimeouts:520 : Cleanup 2 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCleanupHandles:569 : Cleanup 2 2020-10-26 12:36:22.231+0000: 14506: debug : virNetClientCallNew:2038 : New call 0x5623e67c32e0: msg=0x5623e67c47d0, expectReply=1, nonBlock=0 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollMakePollFDs:396 : Prepare n=0 w=1, f=4 e=1 d=0 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollMakePollFDs:396 : Prepare n=1 w=2, f=6 e=1 d=0 2020-10-26 12:36:22.231+0000: 14507: debug : virEventPollCalculateTimeout:333 : Calculate expiry of 2 timers The daemon performs the startup. server: 2020-10-26 12:36:22.231+0000: 10050: debug : virThreadJobSet:94 : Thread 10050 (virNetServerHandleJob) is now running job remoteDispatchDomainCreate 2020-10-26 12:36:22.231+0000: 10050: debug : virDomainCreate:6525 : dom=0x7f90e802e670, (VM: name=g2, uuid=743eee8c-16f1-11eb-a35f-20040febab30) 2020-10-26 12:36:22.231+0000: 10050: debug : qemuDomainObjBeginJobInternal:9784 : Starting job: job=none agentJob=none asyncJob=start (vm=0x7f90a0106e10 name=g2, current job=none agentJob=none async=none) 2020-10-26 12:36:22.231+0000: 10050: debug : qemuDomainObjBeginJobInternal:9841 : Started async job: start (vm=0x7f90a0106e10 name=g2) 2020-10-26 12:36:22.231+0000: 10050: debug : qemuProcessStart:7077 : conn=0x7f90f4019b10 driver=0x7f90a00f7f60 vm=0x7f90a0106e10 name=g2 id=-1 asyncJob=start migrateFrom=<null> migrateFd=-1 migratePath=<null> snapshot=(nil) vmop=0 flags=0x1 While the daemon is performing the startup, the client waits for reply for the issued RPC, but also set timer to wake up in 5 seconds so that keepalive message can be sent: client: 2020-10-26 12:36:22.231+0000: 14507: info : virEventPollRunOnce:635 : EVENT_POLL_RUN: nhandles=1 timeout=5000 2020-10-26 12:36:27.571+0000: 14572: info : libvirt version: 6.0.0, package: 25.module+el8.2.1+7154+47ffd890 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-06-25-15:13:13, ) But the last message logged is entering the poll(). The next one is greeting message from another run of new virsh (after 5 seconds; because there's "sleep 5" in the loop in the comment 19). Therefore, I think virsh crashed when entering poll(). Chenli, is there any record of this? something in the system log, perhaps a coredump even?
Thanks Michal Privoznik's debugging! After further debugging and testing on the env. We found the `virsh start` hang is a qemu bug, I'll file another qemu bug for the VM hang issue. The libvirtd.log error message below is happenned when the VM is hang and exit the `virsh start` terminal manually. - "error : virNetSocketReadWire:1832 : End of file while reading data: Input/output error"
Right. So in my investigation in comment 25 the virsh did not crash, but was killed manually by user. Therefore, we don't have a reproducer. If somebody who is able to reproduce could provide logs including client logs that would be great. Just for the record, the error does not necessarily mean there's a problem. It is reported if client disconnects without closing the connection properly. It is a problem if the client was doing something/wanted to do something more.
(In reply to chhu from comment #28) > Thanks Michal Privoznik's debugging! > > After further debugging and testing on the env. > We found the `virsh start` hang is a qemu bug, I'll file another qemu bug > for the VM hang issue. Tracked in Bug 1892669 - VM start hang - qemu stuck in query-balloon
(In reply to jhardee from comment #9) > Hi, > > I have a customer who is showing this message as well. > > 2020-07-15 13:00:16.389 1 ERROR nova.virt.libvirt.driver > [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 > d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance: > e0f46da7-9105-4b2e-8ef0-58928e1ad325] Failed to start libvirt guest: > libvirtError: Cannot recv data: Connection reset by peer > .... Hi, Ketan The issue in comment 14 and 19 is another Bug 1892669. As Michal said in comment 29: "the error (virNetSocketReadWire:1832 : End of file while reading data: Input/output error) does not necessarily mean there's a problem. It is reported if client disconnects without closing the connection properly. It is a problem if the client was doing something/wanted to do something more." According to the nova log in comment 9, we can see the customer hit start,reboot VM failures, we need to know what happened to the VM before the failures. Will you please help to provide below information? Thank you! -------------------------------------------------------- Tested on packages: - OSP version: OSP 13 (compute node with rhel7.9) ? - Compute node: realtime kernel ? libvirt, qemu-kvm version ? Test env: - VM xml - Env with Realtime ? with SR-IOV PF, VF ? - VM storage: nfs ? ceph ? iscsi ? Reproduced steps: 1. 2. 3. .... Related logs: - sos report, Or - compute node: - /var/log/containers/nova/nova-compute.log - /var/log/containers/libvirt/libvirtd.log - /var/log/libvirt/qemu/instance-0000000*.log - controller node: - /var/log/containers/nova/nova-conductor.log - /var/log/containers/nova/nova-scheduler.log Please do these settings for the libvirtd log, before reproducing the bug: - On compute node: # sudo vi /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/libvirtd.conf log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util" log_outputs="1:file:/var/log/libvirt/libvirtd.log" # sudo podman restart nova_libvirt
(In reply to chhu from comment #31) > (In reply to jhardee from comment #9) > > Hi, > > > > I have a customer who is showing this message as well. > > > > 2020-07-15 13:00:16.389 1 ERROR nova.virt.libvirt.driver > > [req-779c9ef4-4d2d-4568-bd85-99f80acb67f4 485fe140b1dd426b92cb0744ee3ddf05 > > d6a5f6c4f88e41b499f823dd147a24d2 - default default] [instance: > > e0f46da7-9105-4b2e-8ef0-58928e1ad325] Failed to start libvirt guest: > > libvirtError: Cannot recv data: Connection reset by peer > > .... > > Hi, Ketan > > The issue in comment 14 and 19 is another Bug 1892669. > > As Michal said in comment 29: > "the error (virNetSocketReadWire:1832 : End of file while reading data: > Input/output error) does not necessarily mean there's a problem. > It is reported if client disconnects without closing the connection > properly. > It is a problem if the client was doing something/wanted to do something > more." > > According to the nova log in comment 9, we can see the customer hit > start,reboot VM failures, > we need to know what happened to the VM before the failures. Will you please > help to provide below information? Thank you! > -------------------------------------------------------- > Tested on packages: > - OSP version: OSP 13 (compute node with rhel7.9) ? > - Compute node: realtime kernel ? libvirt, qemu-kvm version ? > > Test env: > - VM xml > - Env with Realtime ? with SR-IOV PF, VF ? > - VM storage: nfs ? ceph ? iscsi ? > > Reproduced steps: > 1. 2. 3. .... > > Related logs: > - sos report, Or > - compute node: > - /var/log/containers/nova/nova-compute.log > - /var/log/containers/libvirt/libvirtd.log > - /var/log/libvirt/qemu/instance-0000000*.log > - controller node: > - /var/log/containers/nova/nova-conductor.log > - /var/log/containers/nova/nova-scheduler.log > > Please do these settings for the libvirtd log, before reproducing the bug: > - On compute node: > # sudo vi > /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/libvirtd.conf > log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file > 3:object 1:util" > log_outputs="1:file:/var/log/libvirt/libvirtd.log" > # sudo podman restart nova_libvirt Ketan, any update on the above?
FYI, a new bug was found in libvirt https://bugzilla.redhat.com/show_bug.cgi?id=1918211 It has similar error msg to this bug
(In reply to Han Han from comment #37) > FYI, a new bug was found in libvirt > https://bugzilla.redhat.com/show_bug.cgi?id=1918211 > It has similar error msg to this bug Right, but Michal confirms that in your new bug that it is not the same cause: https://bugzilla.redhat.com/show_bug.cgi?id=1918211#c4 (Also your new bug seems to be reproducible only in nested KVM setup; and not on physical VMs. And in _this_ bug, I think, the setup is not nested.)
A solid reproducer seems to be completely elusive here — without that we can't do much here. So closing this bug for now. If a clear reproducer emerges, please feel free to reopen this one with appropriate details.