Description of problem: I have a server that is having quite a number of NFS retransmission errors - which leads to poor VM performance: calls retrans authrefrsh 518811124 512798 0 However, nothing is propagated to the admin, as we don't even monitor that. Since it's NFS, no errors will be issued, but performance would be pathetic. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug.
Still relevant.
We need more data to make progress: - Do we see slower readDelay values on such system? - Do you know how to reproduce or simulate this issue?
(In reply to Nir Soffer from comment #6) > We need more data to make progress: > > - Do we see slower readDelay values on such system? Yes, we did. > - Do you know how to reproduce or simulate this issue? Using TC and dropping packets. A 1% packet drop should suffice.
Yaniv, can we reproduce this with current code? See comment 7 for tips.
(In reply to Nir Soffer from comment #8) > Yaniv, can we reproduce this with current code? Aharon is leading RHEV storage QE, moving Needinfo to him. > > See comment 7 for tips.
(In reply to Nir Soffer from comment #8) > Yaniv, can we reproduce this with current code? > > See comment 7 for tips. Reproduced using latest code (vdsm-4.17.10.1-0.el7ev.noarch). While having NFS retransmission errors in vdsm, they are not propagated to RHEVM. Used tc to simulate 10% packet loss: [root@green-vdsb ~]# tc qdisc add dev ovirtmgmt root netem loss 10% [root@green-vdsb ~]# nfsstat -c ... Client rpc stats: calls retrans authrefrsh 61624 94 61564 ... Not sure if the following error in vdsm.log is relevant, but this is what I got while had 10% packet loss simulated by tc: Thread-1912::ERROR::2015-11-19 14:38:41,999::monitor::250::Storage.Monitor::(_monitorDomain) Error monitoring domain ad1ecd8f-8822-4323-9ba4-4864e4d97297 Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 246, in _monitorDomain self._performDomainSelftest() File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 774, in wrapper value = meth(self, *a, **kw) File "/usr/share/vdsm/storage/monitor.py", line 313, in _performDomainSelftest self.domain.selftest() File "/usr/share/vdsm/storage/fileSD.py", line 624, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/share/vdsm/storage/outOfProcess.py", line 274, in statvfs return self._iop.statvfs(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 418, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out
Created attachment 1096718 [details] engine, vdsm and messages logs
Yaniv, not sure why it was moved again to future. It should be used (and replace the useless readDelay, at least in NFS). I think it's a mistake and we keep seeing such issues in customers - but are not looking at the right place. NFS retrans are always a good indication for something wrong / too much load. Anyway, if it's not important enough to be implemented for more than 5 years, closing.
I think we should consider this for 4.0.
We need to repeat the test in comment 10 with different values of packet loss (or other network condition that trigger re-transmissions), and check if read delay is reflecting the issue.
(In reply to Nir Soffer from comment #14) > We need to repeat the test in comment 10 with different values of packet > loss (or other network condition that trigger re-transmissions), and check > if read delay is reflecting the issue. Aharon - can you do it? - If read delay is not reflecting the issue (with drops of 0.5, 1, 5 percent, for example), then we need to fix read delay. - If it does, we can probably close this. TIA, Y.
Should be part of the new metrics store via nfs collectd plugin.
(In reply to Yaniv Kaul from comment #15) > (In reply to Nir Soffer from comment #14) > > We need to repeat the test in comment 10 with different values of packet > > loss (or other network condition that trigger re-transmissions), and check > > if read delay is reflecting the issue. > > Aharon - can you do it? > - If read delay is not reflecting the issue (with drops of 0.5, 1, 5 > percent, for example), then we need to fix read delay. > - If it does, we can probably close this. > > TIA, > Y. Elad, can you please test it again (comment #10) with the relevant values?
Created attachment 1221504 [details] logs-17.11.16 Used: vdsm-4.18.999-894.git5c13e49.el7.centos.x86_64 ovirt-engine-4.0.5.4-0.1.el7ev.noarch libvirt-2.0.0-10.el7.x86_64 qemu-kvm-rhev-2.6.0-27.el7.x86_64 ===================================================================== Setup: One host, one NFS domain. Host with 1 NIC with the defaults ovirtmgmt network configured as bridge. A running guest with 2 disks on NFS domain. RHEL OS installed. Applied tc for various packet loss rates. Each rule was applied for 5-10 minutes. ===================================================================== 5% packet loss applied: [root@storage-ge2-vdsm4 ~]# tc qdisc add dev ovirtmgmt root netem loss 5% [root@storage-ge2-vdsm4 ~]# nfsstat -c Client rpc stats: calls retrans authrefrsh 13372 0 13372 ======================= 10% packet loss applied: [root@storage-ge2-vdsm4 ~]# tc qdisc add dev ovirtmgmt root netem loss 10% [root@storage-ge2-vdsm4 ~]# tc -s qdisc ls dev ovirtmgmt qdisc netem 8004: root refcnt 2 limit 1000 loss 10% Sent 32711015 bytes 34760 pkt (dropped 2734, overlimits 0 requeues 0) backlog 0b 0p requeues 0 [root@storage-ge2-vdsm4 ~]# nfsstat -c Client rpc stats: calls retrans authrefrsh 16084 0 16084 ======================= 20% packet loss applied: [root@storage-ge2-vdsm4 ~]# tc qdisc add dev ovirtmgmt root netem loss 20% [root@storage-ge2-vdsm4 ~]# nfsstat -c Client rpc stats: calls retrans authrefrsh 21659 0 21659 [root@storage-ge2-vdsm4 ~]# tc -s qdisc ls dev ovirtmgmt qdisc netem 8005: root refcnt 2 limit 1000 loss 20% Sent 12213810 bytes 36381 pkt (dropped 8607, overlimits 0 requeues 0) backlog 0b 0p requeues 0 ===================================================================== Results: During packet loss rules applied, I observed lagging in the guest while connected through console. No relevant logs seen in vdsm.log and no warning to admin in RHEVM. Added log message to vdsm.log For each added tc rule. For example: ----------------------------------------5% packet loss applied---------------------------------------------- Logs attached
*Engine: ovirt-engine-4.1.0-0.0.master.20161112231308.git672bd31.el7.centos.noarch
Elad - shouldn't the tc be on the physical interface itself, and not on the bridge?
IP configuration is for the bridge and not for the physical interface: root@storage-ge2-vdsm4 ~]# ifconfig ovirtmgmt ovirtmgmt: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.35.65.41 netmask 255.255.255.0 broadcast 10.35.65.255 inet6 2620:52:0:2340:21a:4aff:fecc:ed0c prefixlen 64 scopeid 0x0<global> inet6 fe80::21a:4aff:fecc:ed0c prefixlen 64 scopeid 0x20<link> ether 00:1a:4a:cc:ed:0c txqueuelen 1000 (Ethernet) RX packets 46831491 bytes 241288586957 (224.7 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 46857952 bytes 4944680339 (4.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@storage-ge2-vdsm4 ~]# ifconfig eth0 eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:1a:4a:cc:ed:0c txqueuelen 1000 (Ethernet) RX packets 64013875 bytes 278285054789 (259.1 GiB) RX errors 0 dropped 9 overruns 0 frame 0 TX packets 46928098 bytes 4959421598 (4.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
(In reply to Elad from comment #21) > IP configuration is for the bridge and not for the physical interface: > > > root@storage-ge2-vdsm4 ~]# ifconfig ovirtmgmt > ovirtmgmt: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 10.35.65.41 netmask 255.255.255.0 broadcast 10.35.65.255 > inet6 2620:52:0:2340:21a:4aff:fecc:ed0c prefixlen 64 scopeid > 0x0<global> > inet6 fe80::21a:4aff:fecc:ed0c prefixlen 64 scopeid 0x20<link> > ether 00:1a:4a:cc:ed:0c txqueuelen 1000 (Ethernet) > RX packets 46831491 bytes 241288586957 (224.7 GiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 46857952 bytes 4944680339 (4.6 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > [root@storage-ge2-vdsm4 ~]# ifconfig eth0 > eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > ether 00:1a:4a:cc:ed:0c txqueuelen 1000 (Ethernet) > RX packets 64013875 bytes 278285054789 (259.1 GiB) > RX errors 0 dropped 9 overruns 0 frame 0 > TX packets 46928098 bytes 4959421598 (4.6 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 The fact the IP is on an interface doesn't mean (confusingly) it is actually being used for a connection... Can you try? I would like to understand why you did not get any NFS retransmission, even though you have set packet loss (and we see packets being lost on the bridge and you've witnessed yourself the delay in the console).
[root@storage-ge2-vdsm4 ~]# tc -s qdisc ls dev eth0 qdisc netem 8006: root refcnt 2 limit 1000 loss 20% Sent 160246 bytes 1607 pkt (dropped 466, overlimits 0 requeues 0) backlog 0b 0p requeues 0 [root@storage-ge2-vdsm4 ~]# nfsstat -c Client rpc stats: calls retrans authrefrsh 28581 0 28581
Closing this on collectd.
https://github.com/collectd/collectd/issues/2325