Created attachment 1187860 [details] Host page: Network utilization vs Network Throughput Description of problem: It seems like the Network utilization chart doesn't work properly. I have HW cluster with 3 MON and 4 OSD nodes with two configured networks (1G and 10G). And I utilize the network via iperf command (`iperf -s` on first OSD node and `iperf -c 192.168.100.101 --time 36000` on the second OSD node (192.168.100.101 is the IP address of the 10G network interface on the first OSD node). Command `nload` on the interface p2p1 (192.168.100.102) shows following values: Curr: 4.65 GBit/s Avg: 4.65 GBit/s Min: 4.64 GBit/s Max: 4.65 GBit/s Ttl: 11045.28 GByte And it is wisible also in the Host -> Performance - Network Throughput chart in USM. But Network utilization section says for all values zero. Version-Release number of selected component (if applicable): USM Server (RHEL 7.2): ceph-installer-1.0.14-1.el7scon.noarch libcollection-0.6.2-25.el7.x86_64 ceph-ansible-1.0.5-32.el7scon.noarch rhscon-core-0.0.39-1.el7scon.x86_64 rhscon-ui-0.0.51-1.el7scon.noarch rhscon-core-selinux-0.0.39-1.el7scon.noarch rhscon-ceph-0.0.39-1.el7scon.x86_64 Ceph OSD/MON node (RHEL 7.2): calamari-server-1.4.8-1.el7cp.x86_64 ceph-base-10.2.2-33.el7cp.x86_64 ceph-common-10.2.2-33.el7cp.x86_64 ceph-mon-10.2.2-33.el7cp.x86_64 ceph-osd-10.2.2-33.el7cp.x86_64 ceph-selinux-10.2.2-33.el7cp.x86_64 collectd-ping-5.5.1-1.1.el7.x86_64 collectd-5.5.1-1.1.el7.x86_64 libcephfs1-10.2.2-33.el7cp.x86_64 libcollection-0.6.2-25.el7.x86_64 python-cephfs-10.2.2-33.el7cp.x86_64 rhscon-agent-0.0.16-1.el7scon.noarch rhscon-core-selinux-0.0.39-1.el7scon.noarch How reproducible: 100% Steps to Reproduce: 1. Utilize network by `iperf -s` on one node and `iperf -c 192.168.100.101 --time 36000` on the second node. Actual results: Network utilization charts shows zeros. Expected results: Network utilization charts shows meaningfull data. Additional info: See the attached screenshots.
Created attachment 1187861 [details] Network trafic measured by `nload`
Tested with: server ceph-ansible-1.0.5-32.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch rhscon-ceph-0.0.40-1.el7scon.x86_64 rhscon-core-0.0.41-1.el7scon.x86_64 rhscon-core-selinux-0.0.41-1.el7scon.noarch rhscon-ui-0.0.52-1.el7scon.noarch salt-2015.5.5-1.el7.noarch salt-master-2015.5.5-1.el7.noarch salt-selinux-0.0.41-1.el7scon.noarch node calamari-server-1.4.8-1.el7cp.x86_64 ceph-base-10.2.2-36.el7cp.x86_64 ceph-common-10.2.2-36.el7cp.x86_64 ceph-mon-10.2.2-36.el7cp.x86_64 ceph-selinux-10.2.2-36.el7cp.x86_64 libcephfs1-10.2.2-36.el7cp.x86_64 python-cephfs-10.2.2-36.el7cp.x86_64 rhscon-agent-0.0.18-1.el7scon.noarch rhscon-core-selinux-0.0.41-1.el7scon.noarch salt-2015.5.5-1.el7.noarch salt-minion-2015.5.5-1.el7.noarch salt-selinux-0.0.41-1.el7scon.noarch and there are these issues: 1) Host dashboard shows Performance->Throughput in wrong units. Now there is: "312112629.0 KB/s" and it should be "312112629.0 packets/s" because it is interface-rx_tx from Graphite 2) Network->Utilization units are wrong. There is "GB" now and it should be "GB/s".
Network throughput is calculated as summation of average of interface rx and average of interface tx across all interfaces of node. Network Utilization is calculated as summation of rxs and txs of all interfaces in the node divided by summation of bandwidths of all interfaces and then the result multiplied by 100 to get the percentage.
Created attachment 1189267 [details] network throughput Could you please explain how are related these 2 numbers and how is throughput calculated?
(In reply to anmol babu from comment #7) "average of interface rx and average of interface tx across all interfaces of node" in other words means "packets/s" and not KB/s For example for this node RX an TX from ifconfig output: em1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.16.157.12 netmask 255.255.248.0 broadcast 10.16.159.255 inet6 fe80::d6be:d9ff:feb3:8ef0 prefixlen 64 scopeid 0x20<link> ether d4:be:d9:b3:8e:f0 txqueuelen 1000 (Ethernet) ----> RX packets 964147750 bytes 1076760752912 (1002.8 GiB) RX errors 0 dropped 0 overruns 0 frame 0 ----> TX packets 2685424374 bytes 3982185569935 (3.6 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 em2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether d4:be:d9:b3:8e:f2 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 em3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether d4:be:d9:b3:8e:f4 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 em4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether d4:be:d9:b3:8e:f6 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 2608699 bytes 2085705817 (1.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2608699 bytes 2085705817 (1.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 p1p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.100.105 netmask 255.255.255.0 broadcast 192.168.100.255 inet6 fe80::92e2:baff:fe04:7e80 prefixlen 64 scopeid 0x20<link> ether 90:e2:ba:04:7e:80 txqueuelen 1000 (Ethernet) ----> RX packets 1780672903 bytes 25882228119558 (23.5 TiB) RX errors 0 dropped 132320 overruns 0 frame 0 ----> TX packets 2928330198 bytes 18071710203275 (16.4 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 p1p2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 90:e2:ba:04:7e:81 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 So please change unit KB/s to packets/s in right graph.
As anmol explained in the call, we are taking octets/sec(octet is nothing but a byte) not packets/s from collectd. So if you change in the UI to packet per sec it won't be correct. As discussed in the call we will make two changes in the UI 1) Network utilization KB/MB/GB per sec 2) Network throughput KB/MB/GB per sec
(In reply to Nishanth Thomas from comment #10) > As anmol explained in the call, we are taking octets/sec(octet is nothing > but a byte) not packets/s from collectd. So if you change in the UI to > packet per sec it won't be correct. > > As discussed in the call we will make two changes in the UI > > 1) Network utilization KB/MB/GB per sec > 2) Network throughput KB/MB/GB per sec Network throughput is a time series data. we cannot convert KB/MB/GB per sec. so it will be plotted as B/s
(In reply to Karnan from comment #11) > (In reply to Nishanth Thomas from comment #10) > > As anmol explained in the call, we are taking octets/sec(octet is nothing > > but a byte) not packets/s from collectd. So if you change in the UI to > > packet per sec it won't be correct. > > > > As discussed in the call we will make two changes in the UI > > > > 1) Network utilization KB/MB/GB per sec > > 2) Network throughput KB/MB/GB per sec > > Network throughput is a time series data. we cannot convert KB/MB/GB per sec. > so it will be plotted as B/s thresholds are coming as time series data in bytes. At one point it can be in kb and next moment it can be in gb. so dynamically switching units for whole data is not feasible. so, we are sticking with B/s
Moving to ON_QA. FIV rhscon-ui-0.0.53-1.el7scon
Tested with rhscon-ui-0.0.53-1.el7scon.noarch.rpm and there are correct units now. In "network throughput" graph there are B/s with big number because units cannot be dynamically changed. bug 1365995 Also there is request for documenting how is calculated network throughput https://bugzilla.redhat.com/show_bug.cgi?id=1338692#c5 Also there is bug 1365989 for calculating network throughput only from interfaces related to Ceph.
*** Bug 1366083 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1754