The VM network usage information in Admin. Portal has some gaps, causing the information to be outdated/inaccurate: - Every 5 seconds, vdsm will calculate the percent of network used (considering vnic speed) within that period. - RHEV backend will consult vdsm every 15 seconds, getting the current vdsm value for network use, updating the database. ** Here the first gap: Backend will skip 2/3 of the vdsm checks. And the data is not cumulative. - Admin Portal will update the Network usage bar with the database information every 5 seconds(default) if browser has the focus OR every 60 seconds if it hasn't the focus. ** Here the second gap: Admin. Portal, if in focus, will get 3 times the same information from the database, while the actual information has already changed on the host. ** And also the third gap: Admin. Portal, if not in focus, will take 60 seconds to check the database information. The worst case is Admin. Portal exposing an 80 seconds old information (vdsm[5] + backend[15] + Admin Portal[60]). Notice, in attached image, Admin. Portal (w/o focus) showing always 0%, even with the high network load during a certain period. Please consider to have the total of transferred bytes (and its timestamp), reported by vdsm and recorded to the database, and to change frontend/restapi to calculate the network usage percent between its own checks using the new database information.
Created attachment 907672 [details] network usage stats propagation
Amador, I do not believe that we can increase the vdsm polling rate - we may choke the line and Engine itself. It does make sense to lower the Vdsm period so that it does not poll data that is bound to be dropped.
(In reply to Dan Kenigsberg from comment #2) > Amador, I do not believe that we can increase the vdsm polling rate - we may > choke the line and Engine itself. It does make sense to lower the Vdsm > period so that it does not poll data that is bound to be dropped. Dan, I agree. I'm not intended to increase vdsm pooling rate. I think vdsm should report the total of transmitted/received bytes and let engine calculate the percent by itself. This way, regardless the engine pooling rate, it will be always able to represent all the data at some point.
Hi Amador, I am still not clear on the issue. Two concrete questions: 1. What do you mean by the admin portal not being in focus being updated once every 60 seconds? I think this is related either to your browser or to your OS, I don't think oVirt is aware in any way whether the focus is on its dialog. 2. What is the basic problem here? To my understanding, what you described means that the engine will update the data once every 15 seconds, and therefore reported data could be 15-second old. Is this the problem? Why does it matter, in the context of this bug, whether the engine calculates the rate or displays what's reported by vdsm?
(In reply to Lior Vernia from comment #5) > Hi Amador, > > I am still not clear on the issue. Two concrete questions: > > 1. What do you mean by the admin portal not being in focus being updated > once every 60 seconds? I think this is related either to your browser or to > your OS, I don't think oVirt is aware in any way whether the focus is on its > dialog. Makes sense. But this is not the issue here. > > 2. What is the basic problem here? To my understanding, what you described > means that the engine will update the data once every 15 seconds, and > therefore reported data could be 15-second old. Is this the problem? Why > does it matter, in the context of this bug, whether the engine calculates > the rate or displays what's reported by vdsm? The main issue is the high possibility (2/3) of a network traffic not reported anywhere. vdsm is calculating the network load average every 5 seconds. Since backend is collecting this information every 15 seconds, 2 in 3 vdsm checks are simply dropped. I see 3 possible solutions: #1 Change vdsm to check the network load average every 15 seconds. #2 Change backend to collect from vdsm every 5 seconds. #3 Report total total_bytes and timestamp and let frontend calculate the average by itself. #3 is the most accurate, since #1 and #2 will only transfer the "drops" from vdsm<->backend to backend<->frontend (whenever the interval between frontend checks is greater than 5 seconds for #1 and 15 seconds for #2.) With #3, we can have the average calculated by frontend, between its checks.
Amador, we have already performed your option #1 in http://gerrit.ovirt.org/21257 which is part of ovirt-3.4. Could you suggest the customer to set vm_sample_net_interval=15 in his /etc/vdsm/vdsm.conf (and restart vdsm) to verify if it fixes his issue?
(In reply to Dan Kenigsberg from comment #7) > Amador, we have already performed your option #1 in > http://gerrit.ovirt.org/21257 which is part of ovirt-3.4. > > Could you suggest the customer to set vm_sample_net_interval=15 in his > /etc/vdsm/vdsm.conf (and restart vdsm) to verify if it fixes his issue? Hello Dan, Thanks for your update. I have instructed the customer to try but the issue is still there, please see the feedback from customer: The results of this test are unsatisfactory. I find that if I stream a large (5GB) file I do indeed now see some traffic showing up in the console display. However, for "real world" usage it still shows 0%. Investigating further I find that the streaming test shows results for all hypervisors, regardless of whether or not the vm_sample_net_interval variable has been set. I will mention however that all the hypervisors have been upgraded since the last time I tried the same test, so something has been improved in more recent versions. Asking the blindingly obvious: The CPU and memory displays show real data. Why is the same mechanism used to obtain that data not being used for the network interface? What is your suggestion? Thanks!
Hello Liu, It's not actually the same mechanism for each of these measures. But it doesn't really matter, as the one used for network usage should work independently of the others :) The difference between "real world" usage and large file streaming leads me to suspect that perhaps the speed of the VM interface is defined to be quite large, so that "real world" usage turns out to be negligible when computing percentage. Could you please corroborate/refute? Could you also attach the output of running either "vdsClient -s 0 getAllVmStats" or "vdsClient -s 0 GetVdsStats" while the VM has some "real world" usage, but the engine shows nothing? This is to see if vdsm is indeed reporting some traffic that the engine is missing. Lior.
(In reply to Lior Vernia from comment #9) > Hello Liu, > > It's not actually the same mechanism for each of these measures. But it > doesn't really matter, as the one used for network usage should work > independently of the others :) > > The difference between "real world" usage and large file streaming leads me > to suspect that perhaps the speed of the VM interface is defined to be quite > large, so that "real world" usage turns out to be negligible when computing > percentage. Could you please corroborate/refute? > > Could you also attach the output of running either "vdsClient -s 0 > getAllVmStats" or "vdsClient -s 0 GetVdsStats" while the VM has some "real > world" usage, but the engine shows nothing? This is to see if vdsm is indeed > reporting some traffic that the engine is missing. > > Lior. Hello Lior, Thanks for your update. I have asked the same and got feedback from customer: The speed of the interface is set to 1Gb. This is hardly "quite large". The reality of the matter is that the VM displays 0% but when it has significant traffic the corresponding hypervisor, which has a 10Gb interface, *does* display traffic. The same issue has been asked about on the Ovirt Users mailing list, so it's certainly not a problem unique to us. Running your command may be great in theory but fails in practice. I am not about to monitor network traffic on the VMs in the hope of getting the timing for such a command just right. Your opening sentence bothers me. Why isn't the same mechanism used? The other displays work as expected. The network display, using a different mechanism, doesn't work correctly. Surely that's a clue pointing to a solution? An update on this one: We have just created a new VM, which is functioning as a log collector. In that role there is very significant network traffic initially. I was able to capture the attached screenshots, which very clearly show that the network traffic shown for the VM is nothing like that shown for the host. Both have a 1Gb interface. Although the traffic shown for the host was considerably higher seconds earlier, it had dropped down to only 10% by the time I captured the screenshot, which was taken just after capturing the one for the VM. Please note that the VM in question is at the only one on that host, so all traffic, other that the completely insignificant management overhead, is generated through the VM. I think the fundamental issue is CPU and memory displays show real data. Why is the same mechanism used to obtain that data not being used for the network interface? Please look into the issue and help it move forward. Thanks! Leo
Pleaes help us further debug the issue by supplying the output of vdsClient -s getVdsStats repeatedly, while the VM generates traffic. I'd like to see if Vdsm fails to report the vnic-specific data
1. I'm still unsure why this is on DWH? 2. I'm still unsure why the customer saw '0' in his tests. Makes little sense to me - unless there was really low BW. 3. As for missing 2/3 of the stats - I agree we need to think of a solution here, as we clearly lose some granularity - that may or may not be required. If it isn't, let's move to 15 seconds on the VDSM. If it is, I suggest either VDSM averages the last 3 samples (of 5 seconds each) or send 3 samples back to the engine every 15 seconds. Either way, not sure where the bug is. Dan - can you look into this?
We (Lior, Yaniv, Amador, me) spent quite a lot of time trying to fully understand this issue, and I am not sure we have. But please note that with bug 1066570 verified in rhev-3.6 we have a much more reliable API. With cumulative report, a missed sample only means lost granularity. I believe that DWH should be modified to consume the new API and compute the tx/rx rate as a derivative of the cumulative data.
We already added the Collection of total RX/TX byte statistics to dwh and reports in 3.6. https://bugzilla.redhat.com/show_bug.cgi?id=1215587 https://bugzilla.redhat.com/show_bug.cgi?id=1228991 In reports we added the fields to the ad hoc domain and the user can create a calculated column based on the fields to calculate the amount of network used. So this might be duplicate. Can we close it or is there anything else missing?
(In reply to Shirly Radco from comment #21) > We already added the Collection of total RX/TX byte statistics to dwh and > reports in 3.6. > > https://bugzilla.redhat.com/show_bug.cgi?id=1215587 > https://bugzilla.redhat.com/show_bug.cgi?id=1228991 > > In reports we added the fields to the ad hoc domain and the user can create > a calculated column based on the fields to calculate the amount of network > used. > > So this might be duplicate. > Can we close it or is there anything else missing? This is about making the collection more accurate and we will be looking to resolve this by provide a metrics store in RHEV 4.
*** This bug has been marked as a duplicate of bug 1349309 ***