Hide Forgot
Description of problem: Scenario (scale): - 50+ Storage Domains in the DC - 25 Hosts Action: - Adding another SD to the Pool Problems: - Unexpected temporary load spikes, causing sub-optimal service for already running VMs. - VM monitors unresponsive for a short time In VDSM logs: 1. ConnectStorageServer 2018-12-12 13:42:54,474+0100 DEBUG (jsonrpc/0) [jsonrpc.JsonRpcServer] Calling 'StoragePool.connectStorageServer'.... 2. HBA scan: 2018-12-12 13:42:54,565+0100 DEBUG (jsonrpc/0) [storage.HBA] Starting scan (hba:58) 3. VGS to find new SD 2018-12-12 13:43:00,528+0100 DEBUG (monitor/6d637c9) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-47 /usr/bin/sudo -n /sbin/lvm vgs ... 2018-12-12 13:43:00,653+0100 DEBUG (jsonrpc/0) [storage.HSM] Found SD uuids: ('.....') (hsm:2350) 4. ConnectStorageServer returns 2018-12-12 13:43:00,658+0100 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 6.18 seconds (__init__:539) Until now, everything looks fine, even the lvm vgs is quick. But at this point on we start having load spikes with the SDs refreshing and monitoring for the new SD starting. Some things I noticed: a) Before the new SD was added, we had 3-4 LVM commands per second, but while the new SD is being added there are tons. I understand vdsm droped the caches and is refrshing everything, including the SDs that already existed. See the number of lvm commands per minute: 36 2018-12-12 13:41 52 2018-12-12 13:42 443 2018-12-12 13:43 <--- 1 SD added, going from 56 to 57. 283 2018-12-12 13:44 438 2018-12-12 13:45 328 2018-12-12 13:46 48 2018-12-12 13:47 24 2018-12-12 13:48 Multiply this by 25 hosts and we should have ~10000 lvm commands per minute, may even impact storage? b) It seems each SD monitor does its own vgs when refreshing, instead of a common vgs for all of them. 2018-12-12 13:43:11,014+0100 (monitor/3ff1fae) /sbin/lvm vgs 2018-12-12 13:43:11,031+0100 (monitor/093830f) /sbin/lvm vgs 2018-12-12 13:43:11,047+0100 (monitor/e36c27b) /sbin/lvm vgs 2018-12-12 13:43:11,064+0100 (monitor/ec6f1c2) /sbin/lvm vgs 2018-12-12 13:43:11,081+0100 (monitor/8f6811b) /sbin/lvm vgs 2018-12-12 13:43:11,099+0100 (monitor/0e39d15) /sbin/lvm vgs 2018-12-12 13:43:11,115+0100 (monitor/f6c271d) /sbin/lvm vgs 2018-12-12 13:43:11,131+0100 (monitor/4b8d054) /sbin/lvm vgs 2018-12-12 13:43:11,146+0100 (monitor/6d637c9) /sbin/lvm vgs 2018-12-12 13:43:11,168+0100 (monitor/3bba612) /sbin/lvm vgs 2018-12-12 13:43:11,190+0100 (monitor/dc425d0) /sbin/lvm vgs 2018-12-12 13:43:11,205+0100 (monitor/894d769) /sbin/lvm vgs 2018-12-12 13:43:11,222+0100 (monitor/8a49441) /sbin/lvm vgs 2018-12-12 13:43:11,237+0100 (monitor/0b380e6) /sbin/lvm vgs 2018-12-12 13:43:11,253+0100 (monitor/f3d9d4f) /sbin/lvm vgs 2018-12-12 13:43:11,268+0100 (monitor/9c14d90) /sbin/lvm vgs 2018-12-12 13:43:11,285+0100 (monitor/0455c44) /sbin/lvm vgs 2018-12-12 13:43:11,301+0100 (monitor/c24021c) /sbin/lvm vgs 2018-12-12 13:43:11,317+0100 (monitor/a2050d0) /sbin/lvm vgs 2018-12-12 13:43:11,332+0100 (monitor/62162e6) /sbin/lvm vgs 2018-12-12 13:43:11,347+0100 (monitor/330d343) /sbin/lvm vgs 2018-12-12 13:43:11,363+0100 (monitor/7c5f77f) /sbin/lvm vgs 2018-12-12 13:43:11,378+0100 (monitor/731c37b) /sbin/lvm vgs 2018-12-12 13:43:11,393+0100 (monitor/59a2d31) /sbin/lvm vgs 2018-12-12 13:43:11,409+0100 (monitor/b75c534) /sbin/lvm vgs 2018-12-12 13:43:11,424+0100 (monitor/8114b42) /sbin/lvm vgs 2018-12-12 13:43:11,440+0100 (monitor/aa26ebd) /sbin/lvm vgs 2018-12-12 13:43:11,456+0100 (monitor/4f349c7) /sbin/lvm vgs 2018-12-12 13:43:11,472+0100 (monitor/0260715) /sbin/lvm vgs c) The load only seems to go back to normal levels once all SDS are produced and monitored. 2018-12-12 13:46:30,075+0100 INFO (monitor/012101a) [storage.Monitor] Domain 012101ae-e683-4393-bba6-cee062d67a2a became VALID (monitor:471) d) Context switches also spike, I assume due to the big amount of new processes being spawned (i.e. lvm commands) 10:41:35 PM proc/s cswch/s 10:42:35 PM 33.78 33366.76 10:43:36 PM 52.67 62432.96 10:44:36 PM 64.65 62035.40 10:45:36 PM 80.33 85906.61 10:46:36 PM 84.95 87470.62 10:47:36 PM 40.98 36692.07 e) Run queue and load average also jump 10:41:35 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked 10:42:35 PM 4 2452 1.99 2.65 2.79 0 10:43:36 PM 40 2574 17.38 6.12 3.92 4 10:44:36 PM 7 2473 17.23 8.94 5.08 0 10:45:36 PM 2 2481 29.93 14.19 7.13 0 10:46:36 PM 3 2515 35.19 18.42 9.01 0 10:47:36 PM 1 2495 19.31 16.78 9.03 0 f) CPU usage, note the amount of time spent in kernelspace (lvm syscalls?) 10:41:35 PM CPU %user %nice %system %iowait %steal %idle 10:42:35 PM all 6.11 0.00 1.90 0.14 0.00 91.84 10:43:36 PM all 8.12 0.00 23.80 0.48 0.00 67.61 10:44:36 PM all 7.84 0.00 22.88 0.57 0.00 68.71 10:45:36 PM all 9.41 0.00 40.94 0.75 0.00 48.91 10:46:36 PM all 9.58 0.00 43.07 0.75 0.00 46.60 10:47:36 PM all 6.28 0.00 2.93 0.15 0.00 90.64 g) IO activity noticeably spikes too 10:41:35 PM tps rtps wtps bread/s bwrtn/s 10:42:35 PM 331.78 231.36 100.42 9887.35 2938.26 10:43:36 PM 1432.41 1327.85 104.57 28176.10 2975.54 10:44:36 PM 1772.15 1657.40 114.74 36133.27 4514.01 10:45:36 PM 2464.68 2355.14 109.54 48581.85 3271.79 10:46:36 PM 2520.50 2410.22 110.27 51025.23 3544.96 10:47:36 PM 494.47 395.87 98.60 13853.77 3006.62 h) Finally, VM monitors go unresponsive at 13:46:24. 2018-12-12 13:46:24,303+0100 WARN (jsonrpc/7) [virt.vm] (vmId='fafa7e5d-cbd7-4f46-ba65-0f7254dd91a6') monitor became unresponsive (command timeout, age=63.5500000007) (vm:5132) I'm not sure, but it seems to be the same as BZ1598266 which is about GetVolumeSize apparently not being able to run during refreshing/producing the SDs, so UpdateVolumes for each VM is blocked. I don't fully understand that area (far from it), so it may have some misunderstandings of how this part works. Version-Release number of selected component (if applicable): vdsm-4.19.45-1.el7ev.x86_64 How reproducible: Happens constantly on customer env (scale) Steps to Reproduce: 1. Add Storage Domain to Pool with 50+ SDs. Actual results: Higher than expected load Expected results: Some resources used, but not as significant.
Germano can you look at this https://bugzilla.redhat.com/show_bug.cgi?id=1488892 - this reduced the number of full refreshes and landed in 4.2.1
(In reply to Roy Golan from comment #6) > Germano can you look at this > https://bugzilla.redhat.com/show_bug.cgi?id=1488892 - this reduced the > number of full refreshes and landed in 4.2.1 Hi Roy, Nice. But this case has just 2 active and 2 enabled paths and the load spikes are *very* high. There seem to be a big contention of LVM commands because there are 50+ SDs in total, so dropping all the caches and refreshing everything fires a lot of lvm commands. Even if refreshed just once, I think the problem may still be perceptible. So I'm not sure that patch is enough to full solve this. What do you think?
Vdsm does not limit the number of lvm commnads, and for every domain we refresh lvm caches when performing operations like adding a storage domain. If you have big amount of storage domains you will have big amount of concurrent lvm commands in such case. We discussed this possible issue in the past, but never seen a real case when it was problematic. I think this patch can help: https://gerrit.ovirt.org/c/66932/ I suggest to first reproduce this issue in scale environment, and see if the patch makes a difference.
Patch https://gerrit.ovirt.org/96422 is only for discussion, eliminating unneeded lvm refreshes. I think we can work on this for 4.4, since this requires major changes in the system.
Patch https://gerrit.ovirt.org/96436 is also only for discussion. I think this is the direction we should take in 4.4 if we want to improve scaleability.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
We limit now the amount of concurrent lvm calls to 10 per host. This may help to reduce the load. We also have experimental patches that may improve caching more, but they require much more work. We will continue to work on this after we have tests results for the current code. Daniel, can you reproduce with vdsm 4.30.8, and test with vdsm 4.30.9?
As I see the bug now in modified state we will receive the fix in the coming build. We added it to our sprint to validate it in the coming build? Regarding your request of VDSM - What VDSM will be part of RHV 4.3 Beta 2 that will be built today Feb 21
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops
I have noticed that QA ack was missing ( Acked it now) Is this part of the 4.3.1-2 drop?
Germano, can you define what is "significant load", and how can we measure it? I think we should first reproduce this issue with vdsm 4.30.8, and then try the same with vdsm-4.30.9, and compare the results.
nir can you please explain how we can verify your latest change of concurrent lvm calls to 10 per host.
(In reply to Daniel Gur from comment #18) > nir can you please explain how we can verify your latest change of > concurrent lvm calls to 10 per host. We can watch the number of lvm processes run by vdsm, and check that we never see more than 10 processes while working with 50 storage domains. In previous versions we could see up to 50 lvm processes (one per domain).
(In reply to Nir Soffer from comment #17) > Germano, can you define what is "significant load", and how can we measure > it? > > I think we should first reproduce this issue with vdsm 4.30.8, and then > try the same with vdsm-4.30.9, and compare the results. Hi Nir, I added some examples of what the "significant load" load produces in comment #0. See items 'd' to 'g', which are easier to see from a single host. On the actual customer environment, which has dozens of hosts, all hosts doing this at the same time impacts the storage performance for the VMs, they get reduced peformance too (IO). If I understand your patch correctly, we will spread this load over more time to generate less impact by limiting concurrent lvm commands. But it is not doing N SDs at a time and when finished start another SD, it will still do all SDs at the same time, just limiting the amount of lvm commands. So each SD will take much longer to produce than doing N SDs at a time. I'm afraid this will make things like GetVolumeSize block for much longer as producing a SD will now take longer, triggering things like BZ1598266 even more?
Scale Environment: Hosted-Engine 2 Hosts - SuperMicro Server with ~200VMs each rhv-release-4.3.1-1-001 vdsm-4.30.9-1.el7ev.x86_64 Test Preparation: * Create 60 LUNs on NetApp (FC) Two Tests were run: 1. Add 50 Storage Domains - Duration: 2:05 Hrs (2019-02-24 17:00-19:00 Israel Time) 2. Add more 10 Storage Domains - Duration ~40min (2019-02-25 17:15-17:55 Israel Time) During the tests , resource was collected using 'nmon'.(Link bellow) Results: 1. Hosts CPU - ~12% , (Sys - 2% , IO wait - 0%) 2. Hosts Disk Busy - 4-5% VDSM log: No 'monitor became unresponsive' messages Link for 'nmon' results: https://drive.google.com/open?id=1-I8JsjrZLJaE29qndwQmUR5phA6cCwpp
(In reply to Germano Veit Michel from comment #20) > (In reply to Nir Soffer from comment #17) > If I understand your patch correctly, we will spread this load over more > time to generate less impact by limiting concurrent lvm commands. Right. Before this patch, we could have up to 50 lvm commands on every host running at the same time, reading from remote storage. With the patch, we can have only 10 lvm commands running at the same time. > But it is > not doing N SDs at a time and when finished start another SD, it will still > do all SDs at the same time, just limiting the amount of lvm commands. So > each SD will take much longer to produce than doing N SDs at a time. In my tests running many lvm commands just make every commands slower. We had see every lvm command takes up to 80 seconds few years ago. So I don't think limiting the number of commands will slow down anything. > I'm > afraid this will make things like GetVolumeSize block for much longer as > producing a SD will now take longer, triggering things like BZ1598266 even > more? I don't think it will effect this, but we need to test this. Basically any storage domain function can be block for unlimited time because of the shared domain cache if iscsi/fc rescan is very slow. The real issue is slow rescans not slow lvm commands.
We could not reproduce the issue in 4.3, and we did not test with earlier version, so I'm not sure we understand this issue. It is possible that this was improved by limiting number of concurrent lvm commands in vdsm, but without reproducing the issue with a version that does include this change we cannot be sure. Therefore I don't think we can have meaningful doc text.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1077
sync2jira