Description of problem: We have CU with strange behaviour in one of the hypervisor (Heartbeat exceeded errors) Nov 22, 2017 5:51:19 AM VDSM test command GetStatsVDS failed: Heartbeat exceeded This is happening in the last 2 months regularly on weekly basis. Once we restart the engine or host is rebooted the behaviour is gone for week and after that again Heartbeat is back on daily basis. All hypervisors were during this period updated to the latest version, problem still occurs. The behaviour remains me similar issue https://bugzilla.redhat.com/show_bug.cgi?id=1496517 Version-Release number of selected component (if applicable): RHV 4.1.6 How reproducible: 100% Actual results: Once the engine is restarted the Heartbeat is gone, after one week is back on daily basis. Expected results: Help to identify root cause Additional info: Will add in next update debug logs from vdsm/engine and gcore from vdsm. Thanks
The BZ that you mention could potentially cause heartbeat exceeded. Have you tried to increase heartbeat interval?
Severity?
As it's happening on weekly basic without root cause founded for now High Thx
Usually heartbeat exceeded issues are caused by host being over utilized or networking issue. Do we have some information about how the failing host is utilized. Have you tried to increase heartbeat interval?
Hello, we did not increased heartbeat interval as engine gets heartbeat exceeded after 1 second host gets to non responsive state. However I will ask him to set it to 60 seconds if it helps. I am also attach 2 coredumps files from vdsm daemon that were generated when the issue appeared last time.
(In reply to Marian Jankular from comment #7) > Hello, > > we did not increased heartbeat interval as engine gets heartbeat exceeded > after 1 second host gets to non responsive state. However I will ask him to This is how it works. Heartbeats cause nonreponsive state. > set it to 60 seconds if it helps. I am also attach 2 coredumps files from > vdsm daemon that were generated when the issue appeared last time. Please let us know about the results of increasing the interval.
Hi Piotr The host failed again, the behaviour is the same as before increase Heartbeat I'm collecting the logs from the customer. Thx
Jaroslav, Do you know what was the new value of the heartbeat interval?
Hi Piotr We double it to 60sec but still the same behaviour. I finally received the logs, trying to parse them and will upload them to the BZ. Thanks
(In reply to Jaroslav Spanko from comment #0) > The behaviour remains me similar issue > https://bugzilla.redhat.com/show_bug.cgi?id=1496517 Are you running with fixed libvirt python?
Delayed calls in vdsm.log.10.xz: 2017-12-05 05:06:30,290+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4599966.17 at 0x1d9bad0>> delayed by 0.69 seconds (asyncutils:138) 2017-12-05 05:09:20,479+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4600136.17 at 0x1d9bad0>> delayed by 0.88 seconds (asyncutils:138) 2017-12-05 05:29:10,418+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4601326.17 at 0x1d9bad0>> delayed by 0.82 seconds (asyncutils:138) 2017-12-05 05:29:11,830+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4601327.51 at 0x3606110>> delayed by 0.89 seconds (asyncutils:138) 2017-12-05 05:39:31,540+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4601947.51 at 0x3606110>> delayed by 0.60 seconds (asyncutils:138) 2017-12-05 05:50:31,560+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_iso/f818e07a-977c-404d-8430-799cf652c5af/dom_md/metadata running next_check=4602607.45 at 0x1dac9d0>> delayed by 0.68 seconds (asyncutils:138) 2017-12-05 05:50:36,730+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4602607.51 at 0x3606110>> delayed by 5.79 seconds (asyncutils:138) 2017-12-05 05:50:47,069+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4602616.17 at 0x1d9bad0>> delayed by 7.47 seconds (asyncutils:138) 2017-12-05 05:51:00,940+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata running next_check=4602617.25 at 0x1d9bfd0>> delayed by 20.26 seconds (asyncutils:138) 2017-12-05 05:51:08,240+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_iso/f818e07a-977c-404d-8430-799cf652c5af/dom_md/metadata running next_check=4602617.45 at 0x1dac9d0>> delayed by 27.36 seconds (asyncutils:138) 2017-12-05 05:51:21,099+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4602627.51 at 0x3606110>> delayed by 30.16 seconds (asyncutils:138) 2017-12-05 05:51:35,950+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4602646.17 at 0x1d9bad0>> delayed by 26.35 seconds (asyncutils:138) 2017-12-05 05:51:43,889+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata running next_check=4602647.25 at 0x1d9bfd0>> delayed by 33.21 seconds (asyncutils:138) 2017-12-05 05:51:52,180+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_iso/f818e07a-977c-404d-8430-799cf652c5af/dom_md/metadata running next_check=4602657.45 at 0x1dac9d0>> delayed by 31.30 seconds (asyncutils:138) 2017-12-05 05:52:13,499+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4602677.51 at 0x3606110>> delayed by 32.56 seconds (asyncutils:138) 2017-12-05 05:52:22,239+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4602686.17 at 0x1d9bad0>> delayed by 32.64 seconds (asyncutils:138) 2017-12-05 05:52:24,438+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata running next_check=4602687.25 at 0x1d9bfd0>> delayed by 33.76 seconds (asyncutils:138) 2017-12-05 05:52:24,830+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_iso/f818e07a-977c-404d-8430-799cf652c5af/dom_md/metadata running next_check=4602707.45 at 0x1dac9d0>> delayed by 13.95 seconds (asyncutils:138)
jaroslav, can you attach output of rpm -qa to the bug?
added , thx
Nir, please take a look at the output
Looking at output of rpm -qa (attachment 1366527 [details]) libvirt-python-3.2.0-3.el7_4.1.x86_64 Tue Oct 24 11:19:22 2017 The libvirt-python bug 1497197 was fixed in libvirt-python-3.2.0-3.el7_4.1, this is not the same issue.
How can we diagnose what is the issue?
I checked engine log and found: 2017-12-16 16:28:23,253+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [5d8e858e] Command 'GetAllVmStatsVDSCommand(HostName = brubl535, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='15eb866b-c0a9-4ec7-a35b-c3117f6bec9c'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded around the same time I see blocked tasks in vdsm log: 2017-12-16 16:28:08,240+0100 DEBUG (vdsm.Scheduler) [Executor] executor state: count=5 workers=set([<Worker name=periodic/234 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x1cb6cd0> at 0x1cb6d10> timeout=1.0, duration=85 at 0x31a3710> discarded task#=33099 at 0x7f528408c210>, <Worker name=periodic/238 waiting task#=10691 at 0x34ea650>, <Worker name=periodic/243 waiting task#=0 at 0x1cea7d0>, <Worker name=periodic/241 waiting task#=8668 at 0x2fe2f10>, <Worker name=periodic/242 waiting task#=8713 at 0x7f53241947d0>]) (executor:205) 2017-12-16 16:28:14,070+0100 INFO (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/234 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x1cb6cd0> at 0x1cb6d10> timeout=1.0, duration=85 at 0x31a3710> discarded task#=33099 at 0x7f528408c210> (executor:348) 2017-12-16 16:28:10,089+0100 WARN (jsonrpc/0) [virt.vm] (vmId='e95c22b3-e3d6-4674-a2d2-35afe377ce02') monitor became unresponsive (command timeout, age=76.4800000004) (vm:5132) 2017-12-16 16:28:10,149+0100 WARN (jsonrpc/5) [virt.vm] (vmId='3e1d3243-0549-40a0-bb69-7e3645f22652') monitor became unresponsive (command timeout, age=94.71) (vm:5132) 2017-12-16 16:28:11,549+0100 WARN (jsonrpc/2) [virt.vm] (vmId='e37cb5f9-b1f7-4c66-a783-15a5a05b1ce5') monitor became unresponsive (command timeout, age=93.0100000007) (vm:5132) 2017-12-16 16:28:08,150+0100 DEBUG (periodic/243) [Executor] Worker started (executor:291) 2017-12-16 16:28:15,558+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=5591205.82 at 0x1ceec90>> delayed by 66.31 seconds (asyncutils:138) 2017-12-16 16:28:25,769+0100 WARN (jsonrpc/2) [virt.vm] (vmId='eb4eccb4-55a8-4ecd-9e0c-2f8ee37cadde') monitor became unresponsive (command timeout, age=102.48) (vm:5132) 2017-12-16 16:28:26,729+0100 INFO (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:51446 (protocoldetector:127) 2017-12-16 16:28:25,630+0100 WARN (jsonrpc/0) [virt.vm] (vmId='e37cb5f9-b1f7-4c66-a783-15a5a05b1ce5') monitor became unresponsive (command timeout, age=110.19) (vm:5132) 2017-12-16 16:28:29,910+0100 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/1 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getStats', 'id': u'90c1cd4a-e729-4fa8-b49c-5cbfab3995a1'} at 0x355eb90> timeout=60, duration=126 at 0x355e950> task#=50810 at 0x2eb0ad0> (executor:351) 2017-12-16 16:28:33,509+0100 DEBUG (periodic/234) [Executor] Worker was discarded (executor:298) 2017-12-16 16:28:34,569+0100 WARN (check/loop) [storage.check] Checker '/dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata' is blocked for 105.06 seconds (check:279) 2017-12-16 16:28:54,819+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=5591315.82 at 0x1ceec90>> missed 10 deadlines, scheduling next call at 5591315.82 (asyncutils:161) Nir please take a look.
(In reply to Piotr Kliczewski from comment #43) In your comment we see that we have: 1. blocked drive monitor - libvirt is not responsive - typically caused by non-responsive storage. 2017-12-16 16:28:14,070+0100 INFO (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/234 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x1cb6cd0> at 0x1cb6d10> timeout=1.0, duration=85 at 0x31a3710> discarded task#=33099 at 0x7f528408c210> (executor:348) 2. blocked checker - this means storage is not responsive 2017-12-16 16:28:34,569+0100 WARN (check/loop) [storage.check] Checker '/dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata' is blocked for 105.06 seconds (check:279) But this storage is block storage, it should never block for 105 seconds with out multipath configuration. 3. delayed checker - smells like vdsm hang 2017-12-16 16:28:15,558+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=5591205.82 at 0x1ceec90>> delayed by 66.31 seconds (asyncutils:138) Jaroslav, 1. do we have /var/log/messages and sanlock.log from same time as the log messages above? 2. please attach /etc/multipath.conf on this host, best on all hosts in this cluster.
Jaroslav, can you explain why multipath conf is using this configuration? device { vendor "3PARdata" product "VV" path_grouping_policy group_by_prio path_selector "round-robin 0" path_checker tur features "0" I don't think this should be included in a device configuration. The feature string should be updated automatically by multipath based on other configuration values. hardware_handler "1 alua" prio alua failback immediate rr_weight uniform no_path_retry 18 This will retry 18 times up to 90 seconds after all paths have failed before failing the io. This can cause timeouts in vdsm operations and in engine operations. These timeouts may lead to failures in storage flows, and I think engine may treat these failures as network error. rr_min_io_rq 1 detect_prio yes fast_io_fail_tmo 10 This will delay the time for failing over to another path. Vdsm default is 5. I recommend to use the default vdsm value unless the storage vendor can explain why using 10 is better. dev_loss_tmo 14 This looks strange, I don't know if this can cause trouble. Again, the storage vendor should explain why this value should be used. } Based on this configuration, we need to check if there are timeouts caused by failure of all paths using no_path_retry 18. In 4.2 we added support for multipath alerts, which will make it much easier to detect issues in multipath devices, see: https://ovirt.org/develop/release-management/features/storage/multipath-events/ Moving to bug to storage to complete this investigation.
Created attachment 1382125 [details] Multipath configuration from 4.2 Jaroslav, I'm attaching multipath configuration from 4.2. There is not change in the configuration but the file is fully documented. This should help the customer and the storage vendor to tune multipath conf so it works best with both RHV and storage.
Martin, we have unexplained hangs in vdsm (see vdsm.log.44). The only thing I found in the log is unexplained hosted engine logs like: 2017-12-16 16:26:40,489+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successful response from socket (brokerlink:256) 2017-12-16 16:26:40,500+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Closing connection to ha-broker (brokerlink:86) 2017-12-16 16:26:40,520+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Connecting to ha-broker (brokerlink:51) 2017-12-16 16:26:40,529+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successfully connected (brokerlink:80) 2017-12-16 16:26:40,529+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Sending request: set-storage-domain FilesystemBackend dom_type=fc sd_uuid=37c4f11e-c643-4263-ab81-7b48164e762d (brokerlink:274) 2017-12-16 16:26:40,529+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] socket_readline with 30.0 seconds timeout (util:80) 2017-12-16 16:26:40,540+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Full response: success 139768642954576 (brokerlink:283) 2017-12-16 16:26:40,549+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successful response from socket (brokerlink:256) 2017-12-16 16:26:40,549+0100 INFO (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Success, id 139768642954576 (brokerlink:179) 2017-12-16 16:26:40,549+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Sending request: is-host-alive service_type=hosted-engine.metadata (brokerlink:274) 2017-12-16 16:26:40,550+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] socket_readline with 30.0 seconds timeout (util:80) 2017-12-16 16:26:40,556+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Full response: success 337C317C327C367C377C38 (brokerlink:283) 2017-12-16 16:26:40,557+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successful response from socket (brokerlink:256) 2017-12-16 16:26:40,557+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Alive hosts '[3, 1, 2, 6, 7, 8]' (brokerlink:221) 2017-12-16 16:26:40,557+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Is host '1' alive -> 'True' (brokerlink:223) 2017-12-16 16:26:40,558+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.client.client.HAClient] Is host '1' alive? -> 'True' (client:83) 2017-12-16 16:26:40,558+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Closing connection to ha-broker (brokerlink:86) 2017-12-16 16:26:40,558+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Connecting to ha-broker (brokerlink:51) 2017-12-16 16:26:40,558+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successfully connected (brokerlink:80) 2017-12-16 16:26:40,559+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Sending request: set-storage-domain FilesystemBackend dom_type=fc sd_uuid=37c4f11e-c643-4263-ab81-7b48164e762d (brokerlink:274) Are we importing modules from hosted engine inside vdsm? are we communicating with ha-broker from within vdsm?
Ok, I have the first smoking gun: Someone is calling Host.getHardwareInfo 3 times in the same second: 2017-12-16 16:27:05,989+0100 DEBUG (jsonrpc/6) [jsonrpc.JsonRpcServer] Calling 'Host.getHardwareInfo' in bridge with {} (__init__:556) 2017-12-16 16:27:06,174+0100 DEBUG (jsonrpc/2) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmStats' in bridge with {} (__init__:556) 2017-12-16 16:27:06,175+0100 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Calling 'Host.getHardwareInfo' in bridge with {} (__init__:556) 2017-12-16 16:27:06,176+0100 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Calling 'Host.getHardwareInfo' in bridge with {} (__init__:556) 2017-12-16 16:27:06,177+0100 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Calling 'Host.getHardwareInfo' in bridge with {} (__init__:556) Here a monitor complain about missed deadline. The timestamp of the log message is 3 seconds before the Host.getHardwareInfo calls! I don't have any idea how this can happen. 2017-12-16 16:27:03,070+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=5591205.82 at 0x1ceec90>> missed 1 deadlines, scheduling next call at 5591205.82 (asyncutils:161) 2017-12-16 16:27:13,550+0100 INFO (monitor/37c4f11) [storage.LVM] Refreshing lvs: vg=37c4f11e-c643-4263-ab81-7b48164e762d lvs=['metadata'] (lvm:1291) Monitor complains about 24.89 seconds delay - this is probably menas vdsm was hang. 2017-12-16 16:27:14,820+0100 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata running next_check=5591186.50 at 0x3406890>> delayed by 24.89 seconds (asyncutils:138) We get a response from Host.getHardwareInfo... 2017-12-16 16:27:14,850+0100 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Return 'Host.getHardwareInfo' in bridge with {'systemProductName': 'ProLiant BL460c Gen9', 'systemSerialNumber': 'CZ262705X0', 'systemFamily': 'ProLiant', 'systemVersion': '', 'systemUUID': '31333138-3839-5A43-3236-323730355830', 'systemManufacturer': 'HP'} (__init__:582) 2017-12-16 16:27:15,630+0100 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Return 'Host.getHardwareInfo' in bridge with {'systemProductName': 'ProLiant BL460c Gen9', 'systemSerialNumber': 'CZ262705X0', 'systemFamily': 'ProLiant', 'systemVersion': '', 'systemUUID': '31333138-3839-5A43-3236-323730355830', 'systemManufacturer': 'HP'} (__init__:582) 2017-12-16 16:27:18,850+0100 DEBUG (check/loop) [storage.check] START check '/dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata' (delay=28.92) (check:283) 2017-12-16 16:27:19,729+0100 INFO (monitor/37c4f11) [storage.LVM] Refreshing LVs (vg=37c4f11e-c643-4263-ab81-7b48164e762d, lvs=['metadata']) (lvm:1319) 2017-12-16 16:27:20,079+0100 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Return 'Host.getHardwareInfo' in bridge with {'systemProductName': 'ProLiant BL460c Gen9', 'systemSerialNumber': 'CZ262705X0', 'systemFamily': 'ProLiant', 'systemVersion': '', 'systemUUID': '31333138-3839-5A43-3236-323730355830', 'systemManufacturer': 'HP'} (__init__:582) 2017-12-16 16:27:26,660+0100 DEBUG (jsonrpc/6) [jsonrpc.JsonRpcServer] Return 'Host.getHardwareInfo' in bridge with {'systemProductName': 'ProLiant BL460c Gen9', 'systemSerialNumber': 'CZ262705X0', 'systemFamily': 'ProLiant', 'systemVersion': '', 'systemUUID': '31333138-3839-5A43-3236-323730355830', 'systemManufacturer': 'HP'} (__init__:582) The Host.getHardwareInfo calls took 2--40 seconds. 2017-12-16 16:27:27,670+0100 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 21.41 seconds (__init__:539) 2017-12-16 16:27:47,379+0100 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 41.20 seconds (__init__:539) 2017-12-16 16:27:26,120+0100 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 19.94 seconds (__init__:539) Again we see crazy time values in the logs. How can log from 2017-12-16 16:27:47 come before 2017-12-16 16:27:26? getHardwareInfo is calling dmidecode - a C extension. If this extension is not releasing the GIL properly, it can hang the python process using it. Next step: check if dmidecode is releasing the GIL properly.
Jaroslav, sanlock.log from attachment 1375600 [details] ends at 2017-12-13 - but the issue is seen at 2017-12-16. It is possible that there were no errors in sanlock.log during this time, but it is unlikely since we see this in vdsm: 2017-12-16 16:28:34,569+0100 WARN (check/loop) [storage.check] Checker '/dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata' is blocked for 105.06 seconds (check:279) If a checker is blocked for 105 seconds, sanlock is likely to have issues with the block storage domain e1bc39a3-69d2-4693-a6ae-d7cdf4331eba. Maybe sanlock.log is not from the correct host, or maybe the part from the same time frame is missing? maybe in another sosreport?
Nir: the answers to all your questions are yes, we are - vdsm talks to broker using our brokerlink protocol - the library for that is part of hosted engine Moreover, older hosted engine tools used getHardwareInfo, but the logs indicate there are three different connections and hosted engine should be just one of those. We removed that 7 months ago and use the new jsonrpc retry mechanisms now that those are actually working. I suggest upgrading hosted engine packages to at least what we have in 4.1.8 and see what happens.
The issue seems to be slow writes to /var/log/vdsm/vdsm.log. We can see these logs that the log timestamps do not mach the log order: 2017-12-16 16:27:27,670+0100 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 21.41 seconds (__init__:539) 2017-12-16 16:27:47,379+0100 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 41.20 seconds (__init__:539) 2017-12-16 16:27:26,120+0100 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 19.94 seconds (__init__:539) Message 2 was logged at 16:27:47, but it was written to the log before message 1 which was logged at 16:27:26. We have many instances of this issue in the logs. How this can happen? when logging a message, the logging system: 1. create a log record, keeping the time of the log message 2. pass the log record to all configured handlers 3. handler takes handler lock 4. handler write log line to file 5. handler release handler lock If writing the log message to the log file is slow, we can have multiple threads waiting on the handler lock. When the current thread release the lock, one of the waiting threads will wake up and take the lock. If this thread is trying to log a newer log record, log records will be logged out of order. Since all vdsm threads are trying to log the the same handler, even a small delay writing to the log will cause huge delay in all vdsm threads. The effect is much bigger when using DEBUG log level. I could easily reproduce this issue by adding 1 seconds sleep in vdsm logfile handler. Here is fragment of the log showing out of order log lines, and storage checker delayed by many seconds: 2018-01-23 21:36:06,602+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 7.04 seconds (__init__:573) 2018-01-23 21:36:03,603+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/373e8c55-283f-41d4-8433-95c1ef1bbd1a/metadata running next_check=4711309.72 at 0x4350b90>> missed 5 deadlines, scheduling next call at 4711309.72 (asyncutils:161) 2018-01-23 21:36:08,605+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/metadata running next_check=4711249.75 at 0x408ae50>> delayed by 59.27 seconds (asyncutils:138) 2018-01-23 21:36:04,606+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.hostdevListByCaps succeeded in 19.04 seconds (__init__:573) I also reproduced a "Heartbeat exceeded" error in engine side: 2018-01-23 15:26:53,365-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler2) [64a6a831] Command 'GetAllVmStatsVDSCommand(HostName = voodoo6, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='d431535f-bfca-4a58-8e19-df8db015d072'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded I guess the delayed heartbeat was caused by the hearbeat thread trying to log, assuming that logging cannot block. Jaroslav, can you check if /var/log/vdsm is located on a NFS mount, or the host is booted from SAN? Maybe the weekly incident is triggered by weekly backup procedure, overloding the I/O on the host or the network connection to the /var/log mount? We recommend to have /var/log/vdsm on a fast local disk to avoid such issues. For 4.2 we can fix this issue by writing log message to storage in another thread, and queuing log messages to mitigate temporary slowdowns writing logs to storage.
I did initial testing of the attached patches, and vdsm can work correctly now with storage that takes 1 seconds for each write. So it seems that the new logger works as expected. However, vdsm verbs are still effected by slow /var/log/vdsm mount - although they should not be. For example, this takes normally 5 seconds on my test system: vdsm-client Host getDeviceList But if we slow down access to /var/log/vdsm/ so write takes 1 second, this takes now 55 seconds. If we change supervdsmd log level to ERROR - so we have no logs from supervdsmd, the same call takes 8 seconds. When using 1 second write delay, we can also see storage checkers delayed by few seocnds (1-5). Since logging cannot block now, this seems like a hang of the entire python process for few seconds. If we slow down access to /var/log/messages so write takes 30 seconds, we get hearbeat exceeded issues and vdsm is soft-fenced. So there is another issue that we don't understand yet. It may be some syscall wrapper in python C extension that is not releasing the GIL. We had this issues in the past.
Finally found what is the issue causing delays even with the new threaded logging - logging to NFS. When vdsm log to NFS, there is an open file descriptor to /var/log/vdsm/vdsm.log. this file descriptor is inherited by any child process started by vdsm. After the fork(), the new child try to close all file descriptors that should not be inherited by the child. If we slow down NFS storage write, it also slows down close(). Here is an example strace from supervdsmd when using write delay of 0.1 second: [pid 20135] 02:56:18.070321 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7efdd3dd29d0) = 27265 <0.000792> strace: Process 27265 attached [pid 20135] 02:56:18.071283 close(34) = 0 <0.000030> [pid 27265] 02:56:18.071327 set_robust_list(0x7efdd3dd29e0, 24) = 0 <0.000031> [pid 20135] 02:56:18.071409 close(27) = 0 <0.000027> [pid 27265] 02:56:18.071464 close(28 <unfinished ...> [pid 20135] 02:56:18.071487 close(30 <unfinished ...> [pid 27265] 02:56:18.071516 <... close resumed> ) = 0 <0.000033> [pid 20135] 02:56:18.071529 <... close resumed> ) = 0 <0.000036> [pid 27265] 02:56:18.071560 close(29 <unfinished ...> [pid 20135] 02:56:18.071571 close(32) = 0 <0.000019> [pid 27265] 02:56:18.071617 <... close resumed> ) = 0 <0.000049> [pid 27265] 02:56:18.071646 close(31) = 0 <0.000016> [pid 20135] 02:56:18.071675 read(33, <unfinished ...> [pid 27265] 02:56:18.071698 close(33) = 0 <0.000012> [pid 27265] 02:56:18.071735 dup2(27, 0) = 0 <0.000013> [pid 27265] 02:56:18.071771 dup2(30, 1) = 1 <0.000012> [pid 27265] 02:56:18.071806 dup2(32, 2) = 2 <0.000012> [pid 27265] 02:56:18.071841 close(27) = 0 <0.000012> [pid 27265] 02:56:18.071888 close(30) = 0 <0.000021> [pid 27265] 02:56:18.071951 close(32) = 0 <0.000021> [pid 27265] 02:56:18.072026 open("/proc/self/fd", O_RDONLY|O_CLOEXEC) = 14 <0.000046> [pid 27265] 02:56:18.072130 getdents64(14, /* 11 entries */, 280) = 264 <0.000037> [pid 27265] 02:56:18.072234 close(3) = 0 <0.000024> [pid 27265] 02:56:18.072302 close(4) = 0 <0.116676> Note: this close take 100000 usec, all other close() took 20 usec. [pid 27265] 02:56:18.189074 close(5) = 0 <0.000026> [pid 27265] 02:56:18.189166 close(6) = 0 <0.000023> [pid 27265] 02:56:18.189263 close(7) = 0 <0.000024> [pid 27265] 02:56:18.189370 close(8) = 0 <0.000030> [pid 27265] 02:56:18.189454 getdents64(14, /* 11 entries */, 280) = 264 <0.000067> [pid 27265] 02:56:18.189584 close(9) = 0 <0.000024> [pid 27265] 02:56:18.189709 close(10) = 0 <0.000043> [pid 27265] 02:56:18.189805 close(11) = 0 <0.000032> [pid 27265] 02:56:18.189896 close(12) = 0 <0.000026> [pid 27265] 02:56:18.189975 close(13) = 0 <0.000025> [pid 27265] 02:56:18.190079 close(15) = 0 <0.000044> [pid 27265] 02:56:18.190198 close(16) = 0 <0.000023> [pid 27265] 02:56:18.190271 close(17) = 0 <0.000029> [pid 27265] 02:56:18.190351 close(18) = 0 <0.000035> [pid 27265] 02:56:18.190442 close(19) = 0 <0.000029> [pid 27265] 02:56:18.190526 getdents64(14, /* 8 entries */, 280) = 192 <0.000063> [pid 27265] 02:56:18.190659 close(20) = 0 <0.000025> [pid 27265] 02:56:18.190743 close(21) = 0 <0.000029> [pid 27265] 02:56:18.190827 close(22) = 0 <0.000029> [pid 27265] 02:56:18.190911 close(23) = 0 <0.000029> [pid 27265] 02:56:18.191005 close(24) = 0 <0.000030> [pid 27265] 02:56:18.191093 close(25) = 0 <0.000029> [pid 27265] 02:56:18.191178 close(26) = 0 <0.000029> [pid 27265] 02:56:18.191261 getdents64(14, /* 0 entries */, 280) = 0 <0.000030> [pid 27265] 02:56:18.191346 close(14) = 0 <0.000031> [pid 27265] 02:56:18.191433 execve("/usr/bin/taskset", ["/usr/bin/taskset", "--cpu-list", "0-7", "/usr/lib/udev/scsi_id", "--page=0x80", "--whitelisted", "--export", "--replace-whitespace", "--device=/dev/dm-44"], [/* 7 vars */]strace: Process 27265 detached <detached ...> This issue slows down any vdsm operation that run child process. One example is getDeviceList, running /usr/lib/udev/scsi_id for every multipath device. Another example is storage checkers running dd process to read from storage. Since all storage checkers start dd in the same thread, delay in starting dd will delay the next checker.
Here is testing results with various storage backends for /var/log/vdsm, and various write delay, simulating slow logging. To simulate slow storage, I used: - slowfs over nfs (slowfs-nfs)- create slowfs mount on another server, and export the mount over NFS. In the host, mount the slowfs share at /var/log/vdsm. - local slowfs (slowfs) - create slowfs mount at /var/log/vdsm. note: requires latest slowfs from https://github.com/nirs/slowfs. To mount local slowfs you need to use: % slowfs --allow-other /var/log/vdsm.real /var/log/vdsm In both case I simulate slow storage by adding a write delay: $ slowfsctl set write N All tests run with log level DEBUG for root, vds, storage, and virt, set in /etc/vdsm/logger.conf. The test includes: - activate host (spm) - start 2 ha vms with a lease - create new ha vm with a lease from template - remove the new vm - deactivate unused storage domain - activate the storage domain again - open the manage domain for iscsi storage domain (invokes getDeviceList() 4 times) - let the system run for 1 hour - stop the vms - deactivate host I show 2 metrics: - Vdsm verbs timings - affect user experience on engine side - Delays in storage checkers - may cause a host to become non-operational slowfs over nfs =============== master, write delay 0.1 second ------------------------------ With this setup the system is barely usable, and the host usually get fenced due to heartbeat exceeded during activation. Slowest verbs verb calls mean stdev ----------------------------------------------------------------- Image.prepare 1 71.220 0.000 StoragePool.connect 4 62.780 16.305 Host.getDeviceList 5 49.806 5.775 StorageDomain.activate 1 41.870 0.000 Volume.setDescription 8 32.252 14.932 VM.create 2 28.270 0.360 StorageDomain.deactivate 1 26.650 0.000 Volume.create 1 26.390 0.000 StoragePool.disconnect 1 24.700 0.000 StorageDomain.getInfo 2 23.830 13.320 Most called verbs verb calls mean stdev ----------------------------------------------------------------- Host.getAllVmStats 514 3.280 2.714 Host.getStats 152 11.329 5.081 StoragePool.getSpmStatus 124 6.332 5.191 StoragePool.getInfo 123 8.124 9.277 Host.getAllVmIoTunePolicies 121 3.061 2.586 Task.getStatus 53 4.098 2.784 Host.getAllTasksStatuses 12 3.896 1.265 StoragePool.connectStorageServer 11 17.737 12.729 Task.clear 10 14.075 11.347 Volume.setDescription 8 32.252 14.932 storage checkers delay delays max mean ------------------------ 1469 64.640 17.884 threaded-logging, write delay 0.1 second ---------------------------------------- With threaded logging we are not effected by slow logging, but we experience slowdown because of slow close() when starting child process, see comment 72. Slowest verbs verb calls mean stdev ----------------------------------------------------------------- Host.getDeviceList 5 8.692 0.741 StoragePool.connect 4 4.912 0.845 StoragePool.disconnect 1 3.790 0.000 StorageDomain.deactivate 1 3.010 0.000 StorageDomain.getInfo 2 2.015 0.635 StoragePool.disconnectStorageServer 4 1.965 0.524 StoragePool.connectStorageServer 4 1.795 1.003 StorageDomain.activate 1 1.640 0.000 Host.getCapabilities 1 1.340 0.000 Volume.getQemuImageInfo 1 0.880 0.000 Most called verbs verb calls mean stdev ----------------------------------------------------------------- Host.getAllVmStats 576 0.006 0.005 StoragePool.getSpmStatus 416 0.021 0.068 StoragePool.getInfo 415 0.055 0.207 Host.getStats 288 0.188 0.064 Host.getAllVmIoTunePolicies 277 0.001 0.003 Task.getStatus 23 0.003 0.005 Task.clear 9 0.009 0.003 Volume.setDescription 8 0.118 0.052 Host.getAllTasksStatuses 6 0.003 0.005 Host.getDeviceList 5 8.692 0.741 storage checkers delay delays max mean ------------------------ 2 0.540 0.525 threaded-logging, no write delay -------------------------------- We can compare this with the 0.1 second write delay to see how the delay affects timings. Slowest verbs verb calls mean stdev ----------------------------------------------------------------- StoragePool.disconnect 1 3.820 0.000 StoragePool.connect 4 2.913 0.188 Host.getDeviceList 5 2.316 0.142 StorageDomain.deactivate 1 2.010 0.000 StoragePool.connectStorageServer 4 1.163 0.771 StoragePool.disconnectStorageServer 4 1.098 0.209 StorageDomain.activate 1 0.960 0.000 Host.getCapabilities 1 0.940 0.000 StorageDomain.getInfo 2 0.770 0.010 Volume.create 1 0.460 0.000 Most called verbs verb calls mean stdev ----------------------------------------------------------------- Host.getAllVmStats 511 0.005 0.005 StoragePool.getSpmStatus 336 0.019 0.012 StoragePool.getInfo 335 0.038 0.112 Host.getStats 232 0.082 0.048 Host.getAllVmIoTunePolicies 223 0.001 0.003 Task.getStatus 14 0.001 0.003 Task.clear 9 0.009 0.006 Volume.setDescription 8 0.120 0.029 Host.getAllTasksStatuses 6 0.002 0.004 Host.getDeviceList 5 2.316 0.142 storage checkers delay delays max mean ------------------------ 3 0.580 0.543 local slowfs ============ master, write delay 0.1 second ------------------------------ The results are better than slowfs over nfs but the system is stil barely usable. Slowest verbs verb calls mean stdev ----------------------------------------------------------------- StorageDomain.activate 1 62.380 0.000 StoragePool.connect 5 54.154 10.357 Host.getDeviceList 4 42.935 2.853 StoragePool.spmStart 1 32.680 0.000 Volume.setDescription 8 20.725 16.085 Image.prepare 1 20.560 0.000 StoragePool.disconnect 1 18.330 0.000 Image.delete 1 18.100 0.000 StorageDomain.deactivate 1 16.480 0.000 StoragePool.connectStorageServer 7 15.606 7.453 Most called verbs verb calls mean stdev ----------------------------------------------------------------- Host.getAllVmStats 524 2.034 2.231 Host.getAllVmIoTunePolicies 158 1.863 1.573 Host.getStats 157 7.596 3.105 StoragePool.getSpmStatus 155 3.843 3.760 StoragePool.getInfo 154 5.554 4.184 Task.getStatus 14 3.218 1.528 Host.getAllTasksStatuses 10 3.244 1.673 Task.clear 9 10.351 8.551 Volume.setDescription 8 20.725 16.085 StoragePool.connectStorageServer 7 15.606 7.453 storage checkers delay delays max mean ------------------------ 1995 44.090 7.919 threaded-logging, write delay 0.4 second ---------------------------------------- We are not effected by the slow storage, and we don't drop any message. Slowest verbs verb calls mean stdev ----------------------------------------------------------------- StoragePool.disconnect 1 4.180 0.000 StoragePool.connect 4 2.910 0.179 StorageDomain.deactivate 1 2.010 0.000 Host.getDeviceList 5 1.810 0.151 Lease.create 1 1.780 0.000 StoragePool.connectStorageServer 4 1.020 0.688 Host.getCapabilities 1 1.000 0.000 StoragePool.disconnectStorageServer 4 0.995 0.221 StorageDomain.activate 1 0.970 0.000 Host.getStats 252 0.888 0.145 Most called verbs verb calls mean stdev ----------------------------------------------------------------- Host.getAllVmStats 506 0.005 0.005 StoragePool.getSpmStatus 365 0.017 0.005 StoragePool.getInfo 364 0.036 0.092 Host.getStats 252 0.888 0.145 Host.getAllVmIoTunePolicies 236 0.002 0.004 Task.getStatus 11 0.001 0.003 Task.clear 9 0.006 0.005 Volume.setDescription 8 0.124 0.056 Host.getDeviceList 5 1.810 0.151 Host.getAllTasksStatuses 5 0.002 0.004 storage checkers delay (no delays) log level distribution Total: 43967 DEBUG: 35550 (80.86%) INFO: 8323 (18.93%) WARN: 87 (0.20%) ERROR: 7 (0.02%) CRIT: 0 (0.00%) threaded-logging, write delay 1 second -------------------------------------- We are not effected by the slow storage, but we drop some DEBUG messages. Slowest verbs verb calls mean stdev ----------------------------------------------------------------- Host.getStats 351 3.077 0.739 StoragePool.connect 4 2.665 0.124 Host.setMOMPolicyParameters 1 2.060 0.000 StorageDomain.deactivate 1 2.010 0.000 Host.getDeviceList 9 1.694 0.114 Lease.create 1 1.330 0.000 StoragePool.connectStorageServer 4 1.005 0.711 Host.getCapabilities 1 0.920 0.000 StorageDomain.getInfo 2 0.830 0.120 StorageDomain.activate 1 0.810 0.000 Most called verbs verb calls mean stdev ----------------------------------------------------------------- Host.getAllVmStats 975 0.006 0.005 StoragePool.getSpmStatus 581 0.018 0.012 StoragePool.getInfo 580 0.030 0.065 Host.getAllVmIoTunePolicies 385 0.002 0.004 Host.getStats 351 3.077 0.739 Host.getDeviceList 9 1.694 0.114 Task.clear 9 0.008 0.004 Volume.setDescription 8 0.076 0.015 Task.getStatus 6 0.007 0.005 StoragePool.connectStorageServer 4 1.005 0.711 storage checkers delay (no delays) log level distribution Total: 33808 DEBUG: 20811 (61.56%) INFO: 12721 (37.63%) WARN: 177 (0.52%) ERROR: 0 (0.00%) CRIT: 99 (0.29%) $ grep CRIT threaded-logging-slowfs-1.0/vdsm.log | head 2018-03-11 01:24:56,393+0200 CRIT (mailbox-spm) [root] ThreadedHandler stats for last 61 seconds (dropped records: 343, max pending: 1228) (logutils:312) 2018-03-11 01:25:57,002+0200 CRIT (mailbox-spm) [root] ThreadedHandler stats for last 60 seconds (dropped records: 161, max pending: 1215) (logutils:312) 2018-03-11 01:26:57,427+0200 CRIT (jsonrpc/4) [root] ThreadedHandler stats for last 60 seconds (dropped records: 424, max pending: 1228) (logutils:312) 2018-03-11 01:27:57,772+0200 CRIT (jsonrpc/7) [root] ThreadedHandler stats for last 60 seconds (dropped records: 349, max pending: 1210) (logutils:312) 2018-03-11 01:28:57,851+0200 CRIT (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 624, max pending: 1229) (logutils:312) 2018-03-11 01:29:57,884+0200 CRIT (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 274, max pending: 1212) (logutils:312) 2018-03-11 01:30:57,887+0200 CRIT (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 869, max pending: 1300) (logutils:312) 2018-03-11 01:31:58,082+0200 CRIT (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 380, max pending: 1302) (logutils:312) 2018-03-11 01:32:58,116+0200 CRIT (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 256, max pending: 1211) (logutils:312) 2018-03-11 01:33:58,117+0200 CRIT (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 414, max pending: 1219) (logutils:312) Notes ===== - Host.getStats is slowed down by slow storage, even with threaded logging. This is caused by accessing mom in this verb, which use the standard logger that blocks on storage storage. - Logging to NFS may cause vdsm to get stuck in D state, this is very bad idea with the current system. If we want to support this we need to log a separate process (e.g. via systemd journald).
Created attachment 1406901 [details] Logs and testing scripts for comment 73
(In reply to Nir Soffer from comment #72) > Finally found what is the issue causing delays even with the new threaded > logging - > logging to NFS. > > When vdsm log to NFS, there is an open file descriptor to > /var/log/vdsm/vdsm.log. > this file descriptor is inherited by any child process started by vdsm. > After the > fork(), the new child try to close all file descriptors that should not be > inherited by the child. If we slow down NFS storage write, it also slows down > close(). Idea for future improvement: Do we want to add helper process(es) in charge of only spawning helper programs/external tools, so the main vdsm process never fork()s? same for supervdsm.
(In reply to Francesco Romani from comment #75) > Idea for future improvement: > Do we want to add helper process(es) in charge of only spawning helper > programs/external tools, so the main vdsm process never fork()s? How do you run the helper process without forking? Maybe use systemd-run? I think the solution is moving actual logging to systemd-journald. We will probbly have to keep the threaded handler since writing to /dev/log may block.
(In reply to Nir Soffer from comment #76) > (In reply to Francesco Romani from comment #75) > > Idea for future improvement: > > Do we want to add helper process(es) in charge of only spawning helper > > programs/external tools, so the main vdsm process never fork()s? > How do you run the helper process without forking? Maybe use systemd-run? > > I think the solution is moving actual logging to systemd-journald. We will > probbly > have to keep the threaded handler since writing to /dev/log may block. We can just depend on it to have a fully functional host, kinda like we do with mom, or supervdsm. And we can communicate to it using json-rpc or any other rpc form. BTW, from my side using systemd-journald could also be a good idea, and I see no issues in keeping the threaded logger.
Nir could you please provide verification steps?
(In reply to Pavol Brilla from comment #79) > Nir could you please provide verification steps? Try logging (/var/log/vdsm log files) to a slow device. For example, over NFS over WAN (or with 'tc' command to slow it down). (see comment 72 and comment 73 )
Verified on vdsm-4.20.27.1-1.el7ev.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489
BZ<2>Jira Resync