Bug 1516831 - Host fails with Heartbeat periodically
Summary: Host fails with Heartbeat periodically
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.2.3
: ---
Assignee: Nir Soffer
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-23 12:13 UTC by Jaroslav Spanko
Modified: 2022-03-13 14:58 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-15 17:52:46 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Multipath configuration from 4.2 (4.31 KB, text/plain)
2018-01-16 18:13 UTC, Nir Soffer
no flags Details
Logs and testing scripts for comment 73 (3.69 MB, application/x-xz)
2018-03-11 17:33 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1552949 1 None None None 2021-01-20 06:05:38 UTC
Red Hat Issue Tracker RHV-45183 0 None Waiting on Customer Lack of documentation for CephDashboardNetwork 2022-04-19 14:29:51 UTC
Red Hat Product Errata RHEA-2018:1489 0 None None None 2018-05-15 17:53:52 UTC
oVirt gerrit 86805 0 'None' MERGED logUtils: Add threaded logging handler 2020-11-11 11:27:32 UTC
oVirt gerrit 86806 0 'None' MERGED logger.conf: Add threaded logging configuration 2020-11-11 11:27:32 UTC
oVirt gerrit 86834 0 'None' ABANDONED logUtils: Add threaded logging handler 2020-11-11 11:27:32 UTC
oVirt gerrit 87180 0 'None' MERGED logger.conf: Make threaded logging the default 2020-11-11 11:27:31 UTC
oVirt gerrit 87479 0 'None' MERGED logutils: Adaptive log level 2020-11-11 11:27:32 UTC
oVirt gerrit 87639 0 'None' MERGED logutils: Warn about dropping messages 2020-11-11 11:27:33 UTC
oVirt gerrit 88429 0 'None' MERGED logutils: Defer flushing while handling records 2020-11-11 11:27:33 UTC
oVirt gerrit 88751 0 'None' MERGED logutils: Log threaded handler stats 2020-11-11 11:27:34 UTC
oVirt gerrit 89127 0 'None' MERGED tests: Fix flaky time based test 2020-11-11 11:27:33 UTC
oVirt gerrit 89179 0 'None' MERGED logUtils: Add threaded logging handler 2020-11-11 11:27:34 UTC
oVirt gerrit 89180 0 'None' MERGED logutils: Adaptive log level 2020-11-11 11:27:34 UTC
oVirt gerrit 89181 0 'None' MERGED logutils: Warn about dropping messages 2020-11-11 11:27:33 UTC
oVirt gerrit 89182 0 'None' MERGED logutils: Defer flushing while handling records 2020-11-11 11:27:33 UTC
oVirt gerrit 89183 0 'None' MERGED tests: Fix flaky time based test 2020-11-11 11:27:34 UTC
oVirt gerrit 89184 0 'None' MERGED logger.conf: Add threaded logging configuration 2020-11-11 11:27:34 UTC
oVirt gerrit 89185 0 'None' MERGED logger.conf: Make threaded logging the default 2020-11-11 11:27:54 UTC
oVirt gerrit 89677 0 'None' MERGED tests: drop test_slow_handler_sync 2020-11-11 11:27:35 UTC
oVirt gerrit 89731 0 'None' MERGED logutils: Log threaded handler stats 2020-11-11 11:27:35 UTC

Internal Links: 1552949

Description Jaroslav Spanko 2017-11-23 12:13:01 UTC
Description of problem:

We have CU with strange behaviour in one of the hypervisor (Heartbeat exceeded errors)
Nov 22, 2017 5:51:19 AM VDSM test command GetStatsVDS failed: Heartbeat exceeded

This is happening in the last 2 months regularly on weekly basis.
Once we restart the engine or host is rebooted the behaviour is gone for week and after that again Heartbeat is back on daily basis.

All hypervisors were during this period updated to the latest version, problem still occurs.

The behaviour remains me similar issue 
https://bugzilla.redhat.com/show_bug.cgi?id=1496517

Version-Release number of selected component (if applicable):
RHV 4.1.6

How reproducible:
100% 

Actual results:
Once the engine is restarted the Heartbeat is gone, after one week is back on daily basis.

Expected results:
Help to identify root cause

Additional info:
Will add in next update debug logs from vdsm/engine and gcore from vdsm.

Thanks

Comment 2 Piotr Kliczewski 2017-11-23 12:30:27 UTC
The BZ that you mention could potentially cause heartbeat exceeded. Have you tried to increase heartbeat interval?

Comment 4 Yaniv Kaul 2017-11-23 13:15:12 UTC
Severity?

Comment 5 Jaroslav Spanko 2017-11-23 14:14:50 UTC
As it's happening on weekly basic without root cause founded for now High 
Thx

Comment 6 Piotr Kliczewski 2017-11-24 09:02:37 UTC
Usually heartbeat exceeded issues are caused by host being over utilized or networking issue. Do we have some information about how the failing host is utilized. Have you tried to increase heartbeat interval?

Comment 7 Marian Jankular 2017-11-24 09:33:42 UTC
Hello,

we did not increased heartbeat interval as engine gets heartbeat exceeded after 1 second host gets to non responsive state. However I will ask him to set it to 60 seconds if it helps. I am also attach 2 coredumps files from vdsm daemon that were generated when the issue appeared last time.

Comment 10 Piotr Kliczewski 2017-11-27 12:50:53 UTC
(In reply to Marian Jankular from comment #7)
> Hello,
> 
> we did not increased heartbeat interval as engine gets heartbeat exceeded
> after 1 second host gets to non responsive state. However I will ask him to

This is how it works. Heartbeats cause nonreponsive state.

> set it to 60 seconds if it helps. I am also attach 2 coredumps files from
> vdsm daemon that were generated when the issue appeared last time.

Please let us know about the results of increasing the interval.

Comment 11 Jaroslav Spanko 2017-12-05 14:30:18 UTC
Hi Piotr

The host failed again, the behaviour is the same as before increase Heartbeat
I'm collecting the logs from the customer.
Thx

Comment 12 Piotr Kliczewski 2017-12-05 15:03:21 UTC
Jaroslav, Do you know what was the new value of the heartbeat interval?

Comment 13 Jaroslav Spanko 2017-12-07 12:20:10 UTC
Hi Piotr
We double it to 60sec but still the same behaviour.
I finally received the logs, trying to parse them and will upload them to the BZ.
Thanks

Comment 20 Nir Soffer 2017-12-11 16:13:37 UTC
(In reply to Jaroslav Spanko from comment #0)
> The behaviour remains me similar issue 
> https://bugzilla.redhat.com/show_bug.cgi?id=1496517

Are you running with fixed libvirt python?

Comment 24 Nir Soffer 2017-12-11 21:39:20 UTC
Delayed calls in vdsm.log.10.xz:

2017-12-05 05:06:30,290+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4599966.17 at 0x1d9bad0>> delayed by 0.69 seconds (asyncutils:138)
2017-12-05 05:09:20,479+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4600136.17 at 0x1d9bad0>> delayed by 0.88 seconds (asyncutils:138)
2017-12-05 05:29:10,418+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4601326.17 at 0x1d9bad0>> delayed by 0.82 seconds (asyncutils:138)
2017-12-05 05:29:11,830+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4601327.51 at 0x3606110>> delayed by 0.89 seconds (asyncutils:138)
2017-12-05 05:39:31,540+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4601947.51 at 0x3606110>> delayed by 0.60 seconds (asyncutils:138)
2017-12-05 05:50:31,560+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_iso/f818e07a-977c-404d-8430-799cf652c5af/dom_md/metadata running next_check=4602607.45 at 0x1dac9d0>> delayed by 0.68 seconds (asyncutils:138)
2017-12-05 05:50:36,730+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4602607.51 at 0x3606110>> delayed by 5.79 seconds (asyncutils:138)
2017-12-05 05:50:47,069+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4602616.17 at 0x1d9bad0>> delayed by 7.47 seconds (asyncutils:138)
2017-12-05 05:51:00,940+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata running next_check=4602617.25 at 0x1d9bfd0>> delayed by 20.26 seconds (asyncutils:138)
2017-12-05 05:51:08,240+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_iso/f818e07a-977c-404d-8430-799cf652c5af/dom_md/metadata running next_check=4602617.45 at 0x1dac9d0>> delayed by 27.36 seconds (asyncutils:138)
2017-12-05 05:51:21,099+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4602627.51 at 0x3606110>> delayed by 30.16 seconds (asyncutils:138)
2017-12-05 05:51:35,950+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4602646.17 at 0x1d9bad0>> delayed by 26.35 seconds (asyncutils:138)
2017-12-05 05:51:43,889+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata running next_check=4602647.25 at 0x1d9bfd0>> delayed by 33.21 seconds (asyncutils:138)
2017-12-05 05:51:52,180+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_iso/f818e07a-977c-404d-8430-799cf652c5af/dom_md/metadata running next_check=4602657.45 at 0x1dac9d0>> delayed by 31.30 seconds (asyncutils:138)
2017-12-05 05:52:13,499+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_export/6614bf7a-a345-43e2-9fb9-3958036d7b21/dom_md/metadata running next_check=4602677.51 at 0x3606110>> delayed by 32.56 seconds (asyncutils:138)
2017-12-05 05:52:22,239+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=4602686.17 at 0x1d9bad0>> delayed by 32.64 seconds (asyncutils:138)
2017-12-05 05:52:24,438+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata running next_check=4602687.25 at 0x1d9bfd0>> delayed by 33.76 seconds (asyncutils:138)
2017-12-05 05:52:24,830+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /rhev/data-center/mnt/brubl055.bru-hub.dhl.com:_iso/f818e07a-977c-404d-8430-799cf652c5af/dom_md/metadata running next_check=4602707.45 at 0x1dac9d0>> delayed by 13.95 seconds (asyncutils:138)

Comment 25 Nir Soffer 2017-12-11 21:43:21 UTC
jaroslav, can you attach output of rpm -qa to the bug?

Comment 27 Jaroslav Spanko 2017-12-12 10:28:30 UTC
added , thx

Comment 28 Piotr Kliczewski 2017-12-12 11:14:27 UTC
Nir, please take a look at the output

Comment 29 Nir Soffer 2017-12-12 11:15:36 UTC
Looking at output of rpm -qa (attachment 1366527 [details])

    libvirt-python-3.2.0-3.el7_4.1.x86_64  Tue Oct 24 11:19:22 2017

The libvirt-python bug 1497197 was fixed in libvirt-python-3.2.0-3.el7_4.1, this is
not the same issue.

Comment 30 Piotr Kliczewski 2017-12-13 14:00:54 UTC
How can we diagnose what is the issue?

Comment 43 Piotr Kliczewski 2017-12-22 10:00:59 UTC
I checked engine log and found:

2017-12-16 16:28:23,253+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [5d8e858e] Command 'GetAllVmStatsVDSCommand(HostName = brubl535, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='15eb866b-c0a9-4ec7-a35b-c3117f6bec9c'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded

around the same time I see blocked tasks in vdsm log:

2017-12-16 16:28:08,240+0100 DEBUG (vdsm.Scheduler) [Executor] executor state: count=5 workers=set([<Worker name=periodic/234 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x1cb6cd0> at 0x1cb6d10> timeout=1.0, duration=85 at 0x31a3710> discarded task#=33099 at 0x7f528408c210>, <Worker name=periodic/238 waiting task#=10691 at 0x34ea650>, <Worker name=periodic/243 waiting task#=0 at 0x1cea7d0>, <Worker name=periodic/241 waiting task#=8668 at 0x2fe2f10>, <Worker name=periodic/242 waiting task#=8713 at 0x7f53241947d0>]) (executor:205)
2017-12-16 16:28:14,070+0100 INFO  (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/234 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x1cb6cd0> at 0x1cb6d10> timeout=1.0, duration=85 at 0x31a3710> discarded task#=33099 at 0x7f528408c210> (executor:348)
2017-12-16 16:28:10,089+0100 WARN  (jsonrpc/0) [virt.vm] (vmId='e95c22b3-e3d6-4674-a2d2-35afe377ce02') monitor became unresponsive (command timeout, age=76.4800000004) (vm:5132)
2017-12-16 16:28:10,149+0100 WARN  (jsonrpc/5) [virt.vm] (vmId='3e1d3243-0549-40a0-bb69-7e3645f22652') monitor became unresponsive (command timeout, age=94.71) (vm:5132)
2017-12-16 16:28:11,549+0100 WARN  (jsonrpc/2) [virt.vm] (vmId='e37cb5f9-b1f7-4c66-a783-15a5a05b1ce5') monitor became unresponsive (command timeout, age=93.0100000007) (vm:5132)
2017-12-16 16:28:08,150+0100 DEBUG (periodic/243) [Executor] Worker started (executor:291)
2017-12-16 16:28:15,558+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=5591205.82 at 0x1ceec90>> delayed by 66.31 seconds (asyncutils:138)
2017-12-16 16:28:25,769+0100 WARN  (jsonrpc/2) [virt.vm] (vmId='eb4eccb4-55a8-4ecd-9e0c-2f8ee37cadde') monitor became unresponsive (command timeout, age=102.48) (vm:5132)
2017-12-16 16:28:26,729+0100 INFO  (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:51446 (protocoldetector:127)
2017-12-16 16:28:25,630+0100 WARN  (jsonrpc/0) [virt.vm] (vmId='e37cb5f9-b1f7-4c66-a783-15a5a05b1ce5') monitor became unresponsive (command timeout, age=110.19) (vm:5132)
2017-12-16 16:28:29,910+0100 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/1 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getStats', 'id': u'90c1cd4a-e729-4fa8-b49c-5cbfab3995a1'} at 0x355eb90> timeout=60, duration=126 at 0x355e950> task#=50810 at 0x2eb0ad0> (executor:351)
2017-12-16 16:28:33,509+0100 DEBUG (periodic/234) [Executor] Worker was discarded (executor:298)
2017-12-16 16:28:34,569+0100 WARN  (check/loop) [storage.check] Checker '/dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata' is blocked for 105.06 seconds (check:279)
2017-12-16 16:28:54,819+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=5591315.82 at 0x1ceec90>> missed 10 deadlines, scheduling next call at 5591315.82 (asyncutils:161)

Nir please take a look.

Comment 44 Nir Soffer 2017-12-26 11:29:39 UTC
(In reply to Piotr Kliczewski from comment #43)
In your comment we see that we have:

1. blocked drive monitor - libvirt is not responsive - typically caused by
  non-responsive storage.

2017-12-16 16:28:14,070+0100 INFO  (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/234 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x1cb6cd0> at 0x1cb6d10> timeout=1.0, duration=85 at 0x31a3710> discarded task#=33099 at 0x7f528408c210> (executor:348)

2. blocked checker - this means storage is not responsive

2017-12-16 16:28:34,569+0100 WARN  (check/loop) [storage.check] Checker '/dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata' is blocked for 105.06 seconds (check:279)

But this storage is block storage, it should never block for 105 seconds with out
multipath configuration.

3. delayed checker - smells like vdsm hang

2017-12-16 16:28:15,558+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running next_check=5591205.82 at 0x1ceec90>> delayed by 66.31 seconds (asyncutils:138)

Jaroslav,

1. do we have /var/log/messages and sanlock.log from same time as the
log messages above?
2. please attach /etc/multipath.conf on this host, best on all hosts in this 
cluster.

Comment 50 Nir Soffer 2018-01-16 18:09:58 UTC
Jaroslav, can you explain why multipath conf is using this configuration?

device {
    vendor "3PARdata"
    product "VV"
    path_grouping_policy group_by_prio
    path_selector "round-robin 0"
    path_checker tur 
    features "0" 

I don't think this should be included in a device configuration. The feature
string should be updated automatically by multipath based on other configuration
values.

    hardware_handler "1 alua"
    prio alua
    failback immediate
    rr_weight uniform

    no_path_retry 18

This will retry 18 times up to 90 seconds after all paths have failed before
failing the io. This can cause timeouts in vdsm operations and in engine
operations. These timeouts may lead to failures in storage flows, and I think
engine may treat these failures as network error.

    rr_min_io_rq 1
    detect_prio yes 
    fast_io_fail_tmo 10

This will delay the time for failing over to another path. Vdsm default is 5.
I recommend to use the default vdsm value unless the storage vendor can explain
why using 10 is better.

    dev_loss_tmo 14

This looks strange, I don't know if this can cause trouble. Again, the storage
vendor should explain why this value should be used.

}

Based on this configuration, we need to check if there are timeouts caused by
failure of all paths using no_path_retry 18.

In 4.2 we added support for multipath alerts, which will make it much easier to
detect issues in multipath devices, see:
https://ovirt.org/develop/release-management/features/storage/multipath-events/

Moving to bug to storage to complete this investigation.

Comment 51 Nir Soffer 2018-01-16 18:13:23 UTC
Created attachment 1382125 [details]
Multipath configuration from 4.2

Jaroslav, I'm attaching multipath configuration from 4.2. There is not change in
the configuration but the file is fully documented. This should help the customer
and the storage vendor to tune multipath conf so it works best with both RHV
and storage.

Comment 58 Nir Soffer 2018-01-21 23:30:33 UTC
Martin, we have unexplained hangs in vdsm (see vdsm.log.44).

The only thing I found in the log is unexplained hosted engine logs like:

2017-12-16 16:26:40,489+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successful response from socket (brokerlink:256)
2017-12-16 16:26:40,500+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Closing connection to ha-broker (brokerlink:86)
2017-12-16 16:26:40,520+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Connecting to ha-broker (brokerlink:51)
2017-12-16 16:26:40,529+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successfully connected (brokerlink:80)
2017-12-16 16:26:40,529+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Sending request: set-storage-domain FilesystemBackend dom_type=fc sd_uuid=37c4f11e-c643-4263-ab81-7b48164e762d (brokerlink:274)
2017-12-16 16:26:40,529+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] socket_readline with 30.0 seconds timeout (util:80)
2017-12-16 16:26:40,540+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Full response: success 139768642954576 (brokerlink:283)
2017-12-16 16:26:40,549+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successful response from socket (brokerlink:256)
2017-12-16 16:26:40,549+0100 INFO  (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Success, id 139768642954576 (brokerlink:179)
2017-12-16 16:26:40,549+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Sending request: is-host-alive service_type=hosted-engine.metadata (brokerlink:274)
2017-12-16 16:26:40,550+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] socket_readline with 30.0 seconds timeout (util:80)
2017-12-16 16:26:40,556+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Full response: success 337C317C327C367C377C38 (brokerlink:283)
2017-12-16 16:26:40,557+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successful response from socket (brokerlink:256)
2017-12-16 16:26:40,557+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Alive hosts '[3, 1, 2, 6, 7, 8]' (brokerlink:221)
2017-12-16 16:26:40,557+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Is host '1' alive -> 'True' (brokerlink:223)
2017-12-16 16:26:40,558+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.client.client.HAClient] Is host '1' alive? -> 'True' (client:83)
2017-12-16 16:26:40,558+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Closing connection to ha-broker (brokerlink:86)
2017-12-16 16:26:40,558+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Connecting to ha-broker (brokerlink:51)
2017-12-16 16:26:40,558+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Successfully connected (brokerlink:80)
2017-12-16 16:26:40,559+0100 DEBUG (jsonrpc/2) [ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink] Sending request: set-storage-domain FilesystemBackend dom_type=fc sd_uuid=37c4f11e-c643-4263-ab81-7b48164e762d (brokerlink:274)

Are we importing modules from hosted engine inside vdsm? are we communicating
with ha-broker from within vdsm?

Comment 59 Nir Soffer 2018-01-22 00:16:17 UTC
Ok, I have the first smoking gun:

Someone is calling Host.getHardwareInfo 3 times in the same second:

2017-12-16 16:27:05,989+0100 DEBUG (jsonrpc/6) [jsonrpc.JsonRpcServer] Calling 'Host.getHardwareInfo' in bridge with {} (__init__:556)
2017-12-16 16:27:06,174+0100 DEBUG (jsonrpc/2) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmStats' in bridge with {} (__init__:556)
2017-12-16 16:27:06,175+0100 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Calling 'Host.getHardwareInfo' in bridge with {} (__init__:556)
2017-12-16 16:27:06,176+0100 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Calling 'Host.getHardwareInfo' in bridge with {} (__init__:556)
2017-12-16 16:27:06,177+0100 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Calling 'Host.getHardwareInfo' in bridge with {} (__init__:556)

Here a monitor complain about missed deadline. The timestamp
of the log message is 3 seconds before the Host.getHardwareInfo
calls! I don't have any idea how this can happen.

2017-12-16 16:27:03,070+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata running 
next_check=5591205.82 at 0x1ceec90>> missed 1 deadlines, scheduling next call at 5591205.82 (asyncutils:161)


2017-12-16 16:27:13,550+0100 INFO  (monitor/37c4f11) [storage.LVM] Refreshing lvs: vg=37c4f11e-c643-4263-ab81-7b48164e762d lvs=['metadata'] (lvm:1291)

Monitor complains about 24.89 seconds delay - this is probably menas
vdsm was hang.

2017-12-16 16:27:14,820+0100 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata running next_check=5591186.50 at 0x3406890>> delayed by 24.89 seconds (asyncutils:138)

We get a response from Host.getHardwareInfo...

2017-12-16 16:27:14,850+0100 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Return 'Host.getHardwareInfo' in bridge with {'systemProductName': 'ProLiant BL460c Gen9', 'systemSerialNumber': 'CZ262705X0', 'systemFamily': 'ProLiant', 'systemVersion': '', 'systemUUID': '31333138-3839-5A43-3236-323730355830', 'systemManufacturer': 'HP'} (__init__:582)
2017-12-16 16:27:15,630+0100 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Return 'Host.getHardwareInfo' in bridge with {'systemProductName': 'ProLiant BL460c Gen9', 'systemSerialNumber': 'CZ262705X0', 'systemFamily': 'ProLiant', 'systemVersion': '', 'systemUUID': '31333138-3839-5A43-3236-323730355830', 'systemManufacturer': 'HP'} (__init__:582)
2017-12-16 16:27:18,850+0100 DEBUG (check/loop) [storage.check] START check '/dev/37c4f11e-c643-4263-ab81-7b48164e762d/metadata' (delay=28.92) (check:283)
2017-12-16 16:27:19,729+0100 INFO  (monitor/37c4f11) [storage.LVM] Refreshing LVs (vg=37c4f11e-c643-4263-ab81-7b48164e762d, lvs=['metadata']) (lvm:1319)
2017-12-16 16:27:20,079+0100 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Return 'Host.getHardwareInfo' in bridge with {'systemProductName': 'ProLiant BL460c Gen9', 'systemSerialNumber': 'CZ262705X0', 'systemFamily': 'ProLiant', 'systemVersion': '', 'systemUUID': '31333138-3839-5A43-3236-323730355830', 'systemManufacturer': 'HP'} (__init__:582)
2017-12-16 16:27:26,660+0100 DEBUG (jsonrpc/6) [jsonrpc.JsonRpcServer] Return 'Host.getHardwareInfo' in bridge with {'systemProductName': 'ProLiant BL460c Gen9', 'systemSerialNumber': 'CZ262705X0', 'systemFamily': 'ProLiant', 'systemVersion': '', 'systemUUID': '31333138-3839-5A43-3236-323730355830', 'systemManufacturer': 'HP'} (__init__:582)


The Host.getHardwareInfo calls took 2--40 seconds.

2017-12-16 16:27:27,670+0100 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 21.41 seconds (__init__:539)
2017-12-16 16:27:47,379+0100 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 41.20 seconds (__init__:539)
2017-12-16 16:27:26,120+0100 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 19.94 seconds (__init__:539)

Again we see crazy time values in the logs. How can log from
2017-12-16 16:27:47 come before 2017-12-16 16:27:26?

getHardwareInfo is calling dmidecode - a C extension. If this extension
is not releasing the GIL properly, it can hang the python process
using it.

Next step: check if dmidecode is releasing the GIL properly.

Comment 60 Nir Soffer 2018-01-22 00:56:28 UTC
Jaroslav, sanlock.log from attachment 1375600 [details] ends at 2017-12-13 - but the issue
is seen at 2017-12-16. It is possible that there were no errors in sanlock.log
during this time, but it is unlikely since we see this in vdsm:

2017-12-16 16:28:34,569+0100 WARN  (check/loop) [storage.check] Checker '/dev/e1bc39a3-69d2-4693-a6ae-d7cdf4331eba/metadata' is blocked for 105.06 seconds (check:279)

If a checker is blocked for 105 seconds, sanlock is likely to have issues with 
the block storage domain e1bc39a3-69d2-4693-a6ae-d7cdf4331eba.

Maybe sanlock.log is not from the correct host, or maybe the part from the same
time frame is missing? maybe in another sosreport?

Comment 61 Martin Sivák 2018-01-22 08:41:24 UTC
Nir: the answers to all your questions are yes, we are

- vdsm talks to broker using our brokerlink protocol
- the library for that is part of hosted engine

Moreover, older hosted engine tools used getHardwareInfo, but the logs indicate there are three different connections and hosted engine should be just one of those.

We removed that 7 months ago and use the new jsonrpc retry mechanisms now that those are actually working.

I suggest upgrading hosted engine packages to at least what we have in 4.1.8 and see what happens.

Comment 63 Nir Soffer 2018-01-23 20:43:15 UTC
The issue seems to be slow writes to /var/log/vdsm/vdsm.log.

We can see these logs that the log timestamps do not mach the log order:

2017-12-16 16:27:27,670+0100 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 21.41 seconds (__init__:539)

2017-12-16 16:27:47,379+0100 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 41.20 seconds (__init__:539)

2017-12-16 16:27:26,120+0100 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 19.94 seconds (__init__:539)

Message 2 was logged at 16:27:47, but it was written to the log before message 1
which was logged at 16:27:26. We have many instances of this issue in the logs.

How this can happen? when logging a message, the logging system:

1. create a log record, keeping the time of the log message
2. pass the log record to all configured handlers
3. handler takes handler lock
4. handler write log line to file
5. handler release handler lock

If writing the log message to the log file is slow, we can have multiple threads
waiting on the handler lock. When the current thread release the lock, one of 
the waiting threads will wake up and take the lock. If this thread is trying to
log a newer log record, log records will be logged out of order.

Since all vdsm threads are trying to log the the same handler, even a small delay
writing to the log will cause huge delay in all vdsm threads. The effect is much
bigger when using DEBUG log level.

I could easily reproduce this issue by adding 1 seconds sleep in vdsm logfile 
handler. Here is fragment of the log showing out of order log lines, and storage
checker delayed by many seconds:


2018-01-23 21:36:06,602+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 7.04 seconds (__init__:573)

2018-01-23 21:36:03,603+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/373e8c55-283f-41d4-8433-95c1ef1bbd1a/metadata running next_check=4711309.72 at 0x4350b90>> missed 5 deadlines, scheduling next call at 4711309.72 (asyncutils:161)

2018-01-23 21:36:08,605+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/metadata running next_check=4711249.75 at 0x408ae50>> delayed by 59.27 seconds (asyncutils:138)

2018-01-23 21:36:04,606+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.hostdevListByCaps succeeded in 19.04 seconds (__init__:573)


I also reproduced a "Heartbeat exceeded" error in engine side:

2018-01-23 15:26:53,365-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler2) [64a6a831] Command 'GetAllVmStatsVDSCommand(HostName = voodoo6, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='d431535f-bfca-4a58-8e19-df8db015d072'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded


I guess the delayed heartbeat was caused by the hearbeat thread trying to log,
assuming that logging cannot block.


Jaroslav, can you check if /var/log/vdsm is located on a NFS mount, or the
host is booted from SAN?

Maybe the weekly incident is triggered by weekly backup procedure, overloding
the I/O on the host or the network connection to the /var/log mount?

We recommend to have /var/log/vdsm on a fast local disk to avoid such issues.

For 4.2 we can fix this issue by writing log message to storage in another
thread, and queuing log messages to mitigate temporary slowdowns writing
logs to storage.

Comment 70 Nir Soffer 2018-02-21 16:39:06 UTC
I did initial testing of the attached patches, and vdsm can work correctly now
with storage that takes 1 seconds for each write. So it seems that the new logger
works as expected.

However, vdsm verbs are still effected by slow /var/log/vdsm mount - although they
should not be.  For example, this takes normally 5 seconds on my test system:

    vdsm-client Host getDeviceList

But if we slow down access to /var/log/vdsm/ so write takes 1 second, this takes
now 55 seconds.

If we change supervdsmd log level to ERROR - so we have no logs from supervdsmd, 
the same call takes 8 seconds.

When using 1 second write delay, we can also see storage checkers delayed by few
seocnds (1-5). Since logging cannot block now, this seems like a hang of the
entire python process for few seconds.

If we slow down access to /var/log/messages so write takes 30 seconds, we get 
hearbeat exceeded issues and vdsm is soft-fenced.

So there is another issue that we don't understand yet. It may be some syscall 
wrapper in python C extension that is not releasing the GIL. We had this issues
in the past.

Comment 72 Nir Soffer 2018-03-11 16:33:07 UTC
Finally found what is the issue causing delays even with the new threaded logging -
logging to NFS.

When vdsm log to NFS, there is an open file descriptor to /var/log/vdsm/vdsm.log.
this file descriptor is inherited by any child process started by vdsm. After the
fork(), the new child try to close all file descriptors that should not be
inherited by the child. If we slow down NFS storage write, it also slows down
close().

Here is an example strace from supervdsmd when using write delay of 0.1 second:

[pid 20135] 02:56:18.070321 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7efdd3dd29d0) = 27265 <0.000792>
strace: Process 27265 attached
[pid 20135] 02:56:18.071283 close(34)   = 0 <0.000030>
[pid 27265] 02:56:18.071327 set_robust_list(0x7efdd3dd29e0, 24) = 0 <0.000031>
[pid 20135] 02:56:18.071409 close(27)   = 0 <0.000027>
[pid 27265] 02:56:18.071464 close(28 <unfinished ...>
[pid 20135] 02:56:18.071487 close(30 <unfinished ...>
[pid 27265] 02:56:18.071516 <... close resumed> ) = 0 <0.000033>
[pid 20135] 02:56:18.071529 <... close resumed> ) = 0 <0.000036>
[pid 27265] 02:56:18.071560 close(29 <unfinished ...>
[pid 20135] 02:56:18.071571 close(32)   = 0 <0.000019>
[pid 27265] 02:56:18.071617 <... close resumed> ) = 0 <0.000049>
[pid 27265] 02:56:18.071646 close(31)   = 0 <0.000016>
[pid 20135] 02:56:18.071675 read(33,  <unfinished ...>
[pid 27265] 02:56:18.071698 close(33)   = 0 <0.000012>
[pid 27265] 02:56:18.071735 dup2(27, 0) = 0 <0.000013>
[pid 27265] 02:56:18.071771 dup2(30, 1) = 1 <0.000012>
[pid 27265] 02:56:18.071806 dup2(32, 2) = 2 <0.000012>
[pid 27265] 02:56:18.071841 close(27)   = 0 <0.000012>
[pid 27265] 02:56:18.071888 close(30)   = 0 <0.000021>
[pid 27265] 02:56:18.071951 close(32)   = 0 <0.000021>
[pid 27265] 02:56:18.072026 open("/proc/self/fd", O_RDONLY|O_CLOEXEC) = 14 <0.000046>
[pid 27265] 02:56:18.072130 getdents64(14, /* 11 entries */, 280) = 264 <0.000037>
[pid 27265] 02:56:18.072234 close(3)    = 0 <0.000024>
[pid 27265] 02:56:18.072302 close(4)    = 0 <0.116676>

    Note: this close take 100000 usec, all other close() took 20 usec.

[pid 27265] 02:56:18.189074 close(5)    = 0 <0.000026>
[pid 27265] 02:56:18.189166 close(6)    = 0 <0.000023>
[pid 27265] 02:56:18.189263 close(7)    = 0 <0.000024>
[pid 27265] 02:56:18.189370 close(8)    = 0 <0.000030>
[pid 27265] 02:56:18.189454 getdents64(14, /* 11 entries */, 280) = 264 <0.000067>
[pid 27265] 02:56:18.189584 close(9)    = 0 <0.000024>
[pid 27265] 02:56:18.189709 close(10)   = 0 <0.000043>
[pid 27265] 02:56:18.189805 close(11)   = 0 <0.000032>
[pid 27265] 02:56:18.189896 close(12)   = 0 <0.000026>
[pid 27265] 02:56:18.189975 close(13)   = 0 <0.000025>
[pid 27265] 02:56:18.190079 close(15)   = 0 <0.000044>
[pid 27265] 02:56:18.190198 close(16)   = 0 <0.000023>
[pid 27265] 02:56:18.190271 close(17)   = 0 <0.000029>
[pid 27265] 02:56:18.190351 close(18)   = 0 <0.000035>
[pid 27265] 02:56:18.190442 close(19)   = 0 <0.000029>
[pid 27265] 02:56:18.190526 getdents64(14, /* 8 entries */, 280) = 192 <0.000063>
[pid 27265] 02:56:18.190659 close(20)   = 0 <0.000025>
[pid 27265] 02:56:18.190743 close(21)   = 0 <0.000029>
[pid 27265] 02:56:18.190827 close(22)   = 0 <0.000029>
[pid 27265] 02:56:18.190911 close(23)   = 0 <0.000029>
[pid 27265] 02:56:18.191005 close(24)   = 0 <0.000030>
[pid 27265] 02:56:18.191093 close(25)   = 0 <0.000029>
[pid 27265] 02:56:18.191178 close(26)   = 0 <0.000029>
[pid 27265] 02:56:18.191261 getdents64(14, /* 0 entries */, 280) = 0 <0.000030>
[pid 27265] 02:56:18.191346 close(14)   = 0 <0.000031>
[pid 27265] 02:56:18.191433 execve("/usr/bin/taskset", ["/usr/bin/taskset", "--cpu-list", "0-7", "/usr/lib/udev/scsi_id", "--page=0x80", "--whitelisted", "--export", "--replace-whitespace", "--device=/dev/dm-44"], [/* 7 vars */]strace: Process 27265 detached
 <detached ...>

This issue slows down any vdsm operation that run child process. One example is
getDeviceList, running /usr/lib/udev/scsi_id for every multipath device.

Another example is storage checkers running dd process to read from storage. Since
all storage checkers start dd in the same thread, delay in starting dd will delay
the next checker.

Comment 73 Nir Soffer 2018-03-11 17:21:28 UTC
Here is testing results with various storage backends for /var/log/vdsm, and
various write delay, simulating slow logging.

To simulate slow storage, I used:

- slowfs over nfs (slowfs-nfs)- create slowfs mount on another server, and export
  the mount over NFS. In the host, mount the slowfs share at /var/log/vdsm.

- local slowfs (slowfs) - create slowfs mount at /var/log/vdsm.
  note: requires latest slowfs from https://github.com/nirs/slowfs.
  To mount local slowfs you need to use:

      % slowfs --allow-other /var/log/vdsm.real /var/log/vdsm

In both case I simulate slow storage by adding a write delay:

    $ slowfsctl set write N

All tests run with log level DEBUG for root, vds, storage, and virt, set in 
/etc/vdsm/logger.conf.

The test includes:
- activate host (spm)
- start 2 ha vms with a lease
- create new ha vm with a lease from template
- remove the new vm
- deactivate unused storage domain
- activate the storage domain again
- open the manage domain for iscsi storage domain
  (invokes getDeviceList() 4 times)
- let the system run for 1 hour
- stop the vms
- deactivate host

I show 2 metrics:
- Vdsm verbs timings - affect user experience on engine side
- Delays in storage checkers - may cause a host to become non-operational


slowfs over nfs
===============


master, write delay 0.1 second
------------------------------

With this setup the system is barely usable, and the host usually get fenced
due to heartbeat exceeded during activation.

Slowest verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
Image.prepare                                 1   71.220    0.000
StoragePool.connect                           4   62.780   16.305
Host.getDeviceList                            5   49.806    5.775
StorageDomain.activate                        1   41.870    0.000
Volume.setDescription                         8   32.252   14.932
VM.create                                     2   28.270    0.360
StorageDomain.deactivate                      1   26.650    0.000
Volume.create                                 1   26.390    0.000
StoragePool.disconnect                        1   24.700    0.000
StorageDomain.getInfo                         2   23.830   13.320

Most called verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
Host.getAllVmStats                          514    3.280    2.714
Host.getStats                               152   11.329    5.081
StoragePool.getSpmStatus                    124    6.332    5.191
StoragePool.getInfo                         123    8.124    9.277
Host.getAllVmIoTunePolicies                 121    3.061    2.586
Task.getStatus                               53    4.098    2.784
Host.getAllTasksStatuses                     12    3.896    1.265
StoragePool.connectStorageServer             11   17.737   12.729
Task.clear                                   10   14.075   11.347
Volume.setDescription                         8   32.252   14.932

storage checkers delay
delays      max     mean
------------------------
  1469   64.640   17.884


threaded-logging, write delay 0.1 second
----------------------------------------

With threaded logging we are not effected by slow logging, but we experience 
slowdown because of slow close() when starting child process, see comment 72.

Slowest verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
Host.getDeviceList                            5    8.692    0.741
StoragePool.connect                           4    4.912    0.845
StoragePool.disconnect                        1    3.790    0.000
StorageDomain.deactivate                      1    3.010    0.000
StorageDomain.getInfo                         2    2.015    0.635
StoragePool.disconnectStorageServer           4    1.965    0.524
StoragePool.connectStorageServer              4    1.795    1.003
StorageDomain.activate                        1    1.640    0.000
Host.getCapabilities                          1    1.340    0.000
Volume.getQemuImageInfo                       1    0.880    0.000

Most called verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
Host.getAllVmStats                          576    0.006    0.005
StoragePool.getSpmStatus                    416    0.021    0.068
StoragePool.getInfo                         415    0.055    0.207
Host.getStats                               288    0.188    0.064
Host.getAllVmIoTunePolicies                 277    0.001    0.003
Task.getStatus                               23    0.003    0.005
Task.clear                                    9    0.009    0.003
Volume.setDescription                         8    0.118    0.052
Host.getAllTasksStatuses                      6    0.003    0.005
Host.getDeviceList                            5    8.692    0.741

storage checkers delay
delays      max     mean
------------------------
     2    0.540    0.525


threaded-logging, no write delay
--------------------------------

We can compare this with the 0.1 second write delay to see how the delay affects
timings.

Slowest verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
StoragePool.disconnect                        1    3.820    0.000
StoragePool.connect                           4    2.913    0.188
Host.getDeviceList                            5    2.316    0.142
StorageDomain.deactivate                      1    2.010    0.000
StoragePool.connectStorageServer              4    1.163    0.771
StoragePool.disconnectStorageServer           4    1.098    0.209
StorageDomain.activate                        1    0.960    0.000
Host.getCapabilities                          1    0.940    0.000
StorageDomain.getInfo                         2    0.770    0.010
Volume.create                                 1    0.460    0.000

Most called verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
Host.getAllVmStats                          511    0.005    0.005
StoragePool.getSpmStatus                    336    0.019    0.012
StoragePool.getInfo                         335    0.038    0.112
Host.getStats                               232    0.082    0.048
Host.getAllVmIoTunePolicies                 223    0.001    0.003
Task.getStatus                               14    0.001    0.003
Task.clear                                    9    0.009    0.006
Volume.setDescription                         8    0.120    0.029
Host.getAllTasksStatuses                      6    0.002    0.004
Host.getDeviceList                            5    2.316    0.142

storage checkers delay
delays      max     mean
------------------------
     3    0.580    0.543


local slowfs
============

master, write delay 0.1 second
------------------------------

The results are better than slowfs over nfs but the system is stil barely usable.

Slowest verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
StorageDomain.activate                        1   62.380    0.000
StoragePool.connect                           5   54.154   10.357
Host.getDeviceList                            4   42.935    2.853
StoragePool.spmStart                          1   32.680    0.000
Volume.setDescription                         8   20.725   16.085
Image.prepare                                 1   20.560    0.000
StoragePool.disconnect                        1   18.330    0.000
Image.delete                                  1   18.100    0.000
StorageDomain.deactivate                      1   16.480    0.000
StoragePool.connectStorageServer              7   15.606    7.453

Most called verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
Host.getAllVmStats                          524    2.034    2.231
Host.getAllVmIoTunePolicies                 158    1.863    1.573
Host.getStats                               157    7.596    3.105
StoragePool.getSpmStatus                    155    3.843    3.760
StoragePool.getInfo                         154    5.554    4.184
Task.getStatus                               14    3.218    1.528
Host.getAllTasksStatuses                     10    3.244    1.673
Task.clear                                    9   10.351    8.551
Volume.setDescription                         8   20.725   16.085
StoragePool.connectStorageServer              7   15.606    7.453

storage checkers delay
delays      max     mean
------------------------
  1995   44.090    7.919


threaded-logging, write delay 0.4 second
----------------------------------------

We are not effected by the slow storage, and we don't drop any message.

Slowest verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
StoragePool.disconnect                        1    4.180    0.000
StoragePool.connect                           4    2.910    0.179
StorageDomain.deactivate                      1    2.010    0.000
Host.getDeviceList                            5    1.810    0.151
Lease.create                                  1    1.780    0.000
StoragePool.connectStorageServer              4    1.020    0.688
Host.getCapabilities                          1    1.000    0.000
StoragePool.disconnectStorageServer           4    0.995    0.221
StorageDomain.activate                        1    0.970    0.000
Host.getStats                               252    0.888    0.145

Most called verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
Host.getAllVmStats                          506    0.005    0.005
StoragePool.getSpmStatus                    365    0.017    0.005
StoragePool.getInfo                         364    0.036    0.092
Host.getStats                               252    0.888    0.145
Host.getAllVmIoTunePolicies                 236    0.002    0.004
Task.getStatus                               11    0.001    0.003
Task.clear                                    9    0.006    0.005
Volume.setDescription                         8    0.124    0.056
Host.getDeviceList                            5    1.810    0.151
Host.getAllTasksStatuses                      5    0.002    0.004

storage checkers delay
(no delays)

log level distribution
Total: 43967
DEBUG: 35550 (80.86%)
 INFO: 8323 (18.93%)
 WARN: 87 (0.20%)
ERROR: 7 (0.02%)
 CRIT: 0 (0.00%)


threaded-logging, write delay 1 second
--------------------------------------

We are not effected by the slow storage, but we drop some DEBUG
messages.

Slowest verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
Host.getStats                               351    3.077    0.739
StoragePool.connect                           4    2.665    0.124
Host.setMOMPolicyParameters                   1    2.060    0.000
StorageDomain.deactivate                      1    2.010    0.000
Host.getDeviceList                            9    1.694    0.114
Lease.create                                  1    1.330    0.000
StoragePool.connectStorageServer              4    1.005    0.711
Host.getCapabilities                          1    0.920    0.000
StorageDomain.getInfo                         2    0.830    0.120
StorageDomain.activate                        1    0.810    0.000

Most called verbs
verb                                      calls     mean    stdev
-----------------------------------------------------------------
Host.getAllVmStats                          975    0.006    0.005
StoragePool.getSpmStatus                    581    0.018    0.012
StoragePool.getInfo                         580    0.030    0.065
Host.getAllVmIoTunePolicies                 385    0.002    0.004
Host.getStats                               351    3.077    0.739
Host.getDeviceList                            9    1.694    0.114
Task.clear                                    9    0.008    0.004
Volume.setDescription                         8    0.076    0.015
Task.getStatus                                6    0.007    0.005
StoragePool.connectStorageServer              4    1.005    0.711

storage checkers delay
(no delays)

log level distribution
Total: 33808
DEBUG: 20811 (61.56%)
 INFO: 12721 (37.63%)
 WARN: 177 (0.52%)
ERROR: 0 (0.00%)
 CRIT: 99 (0.29%)

$ grep CRIT threaded-logging-slowfs-1.0/vdsm.log | head
2018-03-11 01:24:56,393+0200 CRIT  (mailbox-spm) [root] ThreadedHandler stats for last 61 seconds (dropped records: 343, max pending: 1228) (logutils:312)
2018-03-11 01:25:57,002+0200 CRIT  (mailbox-spm) [root] ThreadedHandler stats for last 60 seconds (dropped records: 161, max pending: 1215) (logutils:312)
2018-03-11 01:26:57,427+0200 CRIT  (jsonrpc/4) [root] ThreadedHandler stats for last 60 seconds (dropped records: 424, max pending: 1228) (logutils:312)
2018-03-11 01:27:57,772+0200 CRIT  (jsonrpc/7) [root] ThreadedHandler stats for last 60 seconds (dropped records: 349, max pending: 1210) (logutils:312)
2018-03-11 01:28:57,851+0200 CRIT  (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 624, max pending: 1229) (logutils:312)
2018-03-11 01:29:57,884+0200 CRIT  (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 274, max pending: 1212) (logutils:312)
2018-03-11 01:30:57,887+0200 CRIT  (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 869, max pending: 1300) (logutils:312)
2018-03-11 01:31:58,082+0200 CRIT  (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 380, max pending: 1302) (logutils:312)
2018-03-11 01:32:58,116+0200 CRIT  (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 256, max pending: 1211) (logutils:312)
2018-03-11 01:33:58,117+0200 CRIT  (check/loop) [root] ThreadedHandler stats for last 60 seconds (dropped records: 414, max pending: 1219) (logutils:312)


Notes
=====

- Host.getStats is slowed down by slow storage, even with threaded logging.
  This is caused by accessing mom in this verb, which use the standard
  logger that blocks on storage storage.

- Logging to NFS may cause vdsm to get stuck in D state, this is very 
  bad idea with the current system. If we want to support this we need
  to log a separate process (e.g. via systemd journald).

Comment 74 Nir Soffer 2018-03-11 17:33:59 UTC
Created attachment 1406901 [details]
Logs and testing scripts for comment 73

Comment 75 Francesco Romani 2018-03-12 09:10:07 UTC
(In reply to Nir Soffer from comment #72)
> Finally found what is the issue causing delays even with the new threaded
> logging -
> logging to NFS.
> 
> When vdsm log to NFS, there is an open file descriptor to
> /var/log/vdsm/vdsm.log.
> this file descriptor is inherited by any child process started by vdsm.
> After the
> fork(), the new child try to close all file descriptors that should not be
> inherited by the child. If we slow down NFS storage write, it also slows down
> close().

Idea for future improvement:
Do we want to add helper process(es) in charge of only spawning helper programs/external tools, so the main vdsm process never fork()s? same for supervdsm.

Comment 76 Nir Soffer 2018-03-12 10:30:41 UTC
(In reply to Francesco Romani from comment #75)
> Idea for future improvement:
> Do we want to add helper process(es) in charge of only spawning helper
> programs/external tools, so the main vdsm process never fork()s?
How do you run the helper process without forking? Maybe use systemd-run?

I think the solution is moving actual logging to systemd-journald. We will probbly
have to keep the threaded handler since writing to /dev/log may block.

Comment 77 Francesco Romani 2018-03-12 10:45:27 UTC
(In reply to Nir Soffer from comment #76)
> (In reply to Francesco Romani from comment #75)
> > Idea for future improvement:
> > Do we want to add helper process(es) in charge of only spawning helper
> > programs/external tools, so the main vdsm process never fork()s?
> How do you run the helper process without forking? Maybe use systemd-run?
> 
> I think the solution is moving actual logging to systemd-journald. We will
> probbly
> have to keep the threaded handler since writing to /dev/log may block.

We can just depend on it to have a fully functional host, kinda like we do with mom, or supervdsm. And we can communicate to it using json-rpc or any other rpc form.

BTW, from my side using systemd-journald could also be a good idea, and I see no issues in keeping the threaded logger.

Comment 79 Pavol Brilla 2018-05-04 10:53:22 UTC
Nir could you please provide verification steps?

Comment 80 Yaniv Kaul 2018-05-13 09:01:38 UTC
(In reply to Pavol Brilla from comment #79)
> Nir could you please provide verification steps?

Try logging (/var/log/vdsm log files) to a slow device. For example, over NFS over WAN (or with 'tc' command to slow it down). 

(see comment 72 and comment 73 )

Comment 81 Petr Matyáš 2018-05-14 11:38:56 UTC
Verified on vdsm-4.20.27.1-1.el7ev.x86_64

Comment 84 errata-xmlrpc 2018-05-15 17:52:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 85 Franta Kust 2019-05-16 13:04:26 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.