Bug 1337914 - Hosted engine issues too many lvm operations
Summary: Hosted engine issues too many lvm operations
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 3.6.3
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.2.0
: 4.2.0
Assignee: Denis Chaplygin
QA Contact: Artyom
URL:
Whiteboard:
Depends On: 1393902
Blocks: 1458745 1479768
TreeView+ depends on / blocked
 
Reported: 2016-05-20 12:38 UTC by Roman Hodain
Modified: 2021-06-10 11:30 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
This update ensures that the Self-hosted Engine monitoring loop execution and expensive lvm operations are only executed during start up or recovery phases. If the storage is not in a valid state, the agent will now be restarted by VDSM.
Clone Of:
: 1479768 (view as bug list)
Environment:
Last Closed: 2018-05-15 17:32:29 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3191792 0 None None None 2017-10-24 05:55:31 UTC
Red Hat Product Errata RHBA-2018:1472 0 None None None 2018-05-15 17:33:55 UTC
oVirt gerrit 80454 0 master MERGED storage: Added call to repoStats before storage recovery 2020-05-14 10:24:58 UTC
oVirt gerrit 80458 0 master MERGED Made default loop delay value a constant 2020-05-14 10:24:58 UTC

Description Roman Hodain 2016-05-20 12:38:26 UTC
Description of problem:
Hosted engine monitoring issues to many lvm operations which is vulnerable to
delays on the storage side. A specially in environments with high number of LUNs
connected and higher number of LVM volume groups the monitoring can suffer by
lvm performance.

Version-Release number of selected component (if applicable):
    ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch

How reproducible:
    100%

Steps to Reproduce:
1. See what hosted engine is doing during one monitoring cycle.

Actual results:
The monitoring uses to many LVM operations for every monitoring cyccle which can
cause delays in the monitoring.

Expected results:
hosted engine monitoring is more efficient

Comment 4 Roman Hodain 2016-05-20 13:02:42 UTC
Here is an example of an environment suffering by lvm performance.

The monitoring cycle takes ~ one minute which causes often VM start failure as
other hosts assume that the host which is running the HE is down. The reson is
that it does not refresh the status fast enough

$ grep 'Sleeping 10 seconds' prvh2.ldipbm.int/var/log/ovirt-hosted-engine-ha/agent.log  | tail 
MainThread::DEBUG::2016-04-19 12:39:30,636::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds
MainThread::DEBUG::2016-04-19 12:40:44,002::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds
MainThread::DEBUG::2016-04-19 12:41:52,113::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds
MainThread::DEBUG::2016-04-19 12:42:57,744::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds
MainThread::DEBUG::2016-04-19 12:44:04,363::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds
MainThread::DEBUG::2016-04-19 12:45:15,459::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds
MainThread::DEBUG::2016-04-19 12:46:16,765::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds
MainThread::DEBUG::2016-04-19 12:47:19,371::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds
MainThread::DEBUG::2016-04-19 12:48:28,572::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds
MainThread::DEBUG::2016-04-19 12:49:50,712::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds

The sar data:
Storage (average for the LUNs connectord to the system (253-2 is HE SD)
                  DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
Average:     dev253-2     40.74    640.08      0.25     15.72      0.13      3.20      3.09     12.59
Average:     dev253-3    159.84   5474.46   2426.18     49.43      0.51      3.20      1.96     31.31
Average:     dev253-4    124.19  61754.03    330.37    499.91      0.52      4.22      1.82     22.60
Average:     dev253-5    171.48  11165.72   2135.06     77.56      3.62     21.13      3.80     65.17
Average:     dev253-6      7.05     56.38      0.00      8.00      0.02      3.15      3.08      2.17
Average:     dev253-7      7.05     56.38      0.00      8.00      0.02      2.64      2.58      1.82
Average:     dev253-8      7.05     56.38      0.00      8.00      0.02      2.34      2.30      1.62
Average:     dev253-9      7.05     56.38      0.00      8.00      0.04      6.18      6.05      4.26
Average:    dev253-10      7.05     56.38      0.00      8.00      0.02      2.25      2.20      1.55
Average:    dev253-13      3.23     25.84      0.00      8.00      0.02      5.73      5.64      1.82
Average:    dev253-16      3.23     25.84      0.00      8.00      0.01      2.16      2.14      0.69
Average:    dev253-23     13.27    468.14      0.00     35.27      0.03      2.52      2.40      3.19

CPU:
                CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle
Average:        all      0.74      0.00      1.99      2.52      0.00      0.00      0.01      4.18      0.00     90.55
Average:          0      0.53      0.00      2.96      0.02      0.00      0.00      0.01     13.20      0.00     83.29
Average:          1     10.52      0.00      7.46      4.65      0.00      0.00      0.20      2.40      0.00     74.77
Average:          2      0.89      0.00      2.86      0.20      0.00      0.00      0.02      9.63      0.00     86.42
Average:          3      0.86      0.00      2.30      0.48      0.00      0.00      0.00      8.24      0.00     88.11
Average:          4      0.85      0.00      2.55      0.03      0.00      0.00      0.01      8.50      0.00     88.06
Average:          5      0.72      0.00      2.54      0.40      0.00      0.00      0.08     10.73      0.00     85.54
Average:          6      0.66      0.00      2.62      0.01      0.00      0.00      0.02     10.51      0.00     86.19
Average:          7      0.70      0.00      2.71      0.30      0.00      0.00      0.00      9.00      0.00     87.29
Average:          8      0.65      0.00      2.45      2.59      0.00      0.00      0.03      9.11      0.00     85.17
Average:          9      0.59      0.00      2.43      0.07      0.00      0.00      0.01      9.50      0.00     87.39
Average:         10      0.84      0.00      3.21      3.40      0.00      0.00      0.01      5.54      0.00     87.00
Average:         11      0.83      0.00      3.16      3.06      0.00      0.00      0.01      5.59      0.00     87.35
Average:         12      0.84      0.00      3.13      4.83      0.00      0.00      0.01      5.52      0.00     85.67
Average:         13      0.87      0.00      3.14      4.24      0.00      0.00      0.01      5.39      0.00     86.36
Average:         14      0.85      0.00      3.15      4.36      0.00      0.00      0.01      5.29      0.00     86.34
Average:         15      0.86      0.00      3.13      4.25      0.00      0.00      0.01      5.36      0.00     86.40
Average:         16      0.86      0.00      3.13      4.04      0.00      0.00      0.01      5.18      0.00     86.79
Average:         17      0.85      0.00      3.12      4.14      0.00      0.00      0.01      5.06      0.00     86.84
Average:         18      0.85      0.00      3.15      3.92      0.00      0.00      0.01      5.04      0.00     87.04
Average:         19      0.85      0.00      3.13      3.55      0.00      0.00      0.01      5.06      0.00     87.41
Average:         20      0.07      0.00      0.87      0.00      0.00      0.00      0.00      0.95      0.00     98.11
Average:         21      0.70      0.00      1.15      0.09      0.00      0.00      0.00      5.61      0.00     92.44
Average:         22      0.20      0.00      1.32      2.10      0.00      0.00      0.01      0.69      0.00     95.68
Average:         23      0.16      0.00      1.59      0.00      0.00      0.00      0.00      1.31      0.00     96.94
Average:         24      0.12      0.00      1.05      0.00      0.00      0.00      0.02      0.45      0.00     98.35
Average:         25      0.10      0.00      0.98      0.00      0.00      0.00      0.00      0.86      0.00     98.06
Average:         26      0.48      0.00      1.40      6.18      0.00      0.00      0.09      0.53      0.00     91.33
Average:         27      0.06      0.00      0.93      0.00      0.00      0.00      0.00      0.77      0.00     98.24
Average:         28      0.04      0.00      0.90      0.00      0.00      0.00      0.00      0.93      0.00     98.12
Average:         29      0.06      0.00      0.87      0.00      0.00      0.00      0.00      2.27      0.00     96.80
Average:         30      0.07      0.00      0.38      0.08      0.00      0.00      0.00      0.89      0.00     98.57
Average:         31      0.73      0.00      1.60     19.02      0.00      0.00      0.01      0.45      0.00     78.19
Average:         32      0.21      0.00      0.53      6.17      0.00      0.00      0.00      0.46      0.00     92.62
Average:         33      0.24      0.00      0.60     11.11      0.00      0.00      0.01      0.58      0.00     87.46
Average:         34      0.21      0.00      0.44      4.83      0.00      0.00      0.01      0.65      0.00     93.86
Average:         35      0.14      0.00      0.42      1.58      0.00      0.00      0.00      0.83      0.00     97.04
Average:         36      0.09      0.00      0.38      0.83      0.00      0.00      0.00      0.69      0.00     98.01
Average:         37      0.11      0.00      0.51      0.23      0.00      0.00      0.00      0.96      0.00     98.19
Average:         38      0.10      0.00      0.52      0.14      0.00      0.00      0.00      1.22      0.00     98.01
Average:         39      0.18      0.00      0.98      0.07      0.00      0.00      0.00      2.61      0.00     96.16

Memory:

            kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
Average:     47021926  84769194     64.32     11051    803221 103119214     27.82  80827514    284451       169


Here is basically what hoste-engine-ha does with the storage plus the LVM
operation which are part of the operations:

prvh2.ldipbm.int/var/log/vdsm/vdsm.log


Thread-890729::INFO::2016-04-19 12:52:40,164::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=3, spUUID='07a8ed20-7b52-42f7-ad13-b3bc93ac3a3f', conList=[{'id': 'd2d3b2a6-d3b1-4e10-ad13-a1d4752d6eb6', 'connection': '10.2.2.13', 'iqn': 'iqn.1992-04.com.emc:cx.apm00143854803.a4', 'portal': '1', 'user': '', 'password': '********', 'port': '3260'}], options=None)

Thread-890729::DEBUG::2016-04-19 12:52:41,634::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None)
Thread-890729::DEBUG::2016-04-19 12:52:46,046::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0



Thread-890733::INFO::2016-04-19 12:52:46,063::logUtils::48::dispatcher::(wrapper) Run and protect: getStorageDomainStats(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', options=None)
Thread-890733::DEBUG::2016-04-19 12:52:48,351::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None)
Thread-890733::DEBUG::2016-04-19 12:52:54,697::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0

Thread-890733::DEBUG::2016-04-19 12:52:56,759::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None)
Thread-890733::DEBUG::2016-04-19 12:53:01,532::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0

Thread-890733::DEBUG::2016-04-19 12:53:01,584::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None)
Thread-890733::DEBUG::2016-04-19 12:53:03,131::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0



Thread-890740::INFO::2016-04-19 12:53:03,232::logUtils::48::dispatcher::(wrapper) Run and protect: getImagesList(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', options=None)
Thread-890740::DEBUG::2016-04-19 12:53:04,008::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060160411037008194fb521ff4e411|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None)
Thread-890740::DEBUG::2016-04-19 12:53:04,388::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0


Thread-890742::INFO::2016-04-19 12:53:04,404::logUtils::48::dispatcher::(wrapper) Run and protect: getVolumesList(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='0f3378bb-d5c2-4915-b966-c6ec70327ac0', options=None)
Thread-890742::DEBUG::2016-04-19 12:53:05,436::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None)
Thread-890742::DEBUG::2016-04-19 12:53:06,623::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0



Thread-890743::INFO::2016-04-19 12:53:06,661::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='0f3378bb-d5c2-4915-b966-c6ec70327ac0', leafUUID='18e41e18-1e7f-45ac-806f-a3a58a450bcf')
Thread-890743::DEBUG::2016-04-19 12:53:06,663::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060160411037008194fb521ff4e411|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None)
Thread-890743::DEBUG::2016-04-19 12:53:08,013::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0

Thread-890745::INFO::2016-04-19 12:53:08,031::logUtils::48::dispatcher::(wrapper) Run and protect: getVolumesList(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='0d34df52-84ab-4251-bfac-3297850402a5', options=None)
Thread-890745::DEBUG::2016-04-19 12:53:08,033::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060160411037008194fb521ff4e411|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None)
Thread-890745::DEBUG::2016-04-19 12:53:09,575::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0

Repeating prepare images for all of the images:

Thread-890746::INFO::2016-04-19 12:53:09,591::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='0d34df52-84ab-4251-bfac-3297850402a5', leafUUID='4c250af8-6bcf-4bdf-9590-72dd54fcbebd')
Thread-890751::INFO::2016-04-19 12:53:16,726::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='392c2dc0-19be-44e9-8997-aed1baeeeeff', leafUUID='af5fcd22-e481-4394-b114-d35a4ce66339')
Thread-890753::INFO::2016-04-19 12:53:22,544::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='6bfc9d3d-eac0-4d98-b43a-c799bebebfc0', leafUUID='d74512e3-014c-4230-a23a-b4ade9e72ffd')
Thread-890759::INFO::2016-04-19 12:53:29,465::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='b0bf6072-d049-4d57-ab0a-3098015118a2', leafUUID='0e25b559-f986-4561-b4ed-f3af77690652')

No do we have to run all of these every time we run the monitoring? To me it
seems the all of this should be run just conditionaly. For exmaple during the
first start and if there is a storage failure. This is the same as for any other
VM images. We do not periodically prepare the images or connecting to the
storage.

Comment 6 Yaniv Kaul 2016-05-20 17:40:04 UTC
Nir, IIRC your storage monitoring enhancements are not yet in 4.0? Will they affect this flow as well?

Comment 9 Roy Golan 2016-06-29 08:58:14 UTC
Nir can you refer to comment 7?

Comment 12 Martin Sivák 2016-07-13 09:17:55 UTC
We call connectStorageServer and prepareImages every time to make sure we always have access to the images and symlinks we use for sanlock and the synchronization whiteboard.

It was assumed in the past that prepareImages would do nothing when the storage is already connected and we wanted that be a single atomic call.

Comment 13 Nir Soffer 2016-07-13 09:38:25 UTC
(In reply to Martin Sivák from comment #12)
> We call connectStorageServer and prepareImages every time to make sure we
> always have access to the images and symlinks we use for sanlock and the
> synchronization whiteboard.
> 
> It was assumed in the past that prepareImages would do nothing when the
> storage is already connected and we wanted that be a single atomic call.

connectStorageServer is not something you want to call regularly.

Every call does:
- drop entire lvm cache, forming running lvm commands for each lvm related
  call
- perform scsi and fc scan (each can block for up to 30 seconds)
- run multipath to reload devices
- search all devices and resize multipath maps for resized devices
- finally, marking storage cache as stale, possibly triggering another
  refresh next time storage domain is accessed

You should avoid calling connectStorageServer unless you think storage need to
be connected; this can be used in recovery process, not for monitoring.

prepareImage will do every time:
- list all lvs in the vg, possibly running lvm commands if
  the cache was invalidated. If you called connectStorageServer before, the
  it will probably need to refresh lvm cache.
- recreate links in the repository
- activate lvs if needed, but there is never need to when is running

prepareImage is not related to monitoring and should not be called more than
once before you start the engine vm.

You can use it as well as part of a recovery process, not for monitoring.

For monitoring, you should use repoStats, and verbs for getting information
about storage domain and volumes. Please check how engine is monitoring 
storage.

Comment 16 Martin Sivák 2016-07-13 12:41:30 UTC
Nir, we can call it during recovery only, but first we need to detect that a recovery is needed. We would actually like to hear how the VDSM API is supposed to be used from the storage team. Trying to tie all the lose ends from the engine code is error prone as we have repeatedly seen...

Can you please give us a proper call sequence to check that the storage is still properly connected, all images are available and symlinks are ready?

Comment 17 Roy Golan 2016-08-14 07:44:46 UTC
Nir?

Comment 18 Nir Soffer 2016-09-11 13:01:06 UTC
(In reply to Martin Sivák from comment #16)
> Nir, we can call it during recovery only, but first we need to detect that a
> recovery is needed. We would actually like to hear how the VDSM API is
> supposed to be used from the storage team. Trying to tie all the lose ends
> from the engine code is error prone as we have repeatedly seen...

There is no documentation about how vdsm was supposed to be used, and the
original authors of vdsm are not available for years.

The best way when writing new flows with vdsm is to check how engine is using
vdsm, since vdsm was designed to provide services for engine.

> Can you please give us a proper call sequence to check that the storage is
> still properly connected, all images are available and symlinks are ready?

For this I will have to check how engine is doing this, I don't have any 
magic answers.

General rules:

- storage domains - repoStat should be your main source. If you need more info
  you can use other verbs returning info on storage domain.

- images - you don't need to check that they are available, once you prepared
  them, nobody is tearing them down, so there is nothing to check.

- symlinks - these are created when you connect storaged domains and prepare
  images, you don't need to check them. This is implementation detail of vdsm.

Comment 19 Yaniv Kaul 2016-09-21 11:08:39 UTC
Do we have, in 4.0, such an issue?

Comment 20 Martin Sivák 2016-11-23 13:43:42 UTC
Moving to 4.2 as we need to cooperate with the storage team on this.

Comment 26 Artyom 2017-11-19 14:38:01 UTC
Denis, can you help me with verification steps? I tried to verify it on the master, but I do not see that we use "def validate_storage_server" method anywhere(master branch).

Comment 27 Denis Chaplygin 2017-11-20 08:32:19 UTC
Hosted Engine was changed drastically in last three months, so we don't have that call anymore. All storage handling code is moved to the broker and runs in separate threads. 

Biggest part of that code was moved to ovirt_hosted_engine_ha/broker/submonitors/storage_domain.py

Other part is in HostedEngine._validate_storage_images

Comment 28 Artyom 2017-11-20 13:55:15 UTC
Verified on ovirt-engine-4.2.0-0.0.master.20171116212005.git61ffb5f.el7.centos.noarch

MainThread::DEBUG::2017-11-20 15:53:14,087::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status {"healt
h": "good", "vm": "up", "detail": "Up"}
MainThread::DEBUG::2017-11-20 15:53:14,088::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status True
MainThread::DEBUG::2017-11-20 15:53:14,088::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status 10638
MainThread::DEBUG::2017-11-20 15:53:14,089::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status 0.0084
MainThread::DEBUG::2017-11-20 15:53:14,090::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status 1.0
MainThread::DEBUG::2017-11-20 15:53:14,090::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status True

Comment 31 errata-xmlrpc 2018-05-15 17:32:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1472

Comment 32 Franta Kust 2019-05-16 13:05:34 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.