Description of problem: Hosted engine monitoring issues to many lvm operations which is vulnerable to delays on the storage side. A specially in environments with high number of LUNs connected and higher number of LVM volume groups the monitoring can suffer by lvm performance. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. See what hosted engine is doing during one monitoring cycle. Actual results: The monitoring uses to many LVM operations for every monitoring cyccle which can cause delays in the monitoring. Expected results: hosted engine monitoring is more efficient
Here is an example of an environment suffering by lvm performance. The monitoring cycle takes ~ one minute which causes often VM start failure as other hosts assume that the host which is running the HE is down. The reson is that it does not refresh the status fast enough $ grep 'Sleeping 10 seconds' prvh2.ldipbm.int/var/log/ovirt-hosted-engine-ha/agent.log | tail MainThread::DEBUG::2016-04-19 12:39:30,636::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds MainThread::DEBUG::2016-04-19 12:40:44,002::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds MainThread::DEBUG::2016-04-19 12:41:52,113::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds MainThread::DEBUG::2016-04-19 12:42:57,744::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds MainThread::DEBUG::2016-04-19 12:44:04,363::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds MainThread::DEBUG::2016-04-19 12:45:15,459::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds MainThread::DEBUG::2016-04-19 12:46:16,765::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds MainThread::DEBUG::2016-04-19 12:47:19,371::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds MainThread::DEBUG::2016-04-19 12:48:28,572::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds MainThread::DEBUG::2016-04-19 12:49:50,712::hosted_engine::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 10 seconds The sar data: Storage (average for the LUNs connectord to the system (253-2 is HE SD) DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util Average: dev253-2 40.74 640.08 0.25 15.72 0.13 3.20 3.09 12.59 Average: dev253-3 159.84 5474.46 2426.18 49.43 0.51 3.20 1.96 31.31 Average: dev253-4 124.19 61754.03 330.37 499.91 0.52 4.22 1.82 22.60 Average: dev253-5 171.48 11165.72 2135.06 77.56 3.62 21.13 3.80 65.17 Average: dev253-6 7.05 56.38 0.00 8.00 0.02 3.15 3.08 2.17 Average: dev253-7 7.05 56.38 0.00 8.00 0.02 2.64 2.58 1.82 Average: dev253-8 7.05 56.38 0.00 8.00 0.02 2.34 2.30 1.62 Average: dev253-9 7.05 56.38 0.00 8.00 0.04 6.18 6.05 4.26 Average: dev253-10 7.05 56.38 0.00 8.00 0.02 2.25 2.20 1.55 Average: dev253-13 3.23 25.84 0.00 8.00 0.02 5.73 5.64 1.82 Average: dev253-16 3.23 25.84 0.00 8.00 0.01 2.16 2.14 0.69 Average: dev253-23 13.27 468.14 0.00 35.27 0.03 2.52 2.40 3.19 CPU: CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle Average: all 0.74 0.00 1.99 2.52 0.00 0.00 0.01 4.18 0.00 90.55 Average: 0 0.53 0.00 2.96 0.02 0.00 0.00 0.01 13.20 0.00 83.29 Average: 1 10.52 0.00 7.46 4.65 0.00 0.00 0.20 2.40 0.00 74.77 Average: 2 0.89 0.00 2.86 0.20 0.00 0.00 0.02 9.63 0.00 86.42 Average: 3 0.86 0.00 2.30 0.48 0.00 0.00 0.00 8.24 0.00 88.11 Average: 4 0.85 0.00 2.55 0.03 0.00 0.00 0.01 8.50 0.00 88.06 Average: 5 0.72 0.00 2.54 0.40 0.00 0.00 0.08 10.73 0.00 85.54 Average: 6 0.66 0.00 2.62 0.01 0.00 0.00 0.02 10.51 0.00 86.19 Average: 7 0.70 0.00 2.71 0.30 0.00 0.00 0.00 9.00 0.00 87.29 Average: 8 0.65 0.00 2.45 2.59 0.00 0.00 0.03 9.11 0.00 85.17 Average: 9 0.59 0.00 2.43 0.07 0.00 0.00 0.01 9.50 0.00 87.39 Average: 10 0.84 0.00 3.21 3.40 0.00 0.00 0.01 5.54 0.00 87.00 Average: 11 0.83 0.00 3.16 3.06 0.00 0.00 0.01 5.59 0.00 87.35 Average: 12 0.84 0.00 3.13 4.83 0.00 0.00 0.01 5.52 0.00 85.67 Average: 13 0.87 0.00 3.14 4.24 0.00 0.00 0.01 5.39 0.00 86.36 Average: 14 0.85 0.00 3.15 4.36 0.00 0.00 0.01 5.29 0.00 86.34 Average: 15 0.86 0.00 3.13 4.25 0.00 0.00 0.01 5.36 0.00 86.40 Average: 16 0.86 0.00 3.13 4.04 0.00 0.00 0.01 5.18 0.00 86.79 Average: 17 0.85 0.00 3.12 4.14 0.00 0.00 0.01 5.06 0.00 86.84 Average: 18 0.85 0.00 3.15 3.92 0.00 0.00 0.01 5.04 0.00 87.04 Average: 19 0.85 0.00 3.13 3.55 0.00 0.00 0.01 5.06 0.00 87.41 Average: 20 0.07 0.00 0.87 0.00 0.00 0.00 0.00 0.95 0.00 98.11 Average: 21 0.70 0.00 1.15 0.09 0.00 0.00 0.00 5.61 0.00 92.44 Average: 22 0.20 0.00 1.32 2.10 0.00 0.00 0.01 0.69 0.00 95.68 Average: 23 0.16 0.00 1.59 0.00 0.00 0.00 0.00 1.31 0.00 96.94 Average: 24 0.12 0.00 1.05 0.00 0.00 0.00 0.02 0.45 0.00 98.35 Average: 25 0.10 0.00 0.98 0.00 0.00 0.00 0.00 0.86 0.00 98.06 Average: 26 0.48 0.00 1.40 6.18 0.00 0.00 0.09 0.53 0.00 91.33 Average: 27 0.06 0.00 0.93 0.00 0.00 0.00 0.00 0.77 0.00 98.24 Average: 28 0.04 0.00 0.90 0.00 0.00 0.00 0.00 0.93 0.00 98.12 Average: 29 0.06 0.00 0.87 0.00 0.00 0.00 0.00 2.27 0.00 96.80 Average: 30 0.07 0.00 0.38 0.08 0.00 0.00 0.00 0.89 0.00 98.57 Average: 31 0.73 0.00 1.60 19.02 0.00 0.00 0.01 0.45 0.00 78.19 Average: 32 0.21 0.00 0.53 6.17 0.00 0.00 0.00 0.46 0.00 92.62 Average: 33 0.24 0.00 0.60 11.11 0.00 0.00 0.01 0.58 0.00 87.46 Average: 34 0.21 0.00 0.44 4.83 0.00 0.00 0.01 0.65 0.00 93.86 Average: 35 0.14 0.00 0.42 1.58 0.00 0.00 0.00 0.83 0.00 97.04 Average: 36 0.09 0.00 0.38 0.83 0.00 0.00 0.00 0.69 0.00 98.01 Average: 37 0.11 0.00 0.51 0.23 0.00 0.00 0.00 0.96 0.00 98.19 Average: 38 0.10 0.00 0.52 0.14 0.00 0.00 0.00 1.22 0.00 98.01 Average: 39 0.18 0.00 0.98 0.07 0.00 0.00 0.00 2.61 0.00 96.16 Memory: kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty Average: 47021926 84769194 64.32 11051 803221 103119214 27.82 80827514 284451 169 Here is basically what hoste-engine-ha does with the storage plus the LVM operation which are part of the operations: prvh2.ldipbm.int/var/log/vdsm/vdsm.log Thread-890729::INFO::2016-04-19 12:52:40,164::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=3, spUUID='07a8ed20-7b52-42f7-ad13-b3bc93ac3a3f', conList=[{'id': 'd2d3b2a6-d3b1-4e10-ad13-a1d4752d6eb6', 'connection': '10.2.2.13', 'iqn': 'iqn.1992-04.com.emc:cx.apm00143854803.a4', 'portal': '1', 'user': '', 'password': '********', 'port': '3260'}], options=None) Thread-890729::DEBUG::2016-04-19 12:52:41,634::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) Thread-890729::DEBUG::2016-04-19 12:52:46,046::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0 Thread-890733::INFO::2016-04-19 12:52:46,063::logUtils::48::dispatcher::(wrapper) Run and protect: getStorageDomainStats(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', options=None) Thread-890733::DEBUG::2016-04-19 12:52:48,351::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None) Thread-890733::DEBUG::2016-04-19 12:52:54,697::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0 Thread-890733::DEBUG::2016-04-19 12:52:56,759::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None) Thread-890733::DEBUG::2016-04-19 12:53:01,532::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0 Thread-890733::DEBUG::2016-04-19 12:53:01,584::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None) Thread-890733::DEBUG::2016-04-19 12:53:03,131::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0 Thread-890740::INFO::2016-04-19 12:53:03,232::logUtils::48::dispatcher::(wrapper) Run and protect: getImagesList(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', options=None) Thread-890740::DEBUG::2016-04-19 12:53:04,008::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060160411037008194fb521ff4e411|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None) Thread-890740::DEBUG::2016-04-19 12:53:04,388::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0 Thread-890742::INFO::2016-04-19 12:53:04,404::logUtils::48::dispatcher::(wrapper) Run and protect: getVolumesList(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='0f3378bb-d5c2-4915-b966-c6ec70327ac0', options=None) Thread-890742::DEBUG::2016-04-19 12:53:05,436::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600601604110370001e6f6eeb9f5e411|/dev/mapper/36006016041103700437eb2849f79e511|/dev/mapper/3600601604110370073c8a8a80779e511|/dev/mapper/360060160411037008194fb521ff4e411|/dev/mapper/3600601604110370084d7340be6f4e411|/dev/mapper/360060160411037009231d9677873e511|/dev/mapper/36006016041103700994e4d130879e511|/dev/mapper/360060160411037009ba65f6a9f79e511|/dev/mapper/36006016041103700d21c652ae7f4e411|/dev/mapper/36006016041103700d3fdba270879e511|/dev/mapper/36006016041103700e221e8912671e511|/dev/mapper/36006016041103700e41717cd0779e511|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None) Thread-890742::DEBUG::2016-04-19 12:53:06,623::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0 Thread-890743::INFO::2016-04-19 12:53:06,661::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='0f3378bb-d5c2-4915-b966-c6ec70327ac0', leafUUID='18e41e18-1e7f-45ac-806f-a3a58a450bcf') Thread-890743::DEBUG::2016-04-19 12:53:06,663::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060160411037008194fb521ff4e411|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None) Thread-890743::DEBUG::2016-04-19 12:53:08,013::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0 Thread-890745::INFO::2016-04-19 12:53:08,031::logUtils::48::dispatcher::(wrapper) Run and protect: getVolumesList(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='0d34df52-84ab-4251-bfac-3297850402a5', options=None) Thread-890745::DEBUG::2016-04-19 12:53:08,033::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-39 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060160411037008194fb521ff4e411|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 69837360-ab4b-48be-9285-c3882b3b1573 (cwd None) Thread-890745::DEBUG::2016-04-19 12:53:09,575::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0 Repeating prepare images for all of the images: Thread-890746::INFO::2016-04-19 12:53:09,591::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='0d34df52-84ab-4251-bfac-3297850402a5', leafUUID='4c250af8-6bcf-4bdf-9590-72dd54fcbebd') Thread-890751::INFO::2016-04-19 12:53:16,726::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='392c2dc0-19be-44e9-8997-aed1baeeeeff', leafUUID='af5fcd22-e481-4394-b114-d35a4ce66339') Thread-890753::INFO::2016-04-19 12:53:22,544::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='6bfc9d3d-eac0-4d98-b43a-c799bebebfc0', leafUUID='d74512e3-014c-4230-a23a-b4ade9e72ffd') Thread-890759::INFO::2016-04-19 12:53:29,465::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='69837360-ab4b-48be-9285-c3882b3b1573', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='b0bf6072-d049-4d57-ab0a-3098015118a2', leafUUID='0e25b559-f986-4561-b4ed-f3af77690652') No do we have to run all of these every time we run the monitoring? To me it seems the all of this should be run just conditionaly. For exmaple during the first start and if there is a storage failure. This is the same as for any other VM images. We do not periodically prepare the images or connecting to the storage.
Nir, IIRC your storage monitoring enhancements are not yet in 4.0? Will they affect this flow as well?
Nir can you refer to comment 7?
We call connectStorageServer and prepareImages every time to make sure we always have access to the images and symlinks we use for sanlock and the synchronization whiteboard. It was assumed in the past that prepareImages would do nothing when the storage is already connected and we wanted that be a single atomic call.
(In reply to Martin Sivák from comment #12) > We call connectStorageServer and prepareImages every time to make sure we > always have access to the images and symlinks we use for sanlock and the > synchronization whiteboard. > > It was assumed in the past that prepareImages would do nothing when the > storage is already connected and we wanted that be a single atomic call. connectStorageServer is not something you want to call regularly. Every call does: - drop entire lvm cache, forming running lvm commands for each lvm related call - perform scsi and fc scan (each can block for up to 30 seconds) - run multipath to reload devices - search all devices and resize multipath maps for resized devices - finally, marking storage cache as stale, possibly triggering another refresh next time storage domain is accessed You should avoid calling connectStorageServer unless you think storage need to be connected; this can be used in recovery process, not for monitoring. prepareImage will do every time: - list all lvs in the vg, possibly running lvm commands if the cache was invalidated. If you called connectStorageServer before, the it will probably need to refresh lvm cache. - recreate links in the repository - activate lvs if needed, but there is never need to when is running prepareImage is not related to monitoring and should not be called more than once before you start the engine vm. You can use it as well as part of a recovery process, not for monitoring. For monitoring, you should use repoStats, and verbs for getting information about storage domain and volumes. Please check how engine is monitoring storage.
Nir, we can call it during recovery only, but first we need to detect that a recovery is needed. We would actually like to hear how the VDSM API is supposed to be used from the storage team. Trying to tie all the lose ends from the engine code is error prone as we have repeatedly seen... Can you please give us a proper call sequence to check that the storage is still properly connected, all images are available and symlinks are ready?
Nir?
(In reply to Martin Sivák from comment #16) > Nir, we can call it during recovery only, but first we need to detect that a > recovery is needed. We would actually like to hear how the VDSM API is > supposed to be used from the storage team. Trying to tie all the lose ends > from the engine code is error prone as we have repeatedly seen... There is no documentation about how vdsm was supposed to be used, and the original authors of vdsm are not available for years. The best way when writing new flows with vdsm is to check how engine is using vdsm, since vdsm was designed to provide services for engine. > Can you please give us a proper call sequence to check that the storage is > still properly connected, all images are available and symlinks are ready? For this I will have to check how engine is doing this, I don't have any magic answers. General rules: - storage domains - repoStat should be your main source. If you need more info you can use other verbs returning info on storage domain. - images - you don't need to check that they are available, once you prepared them, nobody is tearing them down, so there is nothing to check. - symlinks - these are created when you connect storaged domains and prepare images, you don't need to check them. This is implementation detail of vdsm.
Do we have, in 4.0, such an issue?
Moving to 4.2 as we need to cooperate with the storage team on this.
Denis, can you help me with verification steps? I tried to verify it on the master, but I do not see that we use "def validate_storage_server" method anywhere(master branch).
Hosted Engine was changed drastically in last three months, so we don't have that call anymore. All storage handling code is moved to the broker and runs in separate threads. Biggest part of that code was moved to ovirt_hosted_engine_ha/broker/submonitors/storage_domain.py Other part is in HostedEngine._validate_storage_images
Verified on ovirt-engine-4.2.0-0.0.master.20171116212005.git61ffb5f.el7.centos.noarch MainThread::DEBUG::2017-11-20 15:53:14,087::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status {"healt h": "good", "vm": "up", "detail": "Up"} MainThread::DEBUG::2017-11-20 15:53:14,088::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status True MainThread::DEBUG::2017-11-20 15:53:14,088::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status 10638 MainThread::DEBUG::2017-11-20 15:53:14,089::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status 0.0084 MainThread::DEBUG::2017-11-20 15:53:14,090::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status 1.0 MainThread::DEBUG::2017-11-20 15:53:14,090::brokerlink::95::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status True
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1472
BZ<2>Jira Resync