1765684 – Log important state changes and time spent in slow critical operations

Bug 1765684 - Log important state changes and time spent in slow critical operations

Summary: Log important state changes and time spent in slow critical operations

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.40.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Yosi Ben Shimon
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-25 18:00 UTC by Nir Soffer
Modified:	2020-05-20 21:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-20 20:00:42 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	tnisan: ovirt-4.4?

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	104244	'None'	MERGED	sdc: Log time for slow operations and state changes	2020-05-20 21:22:31 UTC
oVirt gerrit	104245	'None'	MERGED	hba: Log time for scanning FC devices	2020-05-20 21:22:31 UTC
oVirt gerrit	104246	'None'	MERGED	iscsi: Log time for scanning iSCSI devices	2020-05-20 21:22:31 UTC
oVirt gerrit	104247	'None'	MERGED	multipath: Log time for resizing devices	2020-05-20 21:22:32 UTC

Description Nir Soffer 2019-10-25 18:00:33 UTC

Description of problem:

Debugging issues such as bug 1598266 is very time consuming and require too much effort and skill.

See https://bugzilla.redhat.com/show_bug.cgi?id=1598266#c34 for the effort
required to get useful info from vdsm logs.

When we have issues related to slow operations, simple grep should be enough
to reveal the issue.

- Important state changes like invalidating caches must be logged in the
  default log level
- Critical slow operations like FC scan must log the time spent in the default
  log level. 

Version-Release number of selected component (if applicable):
Any

How reproducible:
Always

Steps to Reproduce:
1. Try to debug issues such as bug 1598266
2. Watch your time wasted dealing with bad logs

Actual results:
Understanding the issue takes more than a year.

Expected results:
Understanding the issue takes minutes.

Fixing this issue is easy and it should be fixed in 4.3 to make it easier to
support.

Comment 1 Nir Soffer 2019-10-25 20:35:27 UTC

Some stats from new logs, obtained from one OST run:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5854/

refreshStorage:

$ grep 'Refreshing storage domain cache:' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | awk 'BEGIN{min=99999; max=0} {if ($10 > max) max=$10; if ($10 < min) min=$10; sum+=$10} END{print "min=" min, "max=" max, "avg=" sum / NR}'
min=0.26 max=1.41 avg=0.747534

fc scan:

$ grep 'Scanning FC devices:' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | awk 'BEGIN{min=99999; max=0} {if ($9 > max) max=$9; if ($9 < min) min=$9; sum+=$9} END{print "min=" min, "max=" max, "avg=" sum / NR}'
min=0.12 max=0.42 avg=0.257213

iscsi scan:

$ grep 'Scanning iSCSI devices:' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | awk 'BEGIN{min=99999; max=0} {if ($9 > max) max=$9; if ($9 < min) min=$9; sum+=$9} END{print "min=" min, "max=" max, "avg=" sum / NR}'
min=0.12 max=0.84 avg=0.425612

Resize multipath devices:

$ grep 'Resizing multipath devices:' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | awk 'BEGIN{min=99999; max=0} {if ($9 > max) max=$9; if ($9 < min) min=$9; sum+=$9} END{print "min=" min, "max=" max, "avg=" sum / NR}'
min=0.00 max=0.08 avg=0.0141654

Looking up domains

$ egrep 'Looking up domain [a-fA-F0-9-]+: ' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | awk 'BEGIN{min=99999; max=0} {if ($10 > max) max=$10; if ($10 < min) min=$10; sum+=$10} END{print "min=" min, "max=" max, "avg=" sum / NR}'
min=0.02 max=0.49 avg=0.232553

Number of cache invalidations per OST run:

$ grep 'Invalidating storage domain cache' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | wc -l
494

Number of refreshStorage calls per OST run:

$ grep 'Refreshing storage domain cache:' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | wc -l
653

$ grep 'Scanning FC devices:' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | wc -l
653

$ grep 'Scanning iSCSI devices:' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | wc -l
654

Number of cache clears per OST run:

$ grep 'Clearing storage domain cache' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | wc -l
98

$ grep 'START getDeviceList' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | wc -l
61

Everything is fast in OST because is it a toy setup.

We need to check why we have so many refreshes and invalidations. Refreshes and scans 
makes sense mainly in contents of getDeviceList.

Comment 2 Germano Veit Michel 2019-10-27 22:56:18 UTC

Thank you Nir!

These patches will make debugging these problems A LOT easier and clearer.

The only thing I'm concerned is the amount of logs this may generate on bigger setups (i.e. 50 SDs)
Would it be possible to log every N refreshes, suppressing these messages on most invalidations/refreshes?
Alternatively, I don't think it will be a big problem if these are mostly on DEBUG level, we can enable
when necessary.

Comment 3 Nir Soffer 2019-10-27 23:40:26 UTC

(In reply to Germano Veit Michel from comment #2)
> Thank you Nir!
> 
> These patches will make debugging these problems A LOT easier and clearer.
> 
> The only thing I'm concerned is the amount of logs this may generate on
> bigger setups (i.e. 50 SDs)
> Would it be possible to log every N refreshes, suppressing these messages on
> most invalidations/refreshes?
> Alternatively, I don't think it will be a big problem if these are mostly on
> DEBUG level, we can enable
> when necessary.

I'm afraid we cannot have the cake and eat it at the same time. We need these
logs in the default log level so we have good understanding of the system
behavior when customers have issues, and we don't need to enable special logs 
and try to reproduce the issue.

These changes only add one line per state changes, and 2 lines for the slow
operations (start line/stop line with elapsed time)

Lets see how much logs are added:

$ cat post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | wc -l -c
 241210 68359640

$ egrep 'Refreshing storage domain cache|Scanning FC devices|Scanning iSCSI devices|Resizing multipath devices|Looking up domain|Invalidating storage domain cache|Clearing storage domain cache|Adding domain .+ to storage domain cache|Removing domain .+ from storage domain cache' post-0*/lago-basic-suite-master-host-*/_var_log/vdsm/vdsm.log | wc -l -c
   6617 1326046

Which is +2.7% more lines, and 1.9% more bytes.

OST is using DEBUG log level for many subsystems (but not storage), so in
a real system the difference will be bigger. We can try to tweak OST logs
back to INFO level and measure again to get a good estimate for real setup.

But the issue is not to eliminate the invaliation/refreshes logs, but to 
eliminate the unneeded invalidations and refreshes. I started to work on
it here:
https://gerrit.ovirt.org/c/103424/

Comment 4 Germano Veit Michel 2019-10-27 23:52:02 UTC

(In reply to Nir Soffer from comment #3)
> I'm afraid we cannot have the cake and eat it at the same time. We need these
> logs in the default log level so we have good understanding of the system
> behavior when customers have issues, and we don't need to enable special
> logs and try to reproduce the issue.

So far all these problems have been reproducible. But fair enough.

> But the issue is not to eliminate the invaliation/refreshes logs, but to 
> eliminate the unneeded invalidations and refreshes. I started to work on
> it here:
> https://gerrit.ovirt.org/c/103424/

Right, I forgot you were also working on this, which will make a real difference.
With this optimization we may even have less logs than today, and when we do
they will be much more useful.

So all good. Thanks again.

Comment 8 Avihai 2019-12-09 13:53:23 UTC

Nir , please provide a clear and short scenario for verification.

We do not have the capacity for sage scenarios.

Comment 9 Nir Soffer 2019-12-11 17:38:38 UTC

(In reply to Avihai from comment #8)
I'm not sure what are "sage" scenarios, however this can be verified 
by performing operations that cause storgae refresh, such as creating
new domain, or managing existing iSCSI/FC domain, and looking at vdsm log,
where you will find these new logs:

2019-10-25 14:24:08,813-0400 INFO  (jsonrpc/6) [storage.StorageDomainCache] Refreshing storage domain cache (resize=True) (sdc:80)
2019-10-25 14:24:08,813-0400 INFO  (jsonrpc/6) [storage.ISCSI] Scanning iSCSI devices (iscsi:442)
2019-10-25 14:24:09,058-0400 INFO  (jsonrpc/6) [storage.ISCSI] Scanning iSCSI devices: 0.24 seconds (utils:390)
2019-10-25 14:24:09,059-0400 INFO  (jsonrpc/6) [storage.HBA] Scanning FC devices (hba:60)
2019-10-25 14:24:09,316-0400 INFO  (jsonrpc/6) [storage.HBA] Scanning FC devices: 0.26 seconds (utils:390)
2019-10-25 14:24:09,357-0400 INFO  (jsonrpc/6) [storage.Multipath] Resizing multipath devices (multipath:104)
2019-10-25 14:24:09,359-0400 INFO  (jsonrpc/6) [storage.Multipath] Resizing multipath devices: 0.00 seconds (utils:390)
2019-10-25 14:24:09,360-0400 INFO  (jsonrpc/6) [storage.StorageDomainCache] Refreshing storage domain cache: 0.54 seconds (utils:390)

These logs are mostly useful on large setups with many LUNs and storage
domains. In pathological cases iSCSI scan, FC scan, or resizing multipath
devices can take lot of time, blocking other operations, so we want to make
it easy to find such issues.

Comment 10 Yosi Ben Shimon 2019-12-22 15:41:45 UTC

Tested on:
ovirt-engine-4.4.0-0.13.master.el7.noarch
vdsm-4.40.0-164.git38a19bb.el8ev.x86_64

Actual result:
Created a new NFS storage domain and got these in the VDSM log:

2019-12-22 10:37:53,683-0500 INFO  (jsonrpc/6) [storage.StorageDomainCache] Refreshing storage domain cac
he (resize=False) (sdc:80)
2019-12-22 10:37:53,684-0500 INFO  (jsonrpc/6) [storage.ISCSI] Scanning iSCSI devices (iscsi:442)
2019-12-22 10:37:53,896-0500 INFO  (jsonrpc/6) [storage.ISCSI] Scanning iSCSI devices: 0.21 seconds (util
s:390)
2019-12-22 10:37:53,897-0500 INFO  (jsonrpc/6) [storage.HBA] Scanning FC devices (hba:60)
2019-12-22 10:37:54,208-0500 INFO  (jsonrpc/6) [storage.HBA] Scanning FC devices: 0.31 seconds (utils:390
)
2019-12-22 10:37:54,258-0500 INFO  (jsonrpc/6) [storage.StorageDomainCache] Refreshing storage domain cac
he: 0.57 seconds (utils:390)


VERIFIED

Comment 11 Sandro Bonazzola 2020-05-20 20:00:42 UTC

This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 12 Nir Soffer 2020-05-20 21:32:24 UTC

Removing stale need infos.

Note You need to log in before you can comment on or make changes to this bug.