1523292 – sos plugin is generating Exception during plugin-setup

Bug 1523292 - sos plugin is generating Exception during plugin-setup

Summary: sos plugin is generating Exception during plugin-setup

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.2.1
Target Release:	4.2.0
Assignee:	Ala Hino
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-07 16:33 UTC by Steffen Froemer
Modified:	2021-06-10 13:53 UTC (History)
CC List:	11 users (show)
Fixed In Version:	v4.20.13
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-15 17:52:46 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:1489	0	None	None	None	2018-05-15 17:53:52 UTC
oVirt gerrit	85958	0	master	MERGED	sosplugin: Add checkStatus=False when getting storage domains info	2018-01-10 15:15:59 UTC

Description Steffen Froemer 2017-12-07 16:33:22 UTC

Description of problem:
collecting a sosreport on a RHEL-7.4 hypervisor is throwing an exception during plugin setup routine

Version-Release number of selected component (if applicable):
vdsm-4.19.31-1.el7ev.x86_64

How reproducible:


Steps to Reproduce:
** I was not able to reproduce this in any way. It could be depended on RHV environment


Actual results:
 Setting up archive ...
 Setting up plugins ...
caught exception in plugin method "vdsm.setup()"     <<======
writing traceback to sos_logs/vdsm-plugin-errors.txt
 Running plugins. Please wait ...



Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1252, in setup
    plug.setup()
  File "/usr/lib/python2.7/site-packages/sos/plugins/vdsm.py", line 159, in setup
    sd_uuids = cli.Host.getStorageDomains()
  File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 252, in _call
    raise TimeoutError(method, kwargs, timeout)
TimeoutError: Request Host.getStorageDomains with args {} timed out after 60 seconds



Expected results:
no exception 

Additional info:

Comment 2 Dan Kenigsberg 2017-12-07 17:55:32 UTC

It would be helpful if you attach the vdsm.log from the time of the failed getStorageDomains command.

Comment 11 Nir Soffer 2017-12-24 13:33:34 UTC

Steffen, if collecting info from vdsm timed out, what do you expect to see in
the sosreport instead of the traceback?

Comment 13 Steffen Froemer 2017-12-31 18:29:54 UTC

(In reply to Nir Soffer from comment #11)
> Steffen, if collecting info from vdsm timed out, what do you expect to see in
> the sosreport instead of the traceback?

I would expect to not see this error, as I would like to have the expected information inside the sosreport.
If this error does occur alltime, it would be possible, to miss some data for analysis.

Comment 14 Nir Soffer 2017-12-31 19:04:21 UTC

(In reply to Steffen Froemer from comment #13)
> (In reply to Nir Soffer from comment #11)
> > Steffen, if collecting info from vdsm timed out, what do you expect to see in
> > the sosreport instead of the traceback?
> 
> I would expect to not see this error, as I would like to have the expected
> information inside the sosreport.
> If this error does occur alltime, it would be possible, to miss some data
> for analysis.

sosreport cannot guarantee that the information will be in the sosreport. If vdsm
is not responsive, information from vdsm cannot be in the sosreport.

I think we have multiple issues:

1. sosreport is using incorrect timeout for requests that can take lot of time.

We should use different times for different requests, so we can get results
on a system with lot of luns.

2. sosreport is using getDeviceList incorrectly:

178             self.collectVdsmCommand(
179                 "Host.getDeviceList", cli.Host.getDeviceList)  

getDeviceList must be called with checkStatus=False. Otherwise it will try to
check the status of every LUN, which can take many minutes with hundreds of LUNs.

3. sosreport is collecting data in the setup phase

It should collect data in the collection phase. Not sure what is the correct way
to implement this with sosreport.

4. sosreport is failing after the first timeout

It should continue with the next request. In the worst case, some request will
never complete and we will not have the data for these requests.

I suggest to open new bug for each item.

Comment 15 Nir Soffer 2018-01-10 15:57:36 UTC

Ala, the attached patch is fixing only issue 2. What about the other issues?

I think we need a new bug for each issue, or explain why how they are resolved.

Comment 16 Ala Hino 2018-01-10 16:04:21 UTC

The original bug is about the error that fixed in the reference patch.

I will ask Steffen to open new bugs per the other issues.

Comment 17 Raz Tamir 2018-01-11 09:20:39 UTC

Ala,

Please provide steps to reproduce when you have it

Thanks

Comment 18 Steffen Froemer 2018-01-11 09:42:27 UTC

Nir and Ala,

fixing issue 2 is fine for me. I can't give information, if we hit other issues as well. 
For the first time, I would use the patched vdsm-module and would ask customer for testing. If the see further issues, I will open a new bugzilla for this. Otherwise we're fine.

Is the patch somewhere available? I would like to use a test-version in customer environment.

Thanks,
Steffen

Comment 19 RHV bug bot 2018-01-12 14:39:41 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 20 Ala Hino 2018-01-15 11:30:20 UTC

(In reply to Steffen Froemer from comment #18)
> Nir and Ala,
> 
> fixing issue 2 is fine for me. I can't give information, if we hit other
> issues as well. 
> For the first time, I would use the patched vdsm-module and would ask
> customer for testing. If the see further issues, I will open a new bugzilla
> for this. Otherwise we're fine.
> 
> Is the patch somewhere available? I would like to use a test-version in
> customer environment.

The patch is available in Vdsm 4.20.13.
> 
> Thanks,
> Steffen

Comment 21 Ala Hino 2018-01-15 11:32:35 UTC

(In reply to Raz Tamir from comment #17)
> Ala,
> 
> Please provide steps to reproduce when you have it
> 
> Thanks

Add as many devices as you can (30 or more), and generate the sos report on the host by executing `sosreport` command. No timeout error should be raised during the report generation.

You can also verify that when the storage server is down, there is a timeout but the report is still generated.

Comment 22 RHV bug bot 2018-01-18 17:39:24 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 23 RHV bug bot 2018-01-24 22:07:56 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 24 RHV bug bot 2018-01-30 11:22:56 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 27 Kevin Alon Goldblatt 2018-02-01 20:31:12 UTC

Verified with the following code:
-------------------------------------------------
ovirt-engine-4.2.1.3-0.1.el7.noarch
vdsm-4.20.17-11.gite2d6775.el7.centos.x86_64


Verified with the following scenario:
-------------------------------------------------
1. Create a system with more than 30 storage domains
2. Run 'ovirt-log-collector' on the engine


report is generated. No exceptions thrown.

Moving to VERIFIED

Comment 28 RHV bug bot 2018-02-02 22:05:35 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 31 errata-xmlrpc 2018-05-15 17:52:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 32 Franta Kust 2019-05-16 13:05:38 UTC

BZ<2>Jira Resync

Comment 33 Daniel Gur 2019-08-28 13:12:57 UTC

sync2jira

Comment 34 Daniel Gur 2019-08-28 13:17:10 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.