Description of problem: collecting a sosreport on a RHEL-7.4 hypervisor is throwing an exception during plugin setup routine Version-Release number of selected component (if applicable): vdsm-4.19.31-1.el7ev.x86_64 How reproducible: Steps to Reproduce: ** I was not able to reproduce this in any way. It could be depended on RHV environment Actual results: Setting up archive ... Setting up plugins ... caught exception in plugin method "vdsm.setup()" <<====== writing traceback to sos_logs/vdsm-plugin-errors.txt Running plugins. Please wait ... Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1252, in setup plug.setup() File "/usr/lib/python2.7/site-packages/sos/plugins/vdsm.py", line 159, in setup sd_uuids = cli.Host.getStorageDomains() File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 252, in _call raise TimeoutError(method, kwargs, timeout) TimeoutError: Request Host.getStorageDomains with args {} timed out after 60 seconds Expected results: no exception Additional info:
It would be helpful if you attach the vdsm.log from the time of the failed getStorageDomains command.
Steffen, if collecting info from vdsm timed out, what do you expect to see in the sosreport instead of the traceback?
(In reply to Nir Soffer from comment #11) > Steffen, if collecting info from vdsm timed out, what do you expect to see in > the sosreport instead of the traceback? I would expect to not see this error, as I would like to have the expected information inside the sosreport. If this error does occur alltime, it would be possible, to miss some data for analysis.
(In reply to Steffen Froemer from comment #13) > (In reply to Nir Soffer from comment #11) > > Steffen, if collecting info from vdsm timed out, what do you expect to see in > > the sosreport instead of the traceback? > > I would expect to not see this error, as I would like to have the expected > information inside the sosreport. > If this error does occur alltime, it would be possible, to miss some data > for analysis. sosreport cannot guarantee that the information will be in the sosreport. If vdsm is not responsive, information from vdsm cannot be in the sosreport. I think we have multiple issues: 1. sosreport is using incorrect timeout for requests that can take lot of time. We should use different times for different requests, so we can get results on a system with lot of luns. 2. sosreport is using getDeviceList incorrectly: 178 self.collectVdsmCommand( 179 "Host.getDeviceList", cli.Host.getDeviceList) getDeviceList must be called with checkStatus=False. Otherwise it will try to check the status of every LUN, which can take many minutes with hundreds of LUNs. 3. sosreport is collecting data in the setup phase It should collect data in the collection phase. Not sure what is the correct way to implement this with sosreport. 4. sosreport is failing after the first timeout It should continue with the next request. In the worst case, some request will never complete and we will not have the data for these requests. I suggest to open new bug for each item.
Ala, the attached patch is fixing only issue 2. What about the other issues? I think we need a new bug for each issue, or explain why how they are resolved.
The original bug is about the error that fixed in the reference patch. I will ask Steffen to open new bugs per the other issues.
Ala, Please provide steps to reproduce when you have it Thanks
Nir and Ala, fixing issue 2 is fine for me. I can't give information, if we hit other issues as well. For the first time, I would use the patched vdsm-module and would ask customer for testing. If the see further issues, I will open a new bugzilla for this. Otherwise we're fine. Is the patch somewhere available? I would like to use a test-version in customer environment. Thanks, Steffen
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
(In reply to Steffen Froemer from comment #18) > Nir and Ala, > > fixing issue 2 is fine for me. I can't give information, if we hit other > issues as well. > For the first time, I would use the patched vdsm-module and would ask > customer for testing. If the see further issues, I will open a new bugzilla > for this. Otherwise we're fine. > > Is the patch somewhere available? I would like to use a test-version in > customer environment. The patch is available in Vdsm 4.20.13. > > Thanks, > Steffen
(In reply to Raz Tamir from comment #17) > Ala, > > Please provide steps to reproduce when you have it > > Thanks Add as many devices as you can (30 or more), and generate the sos report on the host by executing `sosreport` command. No timeout error should be raised during the report generation. You can also verify that when the storage server is down, there is a timeout but the report is still generated.
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
Verified with the following code: ------------------------------------------------- ovirt-engine-4.2.1.3-0.1.el7.noarch vdsm-4.20.17-11.gite2d6775.el7.centos.x86_64 Verified with the following scenario: ------------------------------------------------- 1. Create a system with more than 30 storage domains 2. Run 'ovirt-log-collector' on the engine report is generated. No exceptions thrown. Moving to VERIFIED
WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489
BZ<2>Jira Resync
sync2jira