Bug 962677 - sosreport vdsm plugin takes long time causing ssh server to timesout client session while generating sosreport for logcollector
sosreport vdsm plugin takes long time causing ssh server to timesout client s...
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine-log-collector (Show other bugs)
3.3.0
All Linux
urgent Severity medium
: ---
: 3.3.0
Assigned To: Sandro Bonazzola
Ilanit Stein
integration
: Regression, Triaged
Depends On: 1005066
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-14 04:43 EDT by Jaison Raju
Modified: 2015-09-22 09 EDT (History)
16 users (show)

See Also:
Fixed In Version: rhevm-log-collector-3.3.0-0.1.master.el6ev
Doc Type: Bug Fix
Doc Text:
On some hypervisors the sosreport collection could take more than 900 seconds, which is the default duration to keep the ssh client connection alive. Now, calling ssh with the -oServerAliveInterval=600 parameter prevents the connection timeout for the number of times specified using ServerAliveCountMax (default is 3 times).
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-10 04:42:09 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
manual_sos_report (11.05 MB, application/x-xz)
2013-09-05 08:00 EDT, Martin Pavlik
no flags Details
sos_report_output_to_console.txt (3.04 KB, text/plain)
2013-09-05 08:02 EDT, Martin Pavlik
no flags Details
log_collector_verbose (7.02 KB, text/plain)
2013-09-05 08:43 EDT, Martin Pavlik
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 16293 None None None Never

  None (edit)
Description Jaison Raju 2013-05-14 04:43:57 EDT
Description of problem:
rhevm-log-collector fails due to ssh client timeout .

vdsm plugin in sosreport takes longer time &
sosreport generation takes more than 20mins .

But the rhev host limits ssh client sessions to 15mins .
/etc/ssh/sshd_config:
<snip>
ClientAliveInterval 900
ClientAliveCountMax 0
</snip>

Hence while the sosreport is generated by rhevm-log-collector on rhev host ,
the ssh session timesout due to inactivity & rhevm-log-collector fails to
collect that host logs .


Version-Release number of selected component (if applicable):


How reproducible:
On customer's rhev host - Always .

Steps to Reproduce:
Run the following on RHEVM
# rhevm-log-collector collect

Actual results:

[root@rhevm ~]# rhevm-log-collector collect
Please provide the REST API password for the admin@internal RHEV-M user (CTRL+D
About to collect information from 11 hypervisors. Continue? (Y/n): y

ERROR: Failed to collect logs from: 10.1.2.185; Connection to 10.1.2.185 closed  by remote host.

INFO: finished collecting information from 10.1.2.185
Please provide the password for the PostgreSQL user, postgres, to dump the engine PostgreSQL database instance (CTRL+D to skip):
INFO: Gathering PostgreSQL the RHEV-M database and log files from localhost...
INFO: Gathering RHEV-M information...

Expected results:
[root@rhevm ~]# rhevm-log-collector collect
Please provide the REST API password for the admin@internal RHEV-M user (CTRL+D
About to collect information from 11 hypervisors. Continue? (Y/n): y

INFO: finished collecting information from 10.1.2.185
Please provide the password for the PostgreSQL user, postgres, to dump the engine PostgreSQL database instance (CTRL+D to skip):
INFO: Gathering PostgreSQL the RHEV-M database and log files from localhost...
INFO: Gathering RHEV-M information...

Additional info:
Increasing ClientAliveInterval to 2000 resolve the issue .
Can we increase this value to say 2000 to avoid such issues ?
Comment 1 Moran Goldboim 2013-06-12 17:02:30 EDT
can you please check what is the time consumer on this sos generation, you can try running sosreport manually with -v.

i'm afraid that playing with timeout may be a good workaround but we need to understand the source of the problem.
Comment 2 Jaison Raju 2013-06-20 06:10:07 EDT
Hello ,

The sosreport generation took almost 23 mins from the
rhev host.

Although i do not have sosreport -v output now .
Is sosreport -v output required ?


Jaison R
Comment 3 Sandro Bonazzola 2013-06-25 04:00:51 EDT
(In reply to Jaison Raju from comment #2)
> Hello ,
> 
> The sosreport generation took almost 23 mins from the
> rhev host.
> 
> Although i do not have sosreport -v output now .
> Is sosreport -v output required ?
> 
> 
> Jaison R

Yes please. We need to understand the source of the problem.
Comment 4 Sandro Bonazzola 2013-07-01 05:23:41 EDT
(In reply to Jaison Raju from comment #0)

> Additional info:
> Increasing ClientAliveInterval to 2000 resolve the issue .
> Can we increase this value to say 2000 to avoid such issues ?

I think that we can avoid the configuration calling ssh with -o      ServerAliveInterval=600 (deafult=0), informing the ssh server we're still alive.
This will lead to connection closed at 1800 if we don't change also ServerAliveCountMax (deafults=3).

However I wouldlike to understand why sosreport is taking so much.
Comment 5 Sandro Bonazzola 2013-07-01 08:45:39 EDT
Pushed upstream master a patch adding -oServerAliveInterval=600 to ssh command line as workaround.
Still waiting for the sosreport -v output for further investigations.
Comment 6 Sandro Bonazzola 2013-07-01 09:18:00 EDT
More precisely:

 /usr/sbin/sosreport -v --batch -k general.all_logs=True -o libvirt,vdsm,general,networking,hardware,process,yum,filesys,devicemapper,selinux,kernel,memory,rpm,gluster --profile

and we need sosprofile.log and sos.log from sos_logs directory.

It would be great having the whole sosreport.
Comment 7 Sandro Bonazzola 2013-07-10 03:08:35 EDT
Jaison, have you got any news?
Comment 8 Jaison Raju 2013-07-12 01:55:48 EDT
Hello,

Actually the customer has upgraded their RHEV environment &
also upgraded/installed their new hypervisors .

I will provide you the output by monday .
I doubt if the same behaviour would be noticed in the new installed
hosts.
I will update the bugzilla as soon as i get the output from custoemr .

Regards,
Jaison R
Comment 9 Sandro Bonazzola 2013-07-16 02:31:52 EDT
Jaison, have you got any news?
Comment 10 Sandro Bonazzola 2013-07-17 03:07:58 EDT
Patch adding -oServerAliveInterval=600 to ssh command line as workaround merged upstream master for 3.3.0.
When new info will be available we can try to see if something better can be done.
Comment 11 Sandro Bonazzola 2013-07-18 03:40:24 EDT
changed component to log collector since the workaround is done there.
Comment 16 Martin Pavlik 2013-09-05 07:58:51 EDT
It seems that I've hit same BZ on is13. For me collecting logs from hypervisor never finishes. I've let it run for 1 hour.

[root@mp-rhevm33 ~]# rpm -qa | grep collector
rhevm-log-collector-3.3.0-0.2.rc.el6ev.noarch

I am attaching console output of the command from comment 6 and also tar file with logs which it produced.

[root@dell-r210ii-05 ~]# /usr/sbin/sosreport -v --batch -k general.all_logs=True -o libvirt,vdsm,general,networking,hardware,process,yum,filesys,devicemapper,selinux,kernel,memory,rpm,gluster --profile
Comment 17 Martin Pavlik 2013-09-05 08:00:24 EDT
Created attachment 794235 [details]
manual_sos_report
Comment 18 Martin Pavlik 2013-09-05 08:02:50 EDT
Created attachment 794237 [details]
sos_report_output_to_console.txt
Comment 22 Martin Pavlik 2013-09-05 08:43:24 EDT
Created attachment 794266 [details]
log_collector_verbose
Comment 27 Ilanit Stein 2013-09-09 10:22:46 EDT
Verification trials on is11:

1) sshd service stop on host in the middle of log collector sos running - 
got another error, after several min: "ERROR: Failed to collect logs from: 10.35.109.14; ssh: connect to host 10.35.109.14 port 22: Connection refused" .

2) Stop sso command in the hypervisor, but it didn't effect after 50 min log collector was still in wait status. cont sso just made the log collector finish successfully.

3) iptables drop on rhevm : 'iptables -A OUTPUT -d <hypervisor ip address> -j DROP' but it didn't affect log collector, after 40min log collector was still waiting. cancelling it made the log collector continue successfully.

4) Added to /etc/ssh/ssh_config ServerAliveInterval = 60, ServerAliveCountMax = 1
Run log collector and in the middle of the run power down the hypervisor - log collector was still running after 5 min. 

All these above options didn't generate the error mentioned in the description after 30 min.
Comment 28 Sandro Bonazzola 2013-09-09 10:59:46 EDT
(In reply to Ilanit Stein from comment #27)

> All these above options didn't generate the error mentioned in the
> description after 30 min.

So I think you can say that the solution is verified, the timeout doesn't affect anymore the ssh connection.
Comment 29 Ilanit Stein 2013-09-10 04:42:09 EDT
Also tried on host to add 'iptables -A OUTPUT -d <rhevm ip>  -j DROP',
and log collector failed on "DEBUG: STDERR(Timeout, server not responding.
) ERROR: Failed to collect logs from: 10.35.109.14; Timeout, server not responding."

Error mentioned in the bug description didn't reproduce. 
Closing bug as it works for me.
Comment 30 Sandro Bonazzola 2013-09-10 05:17:04 EDT
(In reply to Ilanit Stein from comment #29)

> Error mentioned in the bug description didn't reproduce. 
> Closing bug as it works for me.

I would have moved to verified. A change has been made to ensure the ssh connection stay alive. With a 3.2.z log collector you should be able reproduce this stopping the sos process on the hypervisor.
Comment 31 Charlie 2013-11-27 20:16:29 EST
This bug is currently attached to errata RHBA-2013:15255. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Note You need to log in before you can comment on or make changes to this bug.