Bug 962677
| Summary: | sosreport vdsm plugin takes long time causing ssh server to timesout client session while generating sosreport for logcollector | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Jaison Raju <jraju> | ||||||||
| Component: | ovirt-engine-log-collector | Assignee: | Sandro Bonazzola <sbonazzo> | ||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Ilanit Stein <istein> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 3.3.0 | CC: | acathrow, bazulay, hateya, iheim, jkt, jraju, knesenko, kroberts, lpeer, lyarwood, mgoldboi, mpavlik, pstehlik, Rhev-m-bugs, sbonazzo, sputhenp | ||||||||
| Target Milestone: | --- | Keywords: | Regression, Triaged | ||||||||
| Target Release: | 3.3.0 | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | integration | ||||||||||
| Fixed In Version: | rhevm-log-collector-3.3.0-0.1.master.el6ev | Doc Type: | Bug Fix | ||||||||
| Doc Text: |
On some hypervisors the sosreport collection could take more than 900 seconds, which is the default duration to keep the ssh client connection alive. Now, calling ssh with the -oServerAliveInterval=600 parameter prevents the connection timeout for the number of times specified using ServerAliveCountMax (default is 3 times).
|
Story Points: | --- | ||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2013-09-10 08:42:09 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 1005066 | ||||||||||
| Bug Blocks: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Jaison Raju
2013-05-14 08:43:57 UTC
can you please check what is the time consumer on this sos generation, you can try running sosreport manually with -v. i'm afraid that playing with timeout may be a good workaround but we need to understand the source of the problem. Hello , The sosreport generation took almost 23 mins from the rhev host. Although i do not have sosreport -v output now . Is sosreport -v output required ? Jaison R (In reply to Jaison Raju from comment #2) > Hello , > > The sosreport generation took almost 23 mins from the > rhev host. > > Although i do not have sosreport -v output now . > Is sosreport -v output required ? > > > Jaison R Yes please. We need to understand the source of the problem. (In reply to Jaison Raju from comment #0) > Additional info: > Increasing ClientAliveInterval to 2000 resolve the issue . > Can we increase this value to say 2000 to avoid such issues ? I think that we can avoid the configuration calling ssh with -o ServerAliveInterval=600 (deafult=0), informing the ssh server we're still alive. This will lead to connection closed at 1800 if we don't change also ServerAliveCountMax (deafults=3). However I wouldlike to understand why sosreport is taking so much. Pushed upstream master a patch adding -oServerAliveInterval=600 to ssh command line as workaround. Still waiting for the sosreport -v output for further investigations. More precisely: /usr/sbin/sosreport -v --batch -k general.all_logs=True -o libvirt,vdsm,general,networking,hardware,process,yum,filesys,devicemapper,selinux,kernel,memory,rpm,gluster --profile and we need sosprofile.log and sos.log from sos_logs directory. It would be great having the whole sosreport. Jaison, have you got any news? Hello, Actually the customer has upgraded their RHEV environment & also upgraded/installed their new hypervisors . I will provide you the output by monday . I doubt if the same behaviour would be noticed in the new installed hosts. I will update the bugzilla as soon as i get the output from custoemr . Regards, Jaison R Jaison, have you got any news? Patch adding -oServerAliveInterval=600 to ssh command line as workaround merged upstream master for 3.3.0. When new info will be available we can try to see if something better can be done. changed component to log collector since the workaround is done there. It seems that I've hit same BZ on is13. For me collecting logs from hypervisor never finishes. I've let it run for 1 hour. [root@mp-rhevm33 ~]# rpm -qa | grep collector rhevm-log-collector-3.3.0-0.2.rc.el6ev.noarch I am attaching console output of the command from comment 6 and also tar file with logs which it produced. [root@dell-r210ii-05 ~]# /usr/sbin/sosreport -v --batch -k general.all_logs=True -o libvirt,vdsm,general,networking,hardware,process,yum,filesys,devicemapper,selinux,kernel,memory,rpm,gluster --profile Created attachment 794235 [details]
manual_sos_report
Created attachment 794237 [details]
sos_report_output_to_console.txt
Created attachment 794266 [details]
log_collector_verbose
Verification trials on is11: 1) sshd service stop on host in the middle of log collector sos running - got another error, after several min: "ERROR: Failed to collect logs from: 10.35.109.14; ssh: connect to host 10.35.109.14 port 22: Connection refused" . 2) Stop sso command in the hypervisor, but it didn't effect after 50 min log collector was still in wait status. cont sso just made the log collector finish successfully. 3) iptables drop on rhevm : 'iptables -A OUTPUT -d <hypervisor ip address> -j DROP' but it didn't affect log collector, after 40min log collector was still waiting. cancelling it made the log collector continue successfully. 4) Added to /etc/ssh/ssh_config ServerAliveInterval = 60, ServerAliveCountMax = 1 Run log collector and in the middle of the run power down the hypervisor - log collector was still running after 5 min. All these above options didn't generate the error mentioned in the description after 30 min. (In reply to Ilanit Stein from comment #27) > All these above options didn't generate the error mentioned in the > description after 30 min. So I think you can say that the solution is verified, the timeout doesn't affect anymore the ssh connection. Also tried on host to add 'iptables -A OUTPUT -d <rhevm ip> -j DROP', and log collector failed on "DEBUG: STDERR(Timeout, server not responding. ) ERROR: Failed to collect logs from: 10.35.109.14; Timeout, server not responding." Error mentioned in the bug description didn't reproduce. Closing bug as it works for me. (In reply to Ilanit Stein from comment #29) > Error mentioned in the bug description didn't reproduce. > Closing bug as it works for me. I would have moved to verified. A change has been made to ensure the ssh connection stay alive. With a 3.2.z log collector you should be able reproduce this stopping the sos process on the hypervisor. This bug is currently attached to errata RHBA-2013:15255. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance. |