Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1683904

Summary: sosreport executes almost 5000 commands and timed out in networking plugin
Product: Red Hat Enterprise Linux 7 Reporter: Masahiro Matsuya <mmatsuya>
Component: sosAssignee: Pavel Moravec <pmoravec>
Status: CLOSED ERRATA QA Contact: Miroslav HradĂ­lek <mhradile>
Severity: medium Docs Contact:
Priority: urgent    
Version: 7.6CC: agk, astupnik, bmr, cww, fkrska, jjansky, jmaxwell, jraju, mhradile, pamadio, plambri, pmoravec, sbradley
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: sos-3.9-2.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1805079 1805080 (view as bug list) Environment:
Last Closed: 2020-09-29 20:55:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1784466, 1805079, 1805080    

Comment 3 Pavel Moravec 2019-03-10 11:12:51 UTC
> "2. Confirm that many network namespaces exists"

There will be always scalability vs. requirements limitations - e.g.:

- having thousands of namespaces,
- general requirement "collect these many commands for each namespace"
- some time to execute each

multiplying those values, we can easily get over plugin timeout limit.


What we can change/improve here?

- limit number of namespaces (to traverse)? Well, then we should decide which names are important and which can be ignored. Dont wanna skip some generic ones in favour of some specific - and I doubt we figure out a golden rule to identify this
- collect less commands for each namespace? again, similar problem like above
- improve time to collect one command - that is rather python2 issue :-/
- increase plugin timeout to let all required data to be collected - though it can take hours for so many combinations of namespaces and cmds

I.e. I dont see much how sosreport can be improved / in what direction, such that more basic requirements (esp. "collect this and that for few important namespaces we have") are not broken.


In this context, I raised

https://github.com/sosreport/sos/issues/1585

that can improve here to some extent - not all such "for each .., collect .." commands will be collected, but *all rest* will do, and that can be sufficient in many cases, I guess.


Anyway, any particular idea how sosreport could behave better is welcomed.

Comment 5 Pavel Moravec 2019-03-29 11:35:21 UTC
Scope of 7.7 closed, rescheduling for potential inclusion in 7.8.


Any idea how in particular to improve sosreport behaviour is welcomed - several contradicting requirements are here.

Comment 6 Jonathan Maxwell 2019-05-24 01:18:42 UTC
(In reply to Pavel Moravec from comment #5)
> Scope of 7.7 closed, rescheduling for potential inclusion in 7.8.
> 
> 
> Any idea how in particular to improve sosreport behaviour is welcomed -
> several contradicting requirements are here.

It could be the "-p" option that is adding to the delay. It means that the command needs to query the process information of that socket and that can increase the execution time. I am not suggesting that we omit the "-p" option because it provides valuable insight. 

Matsuya,

If you change:
                    ns_cmd_prefix + "ss -peaonmi",
                    ns_cmd_prefix + "netstat %s -neopa" % self.ns_wide,

to:

                    ns_cmd_prefix + "ss -eaonmi",
                    ns_cmd_prefix + "netstat %s -neoa" % self.ns_wide,

Does that decrease the time that it takes to complete the networking plugins?

Regards

Jon

Comment 16 Pavel Moravec 2019-06-11 11:54:55 UTC
I don't think that playing with a command option helps sufficiently here. Since in the original sosreport, I see thousands of commands like:

2019-02-15 13:36:36,353 INFO: [plugin:networking] collecting output of 'ip netns exec qrouter-55acb798-3698-419e-a7b2-00509771c2dc ethtool qg-00431182-a9'
2019-02-15 13:36:36,412 INFO: [plugin:networking] collecting output of 'ip netns exec qrouter-55acb798-3698-419e-a7b2-00509771c2dc ethtool -i qg-00431182-a9'
2019-02-15 13:36:36,469 INFO: [plugin:networking] collecting output of 'ip netns exec qrouter-55acb798-3698-419e-a7b2-00509771c2dc ethtool -k qg-00431182-a9'
2019-02-15 13:36:36,531 INFO: [plugin:networking] collecting output of 'ip netns exec qrouter-55acb798-3698-419e-a7b2-00509771c2dc ethtool -S qg-00431182-a9'

or

2019-02-15 13:37:59,315 INFO: [plugin:networking] collecting output of 'ip netns exec qdhcp-03b8ebf5-6b9e-4520-a61e-4f033d9ffc0c ethtool tap0177f4b3-a1'
2019-02-15 13:37:59,381 INFO: [plugin:networking] collecting output of 'ip netns exec qdhcp-03b8ebf5-6b9e-4520-a61e-4f033d9ffc0c ethtool -i tap0177f4b3-a1'
2019-02-15 13:37:59,443 INFO: [plugin:networking] collecting output of 'ip netns exec qdhcp-03b8ebf5-6b9e-4520-a61e-4f033d9ffc0c ethtool -k tap0177f4b3-a1'
2019-02-15 13:37:59,510 INFO: [plugin:networking] collecting output of 'ip netns exec qdhcp-03b8ebf5-6b9e-4520-a61e-4f033d9ffc0c ethtool -S tap0177f4b3-a1'

being collected.

Some stats from those commands in the sosreport attached to the support case:

command type				avg	max	sum	count
---------------------------------------------------------------------
_count_durations.ethtool_HQ.txt		63	103	27754	438
_count_durations.ethtool_-i.txt		63	98	39476	625
_count_durations.ethtool_-k.txt		63	99	39531	625
_count_durations.ethtool_-S.txt		63	86	39561	625
_count_durations.ethtool_tap.txt	63	83	11811	187
_count_durations.ip_addr.txt		71	266	24080	335
_count_durations.ip_rout.txt		70	282	23844	336
_count_durations.iptables-save'_.txt	71	346	24063	336
_count_durations.netstat_-s'.txt	486	719	163499	336
_count_durations.netstat_-W.-.txt	261	601	175809	672
_count_durations.ss_-pea.txt		74	1258	25068	336

avg = average time in ms per one command execution
max = maximal time for the same
sum = summary of all those cmds execution
count = number of such cmds executed by the sosreport

So if we shall decrease the sosreport execution time, we should focus on commands like:

2019-02-15 13:28:08,207 INFO: [plugin:networking] collecting output of 'ip netns exec qrouter-d1c111b0-80d0-46b0-97b4-74f72e9f4220 netstat -W -neopa'
2019-02-15 13:28:08,691 INFO: [plugin:networking] collecting output of 'ip netns exec qrouter-d1c111b0-80d0-46b0-97b4-74f72e9f4220 netstat -s'
2019-02-15 13:28:08,765 INFO: [plugin:networking] collecting output of 'ip netns exec qrouter-d1c111b0-80d0-46b0-97b4-74f72e9f4220 netstat -W -agn'
2019-02-15 13:28:09,458 INFO: [plugin:networking] collecting output of 'ip netns exec qdhcp-1f64034d-8be8-4cc4-b6a2-8ef86473efa7 netstat -W -neopa'
..

that took 339 seconds alone. The networking plugin was stopped after 10minutes, so more than half of time spent in those netstat commands.


Anyway, how reasonable is it to iterate over all few hundreds of the networking namespaces and collect several cmds output for all of them? Are all those data usefull?

Comment 30 Pavel Moravec 2019-08-22 14:54:16 UTC
rescheduling from 7.8 scope for potential inclusion in 7.9. Needinfo (still) pending.

Comment 37 Pavel Moravec 2020-01-13 11:16:31 UTC
Upstream discussion kicked off in

https://github.com/sosreport/sos/issues/1916

My proposal is described there as well.

Comment 38 Pavel Moravec 2020-02-08 08:09:18 UTC
Upstream PR merged via [1], codefix will appear in 7.9 and 8.3.

[1] https://github.com/sosreport/sos/commit/c20bd8d489c45401db55cf89bf7d4d0f7623a4fe

Comment 40 Pavel Moravec 2020-02-19 08:45:40 UTC
This will appear in 7.9 for sure (due to rebase to sos 3.9 that contains the fix).

Comment 51 errata-xmlrpc 2020-09-29 20:55:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sos bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4034