1463509 – sosreport should not run oc adm diagnostics by default

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1463509 - sosreport should not run oc adm diagnostics by default

Summary: sosreport should not run oc adm diagnostics by default

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	sos
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Josep 'Pep' Turro Mauri
QA Contact:	Ondřej Pták
Docs Contact:
URL:	https://github.com/sosreport/sos/pull...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-21 07:24 UTC by Marko Myllynen
Modified:	2018-04-10 18:03 UTC (History)
CC List:	11 users (show)
Fixed In Version:	sos-3.5-2.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-10 18:01:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	sosreport sos pull 1150	0	None	closed	[origin] fix typo in oc adm diagnostics --prevent-modification	2020-10-02 13:04:07 UTC
Red Hat Product Errata	RHEA-2018:0963	0	None	None	None	2018-04-10 18:03:36 UTC

Description Marko Myllynen 2017-06-21 07:24:30 UTC

Description of problem:
When running sosreport on an OpenShift cluster master node (and oc whoami saying system:admin) sosreport will run also the origin plugin which in turn will run oc adm diagnostics. This will timeout after 5 minutes at least in disconnected installations due to:

https://bugzilla.redhat.com/show_bug.cgi?id=1439142

And in some setups also due to:

https://bugzilla.redhat.com/show_bug.cgi?id=1431588

One notable issue is also that sosreport should do the minimal effort to collect existing information, oc adm diagnostics creates lots of namespaces and pods to the OCP cluster and during the time of crisis that is undesirable.

Please avoid running oc adm diagnostics from sosreport.

Version-Release number of selected component (if applicable):
sos-3.3-5.el7_3.noarch

Comment 2 Bryn M. Reeves 2017-06-21 11:06:19 UTC

This command is from a commit from Pep, who maintains the OpenShift components in sos:

commit dc37f9f720863cc981447e491fb470260e6636e0
Author: Pep Turró Mauri <pep>
Date:   Tue Jun 28 23:12:45 2016 +0200

    [origin] New plugin for OpenShift Origin / OSE 3.x
    
    Add a new plugin to collect information about OpenShift Origin:
    
      https://github.com/openshift/origin
      https://www.openshift.org
    
    and related products like Atomic Platform / OpenShift Enterprise 3+.
    
    Fixes: #612
    Fixes: RHBZ 1246423
    
    Signed-off-by: Pep Turró Mauri <pep>

So I'd really like a bit more confirmation that this is both not useful, and also potentially harmful, before we rip that out.

I note that the comment in the plugin describes it as experimental - if there are bugs affecting some versions then those should be fixed, worked around, or if that that is not possible, handled specially.

It seems that the command is intended to generate useful diagnostic information so

> One notable issue is also that sosreport should do the minimal effort to 
> collect existing information,

Clearly the maintainers of sos cannot be familiar with every single component on which we report - we have well over two hundred plugins collecting thousands of individual data items today - we rely for this on guidance from our subsystem contributors and maintainers, and they have the greatest influence over what goes in and what is left out.

If you can discuss your concerns and suggestions with Pep (either use this bug, or open an issue upstream), I am sure we can come to a solution that will work well for all users.

Comment 3 Marko Myllynen 2017-06-21 11:13:38 UTC

(In reply to Bryn M. Reeves from comment #2)
> 
> If you can discuss your concerns and suggestions with Pep (either use this
> bug, or open an issue upstream), I am sure we can come to a solution that
> will work well for all users.

I'm happy to participate the discussion e.g. here.

Thanks.

Comment 4 Bryn M. Reeves 2017-06-21 11:17:18 UTC

Adding Pep on CC.

Comment 5 Josep 'Pep' Turro Mauri 2017-06-22 13:16:06 UTC

Some background about why the plugin calls 'oadm diagnostics':

  a) well, it should provide useful diagnostics information

  b) OpenShift evolves very quickly and it's likely that changes specific to certain versions are better captured by corresponding updates to 'oadm diagnostics' than via the sos plugin

  c) there are some checks that are not easy do from within sosreport and 'oadm diagnostics' can handle much better (for example the network connectivity checks across pods that you mention)

I belive that (a) still holds true; if 'oadm diagnostics' is not providing useful information please let us know - probably via a bug in the OpenShift product (CLI component). And with that assumption, it would still make sense to try to collect the information it generates...

The bugs that you mention should definitely be fixed (one is marked as an RFE, which I think we could argue about). but these issues are not specific to the sos plugin... do we want to avoid invoking command X from a sos plugin when X has a bug / enable it back when it's fixed?

Having said that, one thing we could do is to customize how the command gets invoked. For example: it has a --prevent-modifications option that would prevent any attempt to make any changes, like deploying pods for testing. We could set that on by default maybe?

Personally I believe that the "intrusive" tests that the tool performs are simple and safe enouh (it's not "lots of projects", it's 2, the minimum to validate namespace isolation) and do provide significant value: several support requests are related to networking within the SDN, and having a diagnostic of that is valuable... but I'd love more thoughts on that. I'll ask the support team to step in and provide feedback too.

I'll assign the bug to myself to follow-up with any changes needed, feel free to change that if preferred.

Comment 6 Pavel Moravec 2017-06-23 07:01:49 UTC

Just if I got it right:

(In reply to Marko Myllynen from comment #0)
> Description of problem:
> When running sosreport on an OpenShift cluster master node (and oc whoami
> saying system:admin) sosreport will run also the origin plugin which in turn
> will run oc adm diagnostics. This will timeout after 5 minutes at least in
> disconnected installations due to:

Is it supposed/intended/welcomed/required/beneficial to run that command on OpenShift cluster master node? If we get some usefull data from it, sosreport shall call it (and the tool should be robust enough not to timeout under specific situations).

If calling the cmd on that node (and on given setup like disconnected installation) does not bring anything useful, let formally describe how to identify such condition and sosreport will stop calling the cmd in that situation.

I expect the 1st option is valid, am I right?

Comment 7 Marko Myllynen 2017-06-26 17:58:50 UTC

Thanks for looking into this.

I'm all for providing automated tools to collect all the needed information for troubleshooting at one go (see related bug 1456380) but here the tool at least currently gets stuck for a long time without providing any additional data and even if this would be a connected installation someone could kill it leaving extra namespaces behind to the cluster which might lead to confusion later on.

I think the above suggestion to make this configurable is a good compromise, perhaps disabled by default until the above issues are fixed and then flipped enabled by default? This way also in the long run those who want to avoid this (due to current situation with the environment, policies, or such) can opt-out but in the general case the data will be available to assist with troubleshooting.

Thanks.

Comment 8 Josep 'Pep' Turro Mauri 2017-06-27 09:05:34 UTC

(In reply to Marko Myllynen from comment #7)
> Thanks for looking into this.

Thanks for your report and feedback!

> I'm all for providing automated tools to collect all the needed information
> for troubleshooting at one go (see related bug 1456380) but here the tool at
> least currently gets stuck for a long time without providing any additional
> data

So, that's a bug in "adm diagnostics" that should get fixed. In fact, it's more than likely that the fix for "oc" will be released much sooner than a change in the sos plugin that calls it.

> I think the above suggestion to make this configurable is a good compromise,
> perhaps disabled by default

The problem with disabled by default is that experience says that people will just not enable it. I'd vote for enabled by default here... having the option to disable that specific check, or tune it (e.g. disable network checks) in case the default behaviour is not desirable.

> until the above issues are fixed and then flipped enabled by default?

We can only change the behaviour of the plugin every so often - and in this case, as mentioned above: if we were to disable that check due to the referenced bug, we would release the change after its motivation would have been gone...

> This way also in the long run those who want to avoid this (due to current
> situation with the environment, policies, or such) can opt-out but in the
> general case the data will be available to assist with troubleshooting.

I think an opt-out option should be the way to go here, i.e: keep the current behaviour as default, and add an option to tweak the call to "adm diagnostics" with 3 choices: run with default options (enabled by default), run without diag pods, or disable completely. Then if you hit an issue or want to disable that check for any reason you can do so.

Does that sound like a good plan?

Support collects a fair amount of sosreports from OpenShift systems regularly and AFAIK this is the first report of this check causing trouble due to a combination of bugs and environment. This is why I believe we should keep the current behaviour by default.

Maybe I'm missing details about the impact though, so if there's a stronger need to disable this please let me know.

Comment 9 Marko Myllynen 2017-06-27 14:55:43 UTC

(In reply to Josep 'Pep' Turro Mauri from comment #8)
> (In reply to Marko Myllynen from comment #7)
> 
> > I think the above suggestion to make this configurable is a good compromise,
> > perhaps disabled by default
> 
> The problem with disabled by default is that experience says that people
> will just not enable it. I'd vote for enabled by default here... having the
> option to disable that specific check, or tune it (e.g. disable network
> checks) in case the default behaviour is not desirable.
> 
> > until the above issues are fixed and then flipped enabled by default?
> 
> We can only change the behaviour of the plugin every so often - and in this
> case, as mentioned above: if we were to disable that check due to the
> referenced bug, we would release the change after its motivation would have
> been gone...

I see, this is a fair point.

> > This way also in the long run those who want to avoid this (due to current
> > situation with the environment, policies, or such) can opt-out but in the
> > general case the data will be available to assist with troubleshooting.
> 
> I think an opt-out option should be the way to go here, i.e: keep the
> current behaviour as default, and add an option to tweak the call to "adm
> diagnostics" with 3 choices: run with default options (enabled by default),
> run without diag pods, or disable completely. Then if you hit an issue or
> want to disable that check for any reason you can do so.
> 
> Does that sound like a good plan?

Yes, I think this is a solid plan, these choices should provide enough options for everyone around this.

Thanks.

Comment 11 Marko Myllynen 2017-06-29 14:57:24 UTC

This is what I see with latest 3.5 in a disconnected installation:

network-diag-global-ns-bbps6   network-diag-test-pod-3wxh2   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-4vggr   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-7w8r6   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-gbplv   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-pc6sh   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-rhzgr   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-s6w3k   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-sk1wf   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-t4tvx   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-w8h6s   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-0lc62   0/1       ImagePullBackOff   0          5m
network-diag-global-ns-ltn0f   network-diag-test-pod-0vntf   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-5lgrv   0/1       ErrImagePull       0          5m
network-diag-global-ns-ltn0f   network-diag-test-pod-l34vn   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-qb75t   0/1       ErrImagePull       0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-qr2qv   0/1       ErrImagePull       0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-s5ltv   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-tg2gf   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-vgg21   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-z8dl9   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-02wwc   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-2440q   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-7wb6v   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-jf19j   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-nb3ns   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-t9zcf   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-v8vjf   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-wbrg9   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-x7thr   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-xz3k7   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-09ccx   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-1s5st   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-6kwth   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-75rcs   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-bklhs   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-ck0x8   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-h953b   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-qn6jm   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-t3s6j   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-v2nzq   0/1       ImagePullBackOff   0          6m

These are cleaned up later if nobody kills sosreport.

Thanks.

Comment 12 Pavel Moravec 2017-08-30 08:21:34 UTC

(In reply to Marko Myllynen from comment #9)
> (In reply to Josep 'Pep' Turro Mauri from comment #8)
> > I think an opt-out option should be the way to go here, i.e: keep the
> > current behaviour as default, and add an option to tweak the call to "adm
> > diagnostics" with 3 choices: run with default options (enabled by default),
> > run without diag pods, or disable completely. Then if you hit an issue or
> > want to disable that check for any reason you can do so.
> > 
> > Does that sound like a good plan?
> 
> Yes, I think this is a solid plan, these choices should provide enough
> options for everyone around this.
> 
> Thanks.

Pep, could you please file upstream PR for this (esp. if you want to have it in RHEL7.5)? I dont see where "oc adm diagnostics" is called for individual pods (or I misunderstand what cmd shall not be called for the pods via that option?).

Comment 13 Josep 'Pep' Turro Mauri 2017-09-07 11:31:35 UTC

Sorry for the delay, here's the PR:

https://github.com/sosreport/sos/pull/1098

(In reply to Josep 'Pep' Turro Mauri from comment #8)
> I think an opt-out option should be the way to go here, i.e: keep the
> current behaviour as default, and add an option to tweak the call to "adm
> diagnostics" with 3 choices: run with default options (enabled by default),
> run without diag pods, or disable completely. Then if you hit an issue or
> want to disable that check for any reason you can do so.

The suggested implementation is 2 options (boolean):

 - diag (default true): run "oc adm diagnostics"
 - diag-prevent (default false): set "--prevent-modifications" when invoking diag

The --prevent-modifications option would prevent "oc adm diagnostics" to attempt pod deployments, including the ones in the network check that were affected by the bug mentioned earlier.

Does that sound ok? Feel free to comment here or in the PR.

Comment 14 Marko Myllynen 2017-09-07 11:46:18 UTC

(In reply to Josep 'Pep' Turro Mauri from comment #13)
> 
> The suggested implementation is 2 options (boolean):
> 
>  - diag (default true): run "oc adm diagnostics"
>  - diag-prevent (default false): set "--prevent-modifications" when invoking
> diag
> 
> The --prevent-modifications option would prevent "oc adm diagnostics" to
> attempt pod deployments, including the ones in the network check that were
> affected by the bug mentioned earlier.
> 
> Does that sound ok? Feel free to comment here or in the PR.

Yes, sounds perfect, thanks!

Comment 16 Pavel Moravec 2017-11-02 14:55:20 UTC

Marko, could you please provide verification steps or verify the BZ by yourself (brew build with sos package follows)?

Comment 17 Pavel Moravec 2017-11-02 14:56:08 UTC

(asking Pep the same for verification steps)

Comment 19 Pavel Moravec 2017-11-02 15:15:16 UTC

Fixed via sos 3.5 rebase.

Comment 22 Marko Myllynen 2017-11-16 15:30:53 UTC

(In reply to Pavel Moravec from comment #16)
> Marko, could you please provide verification steps or verify the BZ by
> yourself (brew build with sos package follows)?

There are three different ways to run sosreport now:

# Disable oc adm diagnostics completely
sosreport -o origin -k origin.diag=off

# Disable oc adm diagnostics initiated modifications
sosreport -o origin -k origin.diag-prevent=on

# Run full oc adm diagnostics
sosreport -o origin

It seems that there's a typo causing the second alternative to fail:

# cat sos_commands/origin/oc_--config_.etc.origin.master.admin.kubeconfig_adm_diagnostics_-l_0_--prevent-modifications_true
Error: unknown flag: --prevent-modifications


Usage:
  oc adm diagnostics [options]

Options:
      --cluster-context='': Client context to use for cluster administrator
      --config='': Path to the config file to use for CLI requests.
      --context='': The name of the kubeconfig context to use
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --host=false: If true, look for systemd and journald units even without master/node config
      --images='registry.access.redhat.com/openshift3/ose-${component}:${version}': Image template for DiagnosticPod to use in creating a pod
      --latest-images=false: If true, when expanding the image template, use latest version, not release version
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --master-config='': Path to master config file (implies --host)
      --network-logdir='/tmp/openshift/': Path to store network diagnostic results in case of errors
      --network-pod-image='registry.access.redhat.com/openshift3/ose': Image to use for network diagnostic pod
      --network-test-pod-image='registry.access.redhat.com/openshift3/ose-deployer': Image to use for network diagnostic test pod
      --network-test-pod-port=8080: Serving port on the network diagnostic test pod
      --network-test-pod-protocol='TCP': Protocol used to connect to network diagnostic test pod
      --node-config='': Path to node config file (implies --host)
      --prevent-modification=false: If true, may be set to prevent diagnostics making any changes via the API

Use "oc adm options" for a list of global command-line options (applies to all commands).


Thanks.

Comment 23 Pavel Moravec 2017-11-16 15:54:56 UTC

Thanks for spotting it, sos PR 1150 filed.

Comment 29 errata-xmlrpc 2018-04-10 18:01:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0963

Note You need to log in before you can comment on or make changes to this bug.