Bug 1463509 - sosreport should not run oc adm diagnostics by default
sosreport should not run oc adm diagnostics by default
Status: ON_QA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sos (Show other bugs)
7.3
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Josep 'Pep' Turro Mauri
Miroslav Hradílek
https://github.com/sosreport/sos/pull...
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-21 03:24 EDT by Marko Myllynen
Modified: 2017-12-09 18:14 EST (History)
10 users (show)

See Also:
Fixed In Version: sos-3.5-2.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Github sosreport/sos/pull/1150 None None None 2017-11-16 10:54 EST

  None (edit)
Description Marko Myllynen 2017-06-21 03:24:30 EDT
Description of problem:
When running sosreport on an OpenShift cluster master node (and oc whoami saying system:admin) sosreport will run also the origin plugin which in turn will run oc adm diagnostics. This will timeout after 5 minutes at least in disconnected installations due to:

https://bugzilla.redhat.com/show_bug.cgi?id=1439142

And in some setups also due to:

https://bugzilla.redhat.com/show_bug.cgi?id=1431588

One notable issue is also that sosreport should do the minimal effort to collect existing information, oc adm diagnostics creates lots of namespaces and pods to the OCP cluster and during the time of crisis that is undesirable.

Please avoid running oc adm diagnostics from sosreport.

Version-Release number of selected component (if applicable):
sos-3.3-5.el7_3.noarch
Comment 2 Bryn M. Reeves 2017-06-21 07:06:19 EDT
This command is from a commit from Pep, who maintains the OpenShift components in sos:

commit dc37f9f720863cc981447e491fb470260e6636e0
Author: Pep Turró Mauri <pep@redhat.com>
Date:   Tue Jun 28 23:12:45 2016 +0200

    [origin] New plugin for OpenShift Origin / OSE 3.x
    
    Add a new plugin to collect information about OpenShift Origin:
    
      https://github.com/openshift/origin
      https://www.openshift.org
    
    and related products like Atomic Platform / OpenShift Enterprise 3+.
    
    Fixes: #612
    Fixes: RHBZ 1246423
    
    Signed-off-by: Pep Turró Mauri <pep@redhat.com>

So I'd really like a bit more confirmation that this is both not useful, and also potentially harmful, before we rip that out.

I note that the comment in the plugin describes it as experimental - if there are bugs affecting some versions then those should be fixed, worked around, or if that that is not possible, handled specially.

It seems that the command is intended to generate useful diagnostic information so

> One notable issue is also that sosreport should do the minimal effort to 
> collect existing information,

Clearly the maintainers of sos cannot be familiar with every single component on which we report - we have well over two hundred plugins collecting thousands of individual data items today - we rely for this on guidance from our subsystem contributors and maintainers, and they have the greatest influence over what goes in and what is left out.

If you can discuss your concerns and suggestions with Pep (either use this bug, or open an issue upstream), I am sure we can come to a solution that will work well for all users.
Comment 3 Marko Myllynen 2017-06-21 07:13:38 EDT
(In reply to Bryn M. Reeves from comment #2)
> 
> If you can discuss your concerns and suggestions with Pep (either use this
> bug, or open an issue upstream), I am sure we can come to a solution that
> will work well for all users.

I'm happy to participate the discussion e.g. here.

Thanks.
Comment 4 Bryn M. Reeves 2017-06-21 07:17:18 EDT
Adding Pep on CC.
Comment 5 Josep 'Pep' Turro Mauri 2017-06-22 09:16:06 EDT
Some background about why the plugin calls 'oadm diagnostics':

  a) well, it should provide useful diagnostics information

  b) OpenShift evolves very quickly and it's likely that changes specific to certain versions are better captured by corresponding updates to 'oadm diagnostics' than via the sos plugin

  c) there are some checks that are not easy do from within sosreport and 'oadm diagnostics' can handle much better (for example the network connectivity checks across pods that you mention)

I belive that (a) still holds true; if 'oadm diagnostics' is not providing useful information please let us know - probably via a bug in the OpenShift product (CLI component). And with that assumption, it would still make sense to try to collect the information it generates...

The bugs that you mention should definitely be fixed (one is marked as an RFE, which I think we could argue about). but these issues are not specific to the sos plugin... do we want to avoid invoking command X from a sos plugin when X has a bug / enable it back when it's fixed?

Having said that, one thing we could do is to customize how the command gets invoked. For example: it has a --prevent-modifications option that would prevent any attempt to make any changes, like deploying pods for testing. We could set that on by default maybe?

Personally I believe that the "intrusive" tests that the tool performs are simple and safe enouh (it's not "lots of projects", it's 2, the minimum to validate namespace isolation) and do provide significant value: several support requests are related to networking within the SDN, and having a diagnostic of that is valuable... but I'd love more thoughts on that. I'll ask the support team to step in and provide feedback too.

I'll assign the bug to myself to follow-up with any changes needed, feel free to change that if preferred.
Comment 6 Pavel Moravec 2017-06-23 03:01:49 EDT
Just if I got it right:

(In reply to Marko Myllynen from comment #0)
> Description of problem:
> When running sosreport on an OpenShift cluster master node (and oc whoami
> saying system:admin) sosreport will run also the origin plugin which in turn
> will run oc adm diagnostics. This will timeout after 5 minutes at least in
> disconnected installations due to:

Is it supposed/intended/welcomed/required/beneficial to run that command on OpenShift cluster master node? If we get some usefull data from it, sosreport shall call it (and the tool should be robust enough not to timeout under specific situations).

If calling the cmd on that node (and on given setup like disconnected installation) does not bring anything useful, let formally describe how to identify such condition and sosreport will stop calling the cmd in that situation.

I expect the 1st option is valid, am I right?
Comment 7 Marko Myllynen 2017-06-26 13:58:50 EDT
Thanks for looking into this.

I'm all for providing automated tools to collect all the needed information for troubleshooting at one go (see related bug 1456380) but here the tool at least currently gets stuck for a long time without providing any additional data and even if this would be a connected installation someone could kill it leaving extra namespaces behind to the cluster which might lead to confusion later on.

I think the above suggestion to make this configurable is a good compromise, perhaps disabled by default until the above issues are fixed and then flipped enabled by default? This way also in the long run those who want to avoid this (due to current situation with the environment, policies, or such) can opt-out but in the general case the data will be available to assist with troubleshooting.

Thanks.
Comment 8 Josep 'Pep' Turro Mauri 2017-06-27 05:05:34 EDT
(In reply to Marko Myllynen from comment #7)
> Thanks for looking into this.

Thanks for your report and feedback!

> I'm all for providing automated tools to collect all the needed information
> for troubleshooting at one go (see related bug 1456380) but here the tool at
> least currently gets stuck for a long time without providing any additional
> data

So, that's a bug in "adm diagnostics" that should get fixed. In fact, it's more than likely that the fix for "oc" will be released much sooner than a change in the sos plugin that calls it.

> I think the above suggestion to make this configurable is a good compromise,
> perhaps disabled by default

The problem with disabled by default is that experience says that people will just not enable it. I'd vote for enabled by default here... having the option to disable that specific check, or tune it (e.g. disable network checks) in case the default behaviour is not desirable.

> until the above issues are fixed and then flipped enabled by default?

We can only change the behaviour of the plugin every so often - and in this case, as mentioned above: if we were to disable that check due to the referenced bug, we would release the change after its motivation would have been gone...

> This way also in the long run those who want to avoid this (due to current
> situation with the environment, policies, or such) can opt-out but in the
> general case the data will be available to assist with troubleshooting.

I think an opt-out option should be the way to go here, i.e: keep the current behaviour as default, and add an option to tweak the call to "adm diagnostics" with 3 choices: run with default options (enabled by default), run without diag pods, or disable completely. Then if you hit an issue or want to disable that check for any reason you can do so.

Does that sound like a good plan?

Support collects a fair amount of sosreports from OpenShift systems regularly and AFAIK this is the first report of this check causing trouble due to a combination of bugs and environment. This is why I believe we should keep the current behaviour by default.

Maybe I'm missing details about the impact though, so if there's a stronger need to disable this please let me know.
Comment 9 Marko Myllynen 2017-06-27 10:55:43 EDT
(In reply to Josep 'Pep' Turro Mauri from comment #8)
> (In reply to Marko Myllynen from comment #7)
> 
> > I think the above suggestion to make this configurable is a good compromise,
> > perhaps disabled by default
> 
> The problem with disabled by default is that experience says that people
> will just not enable it. I'd vote for enabled by default here... having the
> option to disable that specific check, or tune it (e.g. disable network
> checks) in case the default behaviour is not desirable.
> 
> > until the above issues are fixed and then flipped enabled by default?
> 
> We can only change the behaviour of the plugin every so often - and in this
> case, as mentioned above: if we were to disable that check due to the
> referenced bug, we would release the change after its motivation would have
> been gone...

I see, this is a fair point.

> > This way also in the long run those who want to avoid this (due to current
> > situation with the environment, policies, or such) can opt-out but in the
> > general case the data will be available to assist with troubleshooting.
> 
> I think an opt-out option should be the way to go here, i.e: keep the
> current behaviour as default, and add an option to tweak the call to "adm
> diagnostics" with 3 choices: run with default options (enabled by default),
> run without diag pods, or disable completely. Then if you hit an issue or
> want to disable that check for any reason you can do so.
> 
> Does that sound like a good plan?

Yes, I think this is a solid plan, these choices should provide enough options for everyone around this.

Thanks.
Comment 11 Marko Myllynen 2017-06-29 10:57:24 EDT
This is what I see with latest 3.5 in a disconnected installation:

network-diag-global-ns-bbps6   network-diag-test-pod-3wxh2   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-4vggr   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-7w8r6   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-gbplv   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-pc6sh   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-rhzgr   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-s6w3k   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-sk1wf   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-t4tvx   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-bbps6   network-diag-test-pod-w8h6s   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-0lc62   0/1       ImagePullBackOff   0          5m
network-diag-global-ns-ltn0f   network-diag-test-pod-0vntf   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-5lgrv   0/1       ErrImagePull       0          5m
network-diag-global-ns-ltn0f   network-diag-test-pod-l34vn   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-qb75t   0/1       ErrImagePull       0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-qr2qv   0/1       ErrImagePull       0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-s5ltv   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-tg2gf   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-vgg21   0/1       ImagePullBackOff   0          6m
network-diag-global-ns-ltn0f   network-diag-test-pod-z8dl9   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-02wwc   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-2440q   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-7wb6v   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-jf19j   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-nb3ns   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-t9zcf   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-v8vjf   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-wbrg9   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-x7thr   0/1       ImagePullBackOff   0          6m
network-diag-ns-38v11          network-diag-test-pod-xz3k7   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-09ccx   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-1s5st   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-6kwth   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-75rcs   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-bklhs   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-ck0x8   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-h953b   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-qn6jm   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-t3s6j   0/1       ImagePullBackOff   0          6m
network-diag-ns-5zmwc          network-diag-test-pod-v2nzq   0/1       ImagePullBackOff   0          6m

These are cleaned up later if nobody kills sosreport.

Thanks.
Comment 12 Pavel Moravec 2017-08-30 04:21:34 EDT
(In reply to Marko Myllynen from comment #9)
> (In reply to Josep 'Pep' Turro Mauri from comment #8)
> > I think an opt-out option should be the way to go here, i.e: keep the
> > current behaviour as default, and add an option to tweak the call to "adm
> > diagnostics" with 3 choices: run with default options (enabled by default),
> > run without diag pods, or disable completely. Then if you hit an issue or
> > want to disable that check for any reason you can do so.
> > 
> > Does that sound like a good plan?
> 
> Yes, I think this is a solid plan, these choices should provide enough
> options for everyone around this.
> 
> Thanks.

Pep, could you please file upstream PR for this (esp. if you want to have it in RHEL7.5)? I dont see where "oc adm diagnostics" is called for individual pods (or I misunderstand what cmd shall not be called for the pods via that option?).
Comment 13 Josep 'Pep' Turro Mauri 2017-09-07 07:31:35 EDT
Sorry for the delay, here's the PR:

https://github.com/sosreport/sos/pull/1098

(In reply to Josep 'Pep' Turro Mauri from comment #8)
> I think an opt-out option should be the way to go here, i.e: keep the
> current behaviour as default, and add an option to tweak the call to "adm
> diagnostics" with 3 choices: run with default options (enabled by default),
> run without diag pods, or disable completely. Then if you hit an issue or
> want to disable that check for any reason you can do so.

The suggested implementation is 2 options (boolean):

 - diag (default true): run "oc adm diagnostics"
 - diag-prevent (default false): set "--prevent-modifications" when invoking diag

The --prevent-modifications option would prevent "oc adm diagnostics" to attempt pod deployments, including the ones in the network check that were affected by the bug mentioned earlier.

Does that sound ok? Feel free to comment here or in the PR.
Comment 14 Marko Myllynen 2017-09-07 07:46:18 EDT
(In reply to Josep 'Pep' Turro Mauri from comment #13)
> 
> The suggested implementation is 2 options (boolean):
> 
>  - diag (default true): run "oc adm diagnostics"
>  - diag-prevent (default false): set "--prevent-modifications" when invoking
> diag
> 
> The --prevent-modifications option would prevent "oc adm diagnostics" to
> attempt pod deployments, including the ones in the network check that were
> affected by the bug mentioned earlier.
> 
> Does that sound ok? Feel free to comment here or in the PR.

Yes, sounds perfect, thanks!
Comment 16 Pavel Moravec 2017-11-02 10:55:20 EDT
Marko, could you please provide verification steps or verify the BZ by yourself (brew build with sos package follows)?
Comment 17 Pavel Moravec 2017-11-02 10:56:08 EDT
(asking Pep the same for verification steps)
Comment 19 Pavel Moravec 2017-11-02 11:15:16 EDT
Fixed via sos 3.5 rebase.
Comment 22 Marko Myllynen 2017-11-16 10:30:53 EST
(In reply to Pavel Moravec from comment #16)
> Marko, could you please provide verification steps or verify the BZ by
> yourself (brew build with sos package follows)?

There are three different ways to run sosreport now:

# Disable oc adm diagnostics completely
sosreport -o origin -k origin.diag=off

# Disable oc adm diagnostics initiated modifications
sosreport -o origin -k origin.diag-prevent=on

# Run full oc adm diagnostics
sosreport -o origin

It seems that there's a typo causing the second alternative to fail:

# cat sos_commands/origin/oc_--config_.etc.origin.master.admin.kubeconfig_adm_diagnostics_-l_0_--prevent-modifications_true
Error: unknown flag: --prevent-modifications


Usage:
  oc adm diagnostics [options]

Options:
      --cluster-context='': Client context to use for cluster administrator
      --config='': Path to the config file to use for CLI requests.
      --context='': The name of the kubeconfig context to use
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --host=false: If true, look for systemd and journald units even without master/node config
      --images='registry.access.redhat.com/openshift3/ose-${component}:${version}': Image template for DiagnosticPod to use in creating a pod
      --latest-images=false: If true, when expanding the image template, use latest version, not release version
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --master-config='': Path to master config file (implies --host)
      --network-logdir='/tmp/openshift/': Path to store network diagnostic results in case of errors
      --network-pod-image='registry.access.redhat.com/openshift3/ose': Image to use for network diagnostic pod
      --network-test-pod-image='registry.access.redhat.com/openshift3/ose-deployer': Image to use for network diagnostic test pod
      --network-test-pod-port=8080: Serving port on the network diagnostic test pod
      --network-test-pod-protocol='TCP': Protocol used to connect to network diagnostic test pod
      --node-config='': Path to node config file (implies --host)
      --prevent-modification=false: If true, may be set to prevent diagnostics making any changes via the API

Use "oc adm options" for a list of global command-line options (applies to all commands).


Thanks.
Comment 23 Pavel Moravec 2017-11-16 10:54:56 EST
Thanks for spotting it, sos PR 1150 filed.

Note You need to log in before you can comment on or make changes to this bug.