Red Hat Bugzilla – Bug 1463509
sosreport should not run oc adm diagnostics by default
Last modified: 2018-04-10 14:03:36 EDT
Description of problem: When running sosreport on an OpenShift cluster master node (and oc whoami saying system:admin) sosreport will run also the origin plugin which in turn will run oc adm diagnostics. This will timeout after 5 minutes at least in disconnected installations due to: https://bugzilla.redhat.com/show_bug.cgi?id=1439142 And in some setups also due to: https://bugzilla.redhat.com/show_bug.cgi?id=1431588 One notable issue is also that sosreport should do the minimal effort to collect existing information, oc adm diagnostics creates lots of namespaces and pods to the OCP cluster and during the time of crisis that is undesirable. Please avoid running oc adm diagnostics from sosreport. Version-Release number of selected component (if applicable): sos-3.3-5.el7_3.noarch
This command is from a commit from Pep, who maintains the OpenShift components in sos: commit dc37f9f720863cc981447e491fb470260e6636e0 Author: Pep Turró Mauri <pep@redhat.com> Date: Tue Jun 28 23:12:45 2016 +0200 [origin] New plugin for OpenShift Origin / OSE 3.x Add a new plugin to collect information about OpenShift Origin: https://github.com/openshift/origin https://www.openshift.org and related products like Atomic Platform / OpenShift Enterprise 3+. Fixes: #612 Fixes: RHBZ 1246423 Signed-off-by: Pep Turró Mauri <pep@redhat.com> So I'd really like a bit more confirmation that this is both not useful, and also potentially harmful, before we rip that out. I note that the comment in the plugin describes it as experimental - if there are bugs affecting some versions then those should be fixed, worked around, or if that that is not possible, handled specially. It seems that the command is intended to generate useful diagnostic information so > One notable issue is also that sosreport should do the minimal effort to > collect existing information, Clearly the maintainers of sos cannot be familiar with every single component on which we report - we have well over two hundred plugins collecting thousands of individual data items today - we rely for this on guidance from our subsystem contributors and maintainers, and they have the greatest influence over what goes in and what is left out. If you can discuss your concerns and suggestions with Pep (either use this bug, or open an issue upstream), I am sure we can come to a solution that will work well for all users.
(In reply to Bryn M. Reeves from comment #2) > > If you can discuss your concerns and suggestions with Pep (either use this > bug, or open an issue upstream), I am sure we can come to a solution that > will work well for all users. I'm happy to participate the discussion e.g. here. Thanks.
Adding Pep on CC.
Some background about why the plugin calls 'oadm diagnostics': a) well, it should provide useful diagnostics information b) OpenShift evolves very quickly and it's likely that changes specific to certain versions are better captured by corresponding updates to 'oadm diagnostics' than via the sos plugin c) there are some checks that are not easy do from within sosreport and 'oadm diagnostics' can handle much better (for example the network connectivity checks across pods that you mention) I belive that (a) still holds true; if 'oadm diagnostics' is not providing useful information please let us know - probably via a bug in the OpenShift product (CLI component). And with that assumption, it would still make sense to try to collect the information it generates... The bugs that you mention should definitely be fixed (one is marked as an RFE, which I think we could argue about). but these issues are not specific to the sos plugin... do we want to avoid invoking command X from a sos plugin when X has a bug / enable it back when it's fixed? Having said that, one thing we could do is to customize how the command gets invoked. For example: it has a --prevent-modifications option that would prevent any attempt to make any changes, like deploying pods for testing. We could set that on by default maybe? Personally I believe that the "intrusive" tests that the tool performs are simple and safe enouh (it's not "lots of projects", it's 2, the minimum to validate namespace isolation) and do provide significant value: several support requests are related to networking within the SDN, and having a diagnostic of that is valuable... but I'd love more thoughts on that. I'll ask the support team to step in and provide feedback too. I'll assign the bug to myself to follow-up with any changes needed, feel free to change that if preferred.
Just if I got it right: (In reply to Marko Myllynen from comment #0) > Description of problem: > When running sosreport on an OpenShift cluster master node (and oc whoami > saying system:admin) sosreport will run also the origin plugin which in turn > will run oc adm diagnostics. This will timeout after 5 minutes at least in > disconnected installations due to: Is it supposed/intended/welcomed/required/beneficial to run that command on OpenShift cluster master node? If we get some usefull data from it, sosreport shall call it (and the tool should be robust enough not to timeout under specific situations). If calling the cmd on that node (and on given setup like disconnected installation) does not bring anything useful, let formally describe how to identify such condition and sosreport will stop calling the cmd in that situation. I expect the 1st option is valid, am I right?
Thanks for looking into this. I'm all for providing automated tools to collect all the needed information for troubleshooting at one go (see related bug 1456380) but here the tool at least currently gets stuck for a long time without providing any additional data and even if this would be a connected installation someone could kill it leaving extra namespaces behind to the cluster which might lead to confusion later on. I think the above suggestion to make this configurable is a good compromise, perhaps disabled by default until the above issues are fixed and then flipped enabled by default? This way also in the long run those who want to avoid this (due to current situation with the environment, policies, or such) can opt-out but in the general case the data will be available to assist with troubleshooting. Thanks.
(In reply to Marko Myllynen from comment #7) > Thanks for looking into this. Thanks for your report and feedback! > I'm all for providing automated tools to collect all the needed information > for troubleshooting at one go (see related bug 1456380) but here the tool at > least currently gets stuck for a long time without providing any additional > data So, that's a bug in "adm diagnostics" that should get fixed. In fact, it's more than likely that the fix for "oc" will be released much sooner than a change in the sos plugin that calls it. > I think the above suggestion to make this configurable is a good compromise, > perhaps disabled by default The problem with disabled by default is that experience says that people will just not enable it. I'd vote for enabled by default here... having the option to disable that specific check, or tune it (e.g. disable network checks) in case the default behaviour is not desirable. > until the above issues are fixed and then flipped enabled by default? We can only change the behaviour of the plugin every so often - and in this case, as mentioned above: if we were to disable that check due to the referenced bug, we would release the change after its motivation would have been gone... > This way also in the long run those who want to avoid this (due to current > situation with the environment, policies, or such) can opt-out but in the > general case the data will be available to assist with troubleshooting. I think an opt-out option should be the way to go here, i.e: keep the current behaviour as default, and add an option to tweak the call to "adm diagnostics" with 3 choices: run with default options (enabled by default), run without diag pods, or disable completely. Then if you hit an issue or want to disable that check for any reason you can do so. Does that sound like a good plan? Support collects a fair amount of sosreports from OpenShift systems regularly and AFAIK this is the first report of this check causing trouble due to a combination of bugs and environment. This is why I believe we should keep the current behaviour by default. Maybe I'm missing details about the impact though, so if there's a stronger need to disable this please let me know.
(In reply to Josep 'Pep' Turro Mauri from comment #8) > (In reply to Marko Myllynen from comment #7) > > > I think the above suggestion to make this configurable is a good compromise, > > perhaps disabled by default > > The problem with disabled by default is that experience says that people > will just not enable it. I'd vote for enabled by default here... having the > option to disable that specific check, or tune it (e.g. disable network > checks) in case the default behaviour is not desirable. > > > until the above issues are fixed and then flipped enabled by default? > > We can only change the behaviour of the plugin every so often - and in this > case, as mentioned above: if we were to disable that check due to the > referenced bug, we would release the change after its motivation would have > been gone... I see, this is a fair point. > > This way also in the long run those who want to avoid this (due to current > > situation with the environment, policies, or such) can opt-out but in the > > general case the data will be available to assist with troubleshooting. > > I think an opt-out option should be the way to go here, i.e: keep the > current behaviour as default, and add an option to tweak the call to "adm > diagnostics" with 3 choices: run with default options (enabled by default), > run without diag pods, or disable completely. Then if you hit an issue or > want to disable that check for any reason you can do so. > > Does that sound like a good plan? Yes, I think this is a solid plan, these choices should provide enough options for everyone around this. Thanks.
This is what I see with latest 3.5 in a disconnected installation: network-diag-global-ns-bbps6 network-diag-test-pod-3wxh2 0/1 ImagePullBackOff 0 6m network-diag-global-ns-bbps6 network-diag-test-pod-4vggr 0/1 ImagePullBackOff 0 6m network-diag-global-ns-bbps6 network-diag-test-pod-7w8r6 0/1 ImagePullBackOff 0 6m network-diag-global-ns-bbps6 network-diag-test-pod-gbplv 0/1 ImagePullBackOff 0 6m network-diag-global-ns-bbps6 network-diag-test-pod-pc6sh 0/1 ImagePullBackOff 0 6m network-diag-global-ns-bbps6 network-diag-test-pod-rhzgr 0/1 ImagePullBackOff 0 6m network-diag-global-ns-bbps6 network-diag-test-pod-s6w3k 0/1 ImagePullBackOff 0 6m network-diag-global-ns-bbps6 network-diag-test-pod-sk1wf 0/1 ImagePullBackOff 0 6m network-diag-global-ns-bbps6 network-diag-test-pod-t4tvx 0/1 ImagePullBackOff 0 6m network-diag-global-ns-bbps6 network-diag-test-pod-w8h6s 0/1 ImagePullBackOff 0 6m network-diag-global-ns-ltn0f network-diag-test-pod-0lc62 0/1 ImagePullBackOff 0 5m network-diag-global-ns-ltn0f network-diag-test-pod-0vntf 0/1 ImagePullBackOff 0 6m network-diag-global-ns-ltn0f network-diag-test-pod-5lgrv 0/1 ErrImagePull 0 5m network-diag-global-ns-ltn0f network-diag-test-pod-l34vn 0/1 ImagePullBackOff 0 6m network-diag-global-ns-ltn0f network-diag-test-pod-qb75t 0/1 ErrImagePull 0 6m network-diag-global-ns-ltn0f network-diag-test-pod-qr2qv 0/1 ErrImagePull 0 6m network-diag-global-ns-ltn0f network-diag-test-pod-s5ltv 0/1 ImagePullBackOff 0 6m network-diag-global-ns-ltn0f network-diag-test-pod-tg2gf 0/1 ImagePullBackOff 0 6m network-diag-global-ns-ltn0f network-diag-test-pod-vgg21 0/1 ImagePullBackOff 0 6m network-diag-global-ns-ltn0f network-diag-test-pod-z8dl9 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-02wwc 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-2440q 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-7wb6v 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-jf19j 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-nb3ns 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-t9zcf 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-v8vjf 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-wbrg9 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-x7thr 0/1 ImagePullBackOff 0 6m network-diag-ns-38v11 network-diag-test-pod-xz3k7 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-09ccx 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-1s5st 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-6kwth 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-75rcs 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-bklhs 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-ck0x8 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-h953b 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-qn6jm 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-t3s6j 0/1 ImagePullBackOff 0 6m network-diag-ns-5zmwc network-diag-test-pod-v2nzq 0/1 ImagePullBackOff 0 6m These are cleaned up later if nobody kills sosreport. Thanks.
(In reply to Marko Myllynen from comment #9) > (In reply to Josep 'Pep' Turro Mauri from comment #8) > > I think an opt-out option should be the way to go here, i.e: keep the > > current behaviour as default, and add an option to tweak the call to "adm > > diagnostics" with 3 choices: run with default options (enabled by default), > > run without diag pods, or disable completely. Then if you hit an issue or > > want to disable that check for any reason you can do so. > > > > Does that sound like a good plan? > > Yes, I think this is a solid plan, these choices should provide enough > options for everyone around this. > > Thanks. Pep, could you please file upstream PR for this (esp. if you want to have it in RHEL7.5)? I dont see where "oc adm diagnostics" is called for individual pods (or I misunderstand what cmd shall not be called for the pods via that option?).
Sorry for the delay, here's the PR: https://github.com/sosreport/sos/pull/1098 (In reply to Josep 'Pep' Turro Mauri from comment #8) > I think an opt-out option should be the way to go here, i.e: keep the > current behaviour as default, and add an option to tweak the call to "adm > diagnostics" with 3 choices: run with default options (enabled by default), > run without diag pods, or disable completely. Then if you hit an issue or > want to disable that check for any reason you can do so. The suggested implementation is 2 options (boolean): - diag (default true): run "oc adm diagnostics" - diag-prevent (default false): set "--prevent-modifications" when invoking diag The --prevent-modifications option would prevent "oc adm diagnostics" to attempt pod deployments, including the ones in the network check that were affected by the bug mentioned earlier. Does that sound ok? Feel free to comment here or in the PR.
(In reply to Josep 'Pep' Turro Mauri from comment #13) > > The suggested implementation is 2 options (boolean): > > - diag (default true): run "oc adm diagnostics" > - diag-prevent (default false): set "--prevent-modifications" when invoking > diag > > The --prevent-modifications option would prevent "oc adm diagnostics" to > attempt pod deployments, including the ones in the network check that were > affected by the bug mentioned earlier. > > Does that sound ok? Feel free to comment here or in the PR. Yes, sounds perfect, thanks!
Marko, could you please provide verification steps or verify the BZ by yourself (brew build with sos package follows)?
(asking Pep the same for verification steps)
Fixed via sos 3.5 rebase.
(In reply to Pavel Moravec from comment #16) > Marko, could you please provide verification steps or verify the BZ by > yourself (brew build with sos package follows)? There are three different ways to run sosreport now: # Disable oc adm diagnostics completely sosreport -o origin -k origin.diag=off # Disable oc adm diagnostics initiated modifications sosreport -o origin -k origin.diag-prevent=on # Run full oc adm diagnostics sosreport -o origin It seems that there's a typo causing the second alternative to fail: # cat sos_commands/origin/oc_--config_.etc.origin.master.admin.kubeconfig_adm_diagnostics_-l_0_--prevent-modifications_true Error: unknown flag: --prevent-modifications Usage: oc adm diagnostics [options] Options: --cluster-context='': Client context to use for cluster administrator --config='': Path to the config file to use for CLI requests. --context='': The name of the kubeconfig context to use -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug --host=false: If true, look for systemd and journald units even without master/node config --images='registry.access.redhat.com/openshift3/ose-${component}:${version}': Image template for DiagnosticPod to use in creating a pod --latest-images=false: If true, when expanding the image template, use latest version, not release version --loglevel=0: Set the level of log output (0-10) --logspec='': Set per module logging with file|pattern=LEVEL,... --master-config='': Path to master config file (implies --host) --network-logdir='/tmp/openshift/': Path to store network diagnostic results in case of errors --network-pod-image='registry.access.redhat.com/openshift3/ose': Image to use for network diagnostic pod --network-test-pod-image='registry.access.redhat.com/openshift3/ose-deployer': Image to use for network diagnostic test pod --network-test-pod-port=8080: Serving port on the network diagnostic test pod --network-test-pod-protocol='TCP': Protocol used to connect to network diagnostic test pod --node-config='': Path to node config file (implies --host) --prevent-modification=false: If true, may be set to prevent diagnostics making any changes via the API Use "oc adm options" for a list of global command-line options (applies to all commands). Thanks.
Thanks for spotting it, sos PR 1150 filed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0963