Bug 1812813 - `oc debug` only works against our worker nodes. Running against infra or master nodes gives error [NEEDINFO]
Summary: `oc debug` only works against our worker nodes. Running against infra or mast...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-12 09:05 UTC by kedar
Modified: 2023-12-15 17:29 UTC (History)
25 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-25 22:12:15 UTC
Target Upstream Version:
Embargoed:
jnordell: needinfo-
vlaad: needinfo? (shpawar)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift oc pull 546 0 None closed Bug 1812813: oc adm must-gather: add empty node-selector annotation to namespace 2021-02-15 15:26:02 UTC
Github openshift oc pull 550 0 None closed Bug 1812813: oc debug node: create debug namespace with empty node-selector annotation 2021-02-15 15:26:02 UTC
Github openshift oc pull 668 0 None closed REVERT: Bug 1812813: oc debug node: create debug namespace with empty node-selector annotation #550 2021-02-15 15:26:03 UTC
Red Hat Knowledge Base (Solution) 4982331 0 None None None 2020-05-22 14:42:14 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:57:12 UTC

Comment 1 Maciej Szulik 2020-03-17 11:12:53 UTC
Can you include debug pod yaml passing -oyaml to the command and your node yamls as well so that I can verify what's wrong.

Comment 4 kedar 2020-03-25 11:30:47 UTC
Hello Maciej,

Added the attachment requested.

Comment 5 Masaki Hatada 2020-04-09 06:55:51 UTC
I encountered the same problem, and then avoid it by the following command.

$ oc adm new-project debug --node-selector=""
$ oc debug node/<node> -n debug

Comment 6 Raif Ahmed 2020-04-09 08:01:06 UTC
I encountered the same problem, and i found the issue was because the scheduler was set defaultNodeSelector: node-role.kubernetes.io/worker=

The mentioned workaround from Masaki is solving the issue. Although i still think that it is a bug.

Comment 11 Maciej Szulik 2020-05-05 20:20:27 UTC
This problem will is caused either by cluster-wide or project-wide defaultNodeSelector, since this is performed by an admission plugin
shortly before submitting a resource to the storage layer (see https://speakerdeck.com/sttts/sig-api-machinery-deep-dive-kubecon-na-2018?slide=10).
Thus, the only possible solution to this problem is:
1. with project-wide defaultNodeSelector to use a namespace not having that selector set
2. with cluster-wide defaultNodeSelector to use one of default or kube-system namespaces.

There's no other way around this and I we're not planning to address it otherwise. 

Does this answer the CU question?

Comment 12 Olimp Bockowski 2020-05-12 11:40:28 UTC
ok, so it means that the problem occurs due to wrong customization of scheduler or project (with defaultNodeSelector)
That means it's definitely not a bug, indeed a default deployment (without customization) doesn't suffer from such problem

Comment 14 Sushil 2020-05-12 17:20:43 UTC
Setting a defaultNodeSelector for your Scheduler is a perfectly normal customization for a cluster.

I do no expect a fundamental tool like 

oc debug node/<node-id>

to break just because you have a defaultNodeSector in place for you Scheduler.

I understand why it is failing, because a oc debug node will try to spin a pod that needs to be scheduled to run on the node in question that we are trying to debug.

However the pod also inherits the defaultNodeSelector in most usecases defaultNodeSelector is set to be the nodes with the role/worker

A use case where this fails is for eg:
if you are trying to oc debug node a master node.

Your debug pod will have a node-selector of the mater nodes hostname.
However the Scheduler will slap on an additional node-selector of role/woker.
Now this pod has a node selector requirement that can ever be satisfied, because the master nodes do not have the role/worker label.


Is there anyway we can bypass the defaultNodeSelector that is inherited by the scheduler ? when we run oc debug node ?

Comment 15 Maciej Szulik 2020-05-13 10:48:52 UTC
(In reply to Sushil from comment #14)
> 
> Is there anyway we can bypass the defaultNodeSelector that is inherited by
> the scheduler ? when we run oc debug node ?
>

Using the default namespace is the simplest and will always work, iow. oc debug node/x -n default.

Comment 18 Pablo Alonso Rodriguez 2020-05-22 14:55:05 UTC
@Maciej using the default namespace won't work if this issue is triggered by a cluster-wide node selector.

On my freshly installed 4.4.4 cluster, default namespace looks like this:

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c21,c10
    openshift.io/sa.scc.supplemental-groups: 1000440000/10000
    openshift.io/sa.scc.uid-range: 1000440000/10000
  creationTimestamp: "2020-05-16T10:32:11Z"
  name: default
  resourceVersion: "8542"
  selfLink: /api/v1/namespaces/default
  uid: 806e1225-e67b-4232-8cf7-318a8d197068
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

As you can see, it does not have an annotation with empty node-selector. If I try to use the default namespace, it fails (default cluster-wide node selector is applied).

Possible alternatives I see:

- Temporary namespace: create an ad-hoc temporary namespace in the very same way "oc adm must-gather" does. This way, you don't interfere with other namespaces at all. In my honest opinion, this is the simplest and more robust (as per experience with must-gather)
- Add empty node selector annotation to default project: not sure if feasible. If it is, a random string needs to be added to the debug pod name to prevent collisions (in case 2 people run oc adm debug node without knowing about each other)
- Use another namespace: This may be a bit risky. I have found many namespaces with empty node selector on my freshly installed cluster, but those are managed by other components, so you may not be able to rely on them still having the empty node selector or even on their existence in future versions. You would also have the same problem about pod name collision than if using default namespace.

What do you think?

Thanks and regards.

Comment 19 Maciej Szulik 2020-05-25 11:29:14 UTC
The way the project- or cluster- selectors work is controlled by an admission plugin which explicitly omits 
the few namespaces I've mentioned earlier. On top of that, none of the alternatives you provided is capable
of by-passing the admission chain, that's why it's not possible to change the behavior of these selectors.

Comment 20 Pablo Alonso Rodriguez 2020-05-25 12:25:25 UTC
Not sure if there has been some misunderstanding or you have not read all the details of my previous comment but:

- I HAVE tested with default project on 4.4 and DOES NOT WORK. Not sure about the other kube-XXX projects (I have not tested them)

- Creating a temporary project with default selector DOES WORK, as this is the workaround offered to customers and it worked. It may not be a matter of bypassing the admission controller entirely but to have it not add a node selector to the pod so scheduling is possible.

Comment 21 Maciej Szulik 2020-05-28 08:21:41 UTC
(In reply to Pablo Alonso Rodriguez from comment #20)

> - I HAVE tested with default project on 4.4 and DOES NOT WORK. Not sure
> about the other kube-XXX projects (I have not tested them)

I've seen this and I need to double check the code, since I'm pretty sure it should work.

> - Creating a temporary project with default selector DOES WORK, as this is
> the workaround offered to customers and it worked. It may not be a matter of
> bypassing the admission controller entirely but to have it not add a node
> selector to the pod so scheduling is possible.

That's a good workaround for now.

Comment 22 Pablo Alonso Rodriguez 2020-05-28 08:24:37 UTC
Ok. Note that I tested on 4.4, maybe it is different in latest development versions

Comment 23 Maciej Szulik 2020-05-28 08:30:59 UTC
I'll keep that in mind, although that pieces of code hasn't changed much in the recent releases.

Comment 24 Maciej Szulik 2020-06-18 10:06:53 UTC
Iā€™m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 26 Sally 2020-07-10 19:36:23 UTC
I recently inherited this bug, I'm adding UpcomingSprint,because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level.  I will revisit this bug next sprint.

Comment 27 Sally 2020-07-30 21:39:23 UTC
I'm adding UpcomingSprint,because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level.  I will revisit this bug next sprint.

Comment 28 Michal Fojtik 2020-08-20 11:47:25 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 29 Sally 2020-08-20 22:03:50 UTC
There is a workaround for this https://bugzilla.redhat.com/show_bug.cgi?id=1812813#c17  and this hasn't had any updates other than my hopeful comments of UpcomingSprint, I'm closing this as WONTFIX please re-open if any new cases come up.

Comment 33 Michal Fojtik 2020-08-24 08:32:28 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 47 RamaKasturi 2020-09-10 13:49:26 UTC
Verified bug with payload below and i see that it works fine with out any issues.
[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-10-031249]$ ./oc version
Client Version: 4.6.0-0.nightly-2020-09-10-031249
Server Version: 4.6.0-0.nightly-2020-09-10-031249
Kubernetes Version: v1.19.0-rc.2+068702d
[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-10-031249]$ ./oc version -o yaml
clientVersion:
  buildDate: "2020-09-04T19:54:51Z"
  compiler: gc
  gitCommit: f2a4a0375cc7b0eacb5467a0214303983e1151d6
  gitTreeState: clean
  gitVersion: openshift-clients-4.6.0-202006250705.p0-112-gf2a4a0375
  goVersion: go1.14.4
  major: ""
  minor: ""
  platform: linux/amd64
openshiftVersion: 4.6.0-0.nightly-2020-09-10-031249
releaseClientVersion: 4.6.0-0.nightly-2020-09-10-031249
serverVersion:
  buildDate: "2020-09-04T01:56:40Z"
  compiler: gc
  gitCommit: 068702de7d48739e835ea41b7ca959b5252de432
  gitTreeState: clean
  gitVersion: v1.19.0-rc.2+068702d
  goVersion: go1.14.4
  major: "1"
  minor: 19+
  platform: linux/amd64


Scenario 1:
===============
1) Make sure that a debug namespace is created and the project should have annotation  openshift.io/node-selector: "" when oc debug is run against any of the master, infra or worker nodes.

Scenario 2:
=================
1) set the defaultNodeselector in the scheduler cluster to  defaultNodeSelector: type=foo,region=foo, wait for the openshift-kube-apiserver pods to redeploy and verify that debug namespace is created and the project should have annotation openshift.io/node-selector: "" when oc debug is run against any of the master, worker or infra node.

Based on the above moving the bug to verified state.

Comment 54 errata-xmlrpc 2020-10-27 15:57:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 58 Maciej Szulik 2020-12-11 12:35:45 UTC
Sally make sure to backport revert to 4.6 too.

Comment 60 Sally 2020-12-17 19:21:44 UTC
moving this back to Post, as there is a follow-up PR to the revert, to provide better error message for failed debug pod: 
https://github.com/openshift/oc/pull/675

Comment 61 Sally 2021-01-14 23:33:36 UTC
Will be pushing through the follow-up PR in the upcoming sprint.

Comment 64 Maciej Szulik 2021-11-22 16:46:41 UTC
This was fixed in all 4.6+ releases so moving to modified.


Note You need to log in before you can comment on or make changes to this bug.