Can you include debug pod yaml passing -oyaml to the command and your node yamls as well so that I can verify what's wrong.
Hello Maciej, Added the attachment requested.
I encountered the same problem, and then avoid it by the following command. $ oc adm new-project debug --node-selector="" $ oc debug node/<node> -n debug
I encountered the same problem, and i found the issue was because the scheduler was set defaultNodeSelector: node-role.kubernetes.io/worker= The mentioned workaround from Masaki is solving the issue. Although i still think that it is a bug.
This problem will is caused either by cluster-wide or project-wide defaultNodeSelector, since this is performed by an admission plugin shortly before submitting a resource to the storage layer (see https://speakerdeck.com/sttts/sig-api-machinery-deep-dive-kubecon-na-2018?slide=10). Thus, the only possible solution to this problem is: 1. with project-wide defaultNodeSelector to use a namespace not having that selector set 2. with cluster-wide defaultNodeSelector to use one of default or kube-system namespaces. There's no other way around this and I we're not planning to address it otherwise. Does this answer the CU question?
ok, so it means that the problem occurs due to wrong customization of scheduler or project (with defaultNodeSelector) That means it's definitely not a bug, indeed a default deployment (without customization) doesn't suffer from such problem
Setting a defaultNodeSelector for your Scheduler is a perfectly normal customization for a cluster. I do no expect a fundamental tool like oc debug node/<node-id> to break just because you have a defaultNodeSector in place for you Scheduler. I understand why it is failing, because a oc debug node will try to spin a pod that needs to be scheduled to run on the node in question that we are trying to debug. However the pod also inherits the defaultNodeSelector in most usecases defaultNodeSelector is set to be the nodes with the role/worker A use case where this fails is for eg: if you are trying to oc debug node a master node. Your debug pod will have a node-selector of the mater nodes hostname. However the Scheduler will slap on an additional node-selector of role/woker. Now this pod has a node selector requirement that can ever be satisfied, because the master nodes do not have the role/worker label. Is there anyway we can bypass the defaultNodeSelector that is inherited by the scheduler ? when we run oc debug node ?
(In reply to Sushil from comment #14) > > Is there anyway we can bypass the defaultNodeSelector that is inherited by > the scheduler ? when we run oc debug node ? > Using the default namespace is the simplest and will always work, iow. oc debug node/x -n default.
@Maciej using the default namespace won't work if this issue is triggered by a cluster-wide node selector. On my freshly installed 4.4.4 cluster, default namespace looks like this: apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c21,c10 openshift.io/sa.scc.supplemental-groups: 1000440000/10000 openshift.io/sa.scc.uid-range: 1000440000/10000 creationTimestamp: "2020-05-16T10:32:11Z" name: default resourceVersion: "8542" selfLink: /api/v1/namespaces/default uid: 806e1225-e67b-4232-8cf7-318a8d197068 spec: finalizers: - kubernetes status: phase: Active As you can see, it does not have an annotation with empty node-selector. If I try to use the default namespace, it fails (default cluster-wide node selector is applied). Possible alternatives I see: - Temporary namespace: create an ad-hoc temporary namespace in the very same way "oc adm must-gather" does. This way, you don't interfere with other namespaces at all. In my honest opinion, this is the simplest and more robust (as per experience with must-gather) - Add empty node selector annotation to default project: not sure if feasible. If it is, a random string needs to be added to the debug pod name to prevent collisions (in case 2 people run oc adm debug node without knowing about each other) - Use another namespace: This may be a bit risky. I have found many namespaces with empty node selector on my freshly installed cluster, but those are managed by other components, so you may not be able to rely on them still having the empty node selector or even on their existence in future versions. You would also have the same problem about pod name collision than if using default namespace. What do you think? Thanks and regards.
The way the project- or cluster- selectors work is controlled by an admission plugin which explicitly omits the few namespaces I've mentioned earlier. On top of that, none of the alternatives you provided is capable of by-passing the admission chain, that's why it's not possible to change the behavior of these selectors.
Not sure if there has been some misunderstanding or you have not read all the details of my previous comment but: - I HAVE tested with default project on 4.4 and DOES NOT WORK. Not sure about the other kube-XXX projects (I have not tested them) - Creating a temporary project with default selector DOES WORK, as this is the workaround offered to customers and it worked. It may not be a matter of bypassing the admission controller entirely but to have it not add a node selector to the pod so scheduling is possible.
(In reply to Pablo Alonso Rodriguez from comment #20) > - I HAVE tested with default project on 4.4 and DOES NOT WORK. Not sure > about the other kube-XXX projects (I have not tested them) I've seen this and I need to double check the code, since I'm pretty sure it should work. > - Creating a temporary project with default selector DOES WORK, as this is > the workaround offered to customers and it worked. It may not be a matter of > bypassing the admission controller entirely but to have it not add a node > selector to the pod so scheduling is possible. That's a good workaround for now.
Ok. Note that I tested on 4.4, maybe it is different in latest development versions
I'll keep that in mind, although that pieces of code hasn't changed much in the recent releases.
Iām adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
I recently inherited this bug, I'm adding UpcomingSprint,because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
I'm adding UpcomingSprint,because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
There is a workaround for this https://bugzilla.redhat.com/show_bug.cgi?id=1812813#c17 and this hasn't had any updates other than my hopeful comments of UpcomingSprint, I'm closing this as WONTFIX please re-open if any new cases come up.
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
Verified bug with payload below and i see that it works fine with out any issues. [ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-10-031249]$ ./oc version Client Version: 4.6.0-0.nightly-2020-09-10-031249 Server Version: 4.6.0-0.nightly-2020-09-10-031249 Kubernetes Version: v1.19.0-rc.2+068702d [ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-10-031249]$ ./oc version -o yaml clientVersion: buildDate: "2020-09-04T19:54:51Z" compiler: gc gitCommit: f2a4a0375cc7b0eacb5467a0214303983e1151d6 gitTreeState: clean gitVersion: openshift-clients-4.6.0-202006250705.p0-112-gf2a4a0375 goVersion: go1.14.4 major: "" minor: "" platform: linux/amd64 openshiftVersion: 4.6.0-0.nightly-2020-09-10-031249 releaseClientVersion: 4.6.0-0.nightly-2020-09-10-031249 serverVersion: buildDate: "2020-09-04T01:56:40Z" compiler: gc gitCommit: 068702de7d48739e835ea41b7ca959b5252de432 gitTreeState: clean gitVersion: v1.19.0-rc.2+068702d goVersion: go1.14.4 major: "1" minor: 19+ platform: linux/amd64 Scenario 1: =============== 1) Make sure that a debug namespace is created and the project should have annotation openshift.io/node-selector: "" when oc debug is run against any of the master, infra or worker nodes. Scenario 2: ================= 1) set the defaultNodeselector in the scheduler cluster to defaultNodeSelector: type=foo,region=foo, wait for the openshift-kube-apiserver pods to redeploy and verify that debug namespace is created and the project should have annotation openshift.io/node-selector: "" when oc debug is run against any of the master, worker or infra node. Based on the above moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
Sally make sure to backport revert to 4.6 too.
moving this back to Post, as there is a follow-up PR to the revert, to provide better error message for failed debug pod: https://github.com/openshift/oc/pull/675
Will be pushing through the follow-up PR in the upcoming sprint.
This was fixed in all 4.6+ releases so moving to modified.