2008733 – kube-scheduler: exposed /debug/pprof port

Bug 2008733 - kube-scheduler: exposed /debug/pprof port

Summary: kube-scheduler: exposed /debug/pprof port

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:	LifecycleReset
Depends On:	1889488
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-29 02:58 UTC by Mark Cooper
Modified:	2022-03-10 16:14 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:13:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 1087	0	None	Merged	Bug 2033751: Kube 1.23.0 rebase	2022-02-02 08:57:35 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:14:21 UTC

Description Mark Cooper 2021-09-29 02:58:38 UTC

Description of problem:

We found that for some reason the golang debug net/http/pprof endpoint is exposed within an OCP cluster on the master nodes. 

For example, get the IP of a node in OCP and then you can query the debug path:

   curl <node ip>:10251/debug/pprof/goroutine?debug=1

There is a potential to be able to call this from arbitrary points in the cluster. Although this obviously depends on the environment and if worker pods can reach masters.


Listing the listening ports and their processes on a node, we believed we've narrowed it down to the `kube-scheduler` process:
https://github.com/kubernetes/kubernetes/blob/ea0764452222146c47ec826977f49d7001b0ea8c/staging/src/k8s.io/apiserver/pkg/server/routes/profiling.go#L30

And it's set on by default here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/apis/config/v1beta1/defaults.go#L168

We believe it only affects OCP as testing with minikube it does not seem to behave the same:

  $ curl -k https://localhost:10259/debug/pprof/goroutine
  {
    "kind": "Status",
    "apiVersion": "v1",
    "metadata": {
    },
    "status": "Failure",
    "message": "forbidden: User \"system:anonymous\" cannot get path \"/debug/pprof/goroutine\"",
    "reason": "Forbidden",
    "details": {
    },
    "code": 403

I'm assuming that endpoint is protected via kube-rbac or something.


But it looks the like a similar bug affecting the kubelet, https://github.com/kubernetes/kubernetes/issues/81023 and with the article, https://mmcloughlin.com/posts/your-pprof-is-showing 

Expected results:

Whilst you still need local access to the cluster first and it has limited use, we feel it should be closed down if possible or protected by something like rbac to limit the information gathering. Would expect similar to k8s.

Comment 1 Jan Chaloupka 2021-10-07 09:01:06 UTC

https://github.com/kubernetes/kubernetes/pull/96345/files#diff-d2ca723d710873ea0fa67bb8b79cbe4f6921355fb07368c842819499051c7c53L36 removes insecure bits from the kube-scheduler. We just need to wait until the change gets into 4.10 through a rebase. Once done, we can backport it to 4.9 as well. See https://bugzilla.redhat.com/show_bug.cgi?id=1889488 for more details.

Comment 2 Michal Fojtik 2021-11-24 13:09:30 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 3 Jan Chaloupka 2021-11-25 12:10:25 UTC

As the rule here is to first make the changes to the master branch and only then backport, I am still waiting for the next 4.10 rebase. Once done, I will backport the necessary changes to 4.9.

Comment 4 Michal Fojtik 2021-11-25 12:59:01 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset.
The bug assignee was notified.

Comment 5 Michal Fojtik 2021-12-25 13:22:28 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 6 Jan Chaloupka 2022-01-07 11:48:47 UTC

I am waiting for the v1.23 kubernetes rebase in https://github.com/openshift/kubernetes/pull/1087.

Comment 7 Jan Chaloupka 2022-02-02 08:57:35 UTC

The insecure port 10251 got removed.

Comment 8 Michal Fojtik 2022-02-02 09:02:10 UTC

The LifecycleStale keyword was removed because the bug moved to QE.
The bug assignee was notified.

Comment 10 RamaKasturi 2022-02-03 09:30:50 UTC

Tested build with below nightly and i see that kube-scheduler does not serve on any port with number 10251 and it serves on a secure port now which is 10259.

[knarra@knarra ~]$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.0   True        False         106m    Cluster version is 4.10.0-rc.0



name: kube-scheduler
    ports:
    - containerPort: 10259
      hostPort: 10259
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: healthz
        port: 10259
        scheme: HTTPS

But i do see 10251 port when issued command `oc get pod <pod_name> -n openshift-kube-scheduler` but when checked with dev below is what they have suggested. 
        - It is only a simple bash check. The 10251 is no longer used anywhere. I will remove it from the spec in 4.11. No need for a bug report.
        - Already filed a PR to remove the same in 4.11 , Removing in https://github.com/openshift/cluster-kube-scheduler-operator/pull/41

Based on the above moving bug to verified state.

Comment 13 errata-xmlrpc 2022-03-10 16:13:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.