1572919 – OCP 3.10: pprof profiling endpoint not starting on compute node

Bug 1572919 - OCP 3.10: pprof profiling endpoint not starting on compute node

Summary: OCP 3.10: pprof profiling endpoint not starting on compute node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.10.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.10.0
Assignee:	ravig
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:	aos-scalability-310
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-28 18:01 UTC by Walid A.
Modified:	2018-07-30 19:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-07-30 19:14:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1816	0	None	None	None	2018-07-30 19:15:04 UTC

Description Walid A. 2018-04-28 18:01:03 UTC

Description of problem:
On a 4 node OCP 3.10.0-0.29.0 AWS environment (1 master/etcd, 1 infra and 2 compute nodes), Pprof profiling endpoint fails to start on the compute nodes after adding OPENSHIFT_PROFILING=web to /etc/sysconfig/atomic-openshift-node environment file, and restarting the node service with systemctl restart atomic-openshift-node command. The OCP cluster was installed with openshift_node_bootstrap=true (default value).

Compute node journal logs show it is starting the profiling endpoint:

atomic-openshift-node[10357]: I0427 12:36:49.651796   10358 profiler.go:19] Starting profiling endpoint at http://127.0.0.1:6060/debug/pprof/

But no server is listening on port 6060 on the compute node:
netstat -tunapl | grep 6060

Trying to access the pprof data from the profiling endpoint fails:
# go tool pprof --text http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
http://localhost:6060/debug/pprof/heap: Get http://localhost:6060/debug/pprof/heap: dial tcp [::1]:6060: getsockopt: connection refused
failed to fetch any profiles

This method of starting the profiling endpoint worked on OCP 3.9 compute and master/etcd nodes.  On OCP 3.10, this method does not work on the compute node, but only works on the master/etcd node with the master static pods when "OPENSHIFT_PROFILING=web" is appended to the /etc/origin/master/master.env file, and after master-api and controller pods are restarted with provided restart shims in /usr/local/bin. 

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.10.0-0.29.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

# docker version
Client:
 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-62.gitc6c9b51.el7.x86_64
 Go version:      go1.9.2


How reproducible:
Always

Steps to Reproduce:
1. Deploy and OCP 3.10.0-0.29.0 environment with openshift-ansible/playbooks/deploy_cluster.yml playbook: 1 master/etcd, 1 infra, and 2 compute nodes m4.xlarge, with default value for variable "openshift_node_bootstrap" (true).
2. Append OPENSHIFT_PROFILE=web to /etc/sysconfig/atomic-openshift-node on compute node instance
3. Restart atomic-openshift-node on compute node:  systemctl status atomic-openshift-node 
4. Collect pprof data on compute node:
- go tool pprof -text -seconds=10 -output /root/$(date +%Y-%m-%d:%H:%M:%S).txt http://localhost:6060/debug/pprof/heap
- go tool pprof -text -seconds=10 -output /root/$(date +%Y-%m-%d:%H:%M:%S).txt http://localhost:6060/debug/pprof/profile

Actual results:
Getting "failed to fetch any profiles" error shown in the description above

Expected results:
Profile endpoint starting and listening on port 6060.  Pprof data collected in specified text file

Additional info:
Logs from master and compute node will be in next attachement

Comment 2 Mo 2018-04-30 17:45:18 UTC

Opened https://github.com/openshift/origin/pull/19567 as a strawman.

Comment 3 Walid A. 2018-04-30 19:04:04 UTC

Had a typo in the description section.  The line that was appended to /etc/sysconfig/atomic-openshift-node file and to file /etc/origin/master/master.env was:

OPENSHIFT_PROFILE=web 

(not OPENSHIFT_PROFILING=web).  It is correct in the steps to reproduce.

Comment 4 Seth Jennings 2018-05-07 17:18:59 UTC

For me, the weirdness is this:

$ oc get --raw=/api/v1/nodes/node1.lab.variantweb.net/proxy/stats/summary | head -n 5
{
  "node": {
   "nodeName": "node1.lab.variantweb.net",
   "systemContainers": [
    {

$ oc get --raw=/api/v1/nodes/node1.lab.variantweb.net/proxy/debug/pprof/
Error from server (NotFound): the server could not find the requested resource

Yet on the local node:

$ curl --insecure https://localhost:10250/stats/summary
Forbidden (user=system:anonymous, verb=get, resource=nodes, subresource=stats)

$ curl --insecure https://localhost:10250/debug/pprof/
Forbidden (user=system:anonymous, verb=get, resource=nodes, subresource=proxy)

Both endpoints seem to be there on the kubelet, but the stats/summary is reachable via the proxy and debug/pprof is not.

Comment 5 Mo 2018-05-07 20:19:29 UTC

Seth, are you sure you are not just getting a Forbidden error because of the unauthenticated curl?  auth checks occur before "not found" checks IIRC.

Comment 6 ravig 2018-05-08 16:28:18 UTC

I was able to reproduce this problem following is my take on the situation:


- Since, OPENSHIFT_PROFILE=web is dead as per https://github.com/openshift/origin/pull/19567#issuecomment-385839891, we cannot use it anymore.

- The url to collect pprof data on node now is https://<node_name>:10250/debug/pprof/profile (or) heap (or) block(https://golang.org/pkg/net/http/pprof/)


Working solution:

- To collect data from node we can curl it from master using:

curl --cert /tmp/cert.pem --key /tmp/cert.key https://ip-172-31-6-230.us-west-2.compute.internal:10250/debug/pprof/profile  -k --output /tmp/sample.txt -L

and then use 

go tool pprof /tmp/sample.txt

The cert.pem and cert.key should be with cluster-admin role or else we will get a 403(authz error). On 3.10 cluster(puddle: v3.10.0-0.32.0) I was able to find admin.key and admin.cert in /etc/origin/master but when I use them I was getting a 403 again(with system:anonymous instead of system:admin), so I created cert.pem and cert.key using 

awk '/client-certificate-data:/ { print $2 }'  ~/.kube/config | base64 -d > /tmp/cert.pem
awk '/client-key-data:/ { print $2 }'  ~/.kube/config | base64 -d > /tmp/cert.key
awk '/certificate-authority-data:/ { print $2 }' ~/.kube/config | base64 -d > /tmp/ca.pem

Other options explored:

- I believe we cannot use  oc get --raw=/api/v1/nodes/ip-172-31-29-58.us-west-2.compute.internal/proxy/debug/pprof/profile because we cannot redirect it to a file to be understood by go tool pprof

- We also cannot use go tool pprof with https as pprof as of now cannot accept https, cacert and key(https://github.com/google/pprof/pull/261).


I also think, we shouldn't start the profiler as it doesn't make much of sense in 3.10 case. I will create a PR to remove that.

Comment 7 Mo 2018-05-08 23:31:44 UTC

Why does redirection not work with oc get --raw?

Comment 8 ravig 2018-05-08 23:46:30 UTC

To make my earlier statement clear, I tried  


oc get --raw=/api/v1/nodes/ip-172-31-29-58.us-west-2.compute.internal/proxy/debug/pprof/trace?seconds=30 > /tmp/sample.txt

After that I got:

go tool pprof /tmp/sample.txt 
/tmp/sample.txt: parsing profile: unrecognized profile format
failed to fetch any profiles.

So, seems the execution trace doesn't work but I am not sure why at this point of time though

But there is a typo in my earlier statement, it works for /profile(cpu) and /heap(memory).

Comment 9 Walid A. 2018-05-14 20:38:34 UTC

I was able to test that we can pull pprof data from all the nodes (master, compute, and infra) on this OCP 3.10.0-0.32.0 cluster (profile or heap) using ravig 's curl command in Comment 6, and using the cert and key he generated:

curl --cert /tmp/cert.pem --key /tmp/cert.key https://<node_name>:10250/debug/pprof/profile -k --output /tmp/pprof_profile_sample.txt -L

curl --cert /tmp/cert.pem --key /tmp/cert.key https://<node_name>:10250/debug/pprof/heap  -k --output /tmp/pprof_heap_sample.txt -L


and then run

go tool pprof -text -seconds=10 -output /tmp/$(date +%Y-%m-%d:%H:%M:%S)_profile.txt /tmp/pprof_profile_sample.txt 
go tool pprof -text -seconds=10 -output /tmp/$(date +%Y-%m-%d:%H:%M:%S)_heap.txt /tmp/pprof_heap_sample.txt 


Will the pprof profile, heap and block data be always available at port 10250 for all nodes without any additional configurations to enable them ?

We would need to update the docs with this new supported method for accessing pprof data since we have some existing docs referring to the old method. e.g. https://access.redhat.com/solutions/2155121

Comment 10 ravig 2018-05-14 21:26:55 UTC

> Will the pprof profile, heap and block data be always available at port 10250 for all nodes without any additional configurations to enable them ?

They are already available, you just need to give /heap and /block data instead of profile there.

> We would need to update the docs with this new supported method for accessing pprof data since we have some existing docs referring to the old method. e.g. https://access.redhat.com/solutions/2155121

You are right but I am not sure how/whom to contact for updating? Seth any pointers?

Comment 11 Walid A. 2018-05-14 21:44:34 UTC

I was able to also pull pprof profile and heap data from a newer OCP 3.10 cluster, v3.10.0-0.38.0, as described in Comment 9, without any additional post-install configurations, other than using certs and keys generated from .kube/config file on master (the 3 awk commands in comment 6).

Comment 14 ravig 2018-05-17 00:36:57 UTC

Walid,

As discussed offline, you can contact the installer team to get information on the location of certificates. In mean while, you could also use the 

2nd option suggested in the earlier comment

$ oc proxy &

$ oc get --raw=/api/v1/nodes/ip-172-31-29-58.us-west-2.compute.internal/proxy/debug/pprof/profile > /tmp/sample.txt

Clayton suggested further easier way which could be done in a single step(without using oc proxy) using:

$ oc get --raw /debug/pprof/profile --server https://ip-172-31-29-58.us-west-2.compute.internal:10250 


Mo,

I looked more into the execution trace and it seems the issue is not with go tool pprof. 
- From execution trace file generated, we need to generate pprof file again using go tool trace.

For example,

$ oc get --raw /debug/pprof/trace?seconds=30 --server https://ip-172-31-29-58.us-west-2.compute.internal:10250 > /tmp/sample.trace


go tool trace -pprof=TYPE sample.trace > TYPE.pprof

Here TYPE could be any of:
- net: network blocking profile
- sync: synchronization blocking profile
- syscall: syscall blocking profile
- sched: scheduler latency profile

Ref: https://golang.org/cmd/trace/

PR for origin changes:

https://github.com/openshift/origin/pull/19744/files

Comment 15 Sebastian Jug 2018-05-17 18:33:32 UTC

We've already had a fix (https://github.com/distributed-system-analysis/pbench/pull/676) in progress for the pbench pprof tool since the endpoint went behind RBAC. However, the pprof endpoint was intermittently available for some builds and not on others so we kept using the 6060 ports...

The problem with using the `oc raw` command is that the heap and profile endpoints don't have an interval period, so we only get an instantaneous snapshot. To make pprof work correctly with the new RBAC endpoints I also put in a PR for pprof itself (https://github.com/google/pprof/pull/261)...

So it's not so much about needing "a PR against pbench pprof tool to fix this." Rather we need a way to capture the same interval data, then we can add it to the PR.

Comment 16 ravig 2018-05-17 19:34:43 UTC

> However, the pprof endpoint was intermittently available for some builds and not on others so we kept using the 6060 ports...

Going forward(3.10+), using 6060 is not possible so we have to stick to 10250 and using `oc get --raw` or curl is the way to go  unless the pprof PR that you created sometime ago gets merged.



> The problem with using the `oc raw` command is that the heap and profile endpoints don't have an interval period,

You can use /debug/pprof/profile?seconds=30 for `oc get --raw` as shown in earlier comment in trace.

Comment 17 Sebastian Jug 2018-05-17 19:40:29 UTC

Well it's good to see one way or another that the endpoint will definitively be gone. If it "reappears" again is that considered a bug?

Agreed, we can use `oc get --raw /debug/pprof/heap?seconds=N`, but we currently use profiling periods for the heap endpoint as well which doesn't seem to care for the seconds option.

Comment 18 Sebastian Jug 2018-05-17 19:41:43 UTC

Typo:
`oc get --raw /debug/pprof/profile?seconds=N` works not heap

Comment 19 ravig 2018-05-17 20:26:02 UTC

> If it "reappears" again is that considered a bug?

That is correct for openshift-node. For master, we still need to use the 6060.


> `oc get --raw /debug/pprof/profile?seconds=N` works not heap

It should not since heap collection happens instantaneously. Did it work earlier, even with hitting http server?

Comment 20 Mike Fiedler 2018-05-17 20:31:39 UTC

Heap was always point in time.  We collect heap on the same interval as cpu, but the data itself was instantaneous at that moment.

Comment 21 Walid A. 2018-05-18 00:46:22 UTC

Verified on OCP v3.10.0-0.46.0 (cri-o runtime) that we can collect openshift-node/kubelet pprof profile and heap data from all the nodes in the cluster using these two commands executed on the master/etcd host:

For heap:
---------
oc get --raw /debug/pprof/heap --server https://<node_ip_or_hostanme>:10250 > 
/tmp/pprof_heap.raw

go tool pprof -text -seconds=10 -output /tmp/$(date +%Y-%m-%d:%H:%M:%S)_heap.txt /tmp/pprof_heap.raw


For profile:
------------
oc get --raw /debug/pprof/profile --server https://<node_ip_or_hostname>:10250 > 
/tmp/pprof_profile.raw

go tool pprof -text -seconds=10 -output /tmp/$(date +%Y-%m-%d:%H:%M:%S)_profile.txt /tmp/pprof_profile.raw

Comment 23 errata-xmlrpc 2018-07-30 19:14:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Note You need to log in before you can comment on or make changes to this bug.