Description of problem: On a 4 node OCP 3.10.0-0.29.0 AWS environment (1 master/etcd, 1 infra and 2 compute nodes), Pprof profiling endpoint fails to start on the compute nodes after adding OPENSHIFT_PROFILING=web to /etc/sysconfig/atomic-openshift-node environment file, and restarting the node service with systemctl restart atomic-openshift-node command. The OCP cluster was installed with openshift_node_bootstrap=true (default value). Compute node journal logs show it is starting the profiling endpoint: atomic-openshift-node[10357]: I0427 12:36:49.651796 10358 profiler.go:19] Starting profiling endpoint at http://127.0.0.1:6060/debug/pprof/ But no server is listening on port 6060 on the compute node: netstat -tunapl | grep 6060 Trying to access the pprof data from the profiling endpoint fails: # go tool pprof --text http://localhost:6060/debug/pprof/heap Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap http://localhost:6060/debug/pprof/heap: Get http://localhost:6060/debug/pprof/heap: dial tcp [::1]:6060: getsockopt: connection refused failed to fetch any profiles This method of starting the profiling endpoint worked on OCP 3.9 compute and master/etcd nodes. On OCP 3.10, this method does not work on the compute node, but only works on the master/etcd node with the master static pods when "OPENSHIFT_PROFILING=web" is appended to the /etc/origin/master/master.env file, and after master-api and controller pods are restarted with provided restart shims in /usr/local/bin. Version-Release number of selected component (if applicable): # openshift version openshift v3.10.0-0.29.0 kubernetes v1.10.0+b81c8f8 etcd 3.2.16 # docker version Client: Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-62.gitc6c9b51.el7.x86_64 Go version: go1.9.2 How reproducible: Always Steps to Reproduce: 1. Deploy and OCP 3.10.0-0.29.0 environment with openshift-ansible/playbooks/deploy_cluster.yml playbook: 1 master/etcd, 1 infra, and 2 compute nodes m4.xlarge, with default value for variable "openshift_node_bootstrap" (true). 2. Append OPENSHIFT_PROFILE=web to /etc/sysconfig/atomic-openshift-node on compute node instance 3. Restart atomic-openshift-node on compute node: systemctl status atomic-openshift-node 4. Collect pprof data on compute node: - go tool pprof -text -seconds=10 -output /root/$(date +%Y-%m-%d:%H:%M:%S).txt http://localhost:6060/debug/pprof/heap - go tool pprof -text -seconds=10 -output /root/$(date +%Y-%m-%d:%H:%M:%S).txt http://localhost:6060/debug/pprof/profile Actual results: Getting "failed to fetch any profiles" error shown in the description above Expected results: Profile endpoint starting and listening on port 6060. Pprof data collected in specified text file Additional info: Logs from master and compute node will be in next attachement
Opened https://github.com/openshift/origin/pull/19567 as a strawman.
Had a typo in the description section. The line that was appended to /etc/sysconfig/atomic-openshift-node file and to file /etc/origin/master/master.env was: OPENSHIFT_PROFILE=web (not OPENSHIFT_PROFILING=web). It is correct in the steps to reproduce.
For me, the weirdness is this: $ oc get --raw=/api/v1/nodes/node1.lab.variantweb.net/proxy/stats/summary | head -n 5 { "node": { "nodeName": "node1.lab.variantweb.net", "systemContainers": [ { $ oc get --raw=/api/v1/nodes/node1.lab.variantweb.net/proxy/debug/pprof/ Error from server (NotFound): the server could not find the requested resource Yet on the local node: $ curl --insecure https://localhost:10250/stats/summary Forbidden (user=system:anonymous, verb=get, resource=nodes, subresource=stats) $ curl --insecure https://localhost:10250/debug/pprof/ Forbidden (user=system:anonymous, verb=get, resource=nodes, subresource=proxy) Both endpoints seem to be there on the kubelet, but the stats/summary is reachable via the proxy and debug/pprof is not.
Seth, are you sure you are not just getting a Forbidden error because of the unauthenticated curl? auth checks occur before "not found" checks IIRC.
I was able to reproduce this problem following is my take on the situation: - Since, OPENSHIFT_PROFILE=web is dead as per https://github.com/openshift/origin/pull/19567#issuecomment-385839891, we cannot use it anymore. - The url to collect pprof data on node now is https://<node_name>:10250/debug/pprof/profile (or) heap (or) block(https://golang.org/pkg/net/http/pprof/) Working solution: - To collect data from node we can curl it from master using: curl --cert /tmp/cert.pem --key /tmp/cert.key https://ip-172-31-6-230.us-west-2.compute.internal:10250/debug/pprof/profile -k --output /tmp/sample.txt -L and then use go tool pprof /tmp/sample.txt The cert.pem and cert.key should be with cluster-admin role or else we will get a 403(authz error). On 3.10 cluster(puddle: v3.10.0-0.32.0) I was able to find admin.key and admin.cert in /etc/origin/master but when I use them I was getting a 403 again(with system:anonymous instead of system:admin), so I created cert.pem and cert.key using awk '/client-certificate-data:/ { print $2 }' ~/.kube/config | base64 -d > /tmp/cert.pem awk '/client-key-data:/ { print $2 }' ~/.kube/config | base64 -d > /tmp/cert.key awk '/certificate-authority-data:/ { print $2 }' ~/.kube/config | base64 -d > /tmp/ca.pem Other options explored: - I believe we cannot use oc get --raw=/api/v1/nodes/ip-172-31-29-58.us-west-2.compute.internal/proxy/debug/pprof/profile because we cannot redirect it to a file to be understood by go tool pprof - We also cannot use go tool pprof with https as pprof as of now cannot accept https, cacert and key(https://github.com/google/pprof/pull/261). I also think, we shouldn't start the profiler as it doesn't make much of sense in 3.10 case. I will create a PR to remove that.
Why does redirection not work with oc get --raw?
To make my earlier statement clear, I tried oc get --raw=/api/v1/nodes/ip-172-31-29-58.us-west-2.compute.internal/proxy/debug/pprof/trace?seconds=30 > /tmp/sample.txt After that I got: go tool pprof /tmp/sample.txt /tmp/sample.txt: parsing profile: unrecognized profile format failed to fetch any profiles. So, seems the execution trace doesn't work but I am not sure why at this point of time though But there is a typo in my earlier statement, it works for /profile(cpu) and /heap(memory).
I was able to test that we can pull pprof data from all the nodes (master, compute, and infra) on this OCP 3.10.0-0.32.0 cluster (profile or heap) using ravig 's curl command in Comment 6, and using the cert and key he generated: curl --cert /tmp/cert.pem --key /tmp/cert.key https://<node_name>:10250/debug/pprof/profile -k --output /tmp/pprof_profile_sample.txt -L curl --cert /tmp/cert.pem --key /tmp/cert.key https://<node_name>:10250/debug/pprof/heap -k --output /tmp/pprof_heap_sample.txt -L and then run go tool pprof -text -seconds=10 -output /tmp/$(date +%Y-%m-%d:%H:%M:%S)_profile.txt /tmp/pprof_profile_sample.txt go tool pprof -text -seconds=10 -output /tmp/$(date +%Y-%m-%d:%H:%M:%S)_heap.txt /tmp/pprof_heap_sample.txt Will the pprof profile, heap and block data be always available at port 10250 for all nodes without any additional configurations to enable them ? We would need to update the docs with this new supported method for accessing pprof data since we have some existing docs referring to the old method. e.g. https://access.redhat.com/solutions/2155121
> Will the pprof profile, heap and block data be always available at port 10250 for all nodes without any additional configurations to enable them ? They are already available, you just need to give /heap and /block data instead of profile there. > We would need to update the docs with this new supported method for accessing pprof data since we have some existing docs referring to the old method. e.g. https://access.redhat.com/solutions/2155121 You are right but I am not sure how/whom to contact for updating? Seth any pointers?
I was able to also pull pprof profile and heap data from a newer OCP 3.10 cluster, v3.10.0-0.38.0, as described in Comment 9, without any additional post-install configurations, other than using certs and keys generated from .kube/config file on master (the 3 awk commands in comment 6).
Walid, As discussed offline, you can contact the installer team to get information on the location of certificates. In mean while, you could also use the 2nd option suggested in the earlier comment $ oc proxy & $ oc get --raw=/api/v1/nodes/ip-172-31-29-58.us-west-2.compute.internal/proxy/debug/pprof/profile > /tmp/sample.txt Clayton suggested further easier way which could be done in a single step(without using oc proxy) using: $ oc get --raw /debug/pprof/profile --server https://ip-172-31-29-58.us-west-2.compute.internal:10250 Mo, I looked more into the execution trace and it seems the issue is not with go tool pprof. - From execution trace file generated, we need to generate pprof file again using go tool trace. For example, $ oc get --raw /debug/pprof/trace?seconds=30 --server https://ip-172-31-29-58.us-west-2.compute.internal:10250 > /tmp/sample.trace go tool trace -pprof=TYPE sample.trace > TYPE.pprof Here TYPE could be any of: - net: network blocking profile - sync: synchronization blocking profile - syscall: syscall blocking profile - sched: scheduler latency profile Ref: https://golang.org/cmd/trace/ PR for origin changes: https://github.com/openshift/origin/pull/19744/files
We've already had a fix (https://github.com/distributed-system-analysis/pbench/pull/676) in progress for the pbench pprof tool since the endpoint went behind RBAC. However, the pprof endpoint was intermittently available for some builds and not on others so we kept using the 6060 ports... The problem with using the `oc raw` command is that the heap and profile endpoints don't have an interval period, so we only get an instantaneous snapshot. To make pprof work correctly with the new RBAC endpoints I also put in a PR for pprof itself (https://github.com/google/pprof/pull/261)... So it's not so much about needing "a PR against pbench pprof tool to fix this." Rather we need a way to capture the same interval data, then we can add it to the PR.
> However, the pprof endpoint was intermittently available for some builds and not on others so we kept using the 6060 ports... Going forward(3.10+), using 6060 is not possible so we have to stick to 10250 and using `oc get --raw` or curl is the way to go unless the pprof PR that you created sometime ago gets merged. > The problem with using the `oc raw` command is that the heap and profile endpoints don't have an interval period, You can use /debug/pprof/profile?seconds=30 for `oc get --raw` as shown in earlier comment in trace.
Well it's good to see one way or another that the endpoint will definitively be gone. If it "reappears" again is that considered a bug? Agreed, we can use `oc get --raw /debug/pprof/heap?seconds=N`, but we currently use profiling periods for the heap endpoint as well which doesn't seem to care for the seconds option.
Typo: `oc get --raw /debug/pprof/profile?seconds=N` works not heap
> If it "reappears" again is that considered a bug? That is correct for openshift-node. For master, we still need to use the 6060. > `oc get --raw /debug/pprof/profile?seconds=N` works not heap It should not since heap collection happens instantaneously. Did it work earlier, even with hitting http server?
Heap was always point in time. We collect heap on the same interval as cpu, but the data itself was instantaneous at that moment.
Verified on OCP v3.10.0-0.46.0 (cri-o runtime) that we can collect openshift-node/kubelet pprof profile and heap data from all the nodes in the cluster using these two commands executed on the master/etcd host: For heap: --------- oc get --raw /debug/pprof/heap --server https://<node_ip_or_hostanme>:10250 > /tmp/pprof_heap.raw go tool pprof -text -seconds=10 -output /tmp/$(date +%Y-%m-%d:%H:%M:%S)_heap.txt /tmp/pprof_heap.raw For profile: ------------ oc get --raw /debug/pprof/profile --server https://<node_ip_or_hostname>:10250 > /tmp/pprof_profile.raw go tool pprof -text -seconds=10 -output /tmp/$(date +%Y-%m-%d:%H:%M:%S)_profile.txt /tmp/pprof_profile.raw
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816