Bug 1415293

Summary:

pcp stuck in non-functional state (requiring reboot of RHEL)

Product:

Red Hat Enterprise Linux 7

Reporter:

Joel Diaz <jdiaz>

Component:

pcp

Assignee:

Mark Goodwin <mgoodwin>

Status:

CLOSED NOTABUG

QA Contact:

qe-baseos-tools-bugs

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

7.3

CC:

brolley, dranders, fche, jdiaz, lberk, mgoodwin, nathans

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-02-01 00:36:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
/var/log/pcp/*	none

Description Joel Diaz 2017-01-20 19:19:45 UTC

Description of problem:
We're running pcp inside a container with (what we believe is) appropriate volumes mounted into the container:
-- privileged --pid=host --net=host --ipc=host -v /etc/localtime:/etc/localtime -v /sys:/sys:ro 

We've witnessed where a subset of the machines we are monitoring get into a state where something like running 'pminfo' from inside the container returns:

pminfo: Cannot connect to PMCD on host "local:": No permission to perform requested operation

Restarting the container does not resolve the problem. Only fix we have found is to restart RHEL.


Version-Release number of selected component (if applicable):
pcp-3.11.3-4.el7.x86_64

How reproducible:
Once a system is in this state, it will never run pminfo successfully until the whole machine is restarted.


Steps to Reproduce:
1. ??? wait for system to enter failing state
2. run pminfo: docker exec oso-rhel7-host-monitoring pminfo
3.

Actual results:
pminfo: Cannot connect to PMCD on host "local:": No permission to perform requested operation



Expected results:
Usual pminfo output.

Additional info:
When the container has entered the failed state, there is no /var/run/pcp directory (which of course means no /var/run/pcp/pmcd.socket).

Comment 1 Frank Ch. Eigler 2017-01-20 20:01:55 UTC

Can you gather up the /var/log/pcp directory as a tarball, from a container in the failed state?

Comment 2 Joel Diaz 2017-01-20 21:47:28 UTC

Created attachment 1243017 [details]
/var/log/pcp/*

Comment 3 Frank Ch. Eigler 2017-01-20 22:47:47 UTC

(In reply to Joel Diaz from comment #2)

That log file has no content under /var/log/pcp/pmcd etc., which it should if pmcd ever even tried to start.

Is this snapshot from a post-dead state, or a freshly-recreated state?  Can you pass on the Dockerfile (if any) for your container, and how exactly you started the image this time?  (Is this from https://github.com/openshift/openshift-tools/tree/prod/docker/oso-host-monitoring ?)

Comment 4 Joel Diaz 2017-01-21 00:56:58 UTC

(In reply to Frank Ch. Eigler from comment #3)

Right. There is no /var/log/pcp/pmcd. This is while the container is in the failed state. I can restart the container, and there is still no /var/log/pcp/pmcd. I tried bypassing the usual container startup and manually creating /var/log/pcp/pmcd before starting pmcd, and still nothing was in the directory.

Yes, this is exactly that that oso-host-monitoring container (RHEL7) version. You can grab the centos version from here (if that helps): https://hub.docker.com/r/openshifttools/oso-centos7-host-monitoring/

Comment 6 Mark Goodwin 2017-01-24 00:36:24 UTC

I don't seem to be able to reproduce this on my F25 host (which has PCP installed with pmcd running), e.g. :

$ docker run --rm -it --privileged --pid=host --net=host --ipc=host -v /etc/localtime:/etc/localtime -v /sys:/sys:ro docker.io/openshifttools/oso-centos7-host-monitoring pminfo -f hinv.ncpu

hinv.ncpu
    value 4

Is your host running RHEL atomic? Or which version of RHEL?

For the PCP containers that we build in the PCP project (see https://bintray.com/pcp/containers), there are a couple of extra bindmounts needed (in particular /run, /dev/log and /var/lib/docker). Can you try the following?

# docker run --rm -it --privileged --net=host --pid=host --ipc=host -v /sys:/sys:ro -v /etc/localtime:/etc/localtime:ro -v /var/lib/docker:/var/lib/docker:ro -v /run:/run -v /var/log:/var/log -v /dev/log:/dev/log docker.io/openshifttools/oso-centos7-host-monitoring pminfo -f hinv.ncpu


If that doesn't work, please reproduce the issue (without using --rm), and then package up the entire container instance with the 'docker export' command. Be sure to gzip the resulting tarball, and make it available for me somehow (e.g. upload to redhat's dropbox, see https://access.redhat.com/solutions/2112 for instructions).

Regards
-- Mark Goodwin

Comment 7 Joel Diaz 2017-01-24 20:50:52 UTC

Not running RHEL atomic, just regular RHEL 7.3.

On a system where the monitoring container is currently showing the failing pcp commands, I ran:

docker run --rm -ti --privileged --net=host --pid=host --ipc=host -v /sys:/sys:ro -v /etc/localtime:/etc/localtime:ro -v /var/lib/docker:/var/lib/docker:ro -v /run:/run -v /var/log:/var/log -v /dev/log:/dev/log registry.ops.openshift.com/ops/oso-rhel7-host-monitoring:stg pminfo -f hinv.cpu

and got:
pminfo: Cannot connect to PMCD on host "local:": No permission to perform requested operation


We don't actually run the container on our clusters with '--rm' as the systemd service file that manages this container takes care to 'docker rm' before starting a new container. Just the example scripts to start the container locally have --rm for convenience.

I stopped the container, exported it, edited out the sensitive bits and re-tar/zipped it up: http://file.rdu.redhat.com/~jdiaz/1415293/pcp-container.tar.xz

Comment 8 Frank Ch. Eigler 2017-01-24 20:54:27 UTC

Just a hunch, but what about selinux on or off?

Comment 9 Joel Diaz 2017-01-24 21:35:45 UTC

(In reply to Frank Ch. Eigler from comment #8)
> Just a hunch, but what about selinux on or off?

Didn't help...

[root@ded-stage-gcp-node-compute-vmgx7 ~]# setenforce 0
[root@ded-stage-gcp-node-compute-vmgx7 ~]# systemctl restart oso-rhel7-host-monitoring.service 
[root@ded-stage-gcp-node-compute-vmgx7 ~]# docker exec -ti oso-rhel7-host-monitoring pminfo
pminfo: Cannot connect to PMCD on host "local:": No permission to perform requested operation

Comment 10 Mark Goodwin 2017-01-24 22:14:22 UTC

# docker exec -ti oso-rhel7-host-monitoring pminfo

The above is using "local:" as the host specification by default. This uses a unix domain socket if available, and will fall back to regular tcp/inet to the 'localhost'. Looks like both are failing.

Can you try using :

# docker exec -ti oso-rhel7-host-monitoring pminfo -L -f hinv.ncpu

The above (with -L) will use not use a network socket - it will call the Linux PMDA directly via dlopen. If that works then we can narrow down the problem somewhat.

meanwhile, I'm downloading your container image ...

Thanks
-- Mark

Comment 11 Joel Diaz 2017-01-24 22:22:56 UTC

When the system gets into the "pcp not working" state, the container comes up without /var/run/pcp. If memory serves me correctly, it tries the socket, then tcp, then fails.

Trying with the -L does work:

[root@ded-stage-gcp-node-compute-vmgx7 ~]# docker exec -ti oso-rhel7-host-monitoring pminfo -L -f hinv.ncpu

hinv.ncpu
    value 4

Comment 12 Frank Ch. Eigler 2017-01-24 22:27:34 UTC

Have you tried without bind-mounting the host's [/var]/run into the container?

Comment 13 Joel Diaz 2017-01-25 14:39:25 UTC

Taking out the /var/* bind mountpoints didn't seem to change anything:

[root@ded-stage-gcp-node-compute-vmgx7 ~]# docker run -ti --name oso-rhel7-host-monitoring --privileged --pid=host --net=host --ipc=host -v /etc/localtime:/etc/localtime -v /sys:/sys:ro -v /etc/origin/node:/etc/origin/node -v /usr/bin/oc:/usr/bin/oc -v /usr/bin/oadm:/usr/bin/oadm -v /etc/openshift_tools:/container_setup:ro registry.ops.openshift.com/ops/oso-rhel7-host-monitoring:stg bash
[CTR][root@4fde94f4ab23 ~]$ /usr/share/pcp/lib/pmcd start
PMCD process ... 50620
/usr/share/pcp/lib/pmcd:
Warning: found no /var/run/pcp/pmcd.pid
         and no /var/log/pcp/pmcd/pmcd.log.
         Assuming an uninstall from a chroot: pmcd not killed.
         If this is incorrect, "pmsignal -s TERM 50620"  can be used.
[CTR][root@4fde94f4ab23 ~]$ pminfo -f hinv.ncpu
pminfo: Cannot connect to PMCD on host "local:": No permission to perform requested operation

Comment 14 Frank Ch. Eigler 2017-01-25 14:46:44 UTC

At last, the console error messages!
Any idea what that process 50620 is?
Is the container running somewhere we could log into now?

Comment 15 Frank Ch. Eigler 2017-01-25 14:51:59 UTC

Looking at the /usr/share/pcp/lib/pmcd script, it looks like if anywhere on the system a process that names itself "pmcd" is running, this will block the real pcp pmcd from even trying to start.

It would be nice to move closer to a pure systemd startup and stop using and/or dramatically simplify this script.

Comment 16 Joel Diaz 2017-01-25 16:41:32 UTC

It shows a running process, but I can't make heads or tails of where the process is coming from (the container host does not have pcp RPMs installed):

[root@ded-stage-gcp-node-compute-vmgx7 ~]# ps aux | grep 50620
root       9178  0.0  0.0 112648   964 pts/1    S+   16:39   0:00 grep --color=auto 50620
root      50620  0.0  0.0  58580  2000 ?        Ss   Jan03  10:05 /usr/libexec/pcp/bin/pmcd -A
[root@ded-stage-gcp-node-compute-vmgx7 ~]# cat /proc/50620/cgroup 
11:blkio:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
10:freezer:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
9:pids:/
8:hugetlb:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
7:net_prio,net_cls:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
6:cpuset:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
5:devices:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
4:perf_event:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
3:cpuacct,cpu:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
2:memory:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
1:name=systemd:/system.slice/docker-503caeedd393f145024900c3763d3d680550a224e0aa587cc6a99cdd33213825.scope
[root@ded-stage-gcp-node-compute-vmgx7 ~]# docker ps
CONTAINER ID        IMAGE                                                         COMMAND             CREATED             STATUS              PORTS               NAMES
e1f1c75d270a        registry.access.redhat.com/openshift3/logging-fluentd:3.3.1   "sh run.sh"         19 hours ago        Up 19 hours                             k8s_fluentd-elasticsearch.11d4af99_logging-fluentd-x0ete_logging_449ebe9b-e27d-11e6-bb7b-4201ac19000a_f12fba48
e6b5f90aca88        registry.ops.openshift.com/openshift3/ose-pod:v3.3.1.5        "/pod"              19 hours ago        Up 19 hours                             k8s_POD.868ae68a_logging-fluentd-x0ete_logging_449ebe9b-e27d-11e6-bb7b-4201ac19000a_1027470d

Comment 17 Frank Ch. Eigler 2017-01-25 17:31:54 UTC

Ah, things are starting to make sense - so another container on the same host is already running pmcd.  Because your new container is run with docker --pid=host, the other pmcd is visible to this one and found by the goofy initscript logic.

Try dropping the --pid=host just for giggles.  It may hide other processes from pmcd's procfs.* metrics, but may proceed closer to starting up.

Comment 18 Joel Diaz 2017-01-25 20:22:12 UTC

It looks like what is going on is that a previous version of our oso-rhel7-host-monitoring container exited, and the pmcd process was not properly stopped/cleaned-up/killed and is now a child of PID 1.

The /proc/50620/cgroup output does show the cgroup ids which do indeed match docker container ids. In the case of this stray pmcd process, there is no corresponding docker container with the id (as far as 'docker ps -a' output goes).

We happen to have a few nodes in this state, so I hopped onto another one, and manually killed the pmcd process, and restarted our oso-rhel7-host-monitoring container, and pcp came up in a working state.

With all that being said, we have previously seen cases where docker/container processes became orphaned and were reparented to PID 1. I don't think we're looking at a pcp bug here, but a 'processes are escaping the docker process hierarchy' issue.

FWIW, I did try to run the container without '--pid=host', and pmcd does start up and populate /var/run/pcp/*, and everything seems to be working.

[root@ded-stage-gcp-node-compute-vmgx7 ~]# docker run -ti --name oso-rhel7-host-monitoring --privileged --net=host --ipc=host -v /etc/localtime:/etc/localtime -v /sys:/sys:ro -v /etc/origin/node:/etc/origin/node -v /usr/bin/oc:/usr/bin/oc -v /usr/bin/oadm:/usr/bin/oadm -v /etc/openshift_tools:/container_setup:ro registry.ops.openshift.com/ops/oso-rhel7-host-monitoring:stg bash
[CTR][root@54677b0e120e ~]$ /usr/share/pcp/lib/pmcd start &
[1] 27
[CTR][root@54677b0e120e ~]$ Starting pmcd ... [Wed Jan 25 20:18:18] pmcd(74) Error: gethostname failure
[Wed Jan 25 20:18:18] pmcd(74) Error: Can't get host name/IP address, giving up
Warning: the following access control specification will be ignored
pmcd config[line 16]: Warning: access control error for host 'local:*': Host is down
[Wed Jan 25 20:18:18] pmcd(76) Error: OpenRequestSocket(44321, INADDR_ANY, inet) __pmBind: Address already in use
[Wed Jan 25 20:18:18] pmcd(76) Error: pmcd may already be running
[Wed Jan 25 20:18:18] pmcd(76) Error: OpenRequestSocket(44321, INADDR_ANY, ipv6) __pmBind: Address already in use
[Wed Jan 25 20:18:18] pmcd(76) Error: pmcd may already be running


[CTR][root@54677b0e120e ~]$ pminfo -f hinv.ncpu

hinv.ncpu
    value 4


I think this bug can be closed, and we'll see about trying to gather some info to open a docker bug for this "escaped process" issue.

Comment 19 Nathan Scott 2017-02-01 00:36:57 UTC

Closing as per Joels note in #c18