Problem statement: On a rhel8 os + rhel8 containers I can reliably get mistral-server to lock up using 100% cpu This seems to happen as soon as an ansible task finishes (either successfully or unsuccessfully) So I will get this in top : PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 319045 42430 20 0 1068584 251084 24316 R 100.0 1.2 18:03.18 mistral-server It is the mistral_executor container that is stuck: [root@undercloud-0 mistral]# for i in mistral_engine mistral_executor mistral_api mistral_event_engine; do echo $i; podman inspect $i --format '{{.Sta te.Pid}}' | grep 319045; done mistral_engine mistral_executor 319045 mistral_api mistral_event_engine There are no particular errors in the executor.log (as a matter of fact it seems it stops logging once it hangs) sosreport from the undercloud: http://file.rdu.redhat.com/~mbaldess/mistral-bz/sosreport-undercloud-0-2019-03-07-bkgdcaw.tar.xz mistral logs are here: http://file.rdu.redhat.com/~mbaldess/mistral-bz/mistral-logs.tgz Versions inside the container: [root@undercloud-0 containers]# for i in mistral_executor; do echo $i; podman exec -it -u root $i sh -c 'rpm -qa |grep mistral'; done mistral_executor puppet-mistral-14.2.1-0.20190226101400.b76cf93.el8ost.noarch python3-mistral-lib-1.0.0-0.20190117093849.d1ccfd0.el8ost.noarch python3-mistral-8.0.0-0.20190228185014.608367f.el8ost.noarch openstack-mistral-common-8.0.0-0.20190228185014.608367f.el8ost.noarch python3-mistralclient-3.7.0-0.20190110194247.f0ee48f.el8ost.noarch openstack-mistral-executor-8.0.0-0.20190228185014.608367f.el8ost.noarch
It seems to be stuck in a loop reading from a pipe that is constantly returning zero. [root@undercloud-0 containers]# head -n10 /tmp/strace 319045 read(9<pipe:[3258466]>, "", 8192) = 0 319045 read(9<pipe:[3258466]>, "", 8192) = 0 319045 read(9<pipe:[3258466]>, "", 8192) = 0 319045 read(9<pipe:[3258466]>, "", 8192) = 0 319045 read(9<pipe:[3258466]>, "", 8192) = 0 319045 read(9<pipe:[3258466]>, "", 8192) = 0 319045 read(9<pipe:[3258466]>, "", 8192) = 0 I guess the other side died since there is only one process with a descriptor open on the pipe? [root@undercloud-0 containers]# lsof -n 2>&1|grep -w 3258466 mistral-s 319045 42430 9r FIFO 0,12 0t0 3258466 pipe
Here is a small reproducer: [root@undercloud-0 ~]# more reproducer.py import subprocess import time command = ['ansible', '-m', 'ping', 'localhost'] # that processutils has, do we need to replicate that somehow? process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=False, bufsize=1, universal_newlines=True) start = time.time() stdout = [] lines = [] for line in iter(process.stdout.readline, b''): lines.append(line) print(lines) (mostly copied from https://github.com/openstack/tripleo-common/blob/master/tripleo_common/actions/ansible.py#L570)
Its not better by much but it doesn't seem to block anything now and mistral only stresses itself during operations.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2811