Bug 1686397 - [osp15] mistral-executor is stuck with 100% cpu
Summary: [osp15] mistral-executor is stuck with 100% cpu
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-mistral
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 15.0 (Stein)
Assignee: Adriano Petrich
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-07 11:26 UTC by Michele Baldessari
Modified: 2019-09-26 10:48 UTC (History)
8 users (show)

Fixed In Version: openstack-mistral-8.0.1-0.20190606070407.6ff82c3.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-21 11:20:32 UTC
Target Upstream Version:
Embargoed:
apetrich: needinfo-
apetrich: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 641667 0 None None None 2019-03-07 14:14:56 UTC
Red Hat Product Errata RHEA-2019:2811 0 None None None 2019-09-21 11:20:47 UTC

Description Michele Baldessari 2019-03-07 11:26:43 UTC
Problem statement:
On a rhel8 os + rhel8 containers I can reliably get mistral-server to lock up using 100% cpu

This seems to happen as soon as an ansible task finishes (either successfully or unsuccessfully)

So I will get this in top :
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 319045 42430     20   0 1068584 251084  24316 R 100.0   1.2  18:03.18 mistral-server

It is the mistral_executor container that is stuck:
 [root@undercloud-0 mistral]# for i in mistral_engine mistral_executor mistral_api mistral_event_engine; do echo $i; podman inspect $i --format '{{.Sta
te.Pid}}' | grep 319045; done
mistral_engine
mistral_executor
319045
mistral_api
mistral_event_engine

There are no particular errors in the executor.log (as a matter of fact it seems it stops logging once it hangs)

sosreport from the undercloud: http://file.rdu.redhat.com/~mbaldess/mistral-bz/sosreport-undercloud-0-2019-03-07-bkgdcaw.tar.xz
mistral logs are here: http://file.rdu.redhat.com/~mbaldess/mistral-bz/mistral-logs.tgz

Versions inside the container:
 [root@undercloud-0 containers]# for i in mistral_executor; do echo $i; podman exec -it -u root $i sh -c 'rpm -qa |grep mistral'; done
mistral_executor
puppet-mistral-14.2.1-0.20190226101400.b76cf93.el8ost.noarch
python3-mistral-lib-1.0.0-0.20190117093849.d1ccfd0.el8ost.noarch
python3-mistral-8.0.0-0.20190228185014.608367f.el8ost.noarch
openstack-mistral-common-8.0.0-0.20190228185014.608367f.el8ost.noarch
python3-mistralclient-3.7.0-0.20190110194247.f0ee48f.el8ost.noarch
openstack-mistral-executor-8.0.0-0.20190228185014.608367f.el8ost.noarch

Comment 1 Michele Baldessari 2019-03-07 11:40:46 UTC
It seems to be stuck in a loop reading from a pipe that is constantly returning zero.
 [root@undercloud-0 containers]# head -n10 /tmp/strace
319045 read(9<pipe:[3258466]>, "", 8192) = 0
319045 read(9<pipe:[3258466]>, "", 8192) = 0
319045 read(9<pipe:[3258466]>, "", 8192) = 0
319045 read(9<pipe:[3258466]>, "", 8192) = 0
319045 read(9<pipe:[3258466]>, "", 8192) = 0
319045 read(9<pipe:[3258466]>, "", 8192) = 0
319045 read(9<pipe:[3258466]>, "", 8192) = 0

I guess the other side died since there is only one process with a descriptor open on the pipe?
 [root@undercloud-0 containers]# lsof -n 2>&1|grep -w 3258466
mistral-s 319045                           42430    9r     FIFO               0,12       0t0    3258466 pipe

Comment 2 Michele Baldessari 2019-03-07 12:07:59 UTC
Here is a small reproducer:
 [root@undercloud-0 ~]# more reproducer.py                    
import subprocess                                             
import time                                                   
                                                              
command = ['ansible', '-m', 'ping', 'localhost']              
# that processutils has, do we need to replicate that somehow?
process = subprocess.Popen(command, stdout=subprocess.PIPE,   
                           stderr=subprocess.STDOUT,          
                           shell=False, bufsize=1,            
                           universal_newlines=True)           
start = time.time()                                           
stdout = []                                                   
lines = []                                                    
for line in iter(process.stdout.readline, b''):               
    lines.append(line)                                        
                                                              
print(lines)                                                  

(mostly copied from https://github.com/openstack/tripleo-common/blob/master/tripleo_common/actions/ansible.py#L570)

Comment 6 Amit Ugol 2019-07-21 06:20:30 UTC
Its not better by much but it doesn't seem to block anything now and mistral only stresses itself during operations.

Comment 10 errata-xmlrpc 2019-09-21 11:20:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811


Note You need to log in before you can comment on or make changes to this bug.