Bug 1258985 - when a smartstate worker times out and is killed, any child processes (eg,vixdisklibserver.rb processes) are not killed with their parents leaving them running with PID 1 as the adopted parent process
when a smartstate worker times out and is killed, any child processes (eg,vix...
Status: CLOSED ERRATA
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: SmartState Analysis (Show other bugs)
5.3.0
All All
high Severity medium
: GA
: 5.5.0
Assigned To: Jerry Keselman
Ramesh A
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-01 12:34 EDT by Felix Dewaleyne
Modified: 2015-12-08 08:29 EST (History)
9 users (show)

See Also:
Fixed In Version: 5.5.0.6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-08 08:29:14 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs for pid 13209 (22.58 KB, text/plain)
2015-09-01 12:34 EDT, Felix Dewaleyne
no flags Details
second smartstate worker terminated where child process survives and loops (256.87 KB, text/plain)
2015-09-01 16:06 EDT, Thomas Hennessy
no flags Details
top reports showing 100% CPU utilzaiton until appliance reboot restores order (975.18 KB, application/pdf)
2015-09-01 16:07 EDT, Thomas Hennessy
no flags Details
spreadsheet of smartstate worker processes with problem processes highlighted by yellow backgrounds (16.71 KB, application/zip)
2015-09-01 16:08 EDT, Thomas Hennessy
no flags Details
tgz of several days of vim.log files from logrotate (172.90 KB, application/x-gzip)
2015-09-01 16:46 EDT, Thomas Hennessy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:2551 normal SHIPPED_LIVE Moderate: CFME 5.5.0 bug fixes and enhancement update 2015-12-08 12:58:09 EST

  None (edit)
Description Felix Dewaleyne 2015-09-01 12:34:09 EDT
Created attachment 1069083 [details]
logs for pid 13209

Description of problem:
when a worker times out, vixdisklibserver.rb processes are not killed with their parents

Version-Release number of selected component (if applicable):
cfme-5.3.5.4

How reproducible:
not reproduced yet

Steps to Reproduce:
1. configure cfme 5.3.5.4 to use vmware
2. generate a timeout for a worker that has for child process vixdisklib

Actual results:
top - 03:15:18 up 24 days, 14:13,  0 users,  load average: 5.55, 5.75, 5.64
Tasks: 174 total,   6 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.1%us, 31.6%sy, 66.9%ni,  0.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   5992796k total,  4290748k used,  1702048k free,   312640k buffers
Swap:  9957372k total,     9968k used,  9947404k free,   571756k cached

  PID  PPID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                     
22701     1 root      30  10  189m  28m  10m R 77.9  0.5  27239:54 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb
21970     1 root      30  10  189m  23m  10m R 76.0  0.4  27429:08 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb
 3087     1 root      30  10  190m  29m  10m R 75.5  0.5  19339:19 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb
 4862     1 root      30  10  189m  28m  10m R 75.2  0.5  19315:53 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb
10112     1 root      30  10  204m  39m  10m R 69.9  0.7  19359:08 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb 

Expected results:

the vixdisklibserver.rb processes are killed along with their parents when the worker is restarted due to a timeout

Additional info:

original cause for the timeout is not known yet.
the activity of one of the processes was extracted from the logs - you can see it in the attachments.
Comment 2 Thomas Hennessy 2015-09-01 16:03:59 EDT
The architecture of the smartstate worker was changed earlier this year when the process was altered to initiate a child process for the VixDisk activity rather than performing the work within the smartstate worker itself.

When it happens that the smartstate worker, for whatever reason is killed while the child process is still active (perhaps in a loop, suggesting a bug in the VMware code) the child process is *not* killed, leaving the child process as an orphan that becomes parented by the 'init' PROCESS 1, while the process continues to run.

In the logs provided by the customer, top reports 0% idle cpu and shows 5 of these child processes active and consuming > about 70% of a CPU each. The customer original complaint is that provisioning is no longer proceeding as expected.

The top data for the two weeks provided shows several instances of the transition of these child processes from being owned by a smartstate worker, and after the worker is killed, these child processes continuing to execute and consume resources while now being parented by the 'init' process (#1). (see logs for 13209 and other provided logs).  A spreadsheet is also provided where the transition points of the child process being re-parented to process 1 is also provided.  The points of transition are highlighted by yellow backgrounds with the parent process being '1' and the line above showing the same PID but with the parent process being a smartstate worker process that has been terminated.

The requested remedy for this bug is for CFME to ensure that all child processes of the worker about to be terminated are terminated before the worker process is terminated.
Comment 3 Thomas Hennessy 2015-09-01 16:06:37 EDT
Created attachment 1069140 [details]
second smartstate worker terminated where child process survives and loops
Comment 4 Thomas Hennessy 2015-09-01 16:07:36 EDT
Created attachment 1069142 [details]
top reports showing 100% CPU utilzaiton until appliance reboot restores order
Comment 5 Thomas Hennessy 2015-09-01 16:08:44 EDT
Created attachment 1069144 [details]
spreadsheet of smartstate worker processes with problem processes highlighted by yellow backgrounds
Comment 6 Rich Oliveri 2015-09-01 16:09:47 EDT
These process are designed to timeout after a fairly short period of time if not accessed. Unless that timeout mechanism isn't working, this is a non-issue.
Comment 7 Jerry Keselman 2015-09-01 16:25:43 EDT
Can you tell me what version or versions of VDDK exhibit this behavior?  Thanks.
Comment 8 Jerry Keselman 2015-09-01 16:30:02 EDT
If you can also provide a vim.log for the period in question that would be helpful. Thanks.
Comment 9 Thomas Hennessy 2015-09-01 16:45:32 EDT
Jerry,
there are 14 days of archive logs and the current log set provided for this appliance by the customer. Not every day had a non-null vim.log but I have aggregated together all of the vim log files collected by logrotate and uploaded them into the attachment 'vim_logs.tgz'

As for what version of the vddk is in use, we will have to ask the customer.  Please provide whatever command you would want the customer to execute to provide you the most useable information to determine what level of vddk code is in use.

Tom Hennessy
Comment 10 Thomas Hennessy 2015-09-01 16:46:19 EDT
Created attachment 1069197 [details]
tgz of several days of vim.log files from logrotate
Comment 11 Jerry Keselman 2015-09-02 12:01:32 EDT
Would it be possible to get the vpxa.log files from the customer for the same time period as the vim.logs you have provided?  If the ESX hosts are 4.x then the log files will be in /var/log/vmware/vpx.  For 5.x ESX hosts the log files will be in /var/log.  Thanks.
Comment 12 Thomas Hennessy 2015-09-02 16:31:52 EDT
forwarding request for additional information to  Felix Dewaleyne who opened the case and is in proximity to the customer.
Comment 13 Jerry Keselman 2015-09-09 09:29:23 EDT
One change to be added here will be that the SmartProxyWorker will now trap TERM signals and hand them off to the VixDiskLibServer prior to exiting.  This way attempts to kill the SmartProxyWorker will also result in the VixDiskLibServer exiting.
Comment 14 Oleg Barenboim 2015-09-09 09:48:20 EDT
PR https://github.com/ManageIQ/manageiq/pull/4277
Comment 15 Jerry Keselman 2015-09-17 15:30:45 EDT
The above PR will be closed in lieu of the PR https://github.com/ManageIQ/manageiq/pull/4411
Comment 17 Jerry Keselman 2015-09-28 08:23:50 EDT
Felix lets hold off on asking for the logs at this point.  Since this BZ deals with making sure the child processes go away when the worker times out, that is what the above PR fixes.  It doesn't really deal with the cause of the timeout - which is that there is some hang in the VDDK (I/O possibly).  We will deal with that as it becomes necessary.
Comment 18 Ramesh A 2015-10-28 13:17:57 EDT
Verification blocked due to BZ#1276087
Comment 19 Jerry Keselman 2015-11-12 12:56:38 EST
Now that the aforementioned BZ has been closed can verification go through again?
Comment 20 Ramesh A 2015-11-18 08:04:14 EST
Hi Jerry,

Is there a way to generate a timeout for a worker that has for child process vixdisklib??

As far the QE testing is considered, followed the below mentioned approaches.  Verified in 5.5.0.10-beta2.1.20151110134042_d6f5459.
1. Fleecing an vm
2. Fleecing a vm whose RAM usage was high
3. Fleecing a vm whose Memory usage was high
4. Fleecing more than 4 vm's at once
5. Fleecing all the hosts (QE environment has 4 hosts attached to Vsphere55 provider which has 223 vm's in it)
6. Disabling / Enabling SmartProxy role while Fleecing or after issuing the fleecing option

Was not able to see the time out exception related error as mentioned in the bug.  However came across BZ#1267565.

Let me know if this is sufficient to to validate this BZ??  If this is sufficient then I can move this BZ to verified state.

Thanks,
Ramesh
Comment 21 Jerry Keselman 2015-11-18 08:45:47 EST
Ramesh,

The only way you can test this is to actually send a signal to the smart state worker to kill it, and then check to see if the child vixdisklibserver process also is shut down.  The issue was that the worker timed out and was killed, but the vixdisklibserver was not stopped.  You would have to figure out which process was the smart state worker by checking the miq_workers table in the database.  The pid is in there.
Comment 22 Ramesh A 2015-11-18 12:46:41 EST
Good to go. Verified and working fine in 5.5.0.10-beta2.1.20151110134042_d6f5459

Tested using the below mentioned procedure
1. Identified the process id of smartproxy using the query "SELECT pid FROM miq_workers WHERE queue_name='smartproxy';"
2. Started SSA on multiple vm's
3. Waited for vixdisklibserver process to appear in top and in vim.log file
4. Terminated the smartproxy pid using the kill command
6. vixdisklibserver process id got deleted in top command
7. The Scan process which was struck, got timedout - "job timed out after 3051.924210029 seconds of inactivity. Inactivity threshold [3000 seconds]"

PS:
==
For the next set of SSA's which was in queued status got new smartproxy pid after the following error in log file, and SSA was getting executed normally

[----] E, [2015-11-18T11:30:03.707721 #14043:123598c] ERROR -- : MIQ(MiqServer#validate_worker) Worker [MiqSmartProxyWorker] with ID: [1514], PID: [24440], GUID: [f7cae52e-8e09-11e5-94fb-fa163ea62a09] has not responded in 127.258668925 seconds, restarting worker
Comment 24 errata-xmlrpc 2015-12-08 08:29:14 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551

Note You need to log in before you can comment on or make changes to this bug.