Created attachment 1069083 [details]
logs for pid 13209
Description of problem:
when a worker times out, vixdisklibserver.rb processes are not killed with their parents
Version-Release number of selected component (if applicable):
not reproduced yet
Steps to Reproduce:
1. configure cfme 184.108.40.206 to use vmware
2. generate a timeout for a worker that has for child process vixdisklib
top - 03:15:18 up 24 days, 14:13, 0 users, load average: 5.55, 5.75, 5.64
Tasks: 174 total, 6 running, 168 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.1%us, 31.6%sy, 66.9%ni, 0.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 5992796k total, 4290748k used, 1702048k free, 312640k buffers
Swap: 9957372k total, 9968k used, 9947404k free, 571756k cached
PID PPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22701 1 root 30 10 189m 28m 10m R 77.9 0.5 27239:54 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb
21970 1 root 30 10 189m 23m 10m R 76.0 0.4 27429:08 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb
3087 1 root 30 10 190m 29m 10m R 75.5 0.5 19339:19 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb
4862 1 root 30 10 189m 28m 10m R 75.2 0.5 19315:53 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb
10112 1 root 30 10 204m 39m 10m R 69.9 0.7 19359:08 ruby /var/www/miq/lib/VixDiskLib/VixDiskLibServer.rb
the vixdisklibserver.rb processes are killed along with their parents when the worker is restarted due to a timeout
original cause for the timeout is not known yet.
the activity of one of the processes was extracted from the logs - you can see it in the attachments.
The architecture of the smartstate worker was changed earlier this year when the process was altered to initiate a child process for the VixDisk activity rather than performing the work within the smartstate worker itself.
When it happens that the smartstate worker, for whatever reason is killed while the child process is still active (perhaps in a loop, suggesting a bug in the VMware code) the child process is *not* killed, leaving the child process as an orphan that becomes parented by the 'init' PROCESS 1, while the process continues to run.
In the logs provided by the customer, top reports 0% idle cpu and shows 5 of these child processes active and consuming > about 70% of a CPU each. The customer original complaint is that provisioning is no longer proceeding as expected.
The top data for the two weeks provided shows several instances of the transition of these child processes from being owned by a smartstate worker, and after the worker is killed, these child processes continuing to execute and consume resources while now being parented by the 'init' process (#1). (see logs for 13209 and other provided logs). A spreadsheet is also provided where the transition points of the child process being re-parented to process 1 is also provided. The points of transition are highlighted by yellow backgrounds with the parent process being '1' and the line above showing the same PID but with the parent process being a smartstate worker process that has been terminated.
The requested remedy for this bug is for CFME to ensure that all child processes of the worker about to be terminated are terminated before the worker process is terminated.
Created attachment 1069140 [details]
second smartstate worker terminated where child process survives and loops
Created attachment 1069142 [details]
top reports showing 100% CPU utilzaiton until appliance reboot restores order
Created attachment 1069144 [details]
spreadsheet of smartstate worker processes with problem processes highlighted by yellow backgrounds
These process are designed to timeout after a fairly short period of time if not accessed. Unless that timeout mechanism isn't working, this is a non-issue.
Can you tell me what version or versions of VDDK exhibit this behavior? Thanks.
If you can also provide a vim.log for the period in question that would be helpful. Thanks.
there are 14 days of archive logs and the current log set provided for this appliance by the customer. Not every day had a non-null vim.log but I have aggregated together all of the vim log files collected by logrotate and uploaded them into the attachment 'vim_logs.tgz'
As for what version of the vddk is in use, we will have to ask the customer. Please provide whatever command you would want the customer to execute to provide you the most useable information to determine what level of vddk code is in use.
Created attachment 1069197 [details]
tgz of several days of vim.log files from logrotate
Would it be possible to get the vpxa.log files from the customer for the same time period as the vim.logs you have provided? If the ESX hosts are 4.x then the log files will be in /var/log/vmware/vpx. For 5.x ESX hosts the log files will be in /var/log. Thanks.
forwarding request for additional information to Felix Dewaleyne who opened the case and is in proximity to the customer.
One change to be added here will be that the SmartProxyWorker will now trap TERM signals and hand them off to the VixDiskLibServer prior to exiting. This way attempts to kill the SmartProxyWorker will also result in the VixDiskLibServer exiting.
The above PR will be closed in lieu of the PR https://github.com/ManageIQ/manageiq/pull/4411
Felix lets hold off on asking for the logs at this point. Since this BZ deals with making sure the child processes go away when the worker times out, that is what the above PR fixes. It doesn't really deal with the cause of the timeout - which is that there is some hang in the VDDK (I/O possibly). We will deal with that as it becomes necessary.
Verification blocked due to BZ#1276087
Now that the aforementioned BZ has been closed can verification go through again?
Is there a way to generate a timeout for a worker that has for child process vixdisklib??
As far the QE testing is considered, followed the below mentioned approaches. Verified in 220.127.116.11-beta2.1.20151110134042_d6f5459.
1. Fleecing an vm
2. Fleecing a vm whose RAM usage was high
3. Fleecing a vm whose Memory usage was high
4. Fleecing more than 4 vm's at once
5. Fleecing all the hosts (QE environment has 4 hosts attached to Vsphere55 provider which has 223 vm's in it)
6. Disabling / Enabling SmartProxy role while Fleecing or after issuing the fleecing option
Was not able to see the time out exception related error as mentioned in the bug. However came across BZ#1267565.
Let me know if this is sufficient to to validate this BZ?? If this is sufficient then I can move this BZ to verified state.
The only way you can test this is to actually send a signal to the smart state worker to kill it, and then check to see if the child vixdisklibserver process also is shut down. The issue was that the worker timed out and was killed, but the vixdisklibserver was not stopped. You would have to figure out which process was the smart state worker by checking the miq_workers table in the database. The pid is in there.
Good to go. Verified and working fine in 18.104.22.168-beta2.1.20151110134042_d6f5459
Tested using the below mentioned procedure
1. Identified the process id of smartproxy using the query "SELECT pid FROM miq_workers WHERE queue_name='smartproxy';"
2. Started SSA on multiple vm's
3. Waited for vixdisklibserver process to appear in top and in vim.log file
4. Terminated the smartproxy pid using the kill command
6. vixdisklibserver process id got deleted in top command
7. The Scan process which was struck, got timedout - "job timed out after 3051.924210029 seconds of inactivity. Inactivity threshold [3000 seconds]"
For the next set of SSA's which was in queued status got new smartproxy pid after the following error in log file, and SSA was getting executed normally
[----] E, [2015-11-18T11:30:03.707721 #14043:123598c] ERROR -- : MIQ(MiqServer#validate_worker) Worker [MiqSmartProxyWorker] with ID: , PID: , GUID: [f7cae52e-8e09-11e5-94fb-fa163ea62a09] has not responded in 127.258668925 seconds, restarting worker
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.