1203442 – miq-system.rb causing top_output.log to fill the log filesystem

Bug 1203442 - miq-system.rb causing top_output.log to fill the log filesystem

Summary: miq-system.rb causing top_output.log to fill the log filesystem

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.3.0
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.4.0
Assignee:	Joe Rafaniello
QA Contact:	Dave Johnson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1207860
TreeView+	depends on / blocked

Reported:	2015-03-18 20:31 UTC by Josh Carter
Modified:	2019-07-11 08:48 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1207860 (view as bug list)
Environment:
Last Closed:	2015-06-16 12:54:07 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:1100	0	normal	SHIPPED_LIVE	CFME 5.4.0 bug fixes, and enhancement update	2015-06-16 16:28:42 UTC

Comment 4 Joe Rafaniello 2015-03-18 23:12:26 UTC

We log the server stats on a schedule initiated by a queue message created when the  rufus schedule is fired within the schedule worker process.  This queue message is only created if a prior one doesn't already exist for that server.

We need to clarify what the process count spike means, what processes were added?  Was there something that happened in terms of new/added processes prior to the large number of "df" counts.  Also, it would be useful to know if those df commands were failing because of something.

Looking at the code, nothing looks obvious to me so it might be the 'df' command we're in combination with something that could be happening on storage.


The 5.minutes frequency comes from this setting:
https://github.com/ManageIQ/manageiq/blob/00af6164bf983bed2f86cfa5244d0b9c2c9fcecf/vmdb/config/vmdb.tmpl.yml#L499

The schedule worker's rufus thread throws the "we need to queue up a log server stats" thing into the schedule workers internal queue here:
https://github.com/ManageIQ/manageiq/blob/00af6164bf983bed2f86cfa5244d0b9c2c9fcecf/vmdb/lib/workers/schedule_worker.rb#L113

When the schedule worker goes through it's internal queue, it calls this miq_server_worker_log_status method:
https://github.com/ManageIQ/manageiq/blob/00af6164bf983bed2f86cfa5244d0b9c2c9fcecf/vmdb/lib/workers/schedule_worker/jobs.rb#L16


Note, it's eventually adds it to the miq_queue table by doing MiqQueue.put_unless_exists:
https://github.com/ManageIQ/manageiq/blob/00af6164bf983bed2f86cfa5244d0b9c2c9fcecf/vmdb/lib/workers/schedule_worker/jobs.rb#L225

Which means, it won't queue up the same task twice if there's already a miq_queue row with these values:
:class_name  => "MiqServer", :method_name => "log_status",     :task_id => "log_status", :server_guid => MiqServer.my_guid, :priority => MiqQueue::HIGH_PRIORITY


This last point implies we don't have a bunch of miq_queue records all asking to do the log_status.

Comment 5 Jason Frey 2015-03-19 14:13:49 UTC

Another question is why this log is not rotating.  Is this related to the upstream bug where the logs were not rotating?

Comment 6 Thom Carlin 2015-03-19 14:21:10 UTC

Could this be a NFS mount issue?

Comment 9 Joe Rafaniello 2015-03-19 20:38:43 UTC

to Jason's comment: we rotate logs on a daily schedule, not by size:
https://github.com/jrafanie/manageiq/blob/ecda2a8c75225a3248adc011f7714d186c37cb0a/system/COPY/etc/logrotate.d/miq_logs.conf#L2

Following up on Tom's comment, a google search it's looks clear that df is known to hang when mount volumes are inaccessible.

http://unix.stackexchange.com/questions/21199/why-is-df-hanging
https://bugs.centos.org/view.php?id=6414
https://community.oracle.com/thread/2515836
https://darktraining.com/linux/67/

Many of the suggestions online are to limit our df command to only local volumes by passing -l:  df -l

Note, this bug seems to indicate the local option may not work with fedora20
https://bugzilla.redhat.com/show_bug.cgi?id=1199679

We'll have to recreate this in house and see if we can verify it on rhel 6.6.

Comment 10 Joe Rafaniello 2015-03-19 22:01:49 UTC

Dave, from the provided links, would you be able to try to recreate this?
According to the googling I did, it seems like just mounting a nfs share on an appliance, shutting down the nfs server or disabling the nfs, and trying 'df -T -P' seems to do it.

From Tom's log reviews, it looks like the top output was showing the SAME df command taking MINUTES to run since it was reported 3 or more times in top_output.log.

This is the code we'd change to pass -l:

https://github.com/jrafanie/manageiq/blob/19149b1e903a27e241e79dd174a82234344bfbe3/lib/util/miq-system.rb#L405

Comment 11 Dave Johnson 2015-03-20 03:57:22 UTC

Jan, can you give this a shot and see if you can reproduce?  Thanks!

Comment 12 Jan Krocil 2015-03-20 10:35:04 UTC

Dave,
I was able to reproduce this in 5.4.0.0.12.

After blocking access to a mounted NFS server, the "df" processes began stacking up as seen here (same as what Josh posted above):
[root@my_host ~]# ps aux | grep df
root     28588  0.0  0.0 108208  1280 ?        SN   06:01   0:00 sh -c df -T -P  2>&1
root     28589  0.0  0.0 100980   800 ?        DN   06:01   0:00 df -T -P
root     29024  0.0  0.0 108208  1284 ?        SN   06:11   0:00 sh -c df -T -P  2>&1
root     29025  0.0  0.0 100980   800 ?        DN   06:11   0:00 df -T -P
root     29443  0.0  0.0 108208  1284 ?        SN   06:21   0:00 sh -c df -T -P  2>&1
root     29444  0.0  0.0 100980   800 ?        DN   06:21   0:00 df -T -P
root     29846  0.0  0.0 108208  1284 ?        SN   06:31   0:00 sh -c df -T -P  2>&1
root     29847  0.0  0.0 100980   800 ?        DN   06:31   0:00 df -T -P

When I manually ran "df -T -P" just to check it, it did indeed hang as well.
Running "df -T -P -l" gave me results right away though - it doesn't hang.

So whatever df-related issue may be (have been) present in fedora, it doesn't seem to affect up-to-date rhel6.6-based appliances.

[root@dhcp-8-59-242 ~]# rpm -qf /bin/df
coreutils-8.4-37.el6.x86_64

Comment 13 Joe Rafaniello 2015-03-20 13:09:55 UTC

Awesome Jan!  Thanks.  I'll prepare a fix.  

If you still have the appliance in this state, can you confirm if df/du need the -l option when passed a specific file on a local storage?  In other words, does it bypass remote mounts when you ask for df/du for the storage a file resides?

For example:
logger_file=/var/www/miq/vmdb/log/appliance_console.log
logvol_free_space=`df -k $file | awk '{ print $3 }' | tail -n 1`
existing_log_size=`du -k $file | awk '{ print $1 }' | tail -n 1`

https://github.com/jrafanie/manageiq/blob/c3418fd900ca1a1fd6c44852045291e907c53a58/system/logrotate_free_space_check.sh#L4-5

Comment 14 Joe Rafaniello 2015-03-20 13:41:14 UTC

note, du, doesn't have a local option.  -l is the same as --count-links.

Comment 15 Joe Rafaniello 2015-03-20 14:25:02 UTC

Upstream Pull request:
https://github.com/ManageIQ/manageiq/pull/2277

Limits df to local filesystem even when we specify a local file path (paranoid???).  Jan, I'm still curious if 'df -T -P /etc/hosts' also hangs when NFS is down and 'df -T -P' hangs (without a local file path).

Comment 16 Joe Rafaniello 2015-03-20 14:26:32 UTC

Thanks Tom, Brandon, Jan, Dave, and Josh... I wasn't sure we'd be able to recreate this easily.  I'm much more confident we can verify this fix now.

Comment 17 Jan Krocil 2015-03-20 14:32:45 UTC

Joe,
both "du" and "df" ran against local files (e.g. /etc/hosts) complete successfully.

Comment 18 CFME Bot 2015-03-24 21:31:26 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/0e8c4373503fcd2bca48ae46bccd670657fc5632

commit 0e8c4373503fcd2bca48ae46bccd670657fc5632
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Mar 20 10:00:14 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 20 18:08:01 2015 -0400

    To avoid a hung process, limit df to local fs if no file is provided
    
    Mounted NFS volumes that go bad/down can hang df so let's limit df to only the
    local filesystems we care about.
    
    Existing usage of MiqSystem.disk_usage has been:
    * where we pass no argument and log the output for server statistics from local filesystems
    * where we pass the Postgresql database disk location
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1203442

 lib/spec/util/miq-system_spec.rb | 30 ++++++++++++++++++++++++++----
 lib/util/miq-system.rb           | 10 ++++++++--
 2 files changed, 34 insertions(+), 6 deletions(-)

Comment 19 CFME Bot 2015-03-24 21:31:31 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/9b4ab5a77e469b7e7ca8bf95a87e031b0e5db4f1

commit 9b4ab5a77e469b7e7ca8bf95a87e031b0e5db4f1
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Mar 20 10:16:59 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 20 18:08:04 2015 -0400

    Force df to local filesystems even when a local path is provided.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1203442

 lib/fs/MiqFS/modules/RealFS.rb       | 2 +-
 system/logrotate_free_space_check.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comment 21 Joe Rafaniello 2015-03-31 21:48:46 UTC

Yes, Dave, will be backported as part of https://bugzilla.redhat.com/show_bug.cgi?id=1207860.

Comment 22 CFME Bot 2015-04-02 14:00:53 UTC

New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=f566d49712e2b63da35d4a076a3e17ea758bda66

commit f566d49712e2b63da35d4a076a3e17ea758bda66
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Mar 20 10:00:14 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Mar 31 17:38:11 2015 -0400

    To avoid a hung process, limit df to local fs if no file is provided
    
    Mounted NFS volumes that go bad/down can hang df so let's limit df to only the
    local filesystems we care about.
    
    Existing usage of MiqSystem.disk_usage has been:
    * where we pass no argument and log the output for server statistics from local filesystems
    * where we pass the Postgresql database disk location
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1203442
    
    Conflicts:
    	lib/util/miq-system.rb

 lib/spec/util/miq-system_spec.rb | 30 ++++++++++++++++++++++++++----
 lib/util/miq-system.rb           | 10 ++++++++--
 2 files changed, 34 insertions(+), 6 deletions(-)

Comment 23 CFME Bot 2015-04-02 14:00:58 UTC

New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=8e7ee5d8cf5478c3325e27ec66b1ce87a8eeac09

commit 8e7ee5d8cf5478c3325e27ec66b1ce87a8eeac09
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Mar 20 10:16:59 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Mar 31 17:41:14 2015 -0400

    Force df to local filesystems even when a local path is provided.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1203442

 lib/fs/MiqFS/modules/RealFS.rb       | 2 +-
 system/logrotate_free_space_check.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comment 24 CFME Bot 2015-04-02 14:01:02 UTC

New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=6d017ab52f2e0779c25f17fc807b65f20fd170b0

commit 6d017ab52f2e0779c25f17fc807b65f20fd170b0
Merge: dede0f6 8e7ee5d
Author:     Brandon Dunne <bdunne>
AuthorDate: Thu Apr 2 09:58:18 2015 -0400
Commit:     Brandon Dunne <bdunne>
CommitDate: Thu Apr 2 09:58:18 2015 -0400

    Merge branch '53_df_hang_fix' into '5.3.z'
    
    53 df hang fix
    
    * To avoid a hung process, limit df to local fs if no file is provided …
      Mounted NFS volumes that go bad/down can hang df so let's limit df to only the
      local filesystems we care about.
    
      Existing usage of MiqSystem.disk_usage has been:
      * where we pass no argument and log the output for server statistics from local filesystems
      * where we pass the Postgresql database disk location
    
    * Force df to local filesystems even when a local path is provided.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1203442
    
    See merge request !17

 lib/fs/MiqFS/modules/RealFS.rb       |  2 +-
 lib/spec/util/miq-system_spec.rb     | 30 ++++++++++++++++++++++++++----
 lib/util/miq-system.rb               | 10 ++++++++--
 system/logrotate_free_space_check.sh |  2 +-
 4 files changed, 36 insertions(+), 8 deletions(-)

Comment 25 CFME Bot 2015-04-07 21:35:22 UTC

New commit detected on cfme/5.3.4:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=1cb8dc973655ab3b7c8f222091d92a7ffb3cd949

commit 1cb8dc973655ab3b7c8f222091d92a7ffb3cd949
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Mar 20 10:00:14 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Apr 2 10:20:32 2015 -0400

    To avoid a hung process, limit df to local fs if no file is provided
    
    Mounted NFS volumes that go bad/down can hang df so let's limit df to only the
    local filesystems we care about.
    
    Existing usage of MiqSystem.disk_usage has been:
    * where we pass no argument and log the output for server statistics from local filesystems
    * where we pass the Postgresql database disk location
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1203442
    
    Conflicts:
    	lib/util/miq-system.rb

 lib/spec/util/miq-system_spec.rb | 30 ++++++++++++++++++++++++++----
 lib/util/miq-system.rb           | 10 ++++++++--
 2 files changed, 34 insertions(+), 6 deletions(-)

Comment 26 CFME Bot 2015-04-07 21:35:26 UTC

New commit detected on cfme/5.3.4:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=3e0d55ebbf17a418342fe9e9e6473cdf9e0d7b92

commit 3e0d55ebbf17a418342fe9e9e6473cdf9e0d7b92
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Mar 20 10:16:59 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Apr 2 10:20:36 2015 -0400

    Force df to local filesystems even when a local path is provided.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1203442

 lib/fs/MiqFS/modules/RealFS.rb       | 2 +-
 system/logrotate_free_space_check.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comment 28 Sean Myers 2015-05-01 01:19:06 UTC

Verified on 5.4.0.0.24

Comment 31 errata-xmlrpc 2015-06-16 12:54:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1100.html

Note You need to log in before you can comment on or make changes to this bug.