We log the server stats on a schedule initiated by a queue message created when the rufus schedule is fired within the schedule worker process. This queue message is only created if a prior one doesn't already exist for that server. We need to clarify what the process count spike means, what processes were added? Was there something that happened in terms of new/added processes prior to the large number of "df" counts. Also, it would be useful to know if those df commands were failing because of something. Looking at the code, nothing looks obvious to me so it might be the 'df' command we're in combination with something that could be happening on storage. The 5.minutes frequency comes from this setting: https://github.com/ManageIQ/manageiq/blob/00af6164bf983bed2f86cfa5244d0b9c2c9fcecf/vmdb/config/vmdb.tmpl.yml#L499 The schedule worker's rufus thread throws the "we need to queue up a log server stats" thing into the schedule workers internal queue here: https://github.com/ManageIQ/manageiq/blob/00af6164bf983bed2f86cfa5244d0b9c2c9fcecf/vmdb/lib/workers/schedule_worker.rb#L113 When the schedule worker goes through it's internal queue, it calls this miq_server_worker_log_status method: https://github.com/ManageIQ/manageiq/blob/00af6164bf983bed2f86cfa5244d0b9c2c9fcecf/vmdb/lib/workers/schedule_worker/jobs.rb#L16 Note, it's eventually adds it to the miq_queue table by doing MiqQueue.put_unless_exists: https://github.com/ManageIQ/manageiq/blob/00af6164bf983bed2f86cfa5244d0b9c2c9fcecf/vmdb/lib/workers/schedule_worker/jobs.rb#L225 Which means, it won't queue up the same task twice if there's already a miq_queue row with these values: :class_name => "MiqServer", :method_name => "log_status", :task_id => "log_status", :server_guid => MiqServer.my_guid, :priority => MiqQueue::HIGH_PRIORITY This last point implies we don't have a bunch of miq_queue records all asking to do the log_status.
Another question is why this log is not rotating. Is this related to the upstream bug where the logs were not rotating?
Could this be a NFS mount issue?
to Jason's comment: we rotate logs on a daily schedule, not by size: https://github.com/jrafanie/manageiq/blob/ecda2a8c75225a3248adc011f7714d186c37cb0a/system/COPY/etc/logrotate.d/miq_logs.conf#L2 Following up on Tom's comment, a google search it's looks clear that df is known to hang when mount volumes are inaccessible. http://unix.stackexchange.com/questions/21199/why-is-df-hanging https://bugs.centos.org/view.php?id=6414 https://community.oracle.com/thread/2515836 https://darktraining.com/linux/67/ Many of the suggestions online are to limit our df command to only local volumes by passing -l: df -l Note, this bug seems to indicate the local option may not work with fedora20 https://bugzilla.redhat.com/show_bug.cgi?id=1199679 We'll have to recreate this in house and see if we can verify it on rhel 6.6.
Dave, from the provided links, would you be able to try to recreate this? According to the googling I did, it seems like just mounting a nfs share on an appliance, shutting down the nfs server or disabling the nfs, and trying 'df -T -P' seems to do it. From Tom's log reviews, it looks like the top output was showing the SAME df command taking MINUTES to run since it was reported 3 or more times in top_output.log. This is the code we'd change to pass -l: https://github.com/jrafanie/manageiq/blob/19149b1e903a27e241e79dd174a82234344bfbe3/lib/util/miq-system.rb#L405
Jan, can you give this a shot and see if you can reproduce? Thanks!
Dave, I was able to reproduce this in 5.4.0.0.12. After blocking access to a mounted NFS server, the "df" processes began stacking up as seen here (same as what Josh posted above): [root@my_host ~]# ps aux | grep df root 28588 0.0 0.0 108208 1280 ? SN 06:01 0:00 sh -c df -T -P 2>&1 root 28589 0.0 0.0 100980 800 ? DN 06:01 0:00 df -T -P root 29024 0.0 0.0 108208 1284 ? SN 06:11 0:00 sh -c df -T -P 2>&1 root 29025 0.0 0.0 100980 800 ? DN 06:11 0:00 df -T -P root 29443 0.0 0.0 108208 1284 ? SN 06:21 0:00 sh -c df -T -P 2>&1 root 29444 0.0 0.0 100980 800 ? DN 06:21 0:00 df -T -P root 29846 0.0 0.0 108208 1284 ? SN 06:31 0:00 sh -c df -T -P 2>&1 root 29847 0.0 0.0 100980 800 ? DN 06:31 0:00 df -T -P When I manually ran "df -T -P" just to check it, it did indeed hang as well. Running "df -T -P -l" gave me results right away though - it doesn't hang. So whatever df-related issue may be (have been) present in fedora, it doesn't seem to affect up-to-date rhel6.6-based appliances. [root@dhcp-8-59-242 ~]# rpm -qf /bin/df coreutils-8.4-37.el6.x86_64
Awesome Jan! Thanks. I'll prepare a fix. If you still have the appliance in this state, can you confirm if df/du need the -l option when passed a specific file on a local storage? In other words, does it bypass remote mounts when you ask for df/du for the storage a file resides? For example: logger_file=/var/www/miq/vmdb/log/appliance_console.log logvol_free_space=`df -k $file | awk '{ print $3 }' | tail -n 1` existing_log_size=`du -k $file | awk '{ print $1 }' | tail -n 1` https://github.com/jrafanie/manageiq/blob/c3418fd900ca1a1fd6c44852045291e907c53a58/system/logrotate_free_space_check.sh#L4-5
note, du, doesn't have a local option. -l is the same as --count-links.
Upstream Pull request: https://github.com/ManageIQ/manageiq/pull/2277 Limits df to local filesystem even when we specify a local file path (paranoid???). Jan, I'm still curious if 'df -T -P /etc/hosts' also hangs when NFS is down and 'df -T -P' hangs (without a local file path).
Thanks Tom, Brandon, Jan, Dave, and Josh... I wasn't sure we'd be able to recreate this easily. I'm much more confident we can verify this fix now.
Joe, both "du" and "df" ran against local files (e.g. /etc/hosts) complete successfully.
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/0e8c4373503fcd2bca48ae46bccd670657fc5632 commit 0e8c4373503fcd2bca48ae46bccd670657fc5632 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Mar 20 10:00:14 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 20 18:08:01 2015 -0400 To avoid a hung process, limit df to local fs if no file is provided Mounted NFS volumes that go bad/down can hang df so let's limit df to only the local filesystems we care about. Existing usage of MiqSystem.disk_usage has been: * where we pass no argument and log the output for server statistics from local filesystems * where we pass the Postgresql database disk location https://bugzilla.redhat.com/show_bug.cgi?id=1203442 lib/spec/util/miq-system_spec.rb | 30 ++++++++++++++++++++++++++---- lib/util/miq-system.rb | 10 ++++++++-- 2 files changed, 34 insertions(+), 6 deletions(-)
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/9b4ab5a77e469b7e7ca8bf95a87e031b0e5db4f1 commit 9b4ab5a77e469b7e7ca8bf95a87e031b0e5db4f1 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Mar 20 10:16:59 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 20 18:08:04 2015 -0400 Force df to local filesystems even when a local path is provided. https://bugzilla.redhat.com/show_bug.cgi?id=1203442 lib/fs/MiqFS/modules/RealFS.rb | 2 +- system/logrotate_free_space_check.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
Yes, Dave, will be backported as part of https://bugzilla.redhat.com/show_bug.cgi?id=1207860.
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=f566d49712e2b63da35d4a076a3e17ea758bda66 commit f566d49712e2b63da35d4a076a3e17ea758bda66 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Mar 20 10:00:14 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Mar 31 17:38:11 2015 -0400 To avoid a hung process, limit df to local fs if no file is provided Mounted NFS volumes that go bad/down can hang df so let's limit df to only the local filesystems we care about. Existing usage of MiqSystem.disk_usage has been: * where we pass no argument and log the output for server statistics from local filesystems * where we pass the Postgresql database disk location https://bugzilla.redhat.com/show_bug.cgi?id=1203442 Conflicts: lib/util/miq-system.rb lib/spec/util/miq-system_spec.rb | 30 ++++++++++++++++++++++++++---- lib/util/miq-system.rb | 10 ++++++++-- 2 files changed, 34 insertions(+), 6 deletions(-)
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=8e7ee5d8cf5478c3325e27ec66b1ce87a8eeac09 commit 8e7ee5d8cf5478c3325e27ec66b1ce87a8eeac09 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Mar 20 10:16:59 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Mar 31 17:41:14 2015 -0400 Force df to local filesystems even when a local path is provided. https://bugzilla.redhat.com/show_bug.cgi?id=1203442 lib/fs/MiqFS/modules/RealFS.rb | 2 +- system/logrotate_free_space_check.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=6d017ab52f2e0779c25f17fc807b65f20fd170b0 commit 6d017ab52f2e0779c25f17fc807b65f20fd170b0 Merge: dede0f6 8e7ee5d Author: Brandon Dunne <bdunne> AuthorDate: Thu Apr 2 09:58:18 2015 -0400 Commit: Brandon Dunne <bdunne> CommitDate: Thu Apr 2 09:58:18 2015 -0400 Merge branch '53_df_hang_fix' into '5.3.z' 53 df hang fix * To avoid a hung process, limit df to local fs if no file is provided … Mounted NFS volumes that go bad/down can hang df so let's limit df to only the local filesystems we care about. Existing usage of MiqSystem.disk_usage has been: * where we pass no argument and log the output for server statistics from local filesystems * where we pass the Postgresql database disk location * Force df to local filesystems even when a local path is provided. https://bugzilla.redhat.com/show_bug.cgi?id=1203442 See merge request !17 lib/fs/MiqFS/modules/RealFS.rb | 2 +- lib/spec/util/miq-system_spec.rb | 30 ++++++++++++++++++++++++++---- lib/util/miq-system.rb | 10 ++++++++-- system/logrotate_free_space_check.sh | 2 +- 4 files changed, 36 insertions(+), 8 deletions(-)
New commit detected on cfme/5.3.4: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=1cb8dc973655ab3b7c8f222091d92a7ffb3cd949 commit 1cb8dc973655ab3b7c8f222091d92a7ffb3cd949 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Mar 20 10:00:14 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Apr 2 10:20:32 2015 -0400 To avoid a hung process, limit df to local fs if no file is provided Mounted NFS volumes that go bad/down can hang df so let's limit df to only the local filesystems we care about. Existing usage of MiqSystem.disk_usage has been: * where we pass no argument and log the output for server statistics from local filesystems * where we pass the Postgresql database disk location https://bugzilla.redhat.com/show_bug.cgi?id=1203442 Conflicts: lib/util/miq-system.rb lib/spec/util/miq-system_spec.rb | 30 ++++++++++++++++++++++++++---- lib/util/miq-system.rb | 10 ++++++++-- 2 files changed, 34 insertions(+), 6 deletions(-)
New commit detected on cfme/5.3.4: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=3e0d55ebbf17a418342fe9e9e6473cdf9e0d7b92 commit 3e0d55ebbf17a418342fe9e9e6473cdf9e0d7b92 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Mar 20 10:16:59 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Apr 2 10:20:36 2015 -0400 Force df to local filesystems even when a local path is provided. https://bugzilla.redhat.com/show_bug.cgi?id=1203442 lib/fs/MiqFS/modules/RealFS.rb | 2 +- system/logrotate_free_space_check.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
Verified on 5.4.0.0.24
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1100.html