1205407 – "Log writing failed. can't be called from trap context" in evm.log

Bug 1205407 - "Log writing failed. can't be called from trap context" in evm.log

Summary: "Log writing failed. can't be called from trap context" in evm.log

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	GA
Target Release:	5.4.0
Assignee:	Joe Rafaniello
QA Contact:	Milan Falešník
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-03-24 21:09 UTC by Thom Carlin
Modified:	2015-06-16 12:54 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-06-16 12:54:42 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
evm.log (1.72 MB, text/plain) 2015-03-24 22:52 UTC, Thom Carlin	no flags	Details
workersig.rb (1.97 KB, text/plain) 2015-04-23 09:00 UTC, Milan Falešník	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:1100	0	normal	SHIPPED_LIVE	CFME 5.4.0 bug fixes, and enhancement update	2015-06-16 16:28:42 UTC

Description Thom Carlin 2015-03-24 21:09:43 UTC

Description of problem:

Unexpected errors in newly provisioned CFME appliance

Version-Release number of selected component (if applicable):

5.4.0.0.13.20150319164909_f011a66

How reproducible:

100%

Steps to Reproduce:
1. Deploy appliance
2. Check evm.log
3.

Actual results:

log writing failed. can't be called from trap context [multiple times]
FATAL:  terminating connection due to administrator command
log writing failed. can't be called from trap context [multiple times]
[----] E, [2015-03-24T05:45:55.271621 #2952:108dea4] ERROR -- : MIQ(abstract_adapter) Name: [MiqServer Load], Message: [PGError: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
...: SELECT  "miq_servers".* FROM "miq_servers"  WHERE "miq_servers"."guid" = 'c48e493e-d208-11e4-a744-001a4a2dd78a' LIMIT 1...]
Tue Mar 24 09:47:49 UTC 2015 EVMINIT   EVM Appliance Booted

Expected results:

Booting without errors

Additional info:

Comment 2 Joe Rafaniello 2015-03-24 21:26:19 UTC

Please provide the full log.  There might be a bug trying to trap specific signals but it's not clear what's going on to start triaging.

Comment 3 Thom Carlin 2015-03-24 22:52:17 UTC

Created attachment 1006043 [details]
evm.log

Comment 4 Joe Rafaniello 2015-03-25 18:16:59 UTC

Thanks!  This was caused when we upgraded to ruby 2.0.

See: https://bugs.ruby-lang.org/issues/7917

Comment 5 Joe Rafaniello 2015-03-25 20:26:30 UTC

Note, this would cause some processes (server/worker) to fail to log that they're going away, specifically, when the server/worker receives a SIGINT, SIGTERM, SIGUSR1, SIGUSR2.

Im working on removing the traps in favor of regular exception handling.

Comment 6 Joe Rafaniello 2015-03-26 23:51:25 UTC

Upstream [WIP] PR:
https://github.com/ManageIQ/manageiq/pull/2386

Bumping priority... this is pretty bad to have workers/server process not properly process signals and disappear.

Comment 7 Joe Rafaniello 2015-03-27 22:03:18 UTC

Note, this upstream PR also fixes an issue where the UI/Webservice workers don't gracefully log their "exit" and cleanly restart if they encounter a SIGINT/SIGTERM.

Comment 8 CFME Bot 2015-03-30 15:30:53 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/b125f4532829b87a23d1f333358c75d039b1454d

commit b125f4532829b87a23d1f333358c75d039b1454d
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Mar 26 18:24:18 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 27 16:21:49 2015 -0400

    Add a EvmServer binstub so EvmServer can be required directly.
    
    This makes testing the EvmServer class possible.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/lib/tasks/evm_application.rb  | 2 +-
 vmdb/lib/workers/bin/evm_server.rb | 4 ++++
 vmdb/lib/workers/evm_server.rb     | 2 --
 3 files changed, 5 insertions(+), 3 deletions(-)
 create mode 100644 vmdb/lib/workers/bin/evm_server.rb

Comment 9 CFME Bot 2015-03-30 15:30:57 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/400f9d9dea484329eba1732c6a849032c621cf08

commit 400f9d9dea484329eba1732c6a849032c621cf08
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Wed Mar 25 17:28:27 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 27 16:21:52 2015 -0400

    Don't try to trap SIGKILL since it can't be trapped/ignored.
    
    See #2313
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/lib/workers/evm_server.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 10 CFME Bot 2015-03-30 15:31:01 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/2222b202be7985f4411ca9bf49a5f1883f043ec7

commit 2222b202be7985f4411ca9bf49a5f1883f043ec7
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Mar 27 10:24:51 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 27 16:21:52 2015 -0400

    Cleanup WorkerBase initialization to simplify test setup.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/lib/workers/worker_base.rb           | 11 +++++++----
 vmdb/spec/lib/workers/worker_base_spec.rb | 15 +++------------
 2 files changed, 10 insertions(+), 16 deletions(-)

Comment 11 CFME Bot 2015-03-30 15:31:04 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/859b8a77d3885a2910425aa1b2ad031c82b4a0a2

commit 859b8a77d3885a2910425aa1b2ad031c82b4a0a2
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Wed Mar 25 17:38:27 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 27 16:21:52 2015 -0400

    Convert server signal 'trap' to normal SignalException handling.
    
    Among other things, you can't log in the trap signal handler in ruby 2.0.
    https://bugs.ruby-lang.org/issues/7917
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/lib/workers/evm_server.rb           | 12 +++++++-----
 vmdb/spec/lib/workers/evm_server_spec.rb | 31 +++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 5 deletions(-)
 create mode 100644 vmdb/spec/lib/workers/evm_server_spec.rb

Comment 12 CFME Bot 2015-03-30 15:31:08 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/5d67a7bbbc58f8ed81ce1f8bcdea8cdd21f0f08b

commit 5d67a7bbbc58f8ed81ce1f8bcdea8cdd21f0f08b
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Mar 26 10:36:31 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 27 16:21:53 2015 -0400

    Convert worker 'trap' to normal SignalException handling.
    
    Among other things, you can't log in the trap signal handler in ruby 2.0.
    https://bugs.ruby-lang.org/issues/7917
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/lib/workers/worker_base.rb           | 15 +++++++++------
 vmdb/spec/lib/workers/worker_base_spec.rb | 24 ++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 6 deletions(-)

Comment 13 CFME Bot 2015-03-30 15:31:11 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/180afbe2bd98bc61d0976aedea9049ec09037ca4

commit 180afbe2bd98bc61d0976aedea9049ec09037ca4
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Mar 27 15:43:15 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 27 16:21:53 2015 -0400

    Add at_exit for thin based workers since thin traps interrupts.
    
    Because thin traps interrupts, SignalException/Interrupt exceptions are not raised
    to the UI and Web Service workers, causing them to exit without logging and updating
    it's worker status.  The latter causes the server to not restart the worker as quickly.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/lib/workers/ui_worker.rb          | 8 +++-----
 vmdb/lib/workers/web_service_worker.rb | 8 +++-----
 2 files changed, 6 insertions(+), 10 deletions(-)

Comment 14 CFME Bot 2015-03-30 15:31:15 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/1d5ecf9154c57e6900a013bd8911459cb602573f

commit 1d5ecf9154c57e6900a013bd8911459cb602573f
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Mar 27 16:01:17 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 27 16:21:54 2015 -0400

    Extract all of the commmon WS/UI worker script code to a mixin.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/lib/workers/mixins/web_server_worker_mixin.rb | 15 +++++++++++++++
 vmdb/lib/workers/ui_worker.rb                      | 13 ++-----------
 vmdb/lib/workers/web_service_worker.rb             | 13 ++-----------
 3 files changed, 19 insertions(+), 22 deletions(-)
 create mode 100644 vmdb/lib/workers/mixins/web_server_worker_mixin.rb

Comment 15 CFME Bot 2015-03-30 15:31:19 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/de72d60471394ecc3f9bef456f0eda3fc1bb8d00

commit de72d60471394ecc3f9bef456f0eda3fc1bb8d00
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Mon Mar 30 10:02:22 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Mon Mar 30 10:02:22 2015 -0400

    These shebang lines are wrong and not used so remove them.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/lib/workers/bin/evm_server.rb | 2 --
 vmdb/lib/workers/bin/worker.rb     | 2 --
 2 files changed, 4 deletions(-)

Comment 16 Joe Rafaniello 2015-03-30 15:54:59 UTC

QE, note, the easiest way to verify this is to try send kill pid, kill -TERM pid (same as kill), kill -INT pid on each worker process to ensure they properly log their "exit" and update their worker status.

Previously, the trapping of signals failed on ruby 2.0 because we were trying to log using a mutex (which prints the reported error).  Additionally, the UI worker and Web Service Workers were both NOT trapping anything (on ruby 1.9.3 or 2.0) as the "thin" rails web server was trapping the signals first.  This made it impossible to get a UI or WS worker to log their exit when sending them a SIGINT/SIGTERM.

Now, the workers should exit properly on interrupts (SIGINT/SIGTERM), log their message, and update their status so they can be immediately be restarted.  Previously, the server would have to "timeout" the already exited worker before it would get restarted.  Workers getting restarted via SIGINT/SIGTERM should now take around 15 seconds instead of the nearly 2 minute wait for a interrupted worker to be timed out and restarted.

Finally, the evm server process is also now responding to interrupts although I found other issues that I will be fixing.  For now, it has been fixed to properly log it's message and initiate shutdown when it receives (SIGINT/SIGTERM).

Comment 17 CFME Bot 2015-03-31 13:31:11 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/3ded28b412895c4a9cbe1b487c90633e1f19eb14

commit 3ded28b412895c4a9cbe1b487c90633e1f19eb14
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Mon Mar 30 17:18:29 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Mon Mar 30 17:33:15 2015 -0400

    Rename worker model mixin to be consistent with worker models.
    
    Follow up to: #2386
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/app/models/miq_ui_worker.rb                   |   2 +-
 vmdb/app/models/miq_web_service_worker.rb          |   2 +-
 .../models/mixins/miq_web_server_worker_mixin.rb   | 228 +++++++++++++++++++++
 vmdb/app/models/mixins/web_server_worker_mixin.rb  | 228 ---------------------
 .../mixins/miq_web_server_worker_mixin_spec.rb     |  12 ++
 .../models/mixins/web_server_worker_mixin_spec.rb  |  12 --
 6 files changed, 242 insertions(+), 242 deletions(-)
 create mode 100644 vmdb/app/models/mixins/miq_web_server_worker_mixin.rb
 delete mode 100644 vmdb/app/models/mixins/web_server_worker_mixin.rb
 create mode 100644 vmdb/spec/models/mixins/miq_web_server_worker_mixin_spec.rb
 delete mode 100644 vmdb/spec/models/mixins/web_server_worker_mixin_spec.rb

Comment 18 CFME Bot 2015-03-31 13:31:15 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/71500935772d8123bbbc2be6ab46e3fd16469af7

commit 71500935772d8123bbbc2be6ab46e3fd16469af7
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Mon Mar 30 16:25:54 2015 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Mon Mar 30 17:33:20 2015 -0400

    The module name was accidentally duplicated so the Concern was missed.
    
    Add `extend ActiveSupport::Concern` to fix `wrong number of arguments (0 for 1) (ArgumentError)`
    
    Follow up to: #2386
    https://bugzilla.redhat.com/show_bug.cgi?id=1205407

 vmdb/lib/workers/mixins/web_server_worker_mixin.rb | 2 ++
 vmdb/spec/lib/workers/ui_worker_spec.rb            | 9 +++++++++
 2 files changed, 11 insertions(+)
 create mode 100644 vmdb/spec/lib/workers/ui_worker_spec.rb

Comment 20 Milan Falešník 2015-04-23 09:00:47 UTC

Created attachment 1017820 [details]
workersig.rb

Helper script for testing this bug on an appliance. Copy this file to /var/www/miq/vmdb and run it:

bin/rails runner workersig.rb <signal>

so you can do:

bin/rails runner workersig.rb INT

Comment 21 Milan Falešník 2015-04-23 09:17:09 UTC

So, I checked this in 5.4.0.0.22 using the script I attached, all workers recognize the traps according to the log, the error message is not present. However, some of the workers do not want to end (like MiqEmsRefreshWorkerVmware) or it sometimes takes very long, like once what I saw for Generic worker. But killed workers are replaced quite fast. Do you think this is sufficient for verification?

Comment 22 Joe Rafaniello 2015-04-24 19:26:08 UTC

Yes, Milan, if they log their interrupts and that they are exiting, that's enough.

That is what was fixed.  If the workers themselves don't exit within a reasonable amount of time, then, that's worth investigating to see if there's another bug in that worker.

Comment 24 errata-xmlrpc 2015-06-16 12:54:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1100.html

Note You need to log in before you can comment on or make changes to this bug.