Description of problem: Unexpected errors in newly provisioned CFME appliance Version-Release number of selected component (if applicable): 5.4.0.0.13.20150319164909_f011a66 How reproducible: 100% Steps to Reproduce: 1. Deploy appliance 2. Check evm.log 3. Actual results: log writing failed. can't be called from trap context [multiple times] FATAL: terminating connection due to administrator command log writing failed. can't be called from trap context [multiple times] [----] E, [2015-03-24T05:45:55.271621 #2952:108dea4] ERROR -- : MIQ(abstract_adapter) Name: [MiqServer Load], Message: [PGError: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5432"? ...: SELECT "miq_servers".* FROM "miq_servers" WHERE "miq_servers"."guid" = 'c48e493e-d208-11e4-a744-001a4a2dd78a' LIMIT 1...] Tue Mar 24 09:47:49 UTC 2015 EVMINIT EVM Appliance Booted Expected results: Booting without errors Additional info:
Please provide the full log. There might be a bug trying to trap specific signals but it's not clear what's going on to start triaging.
Created attachment 1006043 [details] evm.log
Thanks! This was caused when we upgraded to ruby 2.0. See: https://bugs.ruby-lang.org/issues/7917
Note, this would cause some processes (server/worker) to fail to log that they're going away, specifically, when the server/worker receives a SIGINT, SIGTERM, SIGUSR1, SIGUSR2. Im working on removing the traps in favor of regular exception handling.
Upstream [WIP] PR: https://github.com/ManageIQ/manageiq/pull/2386 Bumping priority... this is pretty bad to have workers/server process not properly process signals and disappear.
Note, this upstream PR also fixes an issue where the UI/Webservice workers don't gracefully log their "exit" and cleanly restart if they encounter a SIGINT/SIGTERM.
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/b125f4532829b87a23d1f333358c75d039b1454d commit b125f4532829b87a23d1f333358c75d039b1454d Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Mar 26 18:24:18 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 27 16:21:49 2015 -0400 Add a EvmServer binstub so EvmServer can be required directly. This makes testing the EvmServer class possible. https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/lib/tasks/evm_application.rb | 2 +- vmdb/lib/workers/bin/evm_server.rb | 4 ++++ vmdb/lib/workers/evm_server.rb | 2 -- 3 files changed, 5 insertions(+), 3 deletions(-) create mode 100644 vmdb/lib/workers/bin/evm_server.rb
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/400f9d9dea484329eba1732c6a849032c621cf08 commit 400f9d9dea484329eba1732c6a849032c621cf08 Author: Joe Rafaniello <jrafanie> AuthorDate: Wed Mar 25 17:28:27 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 27 16:21:52 2015 -0400 Don't try to trap SIGKILL since it can't be trapped/ignored. See #2313 https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/lib/workers/evm_server.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/2222b202be7985f4411ca9bf49a5f1883f043ec7 commit 2222b202be7985f4411ca9bf49a5f1883f043ec7 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Mar 27 10:24:51 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 27 16:21:52 2015 -0400 Cleanup WorkerBase initialization to simplify test setup. https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/lib/workers/worker_base.rb | 11 +++++++---- vmdb/spec/lib/workers/worker_base_spec.rb | 15 +++------------ 2 files changed, 10 insertions(+), 16 deletions(-)
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/859b8a77d3885a2910425aa1b2ad031c82b4a0a2 commit 859b8a77d3885a2910425aa1b2ad031c82b4a0a2 Author: Joe Rafaniello <jrafanie> AuthorDate: Wed Mar 25 17:38:27 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 27 16:21:52 2015 -0400 Convert server signal 'trap' to normal SignalException handling. Among other things, you can't log in the trap signal handler in ruby 2.0. https://bugs.ruby-lang.org/issues/7917 https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/lib/workers/evm_server.rb | 12 +++++++----- vmdb/spec/lib/workers/evm_server_spec.rb | 31 +++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+), 5 deletions(-) create mode 100644 vmdb/spec/lib/workers/evm_server_spec.rb
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/5d67a7bbbc58f8ed81ce1f8bcdea8cdd21f0f08b commit 5d67a7bbbc58f8ed81ce1f8bcdea8cdd21f0f08b Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Mar 26 10:36:31 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 27 16:21:53 2015 -0400 Convert worker 'trap' to normal SignalException handling. Among other things, you can't log in the trap signal handler in ruby 2.0. https://bugs.ruby-lang.org/issues/7917 https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/lib/workers/worker_base.rb | 15 +++++++++------ vmdb/spec/lib/workers/worker_base_spec.rb | 24 ++++++++++++++++++++++++ 2 files changed, 33 insertions(+), 6 deletions(-)
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/180afbe2bd98bc61d0976aedea9049ec09037ca4 commit 180afbe2bd98bc61d0976aedea9049ec09037ca4 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Mar 27 15:43:15 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 27 16:21:53 2015 -0400 Add at_exit for thin based workers since thin traps interrupts. Because thin traps interrupts, SignalException/Interrupt exceptions are not raised to the UI and Web Service workers, causing them to exit without logging and updating it's worker status. The latter causes the server to not restart the worker as quickly. https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/lib/workers/ui_worker.rb | 8 +++----- vmdb/lib/workers/web_service_worker.rb | 8 +++----- 2 files changed, 6 insertions(+), 10 deletions(-)
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/1d5ecf9154c57e6900a013bd8911459cb602573f commit 1d5ecf9154c57e6900a013bd8911459cb602573f Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Mar 27 16:01:17 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 27 16:21:54 2015 -0400 Extract all of the commmon WS/UI worker script code to a mixin. https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/lib/workers/mixins/web_server_worker_mixin.rb | 15 +++++++++++++++ vmdb/lib/workers/ui_worker.rb | 13 ++----------- vmdb/lib/workers/web_service_worker.rb | 13 ++----------- 3 files changed, 19 insertions(+), 22 deletions(-) create mode 100644 vmdb/lib/workers/mixins/web_server_worker_mixin.rb
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/de72d60471394ecc3f9bef456f0eda3fc1bb8d00 commit de72d60471394ecc3f9bef456f0eda3fc1bb8d00 Author: Joe Rafaniello <jrafanie> AuthorDate: Mon Mar 30 10:02:22 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Mon Mar 30 10:02:22 2015 -0400 These shebang lines are wrong and not used so remove them. https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/lib/workers/bin/evm_server.rb | 2 -- vmdb/lib/workers/bin/worker.rb | 2 -- 2 files changed, 4 deletions(-)
QE, note, the easiest way to verify this is to try send kill pid, kill -TERM pid (same as kill), kill -INT pid on each worker process to ensure they properly log their "exit" and update their worker status. Previously, the trapping of signals failed on ruby 2.0 because we were trying to log using a mutex (which prints the reported error). Additionally, the UI worker and Web Service Workers were both NOT trapping anything (on ruby 1.9.3 or 2.0) as the "thin" rails web server was trapping the signals first. This made it impossible to get a UI or WS worker to log their exit when sending them a SIGINT/SIGTERM. Now, the workers should exit properly on interrupts (SIGINT/SIGTERM), log their message, and update their status so they can be immediately be restarted. Previously, the server would have to "timeout" the already exited worker before it would get restarted. Workers getting restarted via SIGINT/SIGTERM should now take around 15 seconds instead of the nearly 2 minute wait for a interrupted worker to be timed out and restarted. Finally, the evm server process is also now responding to interrupts although I found other issues that I will be fixing. For now, it has been fixed to properly log it's message and initiate shutdown when it receives (SIGINT/SIGTERM).
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/3ded28b412895c4a9cbe1b487c90633e1f19eb14 commit 3ded28b412895c4a9cbe1b487c90633e1f19eb14 Author: Joe Rafaniello <jrafanie> AuthorDate: Mon Mar 30 17:18:29 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Mon Mar 30 17:33:15 2015 -0400 Rename worker model mixin to be consistent with worker models. Follow up to: #2386 https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/app/models/miq_ui_worker.rb | 2 +- vmdb/app/models/miq_web_service_worker.rb | 2 +- .../models/mixins/miq_web_server_worker_mixin.rb | 228 +++++++++++++++++++++ vmdb/app/models/mixins/web_server_worker_mixin.rb | 228 --------------------- .../mixins/miq_web_server_worker_mixin_spec.rb | 12 ++ .../models/mixins/web_server_worker_mixin_spec.rb | 12 -- 6 files changed, 242 insertions(+), 242 deletions(-) create mode 100644 vmdb/app/models/mixins/miq_web_server_worker_mixin.rb delete mode 100644 vmdb/app/models/mixins/web_server_worker_mixin.rb create mode 100644 vmdb/spec/models/mixins/miq_web_server_worker_mixin_spec.rb delete mode 100644 vmdb/spec/models/mixins/web_server_worker_mixin_spec.rb
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/71500935772d8123bbbc2be6ab46e3fd16469af7 commit 71500935772d8123bbbc2be6ab46e3fd16469af7 Author: Joe Rafaniello <jrafanie> AuthorDate: Mon Mar 30 16:25:54 2015 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Mon Mar 30 17:33:20 2015 -0400 The module name was accidentally duplicated so the Concern was missed. Add `extend ActiveSupport::Concern` to fix `wrong number of arguments (0 for 1) (ArgumentError)` Follow up to: #2386 https://bugzilla.redhat.com/show_bug.cgi?id=1205407 vmdb/lib/workers/mixins/web_server_worker_mixin.rb | 2 ++ vmdb/spec/lib/workers/ui_worker_spec.rb | 9 +++++++++ 2 files changed, 11 insertions(+) create mode 100644 vmdb/spec/lib/workers/ui_worker_spec.rb
Created attachment 1017820 [details] workersig.rb Helper script for testing this bug on an appliance. Copy this file to /var/www/miq/vmdb and run it: bin/rails runner workersig.rb <signal> so you can do: bin/rails runner workersig.rb INT
So, I checked this in 5.4.0.0.22 using the script I attached, all workers recognize the traps according to the log, the error message is not present. However, some of the workers do not want to end (like MiqEmsRefreshWorkerVmware) or it sometimes takes very long, like once what I saw for Generic worker. But killed workers are replaced quite fast. Do you think this is sufficient for verification?
Yes, Milan, if they log their interrupts and that they are exiting, that's enough. That is what was fixed. If the workers themselves don't exit within a reasonable amount of time, then, that's worth investigating to see if there's another bug in that worker.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1100.html