Bug 1420431
Summary: | Middleware - Add/Delete Datasources causes CFME temporary unavailable | ||
---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | Hayk Hovsepyan <hhovsepy> |
Component: | Appliance | Assignee: | Cainã Costa <cacosta> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Dave Johnson <dajohnso> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | unspecified | CC: | abellott, abonas, cacosta, dajohnso, dclarizi, hhovsepy, hkataria, hrupp, jfrey, jhardy, jrafanie, mfeifer, mmahoney, mpovolny, obarenbo, pyadav, simaishi |
Target Milestone: | GA | Keywords: | TestOnly, Triaged |
Target Release: | 5.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 5.9.0.4 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-03-06 14:52:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | CFME Core | Target Upstream Version: | |
Embargoed: |
Description
Hayk Hovsepyan
2017-02-08 15:49:29 UTC
It is currently being developed. PR Created: https://github.com/ManageIQ/manageiq/pull/14116 this issue is infra/core and is not middleware specific (and there's a discussion about it with further details in the PR in github), so I'm reassigning this to miq core. Encounter the same Issue while deploying an application to the middleware server-group. Assigning over to appliance after reading through the discussion on the linked PR: https://github.com/ManageIQ/manageiq/pull/14116 The underlying problem appears to be an issue with the number of connections being used and has nothing to do with how the provider is implemented. In case it was missed, I again asked for rails debug (level_rails in advanced settings and restart) logs from a production mode appliance with the issue recreated so we can see what controller actions checkout database connections but don't release them. https://github.com/ManageIQ/manageiq/issues/10333#issuecomment-305614002 There's not much we can do without logs or an environment to recreate this. Reassigning since actioncable threads are consuming connections and not releasing them: https://github.com/ManageIQ/manageiq/issues/10333#issuecomment-307150412 David, not sure if/how actioncable is causing this, but can you look at it and let me know if it's something we can fix on the UI side? Thx, Dan Hayk, Can you provide the logs when actioncable / websocket connections were disabled in the same way they were provided here: https://github.com/ManageIQ/manageiq/issues/10333#issuecomment-307045215 We need to see what is consuming those database connections or if something else is occurring. Or, can you provide me the IP of an appliance after you applied the configuration changes to disable actioncable and recreated the issue? Thanks, Joe Hi Heiko, I spent several hours looking at the logs provided by Hayk and I'm not at all familiar with the hawkular client or the provider. Could you have someone look at this? It seems like we have Threads created when adding a datasource and adding a jdbc driver in the provider. These threads checkout a database connection but never check them back in the activerecord connection pool. Note, the db connection are used to inserts notifications for success/failure operations. Also, note, the first Thread that fails to checkin a connection was on a failure operation. Perhaps, the hawkular client code is creating a Thread which reserves the activerecord connection but the Thread doesn't properly quit on the error condition? I have no idea. If you grep the offending process id for the connection pool checkin and checkout lines for Threads that don't check in connections, we get this... note, I have added new lines so it's easier to see the checkout/checkin for each thread: 03:18:18 ~/Downloads/issues/1420431 (2.4.1) + grep -E "13038.+log_after" error_production.log | tail -n 30 [----] D, [2017-08-30T03:44:35.616534 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 3, in use: 3, waiting_in_queue: 0 [----] D, [2017-08-30T03:44:36.932575 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 3, in use: 2, waiting_in_queue: 0 [----] D, [2017-08-30T03:44:37.414738 #13038:8f807e8] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 3, in use: 3, waiting_in_queue: 0 [----] D, [2017-08-30T03:44:47.371737 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 4, in use: 4, waiting_in_queue: 0 [----] D, [2017-08-30T03:44:47.959437 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 4, in use: 3, waiting_in_queue: 0 [----] D, [2017-08-30T03:44:48.918313 #13038:63197cc] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 4, in use: 4, waiting_in_queue: 0 [----] D, [2017-08-30T03:44:51.264456 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 5, in use: 5, waiting_in_queue: 0 [----] D, [2017-08-30T03:44:51.555935 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 5, in use: 4, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:00.153862 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 5, in use: 5, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:00.420890 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 5, in use: 4, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:02.690869 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 5, in use: 5, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:03.270389 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 5, in use: 4, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:03.504891 #13038:66136bc] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 5, in use: 5, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:06.377768 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 6, in use: 6, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:06.666113 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 6, in use: 5, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:15.607613 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 6, in use: 6, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:15.873705 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 6, in use: 5, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:17.262422 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 6, in use: 6, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:17.843979 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 6, in use: 5, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:18.034001 #13038:ae26e04] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 6, in use: 6, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:57.710916 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 7, in use: 7, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:57.974742 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 7, in use: 6, waiting_in_queue: 0 [----] D, [2017-08-30T03:45:59.492422 #13038:127ff00] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 7, in use: 7, waiting_in_queue: 0 [----] D, [2017-08-30T03:46:00.069919 #13038:127ff00] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 7, in use: 6, waiting_in_queue: 0 [----] D, [2017-08-30T03:46:00.392002 #13038:be6ae14] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 7, in use: 7, waiting_in_queue: 0 [----] D, [2017-08-30T03:46:02.007128 #13038:127eec0] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 8, in use: 8, waiting_in_queue: 0 [----] D, [2017-08-30T03:46:02.276243 #13038:127eec0] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 8, in use: 7, waiting_in_queue: 0 [----] D, [2017-08-30T03:46:11.792330 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 8, in use: 8, waiting_in_queue: 0 [----] D, [2017-08-30T03:46:12.369336 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 8, in use: 7, waiting_in_queue: 0 [----] D, [2017-08-30T03:46:12.883126 #13038:3bd5328] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 8, in use: 8, waiting_in_queue: 0 From there, we have the following thread ids that only CHECKOUT but never checkin, so we can grep the logs for them: grep -E "8f807e8|63197cc|66136bc|ae26e04|be6ae14|3bd5328" error_production.log error_evm.log Most of them look like this: 1) checkout a db connection 2) where query by ems_ref on the middleware_servers 3) query on the notification_types 4) user query 5) insert into notifications 6) insert into notification_recipients They then do nothing else, they do not checkin the database connection. error_production.log:[----] D, [2017-08-30T03:44:37.414738 #13038:8f807e8] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 3, in use: 3, waiting_in_queue: 0 error_production.log:[----] D, [2017-08-30T03:44:37.417504 #13038:8f807e8] DEBUG -- : ManageIQ::Providers::Hawkular::MiddlewareManager::MiddlewareServer Load (0.5ms) SELECT "middleware_servers".* FROM "middleware_servers" WHERE "middleware_servers"."ems_ref" = $1 LIMIT $2 [["ems_ref", "/t;hawkular/f;63ac497e-1b91-4507-a130-b217c06f2350/r;EAP7~~"], ["LIMIT", 1]] error_production.log:[----] D, [2017-08-30T03:44:37.447651 #13038:8f807e8] DEBUG -- : ManageIQ::Providers::Hawkular::MiddlewareManager::MiddlewareServer Inst Including Associations (29.5ms - 1rows) error_production.log:[----] D, [2017-08-30T03:44:37.479815 #13038:8f807e8] DEBUG -- : NotificationType Load (0.7ms) SELECT "notification_types".* FROM "notification_types" WHERE "notification_types"."name" = $1 LIMIT $2 [["name", "mw_op_failure"], ["LIMIT", 1]] error_production.log:[----] D, [2017-08-30T03:44:37.480485 #13038:8f807e8] DEBUG -- : NotificationType Inst Including Associations (0.2ms - 1rows) error_production.log:[----] D, [2017-08-30T03:44:37.484603 #13038:8f807e8] DEBUG -- : (0.3ms) BEGIN error_production.log:[----] D, [2017-08-30T03:44:37.487193 #13038:8f807e8] DEBUG -- : (0.5ms) SELECT "users"."id" FROM "users" error_production.log:[----] D, [2017-08-30T03:44:37.547815 #13038:8f807e8] DEBUG -- : SQL (26.2ms) INSERT INTO "notifications" ("notification_type_id", "created_at", "updated_at", "options") VALUES ($1, $2, $3, $4) RETURNING "id" [["notification_type_id", 23], ["created_at", "2017-08-30 07:44:37.485423"], ["updated_at", "2017-08-30 07:44:37.485423"], ["options", "---\n:op_name: Add JDBC Driver\n:op_arg: sqlserver\n:mw_server: EAP7 (63ac497e-1b91-4507-a130-b217c06f2350)\n"]] error_production.log:[----] D, [2017-08-30T03:44:37.551589 #13038:8f807e8] DEBUG -- : SQL (1.6ms) INSERT INTO "notification_recipients" ("notification_id", "user_id", "seen") VALUES ($1, $2, $3) RETURNING "id" [["notification_id", 1], ["user_id", 2], ["seen", "f"]] error_production.log:[----] D, [2017-08-30T03:44:37.561188 #13038:8f807e8] DEBUG -- : (8.0ms) COMMIT Note, two of these threads, including the first add_middleware_jdbc_driver, failed with the errors below in the evm.log: error_evm.log:[----] E, [2017-08-30T03:44:37.412895 #13038:8f807e8] ERROR -- : MIQ(ManageIQ::Providers::Hawkular::MiddlewareManager#add_middleware_jdbc_driver) error callback was called, reason: Could not perform [Add] on a [JDBC Driver] given by inventory path [/t;hawkular/f;63ac497e-1b91-4507-a130-b217c06f2350/r;EAP7~~] requested on [2017-08-30T07:44Z]: java.lang.IllegalArgumentException: [com.microsoft] already exists at [/home/hudson/jboss-eap7-standalone/modules/com/microsoft/main] error_evm.log:[----] E, [2017-08-30T03:46:00.391072 #13038:be6ae14] ERROR -- : MIQ(ManageIQ::Providers::Hawkular::MiddlewareManager#add_middleware_datasource) error callback was called, reason: Could not perform [Add] on a [Datasource] given by inventory path [/t;hawkular/f;63ac497e-1b91-4507-a130-b217c06f2350/r;EAP7~~] requested on [2017-08-30T07:46Z]: org.hawkular.dmr.api.OperationFailureException: Could not perform operation [composite]: {"WFLYCTL0062: Composite operation failed and was rolled back. Steps that failed:" => {"Operation step-1" => {"WFLYCTL0180: Services with missing/unavailable dependencies" => ["jboss.driver-demander.java:/XAOracleDS2 is missing [jboss.jdbc-driver.oracle]","org.wildfly.data-source.XAOracleDS2 is missing [jboss.jdbc-driver.oracle]"]}}} I am not familiar with the code but if it's being run asynchronously in Threads, many concurrent requests to add_middleware_datasource and add_middleware_jdbc_driver could exhaust our database connection pool size. One problem I see is that we're doing the notification creation (and checking out and checking in database connections) in a block when performing provider operations: https://github.com/ManageIQ/manageiq-providers-hawkular/blob/128237e59c2e42f90ba7e8f453b7e70880b487b5/app/models/manageiq/providers/hawkular/middleware_manager.rb#L378 I might suggest this code: connection.operations(true).add_jdbc_driver(driver_data) do |on| on.success do |data| _log.debug "Success on websocket-operation #{data}" emit_middleware_notification(:mw_op_success, 'Add JDBC Driver', driver_data[:driver_name], ems_ref, MiddlewareServer) end on.failure do |error| _log.error 'error callback was called, reason: ' + error.to_s emit_middleware_notification(:mw_op_failure, 'Add JDBC Driver', driver_data[:driver_name], ems_ref, MiddlewareServer) end end end Be changed to move the emit_middleware_notification outside of the connection.operations... block, like this: op_result = nil connection.operations(true).add_jdbc_driver(driver_data) do |on| on.success do |data| op_result = :mw_op_success _log.debug "Success on websocket-operation #{data}" end on.failure do |error| _log.error 'error callback was called, reason: ' + error.to_s op_result = :mw_op_failure end end emit_middleware_notification(op_result, 'Add JDBC Driver', driver_data[:driver_name], ems_ref, MiddlewareServer) end I still don't understand why the Threads aren't checking in the database connection. Perhaps the first error condition isn't being handled properly and we're not checking in the connection. Perhaps, the Threads aren't properly exiting due to an error or some other condition, and therefore aren't checking in their connection. Do you have any thoughts? Heiko, please respond to Joe's NeedInfo per comment 24 Sorry I missed the question earlier. Caina will look at it Hello Joe, great investigation! I did some research on this part of the code, and it is indeed running in a different thread because of the WebSocket connection that we use to fetch information. It is not a matter of just changing the notification to be created outside of the block, because the result is async (as in, we return the method immediately and create the notification as soon as the data is ready). I'm currently investigating on what need to be done, but the idea that I have in mind right now is to only check-in/checkout the connection when the notification is actually created. That should at least make the problem less likely to occur, but hopefully it'll fix it. Will update as soon as I have a patch. This should fix the problem: https://github.com/ManageIQ/manageiq-providers-hawkular/pull/76 Hayk, can you check if it works? Fix is on master. Verified on 5.9.0.4.20171024163837_ef71ea6 Unable to reproduce that error anymore. |