Bug 1420431

Summary: Middleware - Add/Delete Datasources causes CFME temporary unavailable
Product: Red Hat CloudForms Management Engine Reporter: Hayk Hovsepyan <hhovsepy>
Component: ApplianceAssignee: Cainã Costa <cacosta>
Status: CLOSED CURRENTRELEASE QA Contact: Dave Johnson <dajohnso>
Severity: high Docs Contact:
Priority: medium    
Version: unspecifiedCC: abellott, abonas, cacosta, dajohnso, dclarizi, hhovsepy, hkataria, hrupp, jfrey, jhardy, jrafanie, mfeifer, mmahoney, mpovolny, obarenbo, pyadav, simaishi
Target Milestone: GAKeywords: TestOnly, Triaged
Target Release: 5.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.9.0.4 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-06 14:52:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:

Description Hayk Hovsepyan 2017-02-08 15:49:29 UTC
Description of problem:
After adding several Datasources in Middleware Provider's EAP server, and then deleting them, it causes CFME to be unavailable with "We're sorry, but something went wrong." message.

Version-Release number of selected component (if applicable):
5.8.0

How reproducible:
after having several Datasources created/deleted, it becomes constantly happening with each next Add/Delete

Steps to Reproduce:
1. Add a Middleware Provider into CFME.
2. Add several JDBC Drivers, for different Database types, into EAP7 Server from Middleware Provider.
3. Create several Datasources for all Database types for which JDBC Drivers were added.
4. Delete some of those Datasources.
5. Refresh the CFME page. Here you will see error page "We're sorry, but something went wrong." for a few minutes.

Additional info:
Have logged the same problem before on MiQ Upstream: https://github.com/ManageIQ/manageiq/issues/10333

Comment 3 Cainã Costa 2017-03-01 13:33:20 UTC
It is currently being developed.

Comment 4 Cainã Costa 2017-03-01 15:31:43 UTC
PR Created: https://github.com/ManageIQ/manageiq/pull/14116

Comment 7 Alissa 2017-04-06 13:14:27 UTC
this issue is infra/core and is not middleware specific (and there's a discussion about it with further details in the PR in github), so I'm reassigning this to miq core.

Comment 8 Prachi 2017-04-17 15:15:15 UTC
Encounter the same Issue while deploying an application to the middleware server-group.

Comment 9 Greg Blomquist 2017-05-22 20:56:12 UTC
Assigning over to appliance after reading through the discussion on the linked PR:  https://github.com/ManageIQ/manageiq/pull/14116

The underlying problem appears to be an issue with the number of connections being used and has nothing to do with how the provider is implemented.

Comment 11 Joe Rafaniello 2017-06-01 20:48:17 UTC
In case it was missed, I again asked for rails debug (level_rails in advanced settings and restart) logs from a production mode appliance with the issue recreated so we can see what controller actions checkout database connections but don't release them.

https://github.com/ManageIQ/manageiq/issues/10333#issuecomment-305614002

There's not much we can do without logs or an environment to recreate this.

Comment 12 Joe Rafaniello 2017-07-17 16:06:17 UTC
Reassigning since actioncable threads are consuming connections and not releasing them:

https://github.com/ManageIQ/manageiq/issues/10333#issuecomment-307150412

Comment 13 Dan Clarizio 2017-07-17 21:21:58 UTC
David, not sure if/how actioncable is causing this, but can you look at it and let me know if it's something we can fix on the UI side?  Thx, Dan

Comment 20 Joe Rafaniello 2017-08-29 19:37:48 UTC
Hayk,

Can you provide the logs when actioncable / websocket connections were disabled in the same way they were provided here:

https://github.com/ManageIQ/manageiq/issues/10333#issuecomment-307045215

We need to see what is consuming those database connections or if something else is occurring.

Or, can you provide me the IP of an appliance after you applied the configuration changes to disable actioncable and recreated the issue?

Thanks,
Joe

Comment 24 Joe Rafaniello 2017-08-30 20:03:55 UTC
Hi Heiko,

I spent several hours looking at the logs provided by Hayk and I'm not at all familiar with the hawkular client or the provider.  Could you have someone look at this?

It seems like we have Threads created when adding a datasource and adding a jdbc driver in the provider.  These threads checkout a database connection but never check them back in the activerecord connection pool.  Note, the db connection are used to inserts notifications for success/failure operations.  Also, note, the first Thread that fails to checkin a connection was on a failure operation.  Perhaps, the hawkular client code is creating a Thread which reserves the activerecord connection but the Thread doesn't properly quit on the error condition?  I have no idea.

If you grep the offending process id for the connection pool checkin and checkout lines for Threads that don't check in connections, we get this... note, I have added new lines so it's easier to see the checkout/checkin for each thread:


03:18:18 ~/Downloads/issues/1420431 (2.4.1) + grep -E "13038.+log_after" error_production.log | tail -n 30
[----] D, [2017-08-30T03:44:35.616534 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 3, in use: 3, waiting_in_queue: 0
[----] D, [2017-08-30T03:44:36.932575 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 3, in use: 2, waiting_in_queue: 0

[----] D, [2017-08-30T03:44:37.414738 #13038:8f807e8] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 3, in use: 3, waiting_in_queue: 0

[----] D, [2017-08-30T03:44:47.371737 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 4, in use: 4, waiting_in_queue: 0
[----] D, [2017-08-30T03:44:47.959437 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 4, in use: 3, waiting_in_queue: 0

[----] D, [2017-08-30T03:44:48.918313 #13038:63197cc] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 4, in use: 4, waiting_in_queue: 0

[----] D, [2017-08-30T03:44:51.264456 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 5, in use: 5, waiting_in_queue: 0
[----] D, [2017-08-30T03:44:51.555935 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 5, in use: 4, waiting_in_queue: 0

[----] D, [2017-08-30T03:45:00.153862 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 5, in use: 5, waiting_in_queue: 0
[----] D, [2017-08-30T03:45:00.420890 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 5, in use: 4, waiting_in_queue: 0

[----] D, [2017-08-30T03:45:02.690869 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 5, in use: 5, waiting_in_queue: 0
[----] D, [2017-08-30T03:45:03.270389 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 5, in use: 4, waiting_in_queue: 0

[----] D, [2017-08-30T03:45:03.504891 #13038:66136bc] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 5, in use: 5, waiting_in_queue: 0

[----] D, [2017-08-30T03:45:06.377768 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 6, in use: 6, waiting_in_queue: 0
[----] D, [2017-08-30T03:45:06.666113 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 6, in use: 5, waiting_in_queue: 0

[----] D, [2017-08-30T03:45:15.607613 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 6, in use: 6, waiting_in_queue: 0
[----] D, [2017-08-30T03:45:15.873705 #13038:127f5f0] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 6, in use: 5, waiting_in_queue: 0

[----] D, [2017-08-30T03:45:17.262422 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 6, in use: 6, waiting_in_queue: 0
[----] D, [2017-08-30T03:45:17.843979 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 6, in use: 5, waiting_in_queue: 0

[----] D, [2017-08-30T03:45:18.034001 #13038:ae26e04] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 6, in use: 6, waiting_in_queue: 0

[----] D, [2017-08-30T03:45:57.710916 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 7, in use: 7, waiting_in_queue: 0
[----] D, [2017-08-30T03:45:57.974742 #13038:127c800] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 7, in use: 6, waiting_in_queue: 0

[----] D, [2017-08-30T03:45:59.492422 #13038:127ff00] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 7, in use: 7, waiting_in_queue: 0
[----] D, [2017-08-30T03:46:00.069919 #13038:127ff00] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 7, in use: 6, waiting_in_queue: 0

[----] D, [2017-08-30T03:46:00.392002 #13038:be6ae14] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 7, in use: 7, waiting_in_queue: 0

[----] D, [2017-08-30T03:46:02.007128 #13038:127eec0] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 8, in use: 8, waiting_in_queue: 0
[----] D, [2017-08-30T03:46:02.276243 #13038:127eec0] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 8, in use: 7, waiting_in_queue: 0

[----] D, [2017-08-30T03:46:11.792330 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 8, in use: 8, waiting_in_queue: 0
[----] D, [2017-08-30T03:46:12.369336 #13038:127fb7c] DEBUG -- : PostgreSQLAdapter#log_after_checkin, connection_pool: size: 8, connections: 8, in use: 7, waiting_in_queue: 0

[----] D, [2017-08-30T03:46:12.883126 #13038:3bd5328] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 8, in use: 8, waiting_in_queue: 0


From there, we have the following thread ids that only CHECKOUT but never checkin, so we can grep the logs for them:
grep -E "8f807e8|63197cc|66136bc|ae26e04|be6ae14|3bd5328" error_production.log error_evm.log


Most of them look like this:

1) checkout a db connection
2) where query by ems_ref on the middleware_servers
3) query on the notification_types
4) user query
5) insert into notifications
6) insert into notification_recipients

They then do nothing else, they do not checkin the database connection.



error_production.log:[----] D, [2017-08-30T03:44:37.414738 #13038:8f807e8] DEBUG -- : PostgreSQLAdapter#log_after_checkout, connection_pool: size: 8, connections: 3, in use: 3, waiting_in_queue: 0
error_production.log:[----] D, [2017-08-30T03:44:37.417504 #13038:8f807e8] DEBUG -- :   ManageIQ::Providers::Hawkular::MiddlewareManager::MiddlewareServer Load (0.5ms)  SELECT  "middleware_servers".* FROM "middleware_servers" WHERE "middleware_servers"."ems_ref" = $1 LIMIT $2  [["ems_ref", "/t;hawkular/f;63ac497e-1b91-4507-a130-b217c06f2350/r;EAP7~~"], ["LIMIT", 1]]
error_production.log:[----] D, [2017-08-30T03:44:37.447651 #13038:8f807e8] DEBUG -- :   ManageIQ::Providers::Hawkular::MiddlewareManager::MiddlewareServer Inst Including Associations (29.5ms - 1rows)
error_production.log:[----] D, [2017-08-30T03:44:37.479815 #13038:8f807e8] DEBUG -- :   NotificationType Load (0.7ms)  SELECT  "notification_types".* FROM "notification_types" WHERE "notification_types"."name" = $1 LIMIT $2  [["name", "mw_op_failure"], ["LIMIT", 1]]
error_production.log:[----] D, [2017-08-30T03:44:37.480485 #13038:8f807e8] DEBUG -- :   NotificationType Inst Including Associations (0.2ms - 1rows)
error_production.log:[----] D, [2017-08-30T03:44:37.484603 #13038:8f807e8] DEBUG -- :    (0.3ms)  BEGIN
error_production.log:[----] D, [2017-08-30T03:44:37.487193 #13038:8f807e8] DEBUG -- :    (0.5ms)  SELECT "users"."id" FROM "users"
error_production.log:[----] D, [2017-08-30T03:44:37.547815 #13038:8f807e8] DEBUG -- :   SQL (26.2ms)  INSERT INTO "notifications" ("notification_type_id", "created_at", "updated_at", "options") VALUES ($1, $2, $3, $4) RETURNING "id"  [["notification_type_id", 23], ["created_at", "2017-08-30 07:44:37.485423"], ["updated_at", "2017-08-30 07:44:37.485423"], ["options", "---\n:op_name: Add JDBC Driver\n:op_arg: sqlserver\n:mw_server: EAP7 (63ac497e-1b91-4507-a130-b217c06f2350)\n"]]
error_production.log:[----] D, [2017-08-30T03:44:37.551589 #13038:8f807e8] DEBUG -- :   SQL (1.6ms)  INSERT INTO "notification_recipients" ("notification_id", "user_id", "seen") VALUES ($1, $2, $3) RETURNING "id"  [["notification_id", 1], ["user_id", 2], ["seen", "f"]]
error_production.log:[----] D, [2017-08-30T03:44:37.561188 #13038:8f807e8] DEBUG -- :    (8.0ms)  COMMIT


Note, two of these threads, including the first add_middleware_jdbc_driver, failed with the errors below in the evm.log:

error_evm.log:[----] E, [2017-08-30T03:44:37.412895 #13038:8f807e8] ERROR -- : MIQ(ManageIQ::Providers::Hawkular::MiddlewareManager#add_middleware_jdbc_driver) error callback was called, reason: Could not perform [Add] on a [JDBC Driver] given by inventory path [/t;hawkular/f;63ac497e-1b91-4507-a130-b217c06f2350/r;EAP7~~] requested on [2017-08-30T07:44Z]: java.lang.IllegalArgumentException: [com.microsoft] already exists at [/home/hudson/jboss-eap7-standalone/modules/com/microsoft/main]
error_evm.log:[----] E, [2017-08-30T03:46:00.391072 #13038:be6ae14] ERROR -- : MIQ(ManageIQ::Providers::Hawkular::MiddlewareManager#add_middleware_datasource) error callback was called, reason: Could not perform [Add] on a [Datasource] given by inventory path [/t;hawkular/f;63ac497e-1b91-4507-a130-b217c06f2350/r;EAP7~~] requested on [2017-08-30T07:46Z]: org.hawkular.dmr.api.OperationFailureException: Could not perform operation [composite]: {"WFLYCTL0062: Composite operation failed and was rolled back. Steps that failed:" => {"Operation step-1" => {"WFLYCTL0180: Services with missing/unavailable dependencies" => ["jboss.driver-demander.java:/XAOracleDS2 is missing [jboss.jdbc-driver.oracle]","org.wildfly.data-source.XAOracleDS2 is missing [jboss.jdbc-driver.oracle]"]}}}


I am not familiar with the code but if it's being run asynchronously in Threads, many concurrent requests to add_middleware_datasource and add_middleware_jdbc_driver could exhaust our database connection pool size.  

One problem I see is that we're doing the notification creation (and checking out and checking in database connections) in a block when performing provider operations:

https://github.com/ManageIQ/manageiq-providers-hawkular/blob/128237e59c2e42f90ba7e8f453b7e70880b487b5/app/models/manageiq/providers/hawkular/middleware_manager.rb#L378


I might suggest this code:

        connection.operations(true).add_jdbc_driver(driver_data) do |on|
          on.success do |data|
            _log.debug "Success on websocket-operation #{data}"
            emit_middleware_notification(:mw_op_success, 'Add JDBC Driver', driver_data[:driver_name], ems_ref,
                                         MiddlewareServer)
          end
          on.failure do |error|
            _log.error 'error callback was called, reason: ' + error.to_s
            emit_middleware_notification(:mw_op_failure, 'Add JDBC Driver', driver_data[:driver_name], ems_ref,
                                         MiddlewareServer)
          end
        end
      end

Be changed to move the emit_middleware_notification outside of the connection.operations... block, like this:

        op_result = nil
        connection.operations(true).add_jdbc_driver(driver_data) do |on|
          on.success do |data|
            op_result = :mw_op_success
            _log.debug "Success on websocket-operation #{data}"
          end
          on.failure do |error|
            _log.error 'error callback was called, reason: ' + error.to_s
            op_result = :mw_op_failure
          end
        end

        emit_middleware_notification(op_result, 'Add JDBC Driver', driver_data[:driver_name], ems_ref,
                                         MiddlewareServer)


      end


I still don't understand why the Threads aren't checking in the database connection. Perhaps the first error condition isn't being handled properly and we're not checking in the connection.  Perhaps, the Threads aren't properly exiting due to an error or some other condition, and therefore aren't checking in their connection.

Do you have any thoughts?

Comment 26 Dave Johnson 2017-10-09 03:04:50 UTC
Heiko, please respond to Joe's NeedInfo per comment 24

Comment 27 Heiko W. Rupp 2017-10-10 15:23:32 UTC
Sorry I missed the question earlier. Caina will look at it

Comment 28 Cainã Costa 2017-10-10 23:46:16 UTC
Hello Joe, great investigation! I did some research on this part of the code, and it is indeed running in a different thread because of the WebSocket connection that we use to fetch information. It is not a matter of just changing the notification to be created outside of the block, because the result is async (as in, we return the method immediately and create the notification as soon as the data is ready).

I'm currently investigating on what need to be done, but the idea that I have in mind right now is to only check-in/checkout the connection when the notification is actually created. That should at least make the problem less likely to occur, but hopefully it'll fix it. Will update as soon as I have a patch.

Comment 29 Cainã Costa 2017-10-13 19:40:53 UTC
This should fix the problem: https://github.com/ManageIQ/manageiq-providers-hawkular/pull/76

Hayk, can you check if it works?

Comment 31 Cainã Costa 2017-10-18 12:45:20 UTC
Fix is on master.

Comment 33 Hayk Hovsepyan 2017-10-30 10:06:42 UTC
Verified on 5.9.0.4.20171024163837_ef71ea6

Unable to reproduce that error anymore.