1626005 – Appliance won't start with database connection failure at seeding

Bug 1626005 - Appliance won't start with database connection failure at seeding

Summary: Appliance won't start with database connection failure at seeding

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.9.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Joe Rafaniello
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-06 11:16 UTC by Niladri Roy
Modified:	2022-03-13 15:31 UTC (History)
CC List:	10 users (show)
Fixed In Version:	5.10.0.17
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-07 23:03:43 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2019:0212	0	None	None	None	2019-02-07 23:03:50 UTC

Description Niladri Roy 2018-09-06 11:16:01 UTC

Description of problem:
Evmserverd fails to start on one of the worker appliance

[----] I, [2018-09-06T05:19:23.816884 #5422:a23104]  INFO -- : MIQ(EvmDatabase.seed) Seeding MiqProductFeature
[----] I, [2018-09-06T05:19:24.219191 #5422:a23104]  INFO -- : MIQ(EvmDatabase.seed) Seeding MiqUserRole
/opt/rh/cfme-gemset/gems/activerecord-5.0.6/lib/active_record/connection_adapters/postgresql/database_statements.rb:98:in `async_exec': PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly (ActiveRecord::StatementInvalid)
        This probably means the server terminated abnormally
        before or while processing the request.
: ROLLBACK
        from /opt/rh/cfme-gemset/gems/activerecord-5.0.6/lib/active_record/connection_adapters/postgresql/database_statements.rb:98:in `block in execute'
        from /opt/rh/cfme-gemset/gems/activerecord-5.0.6/lib/active_record/connection_adapters/abstract_adapter.rb:590:in `block in log'


Version-Release number of selected component (if applicable):
5.9.3.4

How reproducible:
twice at Cu environment

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 11 Joe Rafaniello 2018-09-17 11:49:26 UTC

Without providing more information, there's not much we can do.  We recreated the problem in our environment by restarting postgresql and immediately trying to issue queries, causing a few existing cfme processes to try to connect with non-ssl connections and failing since the server was configured for ssl only.  Note, we couldn't recreate it every time as the reconnect logic worked most of the time. 
When we did recreate it and upon restarting cfme, each new process was able to connect to postgresql using ssl and had no issue.

Note, in my testing, I was able to establish a connection to a restarting postgresql before it logged "database system is ready to accept connections" message, which may indicate postgresql is wrongly accepting connections before everything is setup and ready.

Perhaps their postgresql is taking a significant amount of time to restart and therefore is accepting connections before it's truly ready? Without more information and logs, it's impossible to diagnose.  Also, since SSL relies on hostnames, it's possible that if DNS problems are occurring, it may cause SSL certificate verification to fail.

I'd want to know:
1) Are there other appliances currently connected and working with the same configuration against this postgresql?
2) Did the restart of the cfme appliance show this appliances to connect with ssl or non-ssl (postgresql.log) OR did some other error occur?
3) Are there any issues resolving hostnames with DNS?

Comment 12 Joe Rafaniello 2018-09-18 16:23:41 UTC

Here's what I found:

Postgresql adapter's reconnect! method can have issues with reconnecting
a SSL connection to postgresql. It's unclear if it's the underlying pg
gem, a wrapper around libpq, libpq itself, or openssl.

We use reconnect! in several locations to reconnect to the database
after errors.  This becomes a problem with a SSL connection when
postgresql restarts while we're trying to reconnect.  The symptoms are
we try to reconnect, encounter an error during SSL handshaking [1] (only
seen in postgrsql debug logging), and then never try to do SSL
connections from that client process again. If the server is configured for
hostssl only, each new connection attempt will be a non-SSL connection and
fail [2].

To demonstrate this, I recreated this in rails console, and confirmed a
workaround:

I run this:
irb(main):40:1> conn = ActiveRecord::Base.connection; 500.times { begin; puts conn.select_value("select count(*) from users;"); rescue PG::Error, ActiveRecord::StatementInvalid; begin; conn.reconnect!; rescue; retry; end; end; sleep 0.001}; conn.reconnect

Switch terminals and run this to restart postgresql at nearly the same
time:

systemctl restart $APPLIANCE_PG_SERVICE

If it successfully recreated the problem, the original terminal that is
trying to query the users table 500 times in a retry loop will hang in
this loop. Meanwhile, postgresql is continually logging errors for
client connections trying to use non-SSL seen in [2].

If I try to make a connection from a different rails console process (on
the same machine or different host), it works.

To fix the original console, I need to do the following:

spec = ActiveRecord::Base.remove_connection("primary")
ActiveRecord::Base.establish_connection(spec)

I can then issue queries on the first console.

We still need to dig down into the low level code but I think we can do
the code above instead of issuing a reconnect! in the places we do this.

[1]
2018-09-17 20:55:54 EDT:::1(45390):5ba04d1a.2552:[unknown]@[unknown]:[9554]:DEBUG:  SSL connection from "(anonymous)"
2018-09-17 20:55:54 EDT:::1(45390):5ba04d1a.2552:root@vmdb_production:[9554]:FATAL:  the database system is shutting down
...
2018-09-17 20:55:54 EDT:::1(45390):5ba04d1a.2552:root@vmdb_production:[9554]:DEBUG:  SSL: write alert (0x0100)


[2]
2018-09-17 20:55:55 EDT:::1(38486):5ba04d1b.256a:root@vmdb_production:[9578]:FATAL:  no pg_hba.conf entry for host "::1", user "root", database "vmdb_production", SSL off

Comment 13 CFME Bot 2018-09-21 19:36:03 UTC

https://github.com/ManageIQ/manageiq/pull/18010

Comment 14 CFME Bot 2018-09-21 21:56:28 UTC

New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/2f236f684f4d76a884d307e03fc3e80a0beb2bed
commit 2f236f684f4d76a884d307e03fc3e80a0beb2bed
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Tue Sep 18 16:04:52 2018 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Sep 18 16:04:52 2018 -0400

    Establish a new connection instead of reconnect!

    https://bugzilla.redhat.com/show_bug.cgi?id=1626005

    It was found that reconnect! against a postgresql SSL connection could
    occur while postgresql was stopping or starting, even before the
    "database system is ready to accept connections."  If this is timed
    correctly, this connection could fail SSL handshaking, and the
    client could "give" up trying SSL connections and only ever try non-SSL.

    The workaround until this commit, was to restart the client process such
    as the worker or evm server.

    This was evident in the postgresql log where you'd see a client,
    172.168.1.99, initiate a SSL connection:

    2018-08-30 11:23:52
    EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG:
    connection authorized: user=root database=vmdb_production SSL enabled
    (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)

    Then, sometime later postgresql restarts and this connection timed it
    "right" and all future connections were attempted with SSL off and fail:

    2018-08-30 11:23:52
    EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL:
    no pg_hba.conf entry for host "172.168.1.99", user "root", database
    "vmdb_production", SSL off

    Somewhere deep in pg/libpq/openssl, the client code failed to initialize
    the SSL handshaking code and will continually only attempt non-SSL
    connections until you restart the process.

    Until we can dig deep into the pg/libpq/openssl code, this code change
    forces the re-initialization of the SSL handshaking code in the client
    by discarding the existing connection and establishing a new one.

    To test this, I first recreated this solely in an ActiveRecord
    environment, by restarting postgresql in one ssh session on an appliance
    and running this in rails console in another:

    ```
    500.times do
      begin
        puts conn.select_value("select count(*) from users;")
      rescue PG::Error, ActiveRecord::StatementInvalid
        begin
          conn.reconnect!
        rescue
          retry
        end
      end
      sleep 0.001
    end
    ```

    Then, I verified this code change in a full appliance server/worker
    environment by first recreating this (after many iterations) and then
    confirming it did not recreate with this change applied:

    ```
    for x in `seq 1 500`
    do
      # restart postgresql is too fast, so I needed to first stop it, sleep
      # a little, then start it, sleep a bit longer, then keep doing that in
      # a loop, while the workers/servers would try to reconnect!.
      small_rand=`ruby -e 'puts rand(4)'`
      rand=`ruby -e 'puts rand(10) + 10'`
      echo $small_rand
      echo $rand
      systemctl stop rh-postgresql95-postgresql
      sleep $small_rand
      systemctl start rh-postgresql95-postgresql
      sleep $rand
    done
    ```

 app/models/miq_queue_worker_base/runner.rb | 5 +-
 app/models/miq_server.rb | 4 +-
 app/models/miq_server/environment_management.rb | 4 +-
 app/models/miq_server/queue_management.rb | 6 +-
 4 files changed, 14 insertions(+), 5 deletions(-)

Comment 16 Joe Rafaniello 2018-09-28 14:18:22 UTC

Prasad, if existing processes are failing to connect to postgresql properly, they might not be fully exiting cleanly when you issue the restart of evmserverd.  Maybe they have to stop the evmserverd service, verify it's completely gone and start it again.  Or perhaps, reboot the appliance if it's not the database appliance.

> 3. still same error reported in the evm.log file. I also tried restarted 
evmserverd immediately after restarting postgresql but no difference. 

Interesting, we'll have to see the postgresql logs and evm.log for this time period to see what's happening.

If the database.yml is the same for each appliance that is connecting to this database, I'm out of ideas.

I have a pull request that was merged which fixes the issue I was able to recreate based on their prior logs.  Basically, we were able to try to reconnect! to the database while it was restarting and if we timed it exactly wrong/right, we'd initiate a db client connection that fails SSL handshaking and will never try to use SSL again.  The only workaround would be to restart the process.  My fix, changes this code to not use reconnect! but instead abandon the existing connection and establish a new one so there is no recycling of the existing postgresql client in the rails/ruby/pg/libpq/openssl stack.  Because this was in a corner case, the overhead of doing this extra work should not be a big deal.

Comment 17 Joe Rafaniello 2018-09-28 16:15:18 UTC

Niladri/Prasad,

If they're still seeing this message after restarting postgreql, waiting, then stop/start or restarting evmserverd:

2018-09-17 20:55:55 EDT:::1(38486):5ba04d1b.256a:root@vmdb_production:[9578]:FATAL:  no pg_hba.conf entry for host "::1", user "root", database "vmdb_production", SSL off

Then, another thing they can try is exporting an environment variable on the system and see if forcing SSL on the client makes any difference:

export PGSSLMODE="require"


https://www.postgresql.org/docs/9.4/static/libpq-connect.html#LIBPQ-CONNECT-SSLMODE

Unfortunately, I believe the SSL attempt fails (since it's preferred by default) and it fails back to non-SSL.  By forcing SSL, it will just fail, I think.

Comment 19 Joe Rafaniello 2018-10-11 14:11:36 UTC

Hi Niladri,

> Cu has tried to launch a new appliance and join it to the existing environment, it also doesn't show up in the UI.

I don't know what this means.  I reviewed the logs for the new appliance and I have no idea what the problem is.  If this is a new issue, please open a new BZ.  

I'll assume the issue with attempting to use non-SSL postgresql connections is resolved and let this bug be shipped as fixed since it was clearly fixed in my recreation environment.  If the customer is not seeing this anymore, we can come back to this when and if it happens again.

Comment 22 Joe Rafaniello 2018-10-12 13:29:51 UTC

Niladri, you describe the exact scenario I fixed 3 weeks ago in https://github.com/ManageIQ/manageiq/pull/18010

In step 1, they started evmserverd
In step 3, they stopped postgresql
In step 6, they started postgresql and started having clients connect with non-SSL.

Nowhere did they restart evmserverd after restarting postgresql.  If you hit bug causing clients to use not try SSL and instead fallback to non-SSL on a SSL only postgresql, then you need to restart the bad ruby processes.

This is exactly what is fixed in the pull request above.

The alternative to the fix above is to restart evmserverd, if they fail to restart due to lack to db connection, kill the evmserver ruby processes and try systemctl start evmserverd.  The new connections should then establish SSL connections to progress as described several times above.

Comment 28 Tasos Papaioannou 2018-10-30 14:22:41 UTC

Verified on 5.10.0.21.

Comment 31 errata-xmlrpc 2019-02-07 23:03:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0212

Note You need to log in before you can comment on or make changes to this bug.