Bug 1626005
Summary: | Appliance won't start with database connection failure at seeding | ||
---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | Niladri Roy <niroy> |
Component: | Appliance | Assignee: | Joe Rafaniello <jrafanie> |
Status: | CLOSED ERRATA | QA Contact: | Tasos Papaioannou <tpapaioa> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.9.3 | CC: | abellott, dmetzger, jdeubel, jrafanie, nchugh, niroy, obarenbo, pmukhedk, simaishi, tpapaioa |
Target Milestone: | GA | ||
Target Release: | 5.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 5.10.0.17 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-02-07 23:03:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Niladri Roy
2018-09-06 11:16:01 UTC
Without providing more information, there's not much we can do. We recreated the problem in our environment by restarting postgresql and immediately trying to issue queries, causing a few existing cfme processes to try to connect with non-ssl connections and failing since the server was configured for ssl only. Note, we couldn't recreate it every time as the reconnect logic worked most of the time. When we did recreate it and upon restarting cfme, each new process was able to connect to postgresql using ssl and had no issue. Note, in my testing, I was able to establish a connection to a restarting postgresql before it logged "database system is ready to accept connections" message, which may indicate postgresql is wrongly accepting connections before everything is setup and ready. Perhaps their postgresql is taking a significant amount of time to restart and therefore is accepting connections before it's truly ready? Without more information and logs, it's impossible to diagnose. Also, since SSL relies on hostnames, it's possible that if DNS problems are occurring, it may cause SSL certificate verification to fail. I'd want to know: 1) Are there other appliances currently connected and working with the same configuration against this postgresql? 2) Did the restart of the cfme appliance show this appliances to connect with ssl or non-ssl (postgresql.log) OR did some other error occur? 3) Are there any issues resolving hostnames with DNS? Here's what I found: Postgresql adapter's reconnect! method can have issues with reconnecting a SSL connection to postgresql. It's unclear if it's the underlying pg gem, a wrapper around libpq, libpq itself, or openssl. We use reconnect! in several locations to reconnect to the database after errors. This becomes a problem with a SSL connection when postgresql restarts while we're trying to reconnect. The symptoms are we try to reconnect, encounter an error during SSL handshaking [1] (only seen in postgrsql debug logging), and then never try to do SSL connections from that client process again. If the server is configured for hostssl only, each new connection attempt will be a non-SSL connection and fail [2]. To demonstrate this, I recreated this in rails console, and confirmed a workaround: I run this: irb(main):40:1> conn = ActiveRecord::Base.connection; 500.times { begin; puts conn.select_value("select count(*) from users;"); rescue PG::Error, ActiveRecord::StatementInvalid; begin; conn.reconnect!; rescue; retry; end; end; sleep 0.001}; conn.reconnect Switch terminals and run this to restart postgresql at nearly the same time: systemctl restart $APPLIANCE_PG_SERVICE If it successfully recreated the problem, the original terminal that is trying to query the users table 500 times in a retry loop will hang in this loop. Meanwhile, postgresql is continually logging errors for client connections trying to use non-SSL seen in [2]. If I try to make a connection from a different rails console process (on the same machine or different host), it works. To fix the original console, I need to do the following: spec = ActiveRecord::Base.remove_connection("primary") ActiveRecord::Base.establish_connection(spec) I can then issue queries on the first console. We still need to dig down into the low level code but I think we can do the code above instead of issuing a reconnect! in the places we do this. [1] 2018-09-17 20:55:54 EDT:::1(45390):5ba04d1a.2552:[unknown]@[unknown]:[9554]:DEBUG: SSL connection from "(anonymous)" 2018-09-17 20:55:54 EDT:::1(45390):5ba04d1a.2552:root@vmdb_production:[9554]:FATAL: the database system is shutting down ... 2018-09-17 20:55:54 EDT:::1(45390):5ba04d1a.2552:root@vmdb_production:[9554]:DEBUG: SSL: write alert (0x0100) [2] 2018-09-17 20:55:55 EDT:::1(38486):5ba04d1b.256a:root@vmdb_production:[9578]:FATAL: no pg_hba.conf entry for host "::1", user "root", database "vmdb_production", SSL off New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/2f236f684f4d76a884d307e03fc3e80a0beb2bed commit 2f236f684f4d76a884d307e03fc3e80a0beb2bed Author: Joe Rafaniello <jrafanie> AuthorDate: Tue Sep 18 16:04:52 2018 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Sep 18 16:04:52 2018 -0400 Establish a new connection instead of reconnect! https://bugzilla.redhat.com/show_bug.cgi?id=1626005 It was found that reconnect! against a postgresql SSL connection could occur while postgresql was stopping or starting, even before the "database system is ready to accept connections." If this is timed correctly, this connection could fail SSL handshaking, and the client could "give" up trying SSL connections and only ever try non-SSL. The workaround until this commit, was to restart the client process such as the worker or evm server. This was evident in the postgresql log where you'd see a client, 172.168.1.99, initiate a SSL connection: 2018-08-30 11:23:52 EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG: connection authorized: user=root database=vmdb_production SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off) Then, sometime later postgresql restarts and this connection timed it "right" and all future connections were attempted with SSL off and fail: 2018-08-30 11:23:52 EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL: no pg_hba.conf entry for host "172.168.1.99", user "root", database "vmdb_production", SSL off Somewhere deep in pg/libpq/openssl, the client code failed to initialize the SSL handshaking code and will continually only attempt non-SSL connections until you restart the process. Until we can dig deep into the pg/libpq/openssl code, this code change forces the re-initialization of the SSL handshaking code in the client by discarding the existing connection and establishing a new one. To test this, I first recreated this solely in an ActiveRecord environment, by restarting postgresql in one ssh session on an appliance and running this in rails console in another: ``` 500.times do begin puts conn.select_value("select count(*) from users;") rescue PG::Error, ActiveRecord::StatementInvalid begin conn.reconnect! rescue retry end end sleep 0.001 end ``` Then, I verified this code change in a full appliance server/worker environment by first recreating this (after many iterations) and then confirming it did not recreate with this change applied: ``` for x in `seq 1 500` do # restart postgresql is too fast, so I needed to first stop it, sleep # a little, then start it, sleep a bit longer, then keep doing that in # a loop, while the workers/servers would try to reconnect!. small_rand=`ruby -e 'puts rand(4)'` rand=`ruby -e 'puts rand(10) + 10'` echo $small_rand echo $rand systemctl stop rh-postgresql95-postgresql sleep $small_rand systemctl start rh-postgresql95-postgresql sleep $rand done ``` app/models/miq_queue_worker_base/runner.rb | 5 +- app/models/miq_server.rb | 4 +- app/models/miq_server/environment_management.rb | 4 +- app/models/miq_server/queue_management.rb | 6 +- 4 files changed, 14 insertions(+), 5 deletions(-) Prasad, if existing processes are failing to connect to postgresql properly, they might not be fully exiting cleanly when you issue the restart of evmserverd. Maybe they have to stop the evmserverd service, verify it's completely gone and start it again. Or perhaps, reboot the appliance if it's not the database appliance.
> 3. still same error reported in the evm.log file. I also tried restarted
evmserverd immediately after restarting postgresql but no difference.
Interesting, we'll have to see the postgresql logs and evm.log for this time period to see what's happening.
If the database.yml is the same for each appliance that is connecting to this database, I'm out of ideas.
I have a pull request that was merged which fixes the issue I was able to recreate based on their prior logs. Basically, we were able to try to reconnect! to the database while it was restarting and if we timed it exactly wrong/right, we'd initiate a db client connection that fails SSL handshaking and will never try to use SSL again. The only workaround would be to restart the process. My fix, changes this code to not use reconnect! but instead abandon the existing connection and establish a new one so there is no recycling of the existing postgresql client in the rails/ruby/pg/libpq/openssl stack. Because this was in a corner case, the overhead of doing this extra work should not be a big deal.
Niladri/Prasad, If they're still seeing this message after restarting postgreql, waiting, then stop/start or restarting evmserverd: 2018-09-17 20:55:55 EDT:::1(38486):5ba04d1b.256a:root@vmdb_production:[9578]:FATAL: no pg_hba.conf entry for host "::1", user "root", database "vmdb_production", SSL off Then, another thing they can try is exporting an environment variable on the system and see if forcing SSL on the client makes any difference: export PGSSLMODE="require" https://www.postgresql.org/docs/9.4/static/libpq-connect.html#LIBPQ-CONNECT-SSLMODE Unfortunately, I believe the SSL attempt fails (since it's preferred by default) and it fails back to non-SSL. By forcing SSL, it will just fail, I think. Hi Niladri,
> Cu has tried to launch a new appliance and join it to the existing environment, it also doesn't show up in the UI.
I don't know what this means. I reviewed the logs for the new appliance and I have no idea what the problem is. If this is a new issue, please open a new BZ.
I'll assume the issue with attempting to use non-SSL postgresql connections is resolved and let this bug be shipped as fixed since it was clearly fixed in my recreation environment. If the customer is not seeing this anymore, we can come back to this when and if it happens again.
Niladri, you describe the exact scenario I fixed 3 weeks ago in https://github.com/ManageIQ/manageiq/pull/18010 In step 1, they started evmserverd In step 3, they stopped postgresql In step 6, they started postgresql and started having clients connect with non-SSL. Nowhere did they restart evmserverd after restarting postgresql. If you hit bug causing clients to use not try SSL and instead fallback to non-SSL on a SSL only postgresql, then you need to restart the bad ruby processes. This is exactly what is fixed in the pull request above. The alternative to the fix above is to restart evmserverd, if they fail to restart due to lack to db connection, kill the evmserver ruby processes and try systemctl start evmserverd. The new connections should then establish SSL connections to progress as described several times above. Verified on 5.10.0.21. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:0212 |