Bug 1387420

Summary: PG Logical Error: replication set mapping -1931769956:25998 not found (PG::InternalError)
Product: Red Hat CloudForms Management Engine Reporter: Adam Grare <agrare>
Component: ApplianceAssignee: Nick Carboni <ncarboni>
Status: CLOSED CURRENTRELEASE QA Contact: Alex Newman <anewman>
Severity: high Docs Contact:
Priority: high    
Version: 5.6.0CC: abellott, cpelland, jhardy, ncarboni, obarenbo
Target Milestone: GAKeywords: TestOnly
Target Release: 5.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: replication
Fixed In Version: 5.8.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1391713 1404830 (view as bug list) Environment:
Last Closed: 2017-06-12 16:29:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1391713, 1404830    

Description Adam Grare 2016-10-20 20:21:31 UTC
Description of problem:
Errors from pg_logical on worker initialization when syncing config:
/var/www/miq/vmdb/lib/extensions/ar_adapter/ar_pglogical/pglogical_raw.rb:301:in `async_exec': ERROR:  replication set mapping -1931769956:25998 not found (PG::InternalError)
        from /var/www/miq/vmdb/lib/extensions/ar_adapter/ar_pglogical/pglogical_raw.rb:301:in `typed_exec'
        from /var/www/miq/vmdb/lib/extensions/ar_adapter/ar_pglogical/pglogical_raw.rb:281:in `replication_set_remove_table'
        from /var/www/miq/vmdb/lib/miq_pglogical.rb:79:in `block in refresh_excludes'
        from /var/www/miq/vmdb/lib/miq_pglogical.rb:78:in `each' 
        from /var/www/miq/vmdb/lib/miq_pglogical.rb:78:in `refresh_excludes'
        from /var/www/miq/vmdb/lib/vmdb/config/activator.rb:50:in `workers'
        from /var/www/miq/vmdb/lib/vmdb/config/activator.rb:24:in `block in activate'
        from /var/www/miq/vmdb/lib/vmdb/config/activator.rb:20:in `each_key'
        from /var/www/miq/vmdb/lib/vmdb/config/activator.rb:20:in `activate'
        from /var/www/miq/vmdb/lib/vmdb/settings.rb:42:in `activate'
        from /var/www/miq/vmdb/lib/vmdb/settings.rb:20:in `reload!'
        from /var/www/miq/vmdb/app/models/miq_worker/runner.rb:283:in `sync_config'
        from /var/www/miq/vmdb/app/models/miq_worker/runner.rb:63:in `worker_initialization'
        from /var/www/miq/vmdb/app/models/miq_worker/runner.rb:51:in `initialize'
        from /var/www/miq/vmdb/app/models/miq_worker/runner.rb:21:in `new'
        from /var/www/miq/vmdb/app/models/miq_worker/runner.rb:21:in `start_worker'
        from /var/www/miq/vmdb/app/models/miq_worker.rb:347:in `block in start' 


Version-Release number of selected component (if applicable):
5.6.1.2

Comment 4 Nick Carboni 2016-10-28 18:35:51 UTC
This looks like it is being caused by a race condition in MiqPglogical#refresh_excludes

I was able to reproduce this by:
- Creating two appliances pointing to the same database
- Set replication type to "remote" on either appliance
- Remove a table from the excludes on one server (I used log_files)
- Wait for the config to reload on the workers (it may not happen every time)

By taking advantage of a separate issue which causes the replication set to "thrash" when two servers have different excludes each worker that reloads the config will likely have to change the excludes.

It is possible for one worker to read the current replication set, have a different worker remove the table and then have the table be missing when the first one also tries to remove it from the set.

I will look into process synchronization methods to solve this.

Comment 5 Nick Carboni 2016-10-28 18:44:51 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1389821 - Issue referenced in comment 4

Comment 7 CFME Bot 2016-11-01 17:41:08 UTC
New commit detected on ManageIQ/manageiq/euwe:
https://github.com/ManageIQ/manageiq/commit/b89c5ab912994cc26b8aee49546b6b6bd6ef52ee

commit b89c5ab912994cc26b8aee49546b6b6bd6ef52ee
Author:     Jason Frey <fryguy9>
AuthorDate: Mon Oct 31 16:36:51 2016 -0400
Commit:     Oleg Barenboim <chessbyte>
CommitDate: Tue Nov 1 13:37:50 2016 -0400

    Merge pull request #12280 from carbonin/add_logging_to_refresh_excludes
    
    Add logging when the replication set is altered
    (cherry picked from commit 49cfe5798b581595dc91d4d00a9038000c2fd24a)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1387420
    https://bugzilla.redhat.com/show_bug.cgi?id=1386686

 lib/miq_pglogical.rb | 2 ++
 1 file changed, 2 insertions(+)

Comment 8 CFME Bot 2016-11-03 14:31:41 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/320058db5c63a03ebb11e354fb34b983a41ac657

commit 320058db5c63a03ebb11e354fb34b983a41ac657
Author:     Nick Carboni <ncarboni>
AuthorDate: Mon Oct 31 17:02:26 2016 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Mon Oct 31 18:02:23 2016 -0400

    Wrap the contents of MiqPglogical#refresh_excludes in a lock
    
    This will prevent multiple processes from executing this logic
    at the same time which will prevent us from adding a table that
    is already included in the set or removing a table that is not in
    the set.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1387420

 lib/miq_pglogical.rb | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

Comment 10 CFME Bot 2016-12-14 03:21:29 UTC
New commit detected on ManageIQ/manageiq/darga:
https://github.com/ManageIQ/manageiq/commit/7b4b3d961f18a847b9c395e65c0f2575ad108c1a

commit 7b4b3d961f18a847b9c395e65c0f2575ad108c1a
Author:     Nick Carboni <ncarboni>
AuthorDate: Mon Oct 31 17:02:26 2016 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Tue Nov 8 14:57:08 2016 -0500

    Wrap the contents of MiqPglogical#refresh_excludes in a lock
    
    This will prevent multiple processes from executing this logic
    at the same time which will prevent us from adding a table that
    is already included in the set or removing a table that is not in
    the set.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1387420

 lib/miq_pglogical.rb | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)