Bug 1482539
Summary: | Upgrade of Satellite to 6.2.11 error on removal of qpid dat2 directory | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Chris Roberts <chrobert> | ||||
Component: | Installation | Assignee: | Chris Roberts <chrobert> | ||||
Status: | CLOSED ERRATA | QA Contact: | Sanket Jagtap <sjagtap> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.2.11 | CC: | ajoseph, bbuckingham, bkearney, brubisch, cdonnell, chrobert, egolov, ehelms, fgarciad, gkonda, hmore, inecas, jalviso, ktordeur, mbacovsk, michiel.smit, mmccune, mmithaiw, mverma, pdwyer, peter.vreman, pgervase, pmoravec, pmutha, sghai, shbharad, sjagtap, smane, vijsingh, xdmoon | ||||
Target Milestone: | Unspecified | Keywords: | ManyUsersImpacted, PrioBumpGSS, Triaged, UserExperience | ||||
Target Release: | Unused | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | katello-installer-base-3.0.0.100-1 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1530694 (view as bug list) | Environment: | |||||
Last Closed: | 2018-02-05 13:54:34 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Chris Roberts
2017-08-17 13:45:40 UTC
No dupe (IMHO), but the error is just a symptom of a bigger problem. What Satellite users usually do and how it goes wrong: 1) Sat6 including qpidd is fully running 2) "yum update" upgrades also qpid-cpp-server package - which has a tricky post-install script that restarts qpidd service. Since now, qpidd is already using the new directory structure. That is empty at this moment. 3) satellite-installer --upgrade migrates data to the new structure (when qpidd is down, but too late). Here the "rm: cannot remove .." error comes from - it is just a side-effect symptom of this bigger problem. 4) Depending on timing, we: - usually end up with 2 journal files instead of 1 for every durable queue (usually this is no problem but I saw a customer unable to start qpidd due to that since both queues had some unique journal sequence ID), and - sometimes many queues missing (like 2/3 of queues missing at some customer) Note, that we can NOT resolve this in a Satellite upgrade step (like in hooks/pre/30-upgrade.rb). Since this requires "yum update" has been already run, so qpidd is already running on the new directory structure. I see three possible ways how to resolve it: A) somehow allow (yum) updating only with Satellite services down (or at least qpidd down), and running satellite-installer --upgrade just after it / without starting qpidd between. Could we somehow enforce this? B) have qpid-cpp-server package without the post-install script / qpidd restart. Elegant but - wont resolve use case "yum update; reboot (or katello restart); satellite-installer --upgrade". - will require the changed postinstall script in qpid-cpp-server practically forever (to allow upgrades 6.2.10 -> any release) C) Forget on any data migration and re-build the queues and bindings from scratch (possible steps are at the bottom of KCS 3148641). Rationale: - no upgrade shall be done during (pulp) task running, so all pulp queues (for workers, res.manager, celery and katello-agent) should have been empty; so almost everytime *all* queues will be empty; so no messages in queues lost; we can warn about this at the beginning of the upgrade script - bypassing any script issue - this assumes "yum update" is followed by "satellite-installer --upgrade" with ideally as few operations on Satellite as possible - since until the upgrade step is run, queues can be missing with its consequences (tasks can fail). D) Some other solution? I personally vote for C) where the particular procedure would be: (*) running services (at least) qpidd, postgres, foreman-tasks (and httpd?) (*) stop services tomcat (to let katello_event queue made empty), qdrouterd (dont (re)try pulp.agent.* queues to create) and pulp (dont create pulp queues) (*) after a while (few seconds should be enough), stop foreman-tasks (just for sure) (*) stop qpidd (*) rm -rf /var/lib/qpidd/.qpidd (or /var/lib/qpidd/* for RHEL6) - optionally take backup prior this step? (*) start qpidd (no clients shall connect now, so the next step shall create complete and coherent queues/exchanges/bindings) (*) follow bottom of KCS 3148641 to rebuild queues/exchanges/bindings (*) "step migrate qpid directory is complete" \o/ (I owe a beer to anybody who finds a gotcha in above procedure) Would it make sense to update the re-create queue procedure to restore all the queues, so that the installer would not need to be run? (In reply to Ivan Necas from comment #6) > Would it make sense to update the re-create queue procedure to restore all > the queues, so that the installer would not need to be run? It makes sense (specifically for the upgrade path from 6.2.10 or older). The KCS 3148641 might create more queus than required, esp. queues for goferd-less clients. So there are 2 options how to create just required queues: 1) chose just systems with katello-agent package installed 2) identify what queues were there before (and after) the upgrade): ls /var/lib/qpidd/.qpidd/qls/jrnl /var/lib/qpidd/.qpidd/qls/jrnl2 /var/lib/qpidd/qls/jrnl /var/lib/qpidd/qls/jrnl2 2> /dev/null | sort -u | grep pulp.agent - jrnl for old dir before upgrade, jrnl2 after upgrade (if e.g. a system registered after yum update but before satellite-installer) - the other pair of dirs is due to RHEL6/RHEL7 different paths Moving the BZ to NEW since this required work (for the bigger problem) hasnt been implemented. - I ran into this problem too. My sequence of events: # yum update (this upgraded the OS from 7.3 to 7.4, and updated Satellite packages from 6.2.10 to 6.2.11) # satellite-installer --scenario satellite --upgrade # systemctl reboot - lots of errors; resolved by restarting goferd and puppet on all Content Hosts - a few days later I decided to use the UI and upgrade a few Content Hosts: UI -> Content -> Erratta -> apply RHBA-2017:2467 (Satellite Tools 6.2.11 Async Release) this resulted in it being applied successfully but now the UI -> Hosts -> Content Hosts -> host -> Errata - no Errata, host is uptodate, meanwhile it is at RHEL 6.8 and has a few hundred errata to be applied - so I decided to rerun the "satellite-installer --scenario satellite --upgrade" to see if this would fix the discrepany between the UI and "yum check-update" on the Content Host, but hit this problem - resolved by following https://access.redhat.com/solutions/3157651 I have the same issue, but on Capsule, so it is not limited to Satellite Server. Contrast my the Satellite Server upgrade when well. - Satellite Server has only itself connected as client - Satellite Capsule has 3 hosts connected (1 itself and 2 others) as client The KB https://access.redhat.com/solutions/3157651 is therefor incomplete, because some of the proposed fixes do not work on Capsules because there is no postgres (In reply to Peter Vreman from comment #15) > The KB https://access.redhat.com/solutions/3157651 is therefor incomplete, > because some of the proposed fixes do not work on Capsules because there is > no postgres As far as I am aware, qpid-cpp packages of version 0.34 can / should be updated on Satellite only, since only there the memory leak they fix can appear. So there shouldnt be a need to update Capsule to 0.34 now (there will be in 6.3, I think). qpidd on Capsule has far far less queues, basically only those that pulp requires. And pulp re-creates them after relevant services restart. So even if you update qpid-cpp packages on Capsule to 0.34 and hit some error message, it can be ignored (until it stops installer) since any queue will be recreated automatically. Technically, to clear some trash, it makes sense to "rm -rf /var/lib/qpidd/* /var/lib/qpidd/.*" before the upgrade (assuming no pulp pending tasks). Correct the qpid-cpp are not installed on the Capsule [crash/LI] root@li-lc-1589:~# rpm -q qpid-cpp package qpid-cpp is not installed In the end i ended up with executing, based on https://access.redhat.com/solutions/3157651 ========= katello-service stop rm -rf /var/lib/qpidd/.qpidd /var/lib/qpidd/* service qpidd start qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 add exchange topic event --durable qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 add queue katello_event_queue --durable for key in compliance.created entitlement.created entitlement.deleted pool.created pool.deleted; do qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 bind event katello_event_queue $key done for i in pulp_resource_manager pulp_workers pulp_celerybeat; do service $i restart; done katello-service restart ========= After this the Capsule was working again. Before the Capsule was not syncing because pulp-manage-db was not run during the upgrade. This resulted for me in broken repos that then also made all yum commands on the Capsule failing. So the issue is really tricky in repairing once you have self-registered Satellite or Capsules. Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/20594 has been resolved. Created attachment 1346710 [details]
patch
To apply the patch do the following:
Download the patch and move it to the /usr/share/katello-installer-base directory
~~~
# mv 121.patch /usr/share/katello-installer-base
# patch -p1 < 121.patch
~~~
Now complete the upgrade to 6.2.12
~~~
# satellite-installer --scenario satellite --upgrade
~~~
Upgrade Step: upgrade_qpid_paths...
[ INFO 2017-11-01 15:40:34 verbose] Upgrade Step: upgrade_qpid_paths...
[ INFO 2017-11-01 15:40:34 verbose] Qpid directory upgrade is already complete, skipping
Upgrade Step: migrate_pulp...
*** Bug 1494798 has been marked as a duplicate of this bug. *** Ideally both cases should be checked. As far as I remember, this was reproduced when actually stopping the services prior upgrade Build : Satellite 6.2.14 snap 1 Upgraded 6.1.z to 6.2.14 katello-service stop yum update -y satellite-installer --scenario satellite --upgrade Upgraded 6.2.z to 6.2.14 yum update -y satellite-installer --scenario satellite --upgrade No issues were discovered on both upgrade path. After upgrade also checked the queues [root@hp-dl380pgen8-01 ~]# qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b "amqps://localhost:5671" -q Queues queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind ===================================================================================================================================================================== 00012012-7a98-4d7c-a222-48fe51b7703b:1.0 Y Y 0 2 2 0 486 486 1 2 0515af28-08e9-4b9c-a9b8-1f35529d3d43:1.0 Y Y 0 2 2 0 486 486 1 2 0e0304d4-1c72-495b-a68c-0a720f008425:1.0 Y Y 0 2 2 0 486 486 1 2 14cbd114-af14-4b4e-b726-75cfd41f9140:1.0 Y Y 0 8 8 0 4.91k 4.91k 1 2 14cbd114-af14-4b4e-b726-75cfd41f9140:2.0 Y Y 0 4 4 0 2.55k 2.55k 1 2 18200480-6416-4ccc-bbcd-e66762e2e425:1.0 Y Y 0 8 8 0 4.93k 4.93k 1 2 18200480-6416-4ccc-bbcd-e66762e2e425:2.0 Y Y 0 4 4 0 2.53k 2.53k 1 2 18ec18c2-aca2-407c-8ca0-264c054fb558:0.0 Y Y 0 0 0 0 0 0 1 2 303bd0e6-3419-4d79-927e-d82a2278ed19:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2 3535f09a-d57e-43dc-b2e3-9fc97d776254:1.0 Y Y 0 2 2 0 486 486 1 2 3865a63a-936c-4ce6-b02c-4647e71f29bc:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2 3b8d1175-5fdb-4a5d-a2ca-93684c7179cc:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2 3ec15536-255f-40c1-8aef-53d5b1ac1ea1:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2 47c96a84-498d-45af-bb15-88be107a8d15:1.0 Y Y 0 8 8 0 4.91k 4.91k 1 2 47c96a84-498d-45af-bb15-88be107a8d15:2.0 Y Y 0 4 4 0 2.55k 2.55k 1 2 4fc14b36-e467-4d8d-b649-10dfc9fb705d:1.0 Y Y 0 2 2 0 486 486 1 2 5b25d318-7130-4c53-beef-ed5d65324370:1.0 Y Y 0 8 8 0 4.91k 4.91k 1 2 5b25d318-7130-4c53-beef-ed5d65324370:2.0 Y Y 0 4 4 0 2.55k 2.55k 1 2 5dbd2a51-e600-41ec-9525-529e6f8002dd:1.0 Y Y 0 2 2 0 486 486 1 2 621a5d09-8dff-42b5-8a4c-28c4079fb91f:1.0 Y Y 0 8 8 0 4.91k 4.91k 1 2 621a5d09-8dff-42b5-8a4c-28c4079fb91f:2.0 Y Y 0 4 4 0 2.55k 2.55k 1 2 633fe998-2a78-49e0-8851-425f44aa8ff8:1.0 Y Y 0 8 8 0 4.91k 4.91k 1 2 633fe998-2a78-49e0-8851-425f44aa8ff8:2.0 Y Y 0 4 4 0 2.55k 2.55k 1 2 7386dc0a-4a6e-4aa0-9451-4ac33e224fad:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2 7e18abab-3cbb-43c8-a2bd-bc1afcd1e659:1.0 Y Y 0 0 0 0 0 0 1 2 808c6a43-e747-43f7-b5a3-12c6ceb2117c:1.0 Y Y 0 8 8 0 4.91k 4.91k 1 2 808c6a43-e747-43f7-b5a3-12c6ceb2117c:2.0 Y Y 0 4 4 0 2.55k 2.55k 1 2 961ce4f5-925a-40b7-8ab8-499acbbc7c89:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2 b09a58e5-2e3c-4880-a92d-b7864ec94e94:1.0 Y Y 0 8 8 0 4.91k 4.91k 1 2 b09a58e5-2e3c-4880-a92d-b7864ec94e94:2.0 Y Y 0 4 4 0 2.55k 2.55k 1 2 b2463185-1a5c-49cc-b751-471a60e41c98:1.0 Y Y 0 8 8 0 4.91k 4.91k 1 2 b2463185-1a5c-49cc-b751-471a60e41c98:2.0 Y Y 0 4 4 0 2.55k 2.55k 1 2 celery Y 0 20 20 0 16.6k 16.6k 8 2 celeryev.69726f5d-9fe1-452d-9397-a8ce0556b8de Y 0 2.06k 2.06k 0 1.82m 1.82m 1 2 d0588b3e-a0c5-4ad8-9975-e0ad96c0daff:1.0 Y Y 0 2 2 0 486 486 1 2 db3265db-db03-43ce-bee8-6e4f745511bc:1.0 Y Y 0 5 5 0 2.67k 2.67k 1 2 dd397b1b-08ca-458b-bc87-ab913b76ac8a:1.0 Y Y 0 2 2 0 486 486 1 2 de2ed33e-8be3-4545-a732-d499d156b9eb:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2 fce33adc-7ce3-45f9-8ddf-6af84b109d8b:1.0 Y Y 0 2 2 0 486 486 1 2 katello_event_queue Y 0 0 0 0 0 0 1 6 pulp.agent.78fc319c-8993-4c75-965e-e1d151b59287 Y 0 1 1 0 661 661 1 1 pulp.task Y 0 3 3 0 1.36k 1.36k 3 1 reserved_resource_worker-0.pidbox Y 0 0 0 0 0 0 1 2 reserved_resource_worker-0 Y Y 0 0 0 0 0 0 1 2 reserved_resource_worker-1.pidbox Y 0 0 0 0 0 0 1 2 reserved_resource_worker-1 Y Y 0 0 0 0 0 0 1 2 reserved_resource_worker-2.pidbox Y 0 0 0 0 0 0 1 2 reserved_resource_worker-2 Y Y 0 0 0 0 0 0 1 2 reserved_resource_worker-3.pidbox Y 0 0 0 0 0 0 1 2 reserved_resource_worker-3 Y Y 0 6 6 0 6.79k 6.79k 1 2 reserved_resource_worker-4.pidbox Y 0 0 0 0 0 0 1 2 reserved_resource_worker-4 Y Y 0 0 0 0 0 0 1 2 reserved_resource_worker-5.pidbox Y 0 0 0 0 0 0 1 2 reserved_resource_worker-5 Y Y 0 0 0 0 0 0 1 2 reserved_resource_worker-6.pidbox Y 0 0 0 0 0 0 1 2 reserved_resource_worker-6 Y Y 0 0 0 0 0 0 1 2 reserved_resource_worker-7.pidbox Y 0 0 0 0 0 0 1 2 reserved_resource_worker-7 Y Y 0 0 0 0 0 0 1 2 resource_manager Y 0 3 3 0 4.07k 4.07k 1 2 resource_manager.pidbox Y 0 0 0 0 0 0 1 2 resource_manager Y Y 0 0 0 0 0 0 1 2 [root@hp-dl380pgen8-01 ~]# qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b "amqps://localhost:5671" exchanges Type Exchange Name Attributes ================================================== direct --replicate=none direct C.dq --durable direct amq.direct --durable --replicate=none fanout amq.fanout --durable --replicate=none headers amq.match --durable --replicate=none topic amq.topic --durable --replicate=none direct celery --durable fanout celery.pidbox topic celeryev --durable topic event --durable direct qmf.default.direct --replicate=none topic qmf.default.topic --replicate=none topic qpid.management --replicate=none direct resource_manager --durable Is there anything else that needs to verified on the box? Looks good to me, if you didnt see a message regarding cant remove directory on upgrade then this bug is fixed. Marking as verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0273 |