Bug 1482539

Summary:

Upgrade of Satellite to 6.2.11 error on removal of qpid dat2 directory

Product:

Red Hat Satellite

Reporter:

Chris Roberts <chrobert>

Component:

Installation

Assignee:

Chris Roberts <chrobert>

Status:

CLOSED ERRATA

QA Contact:

Sanket Jagtap <sjagtap>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.2.11

CC:

ajoseph, bbuckingham, bkearney, brubisch, cdonnell, chrobert, egolov, ehelms, fgarciad, gkonda, hmore, inecas, jalviso, ktordeur, mbacovsk, michiel.smit, mmccune, mmithaiw, mverma, pdwyer, peter.vreman, pgervase, pmoravec, pmutha, sghai, shbharad, sjagtap, smane, vijsingh, xdmoon

Target Milestone:

Unspecified

Keywords:

ManyUsersImpacted, PrioBumpGSS, Triaged, UserExperience

Target Release:

Unused

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

katello-installer-base-3.0.0.100-1

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1530694 (view as bug list)

Environment:

Last Closed:

2018-02-05 13:54:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
patch	none

Description Chris Roberts 2017-08-17 13:45:40 UTC

Description of problem:

Upgrading to 6.2.11 I get a:

[ERROR 2017-08-11 20:40:22 main] rm: cannot remove ‘/var/lib/qpidd/.qpidd/qls/dat2’: Is a directory

Version-Release number of selected component (if applicable): 6.2.11


How reproducible:


Steps to Reproduce:
1. upgrade to 6.2.11
2. watch the installer log with -v

Actual results:
[ERROR 2017-08-11 20:40:22 main] rm: cannot remove ‘/var/lib/qpidd/.qpidd/qls/dat2’: Is a directory

Expected results:

Remove the directory

Comment 5 Pavel Moravec 2017-08-19 16:14:45 UTC

No dupe (IMHO), but the error is just a symptom of a bigger problem.

What Satellite users usually do and how it goes wrong:

    1) Sat6 including qpidd is fully running

    2) "yum update" upgrades also qpid-cpp-server package - which has a tricky post-install script that restarts qpidd service. Since now, qpidd is already using the new directory structure. That is empty at this moment.

    3) satellite-installer --upgrade migrates data to the new structure (when qpidd is down, but too late). Here the "rm: cannot remove .." error comes from - it is just a side-effect symptom of this bigger problem.

    4) Depending on timing, we:
       - usually end up with 2 journal files instead of 1 for every durable queue (usually this is no problem but I saw a customer unable to start qpidd due to that since both queues had some unique journal sequence ID), and
       - sometimes many queues missing (like 2/3 of queues missing at some customer)


Note, that we can NOT resolve this in a Satellite upgrade step (like in hooks/pre/30-upgrade.rb). Since this requires "yum update" has been already run, so qpidd is already running on the new directory structure.

I see three possible ways how to resolve it:

A) somehow allow (yum) updating only with Satellite services down (or at least qpidd down), and running satellite-installer --upgrade just after it / without starting qpidd between. Could we somehow enforce this?

B) have qpid-cpp-server package without the post-install script / qpidd restart. Elegant but 
- wont resolve use case "yum update; reboot (or katello restart); satellite-installer --upgrade".
- will require the changed postinstall script in qpid-cpp-server practically forever (to allow upgrades 6.2.10 -> any release)

C) Forget on any data migration and re-build the queues and bindings from scratch (possible steps are at the bottom of KCS 3148641). Rationale:
- no upgrade shall be done during (pulp) task running, so all pulp queues (for workers, res.manager, celery and katello-agent) should have been empty; so almost everytime *all* queues will be empty; so no messages in queues lost; we can warn about this at the beginning of the upgrade script
- bypassing any script issue
- this assumes "yum update" is followed by "satellite-installer --upgrade" with ideally as few operations on Satellite as possible - since until the upgrade step is run, queues can be missing with its consequences (tasks can fail).

D) Some other solution?


I personally vote for C) where the particular procedure would be:

(*) running services (at least) qpidd, postgres, foreman-tasks (and httpd?)
(*) stop services tomcat (to let katello_event queue made empty), qdrouterd (dont (re)try pulp.agent.* queues to create) and pulp (dont create pulp queues)
(*) after a while (few seconds should be enough), stop foreman-tasks (just for sure)
(*) stop qpidd
(*) rm -rf /var/lib/qpidd/.qpidd (or /var/lib/qpidd/* for RHEL6)
  - optionally take backup prior this step?
(*) start qpidd (no clients shall connect now, so the next step shall create complete and coherent queues/exchanges/bindings)
(*) follow bottom of KCS 3148641 to rebuild queues/exchanges/bindings
(*) "step migrate qpid directory is complete" \o/

(I owe a beer to anybody who finds a gotcha in above procedure)

Comment 6 Ivan Necas 2017-08-21 06:54:56 UTC

Would it make sense to update the re-create queue procedure to restore all the queues, so that the installer would not need to be run?

Comment 7 Pavel Moravec 2017-08-21 08:07:29 UTC

(In reply to Ivan Necas from comment #6)
> Would it make sense to update the re-create queue procedure to restore all
> the queues, so that the installer would not need to be run?

It makes sense (specifically for the upgrade path from 6.2.10 or older).

The KCS 3148641 might create more queus than required, esp. queues for goferd-less clients. So there are 2 options how to create just required queues:

1) chose just systems with katello-agent package installed

2) identify what queues were there before (and after) the upgrade):

ls /var/lib/qpidd/.qpidd/qls/jrnl /var/lib/qpidd/.qpidd/qls/jrnl2 /var/lib/qpidd/qls/jrnl /var/lib/qpidd/qls/jrnl2 2> /dev/null | sort -u | grep pulp.agent

- jrnl for old dir before upgrade, jrnl2 after upgrade (if e.g. a system registered after yum update but before satellite-installer)
- the other pair of dirs is due to RHEL6/RHEL7 different paths


Moving the BZ to NEW since this required work (for the bigger problem) hasnt been implemented.

Comment 9 michiel.smit 2017-08-22 19:55:36 UTC

- I ran into this problem too. My sequence of events:
# yum update
(this upgraded the OS from 7.3 to 7.4, and updated Satellite packages from 6.2.10 to 6.2.11)
# satellite-installer --scenario satellite --upgrade
# systemctl reboot

- lots of errors; resolved by restarting goferd and puppet on all Content Hosts
- a few days later I decided to use the UI and upgrade a few Content Hosts: UI -> Content -> Erratta -> apply RHBA-2017:2467 (Satellite Tools 6.2.11 Async Release)
this resulted in it being applied successfully but now the UI -> Hosts -> Content Hosts -> host -> Errata - no Errata, host is uptodate, meanwhile it is at RHEL 6.8 and has a few hundred errata to be applied
- so I decided to rerun the "satellite-installer --scenario satellite --upgrade" to see if this would fix the discrepany between the UI and "yum check-update" on the Content Host, but hit this problem
- resolved by following https://access.redhat.com/solutions/3157651

Comment 14 Peter Vreman 2017-09-13 17:40:09 UTC

I have the same issue, but on Capsule, so it is not limited to Satellite Server.
Contrast my the Satellite Server upgrade when well.
- Satellite Server has only itself connected as client
- Satellite Capsule has 3 hosts connected (1 itself and 2 others) as client

Comment 15 Peter Vreman 2017-09-13 17:45:35 UTC

The KB https://access.redhat.com/solutions/3157651 is therefor incomplete, because some of the proposed fixes do not work on Capsules because there is no postgres

Comment 16 Pavel Moravec 2017-09-13 17:53:55 UTC

(In reply to Peter Vreman from comment #15)
> The KB https://access.redhat.com/solutions/3157651 is therefor incomplete,
> because some of the proposed fixes do not work on Capsules because there is
> no postgres

As far as I am aware, qpid-cpp packages of version 0.34 can / should be updated on Satellite only, since only there the memory leak they fix can appear. So there shouldnt be a need to update Capsule to 0.34 now (there will be in 6.3, I think).

qpidd on Capsule has far far less queues, basically only those that pulp requires. And pulp re-creates them after relevant services restart. So even if you update qpid-cpp packages on Capsule to 0.34 and hit some error message, it can be ignored (until it stops installer) since any queue will be recreated automatically.

Technically, to clear some trash, it makes sense to "rm -rf /var/lib/qpidd/* /var/lib/qpidd/.*" before the upgrade (assuming no pulp pending tasks).

Comment 17 Peter Vreman 2017-09-13 18:45:44 UTC

Correct the qpid-cpp are not installed on the Capsule

[crash/LI] root@li-lc-1589:~# rpm -q qpid-cpp
package qpid-cpp is not installed

In the end i ended up with executing, based on https://access.redhat.com/solutions/3157651
=========
katello-service stop
rm -rf /var/lib/qpidd/.qpidd /var/lib/qpidd/*
service qpidd start

qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 add exchange topic event --durable
qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 add queue katello_event_queue --durable
for key in compliance.created entitlement.created entitlement.deleted pool.created pool.deleted; do
    qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 bind event katello_event_queue $key
done

for i in pulp_resource_manager pulp_workers pulp_celerybeat; do service $i restart; done

katello-service restart
=========

After this the Capsule was working again.
Before the Capsule was not syncing because pulp-manage-db was not run during the upgrade. This resulted for me in broken repos that then also made all yum commands on the Capsule failing.

So the issue is really tricky in repairing once you have self-registered Satellite or Capsules.

Comment 18 Satellite Program 2017-10-14 06:07:26 UTC

Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/20594 has been resolved.

Comment 21 Chris Roberts 2017-11-01 19:41:34 UTC

Created attachment 1346710 [details]
patch

To apply the patch do the following:

Download the patch and move it to the /usr/share/katello-installer-base directory

~~~
# mv 121.patch /usr/share/katello-installer-base
# patch -p1 < 121.patch
~~~

Now complete the upgrade to 6.2.12

~~~
# satellite-installer --scenario satellite --upgrade
~~~

Upgrade Step: upgrade_qpid_paths...
[ INFO 2017-11-01 15:40:34 verbose] Upgrade Step: upgrade_qpid_paths...
[ INFO 2017-11-01 15:40:34 verbose] Qpid directory upgrade is already complete, skipping
Upgrade Step: migrate_pulp...

Comment 27 Chris Roberts 2017-11-17 15:33:15 UTC

*** Bug 1494798 has been marked as a duplicate of this bug. ***

Comment 30 Ivan Necas 2018-01-11 11:25:42 UTC

Ideally both cases should be checked. As far as I remember, this was reproduced when actually stopping the services prior upgrade

Comment 31 Sanket Jagtap 2018-01-15 17:06:49 UTC

Build : Satellite 6.2.14 snap 1


Upgraded 6.1.z to 6.2.14 
katello-service stop
yum update -y
satellite-installer --scenario satellite --upgrade

Upgraded 6.2.z to 6.2.14 
yum update -y 
satellite-installer --scenario satellite --upgrade

No issues were discovered on both upgrade path.

After upgrade also checked the queues 

[root@hp-dl380pgen8-01 ~]# qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b "amqps://localhost:5671" -q
Queues
  queue                                                                                 dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =====================================================================================================================================================================
  00012012-7a98-4d7c-a222-48fe51b7703b:1.0                                                   Y        Y        0     2      2       0    486      486         1     2
  0515af28-08e9-4b9c-a9b8-1f35529d3d43:1.0                                                   Y        Y        0     2      2       0    486      486         1     2
  0e0304d4-1c72-495b-a68c-0a720f008425:1.0                                                   Y        Y        0     2      2       0    486      486         1     2
  14cbd114-af14-4b4e-b726-75cfd41f9140:1.0                                                   Y        Y        0     8      8       0   4.91k    4.91k        1     2
  14cbd114-af14-4b4e-b726-75cfd41f9140:2.0                                                   Y        Y        0     4      4       0   2.55k    2.55k        1     2
  18200480-6416-4ccc-bbcd-e66762e2e425:1.0                                                   Y        Y        0     8      8       0   4.93k    4.93k        1     2
  18200480-6416-4ccc-bbcd-e66762e2e425:2.0                                                   Y        Y        0     4      4       0   2.53k    2.53k        1     2
  18ec18c2-aca2-407c-8ca0-264c054fb558:0.0                                                   Y        Y        0     0      0       0      0        0         1     2
  303bd0e6-3419-4d79-927e-d82a2278ed19:1.0                                                   Y        Y        0     4      4       0   2.42k    2.42k        1     2
  3535f09a-d57e-43dc-b2e3-9fc97d776254:1.0                                                   Y        Y        0     2      2       0    486      486         1     2
  3865a63a-936c-4ce6-b02c-4647e71f29bc:1.0                                                   Y        Y        0     4      4       0   2.46k    2.46k        1     2
  3b8d1175-5fdb-4a5d-a2ca-93684c7179cc:1.0                                                   Y        Y        0     4      4       0   2.46k    2.46k        1     2
  3ec15536-255f-40c1-8aef-53d5b1ac1ea1:1.0                                                   Y        Y        0     4      4       0   2.46k    2.46k        1     2
  47c96a84-498d-45af-bb15-88be107a8d15:1.0                                                   Y        Y        0     8      8       0   4.91k    4.91k        1     2
  47c96a84-498d-45af-bb15-88be107a8d15:2.0                                                   Y        Y        0     4      4       0   2.55k    2.55k        1     2
  4fc14b36-e467-4d8d-b649-10dfc9fb705d:1.0                                                   Y        Y        0     2      2       0    486      486         1     2
  5b25d318-7130-4c53-beef-ed5d65324370:1.0                                                   Y        Y        0     8      8       0   4.91k    4.91k        1     2
  5b25d318-7130-4c53-beef-ed5d65324370:2.0                                                   Y        Y        0     4      4       0   2.55k    2.55k        1     2
  5dbd2a51-e600-41ec-9525-529e6f8002dd:1.0                                                   Y        Y        0     2      2       0    486      486         1     2
  621a5d09-8dff-42b5-8a4c-28c4079fb91f:1.0                                                   Y        Y        0     8      8       0   4.91k    4.91k        1     2
  621a5d09-8dff-42b5-8a4c-28c4079fb91f:2.0                                                   Y        Y        0     4      4       0   2.55k    2.55k        1     2
  633fe998-2a78-49e0-8851-425f44aa8ff8:1.0                                                   Y        Y        0     8      8       0   4.91k    4.91k        1     2
  633fe998-2a78-49e0-8851-425f44aa8ff8:2.0                                                   Y        Y        0     4      4       0   2.55k    2.55k        1     2
  7386dc0a-4a6e-4aa0-9451-4ac33e224fad:1.0                                                   Y        Y        0     4      4       0   2.42k    2.42k        1     2
  7e18abab-3cbb-43c8-a2bd-bc1afcd1e659:1.0                                                   Y        Y        0     0      0       0      0        0         1     2
  808c6a43-e747-43f7-b5a3-12c6ceb2117c:1.0                                                   Y        Y        0     8      8       0   4.91k    4.91k        1     2
  808c6a43-e747-43f7-b5a3-12c6ceb2117c:2.0                                                   Y        Y        0     4      4       0   2.55k    2.55k        1     2
  961ce4f5-925a-40b7-8ab8-499acbbc7c89:1.0                                                   Y        Y        0     4      4       0   2.42k    2.42k        1     2
  b09a58e5-2e3c-4880-a92d-b7864ec94e94:1.0                                                   Y        Y        0     8      8       0   4.91k    4.91k        1     2
  b09a58e5-2e3c-4880-a92d-b7864ec94e94:2.0                                                   Y        Y        0     4      4       0   2.55k    2.55k        1     2
  b2463185-1a5c-49cc-b751-471a60e41c98:1.0                                                   Y        Y        0     8      8       0   4.91k    4.91k        1     2
  b2463185-1a5c-49cc-b751-471a60e41c98:2.0                                                   Y        Y        0     4      4       0   2.55k    2.55k        1     2
  celery                                                                                Y                      0    20     20       0   16.6k    16.6k        8     2
  celeryev.69726f5d-9fe1-452d-9397-a8ce0556b8de                                              Y                 0  2.06k  2.06k      0   1.82m    1.82m        1     2
  d0588b3e-a0c5-4ad8-9975-e0ad96c0daff:1.0                                                   Y        Y        0     2      2       0    486      486         1     2
  db3265db-db03-43ce-bee8-6e4f745511bc:1.0                                                   Y        Y        0     5      5       0   2.67k    2.67k        1     2
  dd397b1b-08ca-458b-bc87-ab913b76ac8a:1.0                                                   Y        Y        0     2      2       0    486      486         1     2
  de2ed33e-8be3-4545-a732-d499d156b9eb:1.0                                                   Y        Y        0     4      4       0   2.42k    2.42k        1     2
  fce33adc-7ce3-45f9-8ddf-6af84b109d8b:1.0                                                   Y        Y        0     2      2       0    486      486         1     2
  katello_event_queue                                                                   Y                      0     0      0       0      0        0         1     6
  pulp.agent.78fc319c-8993-4c75-965e-e1d151b59287                                       Y                      0     1      1       0    661      661         1     1
  pulp.task                                                                             Y                      0     3      3       0   1.36k    1.36k        3     1
  reserved_resource_worker-0.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-0             Y    Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-1.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-1             Y    Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-2.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-2             Y    Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-3.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-3             Y    Y                 0     6      6       0   6.79k    6.79k        1     2
  reserved_resource_worker-4.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-4             Y    Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-5.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-5             Y    Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-6.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-6             Y    Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-7.pidbox       Y                 0     0      0       0      0        0         1     2
  reserved_resource_worker-7             Y    Y                 0     0      0       0      0        0         1     2
  resource_manager                                                                      Y                      0     3      3       0   4.07k    4.07k        1     2
  resource_manager.pidbox                 Y                 0     0      0       0      0        0         1     2
  resource_manager                       Y    Y                 0     0      0       0      0        0         1     2
[root@hp-dl380pgen8-01 ~]# qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b "amqps://localhost:5671" exchanges
Type      Exchange Name       Attributes
==================================================
direct                        --replicate=none
direct    C.dq                --durable
direct    amq.direct          --durable --replicate=none
fanout    amq.fanout          --durable --replicate=none
headers   amq.match           --durable --replicate=none
topic     amq.topic           --durable --replicate=none
direct    celery              --durable
fanout    celery.pidbox      
topic     celeryev            --durable
topic     event               --durable
direct    qmf.default.direct  --replicate=none
topic     qmf.default.topic   --replicate=none
topic     qpid.management     --replicate=none
direct    resource_manager    --durable

Is there anything else that needs to verified on the box?

Comment 32 Chris Roberts 2018-01-15 17:13:11 UTC

Looks good to me, if you didnt see a message regarding cant remove directory on upgrade then this bug is fixed.

Comment 33 Sanket Jagtap 2018-01-15 17:14:23 UTC

Marking as verified

Comment 36 errata-xmlrpc 2018-02-05 13:54:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0273