Description of problem: ======================= When cluster have more than one monitor rolling update gets hung in task 'compress the store as much as possible' for second monitor node Version-Release number of selected component (if applicable): ============================================================== update from 10.2.2-38.el7cp.x86_64 to 10.2.2-39.el7cp.x86_64 How reproducible: ================= always Steps to Reproduce: =================== 1. Create a cluster via ceph-ansible having 3 MON, 3 OSD and 1 RGW node (10.2.2-38.el7cp.x86_64) [root@magna044 ceph-ansible]# cat /etc/ansible/hosts [mons] magna078 magna084 magna085 [osds] magna090 magna091 magna085 [rgws] magna094 2. create repo fie on all nodes which points to 10.2.2-39.el7cp.x86_64 bits 3. Change the value of 'serial:' to adjust the number of server to be updated. 4. use rolling_update.yml to update all nodes Actual results: =============== [root@magna044 ceph-ansible]# ansible-playbook rolling_update.yml Are you sure you want to upgrade the cluster? [no]: yes PLAY [confirm whether user really meant to upgrade the cluster] *************** GATHERING FACTS *************************************************************** ok: [localhost] TASK: [exit playbook, if user did not mean to upgrade cluster] **************** skipping: [localhost] PLAY [mons;osds;mdss;rgws] **************************************************** GATHERING FACTS *************************************************************** ok: [magna084] ok: [magna078] ok: [magna085] ok: [magna091] ok: [magna090] ok: [magna094] TASK: [debug msg="gather facts on all Ceph hosts for following reference"] **** ok: [magna078] => { "msg": "gather facts on all Ceph hosts for following reference" } ok: [magna084] => { "msg": "gather facts on all Ceph hosts for following reference" } ok: [magna085] => { "msg": "gather facts on all Ceph hosts for following reference" } ok: [magna094] => { "msg": "gather facts on all Ceph hosts for following reference" } ok: [magna090] => { "msg": "gather facts on all Ceph hosts for following reference" } ok: [magna091] => { "msg": "gather facts on all Ceph hosts for following reference" } TASK: [check if sysvinit] ***************************************************** ok: [magna084] ok: [magna090] ok: [magna091] ok: [magna078] ok: [magna085] ok: [magna094] TASK: [check if upstart] ****************************************************** ok: [magna084] ok: [magna078] ok: [magna090] ok: [magna085] ok: [magna091] ok: [magna094] TASK: [check if systemd] ****************************************************** changed: [magna090] changed: [magna084] changed: [magna085] changed: [magna078] changed: [magna094] changed: [magna091] PLAY [mons] ******************************************************************* GATHERING FACTS *************************************************************** ok: [magna084] ok: [magna078] ok: [magna085] TASK: [compress the store as much as possible] ******************************** changed: [magna078] Expected results: ================= IT should update all nodes Additional info:
Can you give me the state of the cluster prior to run this? Are all the monitors started? Can you try to run the compress command manually on the monitor nodes?
I do have one question here. AFAIK, all MON share same db then why do we need to compress on all MONs? We can compress on one MON node and it should work fine, right? Please correct me if I am wrong
It's weird that we don't know the root cause of that, even if the compaction is not needed by the upgrade, I think it's a nice to have. I ran the playbook several times and the only case where the compact command hung was the monitor being stopped... I can remove the compact command from the playbook anyway.
https://github.com/ceph/ceph-ansible/pull/975
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:2082