Bug 1574995

Summary: [UPGRADES] Error during ceph upgrade: Error EINVAL: bad entity name
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Yurii Prokulevych <yprokule>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED CURRENTRELEASE QA Contact: Vasishta <vashastr>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.1CC: adeza, aschoen, augol, ccamacho, ceph-eng-bugs, gabrioux, gfidente, gmeno, johfulto, jstransk, kdreyer, nthomas, sankarshan, scohen, shan, yprokule, yrabl
Target Milestone: rc   
Target Release: 3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.1.0-0.1.rc3.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1578282 (view as bug list) Environment:
Last Closed: 2019-08-27 05:11:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1548353    
Attachments:
Description Flags
osp12.inventory.yaml none

Description Yurii Prokulevych 2018-05-04 13:33:58 UTC
Description of problem:
-----------------------
Ceph upgrade failed:

openstack overcloud ceph-upgrade run \
    --templates /usr/share/openstack-tripleo-heat-templates \
                -e /home/stack/composable_roles/roles/nodes.yaml \
            -e /home/stack/composable_roles/internal.yaml \
            -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
            -e /home/stack/composable_roles/network/network-environment.yaml \
            -e /home/stack/composable_roles/enable-tls.yaml \
            -e /home/stack/composable_roles/inject-trust-anchor.yaml \
            -e /home/stack/composable_roles/public_vip.yaml \
            -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
            -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
            -e /home/stack/composable_roles/hostnames.yaml \
            -e /home/stack/composable_roles/debug.yaml \
            -e /home/stack/composable_roles/config_heat.yaml \
            -e /home/stack/composable_roles/docker-images.yaml \
            --container-registry-file /home/stack/composable_roles/docker-images.yaml \
    --roles-file /home/stack/composable_roles/roles/roles_data.yaml 2>&1
...
2018-05-04 11:54:30Z [overcloud-AllNodesDeploySteps-zzils2sbx6gj.WorkflowTasks_Step2]: UPDATE_COMPLETE  state changed
2018-05-04 11:54:31Z [overcloud-AllNodesDeploySteps-zzils2sbx6gj.WorkflowTasks_Step2_Execution]: CREATE_IN_PROGRESS  state changed
2018-05-04 12:00:15Z [overcloud-AllNodesDeploySteps-zzils2sbx6gj.WorkflowTasks_Step2_Execution]: CREATE_FAILED  resources.WorkflowTasks_Step2_Execution: ERROR
2018-05-04 12:00:15Z [overcloud-AllNodesDeploySteps-zzils2sbx6gj]: UPDATE_FAILED  Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR
2018-05-04 12:00:16Z [AllNodesDeploySteps]: UPDATE_FAILED  resources.AllNodesDeploySteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR
2018-05-04 12:00:16Z [overcloud]: UPDATE_FAILED  Resource UPDATE failed: resources.AllNodesDeploySteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR

 Stack overcloud UPDATE_FAILED 

overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::TripleO::WorkflowSteps
  physical_resource_id: f17ff6ba-cd6c-4108-a259-6ec36b0289a0
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
openstack-tripleo-heat-templates-8.0.2-9.el7ost.noarch
puppet-ceph-2.5.0-1.el7ost.noarch
ceph-ansible-3.1.0-0.1.beta8.el7cp.noarch


Steps to Reproduce:
-------------------
1. Upgrade UC from 12 to 13
2. Run oc upgrade prepare
3. Upgrade all roles
4. Start ceph upgrade

Actual results:
---------------
Ceph upgrade failed

Expected results:
-----------------
Ceph is upgraded

Comment 6 Giulio Fidente 2018-05-07 10:53:43 UTC
Created attachment 1432569 [details]
osp12.inventory.yaml

Attaching the inventory used for the initial OSP12 deployment. Cmdline was:

ansible-playbook -vv /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actioncHuWZ0/inventory.yaml --private-key /tmp/ansible-mistral-actioncHuWZ0/ssh_private_key --skip-tags package-install,with_pkg

Comment 7 John Fulton 2018-05-08 19:09:50 UTC
Why is the following failing to create the MGR keys?

https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mon/tasks/docker/main.yml#L97

2018-05-04 08:00:09,002 p=15296 u=mistral |  TASK [ceph-mon : create ceph mgr keyring(s) when mon is containerized] *********
2018-05-04 08:00:09,002 p=15296 u=mistral |  task path: /usr/share/ceph-ansible/roles/ceph-mon/tasks/docker/main.yml:97
2018-05-04 08:00:09,002 p=15296 u=mistral |  Friday 04 May 2018  08:00:09 -0400 (0:00:00.040)       0:04:13.446 ************ 
2018-05-04 08:00:09,063 p=15296 u=mistral |   [WARNING]: when statements should not include jinja2 templating delimiters
such as {{ }} or {% %}. Found: {{ groups.get(mgr_group_name, []) | length > 0
}}

2018-05-04 08:00:09,666 p=15296 u=mistral |  failed: [192.168.24.6] (item=192.168.24.9) => {"changed": false, "cmd": ["docker", "exec", "ceph-mon-controller-2", "ceph", "--cluster", "ceph", "auth", "get-or-create", "mgr.controller-0", "mon", "allow profile mgr", "osd", "allow *", "mds", "allow *", "-o", "/etc/ceph/ceph.mgr.controller-0.keyring"], "delta": "0:00:00.313477", "end": "2018-05-04 12:00:09.763679", "item": "192.168.24.9", "msg": "non-zero return code", "rc": 22, "start": "2018-05-04 12:00:09.450202", "stderr": "Error EINVAL: bad entity name", "stderr_lines": ["Error EINVAL: bad entity name"], "stdout": "", "stdout_lines": []}
2018-05-04 08:00:10,273 p=15296 u=mistral |  failed: [192.168.24.6] (item=192.168.24.14) => {"changed": false, "cmd": ["docker", "exec", "ceph-mon-controller-2", "ceph", "--cluster", "ceph", "auth", "get-or-create", "mgr.controller-1", "mon", "allow profile mgr", "osd", "allow *", "mds", "allow *", "-o", "/etc/ceph/ceph.mgr.controller-1.keyring"], "delta": "0:00:00.299650", "end": "2018-05-04 12:00:10.370583", "item": "192.168.24.14", "msg": "non-zero return code", "rc": 22, "start": "2018-05-04 12:00:10.070933", "stderr": "Error EINVAL: bad entity name", "stderr_lines": ["Error EINVAL: bad entity name"], "stdout": "", "stdout_lines": []}
2018-05-04 08:00:11,087 p=15296 u=mistral |  failed: [192.168.24.6] (item=192.168.24.6) => {"changed": false, "cmd": ["docker", "exec", "ceph-mon-controller-2", "ceph", "--cluster", "ceph", "auth", "get-or-create", "mgr.controller-2", "mon", "allow profile mgr", "osd", "allow *", "mds", "allow *", "-o", "/etc/ceph/ceph.mgr.controller-2.keyring"], "delta": "0:00:00.356980", "end": "2018-05-04 12:00:11.185707", "item": "192.168.24.6", "msg": "non-zero return code", "rc": 22, "start": "2018-05-04 12:00:10.828727", "stderr": "Error EINVAL: bad entity name", "stderr_lines": ["Error EINVAL: bad entity name"], "stdout": "", "stdout_lines": []}
2018-05-04 08:00:11,090 p=15296 u=mistral |  RUNNING HANDLER [ceph-defaults : set _mon_handler_called before restart] *******

Comment 8 John Fulton 2018-05-08 22:56:50 UTC
controller0 (192.168.24.6) was running the jewel container while controller{1,2} were running the luminous container [1]. When you run the command that produces the mgr key against a jewel container, you get the error EINVAL: bad entity name [2] which caused the upgrade to fail.

This task should not have been run on controller0 until, as a result of the upgrade, it was running a luminous container.

footnotes:

[1] 
[fultonj@skagra bz1574995]$ grep ceph control*-sos/sosreport-ceph-upgrade-fail-controller-*/sos_commands/docker/docker_ps
control0-sos/sosreport-ceph-upgrade-fail-controller-0-20180507054355/sos_commands/docker/docker_ps:b39cc32df6b4        192.168.24.1:8787/rhceph:2.5-3                                               "/entrypoint.sh"         2 days ago          Up 2 days                                 ceph-mon-controller-0
control1-sos/sosreport-ceph-upgrade-fail-controller-1-20180507055109/sos_commands/docker/docker_ps:a5e2531f3589        192.168.24.1:8787/rhceph:3-6                                                 "/entrypoint.sh"         2 days ago          Up 2 days                                 ceph-mon-controller-1
control2-sos/sosreport-ceph-upgrade-fail-controller-2-20180507055837/sos_commands/docker/docker_ps:14272b012166        192.168.24.1:8787/rhceph:3-6                                                 "/entrypoint.sh"         2 days ago          Up 2 days                                 ceph-mon-controller-2
[fultonj@skagra bz1574995]$ 

[2]
[root@controller-0 ~]# docker ps | grep ceph 
a8bac73cc1b9        192.168.24.1:8787/rhceph:2.5-3                                               "/entrypoint.sh"         31 hours ago        Up 31 hours                                   ceph-mon-controller-0
[root@controller-0 ~]# docker exec ceph-mon-controller-0 ceph --cluster ceph auth get-or-create mgr.controller-0 mon allow profile mgr osd allow * mds allow * -o /etc/ceph/ceph.mgr.controller-0.keyring 
Error EINVAL: bad entity name
[root@controller-0 ~]#

Comment 9 Guillaume Abrioux 2018-05-09 09:36:51 UTC
We tried to reproduce this bug on a new env with Yurii, the update has completed successfully.

the ceph-ansible version was still the same though.
I've noticed mgr containers were well created, the mgr keyrings as well and then the upgrade completed.

It seems some patches have been applied manually before the upgrade but not in ceph-ansible itself, Yurii could you give an update in this point?

By the way, I've tried several times to reproduce this issue on another env without OSP layer by using the same versions of ceph-ansible (from v3.0.27 with jewel containers images to v3.1.0beta8 with luminous containers images), the upgrade worked fine for every attempt.

Comment 10 Guillaume Abrioux 2018-05-09 12:57:58 UTC
I've analyzed the rolling_update.yml playbook log on an env that got the failure.

The workflow is the following:

1/ it pulls new image here : https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L120 in ceph-docker-common
2/ still in ceph-docker-common, there are tasks that check if a new image has been pulled, if yes it notifies handlers. They are triggered after all roles have finished. (Handlers are in ceph-defaults roles)
3/ it keeps going and plays ceph-config and then ceph-mon : https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L121 and https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L122
4/ all roles have been played, so its time to run handlers, it seems that the container are restarted there then.

monitors are deployed using serial: 1, it means all of this is done one node at a time. When running on last mon, since 'ceph-mon' role is played before the handlers are triggered it still running jewel when the task for mgr keyring creation is called.

Comment 11 Guillaume Abrioux 2018-05-09 17:49:48 UTC
patch has been merged upstream, it will be in v3.1.0rc2

https://github.com/ceph/ceph-ansible/releases/tag/v3.1.0rc2

Comment 15 Sébastien Han 2018-05-14 11:47:31 UTC
Yes, running Ansible a second time will fix the issue.

Comment 17 Sébastien Han 2018-05-15 09:20:28 UTC
Please list the content of /tmp/file-mistral-actionhv
oeqB/7beb822a-575a-11e8-9b05-525400e6c600//etc/

Thanks

Comment 18 Yurii Prokulevych 2018-05-15 10:19:14 UTC
There is no such directory on uc:

[root@undercloud-0 (undercloud-12-US)~]# ll /tmp/file-mistral-actionhvoeqB/7beb822a-575a-11e8-9b05-525400e6c600/
ls: cannot access /tmp/file-mistral-actionhvoeqB/7beb822a-575a-11e8-9b05-525400e6c600/: No such file or directory

Comment 19 Guillaume Abrioux 2018-05-15 10:29:35 UTC
the issue reported on c16 is because of a mismatch between the path where the mgr keys are fetched in rolling_update and the path where they are copied in ceph-mon.

upstream patch: https://github.com/ceph/ceph-ansible/pull/2588/commits/d65bb8f9655d906054e634db67b02abcbb3ea837

will be in v3.1.0rc4

Comment 20 Christina Meno 2018-05-15 15:34:55 UTC
Yogev, Would you please set qa_ack ?

Comment 21 Sébastien Han 2018-05-15 17:41:21 UTC
New tag with the fixes https://github.com/ceph/ceph-ansible/releases/tag/v3.1.0rc3

Comment 23 Yogev Rabl 2018-05-18 18:50:01 UTC
Verified