Description of problem: 1. Install 2.5 from CDN and use the same all.yaml file for the upgrade playbook, eg: global: mon_allow_pool_delete: true mon_max_pg_per_osd: 2048 osd_default_pool_size: 2 osd_pool_default_pg_num: 128 osd_pool_default_pgp_num: 128 ceph_origin: distro ceph_repository: rhcs ceph_rhcs_version: 3 ceph_stable_release: luminous ceph_test: true copy_admin_key: true fetch_directory: /home/cephuser/fetch osd_auto_discovery: false osd_scenario: collocated public_network: 172.16.0.0/12 radosgw_interface: eth0 upgrade_ceph_packages: true 2. All pgs are in clean state after 2.5 install, use the upgrade playbook to run the upgrade from 2.5 to 3.z2 ansible-playbook -e ireallymeanit=yes -vv -i hosts rolling_update.yml 3. Notice that upgrade playbook tries to activate the OSD's again, which causes pg's to be in unclean state at the end of the upgrade TASK [ceph-osd : manually prepare ceph "filestore" non-containerized osd disk(s) with collocated osd data and journal] *** task path: /usr/share/ceph-ansible/roles/ceph-osd/tasks/scenarios/collocated.yml:54 2018-04-13 19:06:31,744 - ceph.ceph - INFO - skipping: [ceph-clacroix-run712-node5-osd] => (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/vdb print | egrep -sq '^ 1.*ceph'", u'end': u'2018-04-13 15:06:28.168932', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, 'item': u'/dev/vdb', u'delta': u'0:00:00.019154', u'stderr': u'', u'rc': 0, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/vdb print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, 'stdout_lines': [], 'failed_when_result': False, u'start': u'2018-04-13 15:06:28.149778', '_ansible_ignore_errors': None, 'failed': False}, u'/dev/vdb']) => {"changed": false, "item": [{"_ansible_ignore_errors": null, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "parted --script /dev/vdb print | egrep -sq '^ 1.*ceph'", "delta": "0:00:00.019154", "end": 2018-04-13 19:06:31,744 - ceph.ceph - INFO - "2018-04-13 15:06:28.168932", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "parted --script /dev/vdb print | egrep -sq '^ 1.*ceph'", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": "/dev/vdb", "rc": 0, "start": "2018-04-13 15:06:28.149778", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "/dev/vdb"], "skip_reason": "Conditional result was False"} Full logs: http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-1523642631/ceph_ansible_upgrade_to_rhcs_3_nightly_0.log Expected Results: Upgrade playbook should not try to activate the osd's again and should be mainly doing upgrades on the roles defined in hosts file.
No, it does not block z2 we’ll just release note that use of this flag doesn’t work with upgrade
I have removed the osd specific config and some other items that shouldn't have been necessary for upgrade to try to get around this issue. The config I ran with was this: ceph_test: True ceph_origin: distro ceph_repository: rhcs ceph_rhcs_version: 3 ceph_stable_release: luminous upgrade_ceph_packages: True fetch_directory: ~/fetch copy_admin_key: True The first result was during execution of check_mandatory_vars.yml, which complained that my public_network was not configured. I configured that and reran the upgrade test to a similar error. This time the check was returning the following: TASK [ceph-osd : make sure an osd scenario was chosen] ************************* task path: /usr/share/ceph-ansible/roles/ceph-osd/tasks/check_mandatory_vars.yml:23 fatal: [ceph-clacroix-run113-node5-osd]: FAILED! => {"changed": false, "msg": "please choose an osd scenario"} Are these config items truly necessary for upgrade?
Created attachment 1421631 [details] Logs for failed upgrade scenario
Vasu, After reading the BZ I'm a bit confused initially you said OSD's were prepared again during the upgrade. However, the task you wrote says "INFO - skipping" so the task was skipped. Also looking at your log, this same task got skipped. Let's assume for a second that the OSD got prepared again, what's the state of these OSDs? Can you share more info on the state of the drives? Also, a "ceph -s" will be useful. Now, yes specifying an osd scenario is mandatory, even during an upgrade the whole playbook runs and devices are skipped if they have been prepared already. Only new devices that might appear during the upgrade can be prepared. Please clarify. Thanks.
Vasu, I read the logs again, I suspect a fw issue with the mgr. Can you check the mgr logs on ceph-clacroix-run712-node1-mon? And see if it says anything about not being able to contact the OSDs or something? Thanks!
Vasu, the tasks: 2018-04-13 19:06:31,712 - ceph.ceph - INFO - TASK [ceph-osd : manually prepare ceph "filestore" non-containerized osd disk(s) with collocated osd data and journal] *** task path: /usr/share/ceph-ansible/roles/ceph-osd/tasks/scenarios/collocated.yml:54 2018-04-13 19:06:31,744 - ceph.ceph - INFO - skipping: [ceph-clacroix-run712-node5-osd] => (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/vdb print | egrep -sq '^ 1.*ceph'", u'end': u'2018-04-13 15:06:28.168932', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, 'item': u'/dev/vdb', u'delta': u'0:00:00.019154', u'stderr': u'', u'rc': 0, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/vdb print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, 'stdout_lines': [], 'failed_when_result': False, u'start': u'2018-04-13 15:06:28.149778', '_ansible_ignore_errors': None, 'failed': False}, u'/dev/vdb']) => {"changed": false, "item": [{"_ansible_ignore_errors": null, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "parted --script /dev/vdb print | egrep -sq '^ 1.*ceph'", "delta": "0:00:00.019154", "end": 2018-04-13 19:06:31,744 - ceph.ceph - INFO - "2018-04-13 15:06:28.168932", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "parted --script /dev/vdb print | egrep -sq '^ 1.*ceph'", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": "/dev/vdb", "rc": 0, "start": "2018-04-13 15:06:28.149778", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "/dev/vdb"], "skip_reason": "Conditional result was False"} 2018-04-13 19:06:31,755 - ceph.ceph - INFO - skipping: [ceph-clacroix-run712-node5-osd] => (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/vdc print | egrep -sq '^ 1.*ceph'", u'end': u'2018-04-13 15:06:28.498038', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, 'item': u'/dev/vdc', u'delta': u'0:00:00.019548', u'stderr': u'', u'rc': 0, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/vdc print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, 'stdout_lines': [], 'failed_when_result': False, u'start': u'2018-04-13 15:06:28.478490', '_ansible_ignore_errors': None, 'failed': False}, u'/dev/vdc']) => {"changed": false, "item": [{"_ansible_ignore_errors": null, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "parted --script /dev/vdc print | egrep -sq '^ 1.*ceph'", "delta": "0:00:00.019548", "end": 2018-04-13 19:06:31,755 - ceph.ceph - INFO - "2018-04-13 15:06:28.498038", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "parted --script /dev/vdc print | egrep -sq '^ 1.*ceph'", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": "/dev/vdc", "rc": 0, "start": "2018-04-13 15:06:28.478490", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "/dev/vdc"], "skip_reason": "Conditional result was False"} 2018-04-13 19:06:31,769 - ceph.ceph - INFO - skipping: [ceph-clacroix-run712-node5-osd] => (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/vdd print | egrep -sq '^ 1.*ceph'", u'end': u'2018-04-13 15:06:28.837381', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, 'item': u'/dev/vdd', u'delta': u'0:00:00.018885', u'stderr': u'', u'rc': 0, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/vdd print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, 'stdout_lines': [], 'failed_when_result': False, u'start': u'2018-04-13 15:06:28.818496', '_ansible_ignore_errors': None, 'failed': False}, u'/dev/vdd']) => {"changed": false, "item": [{"_ansible_ignore_errors": null, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "parted --script /dev/vdd print | egrep -sq '^ 1.*ceph'", "delta": "0:00:00.018885", "end": 2018-04-13 19:06:31,769 - ceph.ceph - INFO - "2018-04-13 15:06:28.837381", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "parted --script /dev/vdd print | egrep -sq '^ 1.*ceph'", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": "/dev/vdd", "rc": 0, "start": "2018-04-13 15:06:28.818496", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "/dev/vdd"], "skip_reason": "Conditional result was False"} They are all "skipping". If you can get into the same status again, please check the firewall rules and make sure the manager can connect to the OSDs. Also I can get into the env and debug. Thanks.
Tejas, we still need to find the root cause so, for now, there is no fix to be submitted.
I have recreated this issue and have the nodes available. I will leave these up for now just to be safe. If we need any additional node IPs for troubleshooting I can get those as well. Installer: 10.8.248.47 Mon/Mgr: 10.8.243.153 I ran the health check from the mon and under services it says 9 osds are up, so I assume this is not a firewall issue. I also don't think we would have been able to successfully install 2.5 if a firewall issue was present. From mon node: sudo ceph -s cluster: id: 8f1fe5e1-999a-49bb-8892-63c70538b463 health: HEALTH_WARN Reduced data availability: 832 pgs inactive clock skew detected on mon.ceph-clacroix-run224-node3-mon, mon.ceph-clacroix-run224-node2-mon services: mon: 3 daemons, quorum ceph-clacroix-run224-node1-mon,ceph-clacroix-run224-node3-mon,ceph-clacroix-run224-node2-mon mgr: ceph-clacroix-run224-node1-mon(active), standbys: ceph-clacroix-run224-node3-mon, ceph-clacroix-run224-node2-mon osd: 9 osds: 9 up, 9 in data: pools: 7 pools, 832 pgs objects: 0 objects, 0 bytes usage: 0 kB used, 0 kB / 0 kB avail pgs: 100.000% pgs unknown 832 unknown
The ceph-mgr owns the stats for all the PGS. Typically if you see 100.000% pgs unknown there is a high chance that the mgr can not reach out to the OSD. So seeing OSD up and in is completely different, monitors own these stats. You can deploy if you have a fw issue blocking mgr communication with OSDs, I've seen this twice on QE machines already. So again, please, check the mgr node logs and put them here.
can you tell me what ports do you think should be open here for mgr <-> osd daemons ? because i dont see that information either in downstream docs or in upstream? Here is the log from active mgr node 2018-04-16 21:43:56.337892 7fb4bcb84680 0 set uid:gid to 167:167 (ceph:ceph) 2018-04-16 21:43:56.337915 7fb4bcb84680 0 ceph version 12.2.4-6.el7cp (78f60b924802e34d44f7078029a40dbe6c0c922f) luminous (stable), process (unknown), pid 13681 2018-04-16 21:43:56.339294 7fb4bcb84680 0 pidfile_write: ignore empty --pid-file 2018-04-16 21:43:56.344648 7fb4bcb84680 1 mgr send_beacon standby 2018-04-16 21:43:56.353116 7fb4b3ce7700 1 mgr init Loading python module 'balancer' 2018-04-16 21:43:56.369880 7fb4b3ce7700 1 mgr init Loading python module 'restful' 2018-04-16 21:43:56.492836 7fb4b3ce7700 1 mgr init Loading python module 'status' 2018-04-16 21:43:56.891651 7fb4b3ce7700 1 mgr handle_mgr_map Activating! 2018-04-16 21:43:56.891827 7fb4b3ce7700 1 mgr handle_mgr_map I am now activating 2018-04-16 21:43:56.902512 7fb4a2e8c700 1 mgr load Constructed class from module: balancer 2018-04-16 21:43:56.902697 7fb4a2e8c700 1 mgr load Constructed class from module: restful 2018-04-16 21:43:56.902794 7fb4a2e8c700 1 mgr load Constructed class from module: status 2018-04-16 21:43:56.902817 7fb4a2e8c700 1 mgr send_beacon active 2018-04-16 21:43:56.903994 7fb4a0e88700 1 mgr[restful] server not running: no certificate configured 2018-04-16 21:43:58.345073 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:43:58.345329 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:00.345440 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:00.345786 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:02.345896 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:02.346239 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:04.346348 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:04.346739 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:06.346815 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:06.347152 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:08.347240 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:08.347572 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:10.347668 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:10.347964 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:12.348076 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:12.348400 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:14.348485 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:14.348845 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:16.348957 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:16.349288 7fb4b0ce1700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2018-04-16 21:44:18.349389 7fb4b0ce1700 1 mgr send_beacon active 2018-04-16 21:44:18.349711 7fb4b0ce1700 1 mgr.server send_report Giving up on OSDs that haven't reported yet, sending potentially incomplete PG state to mon 2018-04-16 21:44:18.349745 7fb4b0ce1700 0 Cannot get stat of OSD 0 2018-04-16 21:44:18.349747 7fb4b0ce1700 0 Cannot get stat of OSD 1 2018-04-16 21:44:18.349754 7fb4b0ce1700 0 Cannot get stat of OSD 2 2018-04-16 21:44:18.349754 7fb4b0ce1700 0 Cannot get stat of OSD 3 2018-04-16 21:44:18.349755 7fb4b0ce1700 0 Cannot get stat of OSD 4 2018-04-16 21:44:18.349755 7fb4b0ce1700 0 Cannot get stat of OSD 5 2018-04-16 21:44:18.349756 7fb4b0ce1700 0 Cannot get stat of OSD 6 2018-04-16 21:44:18.349756 7fb4b0ce1700 0 Cannot get stat of OSD 7 2018-04-16 21:44:18.349756 7fb4b0ce1700 0 Cannot get stat of OSD 8
On all the nodes following ports are open 6789, 6800 to 7300. And I see active mgr is using 6800 tcp 0 0 172.16.115.18:6800 0.0.0.0:* LISTEN 13681/ceph-mgr
John, Can you tell if this is a port communication issue?
Hi Vasu, can you check the communication: - from node where osd.0 is running, as interface use name of public_network interface on that node, <value> is MTU on that network, <target_ip> is public_network interface-IP on active mgr node: #ping -W 2 -I <interface> -M do -s <value> <target_ip> - from active mgr node, interface use name of public_network interface on that node, <value> is MTU on that network, <target_ip> is public_network interface-IP of node where osd.0 is running #ping -W 2 -I <interface> -M do -s <value> <target_ip> and with telnet from node where osd.0 is running, as <target_ip> is public_network interface-IP on active mgr node, port from comment#19 above should be 6800 # telnet <target_ip> <port> from active mgr node, <target_ip> is public_network interface-IP of node where osd.0 is running, port is port of osd.0 on public_network that out can get from monitor node # ceph osd metadata 0 | grep front_addr # telnet <target_ip> <port>
I am going to try that, but looks like we have one more here with same symptom https://bugzilla.redhat.com/show_bug.cgi?id=1557063
Hi Tomas, I have executed the commands you mentioned. This is a newly generated stack however, so the IPs will be different than before. I have reproduced the original issue before executing these commands. Both pings execute successfully with no packet loss. # telnet <target_ip> <port> (osd.0 -> mgr) There is no route to host when trying to connect from the osd node to the mgr on port 6800. [cephuser@ceph-clacroix-run517-node4-osd ~]$ telnet 10.8.246.122 6800 Trying 10.8.246.122... telnet: connect to address 10.8.246.122: No route to host # telnet <target_ip> <port> (mgr -> osd.0) From the other side (mgr -> osd) it looks like I can establish a connection on that port. [cephuser@ceph-clacroix-run517-node2-mon ~]$ telnet 10.8.246.100 6800 Trying 10.8.246.100... Connected to 10.8.246.100. Escape character is '^]'. ceph v027 ���s0ɺ ��quit ^] telnet> quit Connection closed. # ceph osd metadata 0 | grep front_addr "front_addr": "172.16.115.48:6800/44555", "hb_front_addr": "172.16.115.48:6803/44555",
Adding to RADOS group as per Brett, Josh, we have the setup in same state, can you look into this one when you get time? Thanks
The mgr is running on 172.16.115.71:6800, as shown by 'ceph mgr dump'. Attempting to telnet to this address and port from the osd nodes results in 'no route to host'. As leseb said, this is caused by the firewall blocking the osd -> mgr communication. It's setup on the mon/mgr nodes to only allow port 6789: [cephuser@ceph-clacroix-run517-node1-mon ~]$ sudo firewall-cmd --list-all public (active) target: default icmp-block-inversion: no interfaces: eth0 sources: services: ssh dhcpv6-client ports: 6789/tcp protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules:
thanks for confirmation, we open required ports during pre configure scripts and have sanity running across luminous builds in jenkins, only the upgrade was failing so will dig more into what is causing those firewall ports to go away during upgrade [cephuser@ceph-clacroix-run517-node1-mon ~]$ sudo firewall-cmd --zone=public --add-port=6800-7300/tcp success [cephuser@ceph-clacroix-run517-node1-mon ~]$ sudo firewall-cmd --list-all public (active) target: default icmp-block-inversion: no interfaces: eth0 sources: services: ssh dhcpv6-client ports: 6789/tcp 6800-7300/tcp protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules:
for the record, this is where we opened the ports when we install 2.5 http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-1523642631/ceph_ansible_install_rhcs_2_stable_0.log 2018-04-13 18:15:34,253 - ceph.ceph - INFO - Running command firewall-cmd --zone=public --add-port=6800-7300/tcp on 10.8.246.61 2018-04-13 18:15:34,652 - ceph.ceph - INFO - Command completed successfully