Description of problem: Version-Release number of selected component (if applicable): * RHCS 2.5: Ceph 10.2.10-17.el7cp (9865b1b203321435cc7128257833dca28bd779aa) How reproducible: 1. purge the scale setup after running to issue: https://bugzilla.redhat.com/show_bug.cgi?id=1599842 2. deploy ceph using ceph-ansible: # ansible-playbook -vv -i hosts site.yml Actual results: Ceph failed to deploy. All the osd nodes are failing with error: fatal: [c07-h21-6048r.rdu.openstack.engineering.redhat.com]: FAILED! => {"msg": "'dict object' has no attribute u'ansible_ens3f0'"} Additional info: * ansible full log: https://paste.fedoraproject.org/paste/mqMStUdfm7W1nNczjSyT0Q * ifconfig from OSD node: enp131s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.18.70.140 netmask 255.255.0.0 broadcast 172.18.255.255 inet6 fe80::ae1f:6bff:fe2d:aa50 prefixlen 64 scopeid 0x20<link> ether ac:1f:6b:2d:aa:50 txqueuelen 1000 (Ethernet) RX packets 164 bytes 9840 (9.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 42 bytes 2700 (2.6 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp131s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.19.70.140 netmask 255.255.0.0 broadcast 172.19.255.255 inet6 fe80::ae1f:6bff:fe2d:aa51 prefixlen 64 scopeid 0x20<link> ether ac:1f:6b:2d:aa:51 txqueuelen 1000 (Ethernet) RX packets 164 bytes 9840 (9.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 42 bytes 2700 (2.6 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp131s0f0.103: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.22.70.140 netmask 255.255.0.0 broadcast 172.22.255.255 inet6 fe80::ae1f:6bff:fe2d:aa50 prefixlen 64 scopeid 0x20<link> ether ac:1f:6b:2d:aa:50 txqueuelen 1000 (Ethernet) RX packets 49 bytes 2254 (2.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 14 bytes 900 (900.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp131s0f0.200: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.26.70.140 netmask 255.252.0.0 broadcast 172.27.255.255 inet6 fe80::ae1f:6bff:fe2d:aa50 prefixlen 64 scopeid 0x20<link> ether ac:1f:6b:2d:aa:50 txqueuelen 1000 (Ethernet) RX packets 49 bytes 2254 (2.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 14 bytes 900 (900.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp131s0f1.104: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.23.70.140 netmask 255.255.0.0 broadcast 172.23.255.255 inet6 fe80::ae1f:6bff:fe2d:aa51 prefixlen 64 scopeid 0x20<link> ether ac:1f:6b:2d:aa:51 txqueuelen 1000 (Ethernet) RX packets 49 bytes 2254 (2.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 14 bytes 900 (900.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp131s0f1.200: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.27.70.140 netmask 255.252.0.0 broadcast 172.27.255.255 inet6 fe80::ae1f:6bff:fe2d:aa51 prefixlen 64 scopeid 0x20<link> ether ac:1f:6b:2d:aa:51 txqueuelen 1000 (Ethernet) RX packets 49 bytes 2254 (2.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 14 bytes 900 (900.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp5s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.12.70.140 netmask 255.255.254.0 broadcast 10.12.71.255 inet6 2620:52:0:c46:ec4:7aff:fe6f:32e8 prefixlen 64 scopeid 0x0<global> inet6 fe80::ec4:7aff:fe6f:32e8 prefixlen 64 scopeid 0x20<link> ether 0c:c4:7a:6f:32:e8 txqueuelen 1000 (Ethernet) RX packets 1800 bytes 356341 (347.9 KiB) RX errors 0 dropped 24 overruns 0 frame 0 TX packets 578 bytes 114166 (111.4 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens3f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.16.70.140 netmask 255.255.0.0 broadcast 172.16.255.255 inet6 fe80::ec4:7aff:fe19:6a18 prefixlen 64 scopeid 0x20<link> ether 0c:c4:7a:19:6a:18 txqueuelen 1000 (Ethernet) RX packets 7978 bytes 559166 (546.0 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 6083 bytes 601362 (587.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens3f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.17.70.140 netmask 255.255.0.0 broadcast 172.17.255.255 inet6 fe80::ec4:7aff:fe19:6a19 prefixlen 64 scopeid 0x20<link> ether 0c:c4:7a:19:6a:19 txqueuelen 1000 (Ethernet) RX packets 164 bytes 9840 (9.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 42 bytes 2700 (2.6 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens3f0.101: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.20.70.140 netmask 255.255.0.0 broadcast 172.20.255.255 inet6 fe80::ec4:7aff:fe19:6a18 prefixlen 64 scopeid 0x20<link> ether 0c:c4:7a:19:6a:18 txqueuelen 1000 (Ethernet) RX packets 49 bytes 2254 (2.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 14 bytes 900 (900.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens3f0.200: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.24.70.140 netmask 255.252.0.0 broadcast 172.27.255.255 inet6 fe80::ec4:7aff:fe19:6a18 prefixlen 64 scopeid 0x20<link> ether 0c:c4:7a:19:6a:18 txqueuelen 1000 (Ethernet) RX packets 49 bytes 2254 (2.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 14 bytes 900 (900.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens3f1.102: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.21.70.140 netmask 255.255.0.0 broadcast 172.21.255.255 inet6 fe80::ec4:7aff:fe19:6a19 prefixlen 64 scopeid 0x20<link> ether 0c:c4:7a:19:6a:19 txqueuelen 1000 (Ethernet) RX packets 49 bytes 2254 (2.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 14 bytes 900 (900.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens3f1.200: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.25.70.140 netmask 255.252.0.0 broadcast 172.27.255.255 inet6 fe80::ec4:7aff:fe19:6a19 prefixlen 64 scopeid 0x20<link> ether 0c:c4:7a:19:6a:19 txqueuelen 1000 (Ethernet) RX packets 49 bytes 2254 (2.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 14 bytes 900 (900.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 6 bytes 318 (318.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 6 bytes 318 (318.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Andrew pointed out to run without rgw section and it worked and further debug showed out of 13 rgw nodes, one node didn't have the right interface, the error message is bit misleading which throws all nodes dont have interface, we could fix the error message here for better experience c08-h22-r630.rdu.openstack.engineering.redhat.com | SUCCESS => { "ansible_facts": {}, "changed": false } c07-h25-6048r.rdu.openstack.engineering.redhat.com | SUCCESS => { "ansible_facts": { "ansible_ens3f0": { "active": true, "device": "ens3f0", "features": { "busy_poll": "off [fixed]", "fcoe_mtu": "off [fixed]", "generic_receive_offload": "on", "generic_segmentation_offload": "on", "highdma": "on", "hw_tc_offload": "off [fixed]", "l2_fwd_offload": "off [fixed]", "large_receive_offload": "off [fixed]", "loopback": "off [fixed]", "netns_local": "off [fixed]", "ntuple_filters": "off", "receive_hashing": "on", "rx_all": "off [fixed]", "rx_checksumming": "on", "rx_fcs": "off [fixed]", "rx_udp_tunnel_port_offload": "on", "rx_vlan_filter": "on [fixed]", "rx_vlan_offload": "on", "rx_vlan_stag_filter": "off [fixed]", "rx_vlan_stag_hw_parse": "off [fixed]"
What engineering work remains for this BZ?
What do you think we should say instead of this error mesg?
The error message comes straight out of Ansible. We could add a safety check to make sure the interface exists on all the specified nodes and fail otherwise. Setting priority to low, this remains a configuration issue in the end, not ceph-ansible's fault.
Patch is merged upstream so this will be in RHCS 3.1, so targeting it back to RHCS 3.1. Unless QE can't ack it.
Right, I was thinking something else but yeah let's put it in 3.2. Thanks
sorry priority should be high, this is serious issue when setting up large clusters and the error message is misleading, I understand that its ansible's fault, but sebastein as you said you can have one level of verification done before our installation starts.
can we not do this in 3.1, why was this moved to 3.2?
Sorry Vasu but as much as I'd love to have this for 3.1 this won't be possible. This patch relies on a feature that is planned for 3.2.
*** Bug 1643403 has been marked as a duplicate of this bug. ***
Using ceph-ansible version 3.2.0-0.1.rc3.el7cp, I don't see a safety check to make sure the interface exists on all the specified nodes. Error message below are still coming from ansible: Tuesday 20 November 2018 20:28:49 +0000 (0:00:00.139) 0:08:30.962 ****** fatal: [mero005]: FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_enp136s1'"} fatal: [mero006]: FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_enp136s1'"} fatal: [mero007]: FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_enp136s1'"}
Step to reproduce the issue: 1. Configure /etc/ansible/hosts with one interface doesn't exist [rgws] mero005 radosgw_interface=enp136s1 mero006 radosgw_interface=enp136s0 mero007 radosgw_interface=enp136s0 2. Deploy ceph using ceph-ansible # ansible-playbook site.yml
Created attachment 1509578 [details] ansible-playbook full log
Verified with 3.2.0-1.el7cp build. Pre-check message is printed out when the interface doesn't exist as expected: TASK [ceph-validate : fail if enp136s1 does not exist on mero005] ****************************************************************** Thursday 13 December 2018 16:40:18 +0000 (0:00:00.531) 0:00:36.579 ***** fatal: [mero005]: FAILED! => {"changed": false, "msg": "enp136s1 does not exist on mero005"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0020