Bug 1816478

Summary: The installer's probe mechanism fails on more complex network configurations.
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Paul Cuzner <pcuzner>
Component: Ceph-InstallerAssignee: Paul Cuzner <pcuzner>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.0CC: aschoen, ceph-eng-bugs, gjose, gmeno, hyelloji, jbrier, jpauling, nthomas, tserlin, vashastr, ykaul
Target Milestone: z1   
Target Release: 4.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cockpit-ceph-installer-1.1-0.el8cp, cockpit-ceph-installer-1.1-0.el7cp Doc Type: Bug Fix
Doc Text:
.Cockpit Ceph Installer no longer fails on physical network devices with bridges Previously, the Cockpit Ceph Installer failed if physical network devices were used in a Linux software bridge. This was due to a logic error in the code. In {storage-product} 4.1z1, the code has been fixed and you can use Cockpit Ceph Installer to deploy on nodes with bridges on the physical network interfaces.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-20 14:21:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1816167    

Description Paul Cuzner 2020-03-24 05:04:17 UTC
Description of problem:
When the Ceph nodes are part of a collocated service stack, the networking may have bridging over bonded interfaces. When this occurs the current logic in the check_roles python module fails with a logic error resulting in a failed gui install


Version-Release number of selected component (if applicable):
4.0 GA

How reproducible:
100%

Steps to Reproduce:
1. Define target ceph hosts with bridge over bonding interfaces
2. attempt the deployment
3.

Actual results:
probe fails

Expected results:
probe should succeed


Additional info:
to be provided by the original reporter (Joel Wirāmu Pauling)

Comment 1 RHEL Program Management 2020-03-24 05:04:23 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 3 jwp@redhat.com 2020-03-24 21:45:40 UTC
Ok I've split up the networking so that the team0/bond interfaces on the baremetals now are addressed on the team devices directly and moved what was attached into those bridges into seperated routed segments.

I am still getting erorrs during probe. Interesting now in the mgr node itself.

This is the artifact stdout from the latest run
--

Identity added: /usr/share/ansible-runner-service/artifacts/faf9e914-6e17-11ea-af72-5254008b6c03/ssh_key_data (/usr/share/ansible-runner-service/artifacts/faf9e914-6e17-11ea-af72-5254008b6c03/ssh_key_data)
[WARNING]: log file at /root/ansible/ansible.log is not writeable and we cannot create it, aborting

[DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to 
allow bad characters in group names by default, this will change, but still be 
user configurable on deprecation. This feature will be removed in version 2.10.
 Deprecation warnings can be disabled by setting deprecation_warnings=False in 
ansible.cfg.
[WARNING]: Invalid characters were found in group names but not replaced, use
-vvvv to see details


PLAY [Validate hosts against desired cluster state] ****************************

TASK [CEPH_CHECK_ROLE] *********************************************************
Tuesday 24 March 2020  21:39:48 +0000 (0:00:00.139)       0:00:00.139 ********* 

ok: [ceph04]
fatal: [ceph03]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"}, "changed": false, "msg": "Failed to create remote module tmp path at dir /root/.ansible/tmp with prefix ansible-moduletmp-1585085990.6926987-: [Errno 13] Permission denied: '/root/.ansible/tmp/ansible-moduletmp-1585085990.6926987-mp6fpjis'"}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'enp67s0f1'
fatal: [ceph02]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"}, "changed": false, "module_stderr": "Traceback (most recent call last):\n  File \"<stdin>\", line 114, in <module>\n  File \"<stdin>\", line 106, in _ansiballz_main\n  File \"<stdin>\", line 49, in invoke_module\n  File \"/usr/lib64/python3.6/imp.py\", line 235, in load_module\n    return load_source(name, filename, file)\n  File \"/usr/lib64/python3.6/imp.py\", line 170, in load_source\n    module = _exec(spec, sys.modules[name])\n  File \"<frozen importlib._bootstrap>\", line 618, in _exec\n  File \"<frozen importlib._bootstrap_external>\", line 678, in exec_module\n  File \"<frozen importlib._bootstrap>\", line 219, in _call_with_frames_removed\n  File \"/tmp/ansible_ceph_check_role_payload_ggmmqwwr/__main__.py\", line 847, in <module>\n  File \"/tmp/ansible_ceph_check_role_payload_ggmmqwwr/__main__.py\", line 843, in main\n  File \"/tmp/ansible_ceph_check_role_payload_ggmmqwwr/__main__.py\", line 818, in run_module\n  File \"/tmp/ansible_ceph_check_role_payload_ggmmqwwr/__main__.py\", line 483, in summarize\n  File \"/tmp/ansible_ceph_check_role_payload_ggmmqwwr/__main__.py\", line 357, in get_network_info\nKeyError: 'enp67s0f1'\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'eno3'
fatal: [ceph01]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"}, "changed": false, "module_stderr": "Traceback (most recent call last):\n  File \"<stdin>\", line 114, in <module>\n  File \"<stdin>\", line 106, in _ansiballz_main\n  File \"<stdin>\", line 49, in invoke_module\n  File \"/usr/lib64/python3.6/imp.py\", line 235, in load_module\n    return load_source(name, filename, file)\n  File \"/usr/lib64/python3.6/imp.py\", line 170, in load_source\n    module = _exec(spec, sys.modules[name])\n  File \"<frozen importlib._bootstrap>\", line 618, in _exec\n  File \"<frozen importlib._bootstrap_external>\", line 678, in exec_module\n  File \"<frozen importlib._bootstrap>\", line 219, in _call_with_frames_removed\n  File \"/tmp/ansible_ceph_check_role_payload_udpm8mv0/__main__.py\", line 847, in <module>\n  File \"/tmp/ansible_ceph_check_role_payload_udpm8mv0/__main__.py\", line 843, in main\n  File \"/tmp/ansible_ceph_check_role_payload_udpm8mv0/__main__.py\", line 818, in run_module\n  File \"/tmp/ansible_ceph_check_role_payload_udpm8mv0/__main__.py\", line 483, in summarize\n  File \"/tmp/ansible_ceph_check_role_payload_udpm8mv0/__main__.py\", line 357, in get_network_info\nKeyError: 'eno3'\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}

PLAY RECAP *********************************************************************
ceph01                     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
ceph02                     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
ceph03                     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
ceph04                     : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Tuesday 24 March 2020  21:39:52 +0000 (0:00:03.595)       0:00:03.735 ********* 
=============================================================================== 
CEPH_CHECK_ROLE --------------------------------------------------------- 3.60s
[root@ceph03 faf9e914-6e17-11ea-af72-5254008b6c03]#

Comment 4 jwp@redhat.com 2020-03-24 22:01:21 UTC
(In reply to Paul Cuzner from comment #0)
> Description of problem:
> When the Ceph nodes are part of a collocated service stack, the networking
> may have bridging over bonded interfaces. When this occurs the current logic
> in the check_roles python module fails with a logic error resulting in a
> failed gui install
> 
> 
> Version-Release number of selected component (if applicable):
> 4.0 GA
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. Define target ceph hosts with bridge over bonding interfaces
> 2. attempt the deployment
> 3.
> 
> Actual results:
> probe fails
> 
> Expected results:
> probe should succeed
> 
> 
> Additional info:
> to be provided by the original reporter (Joel Wirāmu Pauling)

I am not sure if the bridge config is a red-herring at this stage, given bridges get deployed by default for a number of package intallers (docker, libvirt etc - I would imagine this would have gotten picked up previously) . I will try and provide deployment architecture here:

All nodes are Red Hat 8.1 Installs with ceph channels and valid Subscriptions
--
ceph01 - baremetal node (mon,osd)
- team0 - 10.0.0.1 (ceph01.3d.ae.net.nz DNS)
- bridge0 - Management network - to uplink-gateway on 172.16.253.0/24 (dhcpc)
- virbr1 - Routed virtual network (addressed on hypervisor as 10.0.1.1/24)
(virbr0 exists - default with libvirt unused)

ceph02 - baremetal node  (mon,osd)
- team0 - 10.0.0.2 (ceph02.3d.ae.net.nz DNS)
- bridge0 - Management network - to uplink-gateway on 172.16.253.0/24 (dhcpc)
- (virbr0 exists - default with libvirt unused)

ceph03 - VM node (mon, mgr-Deployer node)
- eth0 - 10.0.1.3/24 (ceph03.3d.ae.net.nz DNS) with route to 10.0.0.0/24
- eth1 - Management network (172.16.253.0/24)

ceph04 (metric)
- eth0 - 10.0.1.4/24 (ceph03.3d.ae.net.nz DNS) with route to 10.0.0.0/24
- eth1 - Management network (172.16.253.0/24)

---

All nodes have a non-root ansible user with ssh keys exchanged, and password sudo configured username:jwp

From the ceph03 deployer node the ansible-service-runner public key has been added to the jwp accounts on all nodes.

---


From the management interface of ceph03 ceph-cockpit runs, host add success. Next step is to probe hosts. Quickly during probe of the host a small error dialog appears on the page listing the artifact uuid to check. Checking this shows an error early during the ansible facts collection. Expected result is pass then proceed with deployment.


---

Running ansible -m setup from the ceph03 node manually against the cluster succeeds without issue.

a support ticket is open here (rh internal login needed : https://access.redhat.com/support/cases/#/case/02614455 )

Comment 5 jwp@redhat.com 2020-03-24 22:19:48 UTC
                                   +---------------------+
                                   | Management Router   |
             +------1g-------------+ 172.16.253.1/24     +------1g----------+
             |                     |                     |                  |
             |                     |                     |                  |
             |                     +---------------------+                  |
             |                                                              |
             |                                                              |
             |                                                              |
             |                                                              |
             |                                                              |
       +------------------ceph01---+              +---ceph02----------------------+
  +----+ bridge0       |           |              |                 |   bridge0   |
  |    |172.16.253.X   |           |              |                 |172.16.253.X |
  |    +---------------+           |              |                 +-------------+
  |    |                           |              |                               |
  |    +---------------------------+              +--------------+                |
  |    | virbr1     |              |              |              |                |
  |    | 10.0.1.1/24|  team0       |              |  team0       |                |
  |    |            | 10.0.0.1/24  +----10G-------+10.0.0.2/24   |                |
  |    |            |              |              |              |                |
  |    |            |              +----10G-------+              |                |
  |    |            |              |              |              |                |
  |    +---------------------------+              +-------------------------------+
  |           |   |
  |           |   |
  |   +-ceph03+-------+
  |   |10.0.1.3   |   |
  +---+           |   |
  |   |           |   |
  |   +---------------+
  |   +-ceph04--------+
  |   |10.0.1.3       |
  +---+               |
      |               |
      +---------------+

Comment 6 jwp@redhat.com 2020-03-24 22:20:43 UTC
ceph04 in that topology is typo ip is 10.0.1.4 (+ dhcpc management segment address)

Comment 7 jwp@redhat.com 2020-03-25 00:09:55 UTC
Retried after Paul advised to use jwp account (was using root) in cockpit. Still failing

TASK [CEPH_CHECK_ROLE] *********************************************************
Wednesday 25 March 2020  00:08:15 +0000 (0:00:00.109)       0:00:00.109 ******* 
[WARNING]: Unable to use /root/.ansible/tmp as temporary directory, failing
back to system: [Errno 13] Permission denied: '/root/.ansible'

ok: [ceph04]
ok: [ceph03]
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'eno3'
fatal: [ceph01]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"}, "changed": false, "module_stderr": "Traceback (most recent call last):\n  File \"<stdin>\", line 114, in <module>\n  File \"<stdin>\", line 106, in _ansiballz_main\n  File \"<stdin>\", line 49, in invoke_module\n  File \"/usr/lib64/python3.6/imp.py\", line 235, in load_module\n    return load_source(name, filename, file)\n  File \"/usr/lib64/python3.6/imp.py\", line 170, in load_source\n    module = _exec(spec, sys.modules[name])\n  File \"<frozen importlib._bootstrap>\", line 618, in _exec\n  File \"<frozen importlib._bootstrap_external>\", line 678, in exec_module\n  File \"<frozen importlib._bootstrap>\", line 219, in _call_with_frames_removed\n  File \"/tmp/ansible_ceph_check_role_payload__789wi0e/__main__.py\", line 847, in <module>\n  File \"/tmp/ansible_ceph_check_role_payload__789wi0e/__main__.py\", line 843, in main\n  File \"/tmp/ansible_ceph_check_role_payload__789wi0e/__main__.py\", line 818, in run_module\n  File \"/tmp/ansible_ceph_check_role_payload__789wi0e/__main__.py\", line 483, in summarize\n  File \"/tmp/ansible_ceph_check_role_payload__789wi0e/__main__.py\", line 357, in get_network_info\nKeyError: 'eno3'\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'enp67s0f1'
fatal: [ceph02]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"}, "changed": false, "module_stderr": "Traceback (most recent call last):\n  File \"<stdin>\", line 114, in <module>\n  File \"<stdin>\", line 106, in _ansiballz_main\n  File \"<stdin>\", line 49, in invoke_module\n  File \"/usr/lib64/python3.6/imp.py\", line 235, in load_module\n    return load_source(name, filename, file)\n  File \"/usr/lib64/python3.6/imp.py\", line 170, in load_source\n    module = _exec(spec, sys.modules[name])\n  File \"<frozen importlib._bootstrap>\", line 618, in _exec\n  File \"<frozen importlib._bootstrap_external>\", line 678, in exec_module\n  File \"<frozen importlib._bootstrap>\", line 219, in _call_with_frames_removed\n  File \"/tmp/ansible_ceph_check_role_payload_phvkrc12/__main__.py\", line 847, in <module>\n  File \"/tmp/ansible_ceph_check_role_payload_phvkrc12/__main__.py\", line 843, in main\n  File \"/tmp/ansible_ceph_check_role_payload_phvkrc12/__main__.py\", line 818, in run_module\n  File \"/tmp/ansible_ceph_check_role_payload_phvkrc12/__main__.py\", line 483, in summarize\n  File \"/tmp/ansible_ceph_check_role_payload_phvkrc12/__main__.py\", line 357, in get_network_info\nKeyError: 'enp67s0f1'\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
ceph01                     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
ceph02                     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
ceph03                     : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
ceph04                     : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

PLAY RECAP *********************************************************************
ceph01                     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
ceph02                     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
ceph03                     : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
ceph04                     : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Wednesday 25 March 2020  00:08:18 +0000 (0:00:03.464)       0:00:03.573 ******* 
=============================================================================== 
CEPH_CHECK_ROLE --------------------------------------------------------- 3.46s
[root@kurinui b804099a-6e2c-11ea-85ba-525400f676ce]#

Comment 8 jwp@redhat.com 2020-03-25 00:11:33 UTC
am going to remove all the management bridges and move to using macvtap for management interfaces and destroying the default libvirt network bridge changes behaviour.

Comment 9 jwp@redhat.com 2020-03-25 01:02:30 UTC
Progress ; 

after completely removing any physical devices from bridges (including the management network). The probes pass. 

Seems that virbr/vnet devices must be filtered. Because it passed.

So in effect logic to handle physical devices in bridges is broken.

Comment 10 Paul Cuzner 2020-03-25 03:29:19 UTC
(In reply to jwp from comment #9)

> So in effect logic to handle physical devices in bridges is broken.

Yes - as discussed, the bridge handling logic wasn't accounting for the ansible_ prefix on the NIC names in within the ansible facts (this was a change in ansible 2.7?, that was missed in the bridge code) and since deploying RHCS with the installer doesn't normally target use cases where target machines employ bridges, the QE testing missed it too. The expectation is that complex deployments would be done using ceph-ansible directly!

Will send a POC of the fix to Joel to verify.

Comment 11 Paul Cuzner 2020-04-09 05:45:17 UTC
*** Bug 1819043 has been marked as a duplicate of this bug. ***

Comment 12 Paul Cuzner 2020-06-14 21:47:13 UTC
changes available in 1.1 release

Comment 17 errata-xmlrpc 2020-07-20 14:21:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3003