Bug 1563748 - [CRS][3.7]Install CRS failed due to installer disable firewalld service in external glusterfs cluster [NEEDINFO]
Summary: [CRS][3.7]Install CRS failed due to installer disable firewalld service in ex...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.7.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Jose A. Rivera
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1574370
TreeView+ depends on / blocked
 
Reported: 2018-04-04 15:18 UTC by Wenkai Shi
Modified: 2019-02-04 13:44 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
: 1574370 (view as bug list)
Environment:
Last Closed: 2019-02-04 13:44:25 UTC
Target Upstream Version:
jrivera: needinfo? (eboyd)


Attachments (Terms of Use)

Description Wenkai Shi 2018-04-04 15:18:44 UTC
Description of problem:
Deploy OCP with CRS, installer failed on "Verify heketi service" step, during installation, installer disable firewalld service on hosts which inside glusterfs group which using firewalld service as firewall solution. 
There is a similar one in 3.6 BZ #1473589 but this time installer just close the firewall service, didn't change firewall rules.

Version-Release number of the following components:
openshift-ansible-3.7.42-1.git.0.427f18c.el7

How reproducible:
100%

Steps to Reproduce:
1.Install OCP with CRS
2.
3.

Actual results:
# ansible-playbook -i host -v /usr/share/ansible/openshift-ansible/playbboks/byo/config.yml
...
TASK [openshift_storage_glusterfs : Load heketi topology] **********************
Wednesday 04 April 2018  02:12:34 -0400 (0:00:01.091)       0:16:58.412 ******* 
fatal: [qe-weshi-master-etcd-1.0404-ivm.qe.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-sbndk", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "X+G3CSggRYk06YUSw6kR8y063tEn5B0MwafV4Tq4VyU=", "topology", "load", "--json=/tmp/openshift-glusterfs-ansible-m2FH2D/topology.json", "2>&1"], "delta": "0:00:03.781876", "end": "2018-04-04 02:13:40.244286", "failed": true, "failed_when_result": true, "rc": 0, "start": "2018-04-04 02:13:36.462410", "stderr": "", "stderr_lines": [], "stdout": "Creating cluster ... ID: dec1ec3748f322149917619cb2cb33aa\n\tAllowing file volumes on cluster.\n\tAllowing block volumes on cluster.\n\tCreating node qe-weshi-glusterfs-1 ... ID: ac7be651d83ad43e815ff3ea338a2f08\n\t\tAdding device /dev/vsda ... OK\n\tCreating node qe-weshi-glusterfs-2 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected\n\tCreating node qe-weshi-glusterfs-3 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected", "stdout_lines": ["Creating cluster ... ID: dec1ec3748f322149917619cb2cb33aa", "\tAllowing file volumes on cluster.", "\tAllowing block volumes on cluster.", "\tCreating node qe-weshi-glusterfs-1 ... ID: ac7be651d83ad43e815ff3ea338a2f08", "\t\tAdding device /dev/vsda ... OK", "\tCreating node qe-weshi-glusterfs-2 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected", "\tCreating node qe-weshi-glusterfs-3 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected"]}
...

Expected results:
Installation succeed

Additional info:
# ansible-playbook -i host -v /usr/share/ansible/openshift-ansible/playbboks/byo/config.yml
...
TASK [os_firewall : Ensure firewalld service is not enabled] *******************
Wednesday 04 April 2018  01:55:40 -0400 (0:00:00.084)       0:00:04.520 ******* 
changed: [qe-weshi-abkv-glusterfs-2.0404-ivm.qe.rhcloud.com] => {"changed": true, ...
changed: [qe-weshi-abkv-master-etcd-1.0404-ivm.qe.rhcloud.com] => {"changed": true, ...
changed: [qe-weshi-abkv-nrr-1.0404-ivm.qe.rhcloud.com] => {"changed": true, ...
changed: [qe-weshi-abkv-glusterfs-3.0404-ivm.qe.rhcloud.com] => {"changed": true, ...
changed: [qe-weshi-abkv-glusterfs-1.0404-ivm.qe.rhcloud.com] => {"changed": true, ...
...

[root@qe-weshi-abkv-glusterfs-3 ~]# firewall-cmd --list-all
FirewallD is not running
[root@qe-weshi-abkv-glusterfs-3 ~]# systemctl stop iptables 
[root@qe-weshi-abkv-glusterfs-3 ~]# systemctl start firewalld
Failed to start firewalld.service: Unit is masked.
[root@qe-weshi-abkv-glusterfs-3 ~]# systemctl unmask firewalld
Removed symlink /etc/systemd/system/firewalld.service.
[root@qe-weshi-abkv-glusterfs-3 ~]# systemctl start firewalld
[root@qe-weshi-abkv-glusterfs-3 ~]# firewall-cmd --list-all
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0
  sources: 
  services: ssh dhcpv6-client
  ports: 24007-24021/tcp 111/tcp 38465-38485/tcp 49152-49252/tcp
  protocols: 
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules: 

Same action in all three glusterfs cluster nodes, then load heketi topology succeed:

[root@qe-weshi-abkv-master-etcd-1 ~]# oc rsh deploy-heketi-storage-1-sbndk heketi-cli -s http://localhost:8080 --user admin --secret X+G3CSggRYk06YUSw6kR8y063tEn5B0MwafV4Tq4VyU= topology load --json=/tmp/openshift-glusterfs-ansible-m2FH2D/topology.json
	Found node qe-weshi-abkv-glusterfs-1 on cluster dec1ec3748f322149917619cb2cb33aa
		Found device /dev/vsda
	Found node qe-weshi-abkv-glusterfs-2 on cluster dec1ec3748f322149917619cb2cb33aa
		Found device /dev/vsda
	Creating node qe-weshi-abkv-glusterfs-3 ... ID: c2905a1045d74e77a6ea92a4566701b9
		Adding device /dev/vsda ... OK

Comment 1 Wenkai Shi 2018-04-04 15:26:09 UTC
Typo: 

In comment 0:
Description 
... installer failed on "Verify heketi service" step ...

Should be:
Description 
... installer failed on "Load heketi topology" step ...

Comment 2 Jose A. Rivera 2018-04-04 15:29:57 UTC
If this is CRS why is the installer modifying the firewall? Could you provide your entire inventory file?

Comment 4 Jose A. Rivera 2018-04-04 16:20:43 UTC
Hm.... I'll note that in your original failure output, the first two nodes succeeded but the third one did not, indicating that the firewall ports were open for the former at that time. Are you sure you set up all three nodes correctly? Can you start over from a fresh environment and see if the issue can be replicated? If so, verify that the firewalls are up and configured on all three nodes before starting the installation and see if the issue reproduces.

Comment 5 Wenkai Shi 2018-04-04 17:19:53 UTC
(In reply to Jose A. Rivera from comment #4)
> Hm.... I'll note that in your original failure output, the first two nodes
> succeeded but the third one did not, indicating that the firewall ports were
> open for the former at that time. Are you sure you set up all three nodes
> correctly? Can you start over from a fresh environment and see if the issue
> can be replicated? If so, verify that the firewalls are up and configured on
> all three nodes before starting the installation and see if the issue
> reproduces.

Before installation, I'm sure the firewalld service is running:

[root@qe-weshi-bug-glusterfs-1 ~]# firewall-cmd --list-all
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0
  sources: 
  services: ssh dhcpv6-client
  ports: 24007-24021/tcp 111/tcp 38465-38485/tcp 49152-49252/tcp
  protocols: 
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules: 

Later when I trigger installation:

[root@qe-weshi-bug-glusterfs-1 ~]# firewall-cmd --list-all
FirewallD is not running

Yes, it reproduce.

Comment 6 Jose A. Rivera 2018-04-04 17:22:11 UTC
Oh all nodes, or just one?

Comment 7 Wenkai Shi 2018-04-04 17:32:54 UTC
(In reply to Jose A. Rivera from comment #6)
> Oh all nodes, or just one?

TASK [openshift_storage_glusterfs : Load heketi topology] **********************
Wednesday 04 April 2018  13:18:09 -0400 (0:00:01.464)       0:19:47.253 ******* 

fatal: [qe-weshi-bug-master-etcd-1.0404-v-0.qe.rhcloud.com]: FAILED! => {"changed": true, "cmd": ["oc", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-r8xbf", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "sr7aPB6PsW1M+ED36XVVMtJi0jENhprtvgPxy/FFH9w=", "topology", "load", "--json=/tmp/openshift-glusterfs-ansible-21eSjG/topology.json", "2>&1"], "delta": "0:00:05.098642", "end": "2018-04-04 13:19:19.538550", "failed": true, "failed_when_result": true, "rc": 0, "start": "2018-04-04 13:19:14.439908", "stderr": "", "stderr_lines": [], "stdout": "Creating cluster ... ID: c5403c5409177bddf8acf2178d36511b\n\tAllowing file volumes on cluster.\n\tAllowing block volumes on cluster.\n\tCreating node qe-weshi-bug-glusterfs-1 ... ID: 43edcc380b3c72497002f77ed01a5bd1\n\t\tAdding device /dev/vsda ... OK\n\tCreating node qe-weshi-bug-glusterfs-2 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected\n\tCreating node qe-weshi-bug-glusterfs-3 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected", "stdout_lines": ["Creating cluster ... ID: c5403c5409177bddf8acf2178d36511b", "\tAllowing file volumes on cluster.", "\tAllowing block volumes on cluster.", "\tCreating node qe-weshi-bug-glusterfs-1 ... ID: 43edcc380b3c72497002f77ed01a5bd1", "\t\tAdding device /dev/vsda ... OK", "\tCreating node qe-weshi-bug-glusterfs-2 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected", "\tCreating node qe-weshi-bug-glusterfs-3 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected"]}

Comment 8 Jose A. Rivera 2018-04-04 17:38:13 UTC
That doesn't directly answer my question, but good enough. :)

So it seems that for some reason some random subset of the three CRS nodes are having their firewalls turned off. Before it was 3 failing, now it's 2 and 3. This makes no sense. Can you provide a full ansible log of the latest run?

Comment 11 Jose A. Rivera 2018-04-04 18:18:36 UTC
I'm going to be a bit of a pain: Would you be able to reproduce this on OCP 3.9? I just can't find what might be causing this...

Comment 12 Wenkai Shi 2018-04-09 07:27:50 UTC
(In reply to Jose A. Rivera from comment #11)
> I'm going to be a bit of a pain: Would you be able to reproduce this on OCP
> 3.9? I just can't find what might be causing this...

3.9 doesn't have such issue.

Comment 13 Jose A. Rivera 2018-04-09 15:53:06 UTC
Okay... there was a big PR that got merged just a bit after 3.7.42, so it should be in the next build. Could you try to reproduce on 3.7.43 when it's out?

Comment 14 Wenkai Shi 2018-04-11 03:12:09 UTC
Yes, I can reproduce it on openshift-ansible-3.7.43-1.git.0.176ff8d.el7, firewalld service on CRS nodes is down.

Comment 15 Yaniv Kaul 2018-08-08 13:34:04 UTC
(In reply to Wenkai Shi from comment #14)
> Yes, I can reproduce it on openshift-ansible-3.7.43-1.git.0.176ff8d.el7,
> firewalld service on CRS nodes is down.

Any update?

Comment 16 Jose A. Rivera 2018-08-08 13:44:45 UTC
No update, haven't had time to look into this one.

Comment 17 Yaniv Kaul 2018-08-08 13:48:02 UTC
(In reply to Jose A. Rivera from comment #16)
> No update, haven't had time to look into this one.

Restoring NEEDINFO so you won't forget about it (+ setting target release for 3.10.z, for the time being).

Comment 18 Jose A. Rivera 2018-08-08 14:19:19 UTC
No, this issue is no seen in 3.9+, so it does not belong in 3.10.z. Leaving NEEDINFO on, though it's no guarantee to keep it any more on my radar than without it. :)

Comment 19 Stephen Cuppett 2018-08-23 12:23:54 UTC
At this point, new installs should be using leveraging a more recent release. If this issue needs addressed or is still present and recreated on a more recent release, please reopen with those details.

Comment 20 Erin Boyd 2018-11-06 19:58:40 UTC
This is still happening with the current version 3.11 where you can't install CRS.
Steps to reproduce:

CRS Validation

Do the following on all instances:
subscription-manager register --username= --password=
subscription-manager attach --pool=8a85f9833e1404a9013e3cddf99305e6
subscription-manager repos --disable=*
subscription-manager repos --enable=rhel-7-server-rpms --enable=rhel-7-server-optional-rpms --enable=rhel-7-server-extras-rpms --enable=rhel-7-server-ansible-2.7-rpms --enable=rh-gluster-3-for-rhel-7-server-rpms --enable=rhel-7-server-ose-3.10-rpms

yum update -y

yum install openshift-ansible-3.10-* -y
yum install heketi-client -y
yum install redhat-storage-server -y
yum install gluster-block -y
yum list glusterfs glusterfs-client-xlators glusterfs-libs glusterfs-fuse
yum install glusterfs-fuse -y

rpm -q kernel
uname -r

# REBOOT

firewall-cmd --zone=home --add-port=24010/tcp --add-port=3260/tcp --add-port=111/tcp --add-port=22/tcp --add-port=24007/tcp --add-port=24008/tcp --add-port=49152-49664/tcp
firewall-cmd --zone=home --add-port=24010/tcp --add-port=3260/tcp --add-port=111/tcp --add-port=22/tcp --add-port=24007/tcp --add-port=24008/tcp --add-port=49152-49664/tcp --permanent

Exec on all nodes:
modprobe target_core_user
modprobe dm_thin_pool
lsmod | grep dm_thin_pool
lsmod | grep target_core_user
modprobe dm_multipath
lsmod | grep dm_multipath

systemctl start sshd
systemctl enable sshd
systemctl start glusterd
systemctl enable glusterd
systemctl start gluster-blockd
systemctl enable gluster-blockd

==== inventory.ini ====
[OSEv3:children]
masters
etcd
nodes
glusterfs
glusterfs-registry

[OSEv3:vars]
ansible_ssh_user=root
openshift_deployment_type=openshift-enterprise
openshift_enable_excluders=false
openshift_storage_glusterfs_image=registry.access.redhat.com/rhgs3/rhgs-server-rhel7:v3.11
openshift_storage_glusterfs_block_image=registry.access.redhat.com/rhgs3/rhgs-gluster-block-prov-rhel7:v3.11
openshift_storage_glusterfs_heketi_image=registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:v3.11
openshift_storage_glusterfs_namespace=app-storage
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=false
openshift_storage_glusterfs_block_deploy=true
openshift_storage_glusterfs_block_host_vol_create=true
openshift_storage_glusterfs_block_host_vol_size=100
openshift_storage_glusterfs_block_storageclass=true
openshift_storage_glusterfs_block_storageclass_default=false
openshift_storage_glusterfs_is_native=false
openshift_storage_glusterfs_heketi_is_native=true
openshift_storage_glusterfs_heketi_executor=ssh
openshift_storage_glusterfs_heketi_ssh_port=22
openshift_storage_glusterfs_heketi_ssh_user=root
openshift_storage_glusterfs_heketi_ssh_sudo=false
openshift_storage_glusterfs_heketi_ssh_keyfile="/root/.ssh/id_rsa"

[masters]
172.31.0.12

[nodes]
172.31.0.12 openshift_schedulable=True openshift_node_group_name=node0 openshift_node_group_name=node-config-infra
172.31.0.22 openshift_schedulable=True openshift_node_group_name=node1 openshift_node_group_name=node-config-infra
172.31.0.17 openshift_schedulable=True openshift_node_group_name=node2 openshift_node_group_name=node-config-infra

[etcd]
172.31.0.12

[glusterfs]
172.31.0.12 glusterfs_devices='[ "/dev/nvme1" ]' glusterfs_ip=172.31.0.12
172.31.0.22 glusterfs_devices='[ "/dev/nvme1" ]' glusterfs_ip=172.31.0.22
172.31.0.17 glusterfs_devices='[ "/dev/nvme1" ]' glusterfs_ip=172.31.0.17
[glusterfs-registry]
172.31.0.12 glusterfs_devices='[ "/dev/nvme2" ]'
172.31.0.22 glusterfs_devices='[ "/dev/nvme2" ]'
172.31.0.17 glusterfs_devices='[ "/dev/nvme2" ]'
==== EOF ====

ansible-playbook -i ~/inv.ini /usr/share/ansible/openshift-ansible/playbooks/prerequisites.yml

ansible-playbook -i ~/inv.ini /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml

Comment 21 Jose A. Rivera 2018-11-20 17:01:31 UTC
Erin: So you're still seeing the behavior where the firewall service is being mysteriously disabled on random nodes? Are you finding any consistency to this behavior?

Comment 22 Jose A. Rivera 2019-02-04 13:44:25 UTC
Closing this BZ since it has no attached customer cases. If this is still a problem in current OCP 3.x releases, please reopen it with additional information.


Note You need to log in before you can comment on or make changes to this bug.