Description of problem: Deploy OCP with CRS, installer failed on "Verify heketi service" step, during installation, installer disable firewalld service on hosts which inside glusterfs group which using firewalld service as firewall solution. There is a similar one in 3.6 BZ #1473589 but this time installer just close the firewall service, didn't change firewall rules. Version-Release number of the following components: openshift-ansible-3.7.42-1.git.0.427f18c.el7 How reproducible: 100% Steps to Reproduce: 1.Install OCP with CRS 2. 3. Actual results: # ansible-playbook -i host -v /usr/share/ansible/openshift-ansible/playbboks/byo/config.yml ... TASK [openshift_storage_glusterfs : Load heketi topology] ********************** Wednesday 04 April 2018 02:12:34 -0400 (0:00:01.091) 0:16:58.412 ******* fatal: [qe-weshi-master-etcd-1.0404-ivm.qe.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-sbndk", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "X+G3CSggRYk06YUSw6kR8y063tEn5B0MwafV4Tq4VyU=", "topology", "load", "--json=/tmp/openshift-glusterfs-ansible-m2FH2D/topology.json", "2>&1"], "delta": "0:00:03.781876", "end": "2018-04-04 02:13:40.244286", "failed": true, "failed_when_result": true, "rc": 0, "start": "2018-04-04 02:13:36.462410", "stderr": "", "stderr_lines": [], "stdout": "Creating cluster ... ID: dec1ec3748f322149917619cb2cb33aa\n\tAllowing file volumes on cluster.\n\tAllowing block volumes on cluster.\n\tCreating node qe-weshi-glusterfs-1 ... ID: ac7be651d83ad43e815ff3ea338a2f08\n\t\tAdding device /dev/vsda ... OK\n\tCreating node qe-weshi-glusterfs-2 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected\n\tCreating node qe-weshi-glusterfs-3 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected", "stdout_lines": ["Creating cluster ... ID: dec1ec3748f322149917619cb2cb33aa", "\tAllowing file volumes on cluster.", "\tAllowing block volumes on cluster.", "\tCreating node qe-weshi-glusterfs-1 ... ID: ac7be651d83ad43e815ff3ea338a2f08", "\t\tAdding device /dev/vsda ... OK", "\tCreating node qe-weshi-glusterfs-2 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected", "\tCreating node qe-weshi-glusterfs-3 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected"]} ... Expected results: Installation succeed Additional info: # ansible-playbook -i host -v /usr/share/ansible/openshift-ansible/playbboks/byo/config.yml ... TASK [os_firewall : Ensure firewalld service is not enabled] ******************* Wednesday 04 April 2018 01:55:40 -0400 (0:00:00.084) 0:00:04.520 ******* changed: [qe-weshi-abkv-glusterfs-2.0404-ivm.qe.rhcloud.com] => {"changed": true, ... changed: [qe-weshi-abkv-master-etcd-1.0404-ivm.qe.rhcloud.com] => {"changed": true, ... changed: [qe-weshi-abkv-nrr-1.0404-ivm.qe.rhcloud.com] => {"changed": true, ... changed: [qe-weshi-abkv-glusterfs-3.0404-ivm.qe.rhcloud.com] => {"changed": true, ... changed: [qe-weshi-abkv-glusterfs-1.0404-ivm.qe.rhcloud.com] => {"changed": true, ... ... [root@qe-weshi-abkv-glusterfs-3 ~]# firewall-cmd --list-all FirewallD is not running [root@qe-weshi-abkv-glusterfs-3 ~]# systemctl stop iptables [root@qe-weshi-abkv-glusterfs-3 ~]# systemctl start firewalld Failed to start firewalld.service: Unit is masked. [root@qe-weshi-abkv-glusterfs-3 ~]# systemctl unmask firewalld Removed symlink /etc/systemd/system/firewalld.service. [root@qe-weshi-abkv-glusterfs-3 ~]# systemctl start firewalld [root@qe-weshi-abkv-glusterfs-3 ~]# firewall-cmd --list-all public (active) target: default icmp-block-inversion: no interfaces: eth0 sources: services: ssh dhcpv6-client ports: 24007-24021/tcp 111/tcp 38465-38485/tcp 49152-49252/tcp protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules: Same action in all three glusterfs cluster nodes, then load heketi topology succeed: [root@qe-weshi-abkv-master-etcd-1 ~]# oc rsh deploy-heketi-storage-1-sbndk heketi-cli -s http://localhost:8080 --user admin --secret X+G3CSggRYk06YUSw6kR8y063tEn5B0MwafV4Tq4VyU= topology load --json=/tmp/openshift-glusterfs-ansible-m2FH2D/topology.json Found node qe-weshi-abkv-glusterfs-1 on cluster dec1ec3748f322149917619cb2cb33aa Found device /dev/vsda Found node qe-weshi-abkv-glusterfs-2 on cluster dec1ec3748f322149917619cb2cb33aa Found device /dev/vsda Creating node qe-weshi-abkv-glusterfs-3 ... ID: c2905a1045d74e77a6ea92a4566701b9 Adding device /dev/vsda ... OK
Typo: In comment 0: Description ... installer failed on "Verify heketi service" step ... Should be: Description ... installer failed on "Load heketi topology" step ...
If this is CRS why is the installer modifying the firewall? Could you provide your entire inventory file?
Hm.... I'll note that in your original failure output, the first two nodes succeeded but the third one did not, indicating that the firewall ports were open for the former at that time. Are you sure you set up all three nodes correctly? Can you start over from a fresh environment and see if the issue can be replicated? If so, verify that the firewalls are up and configured on all three nodes before starting the installation and see if the issue reproduces.
(In reply to Jose A. Rivera from comment #4) > Hm.... I'll note that in your original failure output, the first two nodes > succeeded but the third one did not, indicating that the firewall ports were > open for the former at that time. Are you sure you set up all three nodes > correctly? Can you start over from a fresh environment and see if the issue > can be replicated? If so, verify that the firewalls are up and configured on > all three nodes before starting the installation and see if the issue > reproduces. Before installation, I'm sure the firewalld service is running: [root@qe-weshi-bug-glusterfs-1 ~]# firewall-cmd --list-all public (active) target: default icmp-block-inversion: no interfaces: eth0 sources: services: ssh dhcpv6-client ports: 24007-24021/tcp 111/tcp 38465-38485/tcp 49152-49252/tcp protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules: Later when I trigger installation: [root@qe-weshi-bug-glusterfs-1 ~]# firewall-cmd --list-all FirewallD is not running Yes, it reproduce.
Oh all nodes, or just one?
(In reply to Jose A. Rivera from comment #6) > Oh all nodes, or just one? TASK [openshift_storage_glusterfs : Load heketi topology] ********************** Wednesday 04 April 2018 13:18:09 -0400 (0:00:01.464) 0:19:47.253 ******* fatal: [qe-weshi-bug-master-etcd-1.0404-v-0.qe.rhcloud.com]: FAILED! => {"changed": true, "cmd": ["oc", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-r8xbf", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "sr7aPB6PsW1M+ED36XVVMtJi0jENhprtvgPxy/FFH9w=", "topology", "load", "--json=/tmp/openshift-glusterfs-ansible-21eSjG/topology.json", "2>&1"], "delta": "0:00:05.098642", "end": "2018-04-04 13:19:19.538550", "failed": true, "failed_when_result": true, "rc": 0, "start": "2018-04-04 13:19:14.439908", "stderr": "", "stderr_lines": [], "stdout": "Creating cluster ... ID: c5403c5409177bddf8acf2178d36511b\n\tAllowing file volumes on cluster.\n\tAllowing block volumes on cluster.\n\tCreating node qe-weshi-bug-glusterfs-1 ... ID: 43edcc380b3c72497002f77ed01a5bd1\n\t\tAdding device /dev/vsda ... OK\n\tCreating node qe-weshi-bug-glusterfs-2 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected\n\tCreating node qe-weshi-bug-glusterfs-3 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected", "stdout_lines": ["Creating cluster ... ID: c5403c5409177bddf8acf2178d36511b", "\tAllowing file volumes on cluster.", "\tAllowing block volumes on cluster.", "\tCreating node qe-weshi-bug-glusterfs-1 ... ID: 43edcc380b3c72497002f77ed01a5bd1", "\t\tAdding device /dev/vsda ... OK", "\tCreating node qe-weshi-bug-glusterfs-2 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected", "\tCreating node qe-weshi-bug-glusterfs-3 ... Unable to create node: peer probe: failed: Probe returned with Transport endpoint is not connected"]}
That doesn't directly answer my question, but good enough. :) So it seems that for some reason some random subset of the three CRS nodes are having their firewalls turned off. Before it was 3 failing, now it's 2 and 3. This makes no sense. Can you provide a full ansible log of the latest run?
I'm going to be a bit of a pain: Would you be able to reproduce this on OCP 3.9? I just can't find what might be causing this...
(In reply to Jose A. Rivera from comment #11) > I'm going to be a bit of a pain: Would you be able to reproduce this on OCP > 3.9? I just can't find what might be causing this... 3.9 doesn't have such issue.
Okay... there was a big PR that got merged just a bit after 3.7.42, so it should be in the next build. Could you try to reproduce on 3.7.43 when it's out?
Yes, I can reproduce it on openshift-ansible-3.7.43-1.git.0.176ff8d.el7, firewalld service on CRS nodes is down.
(In reply to Wenkai Shi from comment #14) > Yes, I can reproduce it on openshift-ansible-3.7.43-1.git.0.176ff8d.el7, > firewalld service on CRS nodes is down. Any update?
No update, haven't had time to look into this one.
(In reply to Jose A. Rivera from comment #16) > No update, haven't had time to look into this one. Restoring NEEDINFO so you won't forget about it (+ setting target release for 3.10.z, for the time being).
No, this issue is no seen in 3.9+, so it does not belong in 3.10.z. Leaving NEEDINFO on, though it's no guarantee to keep it any more on my radar than without it. :)
At this point, new installs should be using leveraging a more recent release. If this issue needs addressed or is still present and recreated on a more recent release, please reopen with those details.
This is still happening with the current version 3.11 where you can't install CRS. Steps to reproduce: CRS Validation Do the following on all instances: subscription-manager register --username= --password= subscription-manager attach --pool=8a85f9833e1404a9013e3cddf99305e6 subscription-manager repos --disable=* subscription-manager repos --enable=rhel-7-server-rpms --enable=rhel-7-server-optional-rpms --enable=rhel-7-server-extras-rpms --enable=rhel-7-server-ansible-2.7-rpms --enable=rh-gluster-3-for-rhel-7-server-rpms --enable=rhel-7-server-ose-3.10-rpms yum update -y yum install openshift-ansible-3.10-* -y yum install heketi-client -y yum install redhat-storage-server -y yum install gluster-block -y yum list glusterfs glusterfs-client-xlators glusterfs-libs glusterfs-fuse yum install glusterfs-fuse -y rpm -q kernel uname -r # REBOOT firewall-cmd --zone=home --add-port=24010/tcp --add-port=3260/tcp --add-port=111/tcp --add-port=22/tcp --add-port=24007/tcp --add-port=24008/tcp --add-port=49152-49664/tcp firewall-cmd --zone=home --add-port=24010/tcp --add-port=3260/tcp --add-port=111/tcp --add-port=22/tcp --add-port=24007/tcp --add-port=24008/tcp --add-port=49152-49664/tcp --permanent Exec on all nodes: modprobe target_core_user modprobe dm_thin_pool lsmod | grep dm_thin_pool lsmod | grep target_core_user modprobe dm_multipath lsmod | grep dm_multipath systemctl start sshd systemctl enable sshd systemctl start glusterd systemctl enable glusterd systemctl start gluster-blockd systemctl enable gluster-blockd ==== inventory.ini ==== [OSEv3:children] masters etcd nodes glusterfs glusterfs-registry [OSEv3:vars] ansible_ssh_user=root openshift_deployment_type=openshift-enterprise openshift_enable_excluders=false openshift_storage_glusterfs_image=registry.access.redhat.com/rhgs3/rhgs-server-rhel7:v3.11 openshift_storage_glusterfs_block_image=registry.access.redhat.com/rhgs3/rhgs-gluster-block-prov-rhel7:v3.11 openshift_storage_glusterfs_heketi_image=registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:v3.11 openshift_storage_glusterfs_namespace=app-storage openshift_storage_glusterfs_storageclass=true openshift_storage_glusterfs_storageclass_default=false openshift_storage_glusterfs_block_deploy=true openshift_storage_glusterfs_block_host_vol_create=true openshift_storage_glusterfs_block_host_vol_size=100 openshift_storage_glusterfs_block_storageclass=true openshift_storage_glusterfs_block_storageclass_default=false openshift_storage_glusterfs_is_native=false openshift_storage_glusterfs_heketi_is_native=true openshift_storage_glusterfs_heketi_executor=ssh openshift_storage_glusterfs_heketi_ssh_port=22 openshift_storage_glusterfs_heketi_ssh_user=root openshift_storage_glusterfs_heketi_ssh_sudo=false openshift_storage_glusterfs_heketi_ssh_keyfile="/root/.ssh/id_rsa" [masters] 172.31.0.12 [nodes] 172.31.0.12 openshift_schedulable=True openshift_node_group_name=node0 openshift_node_group_name=node-config-infra 172.31.0.22 openshift_schedulable=True openshift_node_group_name=node1 openshift_node_group_name=node-config-infra 172.31.0.17 openshift_schedulable=True openshift_node_group_name=node2 openshift_node_group_name=node-config-infra [etcd] 172.31.0.12 [glusterfs] 172.31.0.12 glusterfs_devices='[ "/dev/nvme1" ]' glusterfs_ip=172.31.0.12 172.31.0.22 glusterfs_devices='[ "/dev/nvme1" ]' glusterfs_ip=172.31.0.22 172.31.0.17 glusterfs_devices='[ "/dev/nvme1" ]' glusterfs_ip=172.31.0.17 [glusterfs-registry] 172.31.0.12 glusterfs_devices='[ "/dev/nvme2" ]' 172.31.0.22 glusterfs_devices='[ "/dev/nvme2" ]' 172.31.0.17 glusterfs_devices='[ "/dev/nvme2" ]' ==== EOF ==== ansible-playbook -i ~/inv.ini /usr/share/ansible/openshift-ansible/playbooks/prerequisites.yml ansible-playbook -i ~/inv.ini /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml
Erin: So you're still seeing the behavior where the firewall service is being mysteriously disabled on random nodes? Are you finding any consistency to this behavior?
Closing this BZ since it has no attached customer cases. If this is still a problem in current OCP 3.x releases, please reopen it with additional information.
please close - no longer supported