Description of problem: RPCbind was causing glusterd to fail on dependency as rpcbind fails. RPCBind is failing as rpcbind.socket in release rpcbind-0.2.0-38.el7.x86_64, explicitly binds to IP's port 111 on host. In previous release it was binding only on request. If we had used rpcbind of host(openshift node) for NFS server, we would have hit this before. This happens as we use --net=host. Workarounds: 1)Image will work if we stop rpcbind.socket and rpcbind.service in host. 2)Disable rpcbind in container as we are not using nfs in gluster containers yet (this is followed in the release cns-3.4 on image rhgs3/rhgs-server-rhel7:3.1.3 and see BZ#1397255 for more information). Expectation: *)Make rpcbind work in containers
Relates to RHEL7 BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1427806
*** Bug 1457617 has been marked as a duplicate of this bug. ***
*** Bug 1457126 has been marked as a duplicate of this bug. ***
According to findings in the RHEL BZ #1427806, this is solved as follows: 1) The rpcbind service has to run on the host (nodes that are supposed to run gluster containers). 2) We need a small change to the gluster container docker file. This change is already done. It will be shipped with the next gluster container image build.
Ok. A with a little more analysis the situation seems like this: 1) A change in the host (OCP/RHEL) has led to the rpcbind being started by default while it was not being started by default et before. 2) Originally the gluster containers started rpcbind, but in the CNS builds this dependency was actually removed for the rhgs 3.2.0 release, in february, since gluster containers don't need rpcbind. ==> I.e. cns 3.5 images should not have a problem! ==> I don't know how OCP qe could have run into this issue. 3) Now with preparation for CNS 3.6, the new gluster-blockd component in the rhgs containers does require rpcbind. Hence testing with the new CNS 3.6 containers, we hit the problem due to the changed Host behavior. ==> The solution is to *not* start rpcbind in the container ever and always rely on rpcbind running on the host. ==> Changed CNS 3.6 gluster builds expected tomorrow (July 26) Summary questions for Brenton: * Is it true that OCP 3.6 has the changed behavior of always starting rpcbind on the host? * How has OCP QE possibly hit this issue for OCP 3.6? Have they possibly been using upstream images instead of RHGS/CNS images?
(In reply to Michael Adam from comment #8) > Summary questions for Brenton: > > * Is it true that OCP 3.6 has the changed behavior of always > starting rpcbind on the host? According to code, rpcbind will only start in openshift_storage_nfs_lvm role, which means in cns situation, it won't start. $ grep -nir "rpcbind" roles/openshift_storage_nfs_lvm/tasks/nfs.yml:6:- name: Start rpcbind roles/openshift_storage_nfs_lvm/tasks/nfs.yml:8: name: rpcbind > > * How has OCP QE possibly hit this issue for OCP 3.6? > Have they possibly been using upstream images instead of RHGS/CNS images? Our QE is using brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7, the tag is latest by default. in BZ #1457126, the latest is 3.3.0-7 which have problem. Now the latest is 3.3.0-9, which works well. Thank you :)
with the latest cns 3.6 builds, rpcbind is now run on host instead of the containers. verified in build - cns-deploy-5.0.0-34.el7rhgs.x86_64
The following steps seems to a prerequisite now and the same has been documented in our CNS 3.6 guide [1] as well: ######### Execute the following commands to enable and run rpcbind on all the nodes hosting the gluster pod : # systemctl add-wants multi-user rpcbind.service # systemctl enable rpcbind.service # systemctl start rpcbind.service ######### [1] https://access.qa.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/container-native_storage_for_openshift_container_platform/#chap-Documentation-Red_Hat_Gluster_Storage_Container_Native_with_OpenShift_Platform-Setting_the_environment-Preparing_RHOE
doc text looks good to me
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2877