Bug 1397255

Summary: container image for gluster on rhel 7.3 - 'rhgs3/rhgs-server-rhel7:latest' - fails to deploy gluster container
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: rhgs-server-containerAssignee: Mohamed Ashiq <mliyazud>
Status: CLOSED ERRATA QA Contact: krishnaram Karthick <kramdoss>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: cns-3.4CC: akhakhar, annair, hchiramm, jarrpa, madam, mliyazud, mzywusko, pprakash, rcyriac, rhs-bugs, rreddy, rtalur, sankarshan, vinug
Target Milestone: ---   
Target Release: CNS 3.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhgs-server-docker-3.1.3-16 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 14:59:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1385247, 1396894    
Attachments:
Description Flags
output of 'journalctl -xe'
none
netstat output none

Description krishnaram Karthick 2016-11-22 04:28:13 UTC
Description of problem:
Latest image of gluster container fails to launch gluster container. glusterd service fails to start.

docker images
REPOSITORY                                                                     TAG                 IMAGE ID            CREATED             SIZE
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7   latest              d670e1d14e3a        20 hours ago        281.9 MB


snippet of etc-glusterfs-glusterd.vol.log

    [2016-11-21 08:00:10.318162] I [MSGID: 100030] [glusterfsd.c:2338:main] 0-glusterd: Started running glusterd version 3.7.9 (args: glusterd --xlator-option *.upgrade=on -N)
    [2016-11-21 08:00:10.327302] I [graph.c:269:gf_add_cmdline_options] 0-management: adding option 'upgrade' for volume 'management' with value 'on'
    [2016-11-21 08:00:10.327361] I [MSGID: 106478] [glusterd.c:1383:init] 0-management: Maximum allowed open file descriptors set to 65536
    [2016-11-21 08:00:10.327377] I [MSGID: 106479] [glusterd.c:1432:init] 0-management: Using /var/lib/glusterd as working directory
    [2016-11-21 08:00:10.334289] E [rpc-transport.c:292:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.7.9/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
    [2016-11-21 08:00:10.334303] W [rpc-transport.c:296:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
    [2016-11-21 08:00:10.334310] W [rpcsvc.c:1627:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
    [2016-11-21 08:00:10.334318] E [MSGID: 106243] [glusterd.c:1656:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
    [2016-11-21 08:00:10.335084] E [MSGID: 101032] [store.c:435:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
    [2016-11-21 08:00:10.335102] E [MSGID: 101032] [store.c:435:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
    [2016-11-21 08:00:10.335106] I [MSGID: 106514] [glusterd-store.c:2093:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 30712
    [2016-11-21 08:00:10.335167] I [MSGID: 106194] [glusterd-store.c:3565:glusterd_store_retrieve_missed_snaps_list] 0-management: No missed snaps list.
    [2016-11-21 08:00:10.335213] E [MSGID: 101032] [store.c:435:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/options. [No such file or directory]
    Final graph:
    +------------------------------------------------------------------------------+
      1: volume management
      2:     type mgmt/glusterd
      3:     option rpc-auth.auth-glusterfs on
      4:     option rpc-auth.auth-unix on
      5:     option rpc-auth.auth-null on
      6:     option rpc-auth-allow-insecure on
      7:     option transport.socket.listen-backlog 128
      8:     option upgrade on
      9:     option event-threads 1
     10:     option ping-timeout 0
     11:     option transport.socket.read-fail-log off
     12:     option transport.socket.keepalive-interval 2
     13:     option transport.socket.keepalive-time 10
     14:     option transport-type rdma
     15:     option working-directory /var/lib/glusterd
     16: end-volume
     17:  
    +------------------------------------------------------------------------------+
    [2016-11-21 08:00:10.340245] W [glusterfsd.c:1251:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7ff48729adc5] -->glusterd(glusterfs_sigwaiter+0xe5) [0x7ff488912915] -->glusterd(cleanup_and_exit+0x6b) [0x7ff48891278b] ) 0-: received signum (15), shutting down
    [2016-11-21 08:00:10.340269] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
    [2016-11-21 08:00:10.340840] E [rpcsvc.c:1390:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with portmap
    [2016-11-21 08:00:10.340855] E [rpcsvc.c:1516:rpcsvc_program_unregister] 0-rpc-service: portmap unregistration of program failed
    [2016-11-21 08:00:10.340860] E [rpcsvc.c:1554:rpcsvc_program_unregister] 0-rpc-service: Program unregistration failed: GlusterD svc cli, Num: 1238463, Ver: 2, Port: 0
    [2016-11-21 08:00:10.340959] E [rpcsvc.c:1390:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with portmap
    [2016-11-21 08:00:10.340967] E [rpcsvc.c:1516:rpcsvc_program_unregister] 0-rpc-service: portmap unregistration of program failed
    [2016-11-21 08:00:10.340971] E [rpcsvc.c:1554:rpcsvc_program_unregister] 0-rpc-service: Program unregistration failed: Gluster Handshake, Num: 14398633, Ver: 2, Port: 0


Version-Release number of selected component (if applicable):
rhgs3/rhgs-server-rhel7:latest image, i.e., image id --> d670e1d14e3a

How reproducible:
100%

Steps to Reproduce:
1. Pull the latest gluster image (d670e1d14e3a)
2. Try to spin up a container out of this image


Actual results:
glusterd fails to start and hence gluster pod fails to spin

Expected results:
gluster pod should get spun successfully.

Additional info:
sosreports and other logs shall be attached shortly.

Comment 2 Humble Chirammal 2016-11-22 04:39:40 UTC
Is it something wrong with this   "/var/lib/glusterd" path ?

~~~
[store.c:435:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
    [2016-11-21 08:00:10.335102] E [MSGID: 101032] [store.c:435:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
~~~

Or 

portmap is going wrong?

--snip--
  [2016-11-21 08:00:10.340840] E [rpcsvc.c:1390:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with portmap
    [2016-11-21 08:00:10.340855] E [rpcsvc.c:1516:rpcsvc_program_unregister] 0-rpc-service: portmap unregistration of program failed
    [2016-11-21 08:00:10.340860] E [rpcsvc.c:1554:rpcsvc_program_unregister] 0-rpc-service: Program unregistration failed: GlusterD svc cli, Num: 1238463, Ver: 2, Port: 0

--/snip--

Comment 3 krishnaram Karthick 2016-11-22 04:41:34 UTC
Created attachment 1222568 [details]
output of 'journalctl -xe'

Atin helped to figure out that there seems to be some dependency issues for glusterd to start. Hence, attaching the output of journalctl -xe.

Comment 4 Humble Chirammal 2016-11-22 04:49:42 UTC
Thanks for the logs. Looks like rpcbind is in problem, thus glusterd

-- 
-- Unit sysinit.target has begun starting up.
Nov 22 04:07:25 dhcp42-40.lab.eng.blr.redhat.com systemd[1]: rpcbind.socket failed to listen on sockets: Address already in use
Nov 22 04:07:25 dhcp42-40.lab.eng.blr.redhat.com systemd[1]: Failed to listen on RPCbind Server Activation Socket.
-- Subject: Unit rpcbind.socket has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit rpcbind.socket has failed.
-- 
-- The result is failed.
Nov 22 04:07:25 dhcp42-40.lab.eng.blr.redhat.com systemd[1]: Dependency failed for RPC bind service.
-- Subject: Unit rpcbind.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit rpcbind.service has failed.
-- 
-- The result is dependency.
Nov 22 04:07:25 dhcp42-40.lab.eng.blr.redhat.com systemd[1]: Dependency failed for GlusterFS, a clustered file-system server.
-- Subject: Unit glusterd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit glusterd.service has failed.

Comment 7 Humble Chirammal 2016-11-22 07:35:03 UTC
Ashiq, look at this angle https://bugzilla.redhat.com/show_bug.cgi?id=1357023

Comment 8 Humble Chirammal 2016-11-22 10:07:26 UTC
This looks to be the exact cause of this error, "rpcbind.socket failed to listen on sockets: Address already in use"  and iic, it listens on "111". Is this port already in use?

Comment 9 Humble Chirammal 2016-11-22 10:10:47 UTC
Karthick, Please attach netstat output from this system

Comment 10 krishnaram Karthick 2016-11-22 10:27:33 UTC
Created attachment 1222651 [details]
netstat output

Comment 12 Humble Chirammal 2016-11-22 11:16:31 UTC
I disabled rpcbind.socket service
Then I deployed gluster  pod:

[root@dhcp42-16 ~]# oc process glusterfs -v GLUSTERFS_NODE=dhcp42-40.lab.eng.blr.redhat.com | oc create -f -
deploymentconfig "glusterfs-dc-dhcp42-40.lab.eng.blr.redhat.com" created

The oc get pod says "its running" , however it has not updated the READY status:

[root@dhcp42-16 ~]# oc get pods
NAME                                                     READY     STATUS    RESTARTS   AGE
glusterfs-dc-dhcp42-40.lab.eng.blr.redhat.com-1-54mr8    0/1       Running   0          8m
glusterfs-dc-dhcp42-40.lab.eng.blr.redhat.com-1-deploy   1/1       Running   0          8m
glusterfs-dc-dhcp47-72.lab.eng.blr.redhat.com-1-deploy   0/1       Error     0          7h
storage-project-router-1-btkke                           1/1       Running   0          1d

But I can exec into the pod and ran 'glusterd' status command.

[root@dhcp42-40 /]# systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2016-11-22 11:01:08 UTC; 8min ago
  Process: 159 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 160 (glusterd)
   CGroup: /system.slice/docker-e0ff7fb62a21e62c316ef7fd7d00b05d4640794f430fee5b65c0885f1713024a.scope/system.slice/glusterd.service
           └─160 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Nov 22 11:01:06 dhcp42-40.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...
Nov 22 11:01:08 dhcp42-40.lab.eng.blr.redhat.com systemd[1]: Started GlusterFS, a clustered file-system server.
[root@dhcp42-40 /]# glusterd --version
glusterfs 3.7.9 built on Aug 24 2016 02:50:44
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2013 Red Hat, Inc. <http://www.redhat.com/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
[root@dhcp42-40 /]#

Comment 13 Mohamed Ashiq 2016-11-22 11:32:19 UTC
Hi,

rpcbind.socket in release rpcbind-0.2.0-38.el7.x86_64, explicitly binds to IP's port 111 on host. In previous release it was binding only on request. If we had used rpcbind of host(openshift node) for NFS server, we would have hit this before. This happens as we use --net=host.

New image will work if we stop rpcbind.socket and rpcbind.service in host.

http://pastebin.test.redhat.com/432506

There is another bug[1] in rpcbind which is caused while updating only rpcbind. I faced this when I updated only on host.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1396291

--Ashiq

Comment 14 Humble Chirammal 2016-11-23 06:58:43 UTC
As we are not using gluster nfs service in the container, we could *try* to disable it using a sed command in Dockerfile before the glusterd start. But, this need to be tested explictitly.

Comment 15 Humble Chirammal 2016-11-23 07:03:20 UTC
(In reply to Humble Chirammal from comment #14)
> As we are not using gluster nfs service in the container, we could *try* to
> disable it using a sed command in Dockerfile before the glusterd start. But,
> this need to be tested explictitly.

Need to see how this effect the container upgrade as the unit file will fall back to previous state if we upgrade from the container, this is not really recommended. If we container upgrade it will be taken care.

Comment 16 Humble Chirammal 2016-11-23 09:56:37 UTC
(In reply to Humble Chirammal from comment #15)
> (In reply to Humble Chirammal from comment #14)
> > As we are not using gluster nfs service in the container, we could *try* to
> > disable it using a sed command in Dockerfile before the glusterd start. But,
> > this need to be tested explictitly.
> 
> Need to see how this effect the container upgrade as the unit file will fall
> back to previous state if we upgrade from the container, this is not really
> recommended. If we container upgrade it will be taken care.

"If we container upgrade it will be taken care." 
Above should have been "if we do container upgrade it will be taken care"

Comment 21 krishnaram Karthick 2016-11-29 05:18:27 UTC
Verified with the following docker image.

REPOSITORY                                                                     TAG                 IMAGE ID            CREATED             SIZE
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7   latest              f019b0d1abd5        4 days ago          281.9 MB

other build info:
[root@dhcp47-115 ~]# openshift version
openshift v3.4.0.29+ca980ba
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0
[root@dhcp47-115 ~]# heketi-cli version
Error: unknown command "version" for "heketi-cli"
Run 'heketi-cli --help' for usage.
[root@dhcp47-115 ~]# heketi-cli --version
heketi-cli 3.1.0
[root@dhcp47-115 ~]# rpm -qa | grep 'heketi'
heketi-client-3.1.0-3.el7rhgs.x86_64
heketi-templates-3.1.0-3.el7rhgs.x86_64
[root@dhcp47-115 ~]# docker images

gluster cluster setup is successful using the above image and builds. moving the bug to verified.

Comment 22 Mohamed Ashiq 2016-11-29 05:25:26 UTC
(In reply to krishnaram Karthick from comment #21)
> Verified with the following docker image.
> 
> REPOSITORY                                                                  
> TAG                 IMAGE ID            CREATED             SIZE
> brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7
> latest              f019b0d1abd5        4 days ago          281.9 MB
> 
> other build info:
> [root@dhcp47-115 ~]# openshift version
> openshift v3.4.0.29+ca980ba
> kubernetes v1.4.0+776c994
> etcd 3.1.0-rc.0
> [root@dhcp47-115 ~]# heketi-cli version
> Error: unknown command "version" for "heketi-cli"
> Run 'heketi-cli --help' for usage.
> [root@dhcp47-115 ~]# heketi-cli --version
> heketi-cli 3.1.0
> [root@dhcp47-115 ~]# rpm -qa | grep 'heketi'
> heketi-client-3.1.0-3.el7rhgs.x86_64
> heketi-templates-3.1.0-3.el7rhgs.x86_64
> [root@dhcp47-115 ~]# docker images
> 
> gluster cluster setup is successful using the above image and builds. moving
> the bug to verified.

Awesome! thanks. :)

Comment 24 errata-xmlrpc 2017-01-18 14:59:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0149