Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1193795

Summary:	[OSEv3 Beta D1] Restart of openshift-sdn-node deamon on Node
Product:	OpenShift Container Platform	Reporter:	Frederic Hornain <fhornain>
Component:	Networking	Assignee:	Rajat Chopra <rchopra>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.0.0	CC:	bleanhar, dmcphers, ekuric, jokerman, libra-onpremise-devel, mmccomas
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-02-18 16:48:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Frederic Hornain 2015-02-18 08:54:43 UTC

Description of problem:

On OSE v3 Beta node, when openshift-sdn-node is started and you try the following command it hangs forever :

systemctl restart openshift-sdn-node

Kind Regards
Frederic

Comment 2 Scott Dodson 2015-02-18 14:36:54 UTC

This happens whenever openshift-sdn-node is attempting to reach the master but failing to do so. It will eventually kill the process and restart, or at least that's been my experience, does that happen for you as well?

Comment 3 Frederic Hornain 2015-02-18 15:39:00 UTC

Hi Scott,

Well, your previous message gave me a hint.

In the /var/log/messages on my OSEv2 master which is also a node, I have got the following information : 

Feb 18 10:34:41 ose3-master openshift-sdn: E0218 10:34:41.828239 00908 controller.go:199] Could not find an allocated subnet for this minion (ose3-master.myredhat.com)(501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]). Waiting..

Now up, I am going to investigate in that area.
If you have got an idea why I have got such messages, please feel free to let me know.

Thanks for your support and your time.

KR
Frederic

Comment 4 Brenton Leanhardt 2015-02-18 15:44:09 UTC

The "All the given peers are not reachable" message is coming from the embeded etcd client.  If it can't reach the master it won't be able to find out any subnet configuration.

Comment 5 Frederic Hornain 2015-02-18 15:53:36 UTC

Indeed, I noticed that my openshift-sdn-master did not start correctly.

I do not know why.

Here is the output of the "systemctl start openshift-sdn-master" cmd I ran

systemctl status -l openshift-sdn-master
openshift-sdn-master.service - OpenShift SDN Master
   Loaded: loaded (/usr/lib/systemd/system/openshift-sdn-master.service; enabled)
   Active: inactive (dead) since Wed 2015-02-18 10:49:09 EST; 11s ago
     Docs: https://github.com/openshift/openshift-sdn
  Process: 2297 ExecStart=/usr/bin/openshift-sdn $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 2297 (code=exited, status=0/SUCCESS)

Feb 18 10:49:09 ose3-master.myredhat.com systemd[1]: Starting OpenShift SDN Master...
Feb 18 10:49:09 ose3-master.myredhat.com systemd[1]: Started OpenShift SDN Master.
Feb 18 10:49:09 ose3-master.myredhat.com openshift-sdn[2297]: I0218 10:49:09.854025 02297 main.go:108] Installing signal handlers
Feb 18 10:49:09 ose3-master.myredhat.com openshift-sdn[2297]: I0218 10:49:09.858536 02297 controller.go:47] Self IP : 192.168.122.20
Feb 18 10:49:09 ose3-master.myredhat.com openshift-sdn[2297]: E0218 10:49:09.859459 02297 controller.go:73] Error in initializing/fetching subnets - 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

Do you have any idea what is wrong here ?

Comment 6 Frederic Hornain 2015-02-18 15:56:58 UTC

OK, I found why.

At first glance, It seems that my openshift-master.service was not started correctly.

I am going to do double checks now and get back to you.
Thanks for your time.

KR
/f

Comment 7 Frederic Hornain 2015-02-18 15:59:07 UTC

FYI, now everything is running correctly on my standalone server which acts as master and minion/node.

I am going to check on my other servers which acts as minion/node.

[root@ose3-master ~]# systemctl | grep openshift
openshift-master.service                                                            loaded active running   OpenShift Master
openshift-node.service                                                              loaded active running   OpenShift Node
openshift-sdn-master.service                                                        loaded active running   OpenShift SDN Master
openshift-sdn-node.service                                                          loaded active running   OpenShift SDN Node

KR
/f

Comment 8 Scott Dodson 2015-02-18 16:09:02 UTC

Frederic,

If this was just after a reboot, yesterday we pushed out new packages that made service startup after a reboot better for openshift-master, though perhaps not perfect. Please update to openshift-0.3.0-0.git.147.4be9abc.el7ose as it should help with that scenario.

I think this bug should still be considered on it's own, openshift-sdn-node should restart more cleanly while attempting to connect to etcd if possible.

--
Scott

Comment 9 Frederic Hornain 2015-02-18 16:18:01 UTC

Hi Scott,

Here are the ones I am using :

openshift-0.3.0-0.git.146.c125b05.el7ose.x86_64
openshift-node-0.3.0-0.git.146.c125b05.el7ose.x86_64
openshift-sdn-0.4-1.git.0.4809789.el7ose.x86_64
openshift-sdn-node-0.4-1.git.0.4809789.el7ose.x86_64
openshift-master-0.3.0-0.git.146.c125b05.el7ose.x86_64
tuned-profiles-openshift-node-0.3.0-0.git.146.c125b05.el7ose.x86_64
openshift-sdn-master-0.4-1.git.0.4809789.el7ose.x86_64

I tried to update them but I was notified there was no packages marked for update.

Meanwhile, I have tested them on a standalone sever acting as Master and Node.
It seems to work.

But on my other servers acting as Node, I still have the following message :

Could not find an allocated subnet for this minion (ose3-node1.myredhat.com)(501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]). Waiting..

On my master/node server I have the following services up and running :

openshift-master.service                                                            loaded active running   OpenShift Master
openshift-node.service                                                              loaded active running   OpenShift Node
openshift-sdn-master.service                                                        loaded active running   OpenShift SDN Master
openshift-sdn-node.service                                                          loaded active running   OpenShift SDN Node

On my Node only server I have the following services up and running :

openshift-node.service                                                              loaded active running   OpenShift Node
openshift-sdn-node.service                                                          loaded active running   OpenShift SDN Node

I am going to continue to investigate.

KR
/f

Comment 10 Frederic Hornain 2015-02-18 16:48:55 UTC

OK, I think I found where the problem was.

Indeed I set the MASTER_URL value inside of the openshift-sdn-node configuration file based on the example provided inside that file.

Initially I did a copy paste and it was my mistake. - see below -
# Example:
#   MASTER_URL=https://10.0.0.1:4001
MASTER_URL=https://192.168.122.20:4001

The solution was to replace https by http like the following :

# Example:
#   MASTER_URL=https://10.0.0.1:4001
MASTER_URL=http://192.168.122.20:4001

Now logs are like this - which are normal- :

Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: + grep -q '^OPTIONS='\''--insecure-registry=0.0.0.0/0 -b=lbr0 --mtu=1450 --selinux-enabled'\''' /etc/sysconfig/docker
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: + cat
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: + systemctl daemon-reload
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: + systemctl restart docker.service
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.922853 02622 controller.go:275] Output of adding table=0,cookie=0x32,priority=200,ip,in_port=9,nw_dst=10.1.0.0/24,actions=set_field:192.168.122....ut:10:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.939617 02622 controller.go:277] Output of adding table=0,cookie=0x32,priority=200,arp,in_port=9,nw_dst=10.1.0.0/24,actions=set_field:192.168.122...ut:10:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.943949 02622 controller.go:275] Output of adding table=0,cookie=0x97,priority=200,ip,in_port=9,nw_dst=10.1.1.0/24,actions=set_field:192.168.122....ut:10:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.958295 02622 controller.go:277] Output of adding table=0,cookie=0x97,priority=200,arp,in_port=9,nw_dst=10.1.1.0/24,actions=set_field:192.168.122...ut:10:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.962123 02622 controller.go:268] Output of adding table=0,cookie=0xd8,priority=200,ip,in_port=10,nw_dst=10.1.2.0/24,actions=output:9:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.965916 02622 controller.go:270] Output of adding table=0,cookie=0xd8,priority=200,arp,in_port=10,nw_dst=10.1.2.0/24,actions=output:9:  (<nil>)


N.B.
@Scott
You are right.
When the openshift-sdn-node hangs, the only way to restart it is to kill it first.

I CLOSE THIS TICKET

Kind Regards
Frederic