Bug 1193795 - [OSEv3 Beta D1] Restart of openshift-sdn-node deamon on Node
Summary: [OSEv3 Beta D1] Restart of openshift-sdn-node deamon on Node
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Rajat Chopra
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-02-18 08:54 UTC by Frederic Hornain
Modified: 2015-04-29 13:49 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-18 16:48:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1194467 0 low CLOSED openshift-sdn-node takes forever to stop/restart 2021-02-22 00:41:40 UTC

Internal Links: 1194467

Description Frederic Hornain 2015-02-18 08:54:43 UTC
Description of problem:

On OSE v3 Beta node, when openshift-sdn-node is started and you try the following command it hangs forever :

systemctl restart openshift-sdn-node

Kind Regards
Frederic

Comment 2 Scott Dodson 2015-02-18 14:36:54 UTC
This happens whenever openshift-sdn-node is attempting to reach the master but failing to do so. It will eventually kill the process and restart, or at least that's been my experience, does that happen for you as well?

Comment 3 Frederic Hornain 2015-02-18 15:39:00 UTC
Hi Scott,

Well, your previous message gave me a hint.

In the /var/log/messages on my OSEv2 master which is also a node, I have got the following information : 

Feb 18 10:34:41 ose3-master openshift-sdn: E0218 10:34:41.828239 00908 controller.go:199] Could not find an allocated subnet for this minion (ose3-master.myredhat.com)(501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]). Waiting..

Now up, I am going to investigate in that area.
If you have got an idea why I have got such messages, please feel free to let me know.

Thanks for your support and your time.

KR
Frederic

Comment 4 Brenton Leanhardt 2015-02-18 15:44:09 UTC
The "All the given peers are not reachable" message is coming from the embeded etcd client.  If it can't reach the master it won't be able to find out any subnet configuration.

Comment 5 Frederic Hornain 2015-02-18 15:53:36 UTC
Indeed, I noticed that my openshift-sdn-master did not start correctly.

I do not know why.

Here is the output of the "systemctl start openshift-sdn-master" cmd I ran

systemctl status -l openshift-sdn-master
openshift-sdn-master.service - OpenShift SDN Master
   Loaded: loaded (/usr/lib/systemd/system/openshift-sdn-master.service; enabled)
   Active: inactive (dead) since Wed 2015-02-18 10:49:09 EST; 11s ago
     Docs: https://github.com/openshift/openshift-sdn
  Process: 2297 ExecStart=/usr/bin/openshift-sdn $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 2297 (code=exited, status=0/SUCCESS)

Feb 18 10:49:09 ose3-master.myredhat.com systemd[1]: Starting OpenShift SDN Master...
Feb 18 10:49:09 ose3-master.myredhat.com systemd[1]: Started OpenShift SDN Master.
Feb 18 10:49:09 ose3-master.myredhat.com openshift-sdn[2297]: I0218 10:49:09.854025 02297 main.go:108] Installing signal handlers
Feb 18 10:49:09 ose3-master.myredhat.com openshift-sdn[2297]: I0218 10:49:09.858536 02297 controller.go:47] Self IP : 192.168.122.20
Feb 18 10:49:09 ose3-master.myredhat.com openshift-sdn[2297]: E0218 10:49:09.859459 02297 controller.go:73] Error in initializing/fetching subnets - 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

Do you have any idea what is wrong here ?

Comment 6 Frederic Hornain 2015-02-18 15:56:58 UTC
OK, I found why.

At first glance, It seems that my openshift-master.service was not started correctly.

I am going to do double checks now and get back to you.
Thanks for your time.

KR
/f

Comment 7 Frederic Hornain 2015-02-18 15:59:07 UTC
FYI, now everything is running correctly on my standalone server which acts as master and minion/node.

I am going to check on my other servers which acts as minion/node.

[root@ose3-master ~]# systemctl | grep openshift
openshift-master.service                                                            loaded active running   OpenShift Master
openshift-node.service                                                              loaded active running   OpenShift Node
openshift-sdn-master.service                                                        loaded active running   OpenShift SDN Master
openshift-sdn-node.service                                                          loaded active running   OpenShift SDN Node

KR
/f

Comment 8 Scott Dodson 2015-02-18 16:09:02 UTC
Frederic,

If this was just after a reboot, yesterday we pushed out new packages that made service startup after a reboot better for openshift-master, though perhaps not perfect. Please update to openshift-0.3.0-0.git.147.4be9abc.el7ose as it should help with that scenario.

I think this bug should still be considered on it's own, openshift-sdn-node should restart more cleanly while attempting to connect to etcd if possible.

--
Scott

Comment 9 Frederic Hornain 2015-02-18 16:18:01 UTC
Hi Scott,

Here are the ones I am using :

openshift-0.3.0-0.git.146.c125b05.el7ose.x86_64
openshift-node-0.3.0-0.git.146.c125b05.el7ose.x86_64
openshift-sdn-0.4-1.git.0.4809789.el7ose.x86_64
openshift-sdn-node-0.4-1.git.0.4809789.el7ose.x86_64
openshift-master-0.3.0-0.git.146.c125b05.el7ose.x86_64
tuned-profiles-openshift-node-0.3.0-0.git.146.c125b05.el7ose.x86_64
openshift-sdn-master-0.4-1.git.0.4809789.el7ose.x86_64

I tried to update them but I was notified there was no packages marked for update.

Meanwhile, I have tested them on a standalone sever acting as Master and Node.
It seems to work.

But on my other servers acting as Node, I still have the following message :

Could not find an allocated subnet for this minion (ose3-node1.myredhat.com)(501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]). Waiting..

On my master/node server I have the following services up and running :

openshift-master.service                                                            loaded active running   OpenShift Master
openshift-node.service                                                              loaded active running   OpenShift Node
openshift-sdn-master.service                                                        loaded active running   OpenShift SDN Master
openshift-sdn-node.service                                                          loaded active running   OpenShift SDN Node

On my Node only server I have the following services up and running :

openshift-node.service                                                              loaded active running   OpenShift Node
openshift-sdn-node.service                                                          loaded active running   OpenShift SDN Node

I am going to continue to investigate.

KR
/f

Comment 10 Frederic Hornain 2015-02-18 16:48:55 UTC
OK, I think I found where the problem was.

Indeed I set the MASTER_URL value inside of the openshift-sdn-node configuration file based on the example provided inside that file.

Initially I did a copy paste and it was my mistake. - see below -
# Example:
#   MASTER_URL=https://10.0.0.1:4001
MASTER_URL=https://192.168.122.20:4001

The solution was to replace https by http like the following :

# Example:
#   MASTER_URL=https://10.0.0.1:4001
MASTER_URL=http://192.168.122.20:4001

Now logs are like this - which are normal- :

Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: + grep -q '^OPTIONS='\''--insecure-registry=0.0.0.0/0 -b=lbr0 --mtu=1450 --selinux-enabled'\''' /etc/sysconfig/docker
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: + cat
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: + systemctl daemon-reload
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: + systemctl restart docker.service
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.922853 02622 controller.go:275] Output of adding table=0,cookie=0x32,priority=200,ip,in_port=9,nw_dst=10.1.0.0/24,actions=set_field:192.168.122....ut:10:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.939617 02622 controller.go:277] Output of adding table=0,cookie=0x32,priority=200,arp,in_port=9,nw_dst=10.1.0.0/24,actions=set_field:192.168.122...ut:10:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.943949 02622 controller.go:275] Output of adding table=0,cookie=0x97,priority=200,ip,in_port=9,nw_dst=10.1.1.0/24,actions=set_field:192.168.122....ut:10:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.958295 02622 controller.go:277] Output of adding table=0,cookie=0x97,priority=200,arp,in_port=9,nw_dst=10.1.1.0/24,actions=set_field:192.168.122...ut:10:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.962123 02622 controller.go:268] Output of adding table=0,cookie=0xd8,priority=200,ip,in_port=10,nw_dst=10.1.2.0/24,actions=output:9:  (<nil>)
Feb 18 11:45:06 ose3-node2.myredhat.com openshift-sdn[2622]: I0218 11:45:06.965916 02622 controller.go:270] Output of adding table=0,cookie=0xd8,priority=200,arp,in_port=10,nw_dst=10.1.2.0/24,actions=output:9:  (<nil>)


N.B.
@Scott
You are right.
When the openshift-sdn-node hangs, the only way to restart it is to kill it first.

I CLOSE THIS TICKET

Kind Regards
Frederic


Note You need to log in before you can comment on or make changes to this bug.