1317031 – AWS HA installation fails with node failed to obtain clusternetwork (because the master is crashing)

Bug 1317031 - AWS HA installation fails with node failed to obtain clusternetwork (because the master is crashing)

Summary: AWS HA installation fails with node failed to obtain clusternetwork (because ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jason DeTiberus
QA Contact:	Xiaoli Tian
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-03-11 18:18 UTC by Andrew Butcher
Modified:	2016-09-07 20:38 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-09-07 20:38:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
NetworkManager debug logs - HA ELB 1.0.0-14 (240.47 KB, text/plain) 2016-03-11 18:22 UTC, Andrew Butcher	no flags	Details
Node logs - HA ELB 1.0.0-14 (10.59 KB, text/plain) 2016-03-11 18:23 UTC, Andrew Butcher	no flags	Details
NetworkManager debug logs - HA ELB 1.0.6-27 (342.78 KB, text/plain) 2016-03-11 18:23 UTC, Andrew Butcher	no flags	Details
Node logs - HA ELB 1.0.6-27 (14.21 KB, text/plain) 2016-03-11 18:24 UTC, Andrew Butcher	no flags	Details
NetworkManager debug logs - Single master no ELB 1.0.0-14 (600.84 KB, text/plain) 2016-03-11 18:25 UTC, Andrew Butcher	no flags	Details
Node logs - Single master no ELB 1.0.0-14 (18.40 KB, text/plain) 2016-03-11 18:26 UTC, Andrew Butcher	no flags	Details
sdn debug output (150.22 KB, application/x-gzip) 2016-03-21 21:20 UTC, Andrew Butcher	no flags	Details
sdn debug output - HA ELB 1.0.6-27 (775.54 KB, application/x-gzip) 2016-03-22 16:53 UTC, Andrew Butcher	no flags	Details
controllers log w/ elb and disabled openshift sdn (274.87 KB, text/x-vhdl) 2016-03-29 14:32 UTC, Andrew Butcher	no flags	Details
View All

Description Andrew Butcher 2016-03-11 18:18:58 UTC

Description of problem:
I've created a Native HA OpenShift cluster in AWS with a TCP ELB in front of the master API. Node SDN will consistently fail as follows with NetworkManager-1.0.0-14.git20150121.b4ea599c.el7.x86_64.

Mar 11 10:58:34 ip-172-18-0-84.ec2.internal atomic-openshift-node[25314]: E0311 10:58:34.109473 25314 common.go:197] Failed to obtain ClusterNetwork: Get https://openshift.internal.abutchernosat1.aws.paas.ninja/oapi/v1/clusternetworks/default: EOF
Mar 11 10:58:34 ip-172-18-0-84.ec2.internal atomic-openshift-node[25314]: F0311 10:58:34.109490 25314 node.go:175] SDN Node failed: Get https://openshift.internal.abutchernosat1.aws.paas.ninja/oapi/v1/clusternetworks/default: EOF
Mar 11 10:58:34 ip-172-18-0-84.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a

The same configuration will succeed after updating to NetworkManager-1.0.6-27.el7.x86_64.

A single master cluster with no ELB will succeed with NetworkManager-1.0.0-14.git20150121.b4ea599c.el7.x86_64.

Version-Release number of selected component (if applicable):
ami-10663b78 RHEL-7.1_HVM_GA-20150225-x86_64-1-Access2-GP2
redhat-release-server-7.1-1.el7.x86_64
NetworkManager-1.0.0-14.git20150121.b4ea599c.el7.x86_64
atomic-openshift-node-3.1.1.6-4.git.21.cd70c35.el7aos.x86_64
atomic-openshift-3.1.1.6-4.git.21.cd70c35.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.1.6-4.git.21.cd70c35.el7aos.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create HA OpenShift cluster in AWS based on ami-10663b78 with a TCP ELB in front of the API.
2. Use openshift-ansible to install OpenShift. This will fail when starting and enabling node services.

Actual results:
Node starts normally.

Expected results:
Node SDN fails.

Additional info:

I suspect the ELB is causing the issue but updating to NetworkManager-1.0.6-27.el7.x86_64 fixes things. I will attach NM debug and node logs for each scenario.

Comment 1 Andrew Butcher 2016-03-11 18:22:41 UTC

Created attachment 1135336 [details]
NetworkManager debug logs - HA ELB 1.0.0-14

Comment 2 Andrew Butcher 2016-03-11 18:23:14 UTC

Created attachment 1135337 [details]
Node logs - HA ELB 1.0.0-14

Comment 3 Andrew Butcher 2016-03-11 18:23:57 UTC

Created attachment 1135338 [details]
NetworkManager debug logs - HA ELB 1.0.6-27

Comment 4 Andrew Butcher 2016-03-11 18:24:29 UTC

Created attachment 1135340 [details]
Node logs - HA ELB 1.0.6-27

Comment 5 Andrew Butcher 2016-03-11 18:25:41 UTC

Created attachment 1135353 [details]
NetworkManager debug logs - Single master no ELB 1.0.0-14

Comment 6 Andrew Butcher 2016-03-11 18:26:24 UTC

Created attachment 1135354 [details]
Node logs - Single master no ELB 1.0.0-14

Comment 7 Dan Winship 2016-03-14 14:11:51 UTC

(In reply to Andrew Butcher from comment #0)
> Mar 11 10:58:34 ip-172-18-0-84.ec2.internal atomic-openshift-node[25314]:
> E0311 10:58:34.109473   25314 common.go:197] Failed to obtain
> ClusterNetwork: Get
> https://openshift.internal.abutchernosat1.aws.paas.ninja/oapi/v1/
> clusternetworks/default: EOF

> I suspect the ELB is causing the issue but updating to
> NetworkManager-1.0.6-27.el7.x86_64 fixes things. I will attach NM debug and
> node logs for each scenario.

The "EOF" in the error message suggests that it is succesfully making a connection, but then the connection gets closed rather than responding to the HTTP request. So that definitely sounds like some proxying/HA/load-balancing layer is getting in the way and breaking things.

Is it NM 1.0.0 on the master or NM 1.0.0 on the node that doesn't work? Or do they both need to be 1.0.6? I don't see anything obvious in the NM NEWS file that looks like it would be related to this, but dcbw might know of something...

Comment 8 Andrew Butcher 2016-03-14 14:23:09 UTC

> Is it NM 1.0.0 on the master or NM 1.0.0 on the node that doesn't work? Or do they both need to be 1.0.6?

I've only tested 1.0.0 on all hosts and 1.0.6 on all hosts.

Comment 9 Dan Williams 2016-03-21 17:29:22 UTC

Andrew, any chance you could replicate the issue and then run this script on the master in each situation?

https://github.com/openshift/openshift-sdn/blob/master/hack/debug.sh

what I'd like to do is get more information about the IP configuration of the machines.  From the logs it looks like there aren't any interesting differences in the NetworkManager setup, but obviously something is going on.

Comment 10 Andrew Butcher 2016-03-21 21:20:50 UTC

Created attachment 1138811 [details]
sdn debug output

This is from a host that is a master and a node. Let me know if I can gather anything else.

Comment 11 Andrew Butcher 2016-03-21 21:22:09 UTC

Note: attachment 1138811 [details] is from an nm 1.0.0 system. I will gather the other scenarios asap.

Comment 12 Andrew Butcher 2016-03-22 16:53:57 UTC

Created attachment 1139163 [details]
sdn debug output - HA ELB 1.0.6-27

Comment 14 Dan Williams 2016-03-23 22:55:40 UTC

Andrew, inspecting the cluster you spun up today I found that none of the masters were actually running atomic-openshift-master-controllers, which is a necessary component when the openshift SDN plugins are used.  That's what actually creates the ClusterNetwork and HostSubnet records that the nodes are actually waiting for.  Once I started atomic-openshift-master-controllers on at least one master, I did get at least one node to start and create the SDN.

ip-172-18-0-234.ec2.internal does appear to start master-controllers during the ansible install process, but then at some point (which appears to be 13:35:52 on all three masters) it gets killed with SIGPIPE and never gets restarted.  That causes the nodes (which are installed and started later) to fail because no SDN controller is running.

Any chance you could retry the installation with ELB and *not* use the openshift-sdn plugins so we can see if atomic-openshift-master-controllers dies at the same spot (indicating it is not an openshift-sdn problem) or continues to run (indicating that it is likely an openshift-sdn problem)?

Comment 15 Andrew Butcher 2016-03-24 00:02:26 UTC

Thanks for looking Dan, I'll test that as soon as I can.

Comment 16 Andrew Butcher 2016-03-29 14:30:09 UTC

> Any chance you could retry the installation with ELB and *not* use the openshift-sdn plugins so we can see if atomic-openshift-master-controllers dies at the same spot (indicating it is not an openshift-sdn problem) or continues to run (indicating that it is likely an openshift-sdn problem)?

Controllers die in this scenario as well - attaching logs from that run.

Comment 17 Andrew Butcher 2016-03-29 14:32:08 UTC

Created attachment 1141285 [details]
controllers log w/ elb and disabled openshift sdn

Comment 18 Ben Bennett 2016-04-07 13:42:46 UTC

The logs indicate that the master is crashing... the networking was just the first place it showed up.

Comment 20 Andy Goldstein 2016-04-08 13:56:17 UTC

I worked with Andrew on this yesterday. He installed a new RHEL 7.1 cluster on AWS. What we saw was that atomic-openshift-master-controllers was working fine until the atomic-openshift-node RPM was installed. That brought in an update to the systemd RPM as well. Around the time we saw the last log message from -controllers, we saw that systemd was updated and reexecuted. The systemctl status showed killed/PIPE. The current theory is that the systemd/journald restart caused -controllers to die. Subsequent attempts at reexec'ing systemd and restarting journald didn't yield the same behavior. And once we started the -controllers and -node processes on the appropriate hosts, everything worked fine.

Moving to UpcomingRelease as this doesn't look like a blocker.

Comment 21 Scott Dodson 2016-04-08 14:39:39 UTC

Upgrading NetworkManager also pulls in newer systemd which explains why we originally came to the conclusion that this was NM related.

Comment 22 Andy Goldstein 2016-04-08 14:43:28 UTC

Makes sense. So there's some issue with at least atomic-openshift-master-controllers dying when systemd is yum updated.

Comment 23 Jordan Liggitt 2016-05-22 15:18:03 UTC

Do we have an idea what component this should be against? releng or installer seems more accurate than REST API, but I'm not sure

Comment 24 Scott Dodson 2016-05-23 13:29:54 UTC

Installer I guess.

Comment 25 Brenton Leanhardt 2016-06-01 14:15:52 UTC

If the issue is actually with atomic-openshift-master-controllers dying when systemd is yum updated I don't think it's related to AWS.

QE, I don't doubt there is still a problem here but would you mind verifying that the issue is reproducible?

It could be the case that it doesn't happen for all NetworkManager upgrades.  You may have to try a few different "yum downgrade" commands to get to a version that triggers the problem.

Comment 26 Ma xiaoqiang 2016-06-02 08:27:30 UTC

Scenarios 1:
Install ose-3.2 on rhel71 with NetworkManager-1.0.0  on openstack then update the system to rhel72. can not reproduce this issue

Scenarios 2:
Install ose-3.1.1.6.-8 on rhel71 with NetworkManager-1.0.0 on openstack, then update the system to rhel72. can not reproduce this issue



Scenarios 3:
Install ose-3.1.1.6-8 on ami-10663b78 on AWS, then update the system to latest rhel72. can not reproduce this issue


Scenarios 4:
Install ose-3.1.1.6-8 on rhel72 with NetworkManager-1.0.0-14.git20150121 on AWS, then update the system to latest rhel72. can not reproduce this issue

QE can not reproduce this issue.

Note You need to log in before you can comment on or make changes to this bug.