Red Hat Bugzilla – Bug 1317031
AWS HA installation fails with node failed to obtain clusternetwork (because the master is crashing)
Last modified: 2016-09-07 16:38:20 EDT
Description of problem:
I've created a Native HA OpenShift cluster in AWS with a TCP ELB in front of the master API. Node SDN will consistently fail as follows with NetworkManager-1.0.0-14.git20150121.b4ea599c.el7.x86_64.
Mar 11 10:58:34 ip-172-18-0-84.ec2.internal atomic-openshift-node: E0311 10:58:34.109473 25314 common.go:197] Failed to obtain ClusterNetwork: Get https://openshift.internal.abutchernosat1.aws.paas.ninja/oapi/v1/clusternetworks/default: EOF
Mar 11 10:58:34 ip-172-18-0-84.ec2.internal atomic-openshift-node: F0311 10:58:34.109490 25314 node.go:175] SDN Node failed: Get https://openshift.internal.abutchernosat1.aws.paas.ninja/oapi/v1/clusternetworks/default: EOF
Mar 11 10:58:34 ip-172-18-0-84.ec2.internal systemd: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
The same configuration will succeed after updating to NetworkManager-1.0.6-27.el7.x86_64.
A single master cluster with no ELB will succeed with NetworkManager-1.0.0-14.git20150121.b4ea599c.el7.x86_64.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create HA OpenShift cluster in AWS based on ami-10663b78 with a TCP ELB in front of the API.
2. Use openshift-ansible to install OpenShift. This will fail when starting and enabling node services.
Node starts normally.
Node SDN fails.
I suspect the ELB is causing the issue but updating to NetworkManager-1.0.6-27.el7.x86_64 fixes things. I will attach NM debug and node logs for each scenario.
Created attachment 1135336 [details]
NetworkManager debug logs - HA ELB 1.0.0-14
Created attachment 1135337 [details]
Node logs - HA ELB 1.0.0-14
Created attachment 1135338 [details]
NetworkManager debug logs - HA ELB 1.0.6-27
Created attachment 1135340 [details]
Node logs - HA ELB 1.0.6-27
Created attachment 1135353 [details]
NetworkManager debug logs - Single master no ELB 1.0.0-14
Created attachment 1135354 [details]
Node logs - Single master no ELB 1.0.0-14
(In reply to Andrew Butcher from comment #0)
> Mar 11 10:58:34 ip-172-18-0-84.ec2.internal atomic-openshift-node:
> E0311 10:58:34.109473 25314 common.go:197] Failed to obtain
> ClusterNetwork: Get
> clusternetworks/default: EOF
> I suspect the ELB is causing the issue but updating to
> NetworkManager-1.0.6-27.el7.x86_64 fixes things. I will attach NM debug and
> node logs for each scenario.
The "EOF" in the error message suggests that it is succesfully making a connection, but then the connection gets closed rather than responding to the HTTP request. So that definitely sounds like some proxying/HA/load-balancing layer is getting in the way and breaking things.
Is it NM 1.0.0 on the master or NM 1.0.0 on the node that doesn't work? Or do they both need to be 1.0.6? I don't see anything obvious in the NM NEWS file that looks like it would be related to this, but dcbw might know of something...
> Is it NM 1.0.0 on the master or NM 1.0.0 on the node that doesn't work? Or do they both need to be 1.0.6?
I've only tested 1.0.0 on all hosts and 1.0.6 on all hosts.
Andrew, any chance you could replicate the issue and then run this script on the master in each situation?
what I'd like to do is get more information about the IP configuration of the machines. From the logs it looks like there aren't any interesting differences in the NetworkManager setup, but obviously something is going on.
Created attachment 1138811 [details]
sdn debug output
This is from a host that is a master and a node. Let me know if I can gather anything else.
Note: attachment 1138811 [details] is from an nm 1.0.0 system. I will gather the other scenarios asap.
Created attachment 1139163 [details]
sdn debug output - HA ELB 1.0.6-27
Andrew, inspecting the cluster you spun up today I found that none of the masters were actually running atomic-openshift-master-controllers, which is a necessary component when the openshift SDN plugins are used. That's what actually creates the ClusterNetwork and HostSubnet records that the nodes are actually waiting for. Once I started atomic-openshift-master-controllers on at least one master, I did get at least one node to start and create the SDN.
ip-172-18-0-234.ec2.internal does appear to start master-controllers during the ansible install process, but then at some point (which appears to be 13:35:52 on all three masters) it gets killed with SIGPIPE and never gets restarted. That causes the nodes (which are installed and started later) to fail because no SDN controller is running.
Any chance you could retry the installation with ELB and *not* use the openshift-sdn plugins so we can see if atomic-openshift-master-controllers dies at the same spot (indicating it is not an openshift-sdn problem) or continues to run (indicating that it is likely an openshift-sdn problem)?
Thanks for looking Dan, I'll test that as soon as I can.
> Any chance you could retry the installation with ELB and *not* use the openshift-sdn plugins so we can see if atomic-openshift-master-controllers dies at the same spot (indicating it is not an openshift-sdn problem) or continues to run (indicating that it is likely an openshift-sdn problem)?
Controllers die in this scenario as well - attaching logs from that run.
Created attachment 1141285 [details]
controllers log w/ elb and disabled openshift sdn
The logs indicate that the master is crashing... the networking was just the first place it showed up.
I worked with Andrew on this yesterday. He installed a new RHEL 7.1 cluster on AWS. What we saw was that atomic-openshift-master-controllers was working fine until the atomic-openshift-node RPM was installed. That brought in an update to the systemd RPM as well. Around the time we saw the last log message from -controllers, we saw that systemd was updated and reexecuted. The systemctl status showed killed/PIPE. The current theory is that the systemd/journald restart caused -controllers to die. Subsequent attempts at reexec'ing systemd and restarting journald didn't yield the same behavior. And once we started the -controllers and -node processes on the appropriate hosts, everything worked fine.
Moving to UpcomingRelease as this doesn't look like a blocker.
Upgrading NetworkManager also pulls in newer systemd which explains why we originally came to the conclusion that this was NM related.
Makes sense. So there's some issue with at least atomic-openshift-master-controllers dying when systemd is yum updated.
Do we have an idea what component this should be against? releng or installer seems more accurate than REST API, but I'm not sure
Installer I guess.
If the issue is actually with atomic-openshift-master-controllers dying when systemd is yum updated I don't think it's related to AWS.
QE, I don't doubt there is still a problem here but would you mind verifying that the issue is reproducible?
It could be the case that it doesn't happen for all NetworkManager upgrades. You may have to try a few different "yum downgrade" commands to get to a version that triggers the problem.
Install ose-3.2 on rhel71 with NetworkManager-1.0.0 on openstack then update the system to rhel72. can not reproduce this issue
Install ose-188.8.131.52.-8 on rhel71 with NetworkManager-1.0.0 on openstack, then update the system to rhel72. can not reproduce this issue
Install ose-184.108.40.206-8 on ami-10663b78 on AWS, then update the system to latest rhel72. can not reproduce this issue
Install ose-220.127.116.11-8 on rhel72 with NetworkManager-1.0.0-14.git20150121 on AWS, then update the system to latest rhel72. can not reproduce this issue
QE can not reproduce this issue.