Bug 1298297 - networking issues appear for previously functioning containers
Summary: networking issues appear for previously functioning containers
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Dan Williams
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks: OSOPS_V3
TreeView+ depends on / blocked
 
Reported: 2016-01-13 16:53 UTC by Joel Diaz
Modified: 2016-09-07 21:27 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-07 21:27:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Joel Diaz 2016-01-13 16:53:04 UTC
Description of problem:
In our environment, we have some monitoring tests that go into each running container and attempts to do DNS resolution. The results of those tests are sent up to our monitoring infrastructure.

We've seen where nodes that have been running for days/weeks will start to flag one or more containers as having DNS/networking issues.

While we are specifically testing DNS, in actuality, it appears that all network functionality is failing on these problematic containers

Version-Release number of selected component (if applicable):
atomic-openshift-node-3.1.0.4-1.git.2.c5fa845.el7aos.x86_64
atomic-openshift-3.1.0.4-1.git.2.c5fa845.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.0.4-1.git.2.c5fa845.el7aos.x86_64
atomic-openshift-clients-3.1.0.4-1.git.2.c5fa845.el7aos.x86_64
tuned-profiles-atomic-openshift-node-3.1.0.4-1.git.2.c5fa845.el7aos.x86_64
openvswitch-2.3.1-2.git20150113.el7.x86_64

How reproducible:
It appears after a node has been up and running for days/weeks.

Steps to Reproduce:
1. Run containers on OpenShift
2. Periodically test network functionality on the OpenShift-managed containers
3. Wait for failure

Actual results:
DNS failure results:
[root@int-node-infra-1f427 ~]# docker exec -ti 2bb797e31266 getent hosts google.com ; echo $?
2


Expected results:
[root@int-node-infra-1f427 ~]# docker exec -ti 8b353d6eaf95 getent hosts google.com ; echo $?
2607:f8b0:400d:c0a::8a google.com
0

Additional info:

Comment 1 Eric Paris 2016-01-13 19:25:14 UTC
Is there any chance that any of the underlying components (docker, atomic-* openvswitch, etc) were updated and/or restarted around the time of failure?

Comment 2 Joel Diaz 2016-01-13 19:59:41 UTC
When experiencing networking issues, our process for getting things working again is:

stop atomic-openshift-node and docker
restart openvswitch
start atomic-openshift-node and docker

It's possible that our NOC team may have performed these steps in response to reported networking problems. Because of the way we monitor the nodes (they are monitored through privileged containers not managed through OpenShift), stopping docker would mean we wouldn't get the network failure reports until our monitoring containers were started back up.

What I'm trying to point out is that stopping docker/openshift/openvswitch is part of the process to get things working *after* a problem has been reported.

Our release team made some changes yesterday on the node names (which are being rolled back now), but otherwise these services shouldn't be getting restarted unless we are notified about a network issue (we've previously run into this issue before the node name changes from yesterday).

Comment 3 Ben Bennett 2016-01-14 16:08:12 UTC
The problem is that there are no OVS rules for the long-running containers.  I see that ports 1, 2, and 9 are connected (vxlan, main interface, and docker bridge respectively).  But nothing else.  And there are no rules for any ports other than those three.

The interface is not connected to the docker bridge (lbr0) either, and as far as I can tell it is as if we removed the networking rules like when a pod goes down... except the pod didn't go down.

I'm still digging through the logs and information I have to try to identify when things stopped working and to see if I can correlate that to anything else on the system.

Comment 4 Ben Bennett 2016-01-14 19:04:36 UTC
There are two systems here 'cluster' and 'infra'.  Both are having problems, and superficially they look similar.  However, the root cause may be different.

We have a working theory for 'infra'.  The facts are:
 - There is one long-running container on the node (running a registry)
 - The networking completely stopped for the node (you can not ping the host address on the SDN)
 - Digging in a bit shows that only the default OVS rules (for ports 1, 2, and 9) are present, there are no rules for the pod
 - The openshift-node software has stopped (there appears to have been a change to the configuration and openshift stops with: Couldn't open cloud provider configuration /etc/aws/aws.conf: &os.PathError{Op:"open", Path:"/etc/aws/aws.conf", Err:0x2})
 - Systemctl tried to restart openshift 5 times

Based on that, we think that when openshift restarted, it cleaned out the OVS rules, set up the three default ones (for ports 1, 2, and 9) and then expected to set the rest up based on the running pods.  However, when it hit the configuration error and exited, it never got a chance to restore the rules.

I believe this explains the 'infra' node case.  I haven't worked out the 'cluster' case yet... still working on it.

Comment 5 Ben Bennett 2016-01-14 20:47:49 UTC
The 'cluster' machine had the following in the hostsubnets:
- apiVersion: v1
  host: 172.20.1.225
  hostIP: 172.20.1.225
  kind: HostSubnet
  metadata:
    creationTimestamp: 2015-11-12T15:25:52Z
    name: 172.20.1.225
    resourceVersion: "971107"
    selfLink: /oapi/v1/hostsubnets/172.20.1.225
    uid: a8d395de-8951-11e5-9261-06e17e4f40ef
  subnet: 10.1.9.0/24
- apiVersion: v1
  host: ip-172-20-1-225.ec2.internal
  hostIP: 172.20.1.225
  kind: HostSubnet
  metadata:
    creationTimestamp: 2016-01-12T19:41:45Z
    name: ip-172-20-1-225.ec2.internal
    resourceVersion: "15164938"
    selfLink: /oapi/v1/hostsubnets/ip-172-20-1-225.ec2.internal
    uid: 8381a96b-b964-11e5-ab7c-06e17e4f40ef
  subnet: 10.1.3.0/24

And the interfaces on the node ended up configured:
7: lbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP
    inet 10.1.9.1/24 scope global lbr0
5037: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN
    inet 10.1.3.1/24 scope global tun0

Note that they have different addresses, where they are supposed to be the same.

This was caused by the release team: "I changed nodeName to something like ip-172-31-5-247.ec2.internal, whereas before it was an IP. It's broken all the nodes"

So, while this is a misconfiguration, and because of that, I think we can drop the priority, there are a couple of things that we probably should address:
 - The sdn code should make sure that the tun0 and lbr0 interfaces have the same address
 - When a new node is created, it should probably check to see if the hostIP is already in use

Comment 6 Ben Bennett 2016-01-15 13:15:22 UTC
Dan Williams said that there is a HostSubnet validation PR that he will fold the duplicate IP validation into.  That means that this bug is now just:

The SDN code should make sure that the tun0 and lbr0 interfaces have the same address

Comment 7 Dan Williams 2016-04-28 20:39:56 UTC
I don't believe different addresses on lbr0 and tun0 can happen in 3.1.1.6 and later.  If openshift-sdn finds that lbr0's address differs from the HostSubnet, it will tear everything down and recreate it with the correct address.  The only way I can see them being different is if there was a crash right after configuring lbr0 and before tun0, or if tun0's IP address was changed manually.  Neither of these are things we should try to account for (because if we do, where does it end...)

For QA, I think letting the openshift-node be configured, then stopping the node and changing the HostSubnet record for the node with 'oc edit HostSubnet', then restarting openshift-node and making sure that tun0 & lbr0 addresses are updated to  match the new HostSubnet would be sufficient.

Comment 8 Yan Du 2016-04-29 06:08:59 UTC
openshift v3.2.0.20
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

After stop the node, we still got warning (* subnet: Invalid value: "10.0.2.0/24": cannot change the subnet lease midflight) when changing the subnet record for node with 'oc edit HostSubnet' .

I have to delete the node and add it back, then node could allocate a new subnet, and tun0 and lbr0 update accordingly

So move the bug to verified.


Note You need to log in before you can comment on or make changes to this bug.