Bug 1564955 - When using flannel host-gw, routes get deleted intermittently on the openshift-node(instance of OpenStack) [NEEDINFO]
Summary: When using flannel host-gw, routes get deleted intermittently on the openshif...
Keywords:
Status: CLOSED EOL
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.1.0
Assignee: Casey Callendrello
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On: 1585789
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-09 03:01 UTC by Miheer Salunke
Modified: 2019-03-12 14:01 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-21 16:09:40 UTC
Target Upstream Version:
emahoney: needinfo? (rchopra)
bbennett: needinfo? (rchopra)
jchaloup: needinfo? (misalunk)


Attachments (Terms of Use)

Comment 1 Rajat Chopra 2018-04-09 04:34:32 UTC
If we restart the flannel daemon do the routes get restored? The answer will help narrow down the issue.
Also, is this happening only on a particular set of nodes or any/all nodes.

Comment 16 Miheer Salunke 2018-04-12 05:54:07 UTC
(In reply to Rajat Chopra from comment #13)
> When flannel stops running on a node for more than a certain period of time,
> the node will lose its subnet lease. Upon restart it will acquire a new
> lease if the older lease has expired.
> The lease duration is usually 24 hours.
> 
> Hence, the first cause of this problem is flannel daemon dying away and the
> bigger problem happens when it stays dead for longer than the lease period.
> So, please fix systemd unit file to allow a restart. That is a necessary fix
> irrespective of what the root cause of the reason why the daemon is dying.
> 
> The part 2 of the fix/investigation would need logs as and when the daemons
> die. Perhaps a disk full?
> 
> 
> Next up, after we put the restart in the unit file:
> How to fix the nodes that have duplicate leases?
> - The cleanest but abrupt answer is to kill all existing pods on that node
> - It may be possible to do the cleanup in etcd, this will require a special
> fix script

Hi Rajat,

Overall I think you are saying first we are going to troubleshoot as to why the routes are incorrect on the node or not getting updated properly  despite flannel running on the node.
So for that we will be adding restart on SIGPIPE so that flannel runs despite it getting sigpipe which will help us understand the pattern about why the routes were not added right ?

Steps ->

Issue 1
 For issue where routes are not correct in the node.

1. Mark the node which has duplicate lease as not schedulable
2. draining the pods on different node which does not have duplicate lease.
3. Add debug level logging in flannel in all the nodes in the cluster and restart the node service.
4. Once the not properly updating of routes issue occurs on any of the node, collect the flannel logs. But to understand the occurrence of this issue the customer has to hit connection issue to pod on that specific node right ?
5. Analyze the logs, update the BZ.
6.   If (the issue is found and if we have a workaround )
       apply it and then again schedule the pods on it.
     else
       leave the pods on the node where there is no route issue 
       or
       manually modify the routes in the ip route

Once Issue is solved then ->

Issue 2
  For the issue when flannel gets killed by SIGPIPE

1. Set debug level flannel logging
2. Start strace on the flannel process for all the nodes
3. Start tcpdump on the flannel port
4. Enable SAR per minute on the nodes to check the resource usage

Thanks a lot Rajat again for looking into this! Let me know if you find anything 
wrong in the above steps.

Thanks and regards,
Miheer Salunke

Comment 17 giriraj rajawat 2018-04-24 12:19:02 UTC
Issue of flanneld service restart reoccurred on 23rd April 09:55:45 on app node 17. 
- Flanneld logs with level 10 is enabled (FLANNEL_OPTIONS="-v=10 -logtostderr")
- SAR is enabled
- As an interim solution we have applied 
  - IgnoreSIGPIPE=yes
  - Restart=always 

 - Latest SOS report and logs are at foobar.

Logs haven't reveled any significant data, Snippet of analysis.   

Flanneld Logs:
-------------- 

Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com flanneld-start[34628]: I0423 09:55:45.168695   34628 network.go:114] Route to 10.128.80.0/23 via 10.3.168.159 already exists, skipping.
Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com systemd[1]: Started Flanneld overlay address etcd agent.
Apr 23 09:58:19 prod-node17.shift.enoc-airtel.com flanneld-start[34628]: I0423 09:58:19.385187   34628 network.go:83] Subnet added: 10.128.236.0/23 via 10.3.168.196
Apr 23 09:58:19 prod-node17.shift.enoc-airtel.com flanneld-start[34628]: I0423 09:58:19.385600   34628 network.go:114] Route to 10.128.236.0/23 via 10.3.168.196 already exists, skipping.



Docker logs:
------------
Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com systemd[1]: Starting Docker Application Container Engine...



Node Logs
--------- 

Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com atomic-openshift-node[107298]: E0423 09:55:45.307538  107298 kubelet_pods.go:736] Error listing containers: &errors.errorString{s:"Cannot connect to the Docker daemon. Is the docker daemon running on this host?"}
Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com atomic-openshift-node[107298]: E0423 09:55:45.307590  107298 kubelet.go:1861] Failed cleaning pods: Cannot connect to the Docker daemon. Is the docker daemon running on this host?


Any pointer will be helpful.

Regards,

Giriraj Rajawat

Comment 25 Ben Bennett 2018-05-01 19:04:33 UTC
Please make sure this doesn't get lost in the other request:

Did someone restart journald?  It looks like you can get this when writing to STDOUT or STDERR fails: https://github.com/golang/go/issues/11845

We can check by looking at:
  > systemctl status systemd-journald
And
  > systemctl status flanneld

And seeing if the restart times are near one another.

Comment 26 giriraj rajawat 2018-05-03 06:57:44 UTC
Hitting warning with none output.

Debug package of flanneld & Glibc installed.

# debuginfo-install install -y glibc
# debuginfo-install install -y flanneld


Find the warning messages after running the gdb process.

[root@prod-node11 /]# gdb --command=/tmp/command -p 1858 > /kdump/error

warning: the debug information found in "/usr/lib/debug//usr/bin/flanneld.debug" does not match "/usr/bin/flanneld" (CRC mismatch).


warning: the debug information found in "/usr/lib/debug/usr/bin/flanneld.debug" does not match "/usr/bin/flanneld" (CRC mismatch).

Comment 27 giriraj rajawat 2018-05-03 08:32:48 UTC
CRC mismatch corrected by aligning the packages subversion. We are waiting on customer for debug logs.

Comment 28 Rajat Chopra 2018-05-03 18:27:10 UTC
Since we have a mechanism (RestartPolicy) to let the SIGPIPE not affect the operation of the cluster, the severity of this bug is being lowered.

Comment 34 Ben Bennett 2018-05-04 16:04:26 UTC
Note from Suresh: What I have observed is, if flannel is trying to push something to etcd and connection breaks, it gets sigpipe

Comment 35 Rajat Chopra 2018-05-04 17:11:38 UTC
What version of docker are we running? The pods should not get restarted on docker restart.

Comment 42 Rajat Chopra 2018-05-07 17:39:59 UTC
Can we get the current contents of the systemd unit file for flanneld, openshift and docker?
We will explore decoupling docker and flannel as suggested in comment #40.

Few things would still need to be done long term:
1. Upgrade docker to 1.13 at least
2. Put a signal handler in flannel (the sigpipe is caused by journald restarting). See example: https://github.com/moby/moby/pull/22460
3. Fix installer for flannel to include restart option

Comment 44 Rajat Chopra 2018-05-07 21:08:04 UTC
Can we get systemd version as well?

The root cause is that journald is stopped and started. The appropriate way is to do a systemctl restart systemd-journald. This correct way will reap all the socket FDs and it does not result in process getting restarted.

Can we find out who/how/why the journald stops/starts? Stop/Start is not the same as a restart.

Comment 46 Miheer Salunke 2018-05-07 23:36:08 UTC
(In reply to Rajat Chopra from comment #44)
> Can we get systemd version as well?
> 
> The root cause is that journald is stopped and started. The appropriate way
> is to do a systemctl restart systemd-journald. This correct way will reap
> all the socket FDs and it does not result in process getting restarted.
> 
> Can we find out who/how/why the journald stops/starts? Stop/Start is not the
> same as a restart.

Do you want us to take gdb on systemd-journald ?

Something like this ?

https://bugzilla.redhat.com/show_bug.cgi?id=1438229#c0

Comment 50 Rajat Chopra 2018-05-08 07:03:02 UTC
The SIGPIPE may not be trappable inside the flannel program itself. Given that the stacktrace involves glog, I have written a small program that mimics this crash. SIGPIPE is not easily trappable in it as EPIPE is sent from 'write' system call, unless all FDs are re-opened.

We will have to decouple flannel and docker for a relief from current issues, while we investigate how to fix reaping of all FDs. Sending the drop-in file fix after I test it.

Comment 56 Rajat Chopra 2018-05-08 13:46:38 UTC
Can someone verify if docker flanneld are indeed connected on restarts?
Kill flanneld and see if docker restarts?

I could not reproduce this with the given unit files.

Comment 57 Ben Bennett 2018-05-08 14:38:46 UTC
What is the output of these commands?:
 systemctl cat docker
 systemctl cat flanneld
 cat /etc/os-release
 rpm -q docker flannel

Thanks

Comment 61 Weibin Liang 2018-05-08 17:23:48 UTC
Docker service is running fine after stop/start systemd-journald and stop/start flanneld.service.

Tested in: 
RHEL 7.3
atomic-openshift-master-3.5.5.31.67-1.git.0.0a8cf24.el7.x86_64
docker-1.12.6-71.git3e8e77d.el7.x86_64
flannel-0.7.1-3.el7.x86_64

Comment 70 Rajat Chopra 2018-05-09 17:52:12 UTC
Here is the fix for decoupling flanneld restarts with respect to docker:

1. Change the flanneld.service file:
   In the [Install] section, replace RequiredBy=docker.service to WantedBy=docker.service
2. systemctl daemon-reload
3. systemctl disable flanneld
4. systemctl enable flanneld

Make sure we still keep the restart policy as is.

Unfortunately this change cannot be done with a drop-in file, so the edit is required.

Comment 71 Ben Bennett 2018-05-09 18:27:40 UTC
To clarify step 1, add a step 0:

0) cp /usr/lib/systemd/system/flanneld.service /etc/systemd/system

Then in step 1 edit /etc/systemd/system/flanneld.service


Note You need to log in before you can comment on or make changes to this bug.