Bug 1564955
Summary: | When using flannel host-gw, routes get deleted intermittently on the openshift-node(instance of OpenStack) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Miheer Salunke <misalunk> |
Component: | Networking | Assignee: | Casey Callendrello <cdc> |
Status: | CLOSED EOL | QA Contact: | Meng Bo <bmeng> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.5.0 | CC: | aos-bugs, bbennett, bleanhar, cdc, dcbw, emahoney, grajawat, hongli, jawilson, jchaloup, jkalliya, jkaur, jmalde, misalunk, mmahudha, pasik, rrajaram, schoudha, sgaikwad, weliang |
Target Milestone: | --- | ||
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-02-21 16:09:40 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1585789 | ||
Bug Blocks: |
Comment 1
Rajat Chopra
2018-04-09 04:34:32 UTC
(In reply to Rajat Chopra from comment #13) > When flannel stops running on a node for more than a certain period of time, > the node will lose its subnet lease. Upon restart it will acquire a new > lease if the older lease has expired. > The lease duration is usually 24 hours. > > Hence, the first cause of this problem is flannel daemon dying away and the > bigger problem happens when it stays dead for longer than the lease period. > So, please fix systemd unit file to allow a restart. That is a necessary fix > irrespective of what the root cause of the reason why the daemon is dying. > > The part 2 of the fix/investigation would need logs as and when the daemons > die. Perhaps a disk full? > > > Next up, after we put the restart in the unit file: > How to fix the nodes that have duplicate leases? > - The cleanest but abrupt answer is to kill all existing pods on that node > - It may be possible to do the cleanup in etcd, this will require a special > fix script Hi Rajat, Overall I think you are saying first we are going to troubleshoot as to why the routes are incorrect on the node or not getting updated properly despite flannel running on the node. So for that we will be adding restart on SIGPIPE so that flannel runs despite it getting sigpipe which will help us understand the pattern about why the routes were not added right ? Steps -> Issue 1 For issue where routes are not correct in the node. 1. Mark the node which has duplicate lease as not schedulable 2. draining the pods on different node which does not have duplicate lease. 3. Add debug level logging in flannel in all the nodes in the cluster and restart the node service. 4. Once the not properly updating of routes issue occurs on any of the node, collect the flannel logs. But to understand the occurrence of this issue the customer has to hit connection issue to pod on that specific node right ? 5. Analyze the logs, update the BZ. 6. If (the issue is found and if we have a workaround ) apply it and then again schedule the pods on it. else leave the pods on the node where there is no route issue or manually modify the routes in the ip route Once Issue is solved then -> Issue 2 For the issue when flannel gets killed by SIGPIPE 1. Set debug level flannel logging 2. Start strace on the flannel process for all the nodes 3. Start tcpdump on the flannel port 4. Enable SAR per minute on the nodes to check the resource usage Thanks a lot Rajat again for looking into this! Let me know if you find anything wrong in the above steps. Thanks and regards, Miheer Salunke Issue of flanneld service restart reoccurred on 23rd April 09:55:45 on app node 17. - Flanneld logs with level 10 is enabled (FLANNEL_OPTIONS="-v=10 -logtostderr") - SAR is enabled - As an interim solution we have applied - IgnoreSIGPIPE=yes - Restart=always - Latest SOS report and logs are at foobar. Logs haven't reveled any significant data, Snippet of analysis. Flanneld Logs: -------------- Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com flanneld-start[34628]: I0423 09:55:45.168695 34628 network.go:114] Route to 10.128.80.0/23 via 10.3.168.159 already exists, skipping. Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com systemd[1]: Started Flanneld overlay address etcd agent. Apr 23 09:58:19 prod-node17.shift.enoc-airtel.com flanneld-start[34628]: I0423 09:58:19.385187 34628 network.go:83] Subnet added: 10.128.236.0/23 via 10.3.168.196 Apr 23 09:58:19 prod-node17.shift.enoc-airtel.com flanneld-start[34628]: I0423 09:58:19.385600 34628 network.go:114] Route to 10.128.236.0/23 via 10.3.168.196 already exists, skipping. Docker logs: ------------ Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com systemd[1]: Starting Docker Application Container Engine... Node Logs --------- Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com atomic-openshift-node[107298]: E0423 09:55:45.307538 107298 kubelet_pods.go:736] Error listing containers: &errors.errorString{s:"Cannot connect to the Docker daemon. Is the docker daemon running on this host?"} Apr 23 09:55:45 prod-node17.shift.enoc-airtel.com atomic-openshift-node[107298]: E0423 09:55:45.307590 107298 kubelet.go:1861] Failed cleaning pods: Cannot connect to the Docker daemon. Is the docker daemon running on this host? Any pointer will be helpful. Regards, Giriraj Rajawat Please make sure this doesn't get lost in the other request: Did someone restart journald? It looks like you can get this when writing to STDOUT or STDERR fails: https://github.com/golang/go/issues/11845 We can check by looking at: > systemctl status systemd-journald And > systemctl status flanneld And seeing if the restart times are near one another. Hitting warning with none output. Debug package of flanneld & Glibc installed. # debuginfo-install install -y glibc # debuginfo-install install -y flanneld Find the warning messages after running the gdb process. [root@prod-node11 /]# gdb --command=/tmp/command -p 1858 > /kdump/error warning: the debug information found in "/usr/lib/debug//usr/bin/flanneld.debug" does not match "/usr/bin/flanneld" (CRC mismatch). warning: the debug information found in "/usr/lib/debug/usr/bin/flanneld.debug" does not match "/usr/bin/flanneld" (CRC mismatch). CRC mismatch corrected by aligning the packages subversion. We are waiting on customer for debug logs. Since we have a mechanism (RestartPolicy) to let the SIGPIPE not affect the operation of the cluster, the severity of this bug is being lowered. Note from Suresh: What I have observed is, if flannel is trying to push something to etcd and connection breaks, it gets sigpipe What version of docker are we running? The pods should not get restarted on docker restart. Can we get the current contents of the systemd unit file for flanneld, openshift and docker? We will explore decoupling docker and flannel as suggested in comment #40. Few things would still need to be done long term: 1. Upgrade docker to 1.13 at least 2. Put a signal handler in flannel (the sigpipe is caused by journald restarting). See example: https://github.com/moby/moby/pull/22460 3. Fix installer for flannel to include restart option Can we get systemd version as well? The root cause is that journald is stopped and started. The appropriate way is to do a systemctl restart systemd-journald. This correct way will reap all the socket FDs and it does not result in process getting restarted. Can we find out who/how/why the journald stops/starts? Stop/Start is not the same as a restart. (In reply to Rajat Chopra from comment #44) > Can we get systemd version as well? > > The root cause is that journald is stopped and started. The appropriate way > is to do a systemctl restart systemd-journald. This correct way will reap > all the socket FDs and it does not result in process getting restarted. > > Can we find out who/how/why the journald stops/starts? Stop/Start is not the > same as a restart. Do you want us to take gdb on systemd-journald ? Something like this ? https://bugzilla.redhat.com/show_bug.cgi?id=1438229#c0 The SIGPIPE may not be trappable inside the flannel program itself. Given that the stacktrace involves glog, I have written a small program that mimics this crash. SIGPIPE is not easily trappable in it as EPIPE is sent from 'write' system call, unless all FDs are re-opened. We will have to decouple flannel and docker for a relief from current issues, while we investigate how to fix reaping of all FDs. Sending the drop-in file fix after I test it. Can someone verify if docker flanneld are indeed connected on restarts? Kill flanneld and see if docker restarts? I could not reproduce this with the given unit files. What is the output of these commands?: systemctl cat docker systemctl cat flanneld cat /etc/os-release rpm -q docker flannel Thanks Docker service is running fine after stop/start systemd-journald and stop/start flanneld.service. Tested in: RHEL 7.3 atomic-openshift-master-3.5.5.31.67-1.git.0.0a8cf24.el7.x86_64 docker-1.12.6-71.git3e8e77d.el7.x86_64 flannel-0.7.1-3.el7.x86_64 Here is the fix for decoupling flanneld restarts with respect to docker: 1. Change the flanneld.service file: In the [Install] section, replace RequiredBy=docker.service to WantedBy=docker.service 2. systemctl daemon-reload 3. systemctl disable flanneld 4. systemctl enable flanneld Make sure we still keep the restart policy as is. Unfortunately this change cannot be done with a drop-in file, so the edit is required. To clarify step 1, add a step 0: 0) cp /usr/lib/systemd/system/flanneld.service /etc/systemd/system Then in step 1 edit /etc/systemd/system/flanneld.service The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |