Bug 1453113
Summary: | all veth cannot be recovered after restarting openvswitch service | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hongan Li <hongli> | ||||||||
Component: | Installer | Assignee: | Scott Dodson <sdodson> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Hongan Li <hongli> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 3.6.0 | CC: | aos-bugs, atragler, bbennett, bleanhar, dcbw, hongli, jialiu, jokerman, mmccomas, rkhan, sdodson, sukulkar, weliang, wsun, xtian, yadu, zzhao | ||||||||
Target Milestone: | --- | Keywords: | Regression | ||||||||
Target Release: | 3.7.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Previously the node service was not restarted when openvswitch was restarted which could result in a misconfigured networking environment. The services have been updated to ensure that the node service is restarted whenever openvswitch is restarted.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-11-28 21:55:46 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Hongan Li
2017-05-22 08:00:58 UTC
After restart node service, all the existing pod are not running. # oc get po NAME READY STATUS RESTARTS AGE docker-registry-1-4mchj 1/1 Running 0 1h # service atomic-openshift-node restart # oc get po NAME READY STATUS RESTARTS AGE docker-registry-1-4mchj 0/1 CrashLoopBackOff 4 1h # oc describe po docker-registry-1-4mchj <--snip--> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1h 1h 1 default-scheduler Normal Scheduled Successfully assigned docker-registry-1-4mchj to host-8-175-8.host.centralci.eng.rdu2.redhat.com 1h 1h 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Pulling pulling image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ose-docker-registry:v3.6.84" 1h 1h 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Pulled Successfully pulled image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ose-docker-registry:v3.6.84" 1h 1h 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Created Created container with id 13f2f8c0dda48065e199683800079a3aa5be393c9f41694df252d1203e287354 1h 1h 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Started Started container with id 13f2f8c0dda48065e199683800079a3aa5be393c9f41694df252d1203e287354 16m 16m 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Warning Unhealthy Liveness probe failed: Get https://10.128.0.4:5000/healthz: http2: no cached connection was available 23s 23s 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Started Started container with id 20dcc80dfb76024def7ff88f80159830f26dd7e0a4eadd10e485ab5bf53fac34 23s 23s 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Created Created container with id 20dcc80dfb76024def7ff88f80159830f26dd7e0a4eadd10e485ab5bf53fac34 24s 4s 2 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Pulled Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ose-docker-registry:v3.6.84" already present on machine 14s 4s 2 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Warning Unhealthy Readiness probe failed: Get https://10.128.0.4:5000/healthz: dial tcp 10.128.0.4:5000: getsockopt: no route to host 4s 4s 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Warning Unhealthy Liveness probe failed: Get https://10.128.0.4:5000/healthz: dial tcp 10.128.0.4:5000: getsockopt: no route to host 4s 4s 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Killing Killing container with id docker://20dcc80dfb76024def7ff88f80159830f26dd7e0a4eadd10e485ab5bf53fac34:pod "docker-registry-1-4mchj_default(6c070063-41e0-11e7-a809-fa163e308807)" container "registry" is unhealthy, it will be killed and re-created. 3s 3s 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Created Created container with id 43b9471c49f82c440d93ee9ec38b18454aa1e12f98998a78a320164bc3e9654d 3s 3s 1 kubelet, host-8-175-8.host.centralci.eng.rdu2.redhat.com spec.containers{registry} Normal Started Started container with id 43b9471c49f82c440d93ee9ec38b18454aa1e12f98998a78a320164bc3e9654d <--snip--> Seem like my issue has the same cause root in this bug's initial report. Rising its priority. Note that we do not support restarting openvswitch underneath openshift. Restarting openshift-node itself, however, should be able to recover. *** Bug 1453190 has been marked as a duplicate of this bug. *** restarting openvswitch is supported in 3.5 and before, and we had a test case for https://bugzilla.redhat.com/show_bug.cgi?id=1316202. Can you run: systemctl cat atomic-openshift-node.service Mine has: # /usr/lib/systemd/system/atomic-openshift-node.service.d/openshift-sdn-ovs.conf [Unit] Requires=openvswitch.service After=ovsdb-server.service After=ovs-vswitchd.service After=openvswitch.service But that file comes from atomic-openshift-sdn-ovs-3.6.65-1.git.0.1c96bcd.el7.x86_64, so is done by the productization team. If that file is missing, then this may be a productization bug. But if you installed from source then that may not be part of the rpm. To clarify, we support restarting OVS if-and-only-if OpenShift is restarted immediately after OVS. We don't support restarting OVS and then restarting OpenShift later. The file that Ben refers to attempts to ensure that OpenShift is restarted when OVS does. Yes, the unit files contains the ovs related services as comment#5 # systemctl cat atomic-openshift-node.service ... ... # /usr/lib/systemd/system/atomic-openshift-node.service.d/openshift-sdn-ovs.conf [Unit] Requires=openvswitch.service After=ovsdb-server.service After=ovs-vswitchd.service After=openvswitch.service And the openshift service will be restarted when the ovs gets restarted. # ps -ef | grep openshift root 13444 1 1 Jun01 ? 01:37:35 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=0 root 128640 1221 0 14:11 pts/0 00:00:00 grep --color=auto openshift # systemctl restart openvswitch.service # ps -ef | grep openshift root 128922 1 12 14:11 ? 00:00:00 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=0 root 129119 1221 0 14:11 pts/0 00:00:00 grep --color=auto openshift So the problem here should be, all the veth are missing when the openshift-node service restarted. (In reply to Johnny Liu from comment #1) > After restart node service, all the existing pod are not running. > # service atomic-openshift-node restart > Seem like my issue has the same cause root in this bug's initial report. No, that's something different (https://github.com/openshift/origin/pull/14446, which doesn't seem to have a bugzilla bug) OK, so the problem here (currently, which is slightly different from when this was originally filed before PR 14446 merged) is that if openshift does a full network re-setup on restart (which it will have to do if you restarted openvswitch), then the UpdatePod() loop will fail because ovscontroller.UpdatePod() tries to get the containerID out of the OVS flows, but the OVS flows won't be there. (There's also a second problem, which is that ovscontroller.UpdatePod() doesn't call ensureOvsPort() any more, which would require it to know the veth name, which it doesn't know.) dcbw: thoughts? tested in latest build atomic-openshift-3.6.106-1.git.0.1072f4f.el7.x86_64 (PR 14446 has been merged) and result as below: 1. restart node service: OK 2. restart openvswitch: veth cannot be recovered Upstream fix pull request: https://github.com/openshift/origin/pull/14665 *** Bug 1461709 has been marked as a duplicate of this bug. *** verified in atomic-openshift-3.6.133-1.git.0.524e4c8.el7.x86_64 and the issue has been fixed. Re-open it since the issue still can be reproduced in containerized installation env. Previous verification on RPM installation env is passed. (In reply to hongli from comment #15) > Re-open it since the issue still can be reproduced in containerized > installation env. Previous verification on RPM installation env is passed. Can you get openshift-node logs with --loglevel=5 when this fails? Created attachment 1296760 [details]
node logs
reproduced in containerized env build openshift v3.6.140 and attached the node logs. test steps: 1. restart atomic-openshift-node service, OK 2. restart openvswitch service, failed to connect Pods Checked OVS then found all openflow rules gone but veth still there, see below: [root@qe-hongli-bugv-node-registry-router-1 ~]# docker exec openvswitch ovs-ofctl dump-flows br0 -O openflow13 OFPST_FLOW reply (OF1.3) (xid=0x2): [root@qe-hongli-bugv-node-registry-router-1 ~]# docker exec openvswitch ovs-vsctl show ba9be912-eaac-4e9d-ae91-8e50b3b0d0bd Bridge "br0" fail_mode: secure Port "tun0" Interface "tun0" type: internal Port "vxlan0" Interface "vxlan0" type: vxlan options: {key=flow, remote_ip=flow} Port "veth4c05d454" Interface "veth4c05d454" Port "br0" Interface "br0" type: internal Port "veth60eb808b" Interface "veth60eb808b" Port "vethee7b5367" Interface "vethee7b5367" Port "vethcbe6fe7d" Interface "vethcbe6fe7d" ovs_version: "2.6.1" (In reply to hongli from comment #18) > reproduced in containerized env build openshift v3.6.140 and attached the > node logs. > > test steps: > 1. restart atomic-openshift-node service, OK > 2. restart openvswitch service, failed to connect Pods Just to be clear; we do not support restarting OVS after restarting OpenShift. We do support restarting OVS if OpenShift is restarted right after. Are you still able to reproduce if you: test steps: 1. restart atomic-openshift-node service, OK 2. restart openvswitch service, failed to connect Pods 3. restart atomic-openshift-node service (In reply to Dan Williams from comment #19) > > Just to be clear; we do not support restarting OVS after restarting > OpenShift. > > We do support restarting OVS if OpenShift is restarted right after. > > Are you still able to reproduce if you: > > test steps: > 1. restart atomic-openshift-node service, OK > 2. restart openvswitch service, failed to connect Pods > 3. restart atomic-openshift-node service Thanks Dan, the test results are different in RPM and containerized installation env: in RPM installation env: 1. systemctl restart openvswitch (it is OK, pod's IP changed and can reach pods after ovs restarted) in containerized installation env: 1. systemctl restart openvswitch (pod's IP not changed and cannot reach pod) 2. systemctl restart atomic-openshift-node (it is OK, pod's IP changed and can reach pod) Is this as expected? Why we need restart node service in containerized env but not need it in RPM env? (In reply to hongli from comment #20) > (In reply to Dan Williams from comment #19) > > > > Just to be clear; we do not support restarting OVS after restarting > > OpenShift. > > > > We do support restarting OVS if OpenShift is restarted right after. > > > > Are you still able to reproduce if you: > > > > test steps: > > 1. restart atomic-openshift-node service, OK > > 2. restart openvswitch service, failed to connect Pods > > 3. restart atomic-openshift-node service > > Thanks Dan, the test results are different in RPM and containerized > installation env: > > in RPM installation env: > > 1. systemctl restart openvswitch (it is OK, pod's IP changed and can reach > pods after ovs restarted) > > in containerized installation env: > 1. systemctl restart openvswitch (pod's IP not changed and cannot reach pod) > 2. systemctl restart atomic-openshift-node (it is OK, pod's IP changed and > can reach pod) > > Is this as expected? Why we need restart node service in containerized env > but not need it in RPM env? That's quite interesting and unexpected. Could you grab logs from openshift-node over the time when you do the openvswitch restart? Created attachment 1298142 [details]
nodelog without restarting ovs
Created attachment 1298143 [details]
nodelog during restart the ovs
The nodelog.1 is node running without restart ovs, the nodelog.2 is the log during the ovs being restarted. I did not see much difference on them, except one warning about conversion.go Logs attached. Thanks, I also don't see much of a difference. I seem to recall that we have some kind of dependency in the systemd unit files for non-containerized OpenShift. So I looked around and came across https://bugzilla.redhat.com/show_bug.cgi?id=1316202#c4 which says that due to the Requires+Restart directives in the non-containerized OpenShift node unit file, systemd should restart openshift automatically when openvswitch goes down. However, I cannot reproduce that behavior. So I guess we continue to recommend (like in https://docs.openshift.com/container-platform/3.4/install_config/upgrading/manual_upgrades.html. Looking further, I *can* make openshift restart when OVS restarts by putting "Requires=openvswitch.service" into the system unit file for the node. Ben/Eric, should we do that for all the unit files in origin tree and ansible? Dan, try to understand the order of restarting OVS and openshif-nodes, are my expectations below correct? 1.Restart OVS only: OK 2.Restart node service only: OK 3.Restart OVS just after node service restarting: OK 4.Restart node service just after OVS restarting: OK 5.Restart OVS, wait for a while, then restart node service: Not OK 6.Restart node service, wait for a while, then restart OVS: OK (In reply to Weibin Liang from comment #26) > Dan, try to understand the order of restarting OVS and openshif-nodes, are my > expectations below correct? I think it's easier to simply say: "Whenever you restart OVS, you must restart the node service immediately after." With that in mind, (1), (5), and (6) should be "not OK". (In reply to hongli from comment #20) > in RPM installation env: > > 1. systemctl restart openvswitch (it is OK, pod's IP changed and can reach > pods after ovs restarted) > Then above test result is not clear: hongli, did you restart OVS only? or restart OVS then restart node service immediately after? (In reply to Weibin Liang from comment #28) > (In reply to hongli from comment #20) > > > in RPM installation env: > > > > 1. systemctl restart openvswitch (it is OK, pod's IP changed and can reach > > pods after ovs restarted) > > > > Then above test result is not clear: > > hongli, did you restart OVS only? or restart OVS then restart node service > immediately after? in RPM installation env, you just need restart OVS only, and you can see the node service is restarted automatically when you restart OVS. but in containerized env, when you restart OVS then only OVS service is restarted, it won't trigger the node services restarting. This is the problem. Please help confirm. So it seems we *used* to have Requires=openvswitch.service in the containerized installs, which would correctly restart openshift-node. But that was changed to Wants back in May due to: https://bugzilla.redhat.com/show_bug.cgi?id=1451192 https://github.com/openshift/openshift-ansible/pull/4213 Scott: Can you help here? It looks like we have to pick which of two bugs we want to have :-( https://github.com/openshift/openshift-ansible/pull/4820 some discussion of other possible solutions here. Asking Giuseppe to look into this Scott: Any news? Thanks! https://github.com/openshift/openshift-ansible/pull/4820 merged 9 hours ago Tested in latest OCP build v3.6.173.0.45 but PR not merged yet. according to the PR and adding "PartOf=openvswitch.service" in /etc/systemd/system/atomic-openshift-node.service, it can fix the issue in container env. will re-tested with next build to see if merged and fixed. verified in atomic-openshift-3.6.173.0.49-1.git.0.0b9377a but sill failed. Checked below file but no "PartOf=openvswitch.service" in it. # cat /etc/systemd/system/atomic-openshift-node.service [Unit] After=atomic-openshift-master.service After=docker.service After=openvswitch.service PartOf=docker.service Requires=docker.service Wants=openvswitch.service After=ovsdb-server.service After=ovs-vswitchd.service Wants=atomic-openshift-master.service Requires=atomic-openshift-node-dep.service After=atomic-openshift-node-dep.service Requires=dnsmasq.service After=dnsmasq.service Scott: Do we want to backport that fix? verified in v3.7.0-0.153.0 containerized env and the bug has been fixed. see also https://bugzilla.redhat.com/show_bug.cgi?id=1453113#c40 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |