Bug 1635804
Summary: | openshift-sdn daemonsets should tolerate taints | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> | ||||
Component: | Networking | Assignee: | Vadim Rutkovsky <vrutkovs> | ||||
Status: | CLOSED ERRATA | QA Contact: | Siva Reddy <schituku> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.11.0 | CC: | aos-bugs, jialiu, jokerman, mmccomas, schituku, wmeng | ||||
Target Milestone: | --- | ||||||
Target Release: | 3.11.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: SDN daemonset didn't run on all nodes
Consequence: upgrade failed, as some nodes didn't have internal network setup
Fix: SDN daemonset tolerates all taints and runs on all nodes
Result: upgrade succeeds
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-12-12 14:15:51 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Justin Pierce
2018-10-03 17:34:37 UTC
Fix is available in openshift-ansible-3.11.42-1 tested the fix in 43 build but still pods are getting affected by the taint. Here are details. Version: openshift v3.11.43 kubernetes v1.11.0+d4cacc0 Steps to reproduce: 1. $node is the compute node where the pods are running 2. Note the pods running for sync and ovs,sdn pods # oc get pods -n openshift-node -o wide | grep $node ; oc get pods -n openshift-sdn -o wide | grep $node ; sync-vkxwk 1/1 Running ovs-gjkts 1/1 Running sdn-c7c7f 1/1 Running 3. Note that there are 3 pods running for each of sync, ovs, sdn 4. taint the node #oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute #oc describe node $node | grep -i taint Taints: NodeWithImpairedVolumes=true:NoExecute 5. restart the api and controllers # master-restart api # master-restart controllers 6. Note the pods for sync and ovs,sdn pods sync-vkxwk 0/1 CrashLoopBackOff sdn-c7c7f 0/1 CrashLoopBackOff The ovs pod doesn't even show up and the sync and sdn pod goes into CrashLoopBackOff constantly. (In reply to Siva Reddy from comment #3) > tested the fix in 43 build but still pods are getting affected by the taint. > 6. Note the pods for sync and ovs,sdn pods > sync-vkxwk 0/1 CrashLoopBackOff > sdn-c7c7f 0/1 CrashLoopBackOff Do pods start on tainted nodes? What's in the logs for these containers? Why would a master api/controller restart be required? Do pods start on tainted nodes? they are going into crashloopbackoff What's in the logs for these containers? I have sent you the cluster details via private message Why would a master api/controller restart be required? I was just following the steps in the guide to apply taints and tolerations. it may not be needed https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/taints_tolerations.html Can you attach the logs of the crashlooping sdn pods? Taints don't affect pods, they just affect whether or not the pod is scheduled. If the pod is crashlooping it's a different (but perhaps related) bug. Yes I agree. The pod is crashlooping because the taint is terminating the ovs pod and hence sdn pod is crashlooping is my guess. The environment is no longer available, but will spin a new one and attach the logs. Created attachment 1506492 [details]
sdn pod log
Yes, it looks like the OVS daemonset missed the taints as well. Vadim, can you take care of that? Ah, that would explain it Fix is available in openshift-ansible-3.11.45-1 Verified that the sdn daemon-sets are tolerating the taints and not crashing after tainting the nodes. Version: openshift v3.11.51 kubernetes v1.11.0+d4cacc0 oc v3.11.50 openshift-ansible-3.11.51-1.git.0.51c90a3.el7.noarch.rpm Verification steps: 1. $node is the compute node where the pods are running 2. Note the pods running for sync and ovs,sdn pods # oc get pods -n openshift-node -o wide | grep $node ; oc get pods -n openshift-sdn -o wide | grep $node ; sync-vkxwk 1/1 Running ovs-gjkts 1/1 Running sdn-c7c7f 1/1 Running 3. Note that there are 3 pods running for each of sync, ovs, sdn 4. taint the node #oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute #oc describe node $node | grep -i taint Taints: NodeWithImpairedVolumes=true:NoExecute 6. Note the pods for sync and ovs,sdn pods All the pods are running without any crashing Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3743 |