Bug 2056605
Summary: | [osp][octavia lb] unschedulable worker node after cloud provider config change | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jon Uriarte <juriarte> | ||||
Component: | Networking | Assignee: | Andreas Karis <akaris> | ||||
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||
Status: | CLOSED DUPLICATE | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | medium | CC: | aos-bugs, m.andre, mdulko, mfedosin, pprinett, rravaiol | ||||
Version: | 4.10 | Keywords: | TestBlocker | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-04-04 15:53:18 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jon Uriarte
2022-02-21 15:08:37 UTC
Seen in 4.8 as well, in this case after adding the LoadBalancer section to the cloud-provider cm (as the use of Octavia needs to be done as a day 2 operation): $ oc edit configmap -n openshift-config cloud-provider-config [...] config: | [...] [LoadBalancer] use-octavia = True [...] 2022-03-01 04:53:38.896 | TASK [tests : Wait until there are no unschedulable nodes - this means the config has been applied] *** 2022-03-01 04:53:38.902 | Tuesday 01 March 2022 04:53:35 +0000 (0:00:24.794) 0:02:07.969 ********* [...] 2022-03-01 05:52:56.702 | FAILED - RETRYING: Wait until there are no unschedulable nodes - this means the config has been applied (1 retries left). 2022-03-01 05:54:59.423 | fatal: [undercloud-0]: FAILED! => { 2022-03-01 05:54:59.426 | "attempts": 30, 2022-03-01 05:54:59.431 | "changed": false, 2022-03-01 05:54:59.435 | "resources": [ 2022-03-01 05:54:59.438 | { 2022-03-01 05:54:59.441 | "apiVersion": "v1", 2022-03-01 05:54:59.444 | "kind": "Node", [...] 2022-03-01 05:55:00.078 | "spec": { 2022-03-01 05:55:00.081 | "providerID": "openstack:///32125a88-8505-4e29-984f-514b229281d7", 2022-03-01 05:55:00.084 | "taints": [ 2022-03-01 05:55:00.087 | { 2022-03-01 05:55:00.090 | "effect": "NoSchedule", 2022-03-01 05:55:00.093 | "key": "node.kubernetes.io/unschedulable", 2022-03-01 05:55:00.096 | "timeAdded": "2022-03-01T04:53:32Z" 2022-03-01 05:55:00.100 | } 2022-03-01 05:55:00.103 | ], 2022-03-01 05:55:00.106 | "unschedulable": true 2022-03-01 05:55:00.110 | }, 2022-03-01 05:55:00.113 | "status": { 2022-03-01 05:55:00.117 | "addresses": [ 2022-03-01 05:55:00.120 | { 2022-03-01 05:55:00.123 | "address": "10.196.2.151", 2022-03-01 05:55:00.127 | "type": "InternalIP" 2022-03-01 05:55:00.130 | }, 2022-03-01 05:55:00.134 | { 2022-03-01 05:55:00.137 | "address": "ostest-dkswg-worker-0-x75m5", 2022-03-01 05:55:00.140 | "type": "Hostname" 2022-03-01 05:55:00.144 | } 2022-03-01 05:55:00.148 | ], [...] I kinda see that after reboot the problematic node has K8s API connectivity issue. I can see that nodes seem to have additional network attached. Could this be related to [1]? [1] https://bugzilla.redhat.com/show_bug.cgi?id=2059330 Okay, it shouldn't be connected with the aforementioned bug as it should be fixed in the version used here. Moving the BZ to the ovn-kubernetes team as it's related to default routes metrics when changing the cloud-provider configuration. Changing the cloud-provider config triggers the issue: $ oc edit configmap -n openshift-config cloud-provider-config [...] [LoadBalancer] use-octavia = True lb-provider = ovn [...] One worker node keeps unschedulable after the change: NAME STATUS ROLES AGE VERSION INTERNAL-IP ostest-qgfdz-master-0 Ready master 82m v1.23.3+bba255d 10.196.1.71 ostest-qgfdz-master-1 Ready master 82m v1.23.3+bba255d 10.196.0.134 ostest-qgfdz-master-2 Ready master 82m v1.23.3+bba255d 10.196.3.214 ostest-qgfdz-worker-0-fkk2h Ready,SchedulingDisabled worker 62m v1.23.3+bba255d 10.196.3.93 ostest-qgfdz-worker-0-qtj89 Ready worker 62m v1.23.3+bba255d 10.196.3.1 ostest-qgfdz-worker-0-tnbzg Ready worker 62m v1.23.3+bba255d 10.196.1.39 Openstack subnets: +--------------------------------------+--------------------+--------------------------------------+---------------+ | ID | Name | Network | Subnet | +--------------------------------------+--------------------+--------------------------------------+---------------+ | 0c4a9fab-e91a-4250-8306-5389216dcf7c | ostest-qgfdz-nodes | e3c3fca6-d140-4098-85cc-ab2a8ee0e11a | 10.196.0.0/16 | | 48f330e1-00bd-4e17-abee-f944248e2399 | StorageNFSSubnet | 9b23f05e-0871-4206-a467-ec571cba27ef | 172.17.5.0/24 | +--------------------------------------+--------------------+--------------------------------------+---------------+ Before the change: ostest-qgfdz-worker-0-fkk2h <<---- --------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 ostest-qgfdz-worker-0-qtj89 --------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 ostest-qgfdz-worker-0-tnbzg --------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 After the change: ostest-qgfdz-worker-0-fkk2h <<---- --------------------------- default via 172.17.5.1 dev br-ex proto dhcp metric 49 <<---- the storage network has more priority default via 10.196.0.1 dev ens3 proto dhcp metric 101 ostest-qgfdz-worker-0-qtj89 (no changes) ---------------------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 ostest-qgfdz-worker-0-tnbzg (no changes) ---------------------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 $ oc describe node ostest-qgfdz-worker-0-fkk2h Name: ostest-qgfdz-worker-0-fkk2h Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m4.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=regionOne failure-domain.beta.kubernetes.io/zone=nova kubernetes.io/arch=amd64 kubernetes.io/hostname=ostest-qgfdz-worker-0-fkk2h kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m4.xlarge node.openshift.io/os_id=rhcos topology.cinder.csi.openstack.org/zone=nova topology.kubernetes.io/region=regionOne topology.kubernetes.io/zone=nova Annotations: csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"f4cb37d6-12b3-448e-8416-34225b4ad0a6","manila.csi.openstack.org":"ostest-qgfdz-worker-0-fkk2h"} k8s.ovn.org/host-addresses: ["10.196.3.93","172.17.5.216"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ostest-qgfdz-worker-0-fkk2h","mac-address":"fa:16:3e:aa:b9:44","ip-addresses":["172.17.5... k8s.ovn.org/node-chassis-id: 54cc8a65-7ed7-4bdb-adb5-57f26f6d0e8a k8s.ovn.org/node-mgmt-port-mac-address: 9a:b8:37:5e:ea:62 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.17.5.216/24"} k8s.ovn.org/node-subnets: {"default":"10.129.2.0/23"} machine.openshift.io/machine: openshift-machine-api/ostest-qgfdz-worker-0-fkk2h machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-ff3365f8923121ad1960df303d0d147c machineconfiguration.openshift.io/desiredConfig: rendered-worker-cc201ca80e2b969ef05d2a4b291debb4 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Working volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 03 Mar 2022 11:31:11 +0000 Taints: node.kubernetes.io/unschedulable:NoSchedule UpdateInProgress:PreferNoSchedule Unschedulable: true [...] ** Seen in 4.8 and 4.10 for now. Hey there! Do you have sosreports and the full journal log (tar -czf /host/var/log/journal.tar.gz /host/var/log/journal) for one of the affected nodes? That would be super helpful. Thanks! - Andreas Can you test and reproduce this on a 4.8 cluster and on the node with issues, can you provide a sosreport and the full journal log (`tar -czf /host/var/log/journal.tar.gz /host/var/log/journal`) for one of the affected nodes? I will need the journal ovs-configure and NM logs to figure out why this happens in 4.8. But I imagine that anything that triggers a reconfiguration of br-ex would have the potential to trigger this. In 4.10, this will then definitely trigger https://bugzilla.redhat.com/show_bug.cgi?id=2057160 ; so in that case, it's a dup. https://github.com/openshift/machine-config-operator/commit/9e3e74d4d2c31e449352bb5b9efcdb25f1234c98 will trigger a replumming of br-ex on every reboot and that can lead to the issues described here. (In reply to Andreas Karis from comment #10) > Can you test and reproduce this on a 4.8 cluster and on the node with > issues, can you provide a sosreport and the full journal log (`tar -czf > /host/var/log/journal.tar.gz /host/var/log/journal`) for one of the affected > nodes? I will need the journal ovs-configure and NM logs to figure out why > this happens in 4.8. But I imagine that anything that triggers a > reconfiguration of br-ex would have the potential to trigger this. Hey Andreas, I'll try manually reproducing it and getting the info. We're hitting it on our CI but cannot extract that info from there. Thanks Hi Jon! Did you have any luck reproducing this on 4.8? Thanks! (In reply to Andreas Karis from comment #13) > Hi Jon! Did you have any luck reproducing this on 4.8? Thanks! Hey Andreas, not yet, if we see it's not reproducible in 4.8 any more we can close it, but let's wait a couple of weeks and see the results. Thanks Ok, I'll close this as closed duplicate in the meantime to clean up my backlog. If you can reproduce this on 4.8, please simply reopen this BZ and I'll have a look at it. *** This bug has been marked as a duplicate of bug 2057160 *** |