Bug 1894539
Summary: | [on-prem] Unable to deploy additional machinesets on separate subnets | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Martin André <m.andre> |
Component: | Machine Config Operator | Assignee: | Martin André <m.andre> |
Status: | CLOSED ERRATA | QA Contact: | weiwei jiang <wjiang> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.7 | CC: | dgautam, dporter, ekasprzy, emarquez, javier.ordax, mkrejci, simore |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: For all on-premise platforms, when discovering the node IP, baremetal-runtimecfg wrongly assumed the nodes are always attached to a subnet that includes the VIP and looked for an address in this IP range.
Consequence: Nodes fail to generate configuration files with baremetal-runtimcfg.
Fix: Fallback to the IP address associated with default route.
Result: New compute nodes can join the cluster when deployed on separate subnets.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:30:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1932967 |
Description
Martin André
2020-11-04 13:21:27 UTC
*** Bug 1905134 has been marked as a duplicate of this bug. *** Hi, I'm facing this issue in 4.6 IPI on VMWare. Is this tag as an error to solve? If yes, as 4.6 is EUS will it be fixed in 4.6? Thank you. Other comment is that this is not just for two subnets, it could be on subnet for masters, one subnet for infra and several subnets for workers. May be it should be configurable which nodes will be used for the VIPs. Hi Javier, there are currently no plans to backport this changes to 4.6 and earlier versions unless there is a strong business case, as it would require a lot of testing. There is also work going on to make this architecture more flexible: https://github.com/openshift/enhancements/pull/524. Hi Martin, I'm seeing this as a bug in the current version, as 4.6 is still in full support I was assuming this would be fixed in 4.6, also taking into account 4.6 is EUS. Being able to install in different VLANs is a feature of OpenShift, and it is not working with IPI. I do not have a business case more than currently customers with this requirement of using different VLANs wants to use 4.6 to have a period of stability due to the EUS and this error is avoiding them to use IPI. Hi, I have created the following machineconfig, I overwrite the files related to keepalive pod and configuration and also disable the service "nodeip-configuration.service". apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: masters-chrony-configuration spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 3.1.0 networkd: {} passwd: {} systemd: units: - contents: | [Unit] Description=Writes IP address configuration so that kubelet and crio services select a valid node IP # This only applies to VIP managing environments where the kubelet and crio IP # address picking logic is flawed and may end up selecting an address from a # different subnet or a deprecated address Wants=network-online.target After=network-online.target ignition-firstboot-complete.service Before=kubelet.service crio.service [Service] # Need oneshot to delay kubelet Type=oneshot # Would prefer to do Restart=on-failure instead of this bash retry loop, but # the version of systemd we have right now doesn't support it. It should be # available in systemd v244 and higher. ExecStart=/bin/bash -c " \ until \ /usr/bin/podman run --rm \ --authfile /var/lib/kubelet/config.json \ --net=host \ --volume /etc/systemd/system:/etc/systemd/system:z \ quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:44dffacfd1b61252df317adcdd1b549c06dfd9d98436adce60fd9a6ec72c7f97 \ node-ip \ set --retry-on-failure \ 192.168.1.247; \ do \ sleep 5; \ done" ExecStart=/bin/systemctl daemon-reload [Install] WantedBy=multi-user.target enabled: false name: nodeip-configuration.service storage: files: - filesystem: root overwrite: true path: "/etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl" contents: source: data:,foo mode: 420 - filesystem: root overwrite: true path: "/etc/kubernetes/manifests/keepalived.yaml" contents: source: data:,kind%3A%20Pod%0AapiVersion%3A%20v1%0Ametadata%3A%0A%20%20name%3A%20foo-keepalived%0A%20%20namespace%3A%20openshift-vsphere-infra%20%0A%20%20labels%3A%0A%20%20%20%20app%3A%20vsphere-infra-vrrp%0Aspec%3A%0A%20%20containers%3A%0A%20%20-%20name%3A%20foo-keepalived%20%20%20%20%0A%20%20%20%20image%3A%20docker.io%2Fbusybox%20%20%0A%20%20hostNetwork%3A%20true%0A%20%20tolerations%3A%0A%20%20-%20operator%3A%20Exists%0A%20%20priorityClassName%3A%20system-node-critical mode: 420 osImageURL: "" I overwrite the files because I didn't find who to remove a file using an ignition file. With this machineconfig the worker nodes starts fine, there is no VIP for workers but that is not a problem because to be productive a Load Balancer is needed, the VIP for ingress is just "temporal" for installation. What do you think, could this be a valid/supported workaround for OCP 4.6? Thank you. Appart of some "typo", I didn't comment that I tested it creating a cluster with 0 workers and master scheduling disabled, after the masters start I deploy the machineconfig and create machinesets using a network different of the masters network. It could be possible also to add it to the <install folder>/openshift folder to deploy it with the cluster creation. In case Ingress VIP is mandatory it would be possible to do not apply this machineconfig to a pair of worker nodes deployed in the same VLAN as the masters and apply it only to the workers deployed in different VLANs. Checked with 4.7.0-0.nightly-2020-12-17-201522, and it works well now. $ oc get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api wj47ios1218a-rnntt-addit-0 1 1 1 1 8m9s openshift-machine-api wj47ios1218a-rnntt-worker-0 3 3 3 3 96m $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME wj47ios1218a-rnntt-addit-0-drz5s Ready worker 59s v1.20.0+87544c5 192.168.66.124 <none> Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios1218a-rnntt-master-0 Ready master 94m v1.20.0+87544c5 192.168.0.79 <none> Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios1218a-rnntt-master-1 Ready master 94m v1.20.0+87544c5 192.168.3.90 <none> Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios1218a-rnntt-master-2 Ready master 94m v1.20.0+87544c5 192.168.3.194 <none> Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios1218a-rnntt-worker-0-ftt9z Ready worker 80m v1.20.0+87544c5 192.168.0.250 <none> Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios1218a-rnntt-worker-0-gtsns Ready worker 80m v1.20.0+87544c5 192.168.0.108 <none> Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios1218a-rnntt-worker-0-zxcbd Ready worker 80m v1.20.0+87544c5 192.168.1.103 <none> Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |