Bug 1847142
Summary: | Some of the nodes lose the provisioning network IPv6 address: NetworkManager reports Impossible condition at dhc6.c:273. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Marius Cornea <mcornea> |
Component: | Installer | Assignee: | Egor Lunin <elunin> |
Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Victor Voronkov <vvoronko> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | afasano, beth.white, shardy, stbenjam |
Version: | 4.4 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.6 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-13 13:15:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marius Cornea
2020-06-15 18:46:39 UTC
So, during initial deployment all masters get a DHCP lease from the bootstrap VM, but subsequently the static-ip-manager container configures the provisioning IP statically, but only on the master where the metal3 pod is scheduled (in this case fd00:1101::3) What we're seeing here I think is just the lease expiry on the other masters, which is expected because NetworkManager brought up the nics, then the DHCP server (on the bootstrap VM) went away. Arguably they should get a new lease from the dnsmasq running in the metal3 pod, I'm not sure why that doesn't happen rather than the failed rebind in the errors. However it raises the question of whether we want to support any use of the provisioning nic on nodes where the metal3 pod is not scheduled - IMHO we don't, so perhaps we can either consider this not a bug or explicitly disable the nics on nodes not running the metal3 pod? There is one potential corner-case, which is related to the order of running static-ip-set here: https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/baremetal_pod.go#L209 We run the machine-os-downloader container, then after we run static-ip-set - this works fine provided the image download happens via the baremetal network. However dev-scripts we actually configure the local mirror such that it downloads via the provisioning network - in that case our current testing works only because of these leases from the bootstrap VM, and if they expire then the metal3 pod gets rescheduled, the download will fail. One solution would be to move the static-ip-set before the machine-os-downloader, but if we do that we may hit this hard-coded lft timeout and image download could still fail. https://github.com/openshift/ironic-static-ip-manager/blob/master/set-static-ip#L20 I think it's reasonable to expect image download in all real non-virt environments to happen via the baremetal/external network (regardless of whether we're downloading from the internet or a local mirror), but we may want to add some validation to prevent the corner case above causing potential problems? I've deployed a 4.4.7 cluster and let it run for two weeks. Tested with IPv6, with DHCP range of fd00:1101::a - fd00:1101::64. During that time I checked the lease renewal logs on a daily basis and never seen an IP address change. Capturing the IP addresses via: for node in $(oc get nodes --template '{{range .items}}{{index .status.addresses 0 "address"}}{{"\n"}}{{end}}'); do ssh core@$node 'journalctl -u NetworkManager | grep "dhcp6 (enp2s0): address"';done > lease_times.txt Had the same values for the entire lifetime of the cluster: master-0: fd2e:6f44:5dd8:c956::14 master-1: fd2e:6f44:5dd8:c956::15 master-2: fd2e:6f44:5dd8:c956::16 worker-1: fd2e:6f44:5dd8:c956::18 As an interesting side note, average time before lease renewal: master-0: 9 sec. master-1: 23 sec. master-2: 24 sec. worker-1: 25 sec. |