Hide Forgot
@ddelcian any chance you can help verified this bug? since QE do not have this kind of environment immediately.
Hi, Can you provide details as to how the customer could test this? Thanks!
(In reply to Daniel Del Ciancio from comment #6) > Hi, > > Can you provide details as to how the customer could test this? > > Thanks! try with this build quay.io/openshift-release-dev/ocp-release:4.7.10-x86_64
Hi again, The customer is planning to skip 4.7 and move straight to 4.8 for testing. Can you confirm which 4.8 release includes this fix? Does candidate-4.8 channel include it at this time?
I don't know what's in candidate-4.8, but it's fixed in any 4.8.0-fc.* release
@Daniel Can customer have a try with quay.io/openshift-release-dev/ocp-release:4.8.0-fc.3-x86_64 build , thanks
Was able to get partially through, @daniel I am seeing the same issue I saw with the forced/workaround upgrade on an existing cluster (CIDRs for network) ```bash oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.3 True False False 6m37s baremetal 4.8.0-fc.3 True False False 26m cloud-credential 4.8.0-fc.3 True False False 35m cluster-autoscaler 4.8.0-fc.3 True False False 25m config-operator 4.8.0-fc.3 True False False 27m console 4.8.0-fc.3 True False False 13m csi-snapshot-controller 4.8.0-fc.3 True False False 26m dns 4.8.0-fc.3 True False False 26m etcd 4.8.0-fc.3 True False False 25m image-registry 4.8.0-fc.3 True False False 21m ingress 4.8.0-fc.3 True False False 11m insights 4.8.0-fc.3 True False False 25m kube-apiserver 4.8.0-fc.3 True False False 24m kube-controller-manager 4.8.0-fc.3 True False False 25m kube-scheduler 4.8.0-fc.3 True False False 25m kube-storage-version-migrator 4.8.0-fc.3 True False False 19m machine-api 4.8.0-fc.3 True False False 26m machine-approver 4.8.0-fc.3 True False False 25m machine-config 4.8.0-fc.3 True False False 25m marketplace 4.8.0-fc.3 True False False 25m monitoring 4.8.0-fc.3 True False False 14m network False True True 31m node-tuning 4.8.0-fc.3 True False False 25m openshift-apiserver 4.8.0-fc.3 True False False 21m openshift-controller-manager 4.8.0-fc.3 True False False 26m openshift-samples 4.8.0-fc.3 True False False 21m operator-lifecycle-manager 4.8.0-fc.3 True False False 25m operator-lifecycle-manager-catalog 4.8.0-fc.3 True False False 25m operator-lifecycle-manager-packageserver 4.8.0-fc.3 True False False 22m service-ca 4.8.0-fc.3 True False False 27m storage 4.8.0-fc.3 True False False 26m ``` Can someone provide me what the CIDRs should be for my network settings below to not run into this issue where network won't come up? ```yaml networking: stack: IPV6 vlan: cluster: 603 storage: 540 networkType: OVNKubernetes machineCIDR: 2605:b100:0000:4::/64 clusterNetwork: - cidr: 2605:b100:283::/56 hostPrefix: 64 serviceNetwork: - 2605:b100:283:104::/112 ```
Can you get an "oc adm must-gather" output from the cluster? Or if that doesn't work, at least get "oc get clusteroperator network -o yaml" FWIW, as the install got past the bootstrap phase, that means that the specific bug that was previously reported is now fixed, and we are now seeing a new bug. I'm going to mark this VERIFIED so that the 4.6 backport of the original bugfix can proceed, but we can continue to debug the new problem here for now.
Created attachment 1784593 [details] co-network -o yaml ommitted managed fields
Normal AddedInterface 72s multus Add eth0 [2605:b100:283:5::e/64] Warning FailedCreatePodSandBox 55s (x52 over 14m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_envoyv4v6-6584c9b6b7-xxxtt_bell-services_f78acab5-54eb-4ec1-bf22-1ae01e4982c8_0(3a44006cee034a1a2d53a99173f06ee82ba6c71140e1a8c1bbb733833dc60963): [bell-services/envoyv4v6-6584c9b6b7-xxxtt:envoyv4]: error adding container to network "envoyv4": failed to create macvlan: device or resource busy Normal AddedInterface 55s Hey Daniel and RH, also this ^^^ is the same error seen on the pod trying to attach a network attachment for the IPv4-IPv6 conversion with the in-place upgrade workaround
> DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-hhxcm is in CrashLoopBackOff State can you get oc logs -n openshift-ovn-kubernetes ovnkube-master --all-containers ?
@danw Do i need to run it on a particular ovnkube master pod or from any ovnkube master pod?
Created attachment 1784813 [details] Logs from the ovnkube-master in crashloopbackoff Attached logs from the ovnkube-master in crashloopbackoff
Also I tried deploying a new cluster with the following settings ```yaml networkType: OVNKubernetes machineCIDR: 2605:b100:0000:9::/64 clusterNetwork: - cidr: 2605:b100:283::/64 hostPrefix: 64 serviceNetwork: - 2605:b100:283:104::/112 ``` Resulted in this ```log evel=info msg=Waiting up to 30m0s for bootstrapping to complete... level=error msg=Cluster operator network Degraded is True with RolloutHung: DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-ndzr9 is in CrashLoopBackOff State level=error msg=DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-b2mb4 is in CrashLoopBackOff State level=error msg=DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-qlswg is in CrashLoopBackOff State level=error msg=DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2021-05-21T12:58:00Z level=error msg=DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-05-21T12:58:00Z level=info msg=Cluster operator network ManagementStateDegraded is False with : level=info msg=Cluster operator network Progressing is True with Deploying: DaemonSet "openshift-multus/network-metrics-daemon" is waiting for other operators to become ready level=info msg=DaemonSet "openshift-multus/multus-admission-controller" is waiting for other operators to become ready level=info msg=DaemonSet "openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 3 nodes) level=info msg=DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 3 nodes) level=info msg=DaemonSet "openshift-network-diagnostics/network-check-target" is waiting for other operators to become ready level=info msg=Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready level=info msg=Cluster operator network Available is False with Startup: The network is starting up level=info msg=Use the following commands to gather logs from the cluster level=info msg=openshift-install gather bootstrap --help level=error msg=Bootstrap failed to complete: timed out waiting for the condition level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. level=fatal msg=Bootstrap failed to complete ``` Anything wrong in my networking config shown above?
> clusterNetwork: > - cidr: 2605:b100:283::/64 > hostPrefix: 64 This says that each node should get a 64 (hostPrefix), but also that the entire cluster gets a /64 (cidr). So that won't work. The cidr needs to be big enough to contain at least as many /64s as you will have nodes.
Ah right. So on the flip side, we're running into the limitation that /56 clusternetwork CIDR is too big for IPv6, correct? Or am i confusing that for another issue? I do see in upstream Kubernetes docs that /56 for pod-network-cidr is an acceptable setting, albeit for dual stack k8s https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/dual-stack-support/
I am also re-creating the first deployment this time on another cluster, and will try my best to get must-gather logs. The settings from the first deployment are ```yaml networkType: OVNKubernetes machineCIDR: 2605:b100:0000:4::/64 clusterNetwork: - cidr: 2605:b100:283::/56 hostPrefix: 64 serviceNetwork: - 2605:b100:283:104::/112 ```
So here's the status of the cluster. ```bash oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.3 False False True 20h baremetal 4.8.0-fc.3 True False False 20h cloud-credential 4.8.0-fc.3 True False False 20h cluster-autoscaler 4.8.0-fc.3 True False False 20h config-operator 4.8.0-fc.3 True False False 20h console 4.8.0-fc.3 False True False 20h csi-snapshot-controller 4.8.0-fc.3 True False False 20h dns 4.8.0-fc.3 True False False 20h etcd 4.8.0-fc.3 True False False 20h image-registry 4.8.0-fc.3 True False False 20h ingress 4.8.0-fc.3 True False True 19h insights 4.8.0-fc.3 True False True 20h kube-apiserver 4.8.0-fc.3 True False False 20h kube-controller-manager 4.8.0-fc.3 True False False 20h kube-scheduler 4.8.0-fc.3 True False False 20h kube-storage-version-migrator 4.8.0-fc.3 True False False 20h machine-api 4.8.0-fc.3 True False False 20h machine-approver 4.8.0-fc.3 True False False 20h machine-config 4.8.0-fc.3 True False False 20h marketplace 4.8.0-fc.3 True False False 20h monitoring 4.8.0-fc.3 True False False 7h36m network 4.8.0-fc.3 True False False 20h node-tuning 4.8.0-fc.3 True False False 20h openshift-apiserver 4.8.0-fc.3 True False False 20h openshift-controller-manager 4.8.0-fc.3 True False False 20h openshift-samples 4.8.0-fc.3 True False False 20h operator-lifecycle-manager 4.8.0-fc.3 True False False 20h operator-lifecycle-manager-catalog 4.8.0-fc.3 True False False 20h operator-lifecycle-manager-packageserver 4.8.0-fc.3 True False False 20h service-ca 4.8.0-fc.3 True False False 20h storage 4.8.0-fc.3 True False False 20h ``` I am unable to collect must-gather due to the pod failing to start. I will upload sosreports for one of the master nodes, and one of the nodes which is failing to assign an IPv4 macvlan network attachment for the IPv4<->IPv6 ingress envoy pod. This is likely preventing the console operator to come online because it's preventing communication between authentication pods and the rest of the cluster components Here's the error for the envoy pod ```bash Warning FailedCreatePodSandBox 2m16s (x4836 over 20h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_envoyv4v6-6584c9b6b7-kr5pg_bell-services_0db3f8ee-d765-4045-b946-d8566a66dd38_0(18ea57de08c04cde8b2f35ccad5496ab6f2cd9c8f66a3b4feaeefaf177824c2c): [bell-services/envoyv4v6-6584c9b6b7-kr5pg:envoyv4]: error adding container to network "envoyv4": failed to create macvlan: device or resource busy ```
(In reply to raj.sarvaiya from comment #23) > Ah right. So on the flip side, we're running into the limitation that /56 > clusternetwork CIDR is too big for IPv6, correct? Or am i confusing that for > another issue? There is no "too big" restriction on clusternetwork. (Hm... well, some upstream documentation might say that it can't be bigger than /48, but that doesn't apply to ovn-kubernetes anyway. There should definitely not be a problem with a /56 anywhere). The restrictions are: - serviceNetwork should be exactly /112 - the clusterNetwork hostPrefix must be exactly 64 - the clusterNetwork length therefore must be < 64 (In reply to raj.sarvaiya from comment #25) > So here's the status of the cluster. "oc get co -o yaml" would be more useful since that would have more detailed status > one of the nodes which is > failing to assign an IPv4 macvlan network attachment for the IPv4<->IPv6 > ingress envoy pod. There is no "IPv4<->IPv6 ingress envoy pod" in a default OCP install, so this would be a problem with something you are running, not a problem with OCP itself... -- Dan
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.12 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1561
(In reply to Dan Winship from comment #26) > (In reply to raj.sarvaiya from comment #23) > > Ah right. So on the flip side, we're running into the limitation that /56 > > clusternetwork CIDR is too big for IPv6, correct? Or am i confusing that for > > another issue? > > There is no "too big" restriction on clusternetwork. (Hm... well, some > upstream documentation might say that it can't be bigger than /48, but that > doesn't apply to ovn-kubernetes anyway. There should definitely not be a > problem with a /56 anywhere). > > The restrictions are: > > - serviceNetwork should be exactly /112 > - the clusterNetwork hostPrefix must be exactly 64 > - the clusterNetwork length therefore must be < 64 Thanks for clearing that up! Really helps > > "oc get co -o yaml" would be more useful since that would have more detailed > status Will this contain any sensitive info that needs to be scrubbed? > > one of the nodes which is > > failing to assign an IPv4 macvlan network attachment for the IPv4<->IPv6 > > ingress envoy pod. > > There is no "IPv4<->IPv6 ingress envoy pod" in a default OCP install, so > this would be a problem with something you are running, not a problem with > OCP itself... > > -- Dan Yes that's something we need for certain envs to enable communication from outside over IPv4. For this we need the extra network attachments from multus to be functional
(In reply to raj.sarvaiya from comment #29) > > "oc get co -o yaml" would be more useful since that would have more detailed > > status > > Will this contain any sensitive info that needs to be scrubbed? nothing beyond node hostnames > > > one of the nodes which is > > > failing to assign an IPv4 macvlan network attachment for the IPv4<->IPv6 > > > ingress envoy pod. > > > > There is no "IPv4<->IPv6 ingress envoy pod" in a default OCP install, so > > this would be a problem with something you are running, not a problem with > > OCP itself... > > Yes that's something we need for certain envs to enable communication from > outside over IPv4. For this we need the extra network attachments from > multus to be functional Yes, I just mean, if the cluster is failing because of that though, then you'd need to figure that out yourself, since we don't know any of the details of what that pod is doing, because it's something you created, not something we created.
(In reply to Dan Winship from comment #30) > > Will this contain any sensitive info that needs to be scrubbed? > > nothing beyond node hostnames > Alright attempting to recreate the conditions and deployment and will try to get you this. > Yes, I just mean, if the cluster is failing because of that though, then > you'd need to figure that out yourself, since we don't know any of the > details of what that pod is doing, because it's something you created, not > something we created. If we can figure out and solve why multus is failing to assign a network attachment there, then I'm fairly certain the cluster will be at the desired functionality level. Basically the pod doesn't even start due to the issue so far being with multus. When using another CNI such as cilium, multus does start working
Hi @raj.sarvaiya, can you provide the install-config.yaml for the OVN cluster? Any difference between the Cilium and OVN network config?
Hey Daniel, will have to regenerate this via a pipeline run. Can do this tomorrow afternoon, in the middle of a few things at the moment
Created attachment 1805968 [details] install-config-4.6-v6-ovn install-config for a 4.6 ipv6 deployment
I have attached the OVN install config, there's no difference between this one and the Cilium install-config, except for Cilium referring to networktype Cilium