Bug 2036577 - OCP 4.10 nightly builds from 4.10.0-0.nightly-s390x-2021-12-18-034912 to 4.10.0-0.nightly-s390x-2022-01-11-233015 fail to upgrade from OCP 4.9.11 and 4.9.12 for network type OVNKubernetes for zVM hypervisor environments
Summary: OCP 4.10 nightly builds from 4.10.0-0.nightly-s390x-2021-12-18-034912 to 4.10...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Multi-Arch
Version: 4.10
Hardware: s390x
OS: Linux
high
medium
Target Milestone: ---
: 4.10.0
Assignee: Prashanth Sundararaman
QA Contact: Douglas Slavens
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-03 09:39 UTC by krmoser
Modified: 2022-03-11 17:58 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-11 17:58:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
worker-0 journalctl -b -u NetworkManager output (91.99 KB, text/plain)
2022-01-04 18:38 UTC, krmoser
no flags Details
worker-0 journalctl -b -u ovs-configuration output (199.15 KB, text/plain)
2022-01-04 19:59 UTC, krmoser
no flags Details
master-0 journalctl -b -u ovs-configuration output (43.91 KB, text/plain)
2022-01-04 22:29 UTC, krmoser
no flags Details
worker-1 journalctl -b -u ovs-configuration output (43.91 KB, text/plain)
2022-01-04 22:30 UTC, krmoser
no flags Details
worker-1 journalctl -b -u ovs-configuration output (110.70 KB, text/plain)
2022-01-07 12:03 UTC, krmoser
no flags Details
worker-1 /usr/local/bin/configure-ovs.sh output (26.25 KB, text/plain)
2022-01-07 12:05 UTC, krmoser
no flags Details
OCP 4.9.13 master-0 node /etc/NetworkManager/system-connections ouuput (3.16 KB, text/plain)
2022-01-07 12:15 UTC, krmoser
no flags Details
OCP 4.9.13 master-0 node /etc/NetworkManager/system-connectionsMerged output (3.36 KB, text/plain)
2022-01-07 12:18 UTC, krmoser
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2902 0 None open Bug 2036577: configure-ovs: do not use overlay directory when checking and copying connections 2022-01-07 17:04:01 UTC
Red Hat Issue Tracker MULTIARCH-2025 0 None None None 2022-01-03 09:40:42 UTC

Description krmoser 2022-01-03 09:39:46 UTC
Description of problem:
For zVM environments: 
1. OCP 4.10 nightly builds starting with 4.10.0-0.nightly-s390x-2021-12-18-034912 and through 4.10.0-0.nightly-s390x-2022-01-02-012917 fail to upgrade from OCP 4.9.11 and 4.9.12 when using network type OVNKubernetes (OVN).

2. These same OCP 4.10 nightly builds starting with 4.10.0-0.nightly-s390x-2021-12-18-034912 and through 4.10.0-0.nightly-s390x-2022-01-02-012917 successfully upgrade from OCP 4.9.11 and 4.9.12 when using network type openshiftSDN (OVS).


3. For these OCP 4.9.11 and 4.9.12 to OCP 4.10 upgrade failures, using the OCP nightly build 4.10.0-0.nightly-s390x-2021-12-24-235654 as an example, the "oc get clusterversion" command consistently reports:
"Unable to apply 4.10.0-0.nightly-s390x-2021-12-24-235654: wait has exceeded 40 minutes for these operators: ingress"

4. For these upgrade failures, using the OCP nightly build 4.10.0-0.nightly-s390x-2021-12-24-235654 as an example, the "oc get co" command consistently reports information similar to the following:

NAME                                       VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         True       35m     APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal                                  4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      100m    
cloud-controller-manager                   4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      105m    
cloud-credential                           4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      104m    
cluster-autoscaler                         4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      100m    
config-operator                            4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      102m    
console                                    4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      49m     
csi-snapshot-controller                    4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      101m    
dns                                        4.10.0-0.nightly-s390x-2021-12-24-235654   True        True          True       100m    DNS default is degraded
etcd                                       4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      100m    
image-registry                             4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      93m     
ingress                                    4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         True       92m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-cd6bdf7dd-h9nrx" cannot be scheduled: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable. Make sure you have sufficient worker nodes.)
insights                                   4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      94m     
kube-apiserver                             4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      97m     
kube-controller-manager                    4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      99m     
kube-scheduler                             4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      100m    
kube-storage-version-migrator              4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      37m     
machine-api                                4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      101m    
machine-approver                           4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      101m    
machine-config                             4.9.11                                     True        True          True       101m    Unable to apply 4.10.0-0.nightly-s390x-2021-12-24-235654: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-091362904afdd033e03a72cdede84f52 expected f8249fc84f1a7dfd655c88fae80811ee9c76c34c has ddd96b04ede2eba72afea1355468a9985aacafe6: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-108b993eeddf5639db907c553a9834dc, retrying
marketplace                                4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      100m    
monitoring                                 4.10.0-0.nightly-s390x-2021-12-24-235654   False       True          True       22m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.10.0-0.nightly-s390x-2021-12-24-235654   True        True          True       102m    DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-2fhwv is in CrashLoopBackOff State...
node-tuning                                4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      93m     
openshift-apiserver                        4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         True       94m     APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager               4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      97m     
openshift-samples                          4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      61m     
operator-lifecycle-manager                 4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      101m    
operator-lifecycle-manager-catalog         4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      101m    
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      58m     
service-ca                                 4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      102m    
storage                                    4.10.0-0.nightly-s390x-2021-12-24-235654   True        False         False      102m    


5. For the network cluster operator, the ovnkube pods are in a CrashLoopBackOff state.  
For example:
DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-2fhwv is in CrashLoopBackOff State... 



Version-Release number of selected component (if applicable):
This issue has been consistently found with the following tested builds:
 1. 4.10.0-0.nightly-s390x-2021-12-18-034912
 2. 4.10.0-0.nightly-s390x-2021-12-20-215258
 3. 4.10.0-0.nightly-s390x-2021-12-21-231942
 4. 4.10.0-0.nightly-s390x-2021-12-22-053640
 5. 4.10.0-0.nightly-s390x-2021-12-23-063012
 6. 4.10.0-0.nightly-s390x-2021-12-24-010839
 7. 4.10.0-0.nightly-s390x-2021-12-24-154536
 8. 4.10.0-0.nightly-s390x-2021-12-24-235654
 9. 4.10.0-0.nightly-s390x-2022-01-02-012917


How reproducible:
Consistently reproducible.

Steps to Reproduce:
1. In a zVM environment, using network type OVNKubernetes, attempt to upgrade from OCP 4.9.11 or 4.9.12 to any OCP 4.10 nightly build between 4.10.0-0.nightly-s390x-2021-12-18-034912 and 4.10.0-0.nightly-s390x-2022-01-02-012917.


Actual results:
Upgrade fails with above network cluster operator ovnkube pod CrashLoopBackOff issues.

Expected results:
Upgrades should succeed as they consistently do for network type openshiftSDN (OVS). 

Additional info:


Thank you.

Comment 1 Prashanth Sundararaman 2022-01-03 14:50:25 UTC
can you get the logs from the crashing pod? like this:

oc logs -n openshift-ovn-kubernetes ovnkube-node-2fhwv -c ovnkube-node

Comment 2 krmoser 2022-01-03 17:27:50 UTC
Prashanth,

Thank you for your assistance and Happy New Year :)

Please see below the requested information from a recreate using OCP 4.10 nightly build 4.10.0-0.nightly-s390x-2022-01-02-012917, using the command:
 "oc logs -n openshift-ovn-kubernetes ovnkube-node-hrlgv -c ovnkube-node"

Thank you,
Kyle



+ [[ -f /env/worker-0.pok-99.ocptest.pok.stglabs.ibm.com ]]
++ date '+%m%d %H:%M:%S.%N'
+ echo 'I0103 17:18:47.674099253 - waiting for db_ip addresses'
I0103 17:18:47.674099253 - waiting for db_ip addresses
+ cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
+ ovn_config_namespace=openshift-ovn-kubernetes
++ date '+%m%d %H:%M:%S.%N'
I0103 17:18:48.022119631 - disable conntrack on geneve port
+ echo 'I0103 17:18:48.022119631 - disable conntrack on geneve port'
+ iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK
+ iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK
+ ip6tables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK
+ ip6tables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK
+ retries=0
+ true
++ timeout 30 kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
+ db_ip=10.20.116.211
+ [[ -n 10.20.116.211 ]]
+ break
++ date '+%m%d %H:%M:%S.%N'
I0103 17:18:48.267568989 - starting ovnkube-node db_ip 10.20.116.211
+ echo 'I0103 17:18:48.267568989 - starting ovnkube-node db_ip 10.20.116.211'
+ '[' shared == shared ']'
+ gateway_mode_flags='--gateway-mode shared --gateway-interface br-ex'
+ export_network_flows_flags=
+ [[ -n '' ]]
+ [[ -n '' ]]
+ [[ -n '' ]]
+ [[ -n '' ]]
+ [[ -n '' ]]
+ [[ -n '' ]]
+ gw_interface_flag=
+ '[' -d /sys/class/net/br-ex1 ']'
+ node_mgmt_port_netdev_flags=
+ [[ -n '' ]]
+ exec /usr/bin/ovnkube --init-node worker-0.pok-99.ocptest.pok.stglabs.ibm.com --nb-address ssl:10.20.116.211:9641,ssl:10.20.116.212:9641,ssl:10.20.116.213:9641 --sb-address ssl:10.20.116.211:9642,ssl:10.20.116.212:9642,ssl:10.20.116.213:9642 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nb-cert-common-name ovn --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-name ovn --config-file=/run/ovnkube-config/ovnkube.conf --loglevel 4 --inactivity-probe=180000 --gateway-mode shared --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103 --ovn-metrics-bind-address 127.0.0.1:29105 --metrics-enable-pprof
I0103 17:18:48.393572  186388 ovs.go:93] Maximum command line arguments set to: 191102
I0103 17:18:48.396652  186388 config.go:1674] Parsed config file /run/ovnkube-config/ovnkube.conf
I0103 17:18:48.396676  186388 config.go:1675] Parsed config: {Default:{MTU:1400 RoutableMTU:0 ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 MonitorAll:true LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]} Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:5 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true EnableEgressFirewall:true} Kubernetes:{Kubeconfig: CACert: CAData:[] APIServer:https://api-int.pok-99.ocptest.pok.stglabs.ibm.com:6443 Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAddress: OVNMetricsBindAddress: MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes: NoHostSubnetNodes:nil HostNetworkNamespace:openshift-host-network PlatformType:None} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 DisablePacketMTUCheck:false RouterSubnet:} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:false RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full MgmtPortNetdev: DisableOVNIfaceIdVer:false}}
I0103 17:18:48.398965  186388 node.go:330] OVN Kube Node initialization, Mode: full
I0103 17:18:48.399217  186388 reflector.go:219] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
I0103 17:18:48.399234  186388 reflector.go:255] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:134
I0103 17:18:48.399246  186388 reflector.go:219] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
I0103 17:18:48.399260  186388 reflector.go:255] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:134
I0103 17:18:48.399270  186388 reflector.go:219] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
I0103 17:18:48.399280  186388 reflector.go:255] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:134
I0103 17:18:48.399975  186388 reflector.go:219] Starting reflector *v1.Endpoints (0s) from k8s.io/client-go/informers/factory.go:134
I0103 17:18:48.399992  186388 reflector.go:255] Listing and watching *v1.Endpoints from k8s.io/client-go/informers/factory.go:134
I0103 17:18:48.499337  186388 shared_informer.go:270] caches populated
I0103 17:18:48.499371  186388 shared_informer.go:270] caches populated
I0103 17:18:48.499378  186388 shared_informer.go:270] caches populated
I0103 17:18:48.499384  186388 shared_informer.go:270] caches populated
I0103 17:18:48.519823  186388 config.go:1216] exec: /usr/bin/ovs-vsctl --timeout=15 del-ssl
I0103 17:18:48.545175  186388 config.go:1216] exec: /usr/bin/ovs-vsctl --timeout=15 set-ssl /ovn-cert/tls.key /ovn-cert/tls.crt /ovn-ca/ca-bundle.crt
I0103 17:18:48.568785  186388 config.go:1216] exec: /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-remote="ssl:10.20.116.211:9642,ssl:10.20.116.212:9642,ssl:10.20.116.213:9642"
I0103 17:18:48.573402  186388 ovs.go:204] exec(1): /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=10.20.116.214 external_ids:ovn-remote-probe-interval=180000 external_ids:ovn-openflow-probe-interval=180 external_ids:hostname="worker-0.pok-99.ocptest.pok.stglabs.ibm.com" external_ids:ovn-monitor-all=true external_ids:ovn-enable-lflow-cache=true external_ids:ovn-limit-lflow-cache-kb=1048576
I0103 17:18:48.576954  186388 ovs.go:207] exec(1): stdout: ""
I0103 17:18:48.576985  186388 ovs.go:208] exec(1): stderr: ""
I0103 17:18:48.576999  186388 ovs.go:204] exec(2): /usr/bin/ovs-vsctl --timeout=15 -- clear bridge br-int netflow -- clear bridge br-int sflow -- clear bridge br-int ipfix
I0103 17:18:48.584825  186388 ovs.go:207] exec(2): stdout: ""
I0103 17:18:48.584836  186388 ovs.go:208] exec(2): stderr: ""
I0103 17:18:48.594692  186388 node.go:386] Node worker-0.pok-99.ocptest.pok.stglabs.ibm.com ready for ovn initialization with subnet 10.128.2.0/23
I0103 17:18:48.594714  186388 ovs.go:204] exec(3): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.20.116.211:9642,ssl:10.20.116.212:9642,ssl:10.20.116.213:9642 --timeout=15 --columns=up list Port_Binding
I0103 17:18:48.656502  186388 ovs.go:207] exec(3): stdout: "up                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : false\n"
I0103 17:18:48.656635  186388 ovs.go:208] exec(3): stderr: ""
I0103 17:18:48.656650  186388 node.go:315] Detected support for port binding with external IDs
I0103 17:18:48.656751  186388 ovs.go:204] exec(4): /usr/bin/ovs-vsctl --timeout=15 -- --if-exists del-port br-int k8s-worker-0.po -- --may-exist add-port br-int ovn-k8s-mp0 -- set interface ovn-k8s-mp0 type=internal mtu_request=1400 external-ids:iface-id=k8s-worker-0.pok-99.ocptest.pok.stglabs.ibm.com
I0103 17:18:48.662638  186388 ovs.go:207] exec(4): stdout: ""
I0103 17:18:48.662650  186388 ovs.go:208] exec(4): stderr: ""
I0103 17:18:48.662658  186388 ovs.go:204] exec(5): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface ovn-k8s-mp0 mac_in_use
I0103 17:18:48.666108  186388 ovs.go:207] exec(5): stdout: "\"66:46:54:36:9e:40\"\n"
I0103 17:18:48.666118  186388 ovs.go:208] exec(5): stderr: ""
I0103 17:18:48.666129  186388 ovs.go:204] exec(6): /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 mac=66\:46\:54\:36\:9e\:40
I0103 17:18:48.670471  186388 ovs.go:207] exec(6): stdout: ""
I0103 17:18:48.670484  186388 ovs.go:208] exec(6): stderr: ""
I0103 17:18:48.719025  186388 gateway_init.go:261] Initializing Gateway Functionality
I0103 17:18:48.719276  186388 gateway_localnet.go:131] Node local addresses initialized to: map[10.128.2.2:{10.128.2.0 fffffe00} 10.20.116.214:{10.20.116.0 ffffff00} 127.0.0.1:{127.0.0.0 ff000000} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::6446:54ff:fe36:9e40:{fe80:: ffffffffffffffff0000000000000000} fe80::d0a8:b5ff:fe0f:5ad3:{fe80:: ffffffffffffffff0000000000000000}]
I0103 17:18:48.719426  186388 helper_linux.go:74] Found default gateway interface enc2e0 10.20.116.247
F0103 17:18:48.719469  186388 ovnkube.go:133] could not find IP addresses: failed to lookup link br-ex: Link not found
#

Comment 3 Prashanth Sundararaman 2022-01-04 17:27:19 UTC
it's failing to setup br-ex:

F0103 17:18:48.719469  186388 ovnkube.go:133] could not find IP addresses: failed to lookup link br-ex: Link not found

Kyle,

Can you log into the nodes which exhibit this problem and check this:

systemctl status ovs-configuration

and then restart this service?

systemctl restart ovs-configuration

once you restart this, can you delete the ovnkube-node pod and it should get recreated properly. I got these instructions from a slack thread exhibiting similar symptoms. Let me know if this works and we can escalate to the networking team. also for debugging could you grab the whole journalctl output? 

Thanks
Prashanth

Comment 4 krmoser 2022-01-04 18:17:02 UTC
Prashanth,

Thanks.  Please see the requested information below.

Thank you,
Kyle




[core@worker-0 ~]$ sudo bash
[systemd]
Failed Units: 1
  ovs-configuration.service
[root@worker-0 core]# systemctl status ovs-configuration
● ovs-configuration.service - Configures OVS with proper host networking configuration
   Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2022-01-03 11:08:08 UTC; 1 day 6h ago
 Main PID: 1694 (code=exited, status=1/FAILURE)
      CPU: 806ms

Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: default via 10.20.116.247 dev enc2e0 proto static metric 100
Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: 10.20.116.0/24 dev enc2e0 proto kernel scope link src 10.20.116.214 metric 100
Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + ip -6 route show
Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: ::1 dev lo proto kernel metric 256 pref medium
Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium
Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + exit 1
Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE
Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Jan 03 11:08:08 worker-0.pok-99.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Consumed 806ms CPU time
[root@worker-0 core]# systemctl restart ovs-configuration
Job for ovs-configuration.service failed because the control process exited with error code.
See "systemctl status ovs-configuration.service" and "journalctl -xe" for details.
[root@worker-0 core]# 
[root@worker-0 core]# journalctl -xe
Jan 04 18:08:01 worker-0.pok-99.ocptest.pok.stglabs.ibm.com systemd[1]: crio-conmon-e709273713b3386247e178bacbeba6f923c932b9e6858052b18bf45bbb8dcec5.scope: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit crio-conmon-e709273713b3386247e178bacbeba6f923c932b9e6858052b18bf45bbb8dcec5.scope has successfully entered the 'dead' state.
Jan 04 18:08:01 worker-0.pok-99.ocptest.pok.stglabs.ibm.com systemd[1]: crio-conmon-e709273713b3386247e178bacbeba6f923c932b9e6858052b18bf45bbb8dcec5.scope: Consumed 36ms CPU time
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit crio-conmon-e709273713b3386247e178bacbeba6f923c932b9e6858052b18bf45bbb8dcec5.scope completed and consumed the indicated resources.
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: I0104 18:08:02.133053    2896 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-ovn-ku>
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: I0104 18:08:02.134238    2896 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-ovn-ku>
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: I0104 18:08:02.135942    2896 generic.go:296] "Generic (PLEG): container finished" podID=a73bf963-c4b4-4525>
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: I0104 18:08:02.135993    2896 kubelet.go:2115] "SyncLoop (PLEG): event for pod" pod="openshift-ovn-kubernet>
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: I0104 18:08:02.136033    2896 scope.go:110] "RemoveContainer" containerID="c2ef48dbfd18701667ee459252beaf10>
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: I0104 18:08:02.137687    2896 scope.go:110] "RemoveContainer" containerID="e709273713b3386247e178bacbeba6f9>
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: E0104 18:08:02.139536    2896 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"StartConta>
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com crio[2848]: time="2022-01-04 18:08:02.140576557Z" level=info msg="Removing container: c2ef48dbfd18701667ee459252beaf101bf25c>
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com systemd[1]: var-lib-containers-storage-overlay-a12fbb526ac9d51b062500745cdf962ad0a5867a832b8003023d822d8dff34d0-merged.mount>
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit var-lib-containers-storage-overlay-a12fbb526ac9d51b062500745cdf962ad0a5867a832b8003023d822d8dff34d0-merged.mount has successfully entered the 'dead' state.
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com systemd[910833]: var-lib-containers-storage-overlay-a12fbb526ac9d51b062500745cdf962ad0a5867a832b8003023d822d8dff34d0-merged.>
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit UNIT has successfully entered the 'dead' state.
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com systemd[1]: var-lib-containers-storage-overlay-a12fbb526ac9d51b062500745cdf962ad0a5867a832b8003023d822d8dff34d0-merged.mount>
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit var-lib-containers-storage-overlay-a12fbb526ac9d51b062500745cdf962ad0a5867a832b8003023d822d8dff34d0-merged.mount completed and consumed the indicated resources.
Jan 04 18:08:02 worker-0.pok-99.ocptest.pok.stglabs.ibm.com crio[2848]: time="2022-01-04 18:08:02.451982390Z" level=info msg="Removed container c2ef48dbfd18701667ee459252beaf101bf25cf5>
Jan 04 18:08:03 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: I0104 18:08:03.142584    2896 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-ovn-ku>
Jan 04 18:08:03 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: I0104 18:08:03.149015    2896 scope.go:110] "RemoveContainer" containerID="e709273713b3386247e178bacbeba6f9>
Jan 04 18:08:03 worker-0.pok-99.ocptest.pok.stglabs.ibm.com hyperkube[2896]: E0104 18:08:03.163965    2896 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"StartConta>
[root@worker-0 core]#

Comment 5 Prashanth Sundararaman 2022-01-04 18:22:23 UTC
Thanks Kyle. I think the network manager journal output might be more helpful:

journalctl -b -u NetworkManager

also could you follow the steps i mentioned above to see if that resolves the problem?

Also is this problem intermittent or is it happening on every install/upgrade?

Thanks
Prashanth

Comment 6 krmoser 2022-01-04 18:23:36 UTC
Prashanth,

Upon deleting 1 of the 2 ovnkube pods in CrashLoopBackOff loop mode, the pod recreated and proceeded looping CrashLoopBackOff again.  

Thank you,
Kyle


[root@ospbmgr7 ~]# oc get pods -A | grep ovnkube
openshift-ovn-kubernetes                           ovnkube-master-dmvkt                                                   6/6     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-master-hdrvp                                                   6/6     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-master-xbcrk                                                   6/6     Running             6                 31h
openshift-ovn-kubernetes                           ovnkube-node-2r4v6                                                     5/5     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-node-hrlgv                                                     4/5     CrashLoopBackOff    373 (4m39s ago)   31h
openshift-ovn-kubernetes                           ovnkube-node-r2pvj                                                     5/5     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-node-wmm6c                                                     4/5     CrashLoopBackOff    373 (3m15s ago)   31h
openshift-ovn-kubernetes                           ovnkube-node-xmzjn                                                     5/5     Running             0                 31h
[root@ospbmgr7 ~]# oc delete pod ovnkube-node-hrlgv   -n openshift-ovn-kubernetes
pod "ovnkube-node-hrlgv" deleted
[root@ospbmgr7 ~]# oc get pods -A | grep ovnkube
openshift-ovn-kubernetes                           ovnkube-master-dmvkt                                                   6/6     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-master-hdrvp                                                   6/6     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-master-xbcrk                                                   6/6     Running             6                 31h
openshift-ovn-kubernetes                           ovnkube-node-2r4v6                                                     5/5     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-node-mwlm4                                                     4/5     Error               1 (2s ago)        6s
openshift-ovn-kubernetes                           ovnkube-node-r2pvj                                                     5/5     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-node-wmm6c                                                     4/5     CrashLoopBackOff    373 (4m27s ago)   31h
openshift-ovn-kubernetes                           ovnkube-node-xmzjn                                                     5/5     Running             0                 31h
[root@ospbmgr7 ~]# oc get pods -A | grep ovnkube
openshift-ovn-kubernetes                           ovnkube-master-dmvkt                                                   6/6     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-master-hdrvp                                                   6/6     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-master-xbcrk                                                   6/6     Running             6                 31h
openshift-ovn-kubernetes                           ovnkube-node-2r4v6                                                     5/5     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-node-mwlm4                                                     4/5     CrashLoopBackOff    1 (3s ago)        8s
openshift-ovn-kubernetes                           ovnkube-node-r2pvj                                                     5/5     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-node-wmm6c                                                     4/5     CrashLoopBackOff    373 (4m29s ago)   31h
openshift-ovn-kubernetes                           ovnkube-node-xmzjn                                                     5/5     Running             0                 31h

 

[root@ospbmgr7 ~]# oc get pods -A | grep ovnkube
openshift-ovn-kubernetes                           ovnkube-master-dmvkt                                                   6/6     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-master-hdrvp                                                   6/6     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-master-xbcrk                                                   6/6     Running             6                 31h
openshift-ovn-kubernetes                           ovnkube-node-2r4v6                                                     5/5     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-node-mwlm4                                                     4/5     CrashLoopBackOff    4 (58s ago)       2m30s
openshift-ovn-kubernetes                           ovnkube-node-r2pvj                                                     5/5     Running             0                 31h
openshift-ovn-kubernetes                           ovnkube-node-wmm6c                                                     4/5     CrashLoopBackOff    374 (107s ago)    31h
openshift-ovn-kubernetes                           ovnkube-node-xmzjn                                                     5/5     Running             0                 31h
[root@ospbmgr7 ~]#

Comment 7 Prashanth Sundararaman 2022-01-04 18:25:51 UTC
Kyle,

did you restart the ovs-configuration service on the node, make sure it succeeds and that the br-ex inteface is up and then try killing the pod?

Prashanth

Comment 8 krmoser 2022-01-04 18:29:34 UTC
Prashanth,

Please see comment #4 where the br-ex interface restart does not succeed.

Thank you,
Kyle

Comment 9 krmoser 2022-01-04 18:31:28 UTC
Prashanth,

1. This consistently occurs for every OCP 4.9.11 and 4.9.12 upgrade to OCP 4.10 starting with the 4.10.0-0.nightly-s390x-2021-12-18-034912 nightly build.

2. Installs of these OCP 4.10 nightly builds succeed (barring any other issues).

3. Working to provide the requested "journalctl -b -u NetworkManager" information.

Thank you,
Kyle

Comment 10 Prashanth Sundararaman 2022-01-04 18:37:56 UTC
thanks Kyle. Could you also get the ovs-configuration logs?  journalctl -b -u ovs-configuration

Comment 11 krmoser 2022-01-04 18:38:14 UTC
Created attachment 1848922 [details]
worker-0 journalctl -b -u NetworkManager output

Prashanth,

This attachment contains the requested "journalctl -b -u NetworkManager" for worker-0.

Thank you,
Kyle

Comment 12 krmoser 2022-01-04 19:59:48 UTC
Created attachment 1848927 [details]
worker-0 journalctl -b -u ovs-configuration output

Comment 13 Prashanth Sundararaman 2022-01-04 22:14:29 UTC
i see this in the ovs-configuration logs:

Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: ipv4.method:                            manual
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + echo 'Static IP addressing detected on default gateway connection: 5a5fed82-e1bd-4caa-ba14-3dbc812edc26'
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: Static IP addressing detected on default gateway connection: 5a5fed82-e1bd-4caa-ba14-3dbc812edc26
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + egrep -l '^uuid=5a5fed82-e1bd-4caa-ba14-3dbc812edc26' '/etc/NetworkManager/systemConnectionsMerged/*'
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: grep: /etc/NetworkManager/systemConnectionsMerged/*: No such file or directory
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + echo 'WARN: unable to find NM configuration file for conn: 5a5fed82-e1bd-4caa-ba14-3dbc812edc26. Attempting to clone conn'
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: WARN: unable to find NM configuration file for conn: 5a5fed82-e1bd-4caa-ba14-3dbc812edc26. Attempting to clone conn
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + nmcli conn clone 5a5fed82-e1bd-4caa-ba14-3dbc812edc26 5a5fed82-e1bd-4caa-ba14-3dbc812edc26-clone
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: Wired Connection (5a5fed82-e1bd-4caa-ba14-3dbc812edc26) cloned as 5a5fed82-e1bd-4caa-ba14-3dbc812edc26-clone (666a52d4-0c19-4659-8099-bd2b753d0a5a).
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + shopt -s nullglob
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + old_conn_files=(${NM_CONN_PATH}/"${old_conn}"-clone*)
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + shopt -u nullglob
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + '[' 0 -ne 1 ']'
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + echo 'ERROR: unable to locate cloned conn file for 5a5fed82-e1bd-4caa-ba14-3dbc812edc26-clone'
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: ERROR: unable to locate cloned conn file for 5a5fed82-e1bd-4caa-ba14-3dbc812edc26-clone
Jan 03 11:08:07 worker-0.pok-99.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1694]: + exit 1

looks like it couldn't find the cloned file to restore the config after boot. in 4.10 the systemConnectionsMerged directory has been removed through https://github.com/openshift/machine-config-operator/pull/2742.

Kyle,

could you also get the output of  journalctl -b -u ovs-configuration on a node where the ovnkube-node pod successfully deployed?

Thanks

Comment 14 krmoser 2022-01-04 22:29:41 UTC
Created attachment 1848952 [details]
master-0 journalctl -b -u ovs-configuration output

Prashanth,

Here's the journalctl -b -u ovs-configuration output for master-0, which is successful for it's ovnkube pod operation. 

Thank you,
Kyle

Comment 15 krmoser 2022-01-04 22:30:41 UTC
Created attachment 1848953 [details]
worker-1 journalctl -b -u ovs-configuration output

Prashanth,

Here's the journalctl -b -u ovs-configuration output for worker-1, which is successful for it's ovnkube pod operation. 

Thank you,
Kyle

Comment 16 Prashanth Sundararaman 2022-01-04 22:48:59 UTC
Hmm...that looks like the old script..looks like worker-1 hasn't started updating yet... worker-0 was the first to upgrade and encountered the error

Which was the last 4.10 build to succeed in upgrading with OVNKubernetes? could you let me know the exact version so i can look at the changes?

Comment 17 Prashanth Sundararaman 2022-01-04 23:23:53 UTC
hi Kyle,

could you also try making a slight modification to the ovs configuration script(/usr/local/bin/configure-ovs.sh) on the node to see if it works? this is the modification:

replace this section at the top of the script:

NM_CONN_OVERLAY="/etc/NetworkManager/systemConnectionsMerged"
NM_CONN_UNDERLAY="/etc/NetworkManager/system-connections"
if [ -d "$NM_CONN_OVERLAY" ]; then
  NM_CONN_PATH="$NM_CONN_OVERLAY"
else
  NM_CONN_PATH="$NM_CONN_UNDERLAY"
fi


with: 

NM_CONN_UNDERLAY="/etc/NetworkManager/system-connections"
NM_CONN_PATH="$NM_CONN_UNDERLAY"


and then run the script to see if it succeeds ?

Thanks
Prashanth

Comment 18 krmoser 2022-01-05 04:25:27 UTC
Prashanth,

1. The last OCP 4.10 nightly build for which the OCP 4.9.11 and 4.9.12 upgrades worked for network type OVNKubernetes is 4.10.0-0.nightly-s390x-2021-12-16-185334.

2. The OCP 4.9.11 and 4.9.12 upgrades to OCP 4.10 nightly build 4.10.0-0.nightly-s390x-2021-12-17-144433 are broken with the (different) issue documented in bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=2036571.

3. Given this (different) issue with the OCP 4.10 nightly build 4.10.0-0.nightly-s390x-2021-12-17-144433, the first OCP 4.10 nightly build that we see this OVNKubernetes network type upgrade issue is 4.10.0-0.nightly-s390x-2021-12-18-034912.

Thank you,
Kyle

Comment 19 Dan Li 2022-01-05 15:07:45 UTC
Hi Prashanth, can we assign this bug to you since you have already started the conversation with Kyle?

Comment 20 Prashanth Sundararaman 2022-01-05 15:09:24 UTC
(In reply to Dan Li from comment #19)
> Hi Prashanth, can we assign this bug to you since you have already started
> the conversation with Kyle?

sounds good Dan

Comment 21 Dan Li 2022-01-05 15:16:26 UTC
Thanks Prashanth! Changing the assignee. Would you provide or set a "Priority" level for this bug as a part of the triage process?

Also adding reviewed-in-sprint flag as we are still investigating and seems unlikely (if there are any PRs) that this will be resolved before the end of this sprint.

Comment 22 Prashanth Sundararaman 2022-01-05 15:18:57 UTC
(In reply to Prashanth Sundararaman from comment #17)
> hi Kyle,
> 
> could you also try making a slight modification to the ovs configuration
> script(/usr/local/bin/configure-ovs.sh) on the node to see if it works? this
> is the modification:
> 
> replace this section at the top of the script:
> 
> NM_CONN_OVERLAY="/etc/NetworkManager/systemConnectionsMerged"
> NM_CONN_UNDERLAY="/etc/NetworkManager/system-connections"
> if [ -d "$NM_CONN_OVERLAY" ]; then
>   NM_CONN_PATH="$NM_CONN_OVERLAY"
> else
>   NM_CONN_PATH="$NM_CONN_UNDERLAY"
> fi
> 
> 
> with: 
> 
> NM_CONN_UNDERLAY="/etc/NetworkManager/system-connections"
> NM_CONN_PATH="$NM_CONN_UNDERLAY"
> 
> 
> and then run the script to see if it succeeds ?
> 
> Thanks
> Prashanth

This is good information Kyle. the commit diff between the build on 16th and the one on 18th includes this PR: https://github.com/openshift/machine-config-operator/pull/2864 is probably causing this issue. Could you try the workaround mentioned in comment#17? thanks!

Comment 23 krmoser 2022-01-05 15:38:09 UTC
Prashanth,

Thanks for the update. This is good news.

I tried the workaround mentioned in comment#17 last night and it did not seem to work, and will try again today and provide an update.

Thank you,
Kyle

Comment 24 Prashanth Sundararaman 2022-01-06 17:41:12 UTC
Kyle,

When you have some time today can you reach out to me on slack so we can try debugging this together. the zVM setups we have here do not support OVN Kubernetes as they are not vxlan aware.

Thanks
Prashanth

Comment 25 krmoser 2022-01-07 11:58:22 UTC
Pranshath,

Thanks for all the information we exchanged on slack.  Here's some additional information and logs to help with debug of this issue.

1. Attempted to upgrade from OCP 4.9.13 to 4.10.0-0.nightly-s390x-2022-01-07-024817.

2. Same network operator ovnkube CrashLoopBackOff issues seen when at "machine-config" stage of upgrade from the "oc get co" command output.  

Here is the "oc get nodes" output for the master-2 node's network co:
 DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-927lm is in CrashLoopBackOff State...


3. Here are the 2 CrashLoopBackOff ovnkube-node pods (I had already deleted the worker-1 ovnkube-node pod for recreate purposes in the output below):

[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS        AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running            0               55m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running            6               53m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running            0               57m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff   7 (4m32s ago)   16m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running            0               59m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running            0               58m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-wq6hx     4/5     CrashLoopBackOff   8 (34s ago)     17m   10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running            0               59m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>



4. Here is the requested modified start of the worker-1 /usr/local/bin/configure-ovs.sh script:

[root@worker-1 ~]# cd /usr/local/bin
[root@worker-1 bin]#
[root@worker-1 bin]# ls -al
total 44
drwxr-xr-x.  2 root root    79 Jan  7 11:30 .
drwxr-xr-x. 11 root root   114 Jan  7 03:32 ..
-rwxr-xr-x.  1 root root 20023 Jan  7 11:21 configure-ovs.sh
-rwxr-xr-x.  1 root root 20161 Jan  7 10:55 configure-ovs.sh.save
-rwxr-xr-x.  1 root root  2275 Jan  7 10:55 mco-hostname
[root@worker-1 bin]#
[root@worker-1 bin]# more configure-ovs.sh
#!/bin/bash
set -eux
# This file is not needed anymore in 4.7+, but when rolling back to 4.6
# the ovs pod needs it to know ovs is running on the host.
touch /var/run/ovs-config-executed

NM_CONN_UNDERLAY="/etc/NetworkManager/system-connections"
NM_CONN_PATH="$NM_CONN_UNDERLAY"




MANAGED_NM_CONN_SUFFIX="-slave-ovs-clone"
# Workaround to ensure OVS is installed due to bug in systemd Requires:
# https://bugzilla.redhat.com/show_bug.cgi?id=1888017
copy_nm_conn_files() {
  local src_path="$NM_CONN_PATH"
  local dst_path="$1"
  if [ "$src_path" = "$dst_path" ]; then
    echo "No need to copy configuration files, source and destination are the same"
    return
  fi
  if [ -d "$src_path" ]; then
    echo "$src_path exists"
    local files=("${MANAGED_NM_CONN_FILES[@]}")
    shopt -s nullglob
    files+=($src_path/*${MANAGED_NM_CONN_SUFFIX}.nmconnection $src_path/*${MANAGED_NM_CONN_SUFFIX})
    shopt -u nullglob
    for file in "${files[@]}"; do
      file="$(basename $file)"
      if [ -f "$src_path/$file" ]; then
        if [ ! -f "$dst_path/$file" ]; then
          echo "Copying configuration $file"
          cp "$src_path/$file" "$dst_path/$file"
        elif ! cmp --silent "$src_path/$file" "$dst_path/$file"; then
          echo "Copying updated configuration $file"
          cp -f "$src_path/$file" "$dst_path/$file"
        else
          echo "Skipping $file since it's equal at destination"
        fi
      else
[root@worker-1 bin]#


5. After updating the worker-1 /usr/local/bin/configure-ovs.sh script with the requested change, it returns an "exit 1".  I'll be attaching a log for this.


6. After deleting the worker-1 ovnkube-node pod, ovnkube-node-wq6hx, and then waiting for it's recreate, we see the following:

[root@ospbmgr7 bin]# oc delete pod   ovnkube-node-wq6hx      -n openshift-ovn-kubernetes
pod "ovnkube-node-wq6hx" deleted
[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS              RESTARTS      AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running             0             56m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running             6             54m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running             0             58m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff    8 (34s ago)   17m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running             0             60m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running             0             59m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-tkzvg     0/5     ContainerCreating   0             3s    10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running             0             61m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS      AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running            0             56m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running            6             54m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running            0             58m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff   8 (42s ago)   17m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running            0             60m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running            0             60m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-tkzvg     4/5     Running            1 (3s ago)    11s   10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running            0             61m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS      AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running            0             56m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running            6             54m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running            0             59m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff   8 (45s ago)   17m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running            0             60m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running            0             60m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-tkzvg     4/5     Running            1 (6s ago)    14s   10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running            0             61m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS      AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running            0             56m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running            6             54m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running            0             59m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff   8 (47s ago)   17m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running            0             60m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running            0             60m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-tkzvg     4/5     Error              1 (8s ago)    16s   10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running            0             61m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS      AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running            0             56m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running            6             54m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running            0             59m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff   8 (49s ago)   17m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running            0             60m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running            0             60m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-tkzvg     4/5     Error              1 (10s ago)   18s   10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running            0             61m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS      AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running            0             56m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running            6             54m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running            0             59m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff   8 (51s ago)   17m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running            0             60m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running            0             60m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-tkzvg     4/5     Error              1 (12s ago)   20s   10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running            0             61m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS      AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running            0             56m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running            6             54m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running            0             59m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff   8 (55s ago)   17m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running            0             60m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running            0             60m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-tkzvg     4/5     Error              1 (16s ago)   24s   10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running            0             61m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS      AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running            0             56m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running            6             54m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running            0             59m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff   8 (58s ago)   17m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running            0             60m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running            0             60m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-tkzvg     4/5     Error              1 (19s ago)   27s   10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running            0             61m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
[root@ospbmgr7 bin]# oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS      AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-bmz85   6/6     Running            0             57m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-r58v2   6/6     Running            6             54m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-master-vbsdb   6/6     Running            0             59m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-927lm     4/5     CrashLoopBackOff   8 (60s ago)   17m   10.20.116.213   master-2.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-cmqmd     5/5     Running            0             60m   10.20.116.214   worker-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-jbn9q     5/5     Running            0             60m   10.20.116.212   master-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-tkzvg     4/5     CrashLoopBackOff   1 (16s ago)   29s   10.20.116.215   worker-1.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>
ovnkube-node-xvvjf     5/5     Running            0             61m   10.20.116.211   master-0.pok-99.ocptest.pok.stglabs.ibm.com   <none>           <none>


7. I'll be attaching a log of the "journalctl -b -u ovs-configuration" command from worker-1.

Thank you,
Kyle

Comment 26 krmoser 2022-01-07 12:03:37 UTC
Created attachment 1849426 [details]
worker-1 journalctl -b -u ovs-configuration output

Per comment 25, here is the "journalctl -b -u ovs-configuration" command output from worker-1.

Comment 27 krmoser 2022-01-07 12:05:55 UTC
Created attachment 1849427 [details]
worker-1 /usr/local/bin/configure-ovs.sh output

Per comment 25, here is the /usr/local/bin/configure-ovs.sh updated script command output from worker-1.

Comment 28 krmoser 2022-01-07 12:11:54 UTC
Prashanth,

Please see the pending attachments for the requested OCP 4.9.13 master-0 node contents of these 2 directories:
1. /etc/NetworkManager/system-connections
2. /etc/NetworkManager/systemConnectionsMerged

Thank you,
Kyle

Comment 29 krmoser 2022-01-07 12:15:06 UTC
Created attachment 1849428 [details]
OCP 4.9.13 master-0 node /etc/NetworkManager/system-connections ouuput

Per comment 28, here is the /etc/NetworkManager/system-connections requested output.

Comment 30 krmoser 2022-01-07 12:18:22 UTC
Created attachment 1849429 [details]
OCP 4.9.13 master-0 node /etc/NetworkManager/system-connectionsMerged output

Per comment 28, here is the /etc/NetworkManager/systemConnectionsMerged requested output.

Comment 31 pdsilva 2022-01-07 14:16:21 UTC
Issue is reproducible on Power as well.
Upgrade from 4.9.12 --> 4.10.0-0.nightly-ppc64le-2022-01-07-115230

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.12    True        True          106m    Unable to apply 4.10.0-0.nightly-ppc64le-2022-01-07-115230: an unknown error has occurred: MultipleErrors


# oc get pods -A -owide | grep ovn
openshift-ovn-kubernetes                           ovnkube-master-crxjh                                         6/6     Running             0                75m    9.114.97.83    master-1   <none>           <none>
openshift-ovn-kubernetes                           ovnkube-master-hkrzt                                         6/6     Running             2 (80m ago)      80m    9.114.97.99    master-0   <none>           <none>
openshift-ovn-kubernetes                           ovnkube-master-qptwx                                         6/6     Running             6                77m    9.114.97.88    master-2   <none>           <none>
openshift-ovn-kubernetes                           ovnkube-node-5sgrc                                           5/5     Running             0                81m    9.114.97.99    master-0   <none>           <none>
openshift-ovn-kubernetes                           ovnkube-node-blwbn                                           5/5     Running             0                80m    9.114.97.96    worker-1   <none>           <none>
openshift-ovn-kubernetes                           ovnkube-node-h4fmt                                           4/5     CrashLoopBackOff    21 (2m44s ago)   81m    9.114.97.100   worker-0   <none>           <none>
openshift-ovn-kubernetes                           ovnkube-node-k8xmx                                           4/5     CrashLoopBackOff    20 (88s ago)     81m    9.114.97.88    master-2   <none>           <none>
openshift-ovn-kubernetes                           ovnkube-node-zxqwm                                           5/5     Running             0                81m    9.114.97.83    master-1   <none>           <none>


   State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:    : true\n\nup                  : true\n\nup                  : true\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n\nup                  : false\n\nup                  : false\n\nup                  : true\n"
I0107 14:05:48.106171   62247 ovs.go:208] exec(3): stderr: ""
I0107 14:05:48.106185   62247 node.go:315] Detected support for port binding with external IDs
I0107 14:05:48.106300   62247 ovs.go:204] exec(4): /usr/bin/ovs-vsctl --timeout=15 -- --if-exists del-port br-int k8s-master-2 -- --may-exist add-port br-int ovn-k8s-mp0 -- set interface ovn-k8s-mp0 type=internal mtu_request=1400 external-ids:iface-id=k8s-master-2
I0107 14:05:48.113384   62247 ovs.go:207] exec(4): stdout: ""
I0107 14:05:48.113401   62247 ovs.go:208] exec(4): stderr: ""
I0107 14:05:48.113424   62247 ovs.go:204] exec(5): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface ovn-k8s-mp0 mac_in_use
I0107 14:05:48.118643   62247 ovs.go:207] exec(5): stdout: "\"62:d7:b8:1f:c3:42\"\n"
I0107 14:05:48.118663   62247 ovs.go:208] exec(5): stderr: ""
I0107 14:05:48.118702   62247 ovs.go:204] exec(6): /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 mac=62\:d7\:b8\:1f\:c3\:42
I0107 14:05:48.124704   62247 ovs.go:207] exec(6): stdout: ""
I0107 14:05:48.124728   62247 ovs.go:208] exec(6): stderr: ""
I0107 14:05:48.172487   62247 gateway_init.go:261] Initializing Gateway Functionality
I0107 14:05:48.172720   62247 gateway_localnet.go:131] Node local addresses initialized to: map[10.129.0.2:{10.129.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 9.114.97.88:{9.114.96.0 fffffc00} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::60d7:b8ff:fe1f:c342:{fe80:: ffffffffffffffff0000000000000000} fe80::bc4a:e1ff:fec1:62fa:{fe80:: ffffffffffffffff0000000000000000}]
I0107 14:05:48.172868   62247 helper_linux.go:74] Found default gateway interface env32 9.114.96.1
F0107 14:05:48.172917   62247 ovnkube.go:133] could not find IP addresses: failed to lookup link br-ex: Link not found

      Exit Code:    1

Comment 32 krmoser 2022-01-08 22:55:25 UTC
Prashanth,

Thanks for the OCP 4.10 on Z build with the fix and you assistance yesterday.

Just an update since we spoke on slack yesterday that in addition to OCP 4.9.12 on Z, OCP 4.9.11 and 4.9.13 on Z also successfully upgrade to your fix build based on OCP 4.10 4.10.0-0.nightly-s390x-2022-01-07-024817.

Specifically:
1. OCP 4.9.11 on Z successfully upgrades to the fix build based on OCP 4.10.0-0.nightly-s390x-2022-01-07-024817 on Z.
2. OCP 4.9.12 on Z successfully upgrades to the fix build based on OCP 4.10.0-0.nightly-s390x-2022-01-07-024817 on Z.
3. OCP 4.9.13 on Z successfully upgrades to the fix build based on OCP 4.10.0-0.nightly-s390x-2022-01-07-024817 on Z.

Thank you,
Kyle

Comment 34 Prashanth Sundararaman 2022-01-12 17:12:27 UTC
Kyle,

The latest nightly has the fix: https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp-dev-preview/4.10.0-0.nightly-s390x-2022-01-12-163931/. if you could test that and confirm it works, we can close this.

Thanks
Prashanth

Comment 35 krmoser 2022-01-12 17:42:06 UTC
Prashanth,

Thanks for the updated build.  I'll test today and provide an update. 

Thank you,
Kyle

Comment 36 Prashanth Sundararaman 2022-01-12 19:56:43 UTC
sorry Kyle, but could you actually test with this build too : https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp-dev-preview/4.10.0-0.nightly-s390x-2022-01-12-163931 . it has kubernetes bumped to 1.23 and that would be good to test as well

Comment 37 krmoser 2022-01-12 20:02:38 UTC
Prashanth,

Thanks for the update.  Yes, will do.

Thank you,
Kyle

Comment 38 pdsilva 2022-01-13 06:55:17 UTC
Verified upgrade from 4.9.12 to 4.10.0-0.nightly-ppc64le-2022-01-13-022003 on Power.

# oc version
Client Version: 4.9.12
Server Version: 4.10.0-0.nightly-ppc64le-2022-01-13-022003
Kubernetes Version: v1.23.0+50f645e


# oc get network.config/cluster -o jsonpath='{.status.networkType}{"\n"}'
OVNKubernetes

# oc get clusterversion
NAME      VERSION                                      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         18m     Cluster version is 4.10.0-0.nightly-ppc64le-2022-01-13-022003


# oc get co
NAME                                       VERSION                                      AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      29m
baremetal                                  4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
cloud-controller-manager                   4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
cloud-credential                           4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
cluster-autoscaler                         4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
config-operator                            4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
console                                    4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      38m
csi-snapshot-controller                    4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
dns                                        4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
etcd                                       4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
image-registry                             4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      39m
ingress                                    4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      39m
insights                                   4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
kube-apiserver                             4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
kube-controller-manager                    4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
kube-scheduler                             4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
kube-storage-version-migrator              4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      39m
machine-api                                4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
machine-approver                           4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
machine-config                             4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      109m
marketplace                                4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
monitoring                                 4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
network                                    4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
node-tuning                                4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      72m
openshift-apiserver                        4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
openshift-controller-manager               4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      73m
openshift-samples                          4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      73m
operator-lifecycle-manager                 4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
operator-lifecycle-manager-catalog         4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
service-ca                                 4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h
storage                                    4.10.0-0.nightly-ppc64le-2022-01-13-022003   True        False         False      11h

Comment 39 krmoser 2022-01-13 11:02:01 UTC
Prashanth,


Thanks for the updates and builds.

1. For the OCP 4,10 nightly build 4.10.0-0.nightly-s390x-2022-01-12-163931, with kubernetes 1.23.0, the upgrade from OCP 4.9.14 was successful.  OCP 4.9.14 was previously upgraded from OCP 4.9.13.


# oc get clusterversion
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-s390x-2022-01-12-163931   True        False         38m     Cluster version is 4.10.0-0.nightly-s390x-2022-01-12-163931

# oc get nodes
NAME                                          STATUS   ROLES    AGE     VERSION
master-0.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master   4h40m   v1.22.1+6859754
master-1.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master   4h44m   v1.22.1+6859754
master-2.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master   4h43m   v1.22.1+6859754
worker-0.pok-25.ocptest.pok.stglabs.ibm.com   Ready    worker   4h29m   v1.22.1+6859754
worker-1.pok-25.ocptest.pok.stglabs.ibm.com   Ready    worker   4h29m   v1.22.1+6859754

# oc get nodes -o wide
NAME                                          STATUS   ROLES    AGE     VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                CONTAINER-RUNTIME
master-0.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master   4h41m   v1.22.1+6859754   10.20.116.11   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201120003-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8
master-1.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master   4h45m   v1.22.1+6859754   10.20.116.12   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201120003-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8
master-2.pok-25.ocptest.pok.stglabs.ibm.com   Ready    master   4h44m   v1.22.1+6859754   10.20.116.13   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201120003-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8
worker-0.pok-25.ocptest.pok.stglabs.ibm.com   Ready    worker   4h30m   v1.22.1+6859754   10.20.116.14   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201120003-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8
worker-1.pok-25.ocptest.pok.stglabs.ibm.com   Ready    worker   4h30m   v1.22.1+6859754   10.20.116.15   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201120003-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8
#


2. For kubernetes reporting versions by the OCP CLI "oc get nodes" command, please note that:
    1. When upgrading from OCP 4.9.13 to OCP 4.9.14, the "oc get nodes" command correctly reports the kubernetes version of 1.22.3, which corresponds to the kubernetes version of 1.22.3 listed in the OCP 4.9.14 build's release.txt file.  

    2. When upgrading from OCP 4.9.14 to OCP 4.10.0-0.nightly-s390x-2022-01-12-163931, the "oc get nodes" command incorrectly reports the kubernetes version as 1.22.1, instead of the 1.23.0 version listed in the OCP 4.10.0-0.nightly-s390x-2022-01-12-163931 build's release.txt file.  The "oc get nodes -o wide" command does report the Container Runtime as "cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8".
 

3. We are conducting additional OCP 4.9.x to OCP 4.10 upgrade tests for the new builds with kubernetes 1.23.0, including with the currently latest available OCP 4.10 nightly build 4.10.0-0.nightly-s390x-2022-01-13-022003, and will post the results here, including the "oc version", "oc clusterversion", "oc get nodes", "oc get nodes -o wide", and "oc co" output.  

4. OCP 4.9.14 was released yesterday, with OCP 4.9.15 released several hours later yesterday.


Thank you,
Kyle

Comment 40 krmoser 2022-01-13 11:40:50 UTC
Prashanth,

1. The upgrade from OCP 4.9.14 to OCP 4.10 nightly build 4.10.0-0.nightly-s390x-2022-01-13-022003 was successful.

# oc get clusterversion
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         3m49s   Cluster version is 4.10.0-0.nightly-s390x-2022-01-13-022003


# oc get nodes
NAME                                          STATUS   ROLES    AGE    VERSION
master-0.pok-99.ocptest.pok.stglabs.ibm.com   Ready    master   143m   v1.23.0+50f645e
master-1.pok-99.ocptest.pok.stglabs.ibm.com   Ready    master   143m   v1.23.0+50f645e
master-2.pok-99.ocptest.pok.stglabs.ibm.com   Ready    master   143m   v1.23.0+50f645e
worker-0.pok-99.ocptest.pok.stglabs.ibm.com   Ready    worker   128m   v1.23.0+50f645e
worker-1.pok-99.ocptest.pok.stglabs.ibm.com   Ready    worker   128m   v1.23.0+50f645e


# oc get nodes -o wide
NAME                                          STATUS   ROLES    AGE    VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                CONTAINER-RUNTIME
master-0.pok-99.ocptest.pok.stglabs.ibm.com   Ready    master   143m   v1.23.0+50f645e   10.20.116.211   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201121602-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8
master-1.pok-99.ocptest.pok.stglabs.ibm.com   Ready    master   143m   v1.23.0+50f645e   10.20.116.212   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201121602-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8
master-2.pok-99.ocptest.pok.stglabs.ibm.com   Ready    master   143m   v1.23.0+50f645e   10.20.116.213   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201121602-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8
worker-0.pok-99.ocptest.pok.stglabs.ibm.com   Ready    worker   128m   v1.23.0+50f645e   10.20.116.214   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201121602-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8
worker-1.pok-99.ocptest.pok.stglabs.ibm.com   Ready    worker   128m   v1.23.0+50f645e   10.20.116.215   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201121602-0 (Ootpa)   4.18.0-305.30.1.el8_4.s390x   cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8


# oc get co
NAME                                       VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      16m
baremetal                                  4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      136m
cloud-controller-manager                   4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      143m
cloud-credential                           4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      143m
cluster-autoscaler                         4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      136m
config-operator                            4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      138m
console                                    4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      19m
csi-snapshot-controller                    4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      137m
dns                                        4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      136m
etcd                                       4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      136m
image-registry                             4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      130m
ingress                                    4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      22m
insights                                   4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      131m
kube-apiserver                             4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      133m
kube-controller-manager                    4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      135m
kube-scheduler                             4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      135m
kube-storage-version-migrator              4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      23m
machine-api                                4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      136m
machine-approver                           4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      136m
machine-config                             4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      135m
marketplace                                4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      137m
monitoring                                 4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      124m
network                                    4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      138m
node-tuning                                4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      126m
openshift-apiserver                        4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      131m
openshift-controller-manager               4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      134m
openshift-samples                          4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      82m
operator-lifecycle-manager                 4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      137m
operator-lifecycle-manager-catalog         4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      137m
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      131m
service-ca                                 4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      138m
storage                                    4.10.0-0.nightly-s390x-2022-01-13-022003   True        False         False      139m


# oc version
Client Version: 4.9.14
Server Version: 4.10.0-0.nightly-s390x-2022-01-13-022003
Kubernetes Version: v1.23.0+50f645e


2. The OCP CLI "oc get nodes" and "oc get nodes -o wide" commands correctly report the kubernetes version 1.23.0, as shown in the above output.

3. We are conducting some additional OCP 4.9.x to OCP 4.10 upgrade tests for the new builds with kubernetes 1.23.0, and will post the results here.


Thank you,
Kyle

Comment 43 Luke Meyer 2022-03-11 17:58:00 UTC
This was shipped in the GA advisory, but was not automatically transitioned because it lacked the new subcomponent field. Closing manually.


Note You need to log in before you can comment on or make changes to this bug.