Description of problem: Based on telemetry data, a UPI AWS cluster is stuck trying to upgrade from 4.1.18 to 4.2.0-rc.1 since 6 days ago. machine-config is reporting degraded with RequiredPoolsFailed. Supportshell shows: Unable to apply 4.2.0-rc.1: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-d7df7ffc4886508dcc5aaa2ed70cad6e expected b8898db9af98e5c3d6a450ae123121677b0dbcb3 has a2175e587b007272f26305fe7d8b603c49e8f1fc, retrying Version-Release number of selected component (if applicable): 4.1.18 -> 4.2.0-rc.1 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1624884 [details] must-gather.partaf
Created attachment 1624885 [details] must-gather.partal
Created attachment 1624886 [details] must-gather.partar
Created attachment 1624887 [details] must-gather.partax
Created attachment 1624888 [details] must-gather.partbd
Created attachment 1624889 [details] must-gather.partaa
Created attachment 1624890 [details] must-gather.partag
Created attachment 1624891 [details] must-gather.partam
Created attachment 1624892 [details] must-gather.partas
Created attachment 1624893 [details] must-gather.partay
Created attachment 1624894 [details] must-gather.partab
Created attachment 1624895 [details] must-gather.partah
Created attachment 1624896 [details] must-gather.partan
Created attachment 1624897 [details] must-gather.partat
Created attachment 1624898 [details] must-gather.partaz
Created attachment 1624899 [details] must-gather.partac
Created attachment 1624900 [details] must-gather.partai
Created attachment 1624901 [details] must-gather.partao
Created attachment 1624902 [details] must-gather.partau
Created attachment 1624903 [details] must-gather.partba
Created attachment 1624904 [details] must-gather.partad
Created attachment 1624905 [details] must-gather.partaj
Created attachment 1624906 [details] must-gather.partap
Created attachment 1624907 [details] must-gather.partav
Created attachment 1624908 [details] must-gather.partbb
Created attachment 1624909 [details] must-gather.partae
Created attachment 1624910 [details] must-gather.partak
Created attachment 1624911 [details] must-gather.partaq
Created attachment 1624912 [details] must-gather.partaw
Created attachment 1624913 [details] must-gather.partbc
So masters on this cluster have been ssh accessed: 2019-10-07T15:37:07.212551663Z I1007 15:37:07.212474 11165 daemon.go:542] Detected a new login session: New session 1 of user core. 2019-10-07T15:37:07.212551663Z I1007 15:37:07.212492 11165 daemon.go:543] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh Then, one of the MCD for a master says: ``` 2019-10-11T19:27:21.264402178Z 2019-10-11T19:27:21.264402178Z A: cgroupDriver: systemd 2019-10-11T19:27:21.264402178Z clusterDNS: 2019-10-11T19:27:21.264402178Z - 10.56.0.10 2019-10-11T19:27:21.264402178Z clusterDomain: cluster.local 2019-10-11T19:27:21.264402178Z maxPods: 250 2019-10-11T19:27:21.264402178Z runtimeRequestTimeout: 10m 2019-10-11T19:27:21.264402178Z serializeImagePulls: false 2019-10-11T19:27:21.264402178Z staticPodPath: /etc/kubernetes/manifests 2019-10-11T19:27:21.264402178Z systemReserved: 2019-10-11T19:27:21.264402178Z cpu: 500m 2019-10-11T19:27:21.264402178Z memory: 500Mi 2019-10-11T19:27:21.264402178Z featureGates: 2019-10-11T19:27:21.264402178Z RotateKubeletServerCertificate: true 2019-10-11T19:27:21.264402178Z ExperimentalCriticalPodAnnotation: true 2019-10-11T19:27:21.264402178Z SupportPodPidsLimit: true 2019-10-11T19:27:21.264402178Z LocalStorageCapacityIsolation: false 2019-10-11T19:27:21.264402178Z serverTLSBootstrap: true 2019-10-11T19:27:21.264402178Z 2019-10-11T19:27:21.264402178Z 2019-10-11T19:27:21.264402178Z B: authentication: 2019-10-11T19:27:21.264402178Z x509: 2019-10-11T19:27:21.264402178Z clientCAFile: /etc/kubernetes/kubelet-ca.crt 2019-10-11T19:27:21.264402178Z anonymous: 2019-10-11T19:27:21.264402178Z enabled: false 2019-10-11T19:27:21.264402178Z cgroupDriver: systemd 2019-10-11T19:27:21.264402178Z clusterDNS: 2019-10-11T19:27:21.264402178Z - 10.56.0.10 2019-10-11T19:27:21.264402178Z clusterDomain: cluster.local 2019-10-11T19:27:21.264402178Z containerLogMaxSize: 50Mi 2019-10-11T19:27:21.264402178Z maxPods: 250 2019-10-11T19:27:21.264402178Z serializeImagePulls: false 2019-10-11T19:27:21.264402178Z staticPodPath: /etc/kubernetes/manifests 2019-10-11T19:27:21.264402178Z systemReserved: 2019-10-11T19:27:21.264402178Z cpu: 500m 2019-10-11T19:27:21.264402178Z memory: 500Mi 2019-10-11T19:27:21.264402178Z featureGates: 2019-10-11T19:27:21.264402178Z RotateKubeletServerCertificate: true 2019-10-11T19:27:21.264402178Z ExperimentalCriticalPodAnnotation: true 2019-10-11T19:27:21.264402178Z SupportPodPidsLimit: true 2019-10-11T19:27:21.264402178Z LocalStorageCapacityIsolation: false 2019-10-11T19:27:21.264402178Z serverTLSBootstrap: true 2019-10-11T19:27:21.264402178Z ``` so looks like someone jumped on the node and changed the kubelet.conf manually? and now MCD is surely complaining.
moving to 4.3 to further investigate but since no other such reports came, I'm still leaning towards not a blocker since it looks like someone manually patched configurations.
Hi Antonio, Thank you for those notes, I will reach out to the customer to see what changes they made and will update here once they get back to me.
re podman's pivot problem, RHCOS team: PTAL
This may be a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1768879
@erjones Would it be possible to have the customer try out the workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1768879#c13 Essentially, upgrade the cluster to a newer version of 4.1 that has the fixed `podman`, then try upgrading to OCP 4.2
Without additional information, we are unable to investigate this further for in time for the 4.3 deadline that is approaching. Moving to 4.4
I believe this was a dup - support, please try having them upgrade to the latest 4.1 before proceeding onto the latest >= 4.2. *** This bug has been marked as a duplicate of bug 1768879 ***