Bug 2099664 - MachineConfigPool is not getting updated
Summary: MachineConfigPool is not getting updated
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.11
Hardware: ppc64le
OS: Linux
high
high
Target Milestone: ---
: 4.12.0
Assignee: Yu Qi Zhang
QA Contact: Rio Liu
URL:
Whiteboard:
Depends On:
Blocks: 2103749
TreeView+ depends on / blocked
 
Reported: 2022-06-21 13:01 UTC by Sridhar Venkat (IBM)
Modified: 2023-01-17 19:50 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:50:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3211 0 None open Bug 2099664: daemon: initialize nodewriter before login monitor 2022-06-27 20:17:03 UTC
Red Hat Issue Tracker MULTIARCH-2667 0 None None None 2022-06-27 14:14:46 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:50:25 UTC

Description Sridhar Venkat (IBM) 2022-06-21 13:01:38 UTC
Note:
1. if you are dealing with Machines or MachineSet objects, please select the component as "Cloud Compute" under same product.
2. if you are dealing with kubelet / kubeletconfigs / container runtime configs, please select the component as "Node" under same product.

Description of problem:
After deploying OCP on ppc64le environment, when I run:
[root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get MachineConfigPool worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-627087f67ace055b92ca4c26488fdeb7   False     True       False      3              2                   2                     0                      18h

I expect to see READYMACHINECOUNT be 3 but it is 2.

Then I stopped kubelet.service on one of the nodes (worker). The count became:

[root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get MachineConfigPool worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-627087f67ace055b92ca4c26488fdeb7   False     True       False      3              1                   2                     0                      18h

READYMACHINECOUNT became 1. When I started the service back, the COUNT got back to 2. 

Version-Release number of MCO (Machine Config Operator) (if applicable): 4.11

Platform (AWS, VSphere, Metal, etc.): IBM Power

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Yes as I am looking at MachineConfigPool.

How reproducible:

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. As described above.
2.
3.

Actual results:

Expected results:

Additional info:

1. Please consider attaching a must-gather archive (via oc adm must-gather). Please review must-gather contents for sensitive information before attaching any must-gathers to a Bugzilla report. You may also mark the bug private if you wish.

2. If a must-gather is unavailable, please provide the output of:

$ oc get co machine-config -o yaml

$ oc get mcp (and oc describe mcp/${degraded_pool} if pools are degraded)

$ oc get mc

$ oc get pod -n openshift-machine-config-operator

$ oc get node -o wide

3. If a node is not accessible via API, please provide console/journal/kubelet logs of the problematic node

4. Are there RHEL nodes on the cluster? If yes, please upload the whole Ansible logs or Jenkins job

Comment 1 Sridhar Venkat (IBM) 2022-06-21 14:26:08 UTC
[root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get co machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-06-20T18:26:39Z"
  generation: 1
  name: machine-config
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: f588cffc-ca69-471a-b3e3-ae10d9389af9
  resourceVersion: "195134"
  uid: 192f78c1-88a5-431a-9fb8-37588daaece0
spec: {}
status:
  conditions:
  - lastTransitionTime: "2022-06-20T18:33:05Z"
    message: Cluster version is 4.11.0-0.nightly-ppc64le-2022-06-16-003709
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-06-20T18:50:36Z"
    message: One or more machine config pools are updating, please see `oc get mcp`
      for further details
    reason: PoolUpdating
    status: "False"
    type: Upgradeable
  - lastTransitionTime: "2022-06-21T03:36:12Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-06-21T03:36:16Z"
    message: Cluster has deployed [{operator 4.11.0-0.nightly-ppc64le-2022-06-16-003709}]
    status: "True"
    type: Available
  extension:
    master: all 3 nodes are at latest configuration rendered-master-c87dd4684c54c74ee43bf6cb07a4ff85
    worker: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-worker-627087f67ace055b92ca4c26488fdeb7
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: ""
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: ""
    resource: controllerconfigs
  - group: machineconfiguration.openshift.io
    name: ""
    resource: kubeletconfigs
  - group: machineconfiguration.openshift.io
    name: ""
    resource: containerruntimeconfigs
  - group: machineconfiguration.openshift.io
    name: ""
    resource: machineconfigs
  - group: ""
    name: ""
    resource: nodes
  - group: ""
    name: openshift-kni-infra
    resource: namespaces
  - group: ""
    name: openshift-openstack-infra
    resource: namespaces
  - group: ""
    name: openshift-ovirt-infra
    resource: namespaces
  - group: ""
    name: openshift-vsphere-infra
    resource: namespaces
  - group: ""
    name: openshift-nutanix-infra
    resource: namespaces
  versions:
  - name: operator
    version: 4.11.0-0.nightly-ppc64le-2022-06-16-003709
[root@rdr-sri-7cfc-tok04-bastion-0 ~]# 

Nothing is degraded.

[root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-c87dd4684c54c74ee43bf6cb07a4ff85   True      False      False      3              3                   3                     0                      19h
worker   rendered-worker-627087f67ace055b92ca4c26488fdeb7   False     True       False      3              2                   2                     0                      19h
[r


[root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
00-worker                                          0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
01-master-container-runtime                        0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
01-master-kubelet                                  0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
01-worker-container-runtime                        0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
01-worker-kubelet                                  0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
99-master-chrony-configuration                                                                2.2.0             19h
99-master-generated-registries                     0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
99-master-ssh                                                                                 3.2.0             19h
99-worker-chrony-configuration                                                                2.2.0             19h
99-worker-generated-registries                     0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
99-worker-ssh                                                                                 3.2.0             19h
rendered-master-c87dd4684c54c74ee43bf6cb07a4ff85   0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
rendered-worker-627087f67ace055b92ca4c26488fdeb7   0bba861a74209e78177612789983183cdc77f1a5   3.2.0             19h
[root@rdr-sri-7cfc-tok04-bastion-0 ~]# 


[root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get pod -n openshift-machine-config-operator -o wide
NAME                                         READY   STATUS             RESTARTS          AGE   IP               NODE                                  NOMINATED NODE   READINESS GATES
machine-config-controller-6c66f4d5f5-ncm5f   2/2     Running            1 (19h ago)       19h   10.129.0.11      tok04-master-1.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-daemon-52nvt                  2/2     Running            0                 19h   192.168.36.167   tok04-worker-0.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-daemon-l2f9r                  1/2     CrashLoopBackOff   230 (2m47s ago)   19h   192.168.36.100   tok04-worker-2.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-daemon-l66fl                  2/2     Running            0                 19h   192.168.36.10    tok04-master-2.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-daemon-p4dct                  2/2     Running            0                 19h   192.168.36.52    tok04-master-1.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-daemon-rhtxr                  2/2     Running            0                 19h   192.168.36.203   tok04-worker-1.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-daemon-x699v                  2/2     Running            0                 19h   192.168.36.78    tok04-master-0.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-operator-8475fdd878-prkwj     1/1     Running            1 (19h ago)       19h   10.128.0.11      tok04-master-0.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-server-d5wpl                  1/1     Running            0                 19h   192.168.36.78    tok04-master-0.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-server-jvp6d                  1/1     Running            0                 19h   192.168.36.52    tok04-master-1.rdr-sri-7cfc.ibm.com   <none>           <none>
machine-config-server-mqzqq                  1/1     Running            0                 19h   192.168.36.10    tok04-master-2.rdr-sri-7cfc.ibm.com   <none>           <none>
[root@rdr-sri-7cfc-tok04-bastion-0 ~]# 


Logs from the failing pod:

[root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc logs machine-config-daemon-l2f9r -n openshift-machine-config-operator
Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy
I0621 14:21:32.442851 1404020 start.go:112] Version: v4.11.0-202206142048.p0.g0bba861.assembly.stream-dirty (0bba861a74209e78177612789983183cdc77f1a5)
I0621 14:21:32.444651 1404020 start.go:125] Calling chroot("/rootfs")
I0621 14:21:32.445279 1404020 update.go:1962] Running: systemctl start rpm-ostreed
I0621 14:21:37.496328 1404020 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0621 14:21:37.542192 1404020 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0621 14:21:37.587780 1404020 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841 (411.86.202206141021-0)
I0621 14:21:37.598870 1404020 start.go:101] Copied self to /run/bin/machine-config-daemon on host
I0621 14:21:37.600049 1404020 start.go:189] overriding kubernetes api to https://api-int.rdr-sri-7cfc.ibm.com:6443
I0621 14:21:37.600638 1404020 metrics.go:106] Registering Prometheus metrics
I0621 14:21:37.600731 1404020 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I0621 14:21:37.605236 1404020 daemon.go:971] Detected a new login session: New session 1 of user core.
I0621 14:21:37.605248 1404020 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30]

goroutine 66 [running]:
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).applySSHAccessedAnnotation(0xc0002ac400?)
        /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:986 +0x30
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor.func1()
        /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:973 +0x18c
created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor
        /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:954 +0x1fc
[root@rdr-sri-7cfc-tok04-bastion-0 ~]#

Comment 2 Yu Qi Zhang 2022-06-21 16:45:52 UTC
So there's a few things to unpack here, let's go at this one at a time:

1. platform-none is specifically for vsphere/baremetal non-platformed deployment (and corresponding templates) and I think this is core MCO, so let's move that sub component. I'm not sure what differences exist for Power in terms of the MCO

2. In your initial assessment it sounds like one of your three worker nodes is not ready due to whatever reason, and then you ssh'ed? into a node. Which node did you ssh into? Was it tok04-worker-2.rdr-sri-7cfc.ibm.com ? If so, given that the ready machinecount dropped when you stopped kubelet, I think you ssh'ed into one of the good nodes.

3. for the failing pod, what is happening is that the MCO is trying to apply a ssh annotation due to the fact that the node has been accessed, but the MCD doesn't have an initialized nodewriter, so it panic'ed upon doing so. This is the first time I've ever seen this happen. Is this node the one you ssh'ed to earlier to stop kubelet? If so, this should be a different issue than the one you initially reported, and only triggered because you ssh'ed into the node. We still don't know what the original issue is based on existing information

4. If the above assessment is correct, there's 2 issues with this cluster: the panic due to ssh annotation failing to be written and the original node notready. I would probably need a full must-gather from this cluster as well as other context (what is this cluster used for, what customizations, what special things exist for ppc, etc.)


Marking not a blocker since this should be very rare and generally not a scenario you would get into. I tested with a regular 4.11 nightly on GCP and ssh seems fine (both the regular annotation write and a forced restart into a boot annotation write as above)

Comment 3 shweta 2022-06-24 10:01:14 UTC
After deploying OCP 4.11.0-0.nightly-ppc64le-2022-06-21-133338 on ppc64le environment issue seen with mcp.


[root@rdr-shw4-1385-bastion-0 ~]# oc version
Client Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338
Kustomize Version: v4.5.4
Server Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338
Kubernetes Version: v1.24.0+284d62a

[root@rdr-shw4-1385-bastion-0 ~]# oc get pods -A | grep -v Running | grep -v Completed
NAMESPACE                                          NAME                                                         READY   STATUS             RESTARTS          AGE
openshift-kube-apiserver                           installer-8-master-1                                         0/1     Error              0                 18h
openshift-machine-config-operator                  machine-config-daemon-bttgz                                  1/2     CrashLoopBackOff   215 (109s ago)    18h
openshift-machine-config-operator                  machine-config-daemon-j6sbw                                  1/2     CrashLoopBackOff   215 (3m31s ago)   18h


[root@rdr-shw4-1385-bastion-0 ~]# oc describe pod  machine-config-daemon-bttgz -n openshift-machine-config-operator
Name:                 machine-config-daemon-bttgz
Namespace:            openshift-machine-config-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 worker-0/9.114.99.175
Start Time:           Thu, 23 Jun 2022 07:02:03 -0400
Labels:               controller-revision-hash=864967879d
                      k8s-app=machine-config-daemon
                      pod-template-generation=1
Annotations:          openshift.io/scc: privileged
Status:               Running
IP:                   9.114.99.175
IPs:
  IP:           9.114.99.175
Controlled By:  DaemonSet/machine-config-daemon
Containers:
  machine-config-daemon:
    Container ID:  cri-o://3e33796ae57621dc1c501bc1f73fd7f28e8c9b7263dcc7c226f787df80c1d244
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/machine-config-daemon
    Args:
      start
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   I0624 05:18:36.510514 1321011 start.go:112] Version: v4.11.0-202206211257.p0.g37b7416.assembly.stream-dirty (37b741601f9b7ff9b2e1870102cc2970b24e1835)
I0624 05:18:36.513320 1321011 start.go:125] Calling chroot("/rootfs")
I0624 05:18:36.513388 1321011 update.go:1962] Running: systemctl start rpm-ostreed
I0624 05:18:42.277171 1321011 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0624 05:18:42.323875 1321011 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0624 05:18:42.370453 1321011 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841 (411.86.202206141021-0)
I0624 05:18:42.384526 1321011 start.go:101] Copied self to /run/bin/machine-config-daemon on host
I0624 05:18:42.386124 1321011 start.go:189] overriding kubernetes api to https://api-int.rdr-shw4-1385.redhat.com:6443
I0624 05:18:42.386924 1321011 metrics.go:106] Registering Prometheus metrics
I0624 05:18:42.387025 1321011 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I0624 05:18:42.392218 1321011 daemon.go:971] Detected a new login session: New session 1 of user core.
I0624 05:18:42.392234 1321011 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30]

goroutine 20 [running]:
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).applySSHAccessedAnnotation(0xc000250400?)
  /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:986 +0x30
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor.func1()
  /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:973 +0x18c
created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor
  /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:954 +0x1fc

      Exit Code:    2
      Started:      Fri, 24 Jun 2022 01:18:36 -0400
      Finished:     Fri, 24 Jun 2022 01:18:42 -0400
    Ready:          False
    Restart Count:  215
    Requests:
      cpu:     20m
      memory:  50Mi
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /rootfs from rootfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dxvbr (ro)
  oauth-proxy:
    Container ID:  cri-o://9e75d0f269b86c2e985a7260f3c670d0f7bf9ac4b9d7b6be46cd1399c3647458
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:78f76006f0a7ed69f40c52398ea7f2a5a86544c7fcefd4c86f8d77aa6182f879
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:78f76006f0a7ed69f40c52398ea7f2a5a86544c7fcefd4c86f8d77aa6182f879
    Port:          9001/TCP
    Host Port:     9001/TCP
    Args:
      --https-address=:9001
      --provider=openshift
      --openshift-service-account=machine-config-daemon
      --upstream=http://127.0.0.1:8797
      --tls-cert=/etc/tls/private/tls.crt
      --tls-key=/etc/tls/private/tls.key
      --cookie-secret-file=/etc/tls/cookie-secret/cookie-secret
      --openshift-sar={"resource": "namespaces", "verb": "get"}
      --openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}}
    State:          Running
      Started:      Thu, 23 Jun 2022 07:03:23 -0400
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        20m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /etc/tls/cookie-secret from cookie-secret (rw)
      /etc/tls/private from proxy-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dxvbr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  rootfs:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  proxy-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  proxy-tls
    Optional:    false
  cookie-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cookie-secret
    Optional:    false
  kube-api-access-dxvbr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   14m (x214 over 18h)     kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27" already present on machine
  Warning  BackOff  4m28s (x4968 over 18h)  kubelet  Back-off restarting failed container


[root@rdr-shw4-1385-bastion-0 ~]# oc logs machine-config-daemon-bttgz -n openshift-machine-config-operator
Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy
I0624 05:34:15.467779 1339582 start.go:112] Version: v4.11.0-202206211257.p0.g37b7416.assembly.stream-dirty (37b741601f9b7ff9b2e1870102cc2970b24e1835)
I0624 05:34:15.470217 1339582 start.go:125] Calling chroot("/rootfs")
I0624 05:34:15.470311 1339582 update.go:1962] Running: systemctl start rpm-ostreed
I0624 05:34:20.899845 1339582 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0624 05:34:20.956738 1339582 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0624 05:34:21.011356 1339582 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841 (411.86.202206141021-0)
I0624 05:34:21.025216 1339582 start.go:101] Copied self to /run/bin/machine-config-daemon on host
I0624 05:34:21.027056 1339582 start.go:189] overriding kubernetes api to https://api-int.rdr-shw4-1385.redhat.com:6443
I0624 05:34:21.027965 1339582 metrics.go:106] Registering Prometheus metrics
I0624 05:34:21.028061 1339582 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I0624 05:34:21.033329 1339582 daemon.go:971] Detected a new login session: New session 1 of user core.
I0624 05:34:21.033345 1339582 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30]

goroutine 37 [running]:
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).applySSHAccessedAnnotation(0xc000740400?)
        /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:986 +0x30
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor.func1()
        /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:973 +0x18c
created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor
        /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:954 +0x1fc

Comment 4 shweta 2022-06-24 11:40:46 UTC
(In reply to shweta from comment #3)
> After deploying OCP 4.11.0-0.nightly-ppc64le-2022-06-21-133338 on ppc64le
> environment issue seen with mcp.
> 
> 
> [root@rdr-shw4-1385-bastion-0 ~]# oc version
> Client Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338
> Kustomize Version: v4.5.4
> Server Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338
> Kubernetes Version: v1.24.0+284d62a
> 
> [root@rdr-shw4-1385-bastion-0 ~]# oc get pods -A | grep -v Running | grep -v
> Completed
> NAMESPACE                                          NAME                     
> READY   STATUS             RESTARTS          AGE
> openshift-kube-apiserver                           installer-8-master-1     
> 0/1     Error              0                 18h
> openshift-machine-config-operator                 
> machine-config-daemon-bttgz                                  1/2    
> CrashLoopBackOff   215 (109s ago)    18h
> openshift-machine-config-operator                 
> machine-config-daemon-j6sbw                                  1/2    
> CrashLoopBackOff   215 (3m31s ago)   18h
> 
> 
> [root@rdr-shw4-1385-bastion-0 ~]# oc describe pod 
> machine-config-daemon-bttgz -n openshift-machine-config-operator
> Name:                 machine-config-daemon-bttgz
> Namespace:            openshift-machine-config-operator
> Priority:             2000001000
> Priority Class Name:  system-node-critical
> Node:                 worker-0/9.114.99.175
> Start Time:           Thu, 23 Jun 2022 07:02:03 -0400
> Labels:               controller-revision-hash=864967879d
>                       k8s-app=machine-config-daemon
>                       pod-template-generation=1
> Annotations:          openshift.io/scc: privileged
> Status:               Running
> IP:                   9.114.99.175
> IPs:
>   IP:           9.114.99.175
> Controlled By:  DaemonSet/machine-config-daemon
> Containers:
>   machine-config-daemon:
>     Container ID: 
> cri-o://3e33796ae57621dc1c501bc1f73fd7f28e8c9b7263dcc7c226f787df80c1d244
>     Image:        
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27
>     Image ID:     
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27
>     Port:          <none>
>     Host Port:     <none>
>     Command:
>       /usr/bin/machine-config-daemon
>     Args:
>       start
>     State:       Waiting
>       Reason:    CrashLoopBackOff
>     Last State:  Terminated
>       Reason:    Error
>       Message:   I0624 05:18:36.510514 1321011 start.go:112] Version:
> v4.11.0-202206211257.p0.g37b7416.assembly.stream-dirty
> (37b741601f9b7ff9b2e1870102cc2970b24e1835)
> I0624 05:18:36.513320 1321011 start.go:125] Calling chroot("/rootfs")
> I0624 05:18:36.513388 1321011 update.go:1962] Running: systemctl start
> rpm-ostreed
> I0624 05:18:42.277171 1321011 rpm-ostree.go:324] Running captured:
> rpm-ostree status --json
> I0624 05:18:42.323875 1321011 rpm-ostree.go:324] Running captured:
> rpm-ostree status --json
> I0624 05:18:42.370453 1321011 daemon.go:236] Booted osImageURL:
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841
> (411.86.202206141021-0)
> I0624 05:18:42.384526 1321011 start.go:101] Copied self to
> /run/bin/machine-config-daemon on host
> I0624 05:18:42.386124 1321011 start.go:189] overriding kubernetes api to
> https://api-int.rdr-shw4-1385.redhat.com:6443
> I0624 05:18:42.386924 1321011 metrics.go:106] Registering Prometheus metrics
> I0624 05:18:42.387025 1321011 metrics.go:111] Starting metrics listener on
> 127.0.0.1:8797
> I0624 05:18:42.392218 1321011 daemon.go:971] Detected a new login session:
> New session 1 of user core.
> I0624 05:18:42.392234 1321011 daemon.go:972] Login access is discouraged!
> Applying annotation: machineconfiguration.openshift.io/ssh
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30]
> 
> goroutine 20 [running]:
> github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).
> applySSHAccessedAnnotation(0xc000250400?)
>  
> /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:
> 986 +0x30
> github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).
> runLoginMonitor.func1()
>  
> /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:
> 973 +0x18c
> created by
> github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).
> runLoginMonitor
>  
> /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:
> 954 +0x1fc
> 
>       Exit Code:    2
>       Started:      Fri, 24 Jun 2022 01:18:36 -0400
>       Finished:     Fri, 24 Jun 2022 01:18:42 -0400
>     Ready:          False
>     Restart Count:  215
>     Requests:
>       cpu:     20m
>       memory:  50Mi
>     Environment:
>       NODE_NAME:   (v1:spec.nodeName)
>     Mounts:
>       /rootfs from rootfs (rw)
>       /var/run/secrets/kubernetes.io/serviceaccount from
> kube-api-access-dxvbr (ro)
>   oauth-proxy:
>     Container ID: 
> cri-o://9e75d0f269b86c2e985a7260f3c670d0f7bf9ac4b9d7b6be46cd1399c3647458
>     Image:        
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 78f76006f0a7ed69f40c52398ea7f2a5a86544c7fcefd4c86f8d77aa6182f879
>     Image ID:     
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 78f76006f0a7ed69f40c52398ea7f2a5a86544c7fcefd4c86f8d77aa6182f879
>     Port:          9001/TCP
>     Host Port:     9001/TCP
>     Args:
>       --https-address=:9001
>       --provider=openshift
>       --openshift-service-account=machine-config-daemon
>       --upstream=http://127.0.0.1:8797
>       --tls-cert=/etc/tls/private/tls.crt
>       --tls-key=/etc/tls/private/tls.key
>       --cookie-secret-file=/etc/tls/cookie-secret/cookie-secret
>       --openshift-sar={"resource": "namespaces", "verb": "get"}
>       --openshift-delegate-urls={"/": {"resource": "namespaces", "verb":
> "get"}}
>     State:          Running
>       Started:      Thu, 23 Jun 2022 07:03:23 -0400
>     Ready:          True
>     Restart Count:  0
>     Requests:
>       cpu:        20m
>       memory:     50Mi
>     Environment:  <none>
>     Mounts:
>       /etc/tls/cookie-secret from cookie-secret (rw)
>       /etc/tls/private from proxy-tls (rw)
>       /var/run/secrets/kubernetes.io/serviceaccount from
> kube-api-access-dxvbr (ro)
> Conditions:
>   Type              Status
>   Initialized       True
>   Ready             False
>   ContainersReady   False
>   PodScheduled      True
> Volumes:
>   rootfs:
>     Type:          HostPath (bare host directory volume)
>     Path:          /
>     HostPathType:
>   proxy-tls:
>     Type:        Secret (a volume populated by a Secret)
>     SecretName:  proxy-tls
>     Optional:    false
>   cookie-secret:
>     Type:        Secret (a volume populated by a Secret)
>     SecretName:  cookie-secret
>     Optional:    false
>   kube-api-access-dxvbr:
>     Type:                    Projected (a volume that contains injected data
> from multiple sources)
>     TokenExpirationSeconds:  3607
>     ConfigMapName:           kube-root-ca.crt
>     ConfigMapOptional:       <nil>
>     DownwardAPI:             true
>     ConfigMapName:           openshift-service-ca.crt
>     ConfigMapOptional:       <nil>
> QoS Class:                   Burstable
> Node-Selectors:              kubernetes.io/os=linux
> Tolerations:                 op=Exists
> Events:
>   Type     Reason   Age                     From     Message
>   ----     ------   ----                    ----     -------
>   Normal   Pulled   14m (x214 over 18h)     kubelet  Container image
> "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27" already
> present on machine
>   Warning  BackOff  4m28s (x4968 over 18h)  kubelet  Back-off restarting
> failed container
> 
> 
> [root@rdr-shw4-1385-bastion-0 ~]# oc logs machine-config-daemon-bttgz -n
> openshift-machine-config-operator
> Defaulted container "machine-config-daemon" out of: machine-config-daemon,
> oauth-proxy
> I0624 05:34:15.467779 1339582 start.go:112] Version:
> v4.11.0-202206211257.p0.g37b7416.assembly.stream-dirty
> (37b741601f9b7ff9b2e1870102cc2970b24e1835)
> I0624 05:34:15.470217 1339582 start.go:125] Calling chroot("/rootfs")
> I0624 05:34:15.470311 1339582 update.go:1962] Running: systemctl start
> rpm-ostreed
> I0624 05:34:20.899845 1339582 rpm-ostree.go:324] Running captured:
> rpm-ostree status --json
> I0624 05:34:20.956738 1339582 rpm-ostree.go:324] Running captured:
> rpm-ostree status --json
> I0624 05:34:21.011356 1339582 daemon.go:236] Booted osImageURL:
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841
> (411.86.202206141021-0)
> I0624 05:34:21.025216 1339582 start.go:101] Copied self to
> /run/bin/machine-config-daemon on host
> I0624 05:34:21.027056 1339582 start.go:189] overriding kubernetes api to
> https://api-int.rdr-shw4-1385.redhat.com:6443
> I0624 05:34:21.027965 1339582 metrics.go:106] Registering Prometheus metrics
> I0624 05:34:21.028061 1339582 metrics.go:111] Starting metrics listener on
> 127.0.0.1:8797
> I0624 05:34:21.033329 1339582 daemon.go:971] Detected a new login session:
> New session 1 of user core.
> I0624 05:34:21.033345 1339582 daemon.go:972] Login access is discouraged!
> Applying annotation: machineconfiguration.openshift.io/ssh
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30]
> 
> goroutine 37 [running]:
> github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).
> applySSHAccessedAnnotation(0xc000740400?)
>        
> /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:
> 986 +0x30
> github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).
> runLoginMonitor.func1()
>        
> /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:
> 973 +0x18c
> created by
> github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).
> runLoginMonitor
>        
> /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:
> 954 +0x1fc

Must Gather log link: https://drive.google.com/file/d/1UXhTEufxwVATmNJo9MxrHkx7rbaHtzTE/view?usp=sharing

Comment 5 shweta 2022-06-24 12:06:38 UTC
(In reply to shweta from comment #3)
> After deploying OCP 4.11.0-0.nightly-ppc64le-2022-06-21-133338 on ppc64le
> environment issue seen with mcp.
> 
> 
> [root@rdr-shw4-1385-bastion-0 ~]# oc version
> Client Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338
> Kustomize Version: v4.5.4
> Server Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338
> Kubernetes Version: v1.24.0+284d62a

Must Gather log link: https://drive.google.com/file/d/1UXhTEufxwVATmNJo9MxrHkx7rbaHtzTE/view?usp=sharing

Comment 6 Sridhar Venkat (IBM) 2022-06-27 12:49:59 UTC
@sbiragda Thanks for providing the logs. This problem is blocking us to deploy ODF on OCP.

Comment 7 shweta 2022-06-27 13:30:37 UTC
Issue with MCP is seen on 4.11.0-fc.3 build as well.

[root@rdr-shw4-e184-bastion-0 ~]# oc version
Client Version: 4.11.0-fc.3
Kustomize Version: v4.5.4
Server Version: 4.11.0-fc.3
Kubernetes Version: v1.24.0+284d62a

[root@rdr-shw4-e184-bastion-0 ~]# oc get pods -A | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                         READY   STATUS             RESTARTS         AGE

openshift-machine-config-operator                  machine-config-daemon-4ztc5                                  1/2     Error              57 (5m20s ago)   4h32m
openshift-machine-config-operator                  machine-config-daemon-9fnh5                                  1/2     CrashLoopBackOff   57 (21s ago)     4h32m

[root@rdr-shw4-e184-bastion-0 ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-0f88e229ebe0dcd9249a3f56216ef41d   True      False      False      3              3                   3                     0                      4h46m
worker   rendered-worker-58ade7fb1aa863df1fe5ee1680df0ef0   False     True       False      2              0                   0                     0                      4h46m

Comment 8 Sinny Kumari 2022-06-27 14:00:36 UTC
>> I0624 05:34:21.033329 1339582 daemon.go:971] Detected a new login session: New session 1 of user core.
>> I0624 05:34:21.033345 1339582 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh

I don't understand why there is a need to ssh into machine while cluster is being created. Jerry has added some initial thought in comment#2 . 

MCO team doesn't have access to ppc64le hardware to create cluster and reproduce this issue. I will let multiarch team assess this problem first to understand what is the root cause.

Comment 9 Yu Qi Zhang 2022-06-27 14:30:24 UTC
Like Sinny said, generally, you shouldn't be SSH'ing into nodes in OCP clusters. Where is the SSH coming from (if there is an SSH at all) and could it be not performed?

Comment 10 Sridhar Venkat (IBM) 2022-06-27 15:57:00 UTC
The SSH mentioned in this test is only to demonstrate it is accessible. After OCP deployment we do not SSH into the node until it is complete. 
And after the deployment is over, we find the machine config not reporting the right number under the READY state column. This is causing problems to deploy ODF (the prevalidation step of automated ODF deployment looks for the right number under the READY column, otherwise it quits.

Comment 11 Sridhar Venkat (IBM) 2022-06-27 16:14:28 UTC
I just deployed a cluster. With no SSH into any nodes, saw this :

[root@nx124-411-4f09-sao01-bastion-0 ~]# oc get MachineConfigPool worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-4a60f72cc2342ef860c5c7bdb0d07de1   False     True       False      3              2                   2                     0                      27m


READYMACHINECOUNT is 2 instead of 3.


[root@nx124-411-4f09-sao01-bastion-0 ~]# oc get pod -n openshift-machine-config-operator -o wide
NAME                                        READY   STATUS             RESTARTS        AGE   IP              NODE                                    NOMINATED NODE   READINESS GATES
machine-config-controller-76975949b-vq6tl   2/2     Running            1 (23m ago)     27m   10.128.0.8      sao01-master-0.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-daemon-2vz9d                 2/2     Running            0               11m   192.168.0.56    sao01-worker-1.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-daemon-4c6l6                 2/2     Running            0               11m   192.168.0.224   sao01-worker-0.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-daemon-6xq4m                 2/2     Running            0               28m   192.168.0.31    sao01-master-0.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-daemon-7rts8                 1/2     CrashLoopBackOff   7 (2m54s ago)   14m   192.168.0.159   sao01-worker-2.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-daemon-cggl2                 2/2     Running            0               28m   192.168.0.64    sao01-master-1.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-daemon-wm8jt                 2/2     Running            0               28m   192.168.0.226   sao01-master-2.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-operator-647c8d6c55-hbdvq    1/1     Running            1 (22m ago)     32m   10.130.0.9      sao01-master-1.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-server-8s8gv                 1/1     Running            0               27m   192.168.0.64    sao01-master-1.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-server-p2wn6                 1/1     Running            0               27m   192.168.0.226   sao01-master-2.nx124-411-4f09.ibm.com   <none>           <none>
machine-config-server-zrlfk                 1/1     Running            0               27m   192.168.0.31    sao01-master-0.nx124-411-4f09.ibm.com   <none>           <none>
[root@nx124-411-4f09-sao01-bastion-0 ~]# 


[root@nx124-411-4f09-sao01-bastion-0 ~]# oc logs machine-config-daemon-7rts8 -n openshift-machine-config-operator
Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy
I0627 16:10:25.488671   18800 start.go:112] Version: v4.11.0-202206230535.p0.ge595ae9.assembly.stream-dirty (e595ae9c763e14213450d77f557bc01643db9fe3)
I0627 16:10:25.490672   18800 start.go:125] Calling chroot("/rootfs")
I0627 16:10:25.490768   18800 update.go:1962] Running: systemctl start rpm-ostreed
I0627 16:10:28.638650   18800 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0627 16:10:28.685424   18800 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0627 16:10:28.735944   18800 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:adb4ae15df5e062e1e7ff35324b98ddf3be2372ab88511464900887ef8f74343 (411.86.202206231243-0)
I0627 16:10:28.746099   18800 start.go:101] Copied self to /run/bin/machine-config-daemon on host
I0627 16:10:28.747021   18800 start.go:189] overriding kubernetes api to https://api-int.nx124-411-4f09.ibm.com:6443
I0627 16:10:28.747642   18800 metrics.go:106] Registering Prometheus metrics
I0627 16:10:28.747754   18800 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I0627 16:10:28.752686   18800 daemon.go:971] Detected a new login session: New session 1 of user core.
I0627 16:10:28.752699   18800 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cd60]

goroutine 84 [running]:
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).applySSHAccessedAnnotation(0xc00048a400?)
        /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:986 +0x30
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor.func1()
        /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:973 +0x18c
created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor
        /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:954 +0x1fc
[root@nx124-411-4f09-sao01-bastion-0 ~]#

Comment 12 Sridhar Venkat (IBM) 2022-06-27 16:15:29 UTC
[root@nx124-411-4f09-sao01-bastion-0 ~]# oc get nodes
NAME                                    STATUS   ROLES    AGE   VERSION
sao01-master-0.nx124-411-4f09.ibm.com   Ready    master   31m   v1.24.0+284d62a
sao01-master-1.nx124-411-4f09.ibm.com   Ready    master   31m   v1.24.0+284d62a
sao01-master-2.nx124-411-4f09.ibm.com   Ready    master   31m   v1.24.0+284d62a
sao01-worker-0.nx124-411-4f09.ibm.com   Ready    worker   13m   v1.24.0+284d62a
sao01-worker-1.nx124-411-4f09.ibm.com   Ready    worker   12m   v1.24.0+284d62a
sao01-worker-2.nx124-411-4f09.ibm.com   Ready    worker   15m   v1.24.0+284d62a
[root@nx124-411-4f09-sao01-bastion-0 ~]#

Comment 13 Prashanth Sundararaman 2022-06-27 16:25:35 UTC
Sridhar,

Does the automation for deployment ssh into the nodes by any chance ?

Prashanth

Comment 14 Sridhar Venkat (IBM) 2022-06-27 17:29:51 UTC
We use ocp4-upi-powervs to deploy the infrastructure and OCP. I am asking the owner of ocp4-upi-powervs to see if there is any SSH operation in the nodes, I doubt it but I will update this bug with an answer. But you can see not all mc pods are in crashloopbackoff state. Only one of them crashed.

Comment 15 Julie 2022-06-27 17:40:41 UTC
(In reply to Sridhar Venkat (IBM) from comment #6)
> @sbiragda Thanks for providing the logs. This problem is blocking
> us to deploy ODF on OCP.


Blocking bug for Power QE team too.
Hitting this issue with all recent builds of 4.11 including fc.3. This is blocking the completion of a few regression tests for 4.11.

Comment 16 Yu Qi Zhang 2022-06-27 18:11:02 UTC
> But you can see not all mc pods are in crashloopbackoff state. Only one of them crashed. 

That is the the weirdest part for me. If this was a purely infra/MCD issue, all 3 MCD should be crashing with the same error, as they are doing the exact same thing. This bug so far points to only a single node having some sort of difference.

The check that the MCD applies is from the output of: journalctl -b -f -o cat -u systemd-logind.service MESSAGE_ID=8d45620c1a4348dbb17410da57c60c66

if you can run that command on all 3 nodes and check for diffs, that would also help us debug the issue.

Again, I cannot reproduce this on a normal cluster. I have only seen this error before when running the MCO test OUTSIDE of a cluster environment, since there is no node. For the MCD to have gotten to the point of that error, the node should have initialized by that point.

Comment 17 CSZHANG 2022-06-27 18:19:54 UTC
The ocp4-upi-powervs call ocp4-playbooks to perform OCP installation, during OCP installation, ocp4-playbook will try to ssh to all cluster nodes to do some configuration. If playbook could not ssh to cluster nodes, the playbook will timeout on ssh and exit installation with error. If the playbook completed without error, that means it is not playbook's problem, it is OCP's problem.

Comment 18 Sridhar Venkat (IBM) 2022-06-27 18:21:13 UTC
Thanks ChongShi for your comment. The deployment is completed successfully without any problems. And as you can see above, I could manually SSH into one of the nodes as well successfully.

Comment 19 Sridhar Venkat (IBM) 2022-06-27 18:56:32 UTC
I reproduced the problem and the pod is crashing on worker-0. I tried to run journalctl -b -f -o cat -u systemd-logind.service MESSAGE_ID=8d45620c1a4348dbb17410da57c60c6 on all three nodes, it hangs for ever. No output on any of the worker nodes even after 5 minutes.

Comment 20 Sridhar Venkat (IBM) 2022-06-27 19:01:19 UTC
BTW, on the same environment, deployed another cluster and I am not able to reproduce the problem. On this healthy node, tried the journalctl command:
[core@sao01-worker-0 ~]$ journalctl -b -f -o cat -u systemd-logind.service MESSAGE_ID=8d45620c1a4348dbb17410da57c60c66
New session 1 of user core.

and then it hangs forever.

In this environment, at least it is printing the 'New session 1 of user core'. In the unhealthy cluster above, this line is not present (the command does not print anything and hangs).

Comment 21 Yu Qi Zhang 2022-06-27 19:54:21 UTC
Sorry I put an additional -f in there, which means it is following (it's not hanging, it's listening for ssh's). The actual command should be: journalctl -b -o cat -u systemd-logind.service MESSAGE_ID=8d45620c1a4348dbb17410da57c60c6

In the case of it hanging, it just means there were no additional ssh's. The "New session 1 of user core." indicated that for that specific node an ssh did happen likely prior to you running the command

> ocp4-playbook will try to ssh to all cluster nodes to do some configuration

May I know what this configuration is? And after the nodes have applied the configuration, does the nodes get rebooted, etc.? Generally in OCP 4 we point people not to ssh into nodes.

I checked our code and we introduced a racy condition in 4.11 for this code, so I will fix that ASAP.

Comment 22 Sridhar Venkat (IBM) 2022-06-27 23:46:16 UTC
@jerzhang I am not sure if the nodes are restarted during the deployment. @chonzhan can answer this question.

Let me know when the race condition is fixed, and we will attempt another deployment.

Comment 23 CSZHANG 2022-06-28 13:36:58 UTC
The ocp4-playbook will ssh to cluster node to perform mtu setting due to ibmveth network adapter's bug, no reboot.

Comment 24 Sridhar Venkat (IBM) 2022-06-29 14:57:50 UTC
Two subsequent deploys are good so far since early today morning. FYI.

Comment 25 Yu Qi Zhang 2022-06-29 15:11:56 UTC
Thanks for the responses.

> Two subsequent deploys are good so far since early today morning. FYI.

This is a race condition, hopefully it won't happen too often. As a workaround when it does happen, you can try removing the ssh journal entry

> Let me know when the race condition is fixed, and we will attempt another deployment.

Currently blocked on CI issues due to branching. We will fix this in 4.12 and backport to 4.11 since code freeze for 4.11.0 is in effect

Comment 26 Sridhar Venkat (IBM) 2022-06-30 12:56:44 UTC
Thanks to @jerzhang We will wait. We saw this problem last night.

Comment 28 Sridhar Venkat (IBM) 2022-07-03 18:02:55 UTC
@jerzhang Is this problem fixed? I do not see new builds in https://mirror.openshift.com/pub/openshift-v4/ppc64le/clients/ocp-dev-preview/latest-4.11/. It has install images as of June 23rd. We saw the problem today morning.

Comment 29 Julie 2022-07-04 10:38:50 UTC
(In reply to Sridhar Venkat (IBM) from comment #28)
> @jerzhang Is this problem fixed? I do not see new builds in
> https://mirror.openshift.com/pub/openshift-v4/ppc64le/clients/ocp-dev-
> preview/latest-4.11/. It has install images as of June 23rd. We saw the
> problem today morning.


Just fYI: 
@jerzhang @svenkat
OCP QE on Power team was able to reproduce the issue with 4.11 rc.0 build dated June 30-> https://mirror.openshift.com/pub/openshift-v4/ppc64le/clients/ocp/4.11.0-rc.0/
Waiting for the next rc build, for re-testing and verifying the fix.

Comment 30 Yu Qi Zhang 2022-07-04 17:27:40 UTC
This has not been fixed for 4.11. The fix version is in the BZ field above, it is now fixed in the 4.12 branches. I will begin the backport to 4.11 process

Comment 33 lmcfadde 2022-07-11 19:41:15 UTC
@jerzhang do you have an outlook for completing the backport and is there a different BZ (or PR) where I can see when it completes?  Do you know if it is targeted for OCP 4.11 GA or future OCP 4.11 z stream post GA?

Comment 34 Prashanth Sundararaman 2022-07-11 22:00:06 UTC
the latest feature candidate has the fix: https://ppc64le.ocp.releases.ci.openshift.org/releasestream/4-stable-ppc64le/release/4.11.0-rc.2

Comment 35 Rio Liu 2022-07-12 02:12:55 UTC
@lmcfadde backport BZ https://bugzilla.redhat.com/show_bug.cgi?id=2103749

Comment 36 Sridhar Venkat (IBM) 2022-07-12 14:06:21 UTC
I see only 4.11.0-rc.1 under https://mirror.openshift.com/pub/openshift-v4/ppc64le/clients/ocp/. Will rc.2 be available here?

Comment 37 Prashanth Sundararaman 2022-07-12 15:11:51 UTC
Until it is available in the mirror, you can test with the image i pointed to above: quay.io/openshift-release-dev/ocp-release:4.11.0-rc.2-ppc64le

Comment 38 shweta 2022-07-13 08:15:09 UTC
Issue with MCP is not seen on 4.11.0-rc.2 build


[root@rdr-ci-ocp-upi-a7bf-tor01-bastion-0 ~]# oc version
Client Version: 4.11.0-rc.2
Kustomize Version: v4.5.4
Server Version: 4.11.0-rc.2
Kubernetes Version: v1.24.0+9546431

[root@rdr-ci-ocp-upi-a7bf-tor01-bastion-0 ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-44d47954f24666349149b34be4fdcbf9   True      False      False      3              3                   3                     0                      55m
worker   rendered-worker-495bb191d4d14a21b89733276315bf3b   True      False      False      2              2                   2                     0                      55m

[root@rdr-ci-ocp-upi-a7bf-tor01-bastion-0 ~]# oc get pod -n openshift-machine-config-operator -o wide
NAME                                        READY   STATUS    RESTARTS   AGE   IP                NODE                                            NOMINATED NODE   READINESS GATES
machine-config-controller-dbcc464ff-n7fjv   2/2     Running   0          68m   10.128.0.10       tor01-master-1.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>
machine-config-daemon-57wbb                 2/2     Running   0          51m   193.168.200.80    tor01-worker-0.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>
machine-config-daemon-bm22f                 2/2     Running   0          69m   193.168.200.51    tor01-master-2.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>
machine-config-daemon-fwt7b                 2/2     Running   0          64m   193.168.200.221   tor01-master-0.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>
machine-config-daemon-hhx2b                 2/2     Running   0          50m   193.168.200.230   tor01-worker-1.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>
machine-config-daemon-v678k                 2/2     Running   0          69m   193.168.200.3     tor01-master-1.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>
machine-config-operator-86c57dd94c-cmnwb    1/1     Running   0          73m   10.129.0.2        tor01-master-2.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>
machine-config-server-jmm6d                 1/1     Running   0          64m   193.168.200.221   tor01-master-0.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>
machine-config-server-lxkps                 1/1     Running   0          68m   193.168.200.3     tor01-master-1.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>
machine-config-server-xhqdd                 1/1     Running   0          68m   193.168.200.51    tor01-master-2.rdr-ci-ocp-upi-a7bf.redhat.com   <none>           <none>


[root@rdr-ci-ocp-upi-a7bf-tor01-bastion-0 ~]# oc get pods -A | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                           READY   STATUS              RESTARTS      AGE

Comment 39 Sridhar Venkat (IBM) 2022-07-18 01:32:20 UTC
We are not seeing this problem anymore for the past four days.

Comment 40 Julie 2022-07-19 05:43:04 UTC
Issue is NOT reproduced on OCP 4.12 cluster on IBM Power platform. All nodes, COs, pods are in good state.
Build used: 4.12.0-0.nightly-ppc64le-2022-07-07-144302

# oc version
Client Version: 4.12.0-0.nightly-ppc64le-2022-07-07-144302
Kustomize Version: v4.5.4
Server Version: 4.12.0-0.nightly-ppc64le-2022-07-07-144302
Kubernetes Version: v1.24.0+bd7662a

# oc get node
NAME       STATUS   ROLES    AGE   VERSION
master-0   Ready    master   46m   v1.24.0+2dd8bb1
master-1   Ready    master   46m   v1.24.0+2dd8bb1
master-2   Ready    master   46m   v1.24.0+2dd8bb1
worker-0   Ready    worker   32m   v1.24.0+2dd8bb1
worker-1   Ready    worker   32m   v1.24.0+2dd8bb1

[root@rdr-shw-412-ba82-bastion-0 ~]# oc get co
NAME                                       VERSION                                      AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      23m
cloud-controller-manager                   4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      46m
cloud-credential                           4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      48m
cluster-autoscaler                         4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      43m
config-operator                            4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      45m
console                                    4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      28m
csi-snapshot-controller                    4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      44m
dns                                        4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      43m
etcd                                       4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      41m
image-registry                             4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      20m
ingress                                    4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      31m
insights                                   4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      38m
kube-apiserver                             4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      39m
kube-controller-manager                    4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      42m
kube-scheduler                             4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      41m
kube-storage-version-migrator              4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      44m
machine-api                                4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      44m
machine-approver                           4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      43m
machine-config                             4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      42m
marketplace                                4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      43m
monitoring                                 4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      28m
network                                    4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      45m
node-tuning                                4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      43m
openshift-apiserver                        4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      36m
openshift-controller-manager               4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      43m
openshift-samples                          4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      36m
operator-lifecycle-manager                 4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      44m
operator-lifecycle-manager-catalog         4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      43m
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      36m
service-ca                                 4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      44m
storage                                    4.12.0-0.nightly-ppc64le-2022-07-07-144302   True        False         False      44m

# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-3b0ed0ab0a7b81be3e884db08c4062c5   True      False      False      3              3                   3                     0                      48m
worker   rendered-worker-69e5eb440d1ea81cf0fb56566091c996   True      False      False      2              2                   2                     0                      48m

# oc get pods -A | grep -v Running | grep -v Completed
NAMESPACE                                          NAME                                                         READY   STATUS      RESTARTS      AGE

Comment 41 Rio Liu 2022-07-19 05:56:40 UTC
move this BZ to VERIFIED

Comment 42 Yu Qi Zhang 2022-07-19 15:26:59 UTC
Was on PTO. I see that the request has been answered and the BZ is verified. Let me know if anything else is needed from me.

Comment 46 errata-xmlrpc 2023-01-17 19:50:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.