Bug 1807104

Summary:

Running the OLM and OperatorHub configuration for restricted networks in an IPv6 bare metal deployment leaves nodes in NotReady,SchedulingDisabled state

Product:

OpenShift Container Platform

Reporter:

Marius Cornea <mcornea>

Component:

Machine Config Operator

Assignee:

Antoni Segura Puimedon <asegurap>

Status:

CLOSED ERRATA

QA Contact:

Marius Cornea <mcornea>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

4.3.z

CC:

agurenko, amurdaca, anli, asegurap, cdoan, farandac, jerzhang, jforrest, jparrill, kboumedh, ohochman, sasha, sberens, wsun, yprokule

Target Milestone:

---

Keywords:

TestBlocker

Target Release:

4.3.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1810331 (view as bug list)

Environment:

Last Closed:

2020-03-24 14:33:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1810331

Bug Blocks:

1771572

Attachments:

Description	Flags
ImageContentSourcePolicy	none

Description Marius Cornea 2020-02-25 15:34:35 UTC

Description of problem:

After following the procedure to Configuring OperatorHub for restricted networks[1] in an IPv6 bare metal deployment nodes end up in NotReady,SchedulingDisabled state:

[kni@provisionhost-0 ~]$ oc get nodes
NAME                                          STATUS                        ROLES           AGE   VERSION
master-0.ocp-edge-cluster.qe.lab.redhat.com   NotReady,SchedulingDisabled   master,worker   12h   v1.16.2
master-1.ocp-edge-cluster.qe.lab.redhat.com   Ready                         master,worker   12h   v1.16.2
master-2.ocp-edge-cluster.qe.lab.redhat.com   Ready                         master,worker   12h   v1.16.2


[1] https://access.redhat.com/documentation/en-us/openshift_container_platform/4.3/html-single/operators/index#olm-restricted-networks-operatorhub_olm-restricted-networks

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2020-02-21-091838-ipv6.3

How reproducible:
100%

Steps to Reproduce:
1. Deploy IPv6 bare metal environment
2. Run the procedure described at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.3/html-single/operators/index#olm-restricted-networks-operatorhub_olm-restricted-networks

Actual results:
After running the procedure some of the nodes end in NotReady,SchedulingDisabled state. 

Expected results:
Nodes are in Ready state after running the Configuring OperatorHub for restricted networks procedure.

Additional info:

Comment 1 Marius Cornea 2020-02-25 15:37:24 UTC

[kni@provisionhost-0 ~]$ oc get nodes master-0.ocp-edge-cluster.qe.lab.redhat.com -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    k8s.ovn.org/l3-gateway-config: '{"default":{"interface-id":"br-local_master-0.ocp-edge-cluster.qe.lab.redhat.com","ip-address":"fd99::2/64","mac-address":"42:24:c5:4e:6c:45","mode":"local","next-hop":"fd99::1","node-port-enable":"true","vlan-id":"0"}}'
    k8s.ovn.org/node-chassis-id: 3aeaac98-fa13-4026-a5ab-840bcd0239c5
    k8s.ovn.org/node-join-subnets: '{"default":"fd98::10/125"}'
    k8s.ovn.org/node-mgmt-port-mac-address: e2:dd:a4:85:85:11
    k8s.ovn.org/node-subnets: '{"default":"fd01:0:0:3::/64"}'
    machineconfiguration.openshift.io/currentConfig: rendered-master-bc96a37b957a32a299d54e386514c8e0
    machineconfiguration.openshift.io/desiredConfig: rendered-master-d69cc937b725ac36d82250e6b0c1096b
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Working
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-02-25T02:37:19Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: master-0.ocp-edge-cluster.qe.lab.redhat.com
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
    node-role.kubernetes.io/worker: ""
    node.openshift.io/os_id: rhcos
  name: master-0.ocp-edge-cluster.qe.lab.redhat.com
  resourceVersion: "218891"
  selfLink: /api/v1/nodes/master-0.ocp-edge-cluster.qe.lab.redhat.com
  uid: 8b250330-ee8d-4f69-93fc-64d25b7a4ca2
spec:
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    timeAdded: "2020-02-25T14:56:02Z"
  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: "2020-02-25T14:59:36Z"
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    timeAdded: "2020-02-25T14:59:41Z"
  unschedulable: true
status:
  addresses:
  - address: fd2e:6f44:5dd8:c956::148
    type: InternalIP
  - address: master-0.ocp-edge-cluster.qe.lab.redhat.com
    type: Hostname
  allocatable:
    cpu: 15500m
    ephemeral-storage: "49681111368"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 32319712Ki
    pods: "250"
  capacity:
    cpu: "16"
    ephemeral-storage: 52644Mi
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 32934112Ki
    pods: "250"
  conditions:
  - lastHeartbeatTime: "2020-02-25T14:56:58Z"
    lastTransitionTime: "2020-02-25T14:59:36Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: MemoryPressure
  - lastHeartbeatTime: "2020-02-25T14:56:58Z"
    lastTransitionTime: "2020-02-25T14:59:36Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: DiskPressure
  - lastHeartbeatTime: "2020-02-25T14:56:58Z"
    lastTransitionTime: "2020-02-25T14:59:36Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: PIDPressure
  - lastHeartbeatTime: "2020-02-25T14:56:58Z"
    lastTransitionTime: "2020-02-25T14:59:36Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4ac9818465cbe07f63efbd9d38647aaf6f2f8759ffec13903a5a97bcdfab3be4
    - <none>:<none>
    sizeBytes: 830103640
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6f2788587579df093bf990268390f7f469cce47e45a7f36b04b74aa00d1dd9e0
    - <none>:<none>
    sizeBytes: 727725367
  - names:
    - registry.svc.ci.openshift.org/ipv6/ovn-kubernetes@sha256:9bb0217b2dd42d2a963b97d2247832a87f289bd537ad3a154ddd5342edc4da6a
    - <none>:<none>
    sizeBytes: 648407045
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7732dea30c4b20a3df39fb18edfc3f1e78bb2addcabe49834491f19aa1d6c4a1
    - <none>:<none>
    sizeBytes: 474643035
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:41a180d934a95487f00aa2cc41f2e78b6d504e2fe39c18c66123c9e62c776953
    - <none>:<none>
    sizeBytes: 467784344
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:099032ec5bd9219642474a5859998c4f338221f843ff19823b1eebd58bb9ab5a
    - <none>:<none>
    sizeBytes: 423152423
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5b57acafd25ff81623864415bf17741fe1147fcb54b71535f6608c3e8a1aaedc
    - <none>:<none>
    sizeBytes: 409924228
  - names:
    - registry.svc.ci.openshift.org/ipv6/machine-config-operator@sha256:01e1fb5bd114ec241f848467004fdfcea47b286717a09cfe434a50e71d675c02
    - <none>:<none>
    sizeBytes: 407437801
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:669493ce03a4c5d94cbae7ac5c2caeb79b8c1d4b4fefc4767c3e49641b3a8a6f
    - <none>:<none>
    sizeBytes: 372668432
  - names:
    - registry.svc.ci.openshift.org/ipv6/cluster-network-operator@sha256:ffd32018be544ebb681a9329487c107c0196aa8cbbce3965a10afb5a63e27b19
    - <none>:<none>
    sizeBytes: 350026677
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca521acc8a1411f8e4781806acfcd2040405045714dced5975e143569106fc88
    - <none>:<none>
    sizeBytes: 341241550
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:afdb21a7d5f1977518de93d30cb526e9d82c26f18f6d3b0e411ec94ae7c98d93
    - <none>:<none>
    sizeBytes: 332965897
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ecacd961eff8cb8fbed1bda112dd14539c4b0c1b48387e329fc3c9e74bf30239
    - <none>:<none>
    sizeBytes: 332439248
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e2d4e5ee1a0ebb9cf599c7f45d4ade4e5ee6b7afc9f2874987edd90f71df32a
    - <none>:<none>
    sizeBytes: 332304625
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a4375971d238f18ada1a996cd3f7709853383c2476f19b962b0c544e9d4e324b
    - <none>:<none>
    sizeBytes: 331306685
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3b661a253769515ebf095c7e3177d2221afd640190d992a49cdb04a1fc9fce12
    - <none>:<none>
    sizeBytes: 329723524
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a98ef6605fc2b7b20edbaac7d890045f374edb8a770d4c562eeebb874ccb9bb6
    - <none>:<none>
    sizeBytes: 317623077
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7014dfe5aab13a081a171299b2a412e2ac3205abf90db0380664bf6d3ff4e812
    - <none>:<none>
    sizeBytes: 315303502
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7b4725c6cd9a5acc63bdbc3d16f5cc195a2a34b456ad8d3b3b9fcd7346a9864
    - <none>:<none>
    sizeBytes: 315105804
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b0ae86a71b4fff62d36c86c105be65f740342165a6d560dde7f0e546e9edf4af
    - <none>:<none>
    sizeBytes: 311932447
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b0a7f2970c2f3d8f06ce0a743c72474f60581fbf3b4917821926418808ea5928
    - <none>:<none>
    sizeBytes: 311593655
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b0a7014c7e86c42f78239d2381d7ee32bd63b3a83a3958b42d09909c325effe4
    - <none>:<none>
    sizeBytes: 310761197
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bda83a90a05ed97033c6065ac35d14e0f0be24d0bcd33fd15221b7b5ba7966f5
    - <none>:<none>
    sizeBytes: 309229038
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:06625988a38a75d7abeba3f5d31f013c8189626444af4492a890f592eab76c10
    - <none>:<none>
    sizeBytes: 308272311
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:101df6bc3b6e171f80d305bdfef699ae34388d2997f993415aee144f08d15654
    - <none>:<none>
    sizeBytes: 305264173
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e32deb224aeed8135d9fc42b516cf07a2391a933e6438e0bac12bf4dfe76f8f8
    - <none>:<none>
    sizeBytes: 304852163
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:388e17432b72c0cb9586b387176ce1500517f2548fd5bba34588de0ee96b6c6d
    - <none>:<none>
    sizeBytes: 301807729
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:359f28880b4110cbd03cc8397540f83d4e7605a51c771117003f488bac569ab1
    - <none>:<none>
    sizeBytes: 300165346
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b908f3195060f494590ced95bce23ff027af730a93e5334d57ccfb28fe0f8f1
    - <none>:<none>
    sizeBytes: 299459222
  - names:
    - registry.svc.ci.openshift.org/ipv6/cluster-kube-apiserver-operator@sha256:c744f7b8c0a2086bdb45bb94ee491a2623d6e956831da99b710a67c26106a844
    - <none>:<none>
    sizeBytes: 298765295
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6fd131fcda17fc956dce1965204c259146151da8276817861adb5baee23d7577
    - <none>:<none>
    sizeBytes: 297132052
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:445b67bd0f586a1bff691cd019c461a8623bd70c8c4d00a5d723a417cf5f038e
    - <none>:<none>
    sizeBytes: 292584523
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c90cafc16f4050d736ed4c6c86f2469b429036a50e27ea3c3639409a246849a4
    - <none>:<none>
    sizeBytes: 292581591
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f61ed683406ed6cd2a3b3117e82c83211c7551d1e4280ff5ce7462f8d650e79a
    - <none>:<none>
    sizeBytes: 285919312
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5759e2c0ae8ce3193cdb408c4a2485866397d8623dfa05a4d1c20adcd48cf073
    - <none>:<none>
    sizeBytes: 279623253
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8109601903d3fb1072e5ee02a3b6c2dc6628c0fff26d69c924c9f7ce4c17e22d
    - <none>:<none>
    sizeBytes: 277849752
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca8762a0c0f0177629d8f6fbebbddcb3438ca8990d8ed993d8dc04574e4a974c
    - <none>:<none>
    sizeBytes: 271271819
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e559edba536c2247aa9286cfb2fa6520516df39f023546acc0b58dfd2d6ef627
    - <none>:<none>
    sizeBytes: 264215014
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0b2f4d3140f87845fab5a88c57b6cb0900774699ea30ba26270f25532abaa2fb
    - <none>:<none>
    sizeBytes: 258011246
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5827985410ea82e3f4f0808cc601e2ed834b7a800304e12079bced988b49dccd
    - <none>:<none>
    sizeBytes: 256980613
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:475f5ece24e33bfa38ac9b2678db33eae039ab10509e7544096b4c96aba0db86
    - <none>:<none>
    sizeBytes: 255944919
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ff152532ba80cce2351febf6925762017095c9bf26840dff208ff8e3e2ccafbe
    - <none>:<none>
    sizeBytes: 250795582
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:83c963d56fe4738ec023ac4b740fa1417e30379e138f12ccff82bc193aa610d8
    - <none>:<none>
    sizeBytes: 238844227
  nodeInfo:
    architecture: amd64
    bootID: b7c77884-89ae-41a2-868d-38e2a01ef63e
    containerRuntimeVersion: cri-o://1.16.3-22.dev.rhaos4.3.git11c04e3.el8
    kernelVersion: 4.18.0-147.5.1.el8_1.x86_64
    kubeProxyVersion: v1.16.2
    kubeletVersion: v1.16.2
    machineID: cd6bf73f0e98411e84d875c27eccba68
    operatingSystem: linux
    osImage: Red Hat Enterprise Linux CoreOS 43.81.202002170853.0 (Ootpa)
    systemUUID: cd6bf73f-0e98-411e-84d8-75c27eccba68

Comment 2 Marius Cornea 2020-02-25 15:38:26 UTC

[root@master-0 core]# systemctl status kubelet
Warning: The unit file, source configuration file or drop-ins of kubelet.service changed on disk. Run 'systemctl daemon-reload' to reload units.
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-default-env.conf, 20-nodenet.conf
   Active: active (running) since Tue 2020-02-25 14:59:20 UTC; 38min ago
  Process: 3370 ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state (code=exited, status=0/SUCCESS)
  Process: 3368 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS)
 Main PID: 3372 (hyperkube)
    Tasks: 40 (limit: 26213)
   Memory: 210.9M
      CPU: 1min 19.154s
   CGroup: /system.slice/kubelet.service
           └─3372 /usr/bin/hyperkube kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/ru>

Feb 25 15:38:07 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: I0225 15:38:07.454790    3372 prober.go:129] Liveness probe for "openshift-kube-scheduler-localhost.localdomain_openshift-kube-scheduler(37a2869826ee72d5a8ee916>
Feb 25 15:38:07 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: E0225 15:38:07.530876    3372 kubelet.go:2275] node "localhost.localdomain" not found
Feb 25 15:38:07 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: E0225 15:38:07.631139    3372 kubelet.go:2275] node "localhost.localdomain" not found
Feb 25 15:38:07 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: E0225 15:38:07.731375    3372 kubelet.go:2275] node "localhost.localdomain" not found
Feb 25 15:38:07 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: E0225 15:38:07.831633    3372 kubelet.go:2275] node "localhost.localdomain" not found
Feb 25 15:38:07 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: E0225 15:38:07.931866    3372 kubelet.go:2275] node "localhost.localdomain" not found
Feb 25 15:38:08 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: E0225 15:38:08.032129    3372 kubelet.go:2275] node "localhost.localdomain" not found
Feb 25 15:38:08 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: I0225 15:38:08.088550    3372 httplog.go:90] GET /metrics/cadvisor: (42.690275ms) 200 [Prometheus/2.14.0 [fd2e:6f44:5dd8:c956::135]:49376]
Feb 25 15:38:08 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: I0225 15:38:08.107950    3372 prober.go:129] Readiness probe for "coredns-localhost.localdomain_openshift-kni-infra(50d1e0101b9a4c5cc8d8bc70799a0083):coredns" s>
Feb 25 15:38:08 master-0.ocp-edge-cluster.qe.lab.redhat.com hyperkube[3372]: E0225 15:38:08.132400    3372 kubelet.go:2275] node "localhost.localdomain" not found

Comment 3 Marius Cornea 2020-02-25 15:55:18 UTC

Trying to restart kubelet I can see:

[root@master-0 core]# systemctl restart kubelet
Warning: The unit file, source configuration file or drop-ins of kubelet.service changed on disk. Run 'systemctl daemon-reload' to reload units.


After restarting kubelet still shows 


[kni@provisionhost-0 ~]$ export KUBECONFIG=clusterconfigs/auth/kubeconfig 
[kni@provisionhost-0 ~]$ oc get csr
NAME        AGE     REQUESTOR                                                 CONDITION
csr-5dcqm   36m     system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Pending
csr-66wz2   20m     system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Pending
csr-ntqr8   51m     system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Pending
csr-r6rq8   5m49s   system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Pending
csr-zx69t   108s    system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Pending
[kni@provisionhost-0 ~]$ oc adm certificate approve csr-5dcqm
certificatesigningrequest.certificates.k8s.io/csr-5dcqm approved
[kni@provisionhost-0 ~]$ oc adm certificate approve csr-66wz2
certificatesigningrequest.certificates.k8s.io/csr-66wz2 approved
[kni@provisionhost-0 ~]$ oc adm certificate approve csr-ntqr8
certificatesigningrequest.certificates.k8s.io/csr-ntqr8 approved
[kni@provisionhost-0 ~]$ oc adm certificate approve csr-r6rq8
certificatesigningrequest.certificates.k8s.io/csr-r6rq8 approved
[kni@provisionhost-0 ~]$ oc adm certificate approve csr-zx69t
certificatesigningrequest.certificates.k8s.io/csr-zx69t approved
[kni@provisionhost-0 ~]$ oc get csr
NAME        AGE     REQUESTOR                                                 CONDITION
csr-5dcqm   36m     system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Approved,Issued
csr-66wz2   21m     system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Approved,Issued
csr-ntqr8   51m     system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Approved,Issued
csr-r6rq8   6m40s   system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Approved,Issued
csr-zx69t   2m39s   system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com   Approved,Issued

Checking the nodes we can see:

[kni@provisionhost-0 ~]$ oc get nodes
NAME                                          STATUS                     ROLES           AGE    VERSION
localhost.localdomain                         Ready                      master,worker   103s   v1.16.2
master-0.ocp-edge-cluster.qe.lab.redhat.com   NotReady                   master,worker   13h    v1.16.2
master-1.ocp-edge-cluster.qe.lab.redhat.com   Ready,SchedulingDisabled   master,worker   13h    v1.16.2
master-2.ocp-edge-cluster.qe.lab.redhat.com   Ready                      master,worker   13h    v1.16.2

2 issues here:

1/ master-0 changed it hostname to localhost
2/ master-1 went into SchedulingDisabled state

It looks like the master mcp is updating:
[kni@provisionhost-0 ~]$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT
master   rendered-master-bc96a37b957a32a299d54e386514c8e0   False     True       False      4              0                   1                     0
worker   rendered-worker-868410f8330337e8136d160eaa007384   True      False      False      0              0                   0                     0

Comment 4 Yu Qi Zhang 2020-02-25 21:08:46 UTC

It looks like it got stuck trying to update the node:

    machineconfiguration.openshift.io/currentConfig: rendered-master-bc96a37b957a32a299d54e386514c8e0
    machineconfiguration.openshift.io/desiredConfig: rendered-master-d69cc937b725ac36d82250e6b0c1096b

NotReady,SchedulingDisabled generally indicates it died on reboot.

Looking at the instructions in https://access.redhat.com/documentation/en-us/openshift_container_platform/4.3/html-single/operators/index#olm-restricted-networks-operatorhub_olm-restricted-networks
I'm not really sure why it tried to apply an updated machineconfig though.

Could you provide a must-gather? This might also be more related to node than the MCO

Comment 5 Marius Cornea 2020-02-26 03:58:53 UTC

I reproduced the issue, master-0 node gets rebooted and after the reboot it comes up with a bad hostname(localhost).

must-gather doesn't seem to work as the cluster doesn't have access to quay.io. Is there any specific specific log that I can provide from master nodes?


[kni@provisionhost-0 ~]$ oc adm must-gather
[must-gather      ] OUT unable to resolve the imagestream tag openshift/must-gather:latest
[must-gather      ] OUT 
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest
[must-gather      ] OUT namespace/openshift-must-gather-rfcbz created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-msgvj created
[must-gather      ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created



[kni@provisionhost-0 ~]$ oc get is -n openshift must-gather -o yaml
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  creationTimestamp: "2020-02-26T03:19:59Z"
  generation: 2
  name: must-gather
  namespace: openshift
  resourceVersion: "12739"
  selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/must-gather
  uid: c9fe41b8-5ae1-48af-b2e2-da1ee3112c30
spec:
  lookupPolicy:
    local: false
  tags:
  - annotations: null
    from:
      kind: DockerImage
      name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b8bc76d838a68c0ac2bd8f8c65cc21852085d44f1f1d07f57a39ecd496ce5706
    generation: 2
    importPolicy:
      scheduled: true
    name: latest
    referencePolicy:
      type: Source
status:
  dockerImageRepository: ""
  tags:
  - conditions:
    - generation: 2
      lastTransitionTime: "2020-02-26T03:19:59Z"
      message: 'Internal error occurred: [registry.ocp-edge-cluster.qe.lab.redhat.com:5000/localimages/local-release-image@sha256:b8bc76d838a68c0ac2bd8f8c65cc21852085d44f1f1d07f57a39ecd496ce5706:
        Get https://registry.ocp-edge-cluster.qe.lab.redhat.com:5000/v2/: x509: certificate
        signed by unknown authority, quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b8bc76d838a68c0ac2bd8f8c65cc21852085d44f1f1d07f57a39ecd496ce5706:
        Get https://quay.io/v2/: dial tcp 52.45.33.205:443: connect: network is unreachable]'
      reason: InternalError
      status: "False"
      type: ImportSuccess
    items: null
    tag: latest

Comment 6 Jessica Forrester 2020-03-03 17:08:45 UTC

when disconnected, must-gather can be run using the --image flag

oc adm must-gather --image myregistry.example.com/must-gather

You might have to separately mirror the image used for must-gather if it has not already been picked up by the release mirror command.

Comment 7 Jessica Forrester 2020-03-03 17:11:18 UTC

the OLM disconnected workflow creates an ImageContentSourcePolicy, go ahead and attach the ICSP that was created

Comment 8 Marius Cornea 2020-03-03 22:31:17 UTC

Created attachment 1667327 [details]
ImageContentSourcePolicy

I managed to reproduce the issue on 4.3.0-0.nightly-2020-03-01-194304


The issue occurs after creating the ImageContentSourcePolicy, attaching it.


At this point `oc adm must-gather` doesn't return anything as the pods are stuck in Pending state:

[kni@provisionhost-0 ~]$ oc adm must-gather --image registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/must-gather
[must-gather      ] OUT Using must-gather plugin-in image: registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/must-gather
[must-gather      ] OUT namespace/openshift-must-gather-j6n72 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-x9mkn created
[must-gather      ] OUT pod for plug-in image registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/must-gather created


[kni@provisionhost-0 ~]$ oc get pods -A | grep must-gather
openshift-must-gather-2882s                             must-gather-dr8hx                                                      0/1     Pending     0          15m
openshift-must-gather-j6n72                             must-gather-hjmz7                                                      0/1     Pending     0          5m5s
openshift-must-gather-wsfrj                             must-gather-9vbs4                                                      0/1     Pending     0          11m

[kni@provisionhost-0 ~]$ oc get nodes
NAME                                          STATUS                        ROLES    AGE     VERSION
master-0.ocp-edge-cluster.qe.lab.redhat.com   Ready                         master   4h1m    v1.16.2
master-1.ocp-edge-cluster.qe.lab.redhat.com   Ready                         master   4h1m    v1.16.2
master-2.ocp-edge-cluster.qe.lab.redhat.com   NotReady,SchedulingDisabled   master   4h1m    v1.16.2
worker-0.ocp-edge-cluster.qe.lab.redhat.com   NotReady,SchedulingDisabled   worker   3h45m   v1.16.2
worker-1.ocp-edge-cluster.qe.lab.redhat.com   Ready                         worker   3h44m   v1.16.2

Comment 9 Omri Hochman 2020-03-04 02:16:21 UTC

this one is a blocker for the management pillar.  we'll need to escalate.

Comment 10 Omri Hochman 2020-03-04 02:37:32 UTC

The issue is blocking installing operators from a disconnected registry; which is needed for ACM disconnected deployment (that should be installed by an extra operator on the IPv6 OCP environment.

Comment 13 Antoni Segura Puimedon 2020-03-04 17:18:37 UTC

After digging a bit I found the following:


Mar 03 22:05:08 localhost NetworkManager[1947]: <warn>  [1583273108.4627] dispatcher: (1) /etc/NetworkManager/dispatcher.d/30-resolv-prepender failed (exec failed): Failed to execute child process <E2><80><9C>/etc/NetworkManager/dispatcher.d/30-resolv-prepender<E2><80><9D> (Permission denied)
Mar 03 22:05:08 localhost NetworkManager[1947]: <warn>  [1583273108.4628] dispatcher: (1) /etc/NetworkManager/dispatcher.d/40-mdns-hostname failed (exec failed): Failed to execute child process <E2><80><9C>/etc/NetworkManager/dispatcher.d/40-mdns-hostname<E2><80><9D> (Permission denied)
Mar 03 22:05:08 localhost NetworkManager[1947]: <warn>  [1583273108.4728] dispatcher: (2) /etc/NetworkManager/dispatcher.d/30-resolv-prepender failed (exec failed): Failed to execute child process <E2><80><9C>/etc/NetworkManager/dispatcher.d/30-resolv-prepender<E2><80><9D> (Permission denied)
Mar 03 22:05:08 localhost NetworkManager[1947]: <warn>  [1583273108.4728] dispatcher: (2) /etc/NetworkManager/dispatcher.d/40-mdns-hostname failed (exec failed): Failed to execute child process <E2><80><9C>/etc/NetworkManager/dispatcher.d/40-mdns-hostname<E2><80><9D> (Permission denied)

Being that /etc/mdns/hostname existed and that it is only created by /etc/NetworkManager/dispatcher.d/40-mdns-hostname it means that on some MCO operation after the initial boot the permissions got broken. Looking at SELinux reveals that:

-rwxr-xr-x. 1 root root system_u:object_r:NetworkManager_initrc_exec_t:s0  100 Mar  3 18:43 04-iscsi
-rwxr-xr-x. 1 root root system_u:object_r:NetworkManager_initrc_exec_t:s0 1062 Mar  3 18:43 11-dhclient
-rwxr-xr-x. 1 root root system_u:object_r:NetworkManager_initrc_exec_t:s0  428 Mar  3 18:43 20-chrony
-rwxr-xr-x. 1 root root system_u:object_r:tmp_t:s0                        1158 Mar  3 22:04 30-resolv-prepender
-rwxr-xr-x. 1 root root system_u:object_r:tmp_t:s0                         392 Mar  3 22:04 40-mdns-hostname

Which points to the files being rendered to a location with tmp_t label (probably /tmp).

After fixing the labels and restarting, the masters and workers went back to Ready. The worker kept the noschedule taint as the machineconfigpool reports that it is still updating and degraded:

[kni@provisionhost-0 ~]$ oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT
master   rendered-master-b3566b5e3bda8866335cc9d5e34b723d   False     True       False      3              2                   2                     0
worker   rendered-worker-805a6087a31096e4ea468a8f81cc882d   False     True       True       2              0                   0                     1
[kni@provisionhost-0 ~]$ 


Finally, the troubleshooting process highlighted that the mdns MCO template should control for cases where hostname is reported as localhost.localdomain in its verify-hostname functionality, not just localhost.

Comment 14 Antoni Segura Puimedon 2020-03-05 16:44:47 UTC

*** Bug 1810632 has been marked as a duplicate of this bug. ***

Comment 17 Marius Cornea 2020-03-09 19:18:31 UTC

Verified on 4.3.0-0.nightly-2020-03-09-172027

Comment 19 errata-xmlrpc 2020-03-24 14:33:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0858