Description of problem: This bug was separated from bug 1647660. I also encountered the issue "ImagePullError" after running the cluster for a while. Version-Release number of the following components: [jzhang@dhcp-140-18 installer]$ openshift-install version openshift-install v0.3.0-172-gfc8e1ec41927fdfe0a1878aa3bb41e916ceda0b1 Terraform v0.11.8 How reproducible: always Steps to Reproduce: 1. Create a cluster by following this doc: https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md 2. Running the cluster for a period of time. Actual results: Got "ImagePullError" issue. Expected results: All of the pods running well. Additional info: PR https://github.com/openshift/installer/pull/606
Created attachment 1507481 [details] The kubelet logs running on the master node
I have a try with the latest version: [jzhang@dhcp-140-18 ~]$ openshift-install version openshift-install v0.3.0-250-g30bb25ac57d7c7d3dae519186cbfca9af8aeaca2 Terraform v0.11.8 Your version of Terraform is out of date! The latest version is 0.11.10. You can update by downloading from www.terraform.io/downloads.html But, got below errors after running the cluster for a while(about 10 hours). [jzhang@dhcp-140-18 ~]$ oc get ns The connection to the server jiazha-api.tt.testing:6443 was refused - did you specify the right host or port? [jzhang@dhcp-140-18 ~]$ sudo virsh list --all setlocale: No such file or directory Id Name State ---------------------------------------------------- 24 jiazha-master-0 running 25 jiazha-worker-0-75mcl running So, I access the master node and check the kubelet logs: [core@jiazha-master-0 ~]$ journalctl -f -u kubelet ... Nov 21 02:57:58 jiazha-master-0 hyperkube[2791]: E1121 02:57:58.321164 2791 pod_workers.go:186] Error syncing pod f3a5a06d-ec7e-11e8-99bc-b62bec1f3377 ("openshift-cluster-kube-controller-manager-operator-5d478985868l_openshift-cluster-kube-controller-manager-operator(f3a5a06d-ec7e-11e8-99bc-b62bec1f3377)"), skipping: failed to "StartContainer" for "operator" with ImagePullBackOff: "Back-off pulling image \"registry.svc.ci.openshift.org/openshift/origin-v4.0-20181119160732@sha256:ef9dc3e2c8344dcab3e1b511c6631249e883f6cdb225e74be431064f1a89a801\"" Nov 21 02:57:58 jiazha-master-0 hyperkube[2791]: I1121 02:57:58.942336 2791 logs.go:49] http: TLS handshake error from 10.2.0.57:48868: remote error: tls: bad certificate But, I found the containers are running well: [core@jiazha-master-0 ~]$ sudo crictl ps CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 34b9fede10290 registry.svc.ci.openshift.org/openshift/origin-release@sha256:1b9a5b3858d219afb508d94a0bc4501aa0d1cf3adcc0012d4950f3984b6fe668 4 hours ago Running cluster-version-operator 146 b423f93187e95 80f9ecff4526e8f67bd9e0b2f43e5dec6aa4de0fd193cd1687e5959b71164f06 14 hours ago Running machine-config-controller 2 2bd6fc810f8ec 03f4586ea1d23398ab35db0295aa09855841ca31f2fc43d412ef37854b107b42 14 hours ago Running machine-api-operator 2 7da351e5a371a e3d07188c6f5a073ce503993011a0b82d45d377332ca800e41c6b7eb0596d774 14 hours ago Running machine-config-operator 2 1d943d6afc18a 71967eedafdbd90bd1d0ba8b3f9b20a522e0e8c03d9aa83cacda32d57dee0850 14 hours ago Running sdn-controller 3 d66bd5de99645 71967eedafdbd90bd1d0ba8b3f9b20a522e0e8c03d9aa83cacda32d57dee0850 14 hours ago Running controller-manager 3 b44dee1adddf0 09ec5ab8b5b442addc495c9f42e075ee06612501b6f00ad4fe8a51db6f69a311 14 hours ago Running apiservice-cabundle-injector-controller 3 90032a9c1e1ad 09ec5ab8b5b442addc495c9f42e075ee06612501b6f00ad4fe8a51db6f69a311 14 hours ago Running operator 3 171c25aef2592 09ec5ab8b5b442addc495c9f42e075ee06612501b6f00ad4fe8a51db6f69a311 14 hours ago Running configmap-cabundle-injector-controller 3 ef33deea19c6f 09ec5ab8b5b442addc495c9f42e075ee06612501b6f00ad4fe8a51db6f69a311 14 hours ago Running service-serving-cert-signer-controller 3 a26f9a0dd0b4f registry.svc.ci.openshift.org/openshift/origin-v4.0-20181119160732@sha256:f062abd485e0431fe336cb610caf5a0808c7876aa40a0e82699f1fe420c011a4 17 hours ago Running console-operator 4 9172a59a04434 registry.svc.ci.openshift.org/openshift/origin-v4.0-20181119160732@sha256:e8896899ef1d40546336a7dd7ab0d6f943c7579844291b428be2d716278c79cd 17 hours ago Running cluster-samples-operator 4 ca0eb3145bb98 2b1a570c7c026bdb6639cde0f19331312912717732201db95bf97163de730da5 17 hours ago Running cluster-image-registry-operator 4 62c252c5d67a7 1a7aeae2858166cb7a9e1565c9900289d206745a87e61bee8c4085623010e8e7 17 hours ago Running machine-config-daemon 2 011fc772b50bc 3129a2ca29d75226dc5657a4629cdd5f38accda7f5b75bc5d5a76f5b4e0e5870 17 hours ago Running rules-configmap-reloader 1 aeeaa0b3c5e43 c4aeb8465da88add5b6c839b8d8669737704c639b89d69b74118088ee8c66028 17 hours ago Running prom-label-proxy 1 d732c2ccdfb6b 90c45954eb03e399ea5303e516911180dc74a62392e2981ebffe2f26b7321b78 17 hours ago Running alertmanager-proxy 1 32f528f8dbb26 8323481d90856e70f1d68024d18cfc0b9140494612cb869474719d96750d3f12 17 hours ago Running kube-rbac-proxy 1 f1e1bbad8b1a6 03f4586ea1d23398ab35db0295aa09855841ca31f2fc43d412ef37854b107b42 17 hours ago Running nodelink-controller 1 2fe4cc48ab452 90c45954eb03e399ea5303e516911180dc74a62392e2981ebffe2f26b7321b78 17 hours ago Running alertmanager-proxy 1 877bd71a4408b registry.svc.ci.openshift.org/openshift/origin-v4.0-20181119160732@sha256:42ee4b8a4cdf53f34fd2a204d9e577f8a9a6ae71b6f5f8bd378cd4840a80325b 17 hours ago Running cluster-node-tuning-operator 1 aa2a0e45b5ac3 90c45954eb03e399ea5303e516911180dc74a62392e2981ebffe2f26b7321b78 17 hours ago Running prometheus-proxy 1 1a722d67f2e3b 543de23dc5f295d3c62af6df944d72507de86dde6417454690129fa25b834db1 17 hours ago Running olm-operator 2 73339b64f0e67 3129a2ca29d75226dc5657a4629cdd5f38accda7f5b75bc5d5a76f5b4e0e5870 17 hours ago Running config-reloader 1 fc432282ffc46 d3a133472950a2a78482aed3acb5dc17793fdb877a4be3f9e580f194fbf89c9d 17 hours ago Running machine-controller 1 81c1deeb3bcbf 3129a2ca29d75226dc5657a4629cdd5f38accda7f5b75bc5d5a76f5b4e0e5870 17 hours ago Running config-reloader 1 3c465388279ff 5c9c3b993bd397f291ded82b5368daae78c878eba48f978263a4eb1ca9fe277f 17 hours ago Running prometheus-config-reloader 1 b2a391a89ba05 d3a133472950a2a78482aed3acb5dc17793fdb877a4be3f9e580f194fbf89c9d 17 hours ago Running controller-manager 1 d8ef6c8cf1a9b 68bbd0006378473c03fd9b8880246855a930558ac063a679aa3848a7a5ce8f9c 17 hours ago Running alertmanager 1 bc83f070e18d5 68bbd0006378473c03fd9b8880246855a930558ac063a679aa3848a7a5ce8f9c 17 hours ago Running alertmanager 1 dfe5a752e33fc 8323481d90856e70f1d68024d18cfc0b9140494612cb869474719d96750d3f12 17 hours ago Running kube-rbac-proxy 1 2cfe3c56dbdbf 79c48450571090935cc175ddfe8473ab88b807bc9276bc35a25af9caaaccd6c0 17 hours ago Running csi-operator 1 10b5cc4ae6afa 59c3f2318e23ae3219e1e9fb5d939ea559c804bc84ef84ba16a1993fb0cff5b1 17 hours ago Running prometheus 2 691eb717c57ab 96b3a943b3a6b542fead7b23c39da2d2532d9603c98ed56f057f43814a19cb74 17 hours ago Running console 2 83a449c96c670 b7f6d322a704f7191ea96105570e921b9ad3d4b421ce56640deaaa2f81338f97 17 hours ago Running ingress-operator 1 f471ecb2c8311 f9f775bf6d0efcbfa64f5b54c198592f710852a14f9bb149913d2c8f1e0f39be 17 hours ago Running node-exporter 1 59dd47936dc64 543de23dc5f295d3c62af6df944d72507de86dde6417454690129fa25b834db1 17 hours ago Running catalog-operator 1 d11c55261d109 c9aa1d32a4af2318171a361fab2603af2cb81a1bff9aef38bbe1eb10631082f7 17 hours ago Running cluster-dns-operator 1 efeb044e7f415 40b18c4b379ed68f6647fb0ca33f4525c12d4b7272ea1a4cf2b1d4e850214874 17 hours ago Running dns 1 9afda236d8c37 d8d79552d3773ff240610d7c64f19c4756a96fbaaea5392ec5c36464a7c6d563 17 hours ago Running machine-config-server 1 364ae6565f7b6 64a26eee6ca57e3b57aaa35d979f127f384156cf7ad4ffaf923c4b1314ebcc21 17 hours ago Running cluster-network-operator 1 1cee0efc3fd0e 786b620c80bb515d469dbd4872881191914062cb1dbbb03ccb98f39138db5fd5 17 hours ago Running openvswitch 1 2e7ede4ce3de7 786b620c80bb515d469dbd4872881191914062cb1dbbb03ccb98f39138db5fd5 17 hours ago Running sdn 1 47ff592339a32 94bc3af972c98ce73f99d70bd72144caa8b63e541ccc9d844960b7f0ca77d7c4 17 hours ago Running etcd-member 1 I attached the kubelet logs for your checking. The CVO logs: [core@jiazha-master-0 ~]$ sudo crictl logs 34b9fede10290 ... E1121 03:08:00.531920 1 leaderelection.go:234] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused I1121 03:08:00.531985 1 leaderelection.go:190] failed to acquire lease openshift-cluster-version/version The logs of machine-api-operator: [core@jiazha-master-0 ~]$ sudo crictl logs 2bd6fc810f8ec E1121 03:10:31.985214 1 leaderelection.go:234] error retrieving resource lock openshift-cluster-api/machine-api-operator: Get https://10.3.0.1:443/api/v1/namespaces/openshift-cluster-api/configmaps/machine-api-operator: dial tcp 10.3.0.1:443: connect: connection refused
Can you try again with the 0.4.0 installer? I recall seeing some fixes related to this failure go in to some of the other operators.
Alex, Yes, the cluster works well after working a day. Can you help change the Status to "ON_QA" so that I can verify it? [jzhang@dhcp-140-18 installer]$ openshift-install version openshift-install v0.4.0-32-g7e7c26fbf41e1e73a0219d48d04881f9aa1318a9-dirty Terraform v0.11.8 Your version of Terraform is out of date! The latest version is 0.11.10. You can update by downloading from www.terraform.io/downloads.html [core@jian-master-0 ~]$ oc get ns NAME STATUS AGE default Active 1d kube-public Active 1d kube-system Active 1d openshift Active 1d openshift-apiserver Active 1d openshift-cluster-api Active 1d openshift-cluster-dns Active 1d openshift-cluster-dns-operator Active 1d openshift-cluster-kube-apiserver-operator Active 1d ... [core@jian-master-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION jian-master-0 Ready master 1d v1.11.0+d4cacc0 jian-worker-0-hgqvt Ready worker 1d v1.11.0+d4cacc0
Alex, I met the issue that the master API cannot be accessed several times. So, I changed the vcpu/memory of the master node, [jzhang@dhcp-140-18 installer]$ vim data/data/libvirt/main.tf ... resource "libvirt_domain" "master" { count = "${var.tectonic_master_count}" name = "${var.tectonic_cluster_name}-master-${count.index}" memory = "6144" vcpu = "4" Seems it works well for days.One more question, how to change the vcpu/memory of the worker node?
You can change the vcpu/memory of workers by editing the corresponding MachineSet custom resource in the openshift-cluster-api namespace. The last time I did that, I had to disable validation (--validate=false).
Alex, Thanks! I'm sorry, but I could not find the "--validate" field. Am I missing something? [core@jian-master-0 ~]$ oc get machineset jian-worker-0 -o yaml apiVersion: cluster.k8s.io/v1alpha1 kind: MachineSet metadata: creationTimestamp: 2018-11-28T09:49:31Z generation: 3 labels: sigs.k8s.io/cluster-api-cluster: jian sigs.k8s.io/cluster-api-machine-role: worker sigs.k8s.io/cluster-api-machine-type: worker name: jian-worker-0 namespace: openshift-cluster-api resourceVersion: "8527936" selfLink: /apis/cluster.k8s.io/v1alpha1/namespaces/openshift-cluster-api/machinesets/jian-worker-0 uid: e780f96e-f2f2-11e8-b24e-92c2a8062742 spec: replicas: 1 selector: matchLabels: sigs.k8s.io/cluster-api-cluster: jian sigs.k8s.io/cluster-api-machineset: jian-worker-0 template: metadata: creationTimestamp: null labels: sigs.k8s.io/cluster-api-cluster: jian sigs.k8s.io/cluster-api-machine-role: worker sigs.k8s.io/cluster-api-machine-type: worker sigs.k8s.io/cluster-api-machineset: jian-worker-0 spec: metadata: creationTimestamp: null providerConfig: value: apiVersion: libvirtproviderconfig.k8s.io/v1alpha1 autostart: false cloudInit: null domainMemory: 2048 domainVcpu: 2 ignKey: "" ignition: userDataSecret: worker-user-data kind: LibvirtMachineProviderConfig networkInterfaceAddress: 192.168.126.0/24 networkInterfaceHostname: "" networkInterfaceName: jian networkUUID: "" uri: qemu+tcp://192.168.122.1/system volume: baseVolumeID: /var/lib/libvirt/images/jian-base poolName: default volumeName: "" versions: kubelet: "" ... For this bug, LGTM, verify it. For example, the OLM has been running for 4 days without errors. [core@jian-master-0 ~]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-6845f75dbc-nbl4z 1/1 Running 0 4d olm-operator-796dc97869-ddgq5 1/1 Running 0 4d package-server-65db98fcc5-9gbsk 1/1 Running 0 4d [core@jian-master-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION jian-master-0 Ready master 5d v1.11.0+d4cacc0 jian-worker-0-hgqvt Ready worker 5d v1.11.0+d4cacc0
That flag (--validate) is for `oc edit`. If you try to change the domainMemory or domainVcpu fields without that flag, oc will refuse to commit the change. At least that was the case when I tested that last.
Alex, I can edit it successfully, change it to `3072` from `2048`. [core@jian-master-0 ~]$ oc edit machineset jian-worker-0 machineset.cluster.k8s.io/jian-worker-0 edited [core@jian-master-0 ~]$ oc get machineset jian-worker-0 -o yaml |grep domain domainMemory: 3072 domainVcpu: 2 But, seems like it did NOT take effect, the memory of the worker still is `2048`. Am I missing something? [core@jian-worker-0-2t97v ~]$ cat /proc/meminfo |grep Mem MemTotal: 2046932 kB MemFree: 248624 kB MemAvailable: 523640 kB
The Machine API Operator doesn't reconcile the attributes of the machines. You'll need to destroy the machine (you can scale down and back up) in order to pick up the new settings. Sorry, I should have mentioned that before.