Bug 1648270 - Got "ImagePullError" after running the cluster for a while.
Summary: Got "ImagePullError" after running the cluster for a while.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.1.0
Assignee: Alex Crawford
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-09 10:02 UTC by Jian Zhang
Modified: 2019-03-12 14:26 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-21 16:55:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The kubelet logs running on the master node (23.60 KB, text/plain)
2018-11-21 03:04 UTC, Jian Zhang
no flags Details

Description Jian Zhang 2018-11-09 10:02:40 UTC
Description of problem:
This bug was separated from bug 1647660.
I also encountered the issue "ImagePullError" after running the cluster for a while.

Version-Release number of the following components:
[jzhang@dhcp-140-18 installer]$ openshift-install version
openshift-install v0.3.0-172-gfc8e1ec41927fdfe0a1878aa3bb41e916ceda0b1
Terraform v0.11.8

How reproducible:
always

Steps to Reproduce:
1. Create a cluster by following this doc: https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md
2. Running the cluster for a period of time.


Actual results:
Got "ImagePullError" issue.

Expected results:
All of the pods running well.

Additional info:
PR https://github.com/openshift/installer/pull/606

Comment 1 Jian Zhang 2018-11-21 03:04:54 UTC
Created attachment 1507481 [details]
The kubelet logs running on the master node

Comment 2 Jian Zhang 2018-11-21 03:12:42 UTC
I have a try with the latest version:
[jzhang@dhcp-140-18 ~]$ openshift-install version
openshift-install v0.3.0-250-g30bb25ac57d7c7d3dae519186cbfca9af8aeaca2
Terraform v0.11.8

Your version of Terraform is out of date! The latest version
is 0.11.10. You can update by downloading from www.terraform.io/downloads.html

But, got below errors after running the cluster for a while(about 10 hours).
[jzhang@dhcp-140-18 ~]$ oc get ns
The connection to the server jiazha-api.tt.testing:6443 was refused - did you specify the right host or port?

[jzhang@dhcp-140-18 ~]$ sudo virsh list --all
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------
 24    jiazha-master-0                running
 25    jiazha-worker-0-75mcl          running

So, I access the master node and check the kubelet logs:
[core@jiazha-master-0 ~]$ journalctl -f -u kubelet
...
Nov 21 02:57:58 jiazha-master-0 hyperkube[2791]: E1121 02:57:58.321164    2791 pod_workers.go:186] Error syncing pod f3a5a06d-ec7e-11e8-99bc-b62bec1f3377 ("openshift-cluster-kube-controller-manager-operator-5d478985868l_openshift-cluster-kube-controller-manager-operator(f3a5a06d-ec7e-11e8-99bc-b62bec1f3377)"), skipping: failed to "StartContainer" for "operator" with ImagePullBackOff: "Back-off pulling image \"registry.svc.ci.openshift.org/openshift/origin-v4.0-20181119160732@sha256:ef9dc3e2c8344dcab3e1b511c6631249e883f6cdb225e74be431064f1a89a801\""
Nov 21 02:57:58 jiazha-master-0 hyperkube[2791]: I1121 02:57:58.942336    2791 logs.go:49] http: TLS handshake error from 10.2.0.57:48868: remote error: tls: bad certificate

But, I found the containers are running well:

[core@jiazha-master-0 ~]$ sudo crictl ps
CONTAINER ID        IMAGE                                                                                                                                        CREATED             STATE               NAME                                      ATTEMPT
34b9fede10290       registry.svc.ci.openshift.org/openshift/origin-release@sha256:1b9a5b3858d219afb508d94a0bc4501aa0d1cf3adcc0012d4950f3984b6fe668               4 hours ago         Running             cluster-version-operator                  146
b423f93187e95       80f9ecff4526e8f67bd9e0b2f43e5dec6aa4de0fd193cd1687e5959b71164f06                                                                             14 hours ago        Running             machine-config-controller                 2
2bd6fc810f8ec       03f4586ea1d23398ab35db0295aa09855841ca31f2fc43d412ef37854b107b42                                                                             14 hours ago        Running             machine-api-operator                      2
7da351e5a371a       e3d07188c6f5a073ce503993011a0b82d45d377332ca800e41c6b7eb0596d774                                                                             14 hours ago        Running             machine-config-operator                   2
1d943d6afc18a       71967eedafdbd90bd1d0ba8b3f9b20a522e0e8c03d9aa83cacda32d57dee0850                                                                             14 hours ago        Running             sdn-controller                            3
d66bd5de99645       71967eedafdbd90bd1d0ba8b3f9b20a522e0e8c03d9aa83cacda32d57dee0850                                                                             14 hours ago        Running             controller-manager                        3
b44dee1adddf0       09ec5ab8b5b442addc495c9f42e075ee06612501b6f00ad4fe8a51db6f69a311                                                                             14 hours ago        Running             apiservice-cabundle-injector-controller   3
90032a9c1e1ad       09ec5ab8b5b442addc495c9f42e075ee06612501b6f00ad4fe8a51db6f69a311                                                                             14 hours ago        Running             operator                                  3
171c25aef2592       09ec5ab8b5b442addc495c9f42e075ee06612501b6f00ad4fe8a51db6f69a311                                                                             14 hours ago        Running             configmap-cabundle-injector-controller    3
ef33deea19c6f       09ec5ab8b5b442addc495c9f42e075ee06612501b6f00ad4fe8a51db6f69a311                                                                             14 hours ago        Running             service-serving-cert-signer-controller    3
a26f9a0dd0b4f       registry.svc.ci.openshift.org/openshift/origin-v4.0-20181119160732@sha256:f062abd485e0431fe336cb610caf5a0808c7876aa40a0e82699f1fe420c011a4   17 hours ago        Running             console-operator                          4
9172a59a04434       registry.svc.ci.openshift.org/openshift/origin-v4.0-20181119160732@sha256:e8896899ef1d40546336a7dd7ab0d6f943c7579844291b428be2d716278c79cd   17 hours ago        Running             cluster-samples-operator                  4
ca0eb3145bb98       2b1a570c7c026bdb6639cde0f19331312912717732201db95bf97163de730da5                                                                             17 hours ago        Running             cluster-image-registry-operator           4
62c252c5d67a7       1a7aeae2858166cb7a9e1565c9900289d206745a87e61bee8c4085623010e8e7                                                                             17 hours ago        Running             machine-config-daemon                     2
011fc772b50bc       3129a2ca29d75226dc5657a4629cdd5f38accda7f5b75bc5d5a76f5b4e0e5870                                                                             17 hours ago        Running             rules-configmap-reloader                  1
aeeaa0b3c5e43       c4aeb8465da88add5b6c839b8d8669737704c639b89d69b74118088ee8c66028                                                                             17 hours ago        Running             prom-label-proxy                          1
d732c2ccdfb6b       90c45954eb03e399ea5303e516911180dc74a62392e2981ebffe2f26b7321b78                                                                             17 hours ago        Running             alertmanager-proxy                        1
32f528f8dbb26       8323481d90856e70f1d68024d18cfc0b9140494612cb869474719d96750d3f12                                                                             17 hours ago        Running             kube-rbac-proxy                           1
f1e1bbad8b1a6       03f4586ea1d23398ab35db0295aa09855841ca31f2fc43d412ef37854b107b42                                                                             17 hours ago        Running             nodelink-controller                       1
2fe4cc48ab452       90c45954eb03e399ea5303e516911180dc74a62392e2981ebffe2f26b7321b78                                                                             17 hours ago        Running             alertmanager-proxy                        1
877bd71a4408b       registry.svc.ci.openshift.org/openshift/origin-v4.0-20181119160732@sha256:42ee4b8a4cdf53f34fd2a204d9e577f8a9a6ae71b6f5f8bd378cd4840a80325b   17 hours ago        Running             cluster-node-tuning-operator              1
aa2a0e45b5ac3       90c45954eb03e399ea5303e516911180dc74a62392e2981ebffe2f26b7321b78                                                                             17 hours ago        Running             prometheus-proxy                          1
1a722d67f2e3b       543de23dc5f295d3c62af6df944d72507de86dde6417454690129fa25b834db1                                                                             17 hours ago        Running             olm-operator                              2
73339b64f0e67       3129a2ca29d75226dc5657a4629cdd5f38accda7f5b75bc5d5a76f5b4e0e5870                                                                             17 hours ago        Running             config-reloader                           1
fc432282ffc46       d3a133472950a2a78482aed3acb5dc17793fdb877a4be3f9e580f194fbf89c9d                                                                             17 hours ago        Running             machine-controller                        1
81c1deeb3bcbf       3129a2ca29d75226dc5657a4629cdd5f38accda7f5b75bc5d5a76f5b4e0e5870                                                                             17 hours ago        Running             config-reloader                           1
3c465388279ff       5c9c3b993bd397f291ded82b5368daae78c878eba48f978263a4eb1ca9fe277f                                                                             17 hours ago        Running             prometheus-config-reloader                1
b2a391a89ba05       d3a133472950a2a78482aed3acb5dc17793fdb877a4be3f9e580f194fbf89c9d                                                                             17 hours ago        Running             controller-manager                        1
d8ef6c8cf1a9b       68bbd0006378473c03fd9b8880246855a930558ac063a679aa3848a7a5ce8f9c                                                                             17 hours ago        Running             alertmanager                              1
bc83f070e18d5       68bbd0006378473c03fd9b8880246855a930558ac063a679aa3848a7a5ce8f9c                                                                             17 hours ago        Running             alertmanager                              1
dfe5a752e33fc       8323481d90856e70f1d68024d18cfc0b9140494612cb869474719d96750d3f12                                                                             17 hours ago        Running             kube-rbac-proxy                           1
2cfe3c56dbdbf       79c48450571090935cc175ddfe8473ab88b807bc9276bc35a25af9caaaccd6c0                                                                             17 hours ago        Running             csi-operator                              1
10b5cc4ae6afa       59c3f2318e23ae3219e1e9fb5d939ea559c804bc84ef84ba16a1993fb0cff5b1                                                                             17 hours ago        Running             prometheus                                2
691eb717c57ab       96b3a943b3a6b542fead7b23c39da2d2532d9603c98ed56f057f43814a19cb74                                                                             17 hours ago        Running             console                                   2
83a449c96c670       b7f6d322a704f7191ea96105570e921b9ad3d4b421ce56640deaaa2f81338f97                                                                             17 hours ago        Running             ingress-operator                          1
f471ecb2c8311       f9f775bf6d0efcbfa64f5b54c198592f710852a14f9bb149913d2c8f1e0f39be                                                                             17 hours ago        Running             node-exporter                             1
59dd47936dc64       543de23dc5f295d3c62af6df944d72507de86dde6417454690129fa25b834db1                                                                             17 hours ago        Running             catalog-operator                          1
d11c55261d109       c9aa1d32a4af2318171a361fab2603af2cb81a1bff9aef38bbe1eb10631082f7                                                                             17 hours ago        Running             cluster-dns-operator                      1
efeb044e7f415       40b18c4b379ed68f6647fb0ca33f4525c12d4b7272ea1a4cf2b1d4e850214874                                                                             17 hours ago        Running             dns                                       1
9afda236d8c37       d8d79552d3773ff240610d7c64f19c4756a96fbaaea5392ec5c36464a7c6d563                                                                             17 hours ago        Running             machine-config-server                     1
364ae6565f7b6       64a26eee6ca57e3b57aaa35d979f127f384156cf7ad4ffaf923c4b1314ebcc21                                                                             17 hours ago        Running             cluster-network-operator                  1
1cee0efc3fd0e       786b620c80bb515d469dbd4872881191914062cb1dbbb03ccb98f39138db5fd5                                                                             17 hours ago        Running             openvswitch                               1
2e7ede4ce3de7       786b620c80bb515d469dbd4872881191914062cb1dbbb03ccb98f39138db5fd5                                                                             17 hours ago        Running             sdn                                       1
47ff592339a32       94bc3af972c98ce73f99d70bd72144caa8b63e541ccc9d844960b7f0ca77d7c4                                                                             17 hours ago        Running             etcd-member                               1

I attached the kubelet logs for your checking.

The CVO logs:
[core@jiazha-master-0 ~]$ sudo crictl logs 34b9fede10290
...
E1121 03:08:00.531920       1 leaderelection.go:234] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
I1121 03:08:00.531985       1 leaderelection.go:190] failed to acquire lease openshift-cluster-version/version

The logs of machine-api-operator:
[core@jiazha-master-0 ~]$ sudo crictl logs 2bd6fc810f8ec
E1121 03:10:31.985214       1 leaderelection.go:234] error retrieving resource lock openshift-cluster-api/machine-api-operator: Get https://10.3.0.1:443/api/v1/namespaces/openshift-cluster-api/configmaps/machine-api-operator: dial tcp 10.3.0.1:443: connect: connection refused

Comment 3 Alex Crawford 2018-11-28 21:55:36 UTC
Can you try again with the 0.4.0 installer? I recall seeing some fixes related to this failure go in to some of the other operators.

Comment 4 Jian Zhang 2018-11-30 01:49:01 UTC
Alex,

Yes, the cluster works well after working a day. Can you help change the Status to "ON_QA" so that I can verify it?

[jzhang@dhcp-140-18 installer]$ openshift-install version
openshift-install v0.4.0-32-g7e7c26fbf41e1e73a0219d48d04881f9aa1318a9-dirty
Terraform v0.11.8

Your version of Terraform is out of date! The latest version
is 0.11.10. You can update by downloading from www.terraform.io/downloads.html

[core@jian-master-0 ~]$ oc get ns
NAME                                                      STATUS    AGE
default                                                   Active    1d
kube-public                                               Active    1d
kube-system                                               Active    1d
openshift                                                 Active    1d
openshift-apiserver                                       Active    1d
openshift-cluster-api                                     Active    1d
openshift-cluster-dns                                     Active    1d
openshift-cluster-dns-operator                            Active    1d
openshift-cluster-kube-apiserver-operator                 Active    1d
...

[core@jian-master-0 ~]$ oc get nodes
NAME                  STATUS    ROLES     AGE       VERSION
jian-master-0         Ready     master    1d        v1.11.0+d4cacc0
jian-worker-0-hgqvt   Ready     worker    1d        v1.11.0+d4cacc0

Comment 5 Jian Zhang 2018-11-30 03:04:30 UTC
Alex,

I met the issue that the master API cannot be accessed several times.
So, I changed the vcpu/memory of the master node,

[jzhang@dhcp-140-18 installer]$ vim data/data/libvirt/main.tf
...
resource "libvirt_domain" "master" {
  count = "${var.tectonic_master_count}"

  name = "${var.tectonic_cluster_name}-master-${count.index}"

  memory = "6144"
  vcpu   = "4"

Seems it works well for days.One more question, how to change the vcpu/memory of the worker node?

Comment 6 Alex Crawford 2018-12-03 21:47:12 UTC
You can change the vcpu/memory of workers by editing the corresponding MachineSet custom resource in the openshift-cluster-api namespace. The last time I did that, I had to disable validation (--validate=false).

Comment 7 Jian Zhang 2018-12-04 02:23:54 UTC
Alex,

Thanks! I'm sorry, but I could not find the "--validate" field. Am I missing something?
[core@jian-master-0 ~]$ oc get machineset jian-worker-0 -o yaml
apiVersion: cluster.k8s.io/v1alpha1
kind: MachineSet
metadata:
  creationTimestamp: 2018-11-28T09:49:31Z
  generation: 3
  labels:
    sigs.k8s.io/cluster-api-cluster: jian
    sigs.k8s.io/cluster-api-machine-role: worker
    sigs.k8s.io/cluster-api-machine-type: worker
  name: jian-worker-0
  namespace: openshift-cluster-api
  resourceVersion: "8527936"
  selfLink: /apis/cluster.k8s.io/v1alpha1/namespaces/openshift-cluster-api/machinesets/jian-worker-0
  uid: e780f96e-f2f2-11e8-b24e-92c2a8062742
spec:
  replicas: 1
  selector:
    matchLabels:
      sigs.k8s.io/cluster-api-cluster: jian
      sigs.k8s.io/cluster-api-machineset: jian-worker-0
  template:
    metadata:
      creationTimestamp: null
      labels:
        sigs.k8s.io/cluster-api-cluster: jian
        sigs.k8s.io/cluster-api-machine-role: worker
        sigs.k8s.io/cluster-api-machine-type: worker
        sigs.k8s.io/cluster-api-machineset: jian-worker-0
    spec:
      metadata:
        creationTimestamp: null
      providerConfig:
        value:
          apiVersion: libvirtproviderconfig.k8s.io/v1alpha1
          autostart: false
          cloudInit: null
          domainMemory: 2048
          domainVcpu: 2
          ignKey: ""
          ignition:
            userDataSecret: worker-user-data
          kind: LibvirtMachineProviderConfig
          networkInterfaceAddress: 192.168.126.0/24
          networkInterfaceHostname: ""
          networkInterfaceName: jian
          networkUUID: ""
          uri: qemu+tcp://192.168.122.1/system
          volume:
            baseVolumeID: /var/lib/libvirt/images/jian-base
            poolName: default
            volumeName: ""
      versions:
        kubelet: ""

...

For this bug, LGTM, verify it.
For example, the OLM has been running for 4 days without errors.
[core@jian-master-0 ~]$ oc get pods
NAME                                READY     STATUS    RESTARTS   AGE
catalog-operator-6845f75dbc-nbl4z   1/1       Running   0          4d
olm-operator-796dc97869-ddgq5       1/1       Running   0          4d
package-server-65db98fcc5-9gbsk     1/1       Running   0          4d
[core@jian-master-0 ~]$ oc get nodes
NAME                  STATUS    ROLES     AGE       VERSION
jian-master-0         Ready     master    5d        v1.11.0+d4cacc0
jian-worker-0-hgqvt   Ready     worker    5d        v1.11.0+d4cacc0

Comment 8 Alex Crawford 2018-12-05 17:47:36 UTC
That flag (--validate) is for `oc edit`. If you try to change the domainMemory or domainVcpu fields without that flag, oc will refuse to commit the change. At least that was the case when I tested that last.

Comment 9 Jian Zhang 2018-12-07 01:28:03 UTC
Alex,

I can edit it successfully, change it to `3072` from `2048`.
[core@jian-master-0 ~]$ oc edit machineset jian-worker-0 
machineset.cluster.k8s.io/jian-worker-0 edited

[core@jian-master-0 ~]$ oc get machineset jian-worker-0 -o yaml |grep domain
          domainMemory: 3072
          domainVcpu: 2

But, seems like it did NOT take effect, the memory of the worker still is `2048`. Am I missing something?
[core@jian-worker-0-2t97v ~]$ cat /proc/meminfo |grep Mem
MemTotal:        2046932 kB
MemFree:          248624 kB
MemAvailable:     523640 kB

Comment 10 Alex Crawford 2018-12-07 01:39:39 UTC
The Machine API Operator doesn't reconcile the attributes of the machines. You'll need to destroy the machine (you can scale down and back up) in order to pick up the new settings. Sorry, I should have mentioned that before.


Note You need to log in before you can comment on or make changes to this bug.