Description of problem: logging-fluentd pod couldn't be started once Openshift was updated from v3.7 to v3.9. compared the Age between logging-es pod and logging-fluentd pods. The Age of logging-es pod is 3h, which is the time when the Openshift was upgrading. The Age of logging-fluentd is 9h, which is the time it was created. It seems the daemonset wasn't refresh during upgrade. logging-es-ops-data-master-en9aqvk6-1-qs7xj 2/2 Running 0 3h logging-fluentd-5fp5h 0/1 CrashLoopBackOff 39 9h Once the daemonset was refreshed. the fluentd come back. refer to the Additional info: Version-Release number of the following components: OCP-3.7.26 -> OCP-3.9.0-0.24.0 logging-fluentd:v3.7.26 openshift-ansible-3.9.0-0.24.0 How reproducible: always Steps to Reproduce: 1. install openshift v3.7.26 2. deploy logging v3.7.26 3. upgrade openshift from v3.7.26 to v3.9.0-0.24.0 4. check the logging status Actual results: # the fluentd pod couldn't be started. [anli@upg_slave_qeos10 37_39]$ oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-dmfsp 1/1 Running 0 3h logging-curator-ops-1-txmkr 1/1 Running 0 3h logging-es-data-master-9i9cywub-1-8jvm2 2/2 Running 0 3h logging-es-ops-data-master-en9aqvk6-1-qs7xj 2/2 Running 0 3h logging-fluentd-5fp5h 0/1 CrashLoopBackOff 39 9h logging-fluentd-5gwmg 0/1 CrashLoopBackOff 39 9h logging-fluentd-5tlz5 0/1 CrashLoopBackOff 39 9h logging-fluentd-6p8pj 0/1 CrashLoopBackOff 40 9h logging-fluentd-brqnb 0/1 CrashLoopBackOff 40 9h logging-fluentd-cn97s 0/1 Error 50 9h logging-fluentd-fckc4 0/1 CrashLoopBackOff 49 9h logging-fluentd-g428w 0/1 CrashLoopBackOff 50 9h logging-fluentd-jwzvf 0/1 CrashLoopBackOff 38 9h logging-fluentd-kh5x4 0/1 CrashLoopBackOff 40 9h logging-fluentd-m6cvl 0/1 CrashLoopBackOff 39 9h logging-kibana-1-6pbcl 2/2 Running 0 3h logging-kibana-ops-1-mmxbj 2/2 Running 0 3h [anli@upg_slave_qeos10 37_39]$ oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-dmfsp 1/1 Running 0 3h logging-curator-ops-1-txmkr 1/1 Running 0 3h logging-es-data-master-9i9cywub-1-8jvm2 2/2 Running 0 3h logging-es-ops-data-master-en9aqvk6-1-qs7xj 2/2 Running 0 3h logging-fluentd-5fp5h 1/1 Running 40 9h logging-fluentd-5gwmg 0/1 CrashLoopBackOff 39 9h logging-fluentd-5tlz5 0/1 CrashLoopBackOff 39 9h logging-fluentd-6p8pj 1/1 Running 41 9h logging-fluentd-brqnb 0/1 CrashLoopBackOff 40 9h logging-fluentd-cn97s 0/1 CrashLoopBackOff 50 9h logging-fluentd-fckc4 0/1 CrashLoopBackOff 49 9h logging-fluentd-g428w 1/1 Running 51 9h logging-fluentd-jwzvf 0/1 CrashLoopBackOff 38 9h logging-fluentd-kh5x4 0/1 CrashLoopBackOff 40 9h logging-fluentd-m6cvl 0/1 CrashLoopBackOff 39 9h logging-kibana-1-6pbcl 2/2 Running 0 3h logging-kibana-ops-1-mmxbj 2/2 Running 0 3h [anli@upg_slave_qeos10 37_39]$ oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-dmfsp 1/1 Running 0 3h logging-curator-ops-1-txmkr 1/1 Running 0 3h logging-es-data-master-9i9cywub-1-8jvm2 2/2 Running 0 3h logging-es-ops-data-master-en9aqvk6-1-qs7xj 2/2 Running 0 3h logging-fluentd-5fp5h 0/1 Error 40 9h logging-fluentd-5gwmg 0/1 CrashLoopBackOff 39 9h logging-fluentd-5tlz5 0/1 CrashLoopBackOff 39 9h logging-fluentd-6p8pj 0/1 Error 41 9h logging-fluentd-brqnb 0/1 CrashLoopBackOff 40 9h logging-fluentd-cn97s 0/1 CrashLoopBackOff 50 9h logging-fluentd-fckc4 0/1 Error 50 9h logging-fluentd-g428w 0/1 Error 51 9h logging-fluentd-jwzvf 1/1 Running 39 9h logging-fluentd-kh5x4 0/1 CrashLoopBackOff 40 9h logging-fluentd-m6cvl 0/1 CrashLoopBackOff 39 9h logging-kibana-1-6pbcl 2/2 Running 0 3h logging-kibana-ops-1-mmxbj 2/2 Running 0 3h [anli@upg_slave_qeos10 37_39]$ oc logs logging-fluentd-5fp5h umounts of dead containers will fail. Ignoring... umount: /var/lib/docker/containers/118cf7ba87a5e67345836b4436127603d2424ac6ff4d68764f2ae5a1687edb6b/shm: not mounted umount: /var/lib/docker/containers/240269b31a5a8c1d9fc5f800daa8cb583d359c0d2a821a7dc396bd8b11ca7e4f/shm: not mounted umount: /var/lib/docker/containers/8764f27ad687ea392fa474d657e9c107538accd449421a1da82316051908d833/shm: not mounted umount: /var/lib/docker/containers/b06b8a881dd7c41653f84d6fa54e6a7abc87a1d26658de66b7f91cb5105fff10/shm: not mounted umount: /var/lib/docker/containers/e36c31fca1b4466d93944388c2f39355f7094dcc65543f759f0008c94bb28a77/shm: not mounted umount: /var/lib/docker/containers/e7911c5afbff7217185559b326fbb4136f73e398ba29ed11c0ed8385355add14/shm: not mounted 2018-01-26 03:59:00 -0500 [info]: reading config file path="/etc/fluent/fluent.conf" 2018-01-26 03:59:20 -0500 [error]: unexpected error error="getaddrinfo: Name or service not known" 2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:878:in `initialize' 2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:878:in `open' 2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:878:in `block in connect' 2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/timeout.rb:52:in `timeout' 2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:877:in `connect' 2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:862:in `do_start' 2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:851:in `start' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/resource.rb:51:in `get' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:328:in `block in api' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:58:in `handle_exception' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:327:in `api' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:322:in `api_valid?' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:227:in `configure' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:145:in `add_filter' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:62:in `block in configure' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:57:in `each' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:57:in `configure' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/root_agent.rb:83:in `block in configure' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/root_agent.rb:83:in `each' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/root_agent.rb:83:in `configure' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/engine.rb:129:in `configure' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/engine.rb:103:in `run_configure' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:489:in `run_configure' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:174:in `block in start' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:366:in `call' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:366:in `main_process' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:170:in `start' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/command/fluentd.rb:173:in `<top (required)>' 2018-01-26 03:59:20 -0500 [error]: /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require' 2018-01-26 03:59:20 -0500 [error]: /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require' 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/bin/fluentd:8:in `<top (required)>' 2018-01-26 03:59:20 -0500 [error]: /usr/bin/fluentd:23:in `load' 2018-01-26 03:59:20 -0500 [error]: /usr/bin/fluentd:23:in `<main>' Expected results: The flunetd pods are running and works fine. Additional info: Once I enable debug mode, the fluend pods come back. # debug in fluentd oc edit configmap logging-fluentd #then, remove the @include configs.d/openshift/system.conf and replace with this <system> log_level debug </system> oc set env ds/logging-fluentd DEBUG=true ^C[anli@upg_slave_qeos10 37_39]$ oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-dmfsp 1/1 Running 0 3h logging-curator-ops-1-txmkr 1/1 Running 0 3h logging-es-data-master-9i9cywub-1-8jvm2 2/2 Running 0 3h logging-es-ops-data-master-en9aqvk6-1-qs7xj 2/2 Running 0 3h logging-fluentd-28plc 1/1 Running 0 12m logging-fluentd-2h2tp 1/1 Running 0 12m logging-fluentd-5fsdq 1/1 Running 0 12m logging-fluentd-5vzxn 1/1 Running 0 12m logging-fluentd-6vfzs 1/1 Running 0 12m logging-fluentd-g4tx2 1/1 Running 0 12m logging-fluentd-k5rqh 1/1 Running 0 12m logging-fluentd-mwnw8 1/1 Running 0 12m logging-fluentd-p2874 1/1 Running 0 12m logging-fluentd-vpmjk 1/1 Running 0 12m logging-fluentd-wlwz6 1/1 Running 0 12m logging-kibana-1-6pbcl 2/2 Running 0 3h logging-kibana-ops-1-mmxbj 2/2 Running 0 3h After remove the debug configure values, the fluentd pod still running. anli@upg_slave_qeos10 37_39]$ oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-dmfsp 1/1 Running 0 3h logging-curator-ops-1-txmkr 1/1 Running 0 3h logging-es-data-master-9i9cywub-1-8jvm2 2/2 Running 0 3h logging-es-ops-data-master-en9aqvk6-1-qs7xj 2/2 Running 0 3h logging-fluentd-7cq2q 1/1 Running 0 2m logging-fluentd-8cnsj 1/1 Running 0 1m logging-fluentd-hv2dk 1/1 Running 0 1m logging-fluentd-kt7q9 1/1 Running 0 1m logging-fluentd-m2sm6 1/1 Running 0 2m logging-fluentd-m4rnp 1/1 Running 0 1m logging-fluentd-mllh4 1/1 Running 0 2m logging-fluentd-sdtqr 1/1 Running 0 2m logging-fluentd-tmzxp 1/1 Running 0 1m logging-fluentd-wlmgb 1/1 Running 0 2m logging-fluentd-z9q9d 1/1 Running 0 2m
@scott, @jeff, It seems a daemonset issue rather than a logging issue.
Hi @Anping, Looking into the error/stacktraces in the report, the cause of the failure looks coming from connecting kubernetes in the fluentd kubernetes metadata plugin. 2018-01-26 03:59:20 -0500 [error]: unexpected error error="getaddrinfo: Name or service not known" [...] 2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:227:in `configure' "filter_kubernetes_metadata.rb" 222 @client = Kubeclient::Client.new @kubernetes_url, @apiVersion, 223 ssl_options: ssl_options, 224 auth_options: auth_options 225 226 begin 227 @client.api_valid? 228 rescue KubeException => kube_error 229 raise Fluent::ConfigError, "Invalid Kubernetes API #{@apiVersion} endpoint #{@kubernetes_url}: #{kube_error.message}" 230 end I'm wondering if the network is healthy after the upgrade. Can we get some more network related info? E.g., "oc get services"? (I'd believe you have much deeper knowledge about checking the network condition on openshift/kubernetes than I do... ;) Another question would be you mentioned enabling DEBUG solved the problem. oc set env ds/logging-fluentd DEBUG=true I'm not so sure about it. Could it be a timing issue, for instance, can we consider this scenario -- when the fluentd was crashing the network was not 100% ready yet, then it's come back and you restarted the fluentd daemonset with DEBUG=true? If that's the case, the problem would be our not detecting the environment is ready OR weak self recovery in the upgrade/startup? Thanks!
Can you advise if the endpoint is reachable? Is it a networking issues? Can you debug an instance of the pod and reach the endpoint from the debug pod
Created attachment 1391936 [details] fluentd debug logs I observed the following strange after upgrade from v3.7 to v3.9. I am sure both masters/nodes had been upgraded to v3.9 and all service are restarted. Maybe, it is pod or upgrade bug rather than logging 1) all daemonset controlled pods were/are not working. [1] apiserver-rmmwv,controller-manager-2r8l7 logging-fluentd-tvzwx and apiserver-f58n2 2) The logging-fluentd-2gfqk on the master is in running status. The side cart is opensihft3/ose-pod:v3.9.0-0.38.0 [2] 3) The logging-fluentd-tvzwx on the node is CrashLoopBackOff. The side cart is openshift3/ose-pod:v3.7.28 [3] 4) The daemonset pods are ignore during upgrade [4] [1]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-4-8pb6t 1/1 Running 0 1h default registry-console-2-8rvbm 1/1 Running 0 1h default router-2-tvnzc 1/1 Running 0 1h install-test mongodb-1-6gqxw 1/1 Running 0 1h install-test nodejs-mongodb-example-1-9kdv2 1/1 Running 0 1h kube-service-catalog apiserver-rmmwv 1/1 Running 4 4h kube-service-catalog controller-manager-2r8l7 1/1 Running 12 4h logging logging-curator-1-hsp6x 1/1 Running 0 1h logging logging-es-data-master-dtrmuatb-1-czp4g 2/2 Running 0 1h logging logging-fluentd-2gfqk 1/1 Running 4 3h logging logging-fluentd-tvzwx 0/1 CrashLoopBackOff 17 3h logging logging-kibana-1-b9xz6 2/2 Running 0 1h openshift-ansible-service-broker asb-1-598b9 1/1 Running 4 1h openshift-ansible-service-broker asb-etcd-1-7qszj 1/1 Running 0 1h openshift-template-service-broker apiserver-f58n2 0/1 Running 1 4h openshift-template-service-broker apiserver-kb2kc 1/1 Running 4 4h openshift-web-console webconsole-74d65548c-xz5sj 1/1 Running 0 1h [2][root@qe-anli37-master-etcd-nfs-1 ~]# docker ps |grep fluentd d2a3b584b3f7 b67c504594fb "sh run.sh" About an hour ago Up About an hour k8s_fluentd-elasticsearch_logging-fluentd-2gfqk_logging_bbc44e29-0b08-11e8-b686-42010af00005_4 dfc6e1ecb4f6 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.9.0-0.38.0 "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_logging-fluentd-2gfqk_logging_bbc44e29-0b08-11e8-b686-42010af00005_4 [3][root@qe-anli37-node-registry-router-1 ~]# docker ps |grep fluentd 74d787f74e44 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.7.28 "/usr/bin/pod" 3 hours ago Up 3 hours k8s_POD_logging-fluentd-tvzwx_logging_bbbff615-0b08-11e8-b686-42010af00005_0 [4]common/openshift-cluster/upgrades/upgrade_nodes.yml: - name: Drain Node for Kubelet upgrade command: > {{ hostvars[groups.oo_first_master.0]['first_master_client_binary'] }} adm drain {{ openshift.node.nodename | lower }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig --force --delete-local-data --ignore-daemonsets --timeout={{ openshift_upgrade_nodes_drain_timeout | default(0) }}s
Can we get pod logs from the pods themselves and the deployer pods to determine if they have any information of interest?
Thanks for the analysis, @Anping. The change was made about a year ago for upgrading 1.3 to 1.4. https://github.com/openshift/openshift-ansible/pull/3370 They even mentioned us "(e.g. aggregated logging)". So, it looks the --ignore-daemonsets option has been set in upgrading to 3.4, 3.5, 3.6, and 3.7. Do you have an idea why we run into this issue this time, 3.7 to 3.9...? Or should we do something extra for daemonset???
No output from "docker logs registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.7.28"
There should be some logs from the failed fluentd pod or the fluentd deployer pod Retrievable with 'oc logs $PODNAME|$PODNAME-deploy'
Jeff, No deloyed happened. the ose-pod container is alive during upgrade. I guess the running container ose-pod:v3.7 failed to launch the fluentd containers.
mfojtik I fail to see any evidence why this is broken daemonset: - pods are crashLoopBackoff - meaning app itself is broken > The Age of logging-fluentd is 9h, which is the time it was created. It seems the daemonset wasn't refresh during upgrade. Was the DS changed during the update? Can we have a yaml dump? oc get ds <name> -o yaml oc get po -l <ds-label> -o yaml
@Tomáš, No DS change. I will attached the yaml file when I peform upgrade the next time.
Moving this to 3.9.z to remove from blocker list as doesnt seem to be a blocker currently but still want to ensure its not an issue.
Confirmed that is not a logging issue. The root cuase is the logging-fluend pod's ip address was reassigned to other pod. It should be a pod/cri issue. The quick workaround is to delete daemonset pod to redeploy daemonset pod during upgrade. Reassign backup to Upgrade. @scott, Could you talk a look about this issue? Root Cause: The logging-fluend pod ip address was reassigned to other pods. The pod 8s_POD_logging-fluentd-9rn4q_logging_68f64a6d-1cf7-11e8-834f-fa163e6d7e7a_0 and k8s_POD_webconsole-6d845c7d76-vvngn_openshift-web-console_917ef9f6-1d1e-11e8-80e5-fa163e6d7e7a_0 are using same IP address. 1) Get the fluentd pod's ip Address 1.1) # docker ps |grep fluentd 5c646c94f47d registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.7.35 "/usr/bin/pod" 6 hours ago Up 6 hours k8s_POD_logging-fluentd-9rn4q_logging_68f64a6d-1cf7-11e8-834f-fa163e6d7e7a_0 1.2) #docker inspect 5c646c94f47d |grep Pid "Pid": 52211, "PidMode": "", "PidsLimit": 0, 1.3) # nsenter -n -i -t 52211 # ip addr |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 10.2.4.2/23 brd 10.2.5.255 scope global eth0 inet6 fe80::ec76:74ff:fea1:6980/64 scope link 2) Check the assign pod for the IP 10.2.4.2 (which is used by logging-fluentd before upgrade) 2.1) # cat /var/lib/cni/openshift-sdn/10.2.4.2 2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7 2.2) # docker ps |grep 2e12c960c226 2e12c960c226 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.9.1 "/usr/bin/pod" 2 hours ago Up 2 hours k8s_POD_webconsole-6d845c7d76-vvngn_openshift-web-console_917ef9f6-1d1e-11e8-80e5-fa163e6d7e7a_0 3) # Check the pod ip of Pod Id 2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7 3.1) # docker inspect 2e12c960c226 |grep 2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7 "Id": "2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7", "ResolvConfPath": "/var/lib/docker/containers/2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7/resolv.conf", "HostnamePath": "/var/lib/docker/containers/2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7/hostname", "HostsPath": "/var/lib/docker/containers/2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7/hosts", 3.2) # docker inspect 2e12c960c226 |grep Pid "Pid": 115529, "PidMode": "", "PidsLimit": 0, 3.3) # nsenter -n -i -t 115529 # ip addr |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 10.2.4.2/23 brd 10.2.5.255 scope global eth0 inet6 fe80::4c51:ceff:fe3a:6368/64 scope link
100% reproduce in my testing during upgrade. The workaround is trigger new deploy by by 'oc delete pods --selector component=fluentd' [root@host-172-16-120-14 ~]# oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-vw7fv 1/1 Running 0 2h logging-curator-ops-1-q74n7 1/1 Running 0 2h logging-es-data-master-s68eocj5-1-6pqnn 2/2 Running 0 2h logging-es-ops-data-master-oa0mmdlh-2-hn86w 2/2 Running 0 2h logging-es-ops-data-master-te7241sz-1-qf9ps 2/2 Running 0 2h logging-es-ops-data-master-vvoif0gl-1-r8hd7 2/2 Running 0 2h logging-fluentd-5z7zv 0/1 CrashLoopBackOff 33 7h logging-fluentd-6bb22 0/1 CrashLoopBackOff 42 7h logging-fluentd-9rn4q 0/1 CrashLoopBackOff 41 7h logging-fluentd-fmwnc 0/1 CrashLoopBackOff 34 7h logging-fluentd-gs45m 0/1 CrashLoopBackOff 33 7h logging-fluentd-kh52s 0/1 CrashLoopBackOff 42 7h logging-fluentd-m5zh8 0/1 CrashLoopBackOff 33 7h logging-fluentd-mw6g5 0/1 CrashLoopBackOff 34 7h logging-fluentd-zqvxv 0/1 CrashLoopBackOff 32 7h logging-kibana-2-zvstx 2/2 Running 0 2h logging-kibana-ops-2-ltcbd 2/2 Running 0 2h
https://bugzilla.redhat.com/show_bug.cgi?id=1538971#c13 Typo "It should be a pod/cri issue" should be "It should be a pod/cni issue".
Anping, Could you proide me DS controller logs when this happens?
Can I get access to the cluster where it is happening?
Meanwhile I am trying to reproduce in my local cluster.
I have been trying to reproduce this with some daemonsets (not fluentd) and unable to reproduce. However, my test cluster is pretty small (one master and one node) so not sure if that's playing any role here. I have taken following steps: 1. Deploy openshift 3.7 as one master and one node (I had to downgrade golang to 1.8.3 to compile 3.7). 2. Create 5 daemon sets. 3. Stop master and node. (please note pods stays in etcd store). 4. delete docker containers 5. Deploy origin latest (should be very close to 3.9) as one master and node with the same config and etcd storage used with 3.7. (I had to upgrade golang back to 1.9 to compile 3.9). 6. Noticed that all ds and its pods were started fine and new containers were created.
@seth, I will provide the Env when this happen. @Avesh, To quick reproduce, I think we can try the following steps which is similar with openshift-ansible 1. Deploy openshift v3.7. 2. Deploy logging v3.7 on openshift v3.7 3. Upgrade Openshift 3.1) oc adm drain $node --ignore-daemonsets=false 3.2) upgrade to v3.9 3.3) make node schedulable 4. Watch the ds pod status.
Time is up on 3.9 moving to 3.9.z while waiting for reproduction. Reapplying needinfo. Please clear it when the issue is reproduced and additional information is available.
No such issue in recently upgrade.
Please reopen if this happens again with: oc get ds <name> -o yaml oc get po -l <ds-label> -o yaml and node logs (if possible)