1538971 – daemonset controlled pod logging-fluentd couldn't be started after upgrade

Bug 1538971 - daemonset controlled pod logging-fluentd couldn't be started after upgrade

Summary: daemonset controlled pod logging-fluentd couldn't be started after upgrade

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Seth Jennings
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-26 09:53 UTC by Anping Li
Modified:	2019-06-06 15:22 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-03-19 14:28:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
fluentd debug logs (1.09 KB, application/x-gzip) 2018-02-06 10:06 UTC, Anping Li	no flags	Details
View All

Description Anping Li 2018-01-26 09:53:38 UTC

Description of problem:
logging-fluentd pod couldn't be started once Openshift was updated from v3.7 to v3.9. 
compared the Age between logging-es pod and logging-fluentd pods.  The Age of  logging-es pod is 3h,  which is the time when the Openshift was upgrading. The Age of logging-fluentd is 9h, which is the time it was created. It seems the daemonset wasn't refresh during upgrade.

logging-es-ops-data-master-en9aqvk6-1-qs7xj   2/2       Running            0          3h
logging-fluentd-5fp5h                         0/1       CrashLoopBackOff   39         9h


Once the daemonset was refreshed. the fluentd come back. refer to the Additional info:

Version-Release number of the following components:
OCP-3.7.26 -> OCP-3.9.0-0.24.0
logging-fluentd:v3.7.26
openshift-ansible-3.9.0-0.24.0

How reproducible:
always

Steps to Reproduce:
1. install openshift v3.7.26
2. deploy logging v3.7.26
3. upgrade openshift from v3.7.26 to v3.9.0-0.24.0
4. check the logging status


Actual results:
# the fluentd pod couldn't be started.

[anli@upg_slave_qeos10 37_39]$ oc get pods
NAME                                          READY     STATUS             RESTARTS   AGE
logging-curator-1-dmfsp                       1/1       Running            0          3h
logging-curator-ops-1-txmkr                   1/1       Running            0          3h
logging-es-data-master-9i9cywub-1-8jvm2       2/2       Running            0          3h
logging-es-ops-data-master-en9aqvk6-1-qs7xj   2/2       Running            0          3h
logging-fluentd-5fp5h                         0/1       CrashLoopBackOff   39         9h
logging-fluentd-5gwmg                         0/1       CrashLoopBackOff   39         9h
logging-fluentd-5tlz5                         0/1       CrashLoopBackOff   39         9h
logging-fluentd-6p8pj                         0/1       CrashLoopBackOff   40         9h
logging-fluentd-brqnb                         0/1       CrashLoopBackOff   40         9h
logging-fluentd-cn97s                         0/1       Error              50         9h
logging-fluentd-fckc4                         0/1       CrashLoopBackOff   49         9h
logging-fluentd-g428w                         0/1       CrashLoopBackOff   50         9h
logging-fluentd-jwzvf                         0/1       CrashLoopBackOff   38         9h
logging-fluentd-kh5x4                         0/1       CrashLoopBackOff   40         9h
logging-fluentd-m6cvl                         0/1       CrashLoopBackOff   39         9h
logging-kibana-1-6pbcl                        2/2       Running            0          3h
logging-kibana-ops-1-mmxbj                    2/2       Running            0          3h

[anli@upg_slave_qeos10 37_39]$ oc get pods
NAME                                          READY     STATUS             RESTARTS   AGE
logging-curator-1-dmfsp                       1/1       Running            0          3h
logging-curator-ops-1-txmkr                   1/1       Running            0          3h
logging-es-data-master-9i9cywub-1-8jvm2       2/2       Running            0          3h
logging-es-ops-data-master-en9aqvk6-1-qs7xj   2/2       Running            0          3h
logging-fluentd-5fp5h                         1/1       Running            40         9h
logging-fluentd-5gwmg                         0/1       CrashLoopBackOff   39         9h
logging-fluentd-5tlz5                         0/1       CrashLoopBackOff   39         9h
logging-fluentd-6p8pj                         1/1       Running            41         9h
logging-fluentd-brqnb                         0/1       CrashLoopBackOff   40         9h
logging-fluentd-cn97s                         0/1       CrashLoopBackOff   50         9h
logging-fluentd-fckc4                         0/1       CrashLoopBackOff   49         9h
logging-fluentd-g428w                         1/1       Running            51         9h
logging-fluentd-jwzvf                         0/1       CrashLoopBackOff   38         9h
logging-fluentd-kh5x4                         0/1       CrashLoopBackOff   40         9h
logging-fluentd-m6cvl                         0/1       CrashLoopBackOff   39         9h
logging-kibana-1-6pbcl                        2/2       Running            0          3h
logging-kibana-ops-1-mmxbj                    2/2       Running            0          3h
[anli@upg_slave_qeos10 37_39]$ oc get pods
NAME                                          READY     STATUS             RESTARTS   AGE
logging-curator-1-dmfsp                       1/1       Running            0          3h
logging-curator-ops-1-txmkr                   1/1       Running            0          3h
logging-es-data-master-9i9cywub-1-8jvm2       2/2       Running            0          3h
logging-es-ops-data-master-en9aqvk6-1-qs7xj   2/2       Running            0          3h
logging-fluentd-5fp5h                         0/1       Error              40         9h
logging-fluentd-5gwmg                         0/1       CrashLoopBackOff   39         9h
logging-fluentd-5tlz5                         0/1       CrashLoopBackOff   39         9h
logging-fluentd-6p8pj                         0/1       Error              41         9h
logging-fluentd-brqnb                         0/1       CrashLoopBackOff   40         9h
logging-fluentd-cn97s                         0/1       CrashLoopBackOff   50         9h
logging-fluentd-fckc4                         0/1       Error              50         9h
logging-fluentd-g428w                         0/1       Error              51         9h
logging-fluentd-jwzvf                         1/1       Running            39         9h
logging-fluentd-kh5x4                         0/1       CrashLoopBackOff   40         9h
logging-fluentd-m6cvl                         0/1       CrashLoopBackOff   39         9h
logging-kibana-1-6pbcl                        2/2       Running            0          3h
logging-kibana-ops-1-mmxbj                    2/2       Running            0          3h
[anli@upg_slave_qeos10 37_39]$ oc logs logging-fluentd-5fp5h
umounts of dead containers will fail. Ignoring...
umount: /var/lib/docker/containers/118cf7ba87a5e67345836b4436127603d2424ac6ff4d68764f2ae5a1687edb6b/shm: not mounted
umount: /var/lib/docker/containers/240269b31a5a8c1d9fc5f800daa8cb583d359c0d2a821a7dc396bd8b11ca7e4f/shm: not mounted
umount: /var/lib/docker/containers/8764f27ad687ea392fa474d657e9c107538accd449421a1da82316051908d833/shm: not mounted
umount: /var/lib/docker/containers/b06b8a881dd7c41653f84d6fa54e6a7abc87a1d26658de66b7f91cb5105fff10/shm: not mounted
umount: /var/lib/docker/containers/e36c31fca1b4466d93944388c2f39355f7094dcc65543f759f0008c94bb28a77/shm: not mounted
umount: /var/lib/docker/containers/e7911c5afbff7217185559b326fbb4136f73e398ba29ed11c0ed8385355add14/shm: not mounted
2018-01-26 03:59:00 -0500 [info]: reading config file path="/etc/fluent/fluent.conf"
2018-01-26 03:59:20 -0500 [error]: unexpected error error="getaddrinfo: Name or service not known"
  2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:878:in `initialize'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:878:in `open'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:878:in `block in connect'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/timeout.rb:52:in `timeout'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:877:in `connect'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:862:in `do_start'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/ruby/net/http.rb:851:in `start'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/resource.rb:51:in `get'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:328:in `block in api'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:58:in `handle_exception'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:327:in `api'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:322:in `api_valid?'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:227:in `configure'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:145:in `add_filter'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:62:in `block in configure'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:57:in `each'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:57:in `configure'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/root_agent.rb:83:in `block in configure'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/root_agent.rb:83:in `each'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/root_agent.rb:83:in `configure'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/engine.rb:129:in `configure'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/engine.rb:103:in `run_configure'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:489:in `run_configure'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:174:in `block in start'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:366:in `call'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:366:in `main_process'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:170:in `start'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/command/fluentd.rb:173:in `<top (required)>'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require'
  2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluentd-0.12.42/bin/fluentd:8:in `<top (required)>'
  2018-01-26 03:59:20 -0500 [error]: /usr/bin/fluentd:23:in `load'
  2018-01-26 03:59:20 -0500 [error]: /usr/bin/fluentd:23:in `<main>'

Expected results:

The flunetd pods are running and works fine.

Additional info:
Once I enable debug mode, the fluend pods come back.

# debug in fluentd
oc edit configmap logging-fluentd
#then, remove the @include configs.d/openshift/system.conf and replace with this
<system>
   log_level debug
</system>

oc set env ds/logging-fluentd DEBUG=true


^C[anli@upg_slave_qeos10 37_39]$ oc get pods
NAME                                          READY     STATUS    RESTARTS   AGE
logging-curator-1-dmfsp                       1/1       Running   0          3h
logging-curator-ops-1-txmkr                   1/1       Running   0          3h
logging-es-data-master-9i9cywub-1-8jvm2       2/2       Running   0          3h
logging-es-ops-data-master-en9aqvk6-1-qs7xj   2/2       Running   0          3h
logging-fluentd-28plc                         1/1       Running   0          12m
logging-fluentd-2h2tp                         1/1       Running   0          12m
logging-fluentd-5fsdq                         1/1       Running   0          12m
logging-fluentd-5vzxn                         1/1       Running   0          12m
logging-fluentd-6vfzs                         1/1       Running   0          12m
logging-fluentd-g4tx2                         1/1       Running   0          12m
logging-fluentd-k5rqh                         1/1       Running   0          12m
logging-fluentd-mwnw8                         1/1       Running   0          12m
logging-fluentd-p2874                         1/1       Running   0          12m
logging-fluentd-vpmjk                         1/1       Running   0          12m
logging-fluentd-wlwz6                         1/1       Running   0          12m
logging-kibana-1-6pbcl                        2/2       Running   0          3h
logging-kibana-ops-1-mmxbj                    2/2       Running   0          3h

After remove the debug configure values, the fluentd pod still running.


anli@upg_slave_qeos10 37_39]$ oc get pods
NAME                                          READY     STATUS    RESTARTS   AGE
logging-curator-1-dmfsp                       1/1       Running   0          3h
logging-curator-ops-1-txmkr                   1/1       Running   0          3h
logging-es-data-master-9i9cywub-1-8jvm2       2/2       Running   0          3h
logging-es-ops-data-master-en9aqvk6-1-qs7xj   2/2       Running   0          3h
logging-fluentd-7cq2q                         1/1       Running   0          2m
logging-fluentd-8cnsj                         1/1       Running   0          1m
logging-fluentd-hv2dk                         1/1       Running   0          1m
logging-fluentd-kt7q9                         1/1       Running   0          1m
logging-fluentd-m2sm6                         1/1       Running   0          2m
logging-fluentd-m4rnp                         1/1       Running   0          1m
logging-fluentd-mllh4                         1/1       Running   0          2m
logging-fluentd-sdtqr                         1/1       Running   0          2m
logging-fluentd-tmzxp                         1/1       Running   0          1m
logging-fluentd-wlmgb                         1/1       Running   0          2m
logging-fluentd-z9q9d                         1/1       Running   0          2m

Comment 1 Anping Li 2018-01-28 12:44:22 UTC

@scott, @jeff, It seems a daemonset issue rather than a logging issue.

Comment 2 Noriko Hosoi 2018-01-30 18:20:13 UTC

Hi @Anping,

Looking into the error/stacktraces in the report, the cause of the failure looks coming from connecting kubernetes in the fluentd kubernetes metadata plugin.

2018-01-26 03:59:20 -0500 [error]: unexpected error error="getaddrinfo: Name or service not known"
[...]

2018-01-26 03:59:20 -0500 [error]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:227:in `configure'

"filter_kubernetes_metadata.rb"
222         @client = Kubeclient::Client.new @kubernetes_url, @apiVersion,
223                                          ssl_options: ssl_options,
224                                          auth_options: auth_options
225
226         begin
227           @client.api_valid?
228         rescue KubeException => kube_error
229           raise Fluent::ConfigError, "Invalid Kubernetes API #{@apiVersion} endpoint #{@kubernetes_url}: #{kube_error.message}"
230         end

I'm wondering if the network is healthy after the upgrade. Can we get some more network related info?  E.g., "oc get services"?  (I'd believe you have much deeper knowledge about checking the network condition on openshift/kubernetes than I do... ;)

Another question would be you mentioned enabling DEBUG solved the problem.

oc set env ds/logging-fluentd DEBUG=true

I'm not so sure about it.  Could it be a timing issue, for instance, can we consider this scenario -- when the fluentd was crashing the network was not 100% ready yet, then it's come back and you restarted the fluentd daemonset with DEBUG=true?

If that's the case, the problem would be our not detecting the environment is ready OR weak self recovery in the upgrade/startup?

Thanks!

Comment 3 Jeff Cantrill 2018-02-05 14:20:12 UTC

Can you advise if the endpoint is reachable?  Is it a networking issues?  Can you debug an instance of the pod and reach the endpoint from the debug pod

Comment 4 Anping Li 2018-02-06 10:06:05 UTC

Created attachment 1391936 [details]
fluentd debug logs

I observed the following strange after upgrade from v3.7 to v3.9. I am sure both masters/nodes had been upgraded to v3.9 and all service are restarted. Maybe, it is pod or upgrade bug rather than logging

1) all daemonset controlled pods were/are not working. [1]

apiserver-rmmwv,controller-manager-2r8l7 logging-fluentd-tvzwx   and apiserver-f58n2


2) The logging-fluentd-2gfqk on the master is in running status. The side cart is opensihft3/ose-pod:v3.9.0-0.38.0 [2]

3) The logging-fluentd-tvzwx on the node is CrashLoopBackOff. The side cart is openshift3/ose-pod:v3.7.28 [3]

4) The daemonset pods are ignore during upgrade [4]





[1]# oc  get pods --all-namespaces
NAMESPACE                           NAME                                      READY     STATUS             RESTARTS   AGE
default                             docker-registry-4-8pb6t                   1/1       Running            0          1h
default                             registry-console-2-8rvbm                  1/1       Running            0          1h
default                             router-2-tvnzc                            1/1       Running            0          1h
install-test                        mongodb-1-6gqxw                           1/1       Running            0          1h
install-test                        nodejs-mongodb-example-1-9kdv2            1/1       Running            0          1h
kube-service-catalog                apiserver-rmmwv                           1/1       Running            4          4h
kube-service-catalog                controller-manager-2r8l7                  1/1       Running            12         4h
logging                             logging-curator-1-hsp6x                   1/1       Running            0          1h
logging                             logging-es-data-master-dtrmuatb-1-czp4g   2/2       Running            0          1h
logging                             logging-fluentd-2gfqk                     1/1       Running            4          3h
logging                             logging-fluentd-tvzwx                     0/1       CrashLoopBackOff   17         3h
logging                             logging-kibana-1-b9xz6                    2/2       Running            0          1h
openshift-ansible-service-broker    asb-1-598b9                               1/1       Running            4          1h
openshift-ansible-service-broker    asb-etcd-1-7qszj                          1/1       Running            0          1h
openshift-template-service-broker   apiserver-f58n2                           0/1       Running            1          4h
openshift-template-service-broker   apiserver-kb2kc                           1/1       Running            4          4h
openshift-web-console               webconsole-74d65548c-xz5sj                1/1       Running            0          1h




[2][root@qe-anli37-master-etcd-nfs-1 ~]# docker ps |grep fluentd
d2a3b584b3f7        b67c504594fb                                                                                                                                   "sh run.sh"              About an hour ago   Up About an hour                        k8s_fluentd-elasticsearch_logging-fluentd-2gfqk_logging_bbc44e29-0b08-11e8-b686-42010af00005_4
dfc6e1ecb4f6        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.9.0-0.38.0                                                                            "/usr/bin/pod"           About an hour ago   Up About an hour                        k8s_POD_logging-fluentd-2gfqk_logging_bbc44e29-0b08-11e8-b686-42010af00005_4


[3][root@qe-anli37-node-registry-router-1 ~]# docker ps |grep fluentd
74d787f74e44        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.7.28         "/usr/bin/pod"           3 hours ago         Up 3 hours                              k8s_POD_logging-fluentd-tvzwx_logging_bbbff615-0b08-11e8-b686-42010af00005_0



[4]common/openshift-cluster/upgrades/upgrade_nodes.yml:
  - name: Drain Node for Kubelet upgrade
    command: >
      {{ hostvars[groups.oo_first_master.0]['first_master_client_binary'] }} adm drain {{ openshift.node.nodename | lower }}
      --config={{ openshift.common.config_base }}/master/admin.kubeconfig
      --force --delete-local-data --ignore-daemonsets
      --timeout={{ openshift_upgrade_nodes_drain_timeout | default(0) }}s

Comment 5 Jeff Cantrill 2018-02-06 20:28:28 UTC

Can we get pod logs from the pods themselves and the deployer pods to determine if they have any information of interest?

Comment 6 Noriko Hosoi 2018-02-06 21:15:24 UTC

Thanks for the analysis, @Anping.

The change was made about a year ago for upgrading 1.3 to 1.4.
https://github.com/openshift/openshift-ansible/pull/3370

They even mentioned us "(e.g. aggregated logging)".  So, it looks the --ignore-daemonsets option has been set in upgrading to 3.4, 3.5, 3.6, and 3.7.  Do you have an idea why we run into this issue this time, 3.7 to 3.9...?  Or should we do something extra for daemonset???

Comment 7 Anping Li 2018-02-07 02:07:05 UTC

No output from "docker logs registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.7.28"

Comment 8 Jeff Cantrill 2018-02-07 15:05:20 UTC

There should be some logs from the failed fluentd pod or the fluentd deployer pod   Retrievable with 'oc logs $PODNAME|$PODNAME-deploy'

Comment 9 Anping Li 2018-02-13 13:13:22 UTC

Jeff,  No deloyed happened. the ose-pod container is alive during upgrade.  I guess the running container ose-pod:v3.7 failed to launch the fluentd containers.

Comment 10 Tomáš Nožička 2018-02-20 16:21:28 UTC

mfojtik I fail to see any evidence why this is broken daemonset:

 - pods are crashLoopBackoff - meaning app itself is broken

> The Age of logging-fluentd is 9h, which is the time it was created. It seems the daemonset wasn't refresh during upgrade.

Was the DS changed during the update? Can we have a yaml dump?
oc get ds <name> -o yaml
oc get po -l <ds-label> -o yaml

Comment 11 Anping Li 2018-02-22 02:48:07 UTC

@Tomáš, No DS change.  I will attached the yaml file when I peform upgrade the next time.

Comment 12 Jeff Cantrill 2018-02-26 16:01:44 UTC

Moving this to 3.9.z to remove from blocker list as doesnt seem to be a blocker currently but still want to ensure its not an issue.

Comment 13 Anping Li 2018-03-01 09:42:34 UTC

Confirmed that is not a logging issue.  The root cuase is the logging-fluend pod's ip address was reassigned to other pod.  It should be a pod/cri issue. The quick workaround is to delete daemonset pod to redeploy daemonset pod during upgrade. Reassign backup to Upgrade. 

@scott, Could you talk a look about this issue?


Root Cause:
The logging-fluend pod ip address was reassigned to other pods.   The pod 8s_POD_logging-fluentd-9rn4q_logging_68f64a6d-1cf7-11e8-834f-fa163e6d7e7a_0 and k8s_POD_webconsole-6d845c7d76-vvngn_openshift-web-console_917ef9f6-1d1e-11e8-80e5-fa163e6d7e7a_0 are using same IP address.

1) Get the fluentd pod's ip Address
1.1) # docker ps |grep fluentd
5c646c94f47d        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.7.35                                                                               "/usr/bin/pod"           6 hours ago         Up 6 hours                              k8s_POD_logging-fluentd-9rn4q_logging_68f64a6d-1cf7-11e8-834f-fa163e6d7e7a_0
1.2) #docker inspect 5c646c94f47d |grep Pid
            "Pid": 52211,
            "PidMode": "",
            "PidsLimit": 0,
1.3) # nsenter -n -i -t 52211
   # ip addr |grep inet
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
    inet 10.2.4.2/23 brd 10.2.5.255 scope global eth0
    inet6 fe80::ec76:74ff:fea1:6980/64 scope link 

2) Check the assign pod for the IP 10.2.4.2 (which is used by logging-fluentd before upgrade)
2.1) # cat /var/lib/cni/openshift-sdn/10.2.4.2
2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7

2.2) # docker ps |grep 2e12c960c226
2e12c960c226        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.9.1                                                                                "/usr/bin/pod"           2 hours ago         Up 2 hours                              k8s_POD_webconsole-6d845c7d76-vvngn_openshift-web-console_917ef9f6-1d1e-11e8-80e5-fa163e6d7e7a_0

3)  # Check the pod ip of Pod Id 2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7
3.1) # docker inspect 2e12c960c226 |grep 2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7
        "Id": "2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7",
        "ResolvConfPath": "/var/lib/docker/containers/2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7/hostname",
        "HostsPath": "/var/lib/docker/containers/2e12c960c226b2ffb720af309e3427a81f67ae296ea87f9bf733eea9373715b7/hosts",

3.2) # docker inspect 2e12c960c226 |grep Pid
            "Pid": 115529,
            "PidMode": "",
            "PidsLimit": 0,

3.3)   #  nsenter -n -i -t  115529
   # ip addr |grep inet
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
    inet 10.2.4.2/23 brd 10.2.5.255 scope global eth0
    inet6 fe80::4c51:ceff:fe3a:6368/64 scope link

Comment 14 Anping Li 2018-03-01 09:50:36 UTC

100% reproduce in my testing during upgrade. The workaround is trigger new deploy by by 'oc delete pods --selector component=fluentd'

[root@host-172-16-120-14 ~]# oc get pods
NAME                                          READY     STATUS             RESTARTS   AGE
logging-curator-1-vw7fv                       1/1       Running            0          2h
logging-curator-ops-1-q74n7                   1/1       Running            0          2h
logging-es-data-master-s68eocj5-1-6pqnn       2/2       Running            0          2h
logging-es-ops-data-master-oa0mmdlh-2-hn86w   2/2       Running            0          2h
logging-es-ops-data-master-te7241sz-1-qf9ps   2/2       Running            0          2h
logging-es-ops-data-master-vvoif0gl-1-r8hd7   2/2       Running            0          2h
logging-fluentd-5z7zv                         0/1       CrashLoopBackOff   33         7h
logging-fluentd-6bb22                         0/1       CrashLoopBackOff   42         7h
logging-fluentd-9rn4q                         0/1       CrashLoopBackOff   41         7h
logging-fluentd-fmwnc                         0/1       CrashLoopBackOff   34         7h
logging-fluentd-gs45m                         0/1       CrashLoopBackOff   33         7h
logging-fluentd-kh52s                         0/1       CrashLoopBackOff   42         7h
logging-fluentd-m5zh8                         0/1       CrashLoopBackOff   33         7h
logging-fluentd-mw6g5                         0/1       CrashLoopBackOff   34         7h
logging-fluentd-zqvxv                         0/1       CrashLoopBackOff   32         7h
logging-kibana-2-zvstx                        2/2       Running            0          2h
logging-kibana-ops-2-ltcbd                    2/2       Running            0          2h

Comment 15 Anping Li 2018-03-01 10:16:31 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1538971#c13 Typo "It should be a pod/cri issue" should be "It should be a pod/cni issue".

Comment 17 Avesh Agarwal 2018-03-02 16:44:25 UTC

Anping,

Could you proide me DS controller logs when this happens?

Comment 18 Avesh Agarwal 2018-03-02 16:53:18 UTC

Can I get access to the cluster where it is happening?

Comment 19 Avesh Agarwal 2018-03-02 17:11:10 UTC

Meanwhile I am trying to reproduce in my local cluster.

Comment 21 Avesh Agarwal 2018-03-05 18:38:54 UTC

I have been trying to reproduce this with some daemonsets (not fluentd) and unable to reproduce. However, my test cluster is pretty small (one master and one node) so not sure if that's playing any role here. I have taken following steps:

1. Deploy openshift 3.7 as one master and one node (I had to downgrade golang to 1.8.3 to compile 3.7).
2. Create 5 daemon sets.
3. Stop master and node. (please note pods stays in etcd store).
4. delete docker containers
5. Deploy origin latest (should be very close to 3.9) as one master and node with the same config and etcd storage used with 3.7. (I had to upgrade golang back to 1.9 to compile 3.9).
6. Noticed that all ds and its pods were started fine and new containers were created.

Comment 22 Anping Li 2018-03-06 02:05:06 UTC

@seth, I will provide the Env when this happen.

@Avesh, To quick reproduce, I think we can try the following steps which is similar with openshift-ansible
1. Deploy openshift v3.7.
2. Deploy logging v3.7 on openshift v3.7
3. Upgrade Openshift
  3.1) oc adm drain $node --ignore-daemonsets=false
  3.2) upgrade to v3.9
  3.3) make node schedulable
4. Watch the ds pod status.

Comment 23 Seth Jennings 2018-03-06 19:33:30 UTC

Time is up on 3.9 moving to 3.9.z while waiting for reproduction.

Reapplying needinfo.  Please clear it when the issue is reproduced and additional information is available.

Comment 24 Anping Li 2018-03-19 10:09:28 UTC

No such issue in recently upgrade.

Comment 25 Seth Jennings 2018-03-19 14:28:47 UTC

Please reopen if this happens again with:
oc get ds <name> -o yaml
oc get po -l <ds-label> -o yaml
and node logs (if possible)

Note You need to log in before you can comment on or make changes to this bug.