Bug 1475045

Summary: [free-stg] hawkular-cassandra failing to attach volume
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED DEFERRED QA Contact: Jianwei Hou <jhou>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.6.0CC: aos-bugs, aos-storage-staff, bchilds, eparis, hchiramm, jupierce, mwoodson, sdodson, tsmetana
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-15 19:23:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Pierce 2017-07-25 23:48:23 UTC
Description of problem:
When running metrics in free-stg, I see the hawkular-metrics pods unable to attach their pvcs.

Version-Release number of selected component (if applicable):
3.6.170.

How reproducible:


Steps to Reproduce:
1. Use openshift-ansible to upgrade the metrics image version 
OR
1. Update the image specified in the hawkular-cassandra-1|2 rcs
2. Delete existing pods

Actual results:
[root@free-stg-master-03fb6 ~]# oc describe pod hawkular-cassandra-2-0t4p3
Name:			hawkular-cassandra-2-0t4p3
Namespace:		openshift-infra
Security Policy:	restricted
Node:			ip-172-31-73-38.us-east-2.compute.internal/172.31.73.38
Start Time:		Tue, 25 Jul 2017 23:14:00 +0000
Labels:			metrics-infra=hawkular-cassandra
			name=hawkular-cassandra-2
			type=hawkular-cassandra
Annotations:		kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"hawkular-cassandra-2","uid":"83986305-1b04-11...
			openshift.io/scc=restricted
Status:			Pending
IP:			
Controllers:		ReplicationController/hawkular-cassandra-2
Containers:
  hawkular-cassandra-2:
    Container ID:	
    Image:		registry.ops.openshift.com/openshift3/metrics-cassandra:v3.6.170
    Image ID:		
    Ports:		9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP
    Command:
      /opt/apache-cassandra/bin/cassandra-docker.sh
      --cluster_name=hawkular-metrics
      --data_volume=/cassandra_data
      --internode_encryption=all
      --require_node_auth=true
      --enable_client_encryption=true
      --require_client_auth=true
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	3725m
      memory:	2G
    Requests:
      cpu:	223m
      memory:	1200M
    Readiness:	exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CASSANDRA_MASTER:			false
      CASSANDRA_DATA_VOLUME:		/cassandra_data
      JVM_OPTS:				-Dcassandra.commitlog.ignorereplayerrors=true
      POD_NAMESPACE:			openshift-infra (v1:metadata.namespace)
      MEMORY_LIMIT:			2000000000 (limits.memory)
      CPU_LIMIT:			3725 (limits.cpu)
      TRUSTSTORE_CLIENT_AUTHORITIES:	/hawkular-cassandra-certs/tls.client.truststore.crt
      TRUSTSTORE_NODES_AUTHORITIES:	/hawkular-cassandra-certs/tls.peer.truststore.crt
    Mounts:
      /cassandra_data from cassandra-data (rw)
      /hawkular-cassandra-certs from hawkular-cassandra-certs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-nq5mt (ro)
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  cassandra-data:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	metrics-cassandra-2
    ReadOnly:	false
  hawkular-cassandra-certs:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-cassandra-certs
    Optional:	false
  cassandra-token-nq5mt:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	cassandra-token-nq5mt
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	type=infra
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason		Message
  ---------	--------	-----	----			-------------	--------	------		-------
  18m		18m		1	default-scheduler			Normal		Scheduled	Successfully assigned hawkular-cassandra-2-0t4p3 to ip-172-31-73-38.us-east-2.compute.internal
  18m		18m		1	attachdetach				Warning		FailedMount	Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: dfc0146e-0747-48b2-94cf-5cb10de29774. The volume is currently attached to instance "i-09f72ada4b954305f"
  16m		16m	1	attachdetach		Warning	FailedMount	Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: bfafbb66-e70b-41d1-af37-8be49963258b. The volume is currently attached to instance "i-09f72ada4b954305f"
  14m		14m	1	attachdetach		Warning	FailedMount	Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: bf8d1ebc-eb29-4026-8ca2-e7750536147a. The volume is currently attached to instance "i-09f72ada4b954305f"
  12m		12m	1	attachdetach		Warning	FailedMount	Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: f6b6efbb-71d2-4f0d-bc24-8703e3810572. The volume is currently attached to instance "i-09f72ada4b954305f"
  10m		10m	1	attachdetach		Warning	FailedMount	Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: 0ad9354e-954c-4a98-8f3a-a948ef21d0f8. The volume is currently attached to instance "i-09f72ada4b954305f"
  8m		8m	1	attachdetach		Warning	FailedMount	Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: 4745f195-f832-4743-83ac-ba333cf29cfb. The volume is currently attached to instance "i-09f72ada4b954305f"
  6m		6m	1	attachdetach		Warning	FailedMount	Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: ded6b96f-fb07-4c1c-ac30-a9576e3918fe. The volume is currently attached to instance "i-09f72ada4b954305f"
  4m		4m	1	attachdetach		Warning	FailedMount	Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: b9a301b7-8e2b-44c9-aa5f-c87eaa68ab3c. The volume is currently attached to instance "i-09f72ada4b954305f"
  2m		2m	1	attachdetach		Warning	FailedMount	Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: 27b444b8-d4fa-455f-86ea-c6c5ebf505a6. The volume is currently attached to instance "i-09f72ada4b954305f"
  16m		1m	8	kubelet, ip-172-31-73-38.us-east-2.compute.internal		Warning	FailedMount	Unable to mount volumes for pod "hawkular-cassandra-2-0t4p3_openshift-infra(f0f59169-718e-11e7-885d-0203ad7dfcd7)": timeout expired waiting for volumes to attach/mount for pod "openshift-infra"/"hawkular-cassandra-2-0t4p3". list of unattached/unmounted volumes=[cassandra-data]
  16m		1m	8	kubelet, ip-172-31-73-38.us-east-2.compute.internal		Warning	FailedSync	Error syncing pod
  30s		30s	1	attachdetach							Warning	FailedMount	(combined from similar events): Failed to attach volume "pvc-865d3bb8-1b04-11e7-b170-02306c0cdc4b" on node "ip-172-31-73-38.us-east-2.compute.internal" with: Error attaching EBS volume "vol-03add028b3e47f20e" to instance "i-0fbe8f89fbd2d45bd": VolumeInUse: vol-03add028b3e47f20e is already attached to an instance
		status code: 400, request id: 8732ce90-bf3d-4a34-b965-471fb28c1996. The volume is currently attached to instance "i-09f72ada4b954305f"



Expected results:
hawkular-cassandra should be able to restart consistently. 


Additional info:

i-09f72ada4b954305f is free-stg-node-infra-e2af0
i-0fbe8f89fbd2d45bd is free-stg-node-infra-a18ed

http://file.rdu.redhat.com/~jupierce/share/hawkular-cassandra-1-5hzn0.log
http://file.rdu.redhat.com/~jupierce/share/metrics-rcs.yaml

Comment 2 Justin Pierce 2017-07-25 23:51:00 UTC
The other hawkular-cassandra pod:

Name:			hawkular-cassandra-1-5hzn0
Namespace:		openshift-infra
Security Policy:	restricted
Node:			ip-172-31-75-47.us-east-2.compute.internal/172.31.75.47
Start Time:		Tue, 25 Jul 2017 23:14:00 +0000
Labels:			metrics-infra=hawkular-cassandra
			name=hawkular-cassandra-1
			type=hawkular-cassandra
Annotations:		kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"hawkular-cassandra-1","uid":"82c0edb1-1b04-11...
			openshift.io/scc=restricted
Status:			Running
IP:			10.131.2.146
Controllers:		ReplicationController/hawkular-cassandra-1
Containers:
  hawkular-cassandra-1:
    Container ID:	docker://8b83cedd009fb629a366822f285e0a3ea1108e67ea55c01105f7e52174b45845
    Image:		registry.ops.openshift.com/openshift3/metrics-cassandra:v3.6.170
    Image ID:		docker-pullable://registry.ops.openshift.com/openshift3/metrics-cassandra@sha256:ab9a1feef28b0cd6f47e3da665cb41a1499a8a2f5f3f6e1563a1ce3829ff4c24
    Ports:		9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP
    Command:
      /opt/apache-cassandra/bin/cassandra-docker.sh
      --cluster_name=hawkular-metrics
      --data_volume=/cassandra_data
      --internode_encryption=all
      --require_node_auth=true
      --enable_client_encryption=true
      --require_client_auth=true
    State:		Running
      Started:		Tue, 25 Jul 2017 23:14:50 +0000
    Ready:		True
    Restart Count:	0
    Limits:
      cpu:	3725m
      memory:	2G
    Requests:
      cpu:	223m
      memory:	1200M
    Readiness:	exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CASSANDRA_MASTER:			true
      CASSANDRA_DATA_VOLUME:		/cassandra_data
      JVM_OPTS:				-Dcassandra.commitlog.ignorereplayerrors=true
      POD_NAMESPACE:			openshift-infra (v1:metadata.namespace)
      MEMORY_LIMIT:			2000000000 (limits.memory)
      CPU_LIMIT:			3725 (limits.cpu)
      TRUSTSTORE_CLIENT_AUTHORITIES:	/hawkular-cassandra-certs/tls.client.truststore.crt
      TRUSTSTORE_NODES_AUTHORITIES:	/hawkular-cassandra-certs/tls.peer.truststore.crt
    Mounts:
      /cassandra_data from cassandra-data (rw)
      /hawkular-cassandra-certs from hawkular-cassandra-certs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-nq5mt (ro)
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	True 
  PodScheduled 	True 
Volumes:
  cassandra-data:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	metrics-cassandra-1
    ReadOnly:	false
  hawkular-cassandra-certs:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-cassandra-certs
    Optional:	false
  cassandra-token-nq5mt:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	cassandra-token-nq5mt
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	type=infra
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath				Type		Reason		Message
  ---------	--------	-----	----							-------------				--------	------		-------
  36m		36m		1	default-scheduler										Normal		Scheduled	Successfully assigned hawkular-cassandra-1-5hzn0 to ip-172-31-75-47.us-east-2.compute.internal
  35m		35m		1	kubelet, ip-172-31-75-47.us-east-2.compute.internal	spec.containers{hawkular-cassandra-1}	Normal		Pulling		pulling image "registry.ops.openshift.com/openshift3/metrics-cassandra:v3.6.170"
  35m		35m		1	kubelet, ip-172-31-75-47.us-east-2.compute.internal	spec.containers{hawkular-cassandra-1}	Normal		Pulled		Successfully pulled image "registry.ops.openshift.com/openshift3/metrics-cassandra:v3.6.170"
  35m		35m		1	kubelet, ip-172-31-75-47.us-east-2.compute.internal	spec.containers{hawkular-cassandra-1}	Normal		Created		Created container
  35m		35m		1	kubelet, ip-172-31-75-47.us-east-2.compute.internal	spec.containers{hawkular-cassandra-1}	Normal		Started		Started container
  35m		35m		1	kubelet, ip-172-31-75-47.us-east-2.compute.internal	spec.containers{hawkular-cassandra-1}	Warning		Unhealthy	Readiness probe failed: Could not get the Cassandra status. This may mean that the Cassandra instance is not up yet. Will try again
Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss