1866778 – [aws] mapi_instance_create_failed doesn't work on aws

Bug 1866778 - [aws] mapi_instance_create_failed doesn't work on aws

Summary: [aws] mapi_instance_create_failed doesn't work on aws

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Danil Grigorev
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-06 10:26 UTC by sunzhaohua
Modified:	2020-10-27 16:25 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:25:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-aws pull 346	0	None	closed	BUG 1866778: [aws] mapi_instance_create_failed doesn't work on aws	2020-12-18 06:25:58 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:25:42 UTC

Description sunzhaohua 2020-08-06 10:26:44 UTC

Description of problem:
mapi_instance_create_failed doesn't work on aws

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-05-013608

How reproducible:
Always

Steps to Reproduce:
1.Create a failed machine by setting ami to an invalid one
2.Check prometheus metrics
3.

Actual results:
Prometheus web console show "No datapoints found".

$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
$  oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "mapi_instance_"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 64475    0 64475    0     0   530k      0 --:--:-- --:--:-- --:--:--  533k
$ oc get machine 
NAME                                      PHASE     TYPE        REGION      ZONE         AGE
zhsunaws5-xftr9-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   31h
zhsunaws5-xftr9-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   31h
zhsunaws5-xftr9-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   31h
zhsunaws5-xftr9-worker-us-east-2a-kngrn   Running   m5.large    us-east-2   us-east-2a   31h
zhsunaws5-xftr9-worker-us-east-2b-6nk45   Running   m5.large    us-east-2   us-east-2b   31h
zhsunaws5-xftr9-worker-us-east-2c-v6lqj   Failed                                         4h46m

Expected results:
Should show mapi_instance_create_failed detail info.

Additional info:

Comment 1 Joel Speed 2020-08-06 11:47:38 UTC

At present, the metric only registers a failure if the create fails when we send the request to the AWS api. I would have expected this to be updated whenever we return an invalid machine configuration error from the actuator create. @Danil do you think we should move the metric updates into the machine controller rather than inside each actuator?

Comment 2 Alberto 2020-08-06 12:13:29 UTC

I'd expect the actuator metrics to track cloud api requests. 

Generic machine phase is tracked here https://github.com/openshift/machine-api-operator/blob/master/pkg/metrics/metrics.go#L25

Comment 3 Joel Speed 2020-08-06 12:23:33 UTC

In that case we have a couple of options, we can say that the behaviour is correct presently and mark this not a bug, or we can update the actuator to do a metric update for create failed any time a call to the API fails, eg like this one where we look up the ami and that fails, I'm not entirely sure what the intention for this metric was

Comment 4 Alberto 2020-08-06 12:31:55 UTC

>or we can update the actuator to do a metric update for create failed any time a call to the API fails

I'd be in favour of this. I think the actuator metrics are useful to measure the particular cloud response.

Comment 5 Danil Grigorev 2020-08-07 08:19:21 UTC

Closing this BZ. It is not related to a misbehaving metric, but the right recognition between validation error, and an actual resource creation. AWS mapi_instance_create_failed will obtain data points once there would be an actual API call, trying to create some resources such as ec2 instance or load balancer, and more importantly - this request would fail.

This metric should ignore errors related to validation, and this is what was observed.

E0807 07:46:14.364136       1 actuator.go:66] test-v7xqm error: test-v7xqm: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0a9ac601f401105bc"
	status code: 400, request id: dae9a535-83ac-4bbf-8c62-cc26bbe80b52
W0807 07:46:14.364177       1 controller.go:315] test-v7xqm: failed to create machine: test-v7xqm: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0a9ac601f401105bc"
	status code: 400, request id: dae9a535-83ac-4bbf-8c62-cc26bbe80b52
I0807 07:46:14.364189       1 controller.go:415] Actuator returned invalid configuration error: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0a9ac601f401105bc"
	status code: 400, request id: dae9a535-83ac-4bbf-8c62-cc26bbe80b52

Comment 6 Joel Speed 2020-08-07 10:08:11 UTC

@Danil I believe this to be a genuine bug. There are multiple places within the `Create` method that make API calls as part of the process to construct the call to create the instance. In the example given, we try to get the AMI referenced, we call the AWS API to fetch this, it does not exist (and therefore would cause the create call to fail anyway), so I believe this should be covered by the mapi_instance_create_failed metric as well based on what Alberto was saying above.

We should check all the times in the create method that we make an API call, if any fail, this is a failure in the instance creation and we should record that, currently we are only doing a very small subset.

Comment 7 Danil Grigorev 2020-08-20 12:53:39 UTC

Re-opening this after discussion, as it will be a better approach to cover all API calls in AWS populating/validating provided machine spec similar way as it is done in Azure or GCP. Will work on that now.

Comment 10 sunzhaohua 2020-09-07 03:15:21 UTC

Verified.
clusterversion: 4.6.0-0.nightly-2020-09-05-015624
Created machine with invalid ami or instanceType, it works similar as in Azure or GCP.

Comment 12 errata-xmlrpc 2020-10-27 16:25:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.