Bug 1713219
Summary: | [DOCS] Document how to add new host in to Ready nodes after a node is stopped in AWS | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Xingxing Xia <xxia> |
Component: | Documentation | Assignee: | Andrea Hoffer <ahoffer> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Xingxing Xia <xxia> |
Severity: | medium | Docs Contact: | Vikram Goyal <vigoyal> |
Priority: | medium | ||
Version: | 4.1.0 | CC: | agarcial, amcdermo, aos-bugs, bleanhar, calfonso, jokerman, mifiedle, mmccomas |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-05-31 01:49:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Xingxing Xia
2019-05-23 07:52:35 UTC
Machine API is not responsible for what is actually run inside AMI. Given the logs it seems the kubeconfig is not properly generated/provided. Switching to installer team to re-evaluate. As far as the machine API is involved in the process, AWS actuator sends a request to AWS API to provision an instance given AMI and other aws provider specific information. Machine API does not see after a point when an AWS instance starts to run. I experimented with: - standup a cluster - all nodes running - stop one node via the AWS console (running -> stopped) Node goes NotReady I see the machine-controller create a new backing instance because the current actuator log for 'does instance exist' must be in 'running' state. I then resumed the instance in the AWS console. The existing node transitions back to 'Ready' and the AWS actuator actually deletes the instance (that failed to turn into a node) that got created but didn't turn into a node - so we don't appear to leak instances. In Xingxing's case I believe DR is being tested. Since we no longer auto approve client CSRs the way we used to (https://bugzilla.redhat.com/show_bug.cgi?id=1707162) the master recovery docs likely need to be updated with details on how to approve them. Tomas, could you take a look? (In reply to Brenton Leanhardt from comment #3) > In Xingxing's case I believe DR is being tested. Since we no longer auto > approve client CSRs the way we used to > (https://bugzilla.redhat.com/show_bug.cgi?id=1707162) the master recovery > docs likely need to be updated with details on how to approve them. Tomas, > could you take a look? And if you manually approve the CSRs then the node comes up just fine. Thanks. Right, I tried approving the CSRs by `oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve`, then the new host becomes shown Ready in `oc get node`. Converting to Doc bug per comment 3 for the doc mentioned in comment 0 I've updated the disaster recovery docs PR [1] to add manual steps for approving the CSRs. @Xingxing Xia, could you please try it out? You can preview here [2], the updates were made to step 2. [1] https://github.com/openshift/openshift-docs/pull/14859 [2] http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html LGTM, thanks |