1956507 – master MCP degraded w possibly corrupted configMap after upgrade

Bug 1956507 - master MCP degraded w possibly corrupted configMap after upgrade

Summary: master MCP degraded w possibly corrupted configMap after upgrade

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Yu Qi Zhang
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-03 19:53 UTC by milti leonard
Modified:	2024-06-14 01:26 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-20 00:11:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description milti leonard 2021-05-03 19:53:00 UTC

Description of problem:

Upgraded from 4.4.9 to 4.5.24 using local mirror.  Process appeared to run successfully, however current configuration reports:

Failed to resync 4.5.24 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 3, updated: 3, unavailable: 0)


Version-Release number of selected component (if applicable):
4.5.24

How reproducible:
not very

Steps to Reproduce:
1.
2.
3.

Actual results:
upgrade reports success but master MCP reports a degraded status

Expected results:
upgrade succeeds and MCP is healthy


Additional info:

Comment 4 milti leonard 2021-05-05 16:20:44 UTC

@jerzhang, cu is having RBAC issues getting both the must-gather and the inspection. will update BZ when either becomes available

Comment 8 milti leonard 2021-05-06 15:07:56 UTC

@jerzhang, can you yank the file-bundle in supportshell? that would be quicker than me d'loading/splitting/attaching it to this ticket. and the RBAC errors that i posted in the BZ previously is what led me to believe that the CM corruption is preventing them from getting a must-gather (i can be wrong, i frequently am); and yes, this is the result of an upgrade.

Comment 12 milti leonard 2021-05-10 16:48:19 UTC

@jerzhang, must-gather has been executed and attached to the ticket.

Comment 13 milti leonard 2021-05-10 16:52:26 UTC

the error msg for the master MCP has changed ever-so-slightly:

  - lastTransitionTime: "2021-04-29T16:46:04Z"
    message: |-
      Failed to render configuration for pool master: parsing Ignition config failed with error: config is not valid
      Report: error at line 1, column 1178
          1: {"ignition":{"config":{},"security":{"tls":{}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"files":[{"contents":{"source":"data:text/plain;charset=utf-8;base64,IyBTcGVjaWZ5IHRpbWUgc291cmNlcy4Kc2VydmVyICAgbnRwMmEubWwuY29tCnNlcnZlciAgIG50 cDJiLm1sLmNvbQpzZXJ2ZXIgICBudHAyYy5tbC5jb20Kc2VydmVyICAgbnRwMmQubWwuY29tCgoj IFJlY29yZCB0aGUgcmF0ZSBhdCB3aGljaCB0aGUgc3lzdGVtIGNsb2NrIGdhaW5zL2xvc3NlcyB0 aW1lLgpkcmlmdGZpbGUgL3Zhci9saWIvY2hyb255L2RyaWZ0CgojIEFsbG93IHRoZSBzeXN0ZW0g Y2xvY2sgdG8gYmUgc3RlcHBlZCBpbiB0aGUgZmlyc3QgdGhyZWUgdXBkYXRlcwojIGlmIGl0cyBv ZmZzZXQgaXMgbGFyZ2VyIHRoYW4gMSBzZWNvbmQuCm1ha2VzdGVwIDEuMCAzCgojIEVuYWJsZSBr ZXJuZWwgc3luY2hyb25pemF0aW9uIG9mIHRoZSByZWFsLXRpbWUgY2xvY2sgKFJUQykuCnJ0Y3N5 bmMKCiMgSW5jcmVhc2UgdGhlIG1pbmltdW0gbnVtYmVyIG9mIHNlbGVjdGFibGUgc291cmNlcyBy ZXF1aXJlZCB0byBhZGp1c3QKIyB0aGUgc3lzdGVtIGNsb2NrLgptaW5zb3VyY2VzIDIKCiMgU3Bl Y2lmeSBmaWxlIGNvbnRhaW5pbmcga2V5cyBmb3IgTlRQIGF1dGhlbnRpY2F0aW9uLgprZXlmaWxl IC9ldGMvY2hyb255LmtleXMKCiMgR2V0IFRBSS1VVEMgb2Zmc2V0IGFuZCBsZWFwIHNlY29uZHMg ZnJvbSB0aGUgc3lzdGVtIHR6IGRhdGFiYXNlLgpsZWFwc2VjdHogcmlnaHQvVVRDCgojIFNwZWNp ZnkgZGlyZWN0b3J5IGZvciBsb2cgZmlsZXMuCmxvZ2RpciAvdmFyL2xvZy9jaHJvbnkK
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ^
      invalid data character
    reason: ""
    status: "True"
    type: RenderDegraded

it could just be formatting, but i dont think so

Comment 15 milti leonard 2021-05-19 17:56:09 UTC

@jerzhang, the cu sent this comment to the ticket:

The base64 data in that yaml was from an attempt to update the ntp server configuration to conform to bank standards.   The base64 text had embedded spaces, which I missed when creating it.  No cr/lf characters, but there were bad characters in the encoding.  I've fixed that, and the ntp.conf files are now correct, meaning the patch apply was successful.  About 15 minutes later, the operators recovered and the cluster now seems to be clear.  I knew there had been a problem with the ntp change apply, but I didn't expect it to cause this level of error with no clear explanation of why it happened.  In any case, thanks for the help, I will leave the cluster as it is for a while, and if it remains normally operational, I will attempt the next stage of the update to 4.6.


they were able to correct the MC and proceed to their final upgrade of the cluster. the support ticket has been closed, if you would like to close this BZ.

Comment 16 Yu Qi Zhang 2021-05-20 00:11:46 UTC

Ok, thank you for the update. Closing this bug.

Note You need to log in before you can comment on or make changes to this bug.