Bug 2047318

Summary:	[GSS] pgs stay in active+degraded state on new ceph cluster
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Samuel Blais-Dowdy <sblaisdo>
Component:	rook	Assignee:	Travis Nielsen <tnielsen>
Status:	CLOSED NOTABUG	QA Contact:	Elad <ebenahar>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	madam, ocs-bugs, sabose
Target Milestone:	---	Flags:	sblaisdo: needinfo+
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-02-08 20:23:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Samuel Blais-Dowdy 2022-01-27 15:34:47 UTC

Description of problem (please be detailed as possible and provide log
snippests):

There are 32 pgs in degraded state.

Confirmed by running:
$ ceph pg ls

All pgs are active, but some are undersized + degraded

From docs here: https://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics

undersized => the PG can’t select enough OSDs given its size
degraded => some objects in the PG are not replicated enough times yet

Also this link https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#placement-groups-never-get-clean, tells us:

"When you create a cluster and your cluster remains in active, active+remapped or active+degraded status and never achieves an active+clean status, you likely have a problem with your configuration."

This cluster was created less than 1 hour ago.

Plenty of storage left, nodes healthy. Looking at logs for the pod triggering the alert (rook-ceph-mgr-a-5869b44f94-th27s) we see this message being logged over and over:

pgmap v1507: 96 pgs: 23 active+undersized+degraded, 73 active+undersized; 43 KiB data, 736 KiB used, 2.0 TiB / 2 TiB avail; 1.2 KiB/s rd, 2 op/s; 34/102 objects degraded (33.333%)

Dumping ceph status as per SOP, confirm state in HEALTH_WARN:
sh-4.4$ ceph status
cluster:
id: 6c31c42c-6328-48f7-8c08-b7877b484c8c
health: HEALTH_WARN
Degraded data redundancy: 34/102 objects degraded (33.333%), 23 pgs degraded, 96 pgs undersized

services:
mon: 3 daemons, quorum a,b,c (age 27m)
mgr: a(active, since 29m)
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
osd: 3 osds: 2 up (since 29m), 2 in (since 29m)

data:
pools: 3 pools, 96 pgs
objects: 34 objects, 43 KiB
usage: 2.0 GiB used, 2.0 TiB / 2 TiB avail
pgs: 34/102 objects degraded (33.333%)
73 active+undersized
23 active+undersized+degraded

io:
client: 853 B/s rd, 1 op/s rd, 0 op/s wr

sh-4.4$ ceph osd status
+----+----------------------------------------------+-------+-------+--------+---------+--------+---------+------------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+----------------------------------------------+-------+-------+--------+---------+--------+---------+------------+
| 0 | ip-10-177-160-100.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 0 | 0 | exists,up |
| 1 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,new |
| 2 | ip-10-177-160-169.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 2 | 106 | exists,up |
+----+----------------------------------------------+-------+-------+--------+---------+--------+---------+------------+

Version of all relevant components (if applicable):

- (Red Hat Addon) ocs-osd-deployer.v1.1.2
- (OLM CSV) ocs-operator.v4.8.5
- (OLM CSV) ose-prometheus-operator.4.8.0
- $ ceph version: ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)
- OSD version 4.9.15


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Everything seems to be healthy, only running in a degraded state. No issues seen with Nodes, Pods, Services, PVCs, etc.


Is there any workaround available to the best of your knowledge?

No. After More than 12 hours, the ceph cluster has not fixed any of it's pgs.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

5. No idea what we can do, or how to fix pgs.


Can this issue reproducible?

I do not know, maybe try re-installing the cluster. But as per the docs previously linked, Ceph documentation suggest this might be a configuration issue.


Can this issue reproduce from the UI?

N/A


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

--- Ceph Version ---

$ ceph version
ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)

--- OSD Cluster info ---

$ ocm get cluster 1pvn1r26egk5hmdliqct9baokdtt1bo7
{
"kind": "Cluster",
"id": "1pvn1r26egk5hmdliqct9baokdtt1bo7",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7",
"name": "r-us-mvp-01",
"external_id": "274282cf-3ac3-4ba8-951f-d88f12b54ae3",
"infra_id": "r-us-mvp-01-v8dcg",
"display_name": "r-us-mvp-01",
"creation_timestamp": "2022-01-26T21:36:46.324794Z",
"activity_timestamp": "2022-01-27T01:59:46Z",
"cloud_provider": {
"kind": "CloudProviderLink",
"id": "aws",
"href": "/api/clusters_mgmt/v1/cloud_providers/aws"
},
"openshift_version": "4.9.15",
"subscription": {
"kind": "SubscriptionLink",
"id": "24Fj0Wuxxi8tiYFt1YAaEUYlZfH",
"href": "/api/accounts_mgmt/v1/subscriptions/24Fj0Wuxxi8tiYFt1YAaEUYlZfH"
},
"region": {
"kind": "CloudRegionLink",
"id": "us-east-2",
"href": "/api/clusters_mgmt/v1/cloud_providers/aws/regions/us-east-2"
},
"console": {
"url": "https://console-openshift-console.apps.r-us-mvp-01.uhab.p1.openshiftapps.com"
},
"api": {
"url": "https://api.r-us-mvp-01.uhab.p1.openshiftapps.com:6443",
"listening": "internal"
},
"nodes": {
"master": 3,
"infra": 3,
"autoscale_compute": {
"min_replicas": 3,
"max_replicas": 6
},
"availability_zones": [
"us-east-2c",
"us-east-2b",
"us-east-2a"
],
"compute_machine_type": {
"kind": "MachineTypeLink",
"id": "m5.4xlarge",
"href": "/api/clusters_mgmt/v1/machine_types/m5.4xlarge"
},
"infra_machine_type": {
"kind": "MachineTypeLink",
"id": "r5.xlarge",
"href": "/api/clusters_mgmt/v1/machine_types/r5.xlarge"
}
},
"state": "ready",
"flavour": {
"kind": "FlavourLink",
"id": "osd-4",
"href": "/api/clusters_mgmt/v1/flavours/osd-4"
},
"groups": {
"kind": "GroupListLink",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/groups"
},
"properties": {
"rosa_cli_version": "1.1.5",
"rosa_creator_arn": "arn:aws:iam::511224662352:role/WS-00S3-role_AUTOMATE"
},
"aws": {
"subnet_ids": [
"subnet-0b93a48d837fb0d2d",
"subnet-03cee1820e31e2fc0",
"subnet-0d705de3b038533b2"
],
"private_link": true,
"sts": {
"enabled": true,
"role_arn": "arn:aws:iam::511224662352:role/ManagedOpenShift-Installer-Role",
"support_role_arn": "arn:aws:iam::511224662352:role/ManagedOpenShift-Support-Role",
"oidc_endpoint_url": "https://rh-oidc.s3.us-east-1.amazonaws.com/1pvn1r26egk5hmdliqct9baokdtt1bo7",
"operator_iam_roles": [
{
"name": "aws-cloud-credentials",
"namespace": "openshift-machine-api",
"role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-machine-api-aws-cloud-credentials"
},
{
"name": "cloud-credential-operator-iam-ro-creds",
"namespace": "openshift-cloud-credential-operator",
"role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-cloud-credential-operator-cloud-crede"
},
{
"name": "installer-cloud-credentials",
"namespace": "openshift-image-registry",
"role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-image-registry-installer-cloud-creden"
},
{
"name": "cloud-credentials",
"namespace": "openshift-ingress-operator",
"role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-ingress-operator-cloud-credentials"
},
{
"name": "ebs-cloud-credentials",
"namespace": "openshift-cluster-csi-drivers",
"role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-cluster-csi-drivers-ebs-cloud-credent"
}
],
"instance_iam_roles": {
"master_role_arn": "arn:aws:iam::511224662352:role/ManagedOpenShift-ControlPlane-Role",
"worker_role_arn": "arn:aws:iam::511224662352:role/ManagedOpenShift-Worker-Role"
}
}
},
"dns": {
"base_domain": "uhab.p1.openshiftapps.com"
},
"network": {
"type": "OpenShiftSDN",
"machine_cidr": "10.177.160.0/22",
"service_cidr": "172.30.0.0/16",
"pod_cidr": "10.128.0.0/14",
"host_prefix": 23
},
"external_configuration": {
"kind": "ExternalConfiguration",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/external_configuration",
"syncsets": {
"kind": "SyncsetListLink",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/external_configuration/syncsets"
},
"labels": {
"kind": "LabelListLink",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/external_configuration/labels"
}
},
"multi_az": true,
"managed": true,
"ccs": {
"enabled": true,
"disable_scp_checks": false
},
"version": {
"kind": "Version",
"id": "openshift-v4.9.15",
"href": "/api/clusters_mgmt/v1/versions/openshift-v4.9.15",
"raw_id": "4.9.15",
"channel_group": "stable",
"end_of_life_timestamp": "2022-07-18T00:00:00Z"
},
"identity_providers": {
"kind": "IdentityProviderListLink",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/identity_providers"
},
"aws_infrastructure_access_role_grants": {
"kind": "AWSInfrastructureAccessRoleGrantLink",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/aws_infrastructure_access_role_grants"
},
"addons": {
"kind": "AddOnInstallationListLink",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/addons"
},
"ingresses": {
"kind": "IngressListLink",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/ingresses",
"id": "1pvn1r26egk5hmdliqct9baokdtt1bo7"
},
"machine_pools": {
"kind": "MachinePoolListLink",
"href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/machine_pools"
},
"product": {
"kind": "ProductLink",
"id": "rosa",
"href": "/api/clusters_mgmt/v1/products/rosa"
},
"status": {
"state": "ready",
"dns_ready": true,
"oidc_ready": true,
"provision_error_message": "",
"provision_error_code": "",
"configuration_mode": "full"
},
"node_drain_grace_period": {
"value": 60,
"unit": "minutes"
},
"etcd_encryption": false,
"billing_model": "standard",
"disable_user_workload_monitoring": false
}

--- oc adm must-gather fails (backplane insufficient permissions) ---

Failed to use $ oc adm must-gather, as backplane user does not have sufficient permissions to create a namespace:

$ oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6
[must-gather ] OUT Using must-gather plugin-in image: registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6
Error from server (Forbidden): namespaces is forbidden: User "system:serviceaccount:openshift-backplane-mtsre:a27a7f29b22279131144ad44f061ad8b" cannot create resource "namespaces" in API group "" at the cluster scope
exit 1

---

Comment 1 Sahina Bose 2022-01-27 15:46:43 UTC

One of the OSDs is not up. Since must-gather did not run, is there a way to get Rook operator logs and attach to bz?

Comment 3 Travis Nielsen 2022-01-27 16:52:39 UTC

To troubleshoot why the osd is down, also please gather:
- "ceph osd tree" in the toolbox will show which osd is down, if it's not obvious from a crashing pod
- Logs for the down osd pod
- OSD pod description for the osd that is down

Comment 4 Samuel Blais-Dowdy 2022-01-31 15:09:17 UTC

--- $ ceph osd tree ---

sh-4.4$ ceph osd tree
ID  CLASS WEIGHT  TYPE NAME                              STATUS REWEIGHT PRI-AFF
 -1       2.00000 root default
 -5       2.00000     region us-east-2
 -4       1.00000         zone us-east-2b
 -3       1.00000             host default-0-data-0pv72z
  0   ssd 1.00000                 osd.0                      up  1.00000 1.00000
-10       1.00000         zone us-east-2c
 -9       1.00000             host default-2-data-0cd28h
  2   ssd 1.00000                 osd.2                      up  1.00000 1.00000
  1             0 osd.1                                    down        0 1.00000

--- OSD down pod description/logs ---

As the pod does not exist, and was never scheduled, this information is unfortunately unavailable. If the pods were grouped in a DaemonSet or Deployment, I believe we would receive better update/information, and we would clearly see 2/3 pods running, hinting at an issue.

Comment 5 Travis Nielsen 2022-01-31 19:00:56 UTC

Samuel Is there an OSD deployment that is unschedulable? Since the osd.1 exists, this means that its OSD prepare job completed successfully, and likely the OSD deployment is pending for some reason. But without a rook operator log, or a description of the rook-ceph-osd-1 deployment it's difficult to diagnose. Getting the must-gather access will really help troubleshoot.

Comment 7 Samuel Blais-Dowdy 2022-02-03 15:16:23 UTC

I digged into the osd.1 prepare jobs (attached the logs), and it seems it actually failed to execute properly. There are some python tracebacks, even though it reported as Completed Successfully.

Comment 8 Travis Nielsen 2022-02-03 17:10:18 UTC

ceph-volume is failing just to list if there are any OSDs already on the volume, with several stacks like this: 

2022-01-26 23:10:36.953704 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/default-1-data-0f8266 --format json
2022-01-26 23:10:37.200585 E | cephosd: . Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 11, in <module>
    load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 39, in __init__
    self.main(self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
    whoami = oj[dev]['whoami']
KeyError: 'whoami'


The PV seems corrupted. This is just a new cluster, right? Can you wipe it and allow a new OSD to be created? You should just need to run the job template to purge OSD 1. Are you familiar with that job template?

Comment 9 Samuel Blais-Dowdy 2022-02-03 18:32:40 UTC

Yes this was a new cluster (when the addon was deployed).

I am not familiar with that job template. Any links to an SOP?

Comment 10 Travis Nielsen 2022-02-03 18:54:19 UTC

Here are instructions on running the osd removal job:
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/deploying_and_managing_openshift_container_storage_using_red_hat_openstack_platform/replacing-storage-devices_osp

Comment 11 Samuel Blais-Dowdy 2022-02-03 19:59:59 UTC

Thanks for the link. Followed the procedure and performed following steps:

1. Delete the pvc `default-1-data-0f8266`
  $ oc delete -n openshift-storage pvc default-1-data-0f8266

2. Remove finalizer (should be added to docs):
  $ oc patch -n openshift-storage pvc default-1-data-0f8266 -p '{"metadata":{"finalizers":null}}'

3. Can't delete the associated PV as we don't have sufficient backplane permissions (get,list,watch):

  $ oc auth can-i --list -n openshift-storage | grep persistentvolume

persistentvolumeclaims                                               []                                          []                                                            [get list watch create delete deletecollection patch update]
persistentvolumeclaims/status                                        []                                          []                                                            [get list watch]
persistentvolumes                                                    []                                          []                                                            [get list watch]

4. Run the job template for OSD.1:

  $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=1 | oc create -n openshift-storage -f -

5. Job keeps spinning pods that fail. Inspecting logs:

  $ oc logs -n openshift-storage pod/ocs-osd-removal-job--1-97pmd

2022-02-03 19:58:24.508518 I | rookcmd: starting Rook v4.8.5-1 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1'
2022-02-03 19:58:24.508617 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=DEBUG, --operator-image=, --osd-ids=1, --service-account=
2022-02-03 19:58:24.508625 I | op-mon: parsing mon endpoints: c=172.30.208.123:6789,a=172.30.211.71:6789,b=172.30.144.228:6789
2022-02-03 19:58:24.523431 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2022-02-03 19:58:24.523551 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2022-02-03 19:58:24.523625 D | cephclient: config file @ /etc/ceph/ceph.conf: [global]
fsid                        = 6c31c42c-6328-48f7-8c08-b7877b484c8c
mon initial members         = a b c
mon host                    = [v2:172.30.211.71:3300,v1:172.30.211.71:6789],[v2:172.30.144.228:3300,v1:172.30.144.228:6789],[v2:172.30.208.123:3300,v1:172.30.208.123:6789]
bdev_flock_retry            = 20
mon_osd_full_ratio          = .85
mon_osd_backfillfull_ratio  = .8
mon_osd_nearfull_ratio      = .75
mon_max_pg_per_osd          = 600
mon_pg_warn_max_object_skew = 0

[osd]
osd_memory_target_cgroup_limit_ratio = 0.5

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

2022-02-03 19:58:24.523760 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/433535554
2022-02-03 19:58:24.825215 I | cephosd: validating status of osd.1
failed to get osd status for osd 1: not found osd.1 in OSDDump

Comment 12 Travis Nielsen 2022-02-03 20:56:14 UTC

It seems the osd.1 is already removed from ceph:

2022-02-03 19:58:24.825215 I | cephosd: validating status of osd.1
failed to get osd status for osd 1: not found osd.1 in OSDDump

Did the osd prepare job and OSD get deleted? If the OSD is purged from Ceph now, the operator should be able to create a new OSD to replace osd.1. If the old osd.1 was purged and you're not seeing a new osd be created automatically, try restarting the operator. Note that Ceph will re-use OSD IDs, so the new OSD will likely also be called osd.1.

Comment 13 Samuel Blais-Dowdy 2022-02-03 21:15:09 UTC

I removed the prepare job manually:

  $ oc delete -n openshift-storage jobs/rook-ceph-osd-prepare-default-1-data-xxxx

Now waiting on operator to do it's magic. It keeps producing this log:

2022-02-03 21:12:33.616357 I | clusterdisruption-controller: all "zone" failure domains: [us-east-2b us-east-2c]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized Count:73} {StateName:active+undersized+degraded Count:23}]"
2022-02-03 21:12:33.617460 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f
2022-02-03 21:12:33.618209 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f
2022-02-03 21:13:03.625052 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f
2022-02-03 21:13:03.625759 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f

Both $ ceph status + $ ceph osd status, confirm that osd1 is gone.

Comment 14 Travis Nielsen 2022-02-03 21:16:58 UTC

Those messages from the disruption controller are expected until the new OSD is up. Please try restarting the operator (delete the rook operator pod) to trigger a new reconcile for the OSDs.

Comment 15 Samuel Blais-Dowdy 2022-02-03 21:24:38 UTC

Manually deleting the operator pod triggered a reconciled and:

- rook-ceph-osd-prepare-default-1-data-0bbp5l--1-86bwv job ran
- rook-ceph-osd-1-565c85fd5-8rk6c pod was created

--- $ ceph status ---

sh-4.4$ ceph status
  cluster:
    id:     6c31c42c-6328-48f7-8c08-b7877b484c8c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 7d)
    mgr: a(active, since 7d)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 62s), 3 in (since 62s)

  data:
    pools:   3 pools, 96 pgs
    objects: 34 objects, 43 KiB
    usage:   3.0 GiB used, 3.0 TiB / 3 TiB avail
    pgs:     96 active+clean

  io:
    client:   853 B/s rd, 1 op/s rd, 0 op/s wr

--- $ ceph osd status ---
sh-4.4$ ceph osd status
+----+----------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| id |                     host                     |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+----------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | ip-10-177-160-100.us-east-2.compute.internal | 1024M | 1022G |    0   |     0   |    0   |     0   | exists,up |
| 1  | ip-10-177-160-54.us-east-2.compute.internal  | 1024M | 1022G |    0   |     0   |    0   |     0   | exists,up |
| 2  | ip-10-177-160-169.us-east-2.compute.internal | 1024M | 1022G |    0   |     0   |    2   |   106   | exists,up |
+----+----------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+


--- Note ---
It seems we did not need to remove the PV after all. Do you think we will ever need to delete PVs with backplane access in the future? I had created this follow-up ticket, if you could please comment :) -> https://issues.redhat.com/browse/MTSRE-430

Attaching logs from the new operator pod as 'rook-ceph-operator-new-logs.txt'.


Thank you I think we we're able to resolve the issue.

Comment 17 Samuel Blais-Dowdy 2022-02-03 21:28:42 UTC

--- Open questions to investigate ---

1. Will we be required to delete Persistent Volumes?

2. Why did we have to manually delete the rook-ceph-operator pod to trigger a reconciliation?

3. Why did OCS deploy with a corrupted osd.1, and why was the operator unable to reconcile and fix it?

4. Why did the rook-ceph-osd-prepare-default-1-data job succeed when it really deployed a corrupted osd.1?

5. Why did the rook-ceph-operator not detect the missing osd.1 pod?

Comment 18 Travis Nielsen 2022-02-03 22:14:36 UTC

Great to see the cluster is healthy again with all three OSDs.

(In reply to Samuel Blais-Dowdy from comment #17)
> --- Open questions to investigate ---
> 
> 1. Will we be required to delete Persistent Volumes?

You shouldn't need to delete PVs directly. Deleting the PVC should be sufficient. PVs generally have a deletion policy of Reclaim, but it's defined by the storage class which you may not have control of.

> 2. Why did we have to manually delete the rook-ceph-operator pod to trigger
> a reconciliation?

The rook operator will watch for many types events to automatically trigger a reconcile. For example, if the cephcluster CR is updated, or if any deployments in the rook namespace are deleted, a new reconcile will be started automatically. But Rook does not trigger a reconcile if a job is deleted, which happened in this case.

> 3. Why did OCS deploy with a corrupted osd.1, and why was the operator
> unable to reconcile and fix it?

I've not seen reports of a corrupt OSD immediately after deployment like this. It seems like a rare condition, but if you see it again or multiple times we should investigate further. Sometimes the underlying storage just goes bad though.

> 4. Why did the rook-ceph-osd-prepare-default-1-data job succeed when it
> really deployed a corrupted osd.1?

The OSD did fail because of the corrupt disk. Even if the job did fail, Rook wouldn't be able to fix it automatically. Rook expects intervention if there is bad underlying storage.

> 5. Why did the rook-ceph-operator not detect the missing osd.1 pod?

The operator pod did show it was waiting for the other osd to start. The operator just couldn't recover automatically so it was stuck.

Comment 19 Samuel Blais-Dowdy 2022-02-04 13:48:40 UTC

One thing that bothers me, is that the osd.1 prepare job was in a failed state (python Traceback), but reported Completed/Success:

2022-01-26 23:10:36.953704 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/default-1-data-0f8266 --format json
2022-01-26 23:10:37.200585 E | cephosd: . Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 11, in <module>
    load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 39, in __init__
    self.main(self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
    whoami = oj[dev]['whoami']
KeyError: 'whoami'

Comment 20 Samuel Blais-Dowdy 2022-02-04 14:18:12 UTC

I created a post-mortem meeting and invited ocs-dedicated. Please let me know if I should invite anyone else (or use a different email), and feel free to request a different time. Thank you.

Comment 21 Travis Nielsen 2022-02-04 14:55:08 UTC

Please invite me as well

Comment 22 Travis Nielsen 2022-02-07 16:09:58 UTC

Samuel Can we close this issue now, or waiting for the post-mortem?

Comment 23 Samuel Blais-Dowdy 2022-02-08 19:44:03 UTC

We can close it. And track any action items that arise from the post-mortem in Jira.

Thank you Travis !