Description of problem (please be detailed as possible and provide log snippests): There are 32 pgs in degraded state. Confirmed by running: $ ceph pg ls All pgs are active, but some are undersized + degraded From docs here: https://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics undersized => the PG can’t select enough OSDs given its size degraded => some objects in the PG are not replicated enough times yet Also this link https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#placement-groups-never-get-clean, tells us: "When you create a cluster and your cluster remains in active, active+remapped or active+degraded status and never achieves an active+clean status, you likely have a problem with your configuration." This cluster was created less than 1 hour ago. Plenty of storage left, nodes healthy. Looking at logs for the pod triggering the alert (rook-ceph-mgr-a-5869b44f94-th27s) we see this message being logged over and over: pgmap v1507: 96 pgs: 23 active+undersized+degraded, 73 active+undersized; 43 KiB data, 736 KiB used, 2.0 TiB / 2 TiB avail; 1.2 KiB/s rd, 2 op/s; 34/102 objects degraded (33.333%) Dumping ceph status as per SOP, confirm state in HEALTH_WARN: sh-4.4$ ceph status cluster: id: 6c31c42c-6328-48f7-8c08-b7877b484c8c health: HEALTH_WARN Degraded data redundancy: 34/102 objects degraded (33.333%), 23 pgs degraded, 96 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 27m) mgr: a(active, since 29m) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 2 up (since 29m), 2 in (since 29m) data: pools: 3 pools, 96 pgs objects: 34 objects, 43 KiB usage: 2.0 GiB used, 2.0 TiB / 2 TiB avail pgs: 34/102 objects degraded (33.333%) 73 active+undersized 23 active+undersized+degraded io: client: 853 B/s rd, 1 op/s rd, 0 op/s wr sh-4.4$ ceph osd status +----+----------------------------------------------+-------+-------+--------+---------+--------+---------+------------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+----------------------------------------------+-------+-------+--------+---------+--------+---------+------------+ | 0 | ip-10-177-160-100.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 0 | 0 | exists,up | | 1 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,new | | 2 | ip-10-177-160-169.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 2 | 106 | exists,up | +----+----------------------------------------------+-------+-------+--------+---------+--------+---------+------------+ Version of all relevant components (if applicable): - (Red Hat Addon) ocs-osd-deployer.v1.1.2 - (OLM CSV) ocs-operator.v4.8.5 - (OLM CSV) ose-prometheus-operator.4.8.0 - $ ceph version: ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable) - OSD version 4.9.15 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Everything seems to be healthy, only running in a degraded state. No issues seen with Nodes, Pods, Services, PVCs, etc. Is there any workaround available to the best of your knowledge? No. After More than 12 hours, the ceph cluster has not fixed any of it's pgs. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5. No idea what we can do, or how to fix pgs. Can this issue reproducible? I do not know, maybe try re-installing the cluster. But as per the docs previously linked, Ceph documentation suggest this might be a configuration issue. Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Ceph Version --- $ ceph version ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable) --- OSD Cluster info --- $ ocm get cluster 1pvn1r26egk5hmdliqct9baokdtt1bo7 { "kind": "Cluster", "id": "1pvn1r26egk5hmdliqct9baokdtt1bo7", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7", "name": "r-us-mvp-01", "external_id": "274282cf-3ac3-4ba8-951f-d88f12b54ae3", "infra_id": "r-us-mvp-01-v8dcg", "display_name": "r-us-mvp-01", "creation_timestamp": "2022-01-26T21:36:46.324794Z", "activity_timestamp": "2022-01-27T01:59:46Z", "cloud_provider": { "kind": "CloudProviderLink", "id": "aws", "href": "/api/clusters_mgmt/v1/cloud_providers/aws" }, "openshift_version": "4.9.15", "subscription": { "kind": "SubscriptionLink", "id": "24Fj0Wuxxi8tiYFt1YAaEUYlZfH", "href": "/api/accounts_mgmt/v1/subscriptions/24Fj0Wuxxi8tiYFt1YAaEUYlZfH" }, "region": { "kind": "CloudRegionLink", "id": "us-east-2", "href": "/api/clusters_mgmt/v1/cloud_providers/aws/regions/us-east-2" }, "console": { "url": "https://console-openshift-console.apps.r-us-mvp-01.uhab.p1.openshiftapps.com" }, "api": { "url": "https://api.r-us-mvp-01.uhab.p1.openshiftapps.com:6443", "listening": "internal" }, "nodes": { "master": 3, "infra": 3, "autoscale_compute": { "min_replicas": 3, "max_replicas": 6 }, "availability_zones": [ "us-east-2c", "us-east-2b", "us-east-2a" ], "compute_machine_type": { "kind": "MachineTypeLink", "id": "m5.4xlarge", "href": "/api/clusters_mgmt/v1/machine_types/m5.4xlarge" }, "infra_machine_type": { "kind": "MachineTypeLink", "id": "r5.xlarge", "href": "/api/clusters_mgmt/v1/machine_types/r5.xlarge" } }, "state": "ready", "flavour": { "kind": "FlavourLink", "id": "osd-4", "href": "/api/clusters_mgmt/v1/flavours/osd-4" }, "groups": { "kind": "GroupListLink", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/groups" }, "properties": { "rosa_cli_version": "1.1.5", "rosa_creator_arn": "arn:aws:iam::511224662352:role/WS-00S3-role_AUTOMATE" }, "aws": { "subnet_ids": [ "subnet-0b93a48d837fb0d2d", "subnet-03cee1820e31e2fc0", "subnet-0d705de3b038533b2" ], "private_link": true, "sts": { "enabled": true, "role_arn": "arn:aws:iam::511224662352:role/ManagedOpenShift-Installer-Role", "support_role_arn": "arn:aws:iam::511224662352:role/ManagedOpenShift-Support-Role", "oidc_endpoint_url": "https://rh-oidc.s3.us-east-1.amazonaws.com/1pvn1r26egk5hmdliqct9baokdtt1bo7", "operator_iam_roles": [ { "name": "aws-cloud-credentials", "namespace": "openshift-machine-api", "role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-machine-api-aws-cloud-credentials" }, { "name": "cloud-credential-operator-iam-ro-creds", "namespace": "openshift-cloud-credential-operator", "role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-cloud-credential-operator-cloud-crede" }, { "name": "installer-cloud-credentials", "namespace": "openshift-image-registry", "role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-image-registry-installer-cloud-creden" }, { "name": "cloud-credentials", "namespace": "openshift-ingress-operator", "role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-ingress-operator-cloud-credentials" }, { "name": "ebs-cloud-credentials", "namespace": "openshift-cluster-csi-drivers", "role_arn": "arn:aws:iam::511224662352:role/r-us-mvp-01-r0d9-openshift-cluster-csi-drivers-ebs-cloud-credent" } ], "instance_iam_roles": { "master_role_arn": "arn:aws:iam::511224662352:role/ManagedOpenShift-ControlPlane-Role", "worker_role_arn": "arn:aws:iam::511224662352:role/ManagedOpenShift-Worker-Role" } } }, "dns": { "base_domain": "uhab.p1.openshiftapps.com" }, "network": { "type": "OpenShiftSDN", "machine_cidr": "10.177.160.0/22", "service_cidr": "172.30.0.0/16", "pod_cidr": "10.128.0.0/14", "host_prefix": 23 }, "external_configuration": { "kind": "ExternalConfiguration", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/external_configuration", "syncsets": { "kind": "SyncsetListLink", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/external_configuration/syncsets" }, "labels": { "kind": "LabelListLink", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/external_configuration/labels" } }, "multi_az": true, "managed": true, "ccs": { "enabled": true, "disable_scp_checks": false }, "version": { "kind": "Version", "id": "openshift-v4.9.15", "href": "/api/clusters_mgmt/v1/versions/openshift-v4.9.15", "raw_id": "4.9.15", "channel_group": "stable", "end_of_life_timestamp": "2022-07-18T00:00:00Z" }, "identity_providers": { "kind": "IdentityProviderListLink", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/identity_providers" }, "aws_infrastructure_access_role_grants": { "kind": "AWSInfrastructureAccessRoleGrantLink", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/aws_infrastructure_access_role_grants" }, "addons": { "kind": "AddOnInstallationListLink", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/addons" }, "ingresses": { "kind": "IngressListLink", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/ingresses", "id": "1pvn1r26egk5hmdliqct9baokdtt1bo7" }, "machine_pools": { "kind": "MachinePoolListLink", "href": "/api/clusters_mgmt/v1/clusters/1pvn1r26egk5hmdliqct9baokdtt1bo7/machine_pools" }, "product": { "kind": "ProductLink", "id": "rosa", "href": "/api/clusters_mgmt/v1/products/rosa" }, "status": { "state": "ready", "dns_ready": true, "oidc_ready": true, "provision_error_message": "", "provision_error_code": "", "configuration_mode": "full" }, "node_drain_grace_period": { "value": 60, "unit": "minutes" }, "etcd_encryption": false, "billing_model": "standard", "disable_user_workload_monitoring": false } --- oc adm must-gather fails (backplane insufficient permissions) --- Failed to use $ oc adm must-gather, as backplane user does not have sufficient permissions to create a namespace: $ oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6 [must-gather ] OUT Using must-gather plugin-in image: registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6 Error from server (Forbidden): namespaces is forbidden: User "system:serviceaccount:openshift-backplane-mtsre:a27a7f29b22279131144ad44f061ad8b" cannot create resource "namespaces" in API group "" at the cluster scope exit 1 ---
One of the OSDs is not up. Since must-gather did not run, is there a way to get Rook operator logs and attach to bz?
To troubleshoot why the osd is down, also please gather: - "ceph osd tree" in the toolbox will show which osd is down, if it's not obvious from a crashing pod - Logs for the down osd pod - OSD pod description for the osd that is down
--- $ ceph osd tree --- sh-4.4$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 2.00000 root default -5 2.00000 region us-east-2 -4 1.00000 zone us-east-2b -3 1.00000 host default-0-data-0pv72z 0 ssd 1.00000 osd.0 up 1.00000 1.00000 -10 1.00000 zone us-east-2c -9 1.00000 host default-2-data-0cd28h 2 ssd 1.00000 osd.2 up 1.00000 1.00000 1 0 osd.1 down 0 1.00000 --- OSD down pod description/logs --- As the pod does not exist, and was never scheduled, this information is unfortunately unavailable. If the pods were grouped in a DaemonSet or Deployment, I believe we would receive better update/information, and we would clearly see 2/3 pods running, hinting at an issue.
Samuel Is there an OSD deployment that is unschedulable? Since the osd.1 exists, this means that its OSD prepare job completed successfully, and likely the OSD deployment is pending for some reason. But without a rook operator log, or a description of the rook-ceph-osd-1 deployment it's difficult to diagnose. Getting the must-gather access will really help troubleshoot.
I digged into the osd.1 prepare jobs (attached the logs), and it seems it actually failed to execute properly. There are some python tracebacks, even though it reported as Completed Successfully.
ceph-volume is failing just to list if there are any OSDs already on the volume, with several stacks like this: 2022-01-26 23:10:36.953704 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/default-1-data-0f8266 --format json 2022-01-26 23:10:37.200585 E | cephosd: . Traceback (most recent call last): File "/usr/sbin/ceph-volume", line 11, in <module> load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')() File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 39, in __init__ self.main(self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc return f(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main terminal.dispatch(self.mapper, subcommand_args) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main terminal.dispatch(self.mapper, self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main self.list(args) File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root return func(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list report = self.generate(args.device) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate whoami = oj[dev]['whoami'] KeyError: 'whoami' The PV seems corrupted. This is just a new cluster, right? Can you wipe it and allow a new OSD to be created? You should just need to run the job template to purge OSD 1. Are you familiar with that job template?
Yes this was a new cluster (when the addon was deployed). I am not familiar with that job template. Any links to an SOP?
Here are instructions on running the osd removal job: https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/deploying_and_managing_openshift_container_storage_using_red_hat_openstack_platform/replacing-storage-devices_osp
Thanks for the link. Followed the procedure and performed following steps: 1. Delete the pvc `default-1-data-0f8266` $ oc delete -n openshift-storage pvc default-1-data-0f8266 2. Remove finalizer (should be added to docs): $ oc patch -n openshift-storage pvc default-1-data-0f8266 -p '{"metadata":{"finalizers":null}}' 3. Can't delete the associated PV as we don't have sufficient backplane permissions (get,list,watch): $ oc auth can-i --list -n openshift-storage | grep persistentvolume persistentvolumeclaims [] [] [get list watch create delete deletecollection patch update] persistentvolumeclaims/status [] [] [get list watch] persistentvolumes [] [] [get list watch] 4. Run the job template for OSD.1: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=1 | oc create -n openshift-storage -f - 5. Job keeps spinning pods that fail. Inspecting logs: $ oc logs -n openshift-storage pod/ocs-osd-removal-job--1-97pmd 2022-02-03 19:58:24.508518 I | rookcmd: starting Rook v4.8.5-1 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1' 2022-02-03 19:58:24.508617 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=DEBUG, --operator-image=, --osd-ids=1, --service-account= 2022-02-03 19:58:24.508625 I | op-mon: parsing mon endpoints: c=172.30.208.123:6789,a=172.30.211.71:6789,b=172.30.144.228:6789 2022-02-03 19:58:24.523431 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config 2022-02-03 19:58:24.523551 I | cephclient: generated admin config in /var/lib/rook/openshift-storage 2022-02-03 19:58:24.523625 D | cephclient: config file @ /etc/ceph/ceph.conf: [global] fsid = 6c31c42c-6328-48f7-8c08-b7877b484c8c mon initial members = a b c mon host = [v2:172.30.211.71:3300,v1:172.30.211.71:6789],[v2:172.30.144.228:3300,v1:172.30.144.228:6789],[v2:172.30.208.123:3300,v1:172.30.208.123:6789] bdev_flock_retry = 20 mon_osd_full_ratio = .85 mon_osd_backfillfull_ratio = .8 mon_osd_nearfull_ratio = .75 mon_max_pg_per_osd = 600 mon_pg_warn_max_object_skew = 0 [osd] osd_memory_target_cgroup_limit_ratio = 0.5 [client.admin] keyring = /var/lib/rook/openshift-storage/client.admin.keyring 2022-02-03 19:58:24.523760 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/433535554 2022-02-03 19:58:24.825215 I | cephosd: validating status of osd.1 failed to get osd status for osd 1: not found osd.1 in OSDDump
It seems the osd.1 is already removed from ceph: 2022-02-03 19:58:24.825215 I | cephosd: validating status of osd.1 failed to get osd status for osd 1: not found osd.1 in OSDDump Did the osd prepare job and OSD get deleted? If the OSD is purged from Ceph now, the operator should be able to create a new OSD to replace osd.1. If the old osd.1 was purged and you're not seeing a new osd be created automatically, try restarting the operator. Note that Ceph will re-use OSD IDs, so the new OSD will likely also be called osd.1.
I removed the prepare job manually: $ oc delete -n openshift-storage jobs/rook-ceph-osd-prepare-default-1-data-xxxx Now waiting on operator to do it's magic. It keeps producing this log: 2022-02-03 21:12:33.616357 I | clusterdisruption-controller: all "zone" failure domains: [us-east-2b us-east-2c]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized Count:73} {StateName:active+undersized+degraded Count:23}]" 2022-02-03 21:12:33.617460 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f 2022-02-03 21:12:33.618209 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f 2022-02-03 21:13:03.625052 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f 2022-02-03 21:13:03.625759 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f Both $ ceph status + $ ceph osd status, confirm that osd1 is gone.
Those messages from the disruption controller are expected until the new OSD is up. Please try restarting the operator (delete the rook operator pod) to trigger a new reconcile for the OSDs.
Manually deleting the operator pod triggered a reconciled and: - rook-ceph-osd-prepare-default-1-data-0bbp5l--1-86bwv job ran - rook-ceph-osd-1-565c85fd5-8rk6c pod was created --- $ ceph status --- sh-4.4$ ceph status cluster: id: 6c31c42c-6328-48f7-8c08-b7877b484c8c health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 7d) mgr: a(active, since 7d) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 62s), 3 in (since 62s) data: pools: 3 pools, 96 pgs objects: 34 objects, 43 KiB usage: 3.0 GiB used, 3.0 TiB / 3 TiB avail pgs: 96 active+clean io: client: 853 B/s rd, 1 op/s rd, 0 op/s wr --- $ ceph osd status --- sh-4.4$ ceph osd status +----+----------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+----------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | ip-10-177-160-100.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 0 | 0 | exists,up | | 1 | ip-10-177-160-54.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 0 | 0 | exists,up | | 2 | ip-10-177-160-169.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 2 | 106 | exists,up | +----+----------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+ --- Note --- It seems we did not need to remove the PV after all. Do you think we will ever need to delete PVs with backplane access in the future? I had created this follow-up ticket, if you could please comment :) -> https://issues.redhat.com/browse/MTSRE-430 Attaching logs from the new operator pod as 'rook-ceph-operator-new-logs.txt'. Thank you I think we we're able to resolve the issue.
--- Open questions to investigate --- 1. Will we be required to delete Persistent Volumes? 2. Why did we have to manually delete the rook-ceph-operator pod to trigger a reconciliation? 3. Why did OCS deploy with a corrupted osd.1, and why was the operator unable to reconcile and fix it? 4. Why did the rook-ceph-osd-prepare-default-1-data job succeed when it really deployed a corrupted osd.1? 5. Why did the rook-ceph-operator not detect the missing osd.1 pod?
Great to see the cluster is healthy again with all three OSDs. (In reply to Samuel Blais-Dowdy from comment #17) > --- Open questions to investigate --- > > 1. Will we be required to delete Persistent Volumes? You shouldn't need to delete PVs directly. Deleting the PVC should be sufficient. PVs generally have a deletion policy of Reclaim, but it's defined by the storage class which you may not have control of. > 2. Why did we have to manually delete the rook-ceph-operator pod to trigger > a reconciliation? The rook operator will watch for many types events to automatically trigger a reconcile. For example, if the cephcluster CR is updated, or if any deployments in the rook namespace are deleted, a new reconcile will be started automatically. But Rook does not trigger a reconcile if a job is deleted, which happened in this case. > 3. Why did OCS deploy with a corrupted osd.1, and why was the operator > unable to reconcile and fix it? I've not seen reports of a corrupt OSD immediately after deployment like this. It seems like a rare condition, but if you see it again or multiple times we should investigate further. Sometimes the underlying storage just goes bad though. > 4. Why did the rook-ceph-osd-prepare-default-1-data job succeed when it > really deployed a corrupted osd.1? The OSD did fail because of the corrupt disk. Even if the job did fail, Rook wouldn't be able to fix it automatically. Rook expects intervention if there is bad underlying storage. > 5. Why did the rook-ceph-operator not detect the missing osd.1 pod? The operator pod did show it was waiting for the other osd to start. The operator just couldn't recover automatically so it was stuck.
One thing that bothers me, is that the osd.1 prepare job was in a failed state (python Traceback), but reported Completed/Success: 2022-01-26 23:10:36.953704 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/default-1-data-0f8266 --format json 2022-01-26 23:10:37.200585 E | cephosd: . Traceback (most recent call last): File "/usr/sbin/ceph-volume", line 11, in <module> load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')() File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 39, in __init__ self.main(self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc return f(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main terminal.dispatch(self.mapper, subcommand_args) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main terminal.dispatch(self.mapper, self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main self.list(args) File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root return func(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list report = self.generate(args.device) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate whoami = oj[dev]['whoami'] KeyError: 'whoami'
I created a post-mortem meeting and invited ocs-dedicated. Please let me know if I should invite anyone else (or use a different email), and feel free to request a different time. Thank you.
Please invite me as well
Samuel Can we close this issue now, or waiting for the post-mortem?
We can close it. And track any action items that arise from the post-mortem in Jira. Thank you Travis !