Bug 1755956
| Summary: | Containerized OSD failed to start after upgrading failed to read label for a lv | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasishta <vashastr> | ||||
| Component: | RADOS | Assignee: | Neha Ojha <nojha> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Manohar Murthy <mmurthy> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.3 | CC: | ceph-eng-bugs, dzafman, jbrier, jdurgin, kchai, nojha, oneumyvakin, sweil, tchandra, vumrao | ||||
| Target Milestone: | z3 | Keywords: | Regression | ||||
| Target Release: | 3.3 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-12-19 18:42:15 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Could you retry with --debug-bluestore=30 and attach the osd log? Also please add dmesg from that node to see if there's anything going on with the disks at the kernel level. Can you attach the output from dd if=/dev/ceph-411886c1-5eb6-445b-bfd0-badb1c7a412a/osd-data-2232e149-96e3-4969-8079-cbbfab969b0b of=/tmp/foo bs=4K count=2 hexdump -C /tmp/foo Thanks! Note there was a request to remove this from the release notes but it was never attached to the 3.3 Release Notes tracker nor does it have a Doc Text/Doc Type so it was never actually in the Release Notes. I've faced with looks very similar issue on ceph version 15.2.13 [2021-08-08 04:08:22,963][ceph_volume.process][INFO ] Running command: /usr/bin/lsblk --nodeps P -o NAME,KNAME,MAJ:MIN,FSTYPE,MOUNTPOINT,LABEL,UUID,RO,RM,MODEL,SIZE,STATE,OWNER,GROUP,MODE,ALIGNMENT,PHY-SEC,L OG-SEC,ROTA,SCHED,TYPE,DISC-ALN,DISC-GRAN,DISC-MAX,DISC-ZERO,PKNAME,PARTLABEL /dev/sdb [2021-08-08 04:08:22,902][ceph_volume.process][INFO ] stdout TAGS=:systemd: [2021-08-08 04:08:22,903][ceph_volume.process][INFO ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a --units=b --nosuffix -S lv_path=/dev/sdb -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size [2021-08-08 04:08:22,963][ceph_volume.process][INFO ] Running command: /usr/bin/lsblk --nodeps -P -o NAME,KNAME,MAJ:MIN,FSTYPE,MOUNTPOINT,LABEL,UUID,RO,RM,MODEL,SIZE,STATE,OWNER,GROUP,MODE,ALIGNMENT,PHY-SEC,LOG-SEC,ROTA,SCHED,TYPE,DISC-ALN,DISC-GRAN,DISC-MAX,DISC-ZERO,PKNAME,PARTLABEL /dev/sdb [2021-08-08 04:08:22,972][ceph_volume.process][INFO ] stdout NAME="sdb" KNAME="sdb" MAJ:MIN="8:16" FSTYPE="LVM2_member" MOUNTPOINT="" LABEL="" UUID="BVsYZe-nnPG-UCo3-D0xy-H5vr-mESK-IrW4IH" RO="0" RM="0" MODEL="QEMU HARDDISK " SIZE="80G" STATE="running" OWNER="root" GROUP="disk" MODE="brw-rw---" ALIGNMENT="0" PHY-SEC="512" LOG-SEC="512" ROTA="1" SCHED="mq-deadline" TYPE="disk" DISC-ALN="0" DISC-GRAN="4K" DISC-MAX="1G" DISC-ZERO="0" PKNAME="" PARTLABEL="" [2021-08-08 04:08:22,973][ceph_volume.process][INFO ] Running command: /usr/sbin/blkid p /dev/sdb [2021-08-08 04:08:22,977][ceph_volume.process][INFO ] stdout /dev/sdb: UUID="BVsYZe-nnPG-UCo3-D0xy-H5vr-mESK-IrW4IH" VERSION="LVM2 001" TYPE="LVM2_member" USAGE="raid" [2021-08-08 04:08:22,978][ceph_volume.process][INFO ] Running command: /usr/sbin/pvs --noheadings --readonly --units=b --nosuffix --separator=";" -o vg_name,pv_count,lv_count,vg_attr,vg_extent_count,vg_free_count,vg_extent_size /dev/sdb [2021-08-08 04:08:23,034][ceph_volume.process][INFO ] stdout ceph-0142ad4b-5029-457d-ae0a-9ed927249e32";"1";"1";"wz--n";"20479";"0";"4194304 [2021-08-08 04:08:23,034][ceph_volume.process][INFO ] Running command: /usr/sbin/pvs --noheadings --readonly --separator=";" -a --units=b --nosuffix -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size /dev/sdb [2021-08-08 04:08:23,094][ceph_volume.process][INFO ] stdout ceph.block_device=/dev/ceph-0142ad4b-5029-457d-ae0a-9ed927249e32/osd-block-bf95f45b-0849-46cd-9715-7c27db32b9ff,ceph.block_uuid=rL6nxR-ESXS-Xd5B-TwP1-GGNE-MLpa-M4mcHY,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=c6f330e2-f775-11eb-a326-85f44cce260a,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=bf95f45b-0849-46cd-9715-7c27db32b9ff,ceph.osd_id=0,ceph.osdspec_affinity=None,ceph.type=block,ceph.vdo=0";"/dev/ceph-0142ad4b-5029-457d-ae0a-9ed927249e32/osd-block-bf95f45b-0849-46cd-9715-7c27db32b9ff";"osd-block-bf95f45b-0849-46cd-9715-7c27db32b9ff";"ceph-0142ad4b-5029-457d-ae0a-9ed927249e32";"rL6nxR-ESXS-Xd5B-TwP1-GGNE-MLpa-M4mcHY";"85895151616 [2021-08-08 04:08:23,095][ceph_volume.process][INFO ] Running command: /usr/bin/ceph-bluestore-tool show-label --dev /dev/sdb [2021-08-08 04:08:23,125][ceph_volume.process][INFO ] stderr unable to read label for /dev/sdb: (2) No such file or directory Run /usr/bin/ceph-bluestore-tool --log-level=30: /usr/bin/ceph-bluestore-tool show-label --log-level=30 --dev /dev/sdb -l /var/log/ceph/ceph-volume.log unable to read label for /dev/sdb: (2) No such file or directory 2021-08-08T05:17:45.303+0000 7ff7fc9c0240 10 bluestore(/dev/sdb) _read_bdev_label 2021-08-08T05:17:45.303+0000 7ff7fc9c0240 2 bluestore(/dev/sdb) _read_bdev_label unable to decode label at offset 102: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding I've submit separate bug report at https://tracker.ceph.com/issues/52095 where you can find hexdump of /dev/sdb |
Created attachment 1619586 [details] File contains journald log of respective OSD Description of problem: Tried to upgrade containerized cluster from [1] to [2] using ceph-ansible [3] ceph-ansible-3.2.27-1.el7cp.noarch . rolling update failed as cluster was did not have clean pgs within timeout. (noout, norebalance was set by ceph-ansible during upgrade) It was observed that only 1 out of 16 OSDs was down with stderr - failed to read label for 'lv' (lv was present) Version-Release number of selected component (if applicable): [1] ceph-3.3-rhel-7-containers-candidate-37269-20190917163216, 12.2.12-70.el7cp [2] ceph-3.3-rhel-7-containers-candidate-99414-20190918165737, 12.2.12-71.el7cp How reproducible: 1 Steps to Reproduce: 1. Configure n-1 version of containerized cluster 2. upgrade it to nth version Actual results: Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-411886c1-5eb6-445b-bfd0-badb1c7a412a/osd-data-2232e149-96e3-4969-8079-cbbfab969b0b --path /var ceph-osd-run.sh[65833]: stderr: failed to read label for /dev/ceph-411886c1-5eb6-445b-bfd0-badb1c7a412a/osd-data-2232e149-96e3-4969-8079-cbbfab969b0b: (2) No such file or directory $ sudo lvdisplay /dev/ceph-411886c1-5eb6-445b-bfd0-badb1c7a412a/osd-data-2232e149-96e3-4969-8079-cbbfab969b0b --- Logical volume --- LV Path /dev/ceph-411886c1-5eb6-445b-bfd0-badb1c7a412a/osd-data-2232e149-96e3-4969-8079-cbbfab969b0b LV Name osd-data-2232e149-96e3-4969-8079-cbbfab969b0b VG Name ceph-411886c1-5eb6-445b-bfd0-badb1c7a412a LV UUID 7UBCWl-RJSw-maqd-1ezX-N243-7C2I-UQeRmd LV Write Access read/write LV Creation host, time magna033, 2019-09-18 05:33:21 +0000 LV Status available # open 0 LV Size 232.12 GiB Current LE 59424 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:10 Expected results: OSDs should get updated successfully Additional info: Particular OSD was created using lvm batch nodexx devices="['/dev/sdb','/dev/sdc','/dev/sdd']" osd_scenario="lvm" osds_per_device=4 With primary observation at the whole env and logs, we have assumed that the issue is with RADOS as ceph-bluestore-tool failed to read label of a lv. Please feel free to reassign to relevant component based on your findings.