Bug 1296464
Summary: | Create Cluster: ceph-disk activate-all fails in non deterministic way because of missing device file | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | Daniel Horák <dahorak> | ||||
Component: | Ceph | Assignee: | Shubhendu Tripathi <shtripat> | ||||
Status: | CLOSED WONTFIX | QA Contact: | sds-qe-bugs | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 2 | CC: | dahorak, japplewh, kbader, mbukatov, mkudlej, nthomas, sankarshan, shtripat | ||||
Target Milestone: | beta | ||||||
Target Release: | 2 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | rhscon-core-0.0.34-1.el7scon.x86_64 rhscon-ceph-0.0.33-1.el7scon.x86_64 rhscon-ui-0.0.47-1.el7scon.noarch | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-07-20 17:45:54 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Daniel Horák
2016-01-07 10:34:16 UTC
The error message "Error: One or more partitions failed to activate" indicates that the issue is happening due to im-proper disk cleanup, which was used for OSD creation earlier. It happened on freshly created nodes (in OpenStack), so I think it should be clean. I'll ping you on IRC or send email, when I notice it again with login details. Cleaning the disks resolves the issue. Tried in the setup provided in cluster creation was successful. It's probably not 100% reproducible, but it happened quite often on different clusters (though created and installed by the same install script and ansible playbook). The whole cluster is always freshly installed and the data disk (/dev/vdb, clean but formated to ext3 and mounted to /mnt) is cleaned via "dd if=/dev/zero of=/dev/vdb bs=1MB count=10" and then the machine is rebooted (this is done before the USM cluster installation/configuration). Created attachment 1114907 [details] excerpt from /var/log/salt/minion of node1 (In reply to Shubhendu Tripathi from comment #1) > The error message "Error: One or more partitions failed to activate" > indicates that the issue is happening due to im-proper disk cleanup, which > was used for OSD creation earlier. I disagree here. You can see in /var/log/salt/minion except I'm attaching to this BZ (I hit this issue today) that node1's salt minion did: * create new GTP table on /dev/vdb (with 2 paritions) * create xfs filesystem on /dev/vdb1 device (one of new partitions) * have some partx issues * fail on 'ceph-disk activate-all' This means that it doesn't matter what state device /dev/vdb was in, because salt minion did create completely new partition table here and formatted it's 1st partition. Whatever device /dev/vdb contained before, it's long gone since salt minion reclaimed the device. I verified this by checking the state on the node1 myself: ~~~ # fdisk -l /dev/vdb WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion. Disk /dev/vdb: 34.4 GB, 34359738368 bytes, 67108864 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: gpt # Start End Size Type Name 1 2099200 67108830 31G unknown ceph data 2 2048 2097152 1023M unknown ceph journal ~~~ This means that partition table has been created successfully. Going little further: ~~~ # mkdir /mnt/foo # mount /dev/vdb1 /mnt/foo # findmnt /mnt/foo TARGET SOURCE FSTYPE OPTIONS /mnt/foo /dev/vdb1 xfs rw,relatime,seclabel,attr2,inode64,noquota ~~~ This means that 1st partition has been created successfully. So if salt minion managed to create new partition table, 2 new paritions with filesystem on the 1st one, but *failed* to run `ceph-disk activate-all`, it means that salt automation (as it's currently part of rhscon) is broken. (In reply to Shubhendu Tripathi from comment #3) > Cleaning the disks resolves the issue. > Tried in the setup provided in cluster creation was successful. But what you cleaned here was configuration created by salt, so if you need to clean it up and rerun manually, it's just another evidence that disk setup as implemented in rhscon using salt is broken. So let's go even deeper and check the failure: ~~~ stderr: 2016-01-14 16:26:14.815515 7fc2cde44780 -1 did not load config file, using default settings. 2016-01-14 16:26:14.832013 7f08b27e8780 -1 did not load config file, using default settings. mount: special device /dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.95d49259-f0e8-4825-9ad6-d4c704d2801a does not exist ceph-disk: Mounting filesystem failed: Command '['/usr/bin/mount', '-t', 'xfs', '-o', 'noatime,inode64', '--', '/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.95d49259-f0e8-4825-9ad6-d4c704d2801a', '/var/lib/ceph/tmp/mnt.Qa 5BZm']' returned non-zero exit status 32 ~~~ In this piece of log, we can see that mount of some partition fails. Let's check which one it is: ~~~ # ls -l /dev/disk/by-parttypeuuid/ total 0 lrwxrwxrwx. 1 root root 10 Jan 14 16:26 45b0969e-9b03-4f30-b4c6-b4b80ceff106.b0cd1b9a-d345-4255-a90d-aa105acf8d2f -> ../../vdb2 lrwxrwxrwx. 1 root root 10 Jan 14 16:26 4fbd7e29-9d25-41b8-afd0-062c0ceff05d.95d49259-f0e8-4825-9ad6-d4c704d2801a -> ../../vdb1 ~~~ Oh, so it's actually /dev/vdb1, which I did successfully mounted already. But just to be sure: ~~~ # umount /mnt/foo/ # mount /dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.95d49259-f0e8-4825-9ad6-d4c704d2801a /mnt/foo/ # findmnt /mnt/foo TARGET SOURCE FSTYPE OPTIONS /mnt/foo /dev/vdb1 xfs rw,relatime,seclabel,attr2,inode64,noquota ~~~ It just works as expected. To sum this up: it seems that the device didn't exist when ceph-disk activate-all tried to mount it. And you would need to fix salt deployment code to one of: wait for the device to appear, or to make sure it appears sooner, or to use different way to work with the device. I'm setting the severity to high as this BZ causes cluster setup to fail in a non deterministic way. Cleaning shouldn't be done with dd, it's much better to use ceph-disk zap. I think I agree with Martin here, we need to wait for the device to appear, or make it appear sooner. I'm not sure of a way to do the later, so the former is likely what needs to happen. As correctly pointed out by Daniel, this path is not applicable now when ceph-installer is used for provisioning the ceph cluster. Closing this now. |