Bug 1445924
Summary: | Volume groups not activating due to mdadm devices reporting degraded | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | John Pittman <jpittman> | ||||
Component: | lvm2 | Assignee: | LVM and device-mapper development team <lvm-team> | ||||
lvm2 sub component: | Default / Unclassified | QA Contact: | cluster-qe <cluster-qe> | ||||
Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |||||
Severity: | medium | ||||||
Priority: | medium | CC: | agk, cmarthal, cww, dledford, guazhang, hartsjc, heinzm, jbrassow, jpittman, jshortt, loberman, msnitzer, ncroxon, prajnoha, prockai, thornber, vanhoof, xni, zkabelac | ||||
Version: | 7.3 | Flags: | jpittman:
needinfo+
|
||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-06-06 14:12:00 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1420851 | ||||||
Attachments: |
|
Description
John Pittman
2017-04-26 19:28:07 UTC
(In reply to John Pittman from comment #0) > Description of problem: > > The boot message which has failed information is too larger than the normal one. There are too many error messages, not only the md/lvm. It looks like the environment was broken. Why there are so many messages? > Apr 25 17:45:42 host kernel: md/raid10:md2: active with 24 out of 24 devices > Apr 25 17:45:42 host kernel: md/raid10:md4: active with 24 out of 24 devices > Apr 25 17:45:42 host kernel: md/raid10:md3: active with 24 out of 24 devices > Apr 25 17:45:42 host kernel: md/raid10:md0: active with 24 out of 24 devices > > Apr 25 17:45:42 host kernel: md4: detected capacity change from 0 to > 4799449792512 > Apr 25 17:45:42 host kernel: md3: detected capacity change from 0 to > 4799449792512 > Apr 25 17:45:42 host kernel: md0: detected capacity change from 0 to > 4799449792512 > Apr 25 17:45:42 host kernel: md2: detected capacity change from 0 to > 4799449792512 > > <PV scan did not kick off> > > - Assuming it's because systemd thinks raid is degraded > > Apr 25 17:46:14 host systemd[1]: mdadm-last-resort changed dead > -> start > Apr 25 17:46:14 host systemd[1]: Starting Activate md array even though > degraded... > Apr 25 17:46:14 host systemd[1]: About to execute: /usr/sbin/mdadm --run > /dev/md2 > Apr 25 17:46:14 host systemd[1]: Forked /usr/sbin/mdadm as 3967 > Apr 25 17:46:14 host systemd[1]: mdadm-last-resort changed dead > -> start > Apr 25 17:46:14 host systemd[1]: Starting Activate md array even though > degraded... > Apr 25 17:46:14 host systemd[1]: About to execute: /usr/sbin/mdadm --run > /dev/md0 > Apr 25 17:46:14 host systemd[1]: Forked /usr/sbin/mdadm as 3968 > Apr 25 17:46:14 host systemd[1]: mdadm-last-resort changed dead > -> start > Apr 25 17:46:14 host systemd[1]: Starting Activate md array even though > degraded... > Apr 25 17:46:14 host systemd[3966]: Executing: /usr/sbin/mdadm --run /dev/md3 > Apr 25 17:46:14 host systemd[1]: About to execute: /usr/sbin/mdadm --run > /dev/md4 > Apr 25 17:46:14 host systemd[1]: Forked /usr/sbin/mdadm as 3969 > Apr 25 17:46:14 host systemd[1]: mdadm-last-resort changed dead > -> start > Apr 25 17:46:14 host systemd[1]: Starting Activate md array even though > degraded... > Apr 25 17:46:14 host systemd[3967]: Executing: /usr/sbin/mdadm --run /dev/md2 > > Apr 25 17:47:02 host systemd[1]: Job dev-mapper-vg03\x2ddata_lv.device/start > timed out. > Apr 25 17:47:02 host systemd[1]: Job dev-mapper-vg03\x2ddata_lv.device/start > finished, result=timeout > Apr 25 17:47:02 host systemd[1]: Timed out waiting for device > dev-mapper-vg03\x2ddata_lv.device. > Apr 25 17:47:02 host systemd[1]: Job data-influxdb\x2ddata.mount/start > finished, result=dependency > > Apr 25 17:47:02 host systemd[1]: Timed out waiting for device > dev-mapper-vg02\x2dwal_lv.device. > Apr 25 17:47:02 host systemd[1]: Job data-influxdb\x2dwal.mount/start > finished, result=dependency > Apr 25 17:47:02 host systemd[1]: Dependency failed for /data/influxdb-wal. > Apr 25 17:47:02 host systemd[1]: Job data-influxdb\x2dwal.mount/start failed > with result 'dependency'. > Apr 25 17:47:02 host systemd[1]: Job dev-mapper-vg02\x2dwal_lv.device/start > failed with result 'timeout'. There are error messages. It looks like the disks are broken. Are they connected by iscsi? Is there a network broken? Apr 25 17:48:07 carf-metrics-influx01 kernel: sd 11:1:47:0: [sdb] tag#0 Sense Key : Recovered Error [current] Apr 25 17:48:07 carf-metrics-influx01 kernel: sd 11:1:47:0: [sdb] tag#0 Add. Sense: Defect list not found Apr 25 17:48:10 carf-metrics-influx01 kernel: sd 11:1:48:0: [sdf] tag#0 Sense Key : Recovered Error [current] Apr 25 17:48:10 carf-metrics-influx01 kernel: sd 11:1:48:0: [sdf] tag#0 Add. Sense: Defect list not found Apr 25 17:48:12 carf-metrics-influx01 kernel: sd 0:0:5:0: [sdh] tag#0 Sense Key : Recovered Error [current] Apr 25 17:48:12 carf-metrics-influx01 kernel: sd 0:0:5:0: [sdh] tag#0 Add. Sense: Defect list not found Apr 25 17:48:14 carf-metrics-influx01 kernel: sd 11:1:49:0: [sdi] tag#0 Sense Key : Recovered Error [current] Apr 25 17:48:14 carf-metrics-influx01 kernel: sd 11:1:49:0: [sdi] tag#0 Add. Sense: Defect list not found Apr 25 17:48:16 carf-metrics-influx01 kernel: sd 0:0:6:0: [sdj] tag#0 Sense Key : Recovered Error [current] Apr 25 17:48:16 carf-metrics-influx01 kernel: sd 0:0:6:0: [sdj] tag#0 Add. Sense: Defect list not found Apr 25 17:48:18 carf-metrics-influx01 kernel: sd 0:0:7:0: [sdk] tag#0 Sense Key : Recovered Error [current] Apr 25 17:48:18 carf-metrics-influx01 kernel: sd 0:0:7:0: [sdk] tag#0 Add. Sense: Defect list not found Apr 25 17:48:20 carf-metrics-influx01 kernel: sd 11:1:50:0: [sdl] tag#0 Sense Key : Recovered Error [current] Apr 25 17:48:20 carf-metrics-influx01 kernel: sd 11:1:50:0: [sdl] tag#0 Add. Sense: Defect list not found Apr 25 17:48:22 carf-metrics-influx01 kernel: sd 0:0:8:0: [sdm] tag#0 Sense Key : Recovered Error [current] Apr 25 17:48:22 carf-metrics-influx01 kernel: sd 0:0:8:0: [sdm] tag#0 Add. Sense: Defect list not found Apr 25 17:48:25 carf-metrics-influx01 kernel: sd 11:1:51:0: [sdp] tag#0 Sense Key : Recovered Error [current] > > * Why does mdadm think the array is degraded even though all devices are > present? Is it taking too long to build? > * If we're activating, even though degraded, shouldn't LVM kick off a PV > scan afterwards? Maybe it's the devices are offline. Md can't read data from underlying devices. Anyway, the boot message which has problem is very messed. There should be something error about the environment. Thanks Xiao They are SAS connected drives to the system via cable to the back of chassis. While clearly see the errors, why do you think they are broken when the only issue is LVM activation on the md array's in the "bad" system? It seems that whatever is the cause/trigger this is also what prevents the mdadm-last-resort@.timer from being canceled (and thus mdadm-last-resrort@.service) runs. Follow this by logical volumes not being able to mount and system aborts boot asking admin to correct (which is lv activation && mount). Is all of this the rest of the error you mentioned above? If you're able to reproduce, can you please follow the instructions as described in bug #1324028 comment #10 ? With this, I'll have full sequence of events as they appeared since boot - this should help me to identify why the lvm2-pvscan@major:minor service was not executed. I can see from the sosreport that the udev db has information about the MD device where there's no SYSTEMD_WANTS="lvm2-pvscan@9:2.service" while it's still properly identified as LVM2 PV by blkid (here for md2 device with 9:2 major:minor pair). So I need to see all the previous udev events and what variables were set there to inspect this more and to check whether at least one of them had SYSTEMD_WANTS="lvm2-pvscan..." set and if not, we can inspect futher why. But I think this is exactly the cause of the problem - the SYSTEMD_WANTS="lvm2-pvscan..." was not set at all in any of the events for MD devices. Requested data sent; please let us know if any additional data is needed. Hello if the bug only reproduce on customer environment or not ? if yes ,QE just do sanity test here and request OtherQA. Could someone help to check if customer can help to feedback test result? thanks Guazhang Created attachment 1282268 [details] lvmetad udev rules to collect env var debug info If possible, please install these modified 69-dm-lvm-metad.rules in /etc/udev/rules.d directory - I've added some more variables to store in udev database so we can see which rule path is taken. Also, following the instructions as in bug #1324028 comment #10. (After collecting all the logs, you can remove the /etc/udev/rules.d/69-dm-lvm-metad.rules.) Case contact was updated from Sebastian to Lucas. Lucas attempted to provide us the information; however, looks like he did the test on the system that has always been working (*02) and not the failing host (*01). We have requested he retry on the failing host and provide the data! We never did get data from the customer, and per conversion with them they are going to be migrating off this hardware. So I think we can probably close this BZ with insufficient data. (In reply to James Hartsock from comment #40) > We never did get data from the customer, and per conversion with them they > are going to be migrating off this hardware. So I think we can probably > close this BZ with insufficient data. Closing for now, if the problem appears again, please, feel free to reopen this report. |