Bug 475385
| Summary: | RAID10 - Install ERROR appears during installation of RAID10 isw dmraid raid array in RHEL 5.3 Snapshot5 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Ed Ciechanowski <ed.ciechanowski> |
| Component: | anaconda | Assignee: | Joel Andres Granados <jgranado> |
| Status: | CLOSED ERRATA | QA Contact: | Release Test Team <release-test-team-automation> |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 5.3 | CC: | agk, atodorov, borgan, coughlan, cward, ddumas, dwysocha, fernando, hdegoede, heinzm, Jacek.Danecki, jgranado, jjarvis, jvillalo, keve.a.gabbert, krzysztof.wojcik, lvm-team, martinez, mbroz, naveenr, pjones, prockai, rpacheco, rvokal, syeghiay, tao |
| Target Milestone: | rc | Keywords: | OtherQA, Reopened |
| Target Release: | --- | ||
| Hardware: | i386 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2009-01-20 20:47:50 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 471689 | ||
| Bug Blocks: | 476866 | ||
| Attachments: | |||
|
Description
Ed Ciechanowski
2008-12-09 00:52:07 UTC
RAID10 (0+1) – Install ERROR appears during installation of RAID10 isw dmraid raid array in RHEL 5.3 Snapshot5. SEE ATTACHMENT .JPG If logs are needed, let me know which ones. Is this a regression or critical error? It's getting very late to introduce new change into the release. Please make your case as soon as possible. Otherwise we'll be forced to defer to 5.4. If fixing for 5.4 is OK, please let me know that too. This is part of the pyblock changes to address the modifications in the dmraid API. When pyblock searches for the dmraid set it first passes the dmraid drive name and calls group_set with this info. When it fails it shows the error message. If it fails a second call to group_set is done with {NULL} as an argument.
My test show that for all intel type raid the error is shown as it expects a {NULL} to be passed. I didn't completely change to calling with the {NULL} argument because of fear of breaking other raid types.
This will not break anything, It just looks ugly. If need be, and if reasured that it wont break anything, I can take away the first call and just leave the group_set({NULL}).
Ed, This is caused by the way the new in RHEL-5.3 isw raid 5 / raid 10 support has been implemented, as explained by Joel. The implementation of the new isw raid 5 / raid 10 suport is being tracked in bug 437184 (which is still in progress), as such I'm closing this as a dup of 437184. *** This bug has been marked as a duplicate of bug 437184 *** Hans, Joel, does this error appear when using kickstart ? If so it will break all automated installs. Moving to assigned until I receive answer to the above question. This will not break automated installs. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: RHEL 5.3 adds support for installing to ISW raid 5 / raid 10 setups. Due to the way this support is implement when you try to install RHEL 5.3 on a system which has a ISW raid 5 / raid 10 setup, you will see an error message like this one: "ERROR: only one argument allowed for this option" You can safely ignore this error, the installation will continue normally and the raid array will be available to install to. Deleted Release Notes Contents. Old Contents: RHEL 5.3 adds support for installing to ISW raid 5 / raid 10 setups. Due to the way this support is implement when you try to install RHEL 5.3 on a system which has a ISW raid 5 / raid 10 setup, you will see an error message like this one: "ERROR: only one argument allowed for this option" You can safely ignore this error, the installation will continue normally and the raid array will be available to install to. Please ignore comment #3 to #9. I was testing with raid0 and was looking at a different error message. My bad sorry for the noise. "During the installing of RHEL 5.3 Snapshot 2 no sw raid devices are active to install OS to." Which do you mean by this -- that anaconda doesn't see the disks, or that you've disabled software raid and yet you're seeing these errors anyway? Sorry, but I am not sure if these Questions are to me. Test Results of installs for RHEL5.3 Snapshot5 to isw raids: RAID 1 (Mirror) – Work correctly, a true mirror is created. You can boot to either drive if one is removed. I would closed this bugzilla as verified fixed. Issues that still exists in this area: RAID0 (Strip) – produces i/o errors on boot. Seems to be a Strip RAID system boots and work normally except for i/o errors on boot. Cloned this Bugzilla – see bug 475384 RAID10 (0+1) – Install ERROR appears during installation of RAID10 isw dmraid raid array in RHEL 5.3 Snapshot5. Cloned this Bugzilla – see bug 475385 RAID5 – Install ERROR appears during installation of RAID5 isw dmraid raid array in RHEL 5.3 Snapshot5. Cloned this Bugzilla – see bug 475386 If you need logs from any of these issue please let me know. I can collect them today. (In reply to comment #12) > Sorry, but I am not sure if these Questions are to me. > > Test Results of installs for RHEL5.3 Snapshot5 to isw raids: > > RAID 1 (Mirror) – Work correctly, a true mirror is created. You can boot to > either drive if one is removed. I would closed this bugzilla as verified fixed. Yes, this has been dealt with. and has been verified AFAIK. > > > Issues that still exists in this area: > RAID0 (Strip) – produces i/o errors on boot. Seems to be a Strip RAID system > boots and work normally except for i/o errors on boot. Cloned this Bugzilla – > see bug 475384 Yep, this is being handled in bug 475384 and whatever addition you have to that bug will be appreciated, pls look at my comments in bug. > > RAID10 (0+1) – Install ERROR appears during installation of RAID10 isw dmraid > raid array in RHEL 5.3 Snapshot5. Cloned this Bugzilla – see bug 475385 Yes. Also correct. 475385 and 475386. in your test they present the same behaviour and seem to sprout from the same place. these two issues will be handled in this bug 475385 unless proven that their cause is different. > > RAID5 – Install ERROR appears during installation of RAID5 isw dmraid raid > array in RHEL 5.3 Snapshot5. Cloned this Bugzilla – see bug 475386 > The confusion sprouted from the fact that (475385 and 475384) where cloned from 471689. which was a bug that contained some issue dating back from snap2. Additionally 471689 has some output that also added to the confusion (the output there not relevant to 475385 nor 475384). The issue in 471689 and the issues in 475385 and 475384 are different and present themselves in different moments in the install/boot. Since we now have clear that this is a new bug can you please answer pjone's question. Have you disabled raid? are you testing with snap5? furthermore: on install, can you change to tty2 and execute `dmraid -ay -vvv- ddd` and post the out put. I'm currently testing the same scenario as you and get slightly different behaviour, I don't get the error message but the raid device is not seen. I will continue to investigate and post my findings, hopefully together we may find the solution to this issue. If you can please provide the logs for the installation. They are usually left in /root directory. Created attachment 326396 [details]
txt from tty2 running 'dmraid -ay -vvv -ddd'
Created attachment 326397 [details]
output from the command "dmesg > dmesgout.log"
Created attachment 326398 [details]
syslog file if this will help
If other items are needed please specify the commands to run. thanks
Been looking at this and in my tests the installer does not recognize the raid sets. It just shows the underlying devices as separate partitions. 1. When I execute `dmraid -ay` in the installer (TTy2) it screams to me that the kernel does not support the raid45. 2. When I execute `dmraid -ay` in the installed system. It correctly mounts the raid device and all is dandy. 3. the installer does not handle dmraid through the dmriad command. It handles dmraid through python-pyblock. pyblock correctly detects the raid sets in the installer but fails to activate them (in installer). Still looking into this but it has something to do with the table that is received from dmraid libraries. On my machine it looks like this "0 312614656 raid45 core 2 65536 nosync raid5_la 1 128 3 -95466056 /dev/sdb 0 /dev/sdc 0 /dev/sdd 0". The not so obvious problem is the number that comes before the device list. Acording to the kernel code this is suppose to be the number of devices to initialize and its range is -1 to #of raiddevs. So a number of -95466056 is defenetly wrong. We don't really modify this string in pyblock, we use whatever libdmraid_make_table gives us. Im still investigating but I would really like to hear from Heinz and see what might be causing this issue. 4. The same as 3. I am aware that the symptoms that the user sees are slightly different than what I am seeing on my machine, but I trust that fixing whatever is wrong on my test machine will give us a more crear picture of whats going on. Pleaes test with http://jgranado.fedorapeople.org/temp/raid45.img as an updated image. You need to append updates=http://jgranado.fedorapeople.org/temp/raid45.img to the kernel args to make this work. The image adds 1 second sleeps before calling the dmraid function libdmraid_make_table. In my tests with my running system it fixed pyblocks behaviour. This hopefully will avoid the erratic behaviour in the installer. Before the change pyblock received strange device to init parameter (part of the device mapper table). its suppose to be from -1 to # of devs, but it gives some *really* big possitive numbers. When pyblock uses the table containing this bogus number device-mapper screams and says that the table has the wrong format. Adding a sleep before the call seems to normalize things. the bug is seen when you execute: `for a in 1 2 3 4 5 6 7 8 9 ; do dmraid -tay ; done` One can see the *big* numbers that go before the device list. An example of the output in my machine is: "isw_bafgeadidc_Volume0: 0 312614656 raid45 core 2 65536 nosync raid5_la 1 128 3 58168736 /dev/sdb 0 /dev/sdc 0 /dev/sdd 0" Notice the 58168736. Here it should be -1 <= x <= 3 One of the reasons that this did not work in my test machine is that anaconda did not have the correct modules. I have just added dm-raid45 dm-mem-cache dm-region_hash dm-message to the mix. Heinz: Do you think some other module should be added? Joel, why are these modules not in by default, since they are in the 126.el5 kernel ? This fixes the behavior I saw with the big numbers I was seeing. I still need confirmation from intel about the images. Intel: Can we pls get a test ASAP! (In reply to comment #24) > Joel, > > why are these modules not in by default, since they are in the 126.el5 kernel ? In the installer we have a list of modules that we use for the installation. We only put stuff that is needed for installs. We do this to cut down on the size of the install images. The list contains various dmraid stuff but did not have these specific modules. Assigned to myself after Joels test result shows, that my lib/activate/active.c fix seems appropriate. Joel, you still 'needinfo' from Ed or can you '-' it ? Comment on attachment 326792 [details]
Test SRPM for Joel
Setting attachment to public for Intel to test.
Heinz: To be sure that the dmraid fix actually fixed the anaconda strangeness that the reporter was seeing I still need him to test with my updates image. So I would leave needinfo in ?. Heinz and Joel, Does the raid45.img file contain the 'attachment 326792 [details]'? Should the 'attachment 326792 [details])' be included during procedure of install? We will test this as soon as possible. EDC Ed, Joels image will do. The srpm is to allow for complete test coverage on your end. Joel, Please see email I sent to you directly - jgranado. Thanks, *EDC* (In reply to comment #35) > Heinz and Joel, > > Does the raid45.img file contain the 'attachment 326792 [details]'? > Should the 'attachment 326792 [details] [details])' be included during procedure of > install? No, the attachement is a src rpm that can be used in a running system. To get it into the installer is more difficult and I would like to test with the current image before starting off that road. The image contains a 1 second wait time before the call to libdmraid_make_table. This should make the buggy version of dmraid work with the installer. To use the images you must append the following text to the installation parameters: "updates=http://jgranado.fedorapeople.org/temp/raid45.img" , you need not to download it to a usb or do anything else. If the machine your installing has access to internet there should be no problem. If you have issues with connecting the machine to the internet I suggest you use an internal http server to host the updates image and get it from there. Http is the easiest way. Attaced are the anaconda log, syslog and dmesg.txt files from installs of RAID10 and RAID5. Using "updates=http://jgranado.fedorapeople.org/temp/raid45.img". The error in both anaconda.log files are as such: 23:04:52 INFO : file location: http://jgranado.fedorapeople.org/temp/raid45.img 23:04:52 INFO : transferring http://jgranado.fedorapeople.org//temp/raid45.img to a fd 23:08:02 ERROR : failed to retrieve http://jgranado.fedorapeople.org///temp/raid45.img I need to find out if it is the network or setup an http: server of my own. See attached logs. EDC Will updated again soon. Created attachment 327031 [details]
Anaconda.log file for RAID10 install
Created attachment 327032 [details]
Anaconda.log file for RAID5 install
Created attachment 327037 [details]
Anaconda.log file for RAID10 install w USB img
Created attachment 327038 [details]
output from the command "dmesg > raid10wUSBimg/dmesout.txt"
Created attachment 327039 [details]
syslog file if this will help for RAID10wUSBimg
Created attachment 327067 [details]
Raid 10 with network img still failing
Raid 10 with network img still failing
Created attachment 327068 [details]
Raid 5 with network img still failing
Raid 5 with network img still failing.
I have not seen the 1 second wait resolve the orginal issue.
The .img file is transferring over the network at home. Started install with following command: Boot: linux updates=http://jgranado.fedorapeople.org/temp/raid45.img Looks like the right .img file is loading but still does not recongize RAID array 5 and 10. Installation still FAILS with raid10 and raid5 installs. Attached are the tar files from both raid10 and raid5 with network image, each .tgz contains: anaconda.log syslog dmesg out from install dmraid_out from TTY2 Please find attached raid10wNETimg.tgz and raid5wNETimg.tgz respectively. TTY1 shows message below: Starting Graphical installation… ERROR: only one argument allowed for this option Using updates image… sleeping for 2 secs… Using updates image… sleeping for 2 secs… Error Error opening /dev/mapper/isw_dgdjhihjig_Volume0: no such device or address 80 I have not seen the 1 second wait resolve the orginal issue. I have a system with me now so I can run quick tests even early or late PST. Thanks, EDC The first bach of tests failed most likely because the network was misconfigured, but the last output (comment #52) tells me that the sleep before the call did not work. This basically means that the issue that Ed is seeing must have another root cause. We have already solved one dmraid issue but still there is something causing the failure of dmraid at install time. dont have anything concrete ATM, but will post to the bug once I do. ~~~ Attention Partners ~~~ The *last* RHEL 5.3 Snapshot 6 is now available at partners.redhat.com. A fix for this bug should be present. Please test and update this bug with test results as soon as possible. If the fix present in Snap6 meets all the expected requirements for this bug, please add the keyword PartnerVerified. If any new bugs are discovered, please CLONE this bug and describe the issues encountered there. Apologies. The fix for this bug won't be present in Snap6, but is scheduled for inclusion in the RC release, which will be available at a later date. I finally got this to work! Stuff that was going wrong: 1. We didn't have the modules in the installer. already commited the change to take care of this. The installer basically did not see the dmraid set. It just saw the separate raid devices. It shouldn't detect that there is a raid set and should finish the install correctly if one continues the installation to one of the devices. This brings me to the current situation seen by the intel tests. As far as I can tell the raid sets can be activated at install time (https://bugzilla.redhat.com/attachment.cgi?id=326396. where you can clearly see that the sets where initialized). So something strange is happening here. because if the snap5 or 6 was being used, the modules should not be in the installer. So the intel test is dealing with another type of raid or its putting in some dm modules that are making the process break in a different way. 2. dmraid had an issue with the table creationg. Heinz' package has fixed this and now everything seems to work normally. This could be seen in a running system only, not in the installer, as the installer failed becuase of lack of modules. So one had to actually install and fidle with dmraid to observer bug. Ed: Pls initialize install and wait for the window asking you for the key. Go to tty2 and execute `lsmod` withoug executing anything else and post the output here. This is to know what modules you have by default. Ed: also run `cat /.buildstamp` and post the output pls. Created attachment 327188 [details]
Builstamp.txt from RAID5
lsmodout.txt from RAID5
Builstamp.txt from RAID5
lsmod_R10.txt from RAID10
buildstamp_R10.txt from RAID10
Created attachment 327189 [details]
lsmod from raid5
Created attachment 327190 [details]
lsmod from RAID10
Created attachment 327191 [details]
buildstamp from RAID10
All the lsmod and buildstamps above were installed with the boot command: "linux updates=http://jgranado.fedorapeople.org/temp/raid45.img" over the known working (home) network. Comment#60 unveils that there's still no dm-raid45 module loaded, hence RAID 5 activation will fail. Comment#61 shows dm-mirror and related modules *but* no dm-stripe, hence RAID10 activation has to fail. So the aforementioned modules are still not installed. Heinz: I expected this. And even though the modules are not there `dmraid -ay` works for him as https://bugzilla.redhat.com/attachment.cgi?id=326396 clearly shows. So my conclusion is that we are looking at a different type of raid. Can we confirm this somehow with a dmraid command? Joel, I am pretty sure, Ed is only basing this on Intel Matrix RAID (dmraid isw format). "dmraid -b" shows any discovered block devices, "dmraid -r" any discovered RAID devices together with their format and more and "dmraid -s" any discovered RAID sets. So, if neither "-s", nor "-r" output shows any supported RAID -> confirmation. Created attachment 327243 [details]
dmraid commands in a .tgz
I am only using dmraid and only isw raid. I believe this would showup on any SW dmraid devices.
Initialized install and wait for the window asking me for the key. Went to
tty2 and executed:
dmraid -b -vvv -ddd > dmraid_b_out.txt
dmraid -r-vvv -ddd > dmraid_r_out.txt
dmraid -s -vvv -ddd > dmraid_s_out.txt
dmraid -tay -vvv -ddd > dmraid_tay_out.txt
All attached in .tgz
I am avalible earlier today for quick tests, if you need!
Attachment in comment#67 shows the presence of a RAID01 set. If this is the 127.el5 kernel, all mappings should be available, because RAID0 (striped target) is build in. Is this test based on 127.el5 kernel ? uname -a shows: Linux localhost.localdomain 2.6.18-125.el5 #1 SMP Mon Dec 1 17:38:25 EST 2008 x86_64 unknown I am downloading Snap6 now. 1). Will that bring me to > 127.el5? 2).Will the raid45.img be needed on the arg boot line? 2a). Will the image work with snap6? 3). What would you like to see from the snap6 tests, if not working? It will take at least a hour to try this. Thanks. (In reply to comment #69) > uname -a shows: > Linux localhost.localdomain 2.6.18-125.el5 #1 SMP Mon Dec 1 17:38:25 EST 2008 > x86_64 unknown > > I am downloading Snap6 now. > > 1). Will that bring me to > 127.el5? It will probably be 126.el5 > 2).Will the raid45.img be needed on the arg boot line? No. The image worked for me. It made my raid5 installable. > 2a). Will the image work with snap6? It will work, but your installations will probably still be failing. Don't use the image anymore. > 3). What would you like to see from the snap6 tests, if not working? please run `dmsetug targets` and post your outputs. > > It will take at least a hour to try this. > Thanks. Created attachment 327285 [details]
dmsetup targets output
dmsetup targets output. I will run some more test and tar them up here like previously.
Created attachment 327288 [details]
Snap6 raid10 install logs
Started install with not .img arg on Boot: <enter>
At the point of install when the graphical interface show Skip and Back button for Installation Number.
Installation of RHEL5.3 Snapshot6 still FAILS with raid10 and raid5 installs.
Attached is a .tgz of all files from both raid10 and raid5 with no network image, each .tgz contains:
anaconda.log
syslog
buildstamp_out
lsmod_out
uname –a out
dmesg out from install
dmraid_r_out from TTY2
dmraid_s_out from TTY2
dmraid_b_out from TTY2
dmraid_tay_out from TTY2
dmsetup table from TTY2
dmsetup targets from TTY2
Please find attached snap6R10Logs.tgz and raid5Snap6.tgz respectively.
Thanks,
EDC
Created attachment 327289 [details]
RAID 5 w/Snap6 logs
Attached is a .tgz of all files from raid5 contains:
anaconda.log
syslog
buildstamp_out
lsmod_out
uname –a out
dmesg out from install
dmraid_r_out from TTY2
dmraid_s_out from TTY2
dmraid_b_out from TTY2
dmraid_tay_out from TTY2
dmsetup table from TTY2
dmsetup targets from TTY2
raid10 update: I found the issue on riad10 to be pyblock. Still doing test but the issue will no longer be handled by this bug. Ed. Lets use https://bugzilla.redhat.com/show_bug.cgi?id=475386 to track this new issue. I just changed the description to correctly describe the situation. thx Correcting component+assignment. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0078.html |