Description of problem: Geo-replication test cases were disabled in master. I have sent a patch [1] to re-enable geo-replication test cases. But it will fail as few pre-requisite steps are required. Pre-requisite: 1. Setup passwordless SSH in all regression test machines for root. Please add it to the script which will spawn new regression machines as it will avoid doing it again on witnessing failures. [1]: https://review.gluster.org/#/c/18024/1 I remember we did do few other changes as the path to install gluster binaries is different in regression machines. But Let's give it a run after having password less SSH. We can get through this step by step. Let me know if any doubts. Version-Release number of selected component (if applicable): mainline How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
We already do have authentication setup: https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/jenkins_builder/tasks/authroot_georep.yml Can you detail what kind of failure you are seeing now ?
(In reply to M. Scherer from comment #1) > We already do have authentication setup: > > https://github.com/gluster/gluster.org_ansible_configuration/blob/master/ > roles/jenkins_builder/tasks/authroot_georep.yml > > Can you detail what kind of failure you are seeing now ? Cool, I will trigger run and try it out. I thought, with new regression machines, it is no longer there. Thanks, Kotresh HR
(In reply to Kotresh HR from comment #2) > (In reply to M. Scherer from comment #1) > > We already do have authentication setup: > > > > https://github.com/gluster/gluster.org_ansible_configuration/blob/master/ > > roles/jenkins_builder/tasks/authroot_georep.yml > > > > Can you detail what kind of failure you are seeing now ? > > Cool, I will trigger run and try it out. I thought, with new regression > machines, it is no longer there. > > Thanks, > Kotresh HR But the regression [1] is still failing with no password less SSH for root as below. 13:10:11 [13:10:11] Running tests in file ./tests/geo-rep/georep-basic-dr-rsync.t 13:10:24 Passwordless ssh login has not been setup with slave32.cloud.gluster.org for user root. 13:11:25 Geo-replication session between master and slave32.cloud.gluster.org::slave does not exist. 13:11:26 Geo-replication session between master and slave32.cloud.gluster.org::slave does not exist. [1] https://build.gluster.org/job/centos6-regression/5975/consoleFull May be it's not copied to authorized_keys properly ?
Kotresh, I suggest skip gsec_create and push-pem steps from the geo-rep tests. - Run gsec_create command as part of setup and store all the files outside the build setup(May be in /root/data/georep_keys/) - Add the common_secret.pem.pub content to same node authorized_keys file. Every Geo-rep test will copy secret.*.pem files to $BUILD/var/lib/glusterd/geo-replication/ and create session without `push-pem`.
(In reply to Aravinda VK from comment #4) > Kotresh, > > I suggest skip gsec_create and push-pem steps from the geo-rep tests. > > - Run gsec_create command as part of setup and store all the files outside > the build setup(May be in /root/data/georep_keys/) > - Add the common_secret.pem.pub content to same node authorized_keys file. > > Every Geo-rep test will copy secret.*.pem files to > $BUILD/var/lib/glusterd/geo-replication/ and create session without > `push-pem`. Yeah, seems to be better approach. @mscherer, can we get that done ?
No, I rather try to figure what was wrong, and why it was working before. And if that was never working, why it wasn't detected sooner. So far, I did found some quoting issue (and pushed a fix), but this was likely present since more than 1 year. Were the test disabled since that time ? Also, the key to connect is /root/.ssh/id_georep , you can use it for the test. I did make sure this is properly limited to localhost, to avoid various security issue. And the reason this is done like this is because the tests were not cleaning up the old root key when they broke (or when it did block the system), which in turn did cause trouble to connect as root after a while, IIRC (cause there is a limit, if only computational to the number of key you can place in authorized_keys). We did manage last time to avoid breakage since we were using salt, but now we use ansible, any sshd breakage would be much more annoying to fix.
(In reply to M. Scherer from comment #6) > No, I rather try to figure what was wrong, and why it was working before. > And if that was never working, why it wasn't detected sooner. Agreed. Kotresh and Aravinda, please do not fix this from the tests, but let us fix this from the infra end.
Hi M. Scherer/Nigel, I see that, it is creating separate ssh key pair for geo-rep( /root/.ssh/id_georep). This will not work as geo-rep won't use ssh with -i option. It requires passwordless for root with default ssh key pair (/root/.ssh/id_rsa) Testcase: ssh root@<local-hostname> (with out -i flag, should login with out password on all regression machines) I think previously, they had setup this and it was working on few and was failing on which it was not setup. Thanks, Kotresh HR
Hi M. Scherer/Nigel, Did we do any progress with this ?It's been really long time.
So you need to be able to do root.gluster.org from inside slave21.cloud.gluster.org without password and have it work? Misc, is this okay as far as we're concerned. Doing it isn't actually a big deal. Probably a tweak in our scripts.
IIRC, this was fixed, there is a link from id_rsa to the right key.
You mean to say, the test case mentioned in comment 8 or comment 11 works in all the regression machines?
Sorry, the latest run also failed for the same reason. Please fix it soon. https://build.gluster.org/job/centos6-regression/7456/consoleFull
Why is this taking so long to be fixed ?
Michael has been taking some time off due to PTOs lapsing. I'll make sure we look into solving this today.
Ok so this was weird. We have the ssh key in place, in /root/.ss/id_georep, and it work: [root@slave23 .ssh]# ssh -i id_georep root.0.1 id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 And we have a symlink to the key, so we should use it right away: [root@slave23 .ssh]# ls -l id_rsa lrwxrwxrwx. 1 root root 20 Nov 3 17:56 id_rsa -> /root/.ssh/id_georep But it didn't work. Turn out that ssh verify the .pub and fail to load the key if it doesn't match. I am gonna fix this cluster wide.
But note that geo-rep would not use specific ssh key. Expectation is "ssh root@<hostname>" should work and not "ssh -i <ssh-key> root@<hostname>"
It do work: [root@slave25 ~]# ssh root.0.1 id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
Then, why is the following run on machine (slave27.cloud.gluster.org ) failing with passwordless SSH ? https://build.gluster.org/job/centos6-regression/7672/console --- 06:16:25 [06:16:30] Running tests in file ./tests/geo-rep/georep-basic-dr-rsync.t 06:16:35 Passwordless ssh login has not been setup with slave27.cloud.gluster.org for user root. 06:17:36 Geo-replication session between master and slave27.cloud.gluster.org::slave does not exist. 06:17:36 Geo-replication session between master and slave27.cloud.gluster.org::slave does not exist. 06:19:37 stat: cannot stat `/mnt/glusterfs/1/hybrid_f1': No such file or directory 06:19:38 stat: cannot stat `/mnt/glusterfs/1/hybrid_f1': No such file or directory -----------------
(In reply to M. Scherer from comment #18) > It do work: > > [root@slave25 ~]# ssh root.0.1 id > uid=0(root) gid=0(root) groups=0(root) > context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [root@slave27 ~]# ssh root.gluster.org id root.gluster.org's password: It seems to only work for 127.0.0.1.
Ok. Easy it fix it ? Let us know if any concerns. We will change geo-rep to use any given ssh key for geo-rep setup ( https://github.com/gluster/glusterfs/issues/362 )
Ah, there's a `from="127.0.0.1"` which should probably be `from-$(hostname)`. But this needs an ansible fix + misc to look at the security implications. I'll ping him when he's online.
In meantime, I will re-trigger changing the slave host to "127.0.0.1". If it works, we can live with it for now.
If that is a possibility for the tests, that would be the best way to handle this. Less security concerns for us.
I have changed the geo-rep scripts to use the exported ssh identify file to use for geo-rep creation. With the patch [1], it uses "/root/.ssh/id_georep" to push ssh keys. I am assuming the password less is setup in all regression machines with identity key "/root/.ssh/id_geo-rep". Once [1] gets merged, we can close this bug. [1]: https://review.gluster.org/#/c/18024/
Hi Nigel, The slave25 regression machines has failed at [1] even with "/root/.ssh/id_georep". Could you just confirm me passwordless SSH is configured on slave25 with that identity key. [1] https://build.gluster.org/job/centos6-regression/8099/console Thanks, Kotresh HR
No, no. SSH will work as root.0.1 in all machines without needing a special key to use. You have to use 127.0.0.1 as the hostname though.
But that is failing for some reason on regression machines and passing locally. Could you lend a regression machine to debug ?
Can you add your SSH key to the bug, please? Ping me on Freenode once you've added it to the bug and I can give you a machine instantly.
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDQujcXnq9qOCmTWs44GwKOw48BxZGBmo9AeFk3KPRx33Cz5hS7UHPYvJy03KdoMySgxJtuxsOSvjJSPxpXesTtArO5giW8RLyuVsu9q5j/GhPKyABttuGQyexiRokrHQFLVifzgqwsUARwHarWH16Oa1n6fZPFNqSH56c872zS4Pwqgkzx99NKRWh/B+fIk8VmDzzP1qvQAnUXDeTeTOzapAL+8fSNTp3QnhmZbvCHYwUxfSqJyzq1wBL+517WrvzvX7yCY9B0wzm72sIO4daV6UDyAL1B72QpB04vEZZ/skwS/0jmokfBn43HHCMPy/Mxywfi54alFJzIu4pti30V kravishankar
You should be able to SSH in as jenkins.gluster.org
Machine is back in the pool.