1755643 – Upgrade to sssd 2.2.2-1.fc30 breaks setting up FreeIPA replica in containers

Bug 1755643 - Upgrade to sssd 2.2.2-1.fc30 breaks setting up FreeIPA replica in containers

Summary: Upgrade to sssd 2.2.2-1.fc30 breaks setting up FreeIPA replica in containers

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	sssd
Sub Component:
Version:	30
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Alexey Tikhonov
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	sync-to-jira
Depends On:
Blocks:	1772888
TreeView+	depends on / blocked

Reported:	2019-09-25 22:43 UTC by Jan Pazdziora (Red Hat)
Modified:	2020-06-16 09:47 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Clones:	1772888 (view as bug list)
Environment:
Last Closed:	2020-05-26 18:44:29 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jan Pazdziora (Red Hat) 2019-09-25 22:43:29 UTC

Description of problem:

With the new sssd 2.2.2-1.fc30, running https://github.com/freeipa/freeipa-container's tests/run-master-and-replica.sh fails due to

Failed to update DNS records.
Missing A/AAAA record(s) for host replica.example.test: 10.88.0.9.

which then leads to

Allocate local instance <class 'lib389.DirSrv'> with ldapi://%2fvar%2frun%2fslapd-EXAMPLE-TEST.socket

Failed to start replication

The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information

RuntimeError: Failed to start replication

Version-Release number of selected component (if applicable):

sssd-common-2.2.2-1.fc30 and all the other versioned sssd subpackages.

How reproducible:

Deterministic.

Steps to Reproduce:
1. Build container image from https://github.com/freeipa/freeipa-container:
   docker build -t local/freeipa-server:fedora-30 -f Dockerfile.fedora-30 .
2. Run the test:
   tests/run-master-and-replica.sh local/freeipa-server:fedora-30

Actual results:

Hostname (replica.example.test) does not have A/AAAA record.
Failed to update DNS records.
Missing A/AAAA record(s) for host replica.example.test: 172.17.0.3.
Missing reverse record(s) for address(es): 172.17.0.3.

[...]

Allocate local instance <class 'lib389.DirSrv'> with ldapi://%2fvar%2frun%2fslapd-EXAMPLE-TEST.socket
Failed to start replication
The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information
 RuntimeError: Failed to start replication
Your system may be partly configured.
Run /usr/sbin/ipa-server-install --uninstall to clean up.
FreeIPA server configuration failed.

Expected results:

Hostname (replica.example.test) does not have A/AAAA record.
Missing reverse record(s) for address(es): 172.17.0.3.

[...]

Allocate local instance <class 'lib389.DirSrv'> with ldapi://%2fvar%2frun%2fslapd-EXAMPLE-TEST.socket
Starting replication, please wait until this has completed.
Update in progress, 3 seconds elapsed
Update succeeded

Additional info:

The command which produces that "Failed to update DNS records." seems to be

args=['/usr/bin/nsupdate', '-g', '/etc/ipa/.dns_update.txt']

which is run with KRB5CCNAME=/etc/ipa/.dns_ccache and in case of the updated sssd packages, it fails with exit code 1, and the A record of the replica is not updated in the master's DNS server, causing the replication to fail.

I have no idea what SSSD has to do with this part of the FreeIPA replica setup but I've been able to show in deterministic fashion on Fedora 29 where the new 2.2.2-1.fc29 is not yet in stable updates repo, that with the new SSSD things fail and with the original version 2.2.0-3.fc29 they pass.

The /etc/ipa/.dns_update.txt content seems to be

debug

update delete replica.example.test. IN A
show
send

update delete replica.example.test. IN AAAA
show
send

update add replica.example.test. 1200 IN A 172.17.0.3
show
send

and the failing run with the new SSSD packages fails with

Reply from SOA query:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id:  34486
;; flags: qr aa rd ra; QUESTION: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;replica.example.test.          IN      SOA

;; AUTHORITY SECTION:
example.test.           0       IN      SOA     ipa.example.test. hostmaster.example.test. 1569430778 3600 900 1209600 3600

Found zone name: example.test
The master is: ipa.example.test
start_gssrequest
send_gssrequest
recvmsg reply from GSS-TSIG query
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id:  17587
;; flags: qr ra; QUESTION: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;671202487.sig-ipa.example.test.        ANY     TKEY

;; ANSWER SECTION:
671202487.sig-ipa.example.test. 0 ANY   TKEY    gss-tsig. 0 0 3 BADNAME 0  0

dns_tkey_gssnegotiate: TKEY is unacceptable 

and the update of replica.example.test's A record to 172.17.0.3 never happens.

What is even more strange, when I replace that /usr/bin/nsupdate with long sleep and run

KRB5CCNAME=/etc/ipa/.dns_ccache /usr/bin/nsupdate.orig -g /etc/ipa/.dns_update.txt

manually from another terminal (via docker exec), it passes.

Comment 1 Lukas Slebodnik 2019-09-26 07:52:36 UTC

It can be related either to https://pagure.io/SSSD/sssd/issue/4047 or https://pagure.io/SSSD/sssd/issue/4012

Comment 2 Tomas Halman 2019-09-26 12:15:48 UTC

I'm almost sure that it is not related to https://pagure.io/SSSD/sssd/issue/4047 . This patch fix the situation when nsupdate is not started at all when 
dyndns_refresh_interval is 0 (default). And here I see that nsupdate is performed.

The other one is not related to DNS too.

Can you provide sssd log with debug_level=9 for that domain?

Comment 3 Jan Pazdziora (Red Hat) 2019-09-26 16:45:16 UTC

This issue appears even if I just run ipa-client-install, so the ipa-replica-install part does not seem to play any role. I've also reproduced the issue against FreeIPA server on Fedora 29 host, so server being in the container is not the factor in the change of behaviour.

Tomáš is correct that this is separate nsupdate -g run by ipa-client-install, before SSSD is even started.

That's why there are no SSSD logs because at that point, the SSSD daemon is not even started.

Checking strace output, I wonder if something sssd_krb5_locator_plugin.so could have caused some difference in behaviour.

Comment 4 Jan Pazdziora (Red Hat) 2019-09-30 09:41:16 UTC

Here's a way to get a setup for potential bisecting. It uses one Fedora 29 machine to run FreeIPA server with DNS server on it, and on the same machine it uses docker to run replica containers.

dnf install -y freeipa-server-dns docker git
# I use --no-dnssec-validation here, otherwise things fail on my network
ipa-server-install -U -r EXAMPLE.TEST -n example.test -p Setup123 -a Secret123 --setup-dns --auto-forwarders --no-ntp --no-dnssec-validation
# The ipa-server-install puts 127.0.0.1 to resolve.conf, we need a public IP address
cp -p /etc/resolv.conf /etc/resolv.conf.ipa && echo nameserver $( dig +short $(hostname -f) ) > /etc/resolv.conf
systemctl start docker
git clone https://github.com/freeipa/freeipa-container.git
docker build -t local/freeipa-server:fedora-29-orig -f freeipa-container/Dockerfile.fedora-29 freeipa-container
# With stable Fedora 29 (and sssd 2.2.0), this following replica setup should pass
docker run --rm --name=freeipa-replica --dns=$( dig +short $( hostname -f ) ) -h replica$RANDOM.example.test local/freeipa-server:fedora-29-orig exit-on-finished ipa-replica-install -U --principal admin --no-ntp -w Secret123

# Now we prepare FreeIPA image with the newer (2.2.2) SSSD
sed -i '/dnf clean all/aRUN dnf upgrade --enablerepo=updates-testing -y sssd-common' freeipa-container/Dockerfile.fedora-29
docker build -t local/freeipa-server:fedora-29-new -f freeipa-container/Dockerfile.fedora-29 freeipa-container

# This replica setup should fail
docker run --rm --name=freeipa-replica --dns=$( dig +short $( hostname -f ) ) -h replica$RANDOM.example.test local/freeipa-server:fedora-29-new exit-on-finished ipa-replica-install -U --principal admin --no-ntp -w Secret123

On the passing setup (SSSD 2.2.0), the output will have

Systemwide CA database updated.
Hostname (replica251.example.test) does not have A/AAAA record.
Missing reverse record(s) for address(es): 172.17.0.2.
SSSD enabled

in it. In the failing case (SSSD 2.2.2), the output will be

Systemwide CA database updated.
Hostname (replica17845.example.test) does not have A/AAAA record.
Failed to update DNS records.
Missing A/AAAA record(s) for host replica17845.example.test: 172.17.0.2.
Missing reverse record(s) for address(es): 172.17.0.2.
SSSD enabled

If you'd like to use this setup for bisecting, replace that

RUN dnf upgrade --enablerepo=updates-testing -y sssd-common

with some command to get the SSSD packages from some bisect repo but make sure that step is not cached when the URL of the repo does not change between runs. Something like

RUN echo change-this-value-every-time
RUN curl -o /etc/yum.repos.d/sssd.repo https://copr.fedorainfracloud.org/coprs/mzidek/sssd-2_2_1/repo/fedora-29/mzidek-sssd-2_2_1-fedora-29.repo
RUN dnf upgrade -y sssd-common

Comment 6 Jan Pazdziora (Red Hat) 2019-10-01 10:25:24 UTC

I've bisected SSSD. The problem was introduced by https://pagure.io/SSSD/sssd/c/548ea574645f405307b14bb1113d66f9da1abf2b?branch=master.

Alexey, any idea what in that changeset could be causing this?

Comment 7 Sumit Bose 2019-10-01 10:41:59 UTC

(In reply to Jan Pazdziora from comment #6)
> I've bisected SSSD. The problem was introduced by
> https://pagure.io/SSSD/sssd/c/
> 548ea574645f405307b14bb1113d66f9da1abf2b?branch=master.
> 
> Alexey, any idea what in that changeset could be causing this?

Hi Jan,

thanks for bisecting this. Just a wild guess, since the commit is about random numbers, maybe the change in SSSD might lead to an empty entropy pool and nsupdate depends on entropy being available? This might explain your observation that it works if you call nsupdate from a different terminal, because you will call it later than the script would have done. Do you know if it works as well if you just add a sleep in the script before calling nsupdate?

bye,
Sumit

Comment 8 Jan Pazdziora (Red Hat) 2019-10-01 11:21:36 UTC

Heya Sumit,

adding

--- /usr/lib/python3.7/site-packages/ipaclient/install/client.py
+++ /usr/lib/python3.7/site-packages/ipaclient/install/client.py
@@ -1313,6 +1313,7 @@ def get_local_ipaddresses(iface=None):
 
 
 def do_nsupdate(update_txt):
+    time.sleep(10)
     logger.debug("Writing nsupdate commands to %s:", UPDATE_FILE)
     logger.debug("%s", update_txt)
 
seems to help when tested locally. I've fired up Travis CI to check it there as well.

Is it expected that the changeset modifies the handling of the entropy pool in such wild way? At the point when the problem happens, no SSSD process is running yet (AFAICT), so the only involvement of the SSSD code in the whole operation is via its shared library plugins.

Comment 9 Sumit Bose 2019-10-01 12:51:45 UTC

Hi,

maybe sss_rand() which calls OpenSSL's RAND_bytes() is used too heavily.

Alexey's comment:
    /* Fallback to libc `rand()`
     * Coverity might complain here: "DC.WEAK_CRYPTO (CWE-327)"
     * But if `RAND_bytes()` failed then there is no entropy available
     * so it doesn't make any sense to try reading "/dev/[u]random"
     */
sounds a bit like RAND_bytes() depends on entropy being available. If that's the case and you say that no other SSSD components are running, it might be the KCM. The changes to src/responder/kcm/kcmsrv_ccache_secdb.c and src/responder/kcm/kcmsrv_ccache_secrets.c use sss_rand() instead of the cheap rand().

Are you using KCM? Can you try if removing the sssd-kcm package before the installation helps as well?

bye,
Sumit

Comment 10 Jan Pazdziora (Red Hat) 2019-10-01 13:35:48 UTC

The Travis CI run still fails: https://travis-ci.org/adelton/freeipa-container/jobs/591967844.

Comment 11 Jan Pazdziora (Red Hat) 2019-10-01 13:36:15 UTC

I don't use KCM, sssd-kcm is not even installed.

Comment 12 Sumit Bose 2019-10-01 16:35:35 UTC

(In reply to Jan Pazdziora from comment #11)
> I don't use KCM, sssd-kcm is not even installed.

Are you sure it is not installed in the base image, it's in the core group in F30 https://pagure.io/fedora-comps/blob/master/f/comps-f30.xml.in#_634.

bye,
Sumit

Comment 13 Jan Pazdziora (Red Hat) 2019-10-01 16:46:01 UTC

Not in container images:

#  docker run --rm registry.fedoraproject.org/fedora:30 rpm -q sssd-kcm
package sssd-kcm is not installed

Which of course gets us to -- should it be installed in FreeIPA container? Does it have the potential of fixing the entropy problem?

Comment 14 Sumit Bose 2019-10-01 17:03:53 UTC

(In reply to Jan Pazdziora from comment #13)
> Not in container images:
> 
> #  docker run --rm registry.fedoraproject.org/fedora:30 rpm -q sssd-kcm
> package sssd-kcm is not installed

ok

> 
> Which of course gets us to -- should it be installed in FreeIPA container?
> Does it have the potential of fixing the entropy problem?

No, this was just to most probable candidate I've seen in the commit you've bisected. Additionally I'm still not sure if there are sufficient evidence that this is really an entropy issue because nowadays everything should use getentropy(). Or is there something special in containers with respect to the kernel and random numbers? 

This commit is quite self contained, shall I prepare a build of sssd-2.2.2-1 just without this commit to verify that it is really the culprit?

bye,
Sumit

Comment 15 Jan Pazdziora (Red Hat) 2019-10-01 17:11:42 UTC

(In reply to Sumit Bose from comment #14)
>
> that this is really an entropy issue because nowadays everything should use
> getentropy(). Or is there something special in containers with respect to
> the kernel and random numbers? 

In containers we already mount /dev/urandom as /dev/random to avoid long startup times of KDC, IIRC. The Travis CI setup is said to have problems with entropy because all the VMs share one source. Why I only see it in Travis with read-write containers and not with read-only ones is not clear to me though.

> This commit is quite self contained, shall I prepare a build of sssd-2.2.2-1
> just without this commit to verify that it is really the culprit?

That'd be great, some dnf/copr repo with build for Fedora 30 would help.

Thank you, Jan

Comment 16 Jan Pazdziora (Red Hat) 2019-10-02 07:10:07 UTC

Tomáš Mráz kindly took a look at the commit.

He notes: the change uses sss_rand() instead of rand() or rand_r(). However, sss_rand()'s return values are in the whole integer range, while man 3 rand says

       The  rand()  function returns a pseudo-random integer in the range 0 to
       RAND_MAX inclusive (i.e., the mathematical range [0, RAND_MAX]).

So the values returned by sss_rand() are not compatible with the expected values returned in the context where rand() used to be.

Comment 17 Jan Pazdziora (Red Hat) 2019-10-02 10:00:21 UTC

I tested SSSD 2.2.0 with

diff --git a/src/util/crypto/libcrypto/crypto_prng.c b/src/util/crypto/libcrypto/crypto_prng.c
index 443cfeeb4..c25ef82c7 100644
--- a/src/util/crypto/libcrypto/crypto_prng.c
+++ b/src/util/crypto/libcrypto/crypto_prng.c
@@ -34,7 +34,7 @@ int sss_rand(void)
     int result;
 
     if (RAND_bytes((unsigned char*)&result, (int)sizeof(int)) == 1) {
-        return result;
+        return result & INT_MAX;
     }
 
     /* Fallback to libc `rand()`

to make the sss_rand()'s output a positive integer but it does not fix the problem when running locally.

Comment 18 Jan Pazdziora (Red Hat) 2019-10-02 10:08:47 UTC

On the other hand, Sumit's build at https://copr.fedorainfracloud.org/coprs/sbose/sssd-test-build/repo/fedora-30/sbose-sssd-test-build-fedora-30.repo which reverses the commit passes even in Travis CI: https://travis-ci.org/adelton/freeipa-container/builds/592473561

Comment 19 Jan Pazdziora (Red Hat) 2019-10-02 11:24:12 UTC

Removing the RAND_bytes part with

diff --git a/src/util/crypto/libcrypto/crypto_prng.c b/src/util/crypto/libcrypto/crypto_prng.c
index 443cfeeb4..de3e6e82a 100644
--- a/src/util/crypto/libcrypto/crypto_prng.c
+++ b/src/util/crypto/libcrypto/crypto_prng.c
@@ -31,11 +31,6 @@
 int sss_rand(void)
 {
     static bool srand_done = false;
-    int result;
-
-    if (RAND_bytes((unsigned char*)&result, (int)sizeof(int)) == 1) {
-        return result;
-    }
 
     /* Fallback to libc `rand()`
      * Coverity might complain here: "DC.WEAK_CRYPTO (CWE-327)"

does not help when tested locally.

Comment 20 Simo Sorce 2019-10-02 14:06:45 UTC

If you are going to change how random is sourced in the code, please pull me in for a crypto review.

Comment 21 Jan Pazdziora (Red Hat) 2019-10-02 19:16:33 UTC

The part of https://pagure.io/SSSD/sssd/c/548ea574645f405307b14bb1113d66f9da1abf2b?branch=master which causes the problem is the

--- a/src/sss_client/common.c
+++ b/src/sss_client/common.c
@@ -566,6 +566,11 @@ static int sss_cli_open_socket(int *errnop, const char *socket_name, int timeout
     /* this piece is adapted from winbind client code */
     wait_time = 0;
     sleep_time = 0;
+    /* This is not security relevant functionality and
+     * it is undesirable to pull unnecessary dependency (util/crypto)
+     * so plain srand() & rand() are used here.
+     */
+    srand(time(NULL) * getpid());
     while (inprogress) {
         int connect_errno = 0;
         socklen_t errnosize;

When I revert that change in Fedora 29 container with SSSD 2.2.2 built from sources, running ipa-replica-install passes, at least when run locally on Fedora 30 host.

I have no idea why removing that srand() call helps.

Comment 22 Simo Sorce 2019-10-02 20:34:46 UTC

I am really baffled and dismayed to see the use of srand() with such lousy justifications in the code, especially in files named *crypto*. This is a score of CVEs waiting to happen and that code should simply be reverted.
If there is an absolute need to avoid blocking then other techniques can be used, like seeding from a, per machine, stored random seed (obviously previously generated by a proper CSPRNG).

For example we use sss_rand() to seed the murmurhash tables.
This is a *security* feature to avoid "dictionary based collision attacks", and srand() as used is *completely predictable*, and totally not appropriate.

This is not the right bug to discuss further this issue, but my recommendation would be to completely revert away any use of srand() and go back to the drawing table on the original problem that brought it in the code.

Comment 23 Jan Pazdziora (Red Hat) 2019-10-03 05:31:54 UTC

I agree.

Given that Sumit's build with the commit reverted worked for our FreeIPA container CI (master + replica), it's be great to have that commit (or this part of the commit) reverted and in Fedoras 29+ as the first step.

Comment 24 Alexey Tikhonov 2019-10-07 10:31:09 UTC

(In reply to Jan Pazdziora from comment #16)
> Tomáš Mráz kindly took a look at the commit.
...
> So the values returned by sss_rand() are not compatible with the expected
> values returned in the context where rand() used to be.

Good catch. I will take a look into this.

Comment 25 Alexey Tikhonov 2019-10-07 11:13:27 UTC

Hi Simo,

(In reply to Simo Sorce from comment #22)
> I am really baffled and dismayed to see the use of srand() with such lousy
> justifications in the code, especially in files named *crypto*. This is a
> score of CVEs waiting to happen and that code should simply be reverted.

Could you please explain?

Jan reports problem is:
https://pagure.io/SSSD/sssd/blob/548ea574645f405307b14bb1113d66f9da1abf2b/f/src/sss_client/common.c#_573

(1) what do you mean by "especially in files named *crypto*"? Is 'is sss_client/commons.c'
(2) PRNG is used in this place to add random delay of [1..2] seconds before next attempt to connect (after getting EAGAIN)
This is not security relevant.
(3) Actually I don't know what is the reason of randomization of this delay at all. But this part of code used `rand()` without `srand()` and this felt wrong.


> If there is an absolute need to avoid blocking then other techniques can be
> used, like seeding from a, per machine, stored random seed (obviously
> previously generated by a proper CSPRNG).
> For example we use sss_rand() to seed the murmurhash tables.

This makes me think you are talking about different patch. But I do not understand what do you mean.

Commit in question replaces majority of calls to plain `srand()&rand()` with new `sss_rand()` which uses `RAND_bytes()` (and only falls back to rand() if RAND_bytes() fails). In a very few places `srand()&rand()` are left untouched for a reason.
So this commit makes usage of PRNG *stronger*/doesn't change, not weaker.


> This is a *security* feature to avoid "dictionary based collision attacks",
> and srand() as used is *completely predictable*, and totally not appropriate.

Actually I do not think this is security feature at least in the sense of FIPS (but I agree this is not a right place to discuss)
But even here:

@@ -1353,8 +1353,7 @@ errno_t sss_mmap_cache_init(TALLOC_CTX *mem_ctx, const char *name,
     /* generate a pseudo-random seed.
      * Needed to fend off dictionary based collision attacks */
-    rseed = time(NULL) * getpid();
-    mc_ctx->seed = rand_r(&rseed);
+    mc_ctx->seed = sss_rand();

  --  call to plain rand_r() was replaces with sss_rand()/RAND_bytes()


So what do you offer to revert?

May be you mean comment 19? But this is not part of a patch set - it was Jan's test.

Comment 26 Alexey Tikhonov 2019-10-07 11:32:29 UTC

(In reply to Simo Sorce from comment #22)
> recommendation would be to completely revert away any use of srand()

(In reply to Jan Pazdziora from comment #23)
> I agree.

I don't agree. Removal of `srand()` in this context looks like hiding an issue.

I think we should debug how use of `srand()` causes problem (because this looks really weird)

Jan, would it be please possible to try following things:

(1) Replace `srand(time(NULL) * getpid())` with `srand(1)`
or
(2) https://pagure.io/SSSD/sssd/blob/548ea574645f405307b14bb1113d66f9da1abf2b/f/src/sss_client/common.c#_608
Change `sleep_time = rand() % 2 + 1;` with `sleep_rime = some_const` (with some_const lets say 0, 1 and 10)
?

Comment 27 Jan Pazdziora (Red Hat) 2019-10-07 11:41:55 UTC

I won't have bandwidth to debug this in foreseeable future, so you guys will have to take this from here. Steps from comment 4 should reasonably reliable way to reproduce the problem.

I assume that the additional srand() does not have much to do with the delay in sss_cli_open_socket() but it resets the seed for some other operation which cares about randomness and which gets interleaved with that sss_cli_open_socket().

Comment 28 Alexey Tikhonov 2019-10-07 12:17:42 UTC

(In reply to Jan Pazdziora from comment #27)
> I won't have bandwidth to debug this in foreseeable future, so you guys will
> have to take this from here.

Ok.

> I assume that the additional srand() does not have much to do with the delay
> in sss_cli_open_socket() but it resets the seed for some other operation
> which cares about randomness and which gets interleaved with that
> sss_cli_open_socket().

Yeap, I thought about this.
Actually I think app that cares about PRNG state should use rand_r().
Otherwise is is just broken design and removal of `srand()` doesn't fully resolve issue since `rand()` in sss_client still can break things.

But may be we should be merciful and get rid of using any PRNG in sss_client at all.

(Setting back "needinfo: ssorce" as I really didn't understand his comment)

Comment 29 Simo Sorce 2019-10-07 17:20:55 UTC

(In reply to Alexey Tikhonov from comment #25)
> Hi Simo,
[..]
> Could you please explain?
[..]
> This makes me think you are talking about different patch. But I do not
> understand what do you mean.
> 
> Commit in question replaces majority of calls to plain `srand()&rand()` with
> new `sss_rand()` which uses `RAND_bytes()` (and only falls back to rand() if
> RAND_bytes() fails). In a very few places `srand()&rand()` are left
> untouched for a reason.
> So this commit makes usage of PRNG *stronger*/doesn't change, not weaker.

Sorry I probably didn't look very careful at the history of the commits here,
but you should never use srand() even as a fallback in sssd code.
It is true that there are many uses where true randomness is not required,
but there are uses (like the one below) where it is. We should avoid issues
in principle by using a decent PRNG (the crypto library provides a good one)
or just use /dev/urandom (although that would be slower), but never srand() or
similar.

I do not even see how that is useful as a fallback, if the crypto library random
generator fails it is eaither a very serious case, or there is a bug that need
fixing, trying to paper over that failure just creates big risks when the
randomness is critical for security.

Once that srand() is out of the window we should try again Jan's scenarios
and see if they still reproduce. And if they do we know it is not due to the
RNG and look elsewhere.

> > This is a *security* feature to avoid "dictionary based collision attacks",
> > and srand() as used is *completely predictable*, and totally not appropriate.
> 
> Actually I do not think this is security feature at least in the sense of
> FIPS (but I agree this is not a right place to discuss)
> But even here:
> 
> @@ -1353,8 +1353,7 @@ errno_t sss_mmap_cache_init(TALLOC_CTX *mem_ctx, const
> char *name,
>      /* generate a pseudo-random seed.
>       * Needed to fend off dictionary based collision attacks */
> -    rseed = time(NULL) * getpid();
> -    mc_ctx->seed = rand_r(&rseed);
> +    mc_ctx->seed = sss_rand();
> 
>   --  call to plain rand_r() was replaces with sss_rand()/RAND_bytes()
> 
> 
> So what do you offer to revert?

The problem is not calling sss_rand() here, that is fine, the problem is in the
fallback to srand() built inside of sss_rand(), that should be eliminated.
 
> May be you mean comment 19? But this is not part of a patch set - it was
> Jan's test.

Yeah sorry for derailing this bug a bit, but I think the RNG was too important
to gloss over it, I also created an upstream Issue.

Comment 30 Alexey Tikhonov 2019-10-08 08:49:24 UTC

> but you should never use srand() even as a fallback in sssd code.

Ok, thank you for explanation.
I understand this claim and I will pay attention to it in the upstream #4024 (I merged #4091 there to keep all relevant concerns/history in a single ticket)


> Once that srand() is out of the window we should try again Jan's scenarios
> and see if they still reproduce

But the thing is, Jan reports other `srand()` is causing a trouble. The one in sss_client.so. So eliminating `srand()` from `sss_rand()` won't affect this bz.

Btw, do you by chance remember a reason of randomization of delay between attempts to connect in the client code?
I am asking because git blame says you are the author (but of course it was very long ago).

Our guess is some app may rely on PRNG state and the call to `srand()` in sss_client.so breaks it.
This of course need to be proved. But if it is the case then I can think about two ways to mitigate the problem:
1) to use `rand_r()` is sss_client.so with any seed (because this is totally non security relevant functionality)
2) to get rid of this randomization completely because honestly I see no reason for it (was it to "hide" a race condition when multiple clients connects to a responder?)

What would you think?

Comment 31 Simo Sorce 2019-10-08 12:48:27 UTC

(In reply to Alexey Tikhonov from comment #30)
> > but you should never use srand() even as a fallback in sssd code.
> 
> Ok, thank you for explanation.
> I understand this claim and I will pay attention to it in the upstream #4024
> (I merged #4091 there to keep all relevant concerns/history in a single
> ticket)
> 
> 
> > Once that srand() is out of the window we should try again Jan's scenarios
> > and see if they still reproduce
> 
> But the thing is, Jan reports other `srand()` is causing a trouble. The one
> in sss_client.so. So eliminating `srand()` from `sss_rand()` won't affect
> this bz.

Ah, got it, if I could I would eliminate srand() from there too.

> Btw, do you by chance remember a reason of randomization of delay between
> attempts to connect in the client code?
> I am asking because git blame says you are the author (but of course it was
> very long ago).

It was to avoid thundering herd issues[1]. We should re-evaluate if this is still needed.

> Our guess is some app may rely on PRNG state and the call to `srand()` in
> sss_client.so breaks it.

That may be an issue indeed, one more reason to not use srand() there, perhaps we should replace with a call to getrandom() these days.

> This of course need to be proved. But if it is the case then I can think
> about two ways to mitigate the problem:
> 1) to use `rand_r()` is sss_client.so with any seed (because this is totally
> non security relevant functionality)
> 2) to get rid of this randomization completely because honestly I see no
> reason for it (was it to "hide" a race condition when multiple clients
> connects to a responder?)
> 
> What would you think?

Try getrandom() first.

[1] https://en.wikipedia.org/wiki/Thundering_herd_problem

Comment 32 Alexey Tikhonov 2019-10-08 14:25:04 UTC

(In reply to Simo Sorce from comment #31)
> > Btw, do you by chance remember a reason of randomization of delay between
> > attempts to connect in the client code?
> 
> It was to avoid thundering herd issues[1]. We should re-evaluate if this is
> still needed.
> 
> > Our guess is some app may rely on PRNG state and the call to `srand()` in
> > sss_client.so breaks it.
> 
> That may be an issue indeed, one more reason to not use srand() there,
> perhaps we should replace with a call to getrandom() these days.

But getrandom() is Linux-specific and was only added to glibc in version 2.25.

I was discussing possibility to use getrandom() in non security/FIPS relevant places with Jakub one day and conclusion was we do not want to worsen portability. Especially for no good reason.
And we really do not need high quality PRNG to fight thundering herd issue. If we need to fight it at all. I would personally vote to get rid of randomization at all: there is nothing like a poll(), there is blunt sleep() for a few seconds, every process/thread will be awaken in a different time anyway (IMO).

Comment 33 Simo Sorce 2019-10-08 14:37:16 UTC

Fine with getting rid of it if people think that is really not needed.
That said getrandom() can be assumed available anywhere it matters, I do not see it as an issue for the client code.

Comment 37 Simo Sorce 2019-10-08 20:28:19 UTC

Alexander, ok I won't assume it is irrelevant, but still it will require custom code, so it can also take care of getrandom() in the same set of ifdefs or in a new client interface.
What I am saying is that AIX compatibility shouldn't be a relevant factor in using better interfaces on other platforms.

Comment 38 Jan Pazdziora (Red Hat) 2019-10-14 07:14:39 UTC

Do we have some short term plan for fixing the regression in Fedora 30+? It affects our ability of testing FreeIPA containerization properly.

Comment 39 Alexey Tikhonov 2019-10-14 11:54:04 UTC

(In reply to Jan Pazdziora from comment #38)
> Do we have some short term plan for fixing the regression in Fedora 30+?

There is upstream PR (https://github.com/SSSD/sssd/pull/900) that should fix the problem according to your comment 21.

Unfortunately I didn't investigate root cause of this issue yet.

Comment 40 Lukas Slebodnik 2019-10-14 12:11:03 UTC

(In reply to Alexey Tikhonov from comment #39)
> (In reply to Jan Pazdziora from comment #38)
> > Do we have some short term plan for fixing the regression in Fedora 30+?
> 
> There is upstream PR (https://github.com/SSSD/sssd/pull/900) that should fix
> the problem according to your comment 21.
> 

We should not call it "fix the problem".
It is just a workaround. Which can be temporarily applied to fedora dist-git
till the problem will be solved.

> Unfortunately I didn't investigate root cause of this issue yet.

That should be preferred way.

Comment 41 Alexey Tikhonov 2019-10-14 12:16:30 UTC

(In reply to Lukas Slebodnik from comment #40)
> > There is upstream PR (https://github.com/SSSD/sssd/pull/900) that should fix
> > the problem according to your comment 21.
> 
> We should not call it "fix the problem".

Ok, I agree.

> Which can be temporarily applied to fedora dist-git

I am inclined to offer patch#900 for master branch independent of root cause of this bz.

Comment 42 Lukas Slebodnik 2019-10-14 12:43:49 UTC

(In reply to Alexey Tikhonov from comment #41)
> (In reply to Lukas Slebodnik from comment #40)
> > > There is upstream PR (https://github.com/SSSD/sssd/pull/900) that should fix
> > > the problem according to your comment 21.
> > 
> > We should not call it "fix the problem".
> 
> Ok, I agree.
> 
> > Which can be temporarily applied to fedora dist-git
> 
> I am inclined to offer patch#900 for master branch independent of root cause
> of this bz.

Sure, you can offer anything and anything can be merged.

But that MR says "Resolves: https://pagure.io/SSSD/sssd/issue/4094"
and it does not resolve the problem. It is just a workaround.

> Unfortunately I didn't investigate root cause of this issue yet.

It might be even solution but there is missing explanation why it is solution.

Comment 43 Jan Pazdziora (Red Hat) 2019-10-14 12:53:43 UTC

In any case, I'd appreciate quick fix in Fedora builds so that we have working FreeIPA setup again, ideally via a Patch in Fedoras' .spec files.

Then the https://pagure.io/SSSD/sssd/issue/4094 can be used to continue with the investigation of the root cause and finding the proper fix. But I'm not excited about the prospect of waiting for merge to master and next upstream release to fix the regression we see in Fedora.

Comment 44 Alexey Tikhonov 2019-10-14 13:06:02 UTC

(In reply to Lukas Slebodnik from comment #42)
> > I am inclined to offer patch#900 for master branch independent of root cause
> > of this bz.
> 
> Sure, you can offer anything and anything can be merged.
> 
> But that MR says "Resolves: https://pagure.io/SSSD/sssd/issue/4094"
> and it does not resolve the problem. It is just a workaround.

This bz != #4094. It might be, but this is not proven yet.
The goal of #4094 is to avoid touching PRNG in a client code to be on a safe side in general (in case of any poorly designed application).
And I think PR#900 resolves #4094 as it is stated.



(In reply to Jan Pazdziora from comment #43)
> But I'm not
> excited about the prospect of waiting for merge to master and next upstream
> release to fix the regression we see in Fedora.

I will ask Michal to build a new Fedora sssd package as soon as patch is ack'ed by other sssd maintainer.

Comment 45 Lukas Slebodnik 2019-10-14 13:25:10 UTC

(In reply to Lukas Slebodnik from comment #42)
> (In reply to Alexey Tikhonov from comment #41)
> > (In reply to Lukas Slebodnik from comment #40)
> > > > There is upstream PR (https://github.com/SSSD/sssd/pull/900) that should fix
> > > > the problem according to your comment 21.
> > > 
> > > We should not call it "fix the problem".
> > 
> > Ok, I agree.
> > 
> > > Which can be temporarily applied to fedora dist-git
> > 
> > I am inclined to offer patch#900 for master branch independent of root cause
> > of this bz.
> 
> Sure, you can offer anything and anything can be merged.
> 
> But that MR says "Resolves: https://pagure.io/SSSD/sssd/issue/4094"
> and it does not resolve the problem. It is just a workaround.
> 

They are tightly related, 
https://pagure.io/SSSD/sssd/issue/4094#comment-604405

IMHO the safest option is to revert to previous state in the client code.
Just remove `srand`
https://pagure.io/SSSD/sssd/blob/master/f/src/sss_client/common.c#_573
(it is to unblock others in fedora)

Then spend time with investigation root cause of the problem.
And then maybe even merge PR#900 (in the current state)

Because if you do not know what is the problem then much better is to
return to previous version rather then create 3rd one which might cause issues in other scenarios.
(reported few months later by enterprise customers)

Comment 46 Alexey Tikhonov 2019-10-14 13:39:25 UTC

(In reply to Lukas Slebodnik from comment #45)
> > But that MR says "Resolves: https://pagure.io/SSSD/sssd/issue/4094"
> > and it does not resolve the problem. It is just a workaround.
> 
> They are tightly related, 
> https://pagure.io/SSSD/sssd/issue/4094#comment-604405

When I have opened #4094 I didn't mean it. Comment you provide link to doesn't change intention of that upstream ticket.
Sorry for any possible misunderstanding. I will clean "Devel Whiteboard".

> IMHO the safest option is to revert to previous state in the client code.
> Just remove `srand`
> https://pagure.io/SSSD/sssd/blob/master/f/src/sss_client/common.c#_573
> (it is to unblock others in fedora)

Might be.

> Because if you do not know what is the problem then much better is to
> return to previous version rather then create 3rd one which might cause
> issues in other scenarios.
> (reported few months later by enterprise customers)

I do not consider PR#900 to be a "3rd version to fix this bz"
I consider PR#900 to be "a general improvement unrelated to this bz whose patch can be used as a (temporarily) fix for this bz"

But yes, it is possible to go with separate patch to remove srand()

Comment 47 Alexey Tikhonov 2019-10-14 13:46:33 UTC

Hi Michal,

what approach you as a package maintainer would prefer to address this bz?

Comment 48 Michal Zidek 2019-10-14 15:21:53 UTC

(In reply to Alexey Tikhonov from comment #47)
> Hi Michal,
> 
> what approach you as a package maintainer would prefer to address this bz?

I would personally prefer to revert the "problematic patch" in Fedora and solve the issue in upstream by the next upstream release.

Comment 49 Michal Zidek 2019-10-14 15:25:49 UTC

Sumit,

you provided Jan with a scratch build that reverted the patch and it worked for Jan. Can you please open upstream PR with the revert patch? I would prefer to have the patch also in upstream until we have a consensus how to address the issue properly.

Michal

Comment 50 Alexey Tikhonov 2019-10-14 15:35:55 UTC

(In reply to Michal Zidek from comment #49)
> Sumit,
> 
> you provided Jan with a scratch build that reverted the patch and it worked
> for Jan. Can you please open upstream PR with the revert patch? I would
> prefer to have the patch also in upstream until we have a consensus how to
> address the issue properly.


I disagree to have whole PR#838 reverted.

I think we should either:
(1) remove srand() in https://pagure.io/SSSD/sssd/blob/master/f/src/sss_client/common.c#_573 "via a Patch in Fedoras' .spec " as a "tmp fix"
or
(2) merge PR#900 upstream and pick it up

Comment 51 Sumit Bose 2019-10-14 17:36:23 UTC

Hi,

I agree with Alexey. Jan already said that just removing srand() worked for him, so it is sufficient to remove the hunk which added it, luckily the hunk didn't include anything else.

bye,
Sumit

Comment 52 Lukas Slebodnik 2019-10-14 22:32:10 UTC

(In reply to Sumit Bose from comment #51)
> Hi,
> 
> I agree with Alexey. Jan already said that just removing srand() worked for
> him, so it is sufficient to remove the hunk which added it, luckily the hunk
> didn't include anything else.
> 
> bye,
> Sumit

It also contains comment :-)

(In reply to Alexey Tikhonov from comment #46)
> (In reply to Lukas Slebodnik from comment #45)
> > > But that MR says "Resolves: https://pagure.io/SSSD/sssd/issue/4094"
> > > and it does not resolve the problem. It is just a workaround.
> > 
> > They are tightly related, 
> > https://pagure.io/SSSD/sssd/issue/4094#comment-604405
> 
> When I have opened #4094 I didn't mean it. Comment you provide link to
> doesn't change intention of that upstream ticket.
> Sorry for any possible misunderstanding. I will clean "Devel Whiteboard".
> 
> > IMHO the safest option is to revert to previous state in the client code.
> > Just remove `srand`
> > https://pagure.io/SSSD/sssd/blob/master/f/src/sss_client/common.c#_573
> > (it is to unblock others in fedora)
> 
> Might be.
> 
> > Because if you do not know what is the problem then much better is to
> > return to previous version rather then create 3rd one which might cause
> > issues in other scenarios.
> > (reported few months later by enterprise customers)
> 
> I do not consider PR#900 to be a "3rd version to fix this bz"
> I consider PR#900 to be "a general improvement unrelated to this bz whose
> patch can be used as a (temporarily) fix for this bz"
> 
> But yes, it is possible to go with separate patch to remove srand()

1st version is to version in 2.2.0 
2nd version is PR#900

OK, it is not 3rd version :-) (off by one mistake)

But I would not call PR#900 "a general improvement". It improves state
in sssd >= 2.2.1 but it is questionable whether it improve something
comparing to sssd <= 2.2.0. I'm sorry there is missing justification for calling it an improvement.
And it is diffifcult to explain anything without knowing root cause of the problem.

Comment 53 Alexey Tikhonov 2019-10-15 08:27:20 UTC

(In reply to Lukas Slebodnik from comment #52)

> But I would not call PR#900 "a general improvement". It improves state
> in sssd >= 2.2.1 but it is questionable whether it improve something
> comparing to sssd <= 2.2.0. I'm sorry there is missing justification for
> calling it an improvement.

Please, feel free to provide opinion directly in #4094 or PR#900.

Comment 54 Michal Zidek 2019-10-15 09:25:00 UTC

I acked the PR#900 and will prepare a Fedora build once it gets merged upstream. If the PR900 will not get merged in a reasonable time I will make it a dowsntream-only patch to get Fedora unblocked.

Comment 55 Tomas Mraz 2019-10-15 16:46:00 UTC

It would be really interesting what was broken by the srand() call. I.E. what was the other library that expected deterministic rand() which was broken by it. It would be good idea to have that other place fixed to use rand_r or something else to generate a stable pseudorandom sequence.

Comment 56 Lukas Slebodnik 2019-10-15 21:18:00 UTC

(In reply to Alexey Tikhonov from comment #53)
> (In reply to Lukas Slebodnik from comment #52)
> 
> > But I would not call PR#900 "a general improvement". It improves state
> > in sssd >= 2.2.1 but it is questionable whether it improve something
> > comparing to sssd <= 2.2.0. I'm sorry there is missing justification for
> > calling it an improvement.
> 
> Please, feel free to provide opinion directly in #4094 or PR#900.

The opinion was provided few times in the BZ.
I is either accepted or it is not considered to be valid.
copy&paste would not change anything.

Comment 57 Jan Pazdziora (Red Hat) 2019-10-21 09:42:45 UTC

I can see the https://github.com/SSSD/sssd/pull/900 was merged.

Any chance of new Fedora 30+ builds coming soon?

Comment 58 Adam Williamson 2019-10-22 18:23:03 UTC

I'm backporting the fix for #1757224, so I'll pull that in too.

Comment 59 Fedora Update System 2019-10-22 19:31:14 UTC

FEDORA-2019-3bdedf56fb has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-3bdedf56fb

Comment 60 Fedora Update System 2019-10-22 19:46:14 UTC

FEDORA-2019-1b16b0e9ae has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-1b16b0e9ae

Comment 61 Fedora Update System 2019-10-23 15:44:50 UTC

sssd-2.2.2-3.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-e9fc910d7f

Comment 62 Alexey Tikhonov 2019-10-23 15:48:24 UTC

(In reply to Tomas Mraz from comment #55)
> It would be really interesting what was broken by the srand() call. I.E.
> what was the other library that expected deterministic rand() which was
> broken by it. It would be good idea to have that other place fixed to use
> rand_r or something else to generate a stable pseudorandom sequence.

Hm... It seems now we don't have a ticket to track this ^^

Comment 63 Fedora Update System 2019-10-25 19:33:44 UTC

sssd-2.2.2-3.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-1b16b0e9ae

Comment 64 Fedora Update System 2019-10-25 21:26:41 UTC

sssd-2.2.2-3.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-3bdedf56fb

Comment 65 Jan Pazdziora (Red Hat) 2019-10-28 06:45:21 UTC

I confirm that the sssd-2.2.2-3.fc30 build in Fedora 30 updates-testing fixes the problem: https://travis-ci.org/freeipa/freeipa-container/jobs/603619439

Comment 66 Jan Pazdziora (Red Hat) 2019-10-28 06:51:42 UTC

(In reply to Alexey Tikhonov from comment #62)
> (In reply to Tomas Mraz from comment #55)
> > It would be really interesting what was broken by the srand() call. I.E.
> > what was the other library that expected deterministic rand() which was
> > broken by it. It would be good idea to have that other place fixed to use
> > rand_r or something else to generate a stable pseudorandom sequence.
> 
> Hm... It seems now we don't have a ticket to track this ^^

It seems like SSSD upstream / maintainers will have to open a new one.

Comment 67 Alexey Tikhonov 2019-10-30 19:05:31 UTC

(In reply to Tomas Mraz from comment #55)
> It would be really interesting what was broken by the srand() call.

Here is my (somewhat vague) understanding.

> ;; QUESTION SECTION:
> ;671202487.sig-ipa.example.test.        ANY     TKEY
> ;; ANSWER SECTION:
> 671202487.sig-ipa.example.test. 0 ANY   TKEY    gss-tsig. 0 0 3 BADNAME 0  0

According to rfc3645[1] "GSS-TSIG : 4.1.1. Receive TKEY Query from Client" :
"If the name is found in the table and the security context for this name is established and not expired, then the server MUST respond to the query with BADNAME error"

and according to rfc2930[2] "TKEY RR : 2.1 The Name Field" :
"The Name field relates to naming keys. ... An attempt to establish another set of keying material at a server for an existing name returns a BADNAME error."

So IIUC this error means that "671202487.sig-" was already in use when `nsupdate` tried to use it.

I am not sure which version of bind package was used in this test, but most probably one based on either 9.11.10 or 9.11.11 upstream release.
In both cases procedure to generate key name is as follows:
bin/nsupdate/nsupdate.c:start_gssrequest():
```
        isc_random_get(&val);
        result = isc_string_printf(mykeystr, sizeof(mykeystr), "%u.sig-%s",
                                   val, namestr);
```
where lib/isc/random.c:isc_random_get():
```
        initialize();
        *val = ((((unsigned int)rand()) & 0xffff0) >> 4) |
               ((((unsigned int)rand()) & 0xffff0) << 12);
```
and initialize() -> initialize_rand():
```
        pid = ((pid << 16) & 0xffff0000) | ((pid >> 16) & 0xffff);
        srand((unsigned)time(NULL) ^ pid);
```

So... I am not entirely sure how did call to `srand()` in sssd_client code lead to the *same* key being generated in nsupdate.
(For example, if an app triggered sss_cli_open_socket() several times in a second, this would reset rng state to the same seed...)

But I am reluctant to dig this deeper since modern version of bind has this part rewritten anyway:
https://gitlab.isc.org/isc-projects/bind9/blob/master/bin/nsupdate/nsupdate.c#L2841
https://gitlab.isc.org/isc-projects/bind9/blob/master/lib/isc/entropy.c#L21
  --  RAND_bytes() is used now.


Petr, does my speculations make any sense for you as a bind maintainer?



[1] https://tools.ietf.org/html/rfc3645
[2] https://tools.ietf.org/html/rfc2930

Comment 68 Alexey Tikhonov 2019-10-30 19:10:57 UTC

(In reply to Alexey Tikhonov from comment #67)
> But I am reluctant to dig this deeper since modern version of bind has this
> part rewritten anyway:
>   --  RAND_bytes() is used now.

*and* `srand()` was already removed in sssd_client upstream code.

Comment 69 Tomas Mraz 2019-10-31 11:00:04 UTC

So to summarize it would be a bind bug which is already fixed in the current bind code. OK then.

Comment 70 Alexey Tikhonov 2019-10-31 11:11:20 UTC

(In reply to Tomas Mraz from comment #69)
> So to summarize it would be a bind bug which is already fixed in the current
> bind code. OK then.

Those parts of bind code are changed in master branch. I do not know when/if it was/will be backported to 9.11 stable branch (which seems to be used in Fedora package). That's something I would like to hear from Petr.

Comment 71 Jan Pazdziora (Red Hat) 2019-10-31 11:16:06 UTC

(In reply to Tomas Mraz from comment #69)
> So to summarize it would be a bind bug which is already fixed in the current
> bind code. OK then.

What definition of "current" are we using here? The problem is still present in Fedora 30 and 31, with only sssd-2.2.2-3.* in updates-testing resolving it. So there does not seem any fix in bind in stable Fedora repos.

Comment 72 Alexey Tikhonov 2019-10-31 15:41:45 UTC

(In reply to Jan Pazdziora from comment #71)
> (In reply to Tomas Mraz from comment #69)
> > So to summarize it would be a bind bug which is already fixed in the current
> > bind code. OK then.
> 
> What definition of "current" are we using here? The problem is still present
> in Fedora 30 and 31, with only sssd-2.2.2-3.* in updates-testing resolving
> it. So there does not seem any fix in bind in stable Fedora repos.

sssd-2.2.2-3 will get into stable in a matter of days.

Do you mean that bind is vulnerable because there might be other parties (besides sssd) messing with PRNG state?
And that Fedora package maintainers could consider backporting relevant patches itself?

Comment 73 Jan Pazdziora (Red Hat) 2019-11-01 13:06:53 UTC

Neither. I've just pointed out that whatever fix for whatever issue in bind might be fixed upstream but it does not seem to be present in Fedora.

Comment 74 Fedora Update System 2019-11-02 02:27:34 UTC

sssd-2.2.2-3.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.

Comment 75 Fedora Update System 2019-11-09 22:38:19 UTC

sssd-2.2.2-3.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.

Comment 76 Petr Menšík 2019-11-12 13:57:48 UTC

Hi Alexey,

thanks for the analysis done on bind part. I have backported some modifications into bind-9.11 to use RAND_bytes for entropy usage in BIND.

That part did not include isc_nonce however, it just changes internal calls used in isc_entropy_getdata calls.

(In reply to Alexey Tikhonov from comment #67)
> (In reply to Tomas Mraz from comment #55)
> > It would be really interesting what was broken by the srand() call.
> 
> Here is my (somewhat vague) understanding.
> 
> > ;; QUESTION SECTION:
> > ;671202487.sig-ipa.example.test.        ANY     TKEY
> > ;; ANSWER SECTION:
> > 671202487.sig-ipa.example.test. 0 ANY   TKEY    gss-tsig. 0 0 3 BADNAME 0  0
> 
> According to rfc3645[1] "GSS-TSIG : 4.1.1. Receive TKEY Query from Client" :
> "If the name is found in the table and the security context for this name is
> established and not expired, then the server MUST respond to the query with
> BADNAME error"
> 
> and according to rfc2930[2] "TKEY RR : 2.1 The Name Field" :
> "The Name field relates to naming keys. ... An attempt to establish another
> set of keying material at a server for an existing name returns a BADNAME
> error."
> 
> So IIUC this error means that "671202487.sig-" was already in use when
> `nsupdate` tried to use it.
> 
> I am not sure which version of bind package was used in this test, but most
> probably one based on either 9.11.10 or 9.11.11 upstream release.
> In both cases procedure to generate key name is as follows:
> bin/nsupdate/nsupdate.c:start_gssrequest():
> ```
>         isc_random_get(&val);
>         result = isc_string_printf(mykeystr, sizeof(mykeystr), "%u.sig-%s",
>                                    val, namestr);
> ```
> where lib/isc/random.c:isc_random_get():
> ```
>         initialize();
>         *val = ((((unsigned int)rand()) & 0xffff0) >> 4) |
>                ((((unsigned int)rand()) & 0xffff0) << 12);
> ```
> and initialize() -> initialize_rand():
> ```
>         pid = ((pid << 16) & 0xffff0000) | ((pid >> 16) & 0xffff);
>         srand((unsigned)time(NULL) ^ pid);
> ```
> 
> So... I am not entirely sure how did call to `srand()` in sssd_client code
> lead to the *same* key being generated in nsupdate.
> (For example, if an app triggered sss_cli_open_socket() several times in a
> second, this would reset rng state to the same seed...)

Shouldn't one application initialise it just single time? It includes time
and includes pid. That should make conflicts quite rare. Does that mean small changes
to pid number compensates small change to time, resulting in the same number?

> 
> But I am reluctant to dig this deeper since modern version of bind has this
> part rewritten anyway:
> https://gitlab.isc.org/isc-projects/bind9/blob/master/bin/nsupdate/nsupdate.
> c#L2841
> https://gitlab.isc.org/isc-projects/bind9/blob/master/lib/isc/entropy.c#L21
>   --  RAND_bytes() is used now.
Since that value is related to key usage, it should be definitely used from better random source.
According to my check, this was changed still in BIND 9.13. Commit 99ba29bc5 [3].
> 
> 
> Petr, does my speculations make any sense for you as a bind maintainer?
> 
Yes, I agree with you. This is used as key index. It might leak current state of the random generator,
but the same would be with nonces. CSRNG has to be resistant anyway. Because dig cookies already
use isc_entropy_getdata, I think this part should use it too. It is nonce of sort.
Especially when uses special nonce API in later releases.
> 
> 
> [1] https://tools.ietf.org/html/rfc3645
> [2] https://tools.ietf.org/html/rfc2930
[3] https://gitlab.isc.org/isc-projects/bind9/commit/99ba29bc52f640ec9c2cbb331ffdeceff2f0f26f

Comment 77 Robbie Harwood 2019-11-15 19:13:31 UTC

I realize this is now well past mattering, but in case anyone finds this later:

> In containers we already mount /dev/urandom as /dev/random to avoid long startup times of KDC, IIRC. The Travis CI setup is said to have problems with entropy because all the VMs share one source. Why I only see it in Travis with read-write containers and not with read-only ones is not clear to me though.

KDC startup should be very fast since it uses getrandom() for everything.  If it's not... well, mounting /dev/urandom as /dev/random won't help you because I use getrandom() for everything :)

Comment 78 Ben Cotton 2020-05-26 18:44:29 UTC

Fedora 30 changed to end-of-life (EOL) status on 2020-05-26. Fedora 30 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 79 Jan Pazdziora (Red Hat) 2020-06-16 09:37:45 UTC

Bodhi did not autoclose this bugzilla. Should its resolution be set to CURRENTRELEASE, rather than EOL?

Note You need to log in before you can comment on or make changes to this bug.