Bug 2291228 (CVE-2024-5206)

Summary:	CVE-2024-5206 scikit-learn: Possible sensitive data leak
Product:	[Other] Security Response	Reporter:	Marco Benatto <mbenatto>
Component:	vulnerability	Assignee:	Product Security <prodsec-ir-bot>
Status:	NEW ---	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	unspecified	CC:	amctagga, aoconnor, bniver, epacific, flucifre, gmeno, haoli, hkataria, jajackso, jcammara, jhardy, jmitchel, jneedle, jobarker, kegrant, koliveir, kshier, mabashia, mbenjamin, mhackett, omaciel, pbraun, shvarugh, simaishi, smcdonal, sostapov, stcannon, teagle, tfister, thavo, vereddy, yguenane, zsadeh
Target Milestone:	---	Keywords:	Security
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:	A vulnerability was found in the scikit-learn package. Before version 1.4.1, post1 scikit-learn stores all tokens with "stop_words_" attributes. This action may cause scikit-learn to expose sensitive data that will not be used in the model training, possibly leaking passwords and keys.	Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2291229, 2291230, 2389947, 2389948, 2389949, 2389950, 2392914, 2392915
Bug Blocks:

Description Marco Benatto 2024-06-10 21:06:17 UTC

A sensitive data leakage vulnerability was identified in scikit-learn's TfidfVectorizer, specifically in versions up to and including 1.4.1.post1, which was fixed in version 1.5.0. The vulnerability arises from the unexpected storage of all tokens present in the training data within the `stop_words_` attribute, rather than only storing the subset of tokens required for the TF-IDF technique to function. This behavior leads to the potential leakage of sensitive information, as the `stop_words_` attribute could contain tokens that were meant to be discarded and not stored, such as passwords or keys. The impact of this vulnerability varies based on the nature of the data being processed by the vectorizer.

https://github.com/scikit-learn/scikit-learn/commit/70ca21f106b603b611da73012c9ade7cd8e438b8
https://huntr.com/bounties/14bc0917-a85b-4106-a170-d09d5191517c

Comment 1 Marco Benatto 2024-06-10 21:08:39 UTC

Created python-imbalanced-learn tracking bugs for this issue:

Affects: fedora-all [bug 2291229]


Created python-scikit-learn tracking bugs for this issue:

Affects: fedora-all [bug 2291230]