Merge main

Signed-off-by: Adam Li <adam2392@gmail.com>
neurodata · Sep 14, 2023 · 0b6803f · 0b6803f
2 parents e2fee00 + 0d701e8
commit 0b6803f
Show file tree

Hide file tree

Showing 76 changed files with 2,176 additions and 2,183 deletions.
diff --git a/README.rst b/README.rst
@@ -17,7 +17,7 @@
 .. |Nightly wheels| image:: https://github.com/scikit-learn/scikit-learn/workflows/Wheel%20builder/badge.svg?event=schedule
 .. _`Nightly wheels`: https://github.com/scikit-learn/scikit-learn/actions?query=workflow%3A%22Wheel+builder%22+event%3Aschedule
 
-.. |PythonVersion| image:: https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue
+.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/scikit-learn.svg
 .. _PythonVersion: https://pypi.org/project/scikit-learn/
 
 .. |PyPi| image:: https://img.shields.io/pypi/v/scikit-learn

diff --git a/doc/conftest.py b/doc/conftest.py
@@ -3,12 +3,15 @@
 from os import environ
 from os.path import exists, join
 
+import pytest
+from _pytest.doctest import DoctestItem
+
 from sklearn.datasets import get_data_home
 from sklearn.datasets._base import _pkl_filepath
 from sklearn.datasets._twenty_newsgroups import CACHE_NAME
 from sklearn.utils import IS_PYPY
 from sklearn.utils._testing import SkipTest, check_skip_network
-from sklearn.utils.fixes import parse_version
+from sklearn.utils.fixes import np_base_version, parse_version
 
 
 def setup_labeled_faces():
@@ -172,3 +175,34 @@ def pytest_configure(config):
         matplotlib.use("agg")
     except ImportError:
         pass
+
+
+def pytest_collection_modifyitems(config, items):
+    """Called after collect is completed.
+
+    Parameters
+    ----------
+    config : pytest config
+    items : list of collected items
+    """
+    skip_doctests = False
+    if np_base_version >= parse_version("2"):
+        # Skip doctests when using numpy 2 for now. See the following discussion
+        # to decide what to do in the longer term:
+        # https://github.com/scikit-learn/scikit-learn/issues/27339
+        reason = "Due to NEP 51 numpy scalar repr has changed in numpy 2"
+        skip_doctests = True
+
+    # Normally doctest has the entire module's scope. Here we set globs to an empty dict
+    # to remove the module's scope:
+    # https://docs.python.org/3/library/doctest.html#what-s-the-execution-context
+    for item in items:
+        if isinstance(item, DoctestItem):
+            item.dtest.globs = {}
+
+    if skip_doctests:
+        skip_marker = pytest.mark.skip(reason=reason)
+
+        for item in items:
+            if isinstance(item, DoctestItem):
+                item.add_marker(skip_marker)
diff --git a/doc/glossary.rst b/doc/glossary.rst
@@ -1731,7 +1731,7 @@ functions or non-estimator constructors.
         For these models, the number of iterations, reported via
         ``len(estimators_)`` or ``n_iter_``, corresponds the total number of
         estimators/iterations learnt since the initialization of the model.
-        Thus, if a model was already initialized with `N`` estimators, and `fit`
+        Thus, if a model was already initialized with `N` estimators, and `fit`
         is called with ``n_estimators`` or ``max_iter`` set to `M`, the model
         will train `M - N` new estimators.
 

diff --git a/doc/modules/density.rst b/doc/modules/density.rst
@@ -113,6 +113,10 @@ forms, which are shown in the following figure:
 
 .. centered:: |kde_kernels|
 
+|details-start|
+**kernels' mathematical expressions**
+|details-split|
+
 The form of these kernels is as follows:
 
 * Gaussian kernel (``kernel = 'gaussian'``)
@@ -139,6 +143,8 @@ The form of these kernels is as follows:
 
   :math:`K(x; h) \propto \cos(\frac{\pi x}{2h})` if :math:`x < h`
 
+|details-end|
+
 The kernel density estimator can be used with any of the valid distance
 metrics (see :class:`~sklearn.metrics.DistanceMetric` for a list of
 available metrics), though the results are properly normalized only

diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
@@ -206,8 +206,9 @@ Note the use of a generator comprehension,
 which introduces laziness into the feature extraction:
 tokens are only processed on demand from the hasher.
 
-Implementation details
-----------------------
+|details-start|
+**Implementation details**
+|details-split|
 
 :class:`FeatureHasher` uses the signed 32-bit variant of MurmurHash3.
 As a result (and because of limitations in ``scipy.sparse``),
@@ -223,16 +224,18 @@ Since a simple modulo is used to transform the hash function to a column index,
 it is advisable to use a power of two as the ``n_features`` parameter;
 otherwise the features will not be mapped evenly to the columns.
 
+.. topic:: References:
+
+  * `MurmurHash3 <https://github.com/aappleby/smhasher>`_.
+
+|details-end|
 
 .. topic:: References:
 
  * Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and
    Josh Attenberg (2009). `Feature hashing for large scale multitask learning
    <https://alex.smola.org/papers/2009/Weinbergeretal09.pdf>`_. Proc. ICML.
 
- * `MurmurHash3 <https://github.com/aappleby/smhasher>`_.
-
-
 .. _text_feature_extraction:
 
 Text feature extraction
@@ -395,8 +398,9 @@ last document::
 
 .. _stop_words:
 
-Using stop words
-................
+|details-start|
+**Using stop words**
+|details-split|
 
 Stop words are words like "and", "the", "him", which are presumed to be
 uninformative in representing the content of a text, and which may be
@@ -426,6 +430,9 @@ identify and warn about some kinds of inconsistencies.
                <https://aclweb.org/anthology/W18-2502>`__.
                In *Proc. Workshop for NLP Open Source Software*.
 
+
+|details-end|
+
 .. _tfidf:
 
 Tf–idf term weighting
@@ -490,6 +497,10 @@ class::
 Again please see the :ref:`reference documentation
 <text_feature_extraction_ref>` for the details on all the parameters.
 
+|details-start|
+**Numeric example of a tf-idf matrix**
+|details-split|
+
 Let's take an example with the following counts. The first term is present
 100% of the time hence not very interesting. The two other features only
 in less than 50% of the time hence probably more representative of the
@@ -609,6 +620,7 @@ feature extractor with a classifier:
 
  * :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`
 
+|details-end|
 
 Decoding text files
 -------------------
@@ -637,6 +649,10 @@ or ``"replace"``. See the documentation for the Python function
 ``bytes.decode`` for more details
 (type ``help(bytes.decode)`` at the Python prompt).
 
+|details-start|
+**Troubleshooting decoding text**
+|details-split|
+
 If you are having trouble decoding text, here are some things to try:
 
 - Find out what the actual encoding of the text is. The file might come
@@ -690,6 +706,7 @@ About Unicode <https://www.joelonsoftware.com/articles/Unicode.html>`_.
 
 .. _`ftfy`: https://github.com/LuminosoInsight/python-ftfy
 
+|details-end|
 
 Applications and examples
 -------------------------
@@ -870,8 +887,9 @@ The :class:`HashingVectorizer` also comes with the following limitations:
   model. A :class:`TfidfTransformer` can be appended to it in a pipeline if
   required.
 
-Performing out-of-core scaling with HashingVectorizer
-------------------------------------------------------
+|details-start|
+**Performing out-of-core scaling with HashingVectorizer**
+|details-split|
 
 An interesting development of using a :class:`HashingVectorizer` is the ability
 to perform `out-of-core`_ scaling. This means that we can learn from data that
@@ -890,6 +908,8 @@ time is often limited by the CPU time one wants to spend on the task.
 For a full-fledged example of out-of-core scaling in a text classification
 task see :ref:`sphx_glr_auto_examples_applications_plot_out_of_core_classification.py`.
 
+|details-end|
+
 Customizing the vectorizer classes
 ----------------------------------
 
@@ -928,6 +948,10 @@ parameters it is possible to derive from the class and override the
 ``build_preprocessor``, ``build_tokenizer`` and ``build_analyzer``
 factory methods instead of passing custom functions.
 
+|details-start|
+**Tips and tricks**
+|details-split|
+
 Some tips and tricks:
 
   * If documents are pre-tokenized by an external package, then store them in
@@ -982,6 +1006,8 @@ Some tips and tricks:
 Customizing the vectorizer can also be useful when handling Asian languages
 that do not use an explicit word separator such as whitespace.
 
+|details-end|
+
 .. _image_feature_extraction:
 
 Image feature extraction