Skip to content

Commit

Permalink
Merge main
Browse files Browse the repository at this point in the history
Signed-off-by: Adam Li <adam2392@gmail.com>
  • Loading branch information
adam2392 committed Sep 14, 2023
2 parents e2fee00 + 0d701e8 commit 0b6803f
Show file tree
Hide file tree
Showing 76 changed files with 2,176 additions and 2,183 deletions.
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
.. |Nightly wheels| image:: https://github.com/scikit-learn/scikit-learn/workflows/Wheel%20builder/badge.svg?event=schedule
.. _`Nightly wheels`: https://github.com/scikit-learn/scikit-learn/actions?query=workflow%3A%22Wheel+builder%22+event%3Aschedule

.. |PythonVersion| image:: https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue
.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/scikit-learn.svg
.. _PythonVersion: https://pypi.org/project/scikit-learn/

.. |PyPi| image:: https://img.shields.io/pypi/v/scikit-learn
Expand Down
36 changes: 35 additions & 1 deletion doc/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,15 @@
from os import environ
from os.path import exists, join

import pytest
from _pytest.doctest import DoctestItem

from sklearn.datasets import get_data_home
from sklearn.datasets._base import _pkl_filepath
from sklearn.datasets._twenty_newsgroups import CACHE_NAME
from sklearn.utils import IS_PYPY
from sklearn.utils._testing import SkipTest, check_skip_network
from sklearn.utils.fixes import parse_version
from sklearn.utils.fixes import np_base_version, parse_version


def setup_labeled_faces():
Expand Down Expand Up @@ -172,3 +175,34 @@ def pytest_configure(config):
matplotlib.use("agg")
except ImportError:
pass


def pytest_collection_modifyitems(config, items):
"""Called after collect is completed.
Parameters
----------
config : pytest config
items : list of collected items
"""
skip_doctests = False
if np_base_version >= parse_version("2"):
# Skip doctests when using numpy 2 for now. See the following discussion
# to decide what to do in the longer term:
# https://github.com/scikit-learn/scikit-learn/issues/27339
reason = "Due to NEP 51 numpy scalar repr has changed in numpy 2"
skip_doctests = True

# Normally doctest has the entire module's scope. Here we set globs to an empty dict
# to remove the module's scope:
# https://docs.python.org/3/library/doctest.html#what-s-the-execution-context
for item in items:
if isinstance(item, DoctestItem):
item.dtest.globs = {}

if skip_doctests:
skip_marker = pytest.mark.skip(reason=reason)

for item in items:
if isinstance(item, DoctestItem):
item.add_marker(skip_marker)
2 changes: 1 addition & 1 deletion doc/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1731,7 +1731,7 @@ functions or non-estimator constructors.
For these models, the number of iterations, reported via
``len(estimators_)`` or ``n_iter_``, corresponds the total number of
estimators/iterations learnt since the initialization of the model.
Thus, if a model was already initialized with `N`` estimators, and `fit`
Thus, if a model was already initialized with `N` estimators, and `fit`
is called with ``n_estimators`` or ``max_iter`` set to `M`, the model
will train `M - N` new estimators.

Expand Down
6 changes: 6 additions & 0 deletions doc/modules/density.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,10 @@ forms, which are shown in the following figure:

.. centered:: |kde_kernels|

|details-start|
**kernels' mathematical expressions**
|details-split|

The form of these kernels is as follows:

* Gaussian kernel (``kernel = 'gaussian'``)
Expand All @@ -139,6 +143,8 @@ The form of these kernels is as follows:

:math:`K(x; h) \propto \cos(\frac{\pi x}{2h})` if :math:`x < h`

|details-end|

The kernel density estimator can be used with any of the valid distance
metrics (see :class:`~sklearn.metrics.DistanceMetric` for a list of
available metrics), though the results are properly normalized only
Expand Down
44 changes: 35 additions & 9 deletions doc/modules/feature_extraction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -206,8 +206,9 @@ Note the use of a generator comprehension,
which introduces laziness into the feature extraction:
tokens are only processed on demand from the hasher.

Implementation details
----------------------
|details-start|
**Implementation details**
|details-split|

:class:`FeatureHasher` uses the signed 32-bit variant of MurmurHash3.
As a result (and because of limitations in ``scipy.sparse``),
Expand All @@ -223,16 +224,18 @@ Since a simple modulo is used to transform the hash function to a column index,
it is advisable to use a power of two as the ``n_features`` parameter;
otherwise the features will not be mapped evenly to the columns.

.. topic:: References:

* `MurmurHash3 <https://github.com/aappleby/smhasher>`_.

|details-end|

.. topic:: References:

* Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and
Josh Attenberg (2009). `Feature hashing for large scale multitask learning
<https://alex.smola.org/papers/2009/Weinbergeretal09.pdf>`_. Proc. ICML.

* `MurmurHash3 <https://github.com/aappleby/smhasher>`_.


.. _text_feature_extraction:

Text feature extraction
Expand Down Expand Up @@ -395,8 +398,9 @@ last document::

.. _stop_words:

Using stop words
................
|details-start|
**Using stop words**
|details-split|

Stop words are words like "and", "the", "him", which are presumed to be
uninformative in representing the content of a text, and which may be
Expand Down Expand Up @@ -426,6 +430,9 @@ identify and warn about some kinds of inconsistencies.
<https://aclweb.org/anthology/W18-2502>`__.
In *Proc. Workshop for NLP Open Source Software*.
|details-end|

.. _tfidf:

Tf–idf term weighting
Expand Down Expand Up @@ -490,6 +497,10 @@ class::
Again please see the :ref:`reference documentation
<text_feature_extraction_ref>` for the details on all the parameters.

|details-start|
**Numeric example of a tf-idf matrix**
|details-split|

Let's take an example with the following counts. The first term is present
100% of the time hence not very interesting. The two other features only
in less than 50% of the time hence probably more representative of the
Expand Down Expand Up @@ -609,6 +620,7 @@ feature extractor with a classifier:

* :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`

|details-end|

Decoding text files
-------------------
Expand Down Expand Up @@ -637,6 +649,10 @@ or ``"replace"``. See the documentation for the Python function
``bytes.decode`` for more details
(type ``help(bytes.decode)`` at the Python prompt).

|details-start|
**Troubleshooting decoding text**
|details-split|

If you are having trouble decoding text, here are some things to try:

- Find out what the actual encoding of the text is. The file might come
Expand Down Expand Up @@ -690,6 +706,7 @@ About Unicode <https://www.joelonsoftware.com/articles/Unicode.html>`_.

.. _`ftfy`: https://github.com/LuminosoInsight/python-ftfy

|details-end|

Applications and examples
-------------------------
Expand Down Expand Up @@ -870,8 +887,9 @@ The :class:`HashingVectorizer` also comes with the following limitations:
model. A :class:`TfidfTransformer` can be appended to it in a pipeline if
required.

Performing out-of-core scaling with HashingVectorizer
------------------------------------------------------
|details-start|
**Performing out-of-core scaling with HashingVectorizer**
|details-split|

An interesting development of using a :class:`HashingVectorizer` is the ability
to perform `out-of-core`_ scaling. This means that we can learn from data that
Expand All @@ -890,6 +908,8 @@ time is often limited by the CPU time one wants to spend on the task.
For a full-fledged example of out-of-core scaling in a text classification
task see :ref:`sphx_glr_auto_examples_applications_plot_out_of_core_classification.py`.

|details-end|

Customizing the vectorizer classes
----------------------------------

Expand Down Expand Up @@ -928,6 +948,10 @@ parameters it is possible to derive from the class and override the
``build_preprocessor``, ``build_tokenizer`` and ``build_analyzer``
factory methods instead of passing custom functions.

|details-start|
**Tips and tricks**
|details-split|

Some tips and tricks:

* If documents are pre-tokenized by an external package, then store them in
Expand Down Expand Up @@ -982,6 +1006,8 @@ Some tips and tricks:
Customizing the vectorizer can also be useful when handling Asian languages
that do not use an explicit word separator such as whitespace.

|details-end|

.. _image_feature_extraction:

Image feature extraction
Expand Down
Loading

0 comments on commit 0b6803f

Please sign in to comment.