diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..48ae243 --- /dev/null +++ b/404.html @@ -0,0 +1,1393 @@ + + + +
+ + + + + + + + + + + + + + + + + + +AggrFunc(
+ features_group: List[List[int]] = None,
+ non_longitudinal_features: List[Union[int, str]] = None,
+ feature_list_names: List[str] = None,
+ aggregation_func: Union[str, Callable] = "mean",
+ parallel: bool = False,
+ num_cpus: int = -1
+)
+
The AggrFunc class helps apply aggregation functions to feature groups in longitudinal datasets. +The motivation is to use some of the dataset's temporal information before using traditional machine learning algorithms +like Scikit-Learn. However, it is worth noting that aggregation significantly diminishes the overall temporal information of the dataset.
+A feature group refers to a collection of features that possess a common base longitudinal attribute +while originating from distinct waves of data collection. Refer to the documentation's "Temporal Dependency" page for more details.
+Aggregation Function
+In a given scenario, it is observed that a dataset comprises three distinct features, namely "income_wave1", "income_wave2", and "income_wave3". +It is noteworthy that these features collectively constitute a group within the dataset.
+The application of the aggregation function occurs iteratively across the waves, specifically targeting +each feature group. As a result, an aggregated feature is produced for every group. +In the context of data aggregation, when the designated aggregation function is the mean, it follows that the +individual features "income_wave1", "income_wave2", and "income_wave3" would undergo a transformation reduction resulting +in the creation of a consolidated feature named "mean_income".
+Support for Custom Functions
+The latest update to the class incorporates enhanced functionality to accommodate custom aggregation functions, +as long as they adhere to the callable interface. The user has the ability to provide a function as an argument, +which is expected to accept a pandas Series as input and produce a singular value as output. The pandas Series +is representative of the longitudinal attribute across the waves.
+List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.List[Union[int, str]]
, optional): A list of indices of features that are not longitudinal attributes. Defaults to None.List[str]
): A list of feature names in the dataset.Union[str, Callable]
, optional): The aggregation function to apply. Can be "mean", "median", "mode", or a custom function.bool
, optional): Whether to use parallel processing for the aggregation. Defaults to False.int
, optional): The number of CPUs to use for parallel processing. Defaults to -1, which uses all available CPUs.bool
, optional): If True, will return the parameters for this estimator and contained subobjects that are estimators. Defaults to True.np.ndarray
): The input data.np.ndarray
, optional): The target data. Not particularly relevant for this class. Defaults to None.LongitudinalDataset(
+ file_path: Union[str, Path],
+ data_frame: Optional[pd.DataFrame] = None
+)
+
The LongitudinalDataset class is a comprehensive container specifically designed for managing and preparing +longitudinal datasets. It provides essential data management and transformation capabilities, thereby facilitating the +development and application of machine learning algorithms tailored to longitudinal data classification tasks.
+Feature Groups and Non-Longitudinal Characteristics
+The class employs two crucial attributes, feature_groups
and non_longitudinal_features
, which play a vital role
+in enabling adapted/newly-designed machine learning algorithms to comprehend the temporal structure of longitudinal
+datasets.
Wrapper Around Pandas DataFrame
+This class wraps a pandas
DataFrame, offering a familiar interface while incorporating enhancements for
+longitudinal data. It ensures effective processing and learning from data collected over multiple time points.
Union[str, Path]
): Path to the dataset file. Supports both ARFF and CSV formats.Optional[pd.DataFrame]
, optional): If provided, this pandas DataFrame will serve as the dataset, and the file_path parameter will be ignored.pd.DataFrame
): A read-only property that returns the loaded dataset as a pandas DataFrame.pd.Series
): A read-only property that returns the target variable (class variable) as a pandas Series.np.ndarray
): A read-only property that returns the training data as a numpy array.np.ndarray
): A read-only property that returns the test data as a numpy array.pd.Series
): A read-only property that returns the training target data as a pandas Series.pd.Series
): A read-only property that returns the test target data as a pandas Series..load_target(
+ target_column: str,
+ target_wave_prefix: str = "class_",
+ remove_target_waves: bool = False
+)
+
str
): The name of the column in the dataset to be used as the target variable.str
, optional): The prefix of the columns that represent different waves of the target variable. Defaults to "class_".bool
, optional): If True, all the columns with target_wave_prefix and the target_column will be removed from the dataset after extracting the target variable. Note, sometimes in Longitudinal study, classes are also subject to be collected at different time points, hence the automatic deletion if this parameter set to true. Defaults to False.float
, optional): The proportion of the dataset to include in the test split. Defaults to 0.2.int
, optional): Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. Defaults to None..load_data_target_train_test_split(
+ target_column: str,
+ target_wave_prefix: str = "class_",
+ remove_target_waves: bool = False,
+ test_size: float = 0.2,
+ random_state: int = None
+)
+
str
): The name of the column in the dataset to be used as the target variable.str
, optional): The prefix of the columns that represent different waves of the target variable. Defaults to "class_".bool
, optional): If True, all the columns with target_wave_prefix and the target_column will be removed from the dataset after extracting the target variable. Defaults to False.float
, optional): The proportion of the dataset to include in the test split. Defaults to 0.2.int
, optional): Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. Defaults to None.Union[str, Path]
): Path to store the resulting file.Union[str, Path]
): Path to store the resulting file.Feature Group Setup
+The method allows for setting up feature groups based on the input data provided. +The input data can be in the form of a list of lists of integers, a list of lists of strings (feature names), +or using a pre-set strategy (e.g., "elsa").
+The list of list of integers/strings works as follows:
+For more information, see the documentation's "Temporal Dependency" page.
+Pre-set Strategy
+The "elsa" strategy groups features based on their name and suffix "_w1", "_w2", etc. For exemple, if the dataset +has features "age_w1", "age_w2". The method will group them together, making w2 more recent than w1 in the features +group setup.
+More pre-set strategy are welcome to be added in the future. Open an issue if you have any suggestion or if you +would like to contribute to one.
+Union[str, List[List[Union[str, int]]]]
): The input data for setting up the feature groups:bool
, optional): If True, the feature names will be returned instead of the indices. Defaults to False.bool
, optional): If True, the feature names will be returned instead of the indices. Defaults to False.pd.DataFrame
): The data.pd.Series
): The target.pd.DataFrame
): The training data.pd.DataFrame
): The test data.pd.Series
): The training target data.pd.Series
): The test target data.load_data_target_train_test_split
¶non_longitudinal_features
attribute.MerWavTimeMinus(
+ features_group: List[List[int]] = None,
+ non_longitudinal_features: List[Union[int, str]] = None,
+ feature_list_names: List[str] = None
+)
+
The MerWavTimeMinus
class provides a method for transforming longitudinal data by merging all features across waves
+into a single set, effectively discarding the temporal information in order to apply with traditional machine learning algorithms,
+However, it is important to note that this approach does not leverage any temporal dependencies or patterns inherent in the longitudinal data.
+Nor by reducing/augmenting the current features or by understanding the temporal information. The input data is what is
+fed to the model.
MerWavTime(-)
+The MerWavTimeMinus
method involves merging all features from all waves into a single set of features,
+disregarding their time indices. This approach treats different values of the same original longitudinal feature as
+distinct features, losing the temporal information but simplifying the dataset for traditional machine learning
+algorithms.
Why is this class important?
+Before running a pre-processor or classifier, in some cases, we would like to know the data preparation utilised. +This provides a means to know. Yet, no proper reduction/augmentation is done, this is a plain step, yet visually +important to know/be able to see.
+List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list.List[Union[int, str]]
, optional): A list of indices or names of non-longitudinal features. Defaults to None.List[str]
): A list of feature names in the dataset.bool
, optional): If True, will return the parameters for this estimator and contained subobjects that are estimators. Defaults to True.np.ndarray
): The input data.np.ndarray
, optional): The target data. Not particularly relevant for this class. Defaults to None.For more detailed information, refer to the paper:
+MerWavTimePlus(
+ features_group: List[List[int]] = None,
+ non_longitudinal_features: List[Union[int, str]] = None,
+ feature_list_names: List[str] = None
+)
+
The MerWavTimePlus
class provides a method for transforming longitudinal data by merging all features across waves
+into a single set while keeping their time indices. This approach maintains the temporal structure of the data,
+allowing the use of longitudinal methods to learn temporal patterns.
MerWavTime(+)
+In longitudinal studies, data is collected across multiple waves (time points), resulting in features that capture
+temporal information. The MerWavTimePlus
method involves merging all features from all waves into a single set of
+features while preserving their time indices. This approach allows the use of longitudinal machine learning methods
+to leverage temporal dependencies and patterns inherent in the longitudinal data.
Why is this class important?
+Before running a pre-processor or classifier, in some cases, we would like to know the data preparation utilised. +This provides a means to know. Yet, no proper reduction/augmentation is done, this is a plain step, yet visually +important to know/be able to see. Nonetheless, subsequent steps such as a pre-processor or a classifier have access +to the temporal information of the fed-dataset.
+List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list.List[Union[int, str]]
, optional): A list of indices or names of non-longitudinal features. Defaults to None.List[str]
): A list of feature names in the dataset.bool
, optional): If True, will return the parameters for this estimator and contained subobjects that are estimators. Defaults to True.np.ndarray
): The input data.np.ndarray
, optional): The target data. Not particularly relevant for this class. Defaults to None.time-aware
machine learning algorithms, leveraging temporal dependencies or patterns inherent in the longitudinal data, to be applied.For more detailed information, refer to the paper:
+SepWav(
+ estimator: Union[ClassifierMixin, CustomClassifierMixinEstimator] = None,
+ features_group: List[List[int]] = None,
+ non_longitudinal_features: List[Union[int, str]] = None,
+ feature_list_names: List[str] = None,
+ voting: LongitudinalEnsemblingStrategy = LongitudinalEnsemblingStrategy.MAJORITY_VOTING,
+ stacking_meta_learner: Union[CustomClassifierMixinEstimator, ClassifierMixin, None] = LogisticRegression(),
+ n_jobs: int = None,
+ parallel: bool = False,
+ num_cpus: int = -1,
+)
+
The SepWav
class implements the Separate Waves (SepWav) strategy for longitudinal data analysis.
+This approach involves treating each wave (time point) as a separate dataset, training a classifier on each dataset,
+and combining their predictions using an ensemble method.
SepWav (Separate Waves) Strategy
+In the SepWav strategy, each wave's features and class variable are treated as a separate dataset. +Classifiers (non-longitudinally focussed) are trained on each wave independently, and their predictions are combined into a final predicted +class label. This combination can be achieved using various approaches:
+Combination Strategies
+The SepWav strategy allows for different ensemble methods to be used for combining the predictions of the classifiers trained on each wave.
+The choice of ensemble method can impact the final model's performance and generalisation ability. Therefore,
+the reader can further read into the LongitudinalVoting
and LongitudinalStacking
classes for mathematical details.
List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list.Union[ClassifierMixin, CustomClassifierMixinEstimator]
): The base classifier to use for each wave.List[Union[int, str]]
, optional): A list of indices or names of non-longitudinal features. Defaults to None.List[str]
): A list of feature names in the dataset.LongitudinalEnsemblingStrategy
, optional): The ensemble strategy to use. Defaults to LongitudinalEnsemblingStrategy.MAJORITY_VOTING
. See further in LongitudinalVoting
and LongitudinalStacking
for more details.Union[CustomClassifierMixinEstimator, ClassifierMixin, None]
, optional): The final estimator to use in stacking. Defaults to LogisticRegression()
.int
, optional): The number of jobs to run in parallel. Defaults to None.bool
, optional): Whether to run the fit waves in parallel. Defaults to False.int
, optional): The number of CPUs to use for parallel processing. Defaults to -1, which uses all available CPUs.bool
, optional): If True, will return the parameters for this estimator and contained subobjects that are estimators. Defaults to True.np.ndarray
): The input data.np.ndarray
, optional): The target data. Not particularly relevant for this class. Defaults to None.Union[List[List[float]], np.ndarray]
): The input samples.Union[List[float], np.ndarray]
): The target values.Union[List[List[float]], np.ndarray]
): The input samples.Union[List[List[float]], np.ndarray]
): The input samples.int
): The wave number to extract.Union[List[List[float]], np.ndarray]
): The input samples.MAJORITY_VOTING
strategy. Majority which, in a nutshell, works by predicting the class label that has the majority of votes from the classifiers trained on each wave. Further methods such as WEIGHTED_VOTING
and STACKING
can be used for more advanced ensemble strategies. See further in classes LongitudinalVoting
and LongitudinalVoting
.STACKING
strategy to combine the predictions of the classifiers trained on each wave. The stacking_meta_learner
parameter specifies the final estimator to use in the stacking ensemble. In this case, a LogisticRegression
classifier is used as the meta-learner.parallel
parameter is set to True
to enable parallel processing of the waves.num_cpus
parameter specifies the number of CPUs to use for parallel processing. In this case, the SepWav instance will use four CPUs for parallel processing. This means that if there was four waves, each waves would be trained at the same time, each wave's dedicated estimator. Fastening the overall process.LexicoDeepForestClassifier(
+ longitudinal_base_estimators: Optional[List[LongitudinalEstimatorConfig]] = None,
+ features_group: List[List[int]] = None,
+ non_longitudinal_features: List[Union[int, str]] = None,
+ diversity_estimators: bool = True, random_state: int = None,
+ single_classifier_type: Optional[Union[LongitudinalClassifierType, str]] = None,
+ single_count: Optional[int] = None, max_layers: int = 5
+)
+
The Lexico Deep Forest Classifier is an advanced ensemble algorithm specifically designed for longitudinal data analysis. This classifier extends the fundamental principles of the Deep Forest framework by incorporating longitudinal-adapted base estimators to capture the temporal complexities and interdependencies inherent in longitudinal data.
+Lexico Deep Forest with the Lexicographical Optimisation
+/scikit-longitudinal/scikit-learn/sklearn/tree/_splitter.pyx
.The combination of these accurate and weak learners aims to exploit the strengths of each estimator type, leading to a more effective and reliable classification performance on longitudinal datasets.
+For further scientific references, please refer to the Notes section.
+List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.List[LongitudinalEstimatorConfig]
): A list of LongitudinalEstimatorConfig
objects that define the configuration for each base estimator within the ensemble. Each configuration specifies the type of longitudinal classifier, the number of times it should be instantiated within the ensemble, and an optional dictionary of hyperparameters for finer control over the individual classifiers' behavior. Available longitudinal classifiers are:List[Union[int, str]]
, optional): A list of indices of features that are not longitudinal attributes. Defaults to None. This parameter will be forwarded to the base longitudinal-based(-adapted) algorithms if required.bool
, optional): A flag indicating whether the ensemble should include diversity estimators, defaulting to True. When enabled, diversity estimators, which function as weak learners, are added to the ensemble to enhance its diversity and, by extension, its predictive performance. Disabling this option results in an ensemble comprising solely of the specified base longitudinal-adapted algorithms. The diversity is achieved by integrating two additional completely random LexicoRandomForestClassifier instances into the ensemble.int
, optional): The seed used by the random number generator. Defaults to None.Fit the Deep Forest Longitudinal Classifier model according to the given training data.
+np.ndarray
): The training input samples.np.ndarray
): The target values (class labels).Predict class labels for samples in X.
+np.ndarray
): The input samples.Predict class probabilities for samples in X.
+np.ndarray
): The input samples.Consider the following dataset
+Features:
+smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)Target:
+stroke
(binary classification) at wave/time-point 2 only for the sake of the exampleThe dataset is shown below:
+smoke_wave_1 | +smoke_wave_2 | +cholesterol_wave_1 | +cholesterol_wave_2 | +age | +gender | +stroke_wave_2 | +
---|---|---|---|---|---|---|
0 | +1 | +0 | +1 | +45 | +1 | +0 | +
1 | +1 | +1 | +1 | +50 | +0 | +1 | +
0 | +0 | +0 | +0 | +55 | +1 | +0 | +
1 | +1 | +1 | +1 | +60 | +0 | +1 | +
0 | +1 | +0 | +1 | +65 | +1 | +0 | +
The reader is encouraged to refer to the LexicoDecisionTreeClassifier and LexicoRandomForestClassifier documentation for more information on the base longitudinal-adapted algorithms used in the Lexico Deep Forest Classifier.
+++For more information, see the following paper on the Deep Forest algorithm:
+
Here is the initial Python implementation of the Deep Forest algorithm: Deep Forest GitHub
+ + + + + + + + + + + + + + + + +LexicoGradientBoostingClassifier(
+ threshold_gain: float = 0.0015, features_group: List[List[int]] = None,
+ criterion: str = 'friedman_mse', splitter: str = 'lexicoRF',
+ max_depth: Optional[int] = 3, min_samples_split: int = 2, min_samples_leaf: int = 1,
+ min_weight_fraction_leaf: float = 0.0, max_features: Optional[Union[int, str]] = None,
+ random_state: Optional[int] = None, max_leaf_nodes: Optional[int] = None,
+ min_impurity_decrease: float = 0.0, ccp_alpha: float = 0.0, tree_flavor: bool = False,
+ n_estimators: int = 100, learning_rate: float = 0.1
+)
+
Gradient Boosting Classifier adapted for longitudinal data analysis.
+The Lexico Gradient Boosting Classifier is an advanced ensemble algorithm designed specifically for longitudinal datasets, +Incorporating the fundamental principles of the Gradient Boosting framework. This classifier distinguishes itself +through the implementation of longitudinal-adapted base estimators, which are intended to capture the temporal +complexities and interdependencies intrinsic to longitudinal data.
+The base estimators of the Lexico Gradient Boosting Classifier are Lexico Decision Tree Regressors, specialised +decision tree models capable of handling longitudinal data.
+Lexicographical Optimisation
+The primary goal of this approach is to prioritize the selection of more recent data points (wave ids) when determining splits in the decision tree, based on the premise that recent measurements are typically more predictive and relevant than older ones.
+Key Features:
+node_lexicoRF_split
function.For further scientific references, please refer to the Notes section.
+List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.float
): The threshold value for comparing gain ratios of features during the decision tree construction.str
, optional, default="friedman_mse"): The function to measure the quality of a split. Do not change this value.str
, optional, default="lexicoRF"): The strategy used to choose the split at each node. Do not change this value.Optional[int]
, default=None): The maximum depth of the tree.int
, optional, default=2): The minimum number of samples required to split an internal node.int
, optional, default=1): The minimum number of samples required to be at a leaf node.float
, optional, default=0.0): The minimum weighted fraction of the sum total of weights required to be at a leaf node.Optional[Union[int, str]]
, default=None): The number of features to consider when looking for the best split.Optional[int]
, default=None): The seed used by the random number generator.Optional[int]
, default=None): The maximum number of leaf nodes in the tree.float
, optional, default=0.0): The minimum impurity decrease required for a node to be split.float
, optional, default=0.0): Complexity parameter used for Minimal Cost-Complexity Pruning.bool
, optional, default=False): Indicates whether to use a specific tree flavor.int
, optional, default=100): The number of boosting stages to be run.float
, optional, default=0.1): Learning rate shrinks the contribution of each tree by learning_rate
.Fit the Lexico Gradient Boosting Longitudinal Classifier model according to the given training data.
+np.ndarray
): The training input samples.np.ndarray
): The target values (class labels).Predict class labels for samples in X.
+np.ndarray
): The input samples.Predict class probabilities for samples in X.
+np.ndarray
): The input samples.learning_rate
. There is a trade-off between learning_rate and n_estimators.++For more information, please refer to the following papers:
+
Here is the initial Python implementation of the Gradient Boosting algorithm: Gradient Boosting Sklearn
+ + + + + + + + + + + + + + + + +LexicoRandomForestClassifier(
+ n_estimators: int = 100, threshold_gain: float = 0.0015,
+ features_group: List[List[int]] = None, max_depth: Optional[int] = None,
+ min_samples_split: int = 2, min_samples_leaf: int = 1,
+ min_weight_fraction_leaf: float = 0.0, max_features: Optional[Union[int, str]] = 'sqrt',
+ max_leaf_nodes: Optional[int] = None, min_impurity_decrease: float = 0.0,
+ class_weight: Optional[str] = None, ccp_alpha: float = 0.0, random_state: int = None, **kwargs
+)
+
The Lexico Random Forest Classifier is an advanced ensemble algorithm specifically designed for longitudinal data analysis. +This classifier extends the traditional random forest algorithm by incorporating a lexicographic optimisation approach +to select the best split at each node.
+Lexicographic Optimisation
+The primary goal of this approach is to prioritise the selection of more recent data points (wave ids) when +determining splits in the decision tree, based on the premise that recent measurements are typically +more predictive and relevant than older ones.
+Key Features:
+node_lexicoRF_split
function.For further scientific references, please refer to the Notes section.
+List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.float
): The threshold value for comparing gain ratios of features during the decision tree construction.int
, optional, default=100): The number of trees in the forest.str
, optional, default="entropy"): The function to measure the quality of a split. Do not change this value.str
, optional, default="lexicoRF"): The strategy used to choose the split at each node. Do not change this value.Optional[int]
, default=None): The maximum depth of the tree.int
, optional, default=2): The minimum number of samples required to split an internal node.int
, optional, default=1): The minimum number of samples required to be at a leaf node.float
, optional, default=0.0): The minimum weighted fraction of the sum total of weights required to be at a leaf node.Optional[Union[int, str]]
, default='sqrt'): The number of features to consider when looking for the best split.Optional[int]
, default=None): The seed used by the random number generator.Optional[int]
, default=None): The maximum number of leaf nodes in the tree.float
, optional, default=0.0): The minimum impurity decrease required for a node to be split.Optional[str]
, default=None): Weights associated with classes in the form of {class_label: weight}.float
, optional, default=0.0): Complexity parameter used for Minimal Cost-Complexity Pruning.dict
): The keyword arguments for the RandomForestClassifier.ndarray
of shape (n_classes,)): The class labels (single output problem).int
): The number of classes (single output problem).int
): The number of features when fit is performed.int
): The number of outputs when fit is performed.ndarray
of shape (n_features,)): The impurity-based feature importances.int
): The inferred value of max_features.list
of LexicoDecisionTreeClassifier): The collection of fitted sub-estimators.Fit the LexicoRandomForestClassifier model according to the given training data.
+np.ndarray
): The training input samples.np.ndarray
): The target values (class labels).Optional[np.ndarray]
, default=None): Sample weights.Predict class labels for samples in X.
+np.ndarray
): The input samples.Predict class probabilities for samples in X.
+np.ndarray
): The input samples.Consider the following dataset
+Features:
+smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)Target:
+stroke
(binary classification) at wave/time-point 2 only for the sake of the exampleThe dataset is shown below:
+smoke_wave_1 | +smoke_wave_2 | +cholesterol_wave_1 | +cholesterol_wave_2 | +age | +gender | +stroke_wave_2 | +
---|---|---|---|---|---|---|
0 | +1 | +0 | +1 | +45 | +1 | +0 | +
1 | +1 | +1 | +1 | +50 | +0 | +1 | +
0 | +0 | +0 | +0 | +55 | +1 | +0 | +
1 | +1 | +1 | +1 | +60 | +0 | +1 | +
0 | +1 | +0 | +1 | +65 | +1 | +0 | +
++For more information, please refer to the following papers:
+
LongitudinalStackingClassifier(
+ estimators: List[CustomClassifierMixinEstimator],
+ meta_learner: Optional[Union[CustomClassifierMixinEstimator,
+ ClassifierMixin]] = LogisticRegression(), n_jobs: int = 1
+)
+
The Longitudinal Stacking Classifier is a sophisticated ensemble method designed to handle the unique challenges posed +by longitudinal data. By leveraging a stacking approach, this classifier combines multiple base estimators trained to feed their prediction to a +meta-learner to enhance predictive performance. The base estimators are individually trained on the entire dataset, and +their predictions serve as inputs for the meta-learner, which generates the final prediction.
+When to Use?
+This classifier is primarily used when the "SepWav" (Separate Waves) strategy is employed. However, it can also be +applied with only Longitudinal-based estimators and do not follow the SepWav approach if wanted.
+SepWav (Separate Waves) Strategy
+The SepWav strategy involves considering each wave's features and the class variable as a separate dataset, +then learning a classifier for each dataset. The class labels predicted by these classifiers are combined +into a final predicted class label. This combination can be achieved using various approaches: +simple majority voting, weighted voting with weights decaying linearly or exponentially for older waves, +weights optimized by cross-validation on the training set (see LongitudinalVoting), and stacking methods +(current class) that use the classifiers' predicted labels as input for learning a meta-classifier +(using a decision tree, logistic regression, or Random Forest algorithm).
+Wrapper Around Sklearn StackingClassifier
+This class wraps the sklearn
StackingClassifier, offering a familiar interface while incorporating
+enhancements for longitudinal data.
List[CustomClassifierMixinEstimator]
): The base estimators for the ensemble, they need to be trained already.Optional[Union[CustomClassifierMixinEstimator, ClassifierMixin]]
): The meta-learner to be used in stacking.int
): The number of jobs to run in parallel for fitting base estimators.StackingClassifier
): The underlying sklearn StackingClassifier instance.Fits the ensemble model.
+np.ndarray
): The input data.np.ndarray
): The target data.Predicts the target data for the given input data.
+np.ndarray
): The input data.Predicts the target data probabilities for the given input data.
+np.ndarray
): The input data.Consider the following dataset
+Features:
+smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)Target:
+stroke
(binary classification) at wave/time-point 2 only for the sake of the exampleThe dataset is shown below:
+smoke_wave_1 | +smoke_wave_2 | +cholesterol_wave_1 | +cholesterol_wave_2 | +age | +gender | +stroke_wave_2 | +
---|---|---|---|---|---|---|
0 | +1 | +0 | +1 | +45 | +1 | +0 | +
1 | +1 | +1 | +1 | +50 | +0 | +1 | +
0 | +0 | +0 | +0 | +55 | +1 | +0 | +
1 | +1 | +1 | +1 | +60 | +0 | +1 | +
0 | +1 | +0 | +1 | +65 | +1 | +0 | +
++For more information, please refer to the following paper:
+
LongitudinalVotingClassifier(
+ voting: LongitudinalEnsemblingStrategy = LongitudinalEnsemblingStrategy.MAJORITY_VOTING,
+ estimators: List[CustomClassifierMixinEstimator] = None,
+ extract_wave: Callable = None, n_jobs: int = 1
+)
+
The Longitudinal Voting Classifier is a versatile ensemble method designed to handle the unique challenges posed by +longitudinal data. By leveraging different voting strategies, this classifier combines predictions from multiple base +estimators to enhance predictive performance. The base estimators are individually trained, and their predictions are +aggregated based on the chosen voting strategy to generate the final prediction.
+When to Use?
+This classifier is primarily used when the "SepWav" (Separate Waves) strategy is employed. However, it can also be +applied with only longitudinal-based estimators and do not follow the SepWav approach if wanted.
+SepWav (Separate Waves) Strategy
+The SepWav strategy involves considering each wave's features and the class variable as a separate dataset, +then learning a classifier for each dataset. The class labels predicted by these classifiers are combined into a +final predicted class label. This combination can be achieved using various approaches: simple majority voting, +weighted voting with weights decaying linearly or exponentially for older waves, weights optimized by cross-validation +on the training set (current class), and stacking methods that use the classifiers' predicted labels as input +for learning a meta-classifier (see LongitudinalStacking).
+Wrapper Around Sklearn VotingClassifier
+This class wraps the sklearn
VotingClassifier, offering a familiar interface while incorporating enhancements
+for longitudinal data.
LongitudinalEnsemblingStrategy
): The voting strategy to be used for the ensemble. Refer to the LongitudinalEnsemblingStrategy enum below.List[CustomClassifierMixinEstimator]
): A list of classifiers for the ensemble. Note, the classifiers need to be trained before being passed to the LongitudinalVotingClassifier.Callable
): A function to extract specific wave data for training.int
, optional, default=1): The number of jobs to run in parallel.predict
and not predict_proba
, given that predict_proba
takes the average of votes, similarly as how sklearn's voting classifier does.Fit the ensemble model.
+np.ndarray
): The training data.np.ndarray
): The target values.Predict using the ensemble model.
+np.ndarray
): The test data.Predict probabilities using the ensemble model.
+np.ndarray
): The test data.Consider the following dataset
+Features:
+smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)Target:
+stroke
(binary classification) at wave/time-point 2 only for the sake of the exampleThe dataset is shown below:
+smoke_wave_1 | +smoke_wave_2 | +cholesterol_wave_1 | +cholesterol_wave_2 | +age | +gender | +stroke_wave_2 | +
---|---|---|---|---|---|---|
0 | +1 | +0 | +1 | +45 | +1 | +0 | +
1 | +1 | +1 | +1 | +50 | +0 | +1 | +
0 | +0 | +0 | +0 | +55 | +1 | +0 | +
1 | +1 | +1 | +1 | +60 | +0 | +1 | +
0 | +1 | +0 | +1 | +65 | +1 | +0 | +
++For more information, please refer to the following paper:
+
An enum for the different longitudinal voting strategies.
+int
): Simple consensus voting where the most frequent prediction is selected.int
): Weights each classifier's vote based on the recency of its wave.int
): Weights each classifier based on its cross-validation accuracy on the training data.int
): Stacking ensemble strategy uses a meta-learner to combine predictions of base classifiers. The meta-learner is trained on meta-features formed from the base classifiers' predictions. This approach is suitable when the cardinality of meta-features is smaller than the original feature set.In stacking, for each wave \( I \) (\( I \in \{1, 2, \ldots, N\} \)), a base classifier \( C_i \) is trained on \( (X_i, T_N) \). The prediction from \( C_i \) is denoted as \( V_i \), forming the meta-features \( \mathbf{V} = [V_1, V_2, ..., V_N] \). The meta-learner \( M \) is then trained on \( (\mathbf{V}, T_N) \), and for a new instance \( x \), the final prediction is \( P(x) = M(\mathbf{V}(x)) \).
+ + + + + + + + + + + + + + + + +NestedTreesClassifier(
+ features_group: List[List[int]] = None,
+ non_longitudinal_features: List[Union[int, str]] = None, max_outer_depth: int = 3,
+ max_inner_depth: int = 2, min_outer_samples: int = 5,
+ inner_estimator_hyperparameters: Optional[Dict[str, Any]] = None,
+ save_nested_trees: bool = False, parallel: bool = False, num_cpus: int = -1
+)
+
The Nested Trees Classifier is a unique and innovative classification algorithm specifically designed for longitudinal +datasets. This method enhances traditional decision tree algorithms by embedding smaller decision trees within the nodes +of a primary tree structure, leveraging the inherent information in longitudinal data optimally.
+Nested Trees Structure
+The outer decision tree employs a custom algorithm that selects longitudinal attributes, categorised as groups of +time-specific attributes. The inner embedded decision tree uses Scikit Learn's decision tree algorithm, +partitioning the dataset based on the longitudinal attribute of the parent node.
+Wrapper Around Sklearn DecisionTreeClassifier
+This class wraps the sklearn
DecisionTreeClassifier, offering a familiar interface while incorporating
+enhancements for longitudinal data. It ensures effective processing and learning from data collected over multiple
+time points.
List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.List[Union[int, str]]
, optional): A list of indices of features that are not longitudinal attributes. Defaults to None.int
, optional, default=3): The maximum depth of the outer custom decision tree.int
, optional, default=2): The maximum depth of the inner decision trees.int
, optional, default=5): The minimum number of samples required to split an internal node in the outer decision tree.Dict[str, Any]
, optional): A dictionary of hyperparameters to be passed to the inner Scikit-learn decision tree estimators. Defaults to None.bool
, optional, default=False): If set to True, the nested trees structure plot will be saved, which may be useful for model interpretation and visualization.bool
, optional, default=False): Whether to use parallel processing.int
, optional, default=-1): The number of CPUs to use for parallel processing. Defaults to -1 (use all available).Node
, optional): The root node of the outer decision tree. Set to None upon initialization, it will be updated during the model fitting process.Fit the Nested Trees Classifier model according to the given training data.
+np.ndarray
): The training input samples.np.ndarray
): The target values (class labels).Predict class labels for samples in X.
+np.ndarray
): The input samples.Predict class probabilities for samples in X.
+np.ndarray
): The input samples..print_nested_tree(
+ node: Optional['NestedTreesClassifier.Node'] = None, depth: int = 0, prefix: str = '',
+ parent_name: str = ''
+)
+
Print the structure of the nested tree classifier.
+Optional[NestedTreesClassifier.Node]
, optional): The current node in the outer decision tree. If None, start from the root node. Default is None.int
, optional, default=0): The current depth of the node in the outer decision tree.str
, optional, default=""`): A string to prepend before the node's name in the output.str
, optional, default=""`): The name of the parent node in the outer decision tree.Consider the following dataset
+Features:
+smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)Target:
+stroke
(binary classification) at wave/time-point 2 only for the sake of the exampleThe dataset is shown below:
+smoke_wave_1 | +smoke_wave_2 | +cholesterol_wave_1 | +cholesterol_wave_2 | +age | +gender | +stroke_wave_2 | +
---|---|---|---|---|---|---|
0 | +1 | +0 | +1 | +45 | +1 | +0 | +
1 | +1 | +1 | +1 | +50 | +0 | +1 | +
0 | +0 | +0 | +0 | +55 | +1 | +0 | +
1 | +1 | +1 | +1 | +60 | +0 | +1 | +
0 | +1 | +0 | +1 | +65 | +1 | +0 | +
++For more information, see the following paper on the Nested Trees algorithm:
+
Here is the initial Java implementation of the Nested Trees algorithm: Nested Trees GitHub
+ + + + + + + + + + + + + + + + +LexicoDecisionTreeClassifier(
+ threshold_gain: float = 0.0015, features_group: List[List[int]] = None,
+ criterion: str = 'entropy', splitter: str = 'lexicoRF',
+ max_depth: Optional[int] = None, min_samples_split: int = 2,
+ min_samples_leaf: int = 1, min_weight_fraction_leaf: float = 0.0,
+ max_features: Optional[Union[int, str]] = None, random_state: Optional[int] = None,
+ max_leaf_nodes: Optional[int] = None, min_impurity_decrease: float = 0.0,
+ class_weight: Optional[str] = None, ccp_alpha: float = 0.0
+)
+
The Lexico Decision Tree Classifier is an advanced classification model specifically designed for longitudinal data. +This implementation extends the traditional decision tree algorithm by incorporating a lexicographic optimisation approach.
+Lexicographic Optimisation
+The primary goal of this approach is to prioritise the selection of more recent data points (wave ids) when +determining splits in the decision tree, based on the premise that recent measurements are typically more +predictive and relevant than older ones.
+Key Features:
+node_lexicoRF_split
function.For further scientific references, please refer to the Notes section.
+List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.float
): The threshold value for comparing gain ratios of features during the decision tree construction.str
, optional, default="entropy"): The function to measure the quality of a split. Do not change this value.str
, optional, default="lexicoRF"): The strategy used to choose the split at each node. Do not change this value.Optional[int]
, default=None): The maximum depth of the tree.int
, optional, default=2): The minimum number of samples required to split an internal node.int
, optional, default=1): The minimum number of samples required to be at a leaf node.float
, optional, default=0.0): The minimum weighted fraction of the sum total of weights required to be at a leaf node.Optional[Union[int, str]]
, default=None): The number of features to consider when looking for the best split.Optional[int]
, default=None): The seed used by the random number generator.Optional[int]
, default=None): The maximum number of leaf nodes in the tree.float
, optional, default=0.0): The minimum impurity decrease required for a node to be split.Optional[str]
, default=None): Weights associated with classes in the form of {class_label: weight}.float
, optional, default=0.0): Complexity parameter used for Minimal Cost-Complexity Pruning.ndarray
of shape (n_classes,)): The classes labels (single output problem).int
): The number of classes (single output problem).int
): The number of features when fit is performed.int
): The number of outputs when fit is performed.ndarray
of shape (n_features,)): The impurity-based feature importances.int
): The inferred value of max_features.Tree
object): The underlying Tree object..fit(
+ X: np.ndarray, y: np.ndarray, sample_weight: Optional[np.ndarray] = None,
+ check_input: bool = True, X_idx_sorted: Optional[np.ndarray] = None
+)
+
Fit the decision tree classifier.
+This method fits the LexicoDecisionTreeClassifier
to the given data.
np.ndarray
): The training input samples of shape (n_samples, n_features)
.np.ndarray
): The target values of shape (n_samples,)
.Optional[np.ndarray]
, default=None): Sample weights.bool
, default=True): Allow to bypass several input checking. Don’t use this parameter unless you know what you do.Optional[np.ndarray]
, default=None): The indices of the sorted training input samples. If many tree are grown on the same dataset, this allows the use of sorted representations in max_features and max_depth searches.Predict class or regression value for X.
+The predicted class or the predict value of an input sample is computed as the mean predicted class of the trees in the forest.
+np.ndarray
): The input samples of shape (n_samples, n_features)
.Predict class probabilities for X.
+The predicted class probabilities of an input sample are computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.
+np.ndarray
): The input samples of shape (n_samples, n_features)
.Consider the following dataset
+Features:
+smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)Target:
+stroke
(binary classification) at wave/time-point 2 only for the sake of the exampleThe dataset is shown below:
+smoke_wave_1 | +smoke_wave_2 | +cholesterol_wave_1 | +cholesterol_wave_2 | +age | +gender | +stroke_wave_2 | +
---|---|---|---|---|---|---|
0 | +1 | +0 | +1 | +45 | +1 | +0 | +
1 | +1 | +1 | +1 | +50 | +0 | +1 | +
0 | +0 | +0 | +0 | +55 | +1 | +0 | +
1 | +1 | +1 | +1 | +60 | +0 | +1 | +
0 | +1 | +0 | +1 | +65 | +1 | +0 | +
Example_1: Default Parameters | |
---|---|
++For more information, please refer to the following paper:
+
LexicoDecisionTreeRegressor(
+ threshold_gain: float = 0.0015, features_group: List[List[int]] = None,
+ criterion: str = 'friedman_mse', splitter: str = 'lexicoRF',
+ max_depth: Optional[int] = None, min_samples_split: int = 2,
+ min_samples_leaf: int = 1, min_weight_fraction_leaf: float = 0.0,
+ max_features: Optional[Union[int, str]] = None, random_state: Optional[int] = None,
+ max_leaf_nodes: Optional[int] = None, min_impurity_decrease: float = 0.0,
+ ccp_alpha: float = 0.0
+)
+
The Lexico Decision Tree Regressor is an advanced regression model specifically designed for longitudinal data. +This implementation extends the traditional decision tree algorithm by incorporating a lexicographic Optimisation approach.
+Lexicographical Optimisation
+The primary goal of this approach is to prioritise the selection of more recent data points (wave ids) when +determining splits in the decision tree, based on the premise that recent measurements are typically more +predictive and relevant than older ones.
+Key Features:
+For further scientific references, please refer to the Notes section.
+Why's there a regressor?
+It is worth noting that while Sklong focuses on classification. This regressor is necessary for the Lexico Gradient Boosting. +However, it could be of-use for regression tasks.
+List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.float
): The threshold value for comparing gain ratios of features during the decision tree construction.str
, optional, default="friedman_mse"): The function to measure the quality of a split. It is not recommended to change this value.str
, optional, default="lexicoRF"): The strategy used to choose the split at each node. Do not change this value.Optional[int]
, default=None): The maximum depth of the tree.int
, optional, default=2): The minimum number of samples required to split an internal node.int
, optional, default=1): The minimum number of samples required to be at a leaf node.float
, optional, default=0.0): The minimum weighted fraction of the sum total of weights required to be at a leaf node.Optional[Union[int, str]]
, default=None): The number of features to consider when looking for the best split.Optional[int]
, default=None): The seed used by the random number generator.Optional[int]
, default=None): The maximum number of leaf nodes in the tree.float
, optional, default=0.0): The minimum impurity decrease required for a node to be split.float
, optional, default=0.0): Complexity parameter used for Minimal Cost-Complexity Pruning.int
): The number of features when fit is performed.int
): The number of outputs when fit is performed.ndarray
of shape (n_features,)): The impurity-based feature importances.int
): The inferred value of max_features.Tree
object): The underlying Tree object..fit(
+ X: np.ndarray, y: np.ndarray, sample_weight: Optional[np.ndarray] = None,
+ check_input: bool = True, X_idx_sorted: Optional[np.ndarray] = None
+)
+
Fit the decision tree regressor.
+This method fits the LexicoDecisionTreeRegressor
to the given data.
np.ndarray
): The training input samples of shape (n_samples, n_features)
.np.ndarray
): The target values of shape (n_samples,)
.Optional[np.ndarray]
, default=None): Sample weights.bool
, default=True): Allow to bypass several input checking. Don’t use this parameter unless you know what you do.Optional[np.ndarray]
, default=None): The indices of the sorted training input samples. If many tree are grown on the same dataset, this allows the use of sorted representations in max_features and max_depth searches.Predict regression value for X.
+The predicted value of an input sample is computed as the mean predicted value of the trees in the forest.
+np.ndarray
): The input samples of shape (n_samples, n_features)
.Predict class probabilities for X.
+The predicted class probabilities of an input sample are computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.
+np.ndarray
): The input samples of shape (n_samples, n_features)
.Here's the documentation for the LexicoDecisionTreeRegressor, including examples and explanation:
+# Lexico Decision Tree Regressor
+## LexicoDecisionTreeRegressor
+
+[source](https://github.com/simonprovost/scikit-longitudinal/blob/main/scikit_longitudinal/estimators/trees/lexicographical/lexico_decision_tree_regressor.py/#L8)
+
+``` py
+LexicoDecisionTreeRegressor(
+ threshold_gain: float = 0.0015, features_group: List[List[int]] = None,
+ criterion: str = 'friedman_mse', splitter: str = 'lexicoRF',
+ max_depth: Optional[int] = None, min_samples_split: int = 2,
+ min_samples_leaf: int = 1, min_weight_fraction_leaf: float = 0.0,
+ max_features: Optional[Union[int, str]] = None, random_state: Optional[int] = None,
+ max_leaf_nodes: Optional[int] = None, min_impurity_decrease: float = 0.0,
+ ccp_alpha: float = 0.0
+)
+
The Lexico Decision Tree Regressor is an advanced regression model specifically designed for longitudinal data. This implementation extends the traditional decision tree algorithm by incorporating a lexicographic Optimisation approach.
+Lexicographic Optimisation
+The primary goal of this approach is to prioritise the selection of more recent data points (wave ids) when determining splits in the decision tree, based on the premise that recent measurements are typically more predictive and relevant than older ones.
+Key Features:
+node_lexicoRF_split
function.For further scientific references, please refer to the Notes section.
+float
, default=0.0015): The threshold value for comparing gain ratios of features during the decision tree construction.List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.str
, default='friedman_mse'): The function to measure the quality of a split.str
, default='lexicoRF'): The strategy used to choose the split at each node.Optional[int]
, default=None): The maximum depth of the tree.int
, default=2): The minimum number of samples required to split an internal node.int
, default=1): The minimum number of samples required to be at a leaf node.float
, default=0.0): The minimum weighted fraction of the sum total of weights required to be at a leaf node.Optional[Union[int, str]]
, default=None): The number of features to consider when looking for the best split.Optional[int]
, default=None): The seed used by the random number generator.Optional[int]
, default=None): The maximum number of leaf nodes in the tree.float
, default=0.0): The minimum impurity decrease required for a node to be split.float
, default=0.0): Complexity parameter used for Minimal Cost-Complexity Pruning.Fits the Lexico Decision Tree Regressor on the input data and target variable.
+np.ndarray
): The input data of shape (n_samples, n_features).np.ndarray
): The target variable of shape (n_samples).Predicts the target data for the given input data.
+np.ndarray
): The input data.Consider the following dataset
+Features:
+smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)Target:
+blood_pressure
(continuous variable) at wave/time-point 2 only for the sake of the exampleThe dataset is shown below:
+smoke_wave_1 | +smoke_wave_2 | +cholesterol_wave_1 | +cholesterol_wave_2 | +age | +gender | +blood_pressure_wave_2 | +
---|---|---|---|---|---|---|
0 | +1 | +0 | +1 | +45 | +1 | +120 | +
1 | +1 | +1 | +1 | +50 | +0 | +130 | +
0 | +0 | +0 | +0 | +55 | +1 | +110 | +
1 | +1 | +1 | +1 | +60 | +0 | +140 | +
0 | +1 | +0 | +1 | +65 | +1 | +125 | +
++The Lexico Decision Tree Regressor leverages advanced techniques in decision tree Optimisation to handle longitudinal data effectively. For further reading, please refer to the following references:
+
Welcome to the full API documentation of the Scikit-Longitudinal
toolbox.
LongitudinalPipeline(
+ steps: List[Tuple[str, Any]], features_group: List[List[int]],
+ non_longitudinal_features: List[Union[int, str]] = None,
+ update_feature_groups_callback: Union[Callable, str] = None,
+ feature_list_names: List[str] = None
+)
+
Custom pipeline for handling and processing longitudinal techniques (preprocessors, classifier, etc.). This class extends scikit-learn's Pipeline
to offer specialized methods and attributes for working with longitudinal data. It ensures that the longitudinal features and their structure are updated throughout the pipeline's transformations.
List[Tuple[str, Any]]
): List of (name, transform) tuples (implementing fit
/transform
) that are chained, in the order in which they are chained, with the last object being an estimator.List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.List[Union[int, str]]
, optional): A list of indices of features that are not longitudinal attributes. Defaults to None.Union[Callable, str]
, optional): Callback function to update feature groups. This function is invoked to update the structure of longitudinal features during pipeline transformations.List[str]
, optional): List of names corresponding to the features.np.ndarray
): Longitudinal data being processed.np.ndarray
): Indices of the selected features.Any
): Final step in the pipeline.++Note:
+
+While this class maintains the interface of scikit-learn'sPipeline
, it includes specific methods and validations to ensure the correct processing of longitudinal data.
.fit(
+ X: np.ndarray, y: Optional[Union[pd.Series, np.ndarray]] = None,
+ **fit_params: Dict[str, Any]
+)
+
Fit the transformers in the pipeline and then the final estimator. For each step, the transformers are configured and fitted. The data is transformed and updated for each step, ensuring that the longitudinal feature structure is maintained.
+np.ndarray
): The input data.Optional[Union[pd.Series, np.ndarray]]
): The target variable.Dict[str, Any]
): Additional fitting parameters.Predict the target values using the final estimator of the pipeline.
+np.ndarray
): The input data.Dict[str, Any]
): Additional prediction parameters.Predict the probability of the target values using the final estimator of the pipeline.
+np.ndarray
): The input data.Dict[str, Any]
): Additional prediction parameters.Transform the input data using the final estimator of the pipeline.
+np.ndarray
): The input data.Dict[str, Any]
): Additional transformation parameters.update_feature_groups_callback
¶The update_feature_groups_callback
is a crucial component in the LongitudinalPipeline
. This callback function is responsible for updating the structure of longitudinal features during each step of the pipeline. Here’s a detailed breakdown of how it works:
The update_feature_groups_callback
ensures that the structure of longitudinal features is accurately maintained and updated as data flows through the pipeline's transformers. This is essential because longitudinal data often requires specific handling to preserve its temporal or grouped characteristics.
By default, the LongitudinalPipeline
includes a built-in callback function that automatically manages the update of longitudinal features for the current transformers/estimators of the library. This default implementation ensures that users can utilise the pipeline without needing to provide a custom callback for the current available techniques, simplifying the initial setup.
For more advanced use cases, users can provide their custom callback function. E.g because you have a new data pre-processing technique that changes the temporal structure of the dataset. This custom function can be passed as a lambda or a regular function. The custom callback allows users to implement specific logic tailored to their unique longitudinal data processing needs.
+def update_feature_groups_callback(
+ step_idx: int,
+ longitudinal_dataset: LongitudinalDataset,
+ y: Optional[Union[pd.Series, np.ndarray]],
+ name: str,
+ transformer: TransformerMixin
+) -> Tuple[np.ndarray, List[List[int]], List[Union[int, str]], List[str]]:
+ ...
+
int
): The index of the current step in the pipeline.LongitudinalDataset
): A custom dataset object that includes the longitudinal data and feature groups.Optional[Union[pd.Series, np.ndarray]]
): The target variable.str
): The name of the current transformer.TransformerMixin
): The current transformer being applied in the pipeline.np.ndarray
): The updated longitudinal data.List[List[int]]
): The updated grouping of longitudinal features.List[Union[int, str]]
): The updated list of non-longitudinal features.List[str]
): The updated list of feature names.Initialise Dataset: The function starts by initialising a LongitudinalDataset
object using the current state of the longitudinal data and feature groups.
Update Features: The callback function is invoked with the current step index, the LongitudinalDataset
object, the target variable, the name of the transformer, and the transformer itself.
Return Updated Data: The callback function returns the updated longitudinal data, features group, non-longitudinal features, and feature list names, which are then used in subsequent steps of the pipeline.
+Users can pass a lambda function as the update_feature_groups_callback
to quickly define custom update logic. For example:
pipeline = LongitudinalPipeline(
+ steps=[...],
+ features_group=[...],
+ update_feature_groups_callback=lambda step_idx, dataset, y, name, transformer: (
+ custom_update_function(step_idx, dataset, y, name, transformer)
+ )
+)
+
This allows for easy customisation and experimentation with different feature group update strategies. In the meantime,
+for further details on the LongitudinalPipeline
class, refer to the source code, or open a Github issue.
CorrelationBasedFeatureSelectionPerGroup(
+ non_longitudinal_features: Optional[List[int]] = None,
+ search_method: str = 'greedySearch',
+ features_group: Optional[List[List[int]]] = None, parallel: bool = False,
+ outer_search_method: str = None, inner_search_method: str = 'exhaustiveSearch',
+ version = 1, num_cpus: int = -1
+)
+
Correlation-based Feature Selection (CFS) per group (CFS Per Group).
+This class performs feature selection using the CFS-Per-Group algorithm on given data. The CFS algorithm is, in-a-nutshell, +a filter method that selects features based on their correlation with the target variable and their mutual correlation +with each other. CFS-Per-Group, on the other hand, is an implementation that is adapted from the original CFS, +tailored to understand the Longitudinal temporality.
+CFS-Per-Group a Longitudinal Variation of the Standard CFS Method
+CFS-Per-Group, also known as Exh-CFS-Gr
in the literature, is a longitudinal variation of the standard CFS method.
+It is designed to handle longitudinal data by considering temporal variations across multiple waves (time points).
+The method works in two phases:
For more scientific references, refer to the Notes section.
+Standard CFS Algorithm implementation available
+If you would like to use the standard CFS algorithm, please refer to the CorrelationBasedFeatureSelection
class.
+Given that this is out of the scope of this documentation, we recommend checking the source code for more information.
Optional[List[Tuple[int, ...]]]
, default=None): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.Optional[List[int]]
): A list of feature indices that are considered non-longitudinal. In version-2, these features will be employed in the second phase of the CFS per group algorithm.str
, default="greedySearch"): The search method to use (Phase-1). Options are "exhaustiveSearch" and "greedySearch".int
, default=2): The version of the CFS per group algorithm to use. Options are "1" and "2". Version 2 is the improved version with an outer search out of the final aggregated list of features of the first phase.str
, default=None): The outer (to the final aggregated list of features) search method to use for the CFS per group (longitudinal component). If None, it defaults to the same as the search_method
.str
, default="exhaustiveSearch"): The inner (to each longitudinal attributes' waves') search method to use for the CFS per group (longitudinal component).bool
, default=False): Whether to use parallel processing for the CFS algorithm (especially useful for the exhaustive search method with the CFS per group, i.e., longitudinal component).int
, default=-1): The number of CPUs to use for parallel processing. If -1, all available CPUs will be used.ndarray
of shape (n_features,)): The indices of the selected features.Fits the CFS algorithm on the input data and target variable.
+np.ndarray
): The input data of shape (n_samples, n_features).np.ndarray
): The target variable of shape (n_samples).Reduces the input data to only the selected features.
+Warning
+Not to be used directly. Use the apply_selected_features_and_rename
method instead.
np.ndarray
): A numpy array of shape (n_samples, n_features) representing the input data..apply_selected_features_and_rename(
+ df: pd.DataFrame, selected_features: List, regex_match = '^(.+)_w(\\d+)$'
+)
+
Apply selected features to the input DataFrame and rename non-longitudinal features. This function applies the
+selected features using the selected_features_
attribute given. Therefore, you can capture by your_model.selected_features_
.
+It also renames the non-longitudinal features that may have become non-longitudinal if only one wave remains after the
+feature selection process, to avoid them being considered as longitudinal attributes during future automatic feature
+grouping.
Note
+To avoid adding a "transform" parameter to the Transformer Mixin class, this function was created instead.
+To avoid misunderstanding, given that changes to Longitudinal data features (longitudinal and non-longitudinal) are needed,
+we created this new function replacing Transform
. Rest assured, the LongitudinalPipeline
interprets
+this workaround by default without the need for anything from the user perspective.
pd.DataFrame
): The input DataFrame to apply the selected features and perform renaming.List
): The list of selected features to apply to the input DataFrame.str
): The regex pattern to use for renaming non-longitudinal features. Follows by default the Elsa
naming convention for longitudinal features. For more information, see the source code or open an issue.Consider the following dataset
+Features:
+smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)Target:
+stroke
(binary classification) at wave/time-point 2 only for the sake of the exampleThe dataset is shown below:
+smoke_wave_1 | +smoke_wave_2 | +cholesterol_wave_1 | +cholesterol_wave_2 | +age | +gender | +stroke_wave_2 | +
---|---|---|---|---|---|---|
0 | +1 | +0 | +1 | +45 | +1 | +0 | +
1 | +1 | +1 | +1 | +50 | +0 | +1 | +
0 | +0 | +0 | +0 | +55 | +1 | +0 | +
1 | +1 | +1 | +1 | +60 | +0 | +1 | +
0 | +1 | +0 | +1 | +65 | +1 | +0 | +
++The improved Correlation-Based Feature Selection (CFS) algorithm is built upon the following key references:
+
2&&void 0!==arguments[2]?arguments[2]:t,o=arguments.length>3?arguments[3]:void 0;if(e===b)return e;var h=void 0!==o?null===(r=n._$Cl)||void 0===r?void 0:r[o]:n._$Cu,l=r$1(e)?void 0:e._$litDirective$;return(null==h?void 0:h.constructor)!==l&&(null===(i=null==h?void 0:h._$AO)||void 0===i||i.call(h,!1),void 0===l?h=void 0:(h=new l(t))._$AT(t,n,o),void 0!==o?(null!==(s=(a=n)._$Cl)&&void 0!==s?s:a._$Cl=[])[o]=h:n._$Cu=h),void 0!==h&&(e=P(t,h._$AS(t,e.values),h,o)),e}class V{constructor(t,e){this.v=[],this._$AN=void 0,this._$AD=t,this._$AM=e}get parentNode(){return this._$AM.parentNode}get _$AU(){return this._$AM._$AU}p(t){var e,{el:{content:r},parts:i}=this._$AD,s=(null!==(e=null==t?void 0:t.creationScope)&&void 0!==e?e:l$2).importNode(r,!0);A.currentNode=s;for(var a=A.nextNode(),n=0,o=0,h=i[0];void 0!==h;){if(n===h.index){var l=void 0;2===h.type?l=new N(a,a.nextSibling,this,t):1===h.type?l=new h.ctor(a,h.name,h.strings,this,t):6===h.type&&(l=new L(a,this,t)),this.v.push(l),h=i[++o]}n!==(null==h?void 0:h.index)&&(a=A.nextNode(),n++)}return s}m(t){var e=0;for(var r of this.v)void 0!==r&&(void 0!==r.strings?(r._$AI(t,r,e),e+=r.strings.length-2):r._$AI(t[e])),e++}}class N{constructor(t,e,r,i){var s;this.type=2,this._$AH=w,this._$AN=void 0,this._$AA=t,this._$AB=e,this._$AM=r,this.options=i,this._$Cg=null===(s=null==i?void 0:i.isConnected)||void 0===s||s}get _$AU(){var t,e;return null!==(e=null===(t=this._$AM)||void 0===t?void 0:t._$AU)&&void 0!==e?e:this._$Cg}get parentNode(){var t=this._$AA.parentNode,e=this._$AM;return void 0!==e&&11===t.nodeType&&(t=e.parentNode),t}get startNode(){return this._$AA}get endNode(){return this._$AB}_$AI(t){t=P(this,t,arguments.length>1&&void 0!==arguments[1]?arguments[1]:this),r$1(t)?t===w||null==t||""===t?(this._$AH!==w&&this._$AR(),this._$AH=w):t!==this._$AH&&t!==b&&this.$(t):void 0!==t._$litType$?this.T(t):void 0!==t.nodeType?this.S(t):u(t)?this.A(t):this.$(t)}M(t){var e=arguments.length>1&&void 0!==arguments[1]?arguments[1]:this._$AB;return this._$AA.parentNode.insertBefore(t,e)}S(t){this._$AH!==t&&(this._$AR(),this._$AH=this.M(t))}$(t){this._$AH!==w&&r$1(this._$AH)?this._$AA.nextSibling.data=t:this.S(l$2.createTextNode(t)),this._$AH=t}T(t){var e,{values:r,_$litType$:i}=t,s="number"==typeof i?this._$AC(t):(void 0===i.el&&(i.el=E.createElement(i.h,this.options)),i);if((null===(e=this._$AH)||void 0===e?void 0:e._$AD)===s)this._$AH.m(r);else{var a=new V(s,this),n=a.p(this.options);a.m(r),this.S(n),this._$AH=a}}_$AC(t){var e=T.get(t.strings);return void 0===e&&T.set(t.strings,e=new E(t)),e}A(t){d(this._$AH)||(this._$AH=[],this._$AR());var e,r=this._$AH,i=0;for(var s of t)i===r.length?r.push(e=new N(this.M(h$1()),this.M(h$1()),this,this.options)):e=r[i],e._$AI(s),i++;i n?-1:1,l=!0;l;)if(i[a]<=n&&i[a+1]>n?(o=(n-i[a])/(i[a+1]-i[a]),l=!1):a+=h,a<0||a>=s-1){if(a===s-1)return r[a];l=!1}return r[a]+(r[a+1]-r[a])*o}var h=createTypedArray("float32",8);return{getSegmentsLength:function(t){var e,i=segmentsLengthPool.newElement(),s=t.c,a=t.v,n=t.o,o=t.i,h=t._length,l=i.lengths,p=0;for(e=0;e1?r[1]=1:r[1]<=0&&(r[1]=0),HSVtoRGB(r[0],r[1],r[2])}function addBrightnessToRGB(t,e){var r=RGBtoHSV(255*t[0],255*t[1],255*t[2]);return r[2]+=e,r[2]>1?r[2]=1:r[2]<0&&(r[2]=0),HSVtoRGB(r[0],r[1],r[2])}function addHueToRGB(t,e){var r=RGBtoHSV(255*t[0],255*t[1],255*t[2]);return r[0]+=e/360,r[0]>1?r[0]-=1:r[0]<0&&(r[0]+=1),HSVtoRGB(r[0],r[1],r[2])}var rgbToHex=function(){var t,e,r=[];for(t=0;t<256;t+=1)e=t.toString(16),r[t]=1===e.length?"0"+e:e;return function(t,e,i){return t<0&&(t=0),e<0&&(e=0),i<0&&(i=0),"#"+r[t]+r[e]+r[i]}}(),setSubframeEnabled=function(t){subframeEnabled=!!t},getSubframeEnabled=function(){return subframeEnabled},setExpressionsPlugin=function(t){expressionsPlugin=t},getExpressionsPlugin=function(){return expressionsPlugin},setExpressionInterfaces=function(t){expressionsInterfaces=t},getExpressionInterfaces=function(){return expressionsInterfaces},setDefaultCurveSegments=function(t){defaultCurveSegments=t},getDefaultCurveSegments=function(){return defaultCurveSegments},setIdPrefix=function(t){idPrefix$1=t},getIdPrefix=function(){return idPrefix$1};function createNS(t){return document.createElementNS(svgNS,t)}function _typeof$5(t){return _typeof$5="function"==typeof Symbol&&"symbol"==typeof Symbol.iterator?function(t){return typeof t}:function(t){return t&&"function"==typeof Symbol&&t.constructor===Symbol&&t!==Symbol.prototype?"symbol":typeof t},_typeof$5(t)}var dataManager=function(){var t,e,r=1,i=[],s={onmessage:function(){},postMessage:function(e){t({data:e})}},_workerSelf={postMessage:function(t){s.onmessage({data:t})}};function a(){e||(e=function(e){if(window.Worker&&window.Blob&&getWebWorker()){var r=new Blob(["var _workerSelf = self; self.onmessage = ",e.toString()],{type:"text/javascript"}),i=URL.createObjectURL(r);return new Worker(i)}return t=e,s}((function(t){if(_workerSelf.dataManager||(_workerSelf.dataManager=function(){function t(s,a){var n,o,h,l,p,f,u=s.length;for(o=0;o=0;e-=1)if("sh"===t[e].ty)if(t[e].ks.k.i)i(t[e].ks.k);else for(a=t[e].ks.k.length,s=0;sr[0]||!(r[0]>t[0])&&(t[1]>r[1]||!(r[1]>t[1])&&(t[2]>r[2]||!(r[2]>t[2])&&null))}var a,n=function(){var t=[4,4,14];function e(t){var e,r,i,s=t.length;for(e=0;e=0;r-=1)if("sh"===t[r].ty)if(t[r].ks.k.i)t[r].ks.k.c=t[r].closed;else for(s=t[r].ks.k.length,i=0;i500)&&(this._imageLoaded(),clearInterval(r)),e+=1}.bind(this),50)}function a(t){var e={assetData:t},r=i(t,this.assetsPath,this.path);return dataManager.loadData(r,function(t){e.img=t,this._footageLoaded()}.bind(this),function(){e.img={},this._footageLoaded()}.bind(this)),e}function n(){this._imageLoaded=e.bind(this),this._footageLoaded=r.bind(this),this.testImageLoaded=s.bind(this),this.createFootageData=a.bind(this),this.assetsPath="",this.path="",this.totalImages=0,this.totalFootages=0,this.loadedAssets=0,this.loadedFootagesCount=0,this.imagesLoadedCb=null,this.images=[]}return n.prototype={loadAssets:function(t,e){var r;this.imagesLoadedCb=e;var i=t.length;for(r=0;rthis.animationData.op&&(this.animationData.op=t.op,this.totalFrames=Math.floor(t.op-this.animationData.ip));var e,r,i=this.animationData.layers,s=i.length,a=t.layers,n=a.length;for(r=0;r0;)r-=1,this._elements.unshift(e[r]);this.dynamicProperties.length?this.k=!0:this.getValue(!0)},RepeaterModifier.prototype.resetElements=function(t){var e,r=t.length;for(e=0;e=0;a-=1)o=PolynomialBezier.shapeSegmentInverted(t,a),l.push(offsetSegmentSplit(o,e));l=pruneIntersections(l);var p=null,c=null;for(a=0;a0&&(p=!1),p){var c=createTag("style");c.setAttribute("f-forigin",i[r].fOrigin),c.setAttribute("f-origin",i[r].origin),c.setAttribute("f-family",i[r].fFamily),c.type="text/css",c.innerText="@font-face {font-family: "+i[r].fFamily+"; font-style: normal; src: url('"+i[r].fPath+"');}",e.appendChild(c)}}else if("g"===i[r].fOrigin||1===i[r].origin){for(h=document.querySelectorAll('link[f-forigin="g"], link[f-origin="1"]'),l=0;l=0&&!this.shapeModifiers[t].processShapes(this._isFirstFrame);t-=1);}},searchProcessedElement:function(t){for(var e=this.processedElements,r=0,i=e.length;r.01)return!1;r+=1}return!0},GradientProperty.prototype.checkCollapsable=function(){if(this.o.length/2!=this.c.length/4)return!1;if(this.data.k.k[0].s)for(var t=0,e=this.data.k.k.length;te);)r+=1;return this.keysIndex!==r&&(this.keysIndex=r),this.data.d.k[this.keysIndex].s},TextProperty.prototype.buildFinalText=function(t){for(var e,r,i=[],s=0,a=t.length,n=!1,o=!1,h="";s=55296&&e<=56319?FontManager.isRegionalFlag(t,s)?h=t.substr(s,14):(r=t.charCodeAt(s+1))>=56320&&r<=57343&&(FontManager.isModifier(e,r)?(h=t.substr(s,2),n=!0):h=FontManager.isFlagEmoji(t.substr(s,4))?t.substr(s,4):t.substr(s,2)):e>56319?(r=t.charCodeAt(s+1),FontManager.isVariationSelector(e)&&(n=!0)):FontManager.isZeroWidthJoiner(e)&&(n=!0,o=!0),n?(i[i.length-1]+=h,n=!1):i.push(h),s+=h.length;return i},TextProperty.prototype.completeTextData=function(t){t.__complete=!0;var e,r,i,s,a,n,o,h=this.elem.globalData.fontManager,l=this.data,p=[],c=0,f=l.m.g,u=0,d=0,m=0,y=[],g=0,v=0,b=h.getFontByName(t.f),_=0,P=getFontProperties(b);t.fWeight=P.weight,t.fStyle=P.style,t.finalSize=t.s,t.finalText=this.buildFinalText(t.t),r=t.finalText.length,t.finalLineHeight=t.lh;var S,E=t.tr/1e3*t.finalSize;if(t.sz)for(var x,C,A=!0,w=t.sz[0],k=t.sz[1];A;){x=0,g=0,r=(C=this.buildFinalText(t.t)).length,E=t.tr/1e3*t.finalSize;var T=-1;for(e=0;es&&"slice"===o)?(r-this.transformCanvas.w*(i/this.transformCanvas.h))/2*this.renderConfig.dpr:"xMax"===l&&(as&&"slice"===o)?(r-this.transformCanvas.w*(i/this.transformCanvas.h))*this.renderConfig.dpr:0,this.transformCanvas.ty="YMid"===p&&(a>s&&"meet"===o||as&&"meet"===o||a=0;t-=1)this.elements[t]&&this.elements[t].destroy&&this.elements[t].destroy();this.elements.length=0,this.globalData.canvasContext=null,this.animationItem.container=null,this.destroyed=!0},CanvasRendererBase.prototype.renderFrame=function(t,e){if((this.renderedFrame!==t||!0!==this.renderConfig.clearCanvas||e)&&!this.destroyed&&-1!==t){var r;this.renderedFrame=t,this.globalData.frameNum=t-this.animationItem._isFirstFrame,this.globalData.frameId+=1,this.globalData._mdf=!this.renderConfig.clearCanvas||e,this.globalData.projectInterface.currentFrame=t;var i=this.layers.length;for(this.completeLayers||this.checkLayers(t),r=i-1;r>=0;r-=1)(this.completeLayers||this.elements[r])&&this.elements[r].prepareFrame(t-this.layers[r].st);if(this.globalData._mdf){for(!0===this.renderConfig.clearCanvas?this.canvasContext.clearRect(0,0,this.transformCanvas.w,this.transformCanvas.h):this.save(),r=i-1;r>=0;r-=1)(this.completeLayers||this.elements[r])&&this.elements[r].renderFrame();!0!==this.renderConfig.clearCanvas&&this.restore()}}},CanvasRendererBase.prototype.buildItem=function(t){var e=this.elements;if(!e[t]&&99!==this.layers[t].ty){var r=this.createItem(this.layers[t],this,this.globalData);e[t]=r,r.initExpressions()}},CanvasRendererBase.prototype.checkPendingElements=function(){for(;this.pendingElements.length;)this.pendingElements.pop().checkParenting()},CanvasRendererBase.prototype.hide=function(){this.animationItem.container.style.display="none"},CanvasRendererBase.prototype.show=function(){this.animationItem.container.style.display="block"},CVContextData.prototype.duplicate=function(){var t=2*this._length,e=0;for(e=this._length;e1?e=1:e<0&&(e=0);var n=t(e);if($bm_isInstanceOfArray(s)){var o,h=s.length,l=createTypedArray("float32",h);for(o=0;o