Nthread in prophet_xgboost and select_best Computation Speed #258

Zhouyi-Joey · 2024-12-02T09:22:58Z

Hi Matt,

I’ve been using your fantastic tools for time-series forecasting and encountered a couple of issues that might need clarification or improvement:
1.When setting set_engine("prophet_xgboost", nthread = 12), it seems that the specified number of threads (12) isn’t being utilized effectively during computation. Could you confirm if nthread in the XGBoost backend is fully activated in this setup?
2.Alternatively, using parallel_start(12, .method = "parallel") allows parallelism for most of the computation steps. However, during the final step where I run

best_result <- select_best(tune_results, metric = "rmse"),

only a single core is used, leading to significant delays for large datasets.

Could you provide guidance on:
1.Ensuring that nthread is fully activated for XGBoost within the prophet_xgboost engine?
2.or, Improving parallelization for the select_best step to speed up computation?

Below is a minimal reproducible example showcasing the issue:

# Load necessary libraries
library(modeltime)      # For time-series forecasting models
library(tidymodels)     # For tidymodels framework
library(workflowsets)   # For creating multiple workflows
library(tidyverse)      # For data manipulation
library(timetk)         # For time series toolkits
library(prophet)        # For Prophet time-series forecasting

# Filter US holidays from a pre-generated holiday dataset
us_holidays <- generated_holidays[generated_holidays$country == "US", ]

# If nthread = 12 is removed and replaced with parallel_start(), it leads to issue 2 described above.
# parallel_start(12, .method = "parallel")

# Load and preprocess data
# Filter specific time-series data
m750 <- m4_monthly %>% filter(id == "M750")

# Ensure the date column is in Date format
m750 <- m750 %>% mutate(date = as.Date(date))

# Add annual statistics: mean and variance
m750 <- m750 %>%
    group_by(year = year(date)) %>%
    mutate(
        annual_mean = mean(value, na.rm = TRUE),
        annual_variance = var(value, na.rm = TRUE)
    ) %>%
    ungroup()

# Add quarterly statistics: mean
m750 <- m750 %>%
    group_by(year = year(date), quarter = quarter(date)) %>%
    mutate(quarter_mean = mean(value, na.rm = TRUE)) %>%
    ungroup()

# Split data into training and testing sets (90% training data)
splits <- initial_time_split(m750, prop = 0.9)

# Define the preprocessing recipe
recipe_spec <- recipe(value ~ ., data = training(splits))

# Prepare and inspect the preprocessed data
recipe_spec %>% prep() %>% juice()

# Define the Prophet + XGBoost model with tunable parameters
prophet_boost_tune <- prophet_boost(
    mode = "regression"
) %>%
    set_engine("prophet_xgboost", holidays = us_holidays, nthread = 12) %>%
    set_args(
        changepoint_range = 0.85,
        trees = 100000,
        tree_depth = tune(),
        learn_rate = 0.01,
        stop_iter = 10,
        min_n = tune(),
        season = tune(),
        changepoint_num = tune(),
        prior_scale_changepoints = tune(),
        prior_scale_seasonality = tune(),
        prior_scale_holidays = tune()
    )

# Set Bayesian optimization control parameters
bayes_control <- control_bayes(
    verbose = TRUE,
    uncertain = 10,
    no_improve = 10,
    parallel_over = "everything", # Allow parallel processing
    save_pred = TRUE
)

# Define the workflow by combining the recipe and model
workflow_prophet_boost <- workflow() %>%
    add_model(prophet_boost_tune) %>%
    add_recipe(recipe_spec)

# Set seed for reproducibility
set.seed(200)

# Define cross-validation folds for time-series
cv_folds <- time_series_cv(
    data = m750,
    date_var = date,
    initial = "20 years",
    assess = "6 months"
)

# Perform Bayesian tuning
tune_results <- tune_bayes(
    workflow_prophet_boost,
    resamples = cv_folds,
    initial = 10,      # Number of initial points in the Bayesian search
    iter = 10,         # Number of iterations for optimization
    control = bayes_control,
    metrics = metric_set(rmse) # Root Mean Square Error as the evaluation metric
)

# Select the best result based on RMSE
best_result <- select_best(tune_results, metric = "rmse")

Thank you for your time and incredible contributions to the R and time-series forecasting communities. Your tools have been a game-changer in streamlining complex workflows and making advanced modeling accessible. I deeply appreciate your hard work and dedication in maintaining and improving these packages.

Best regards,
Yi Zhou

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nthread in prophet_xgboost and select_best Computation Speed #258

Nthread in prophet_xgboost and select_best Computation Speed #258

Zhouyi-Joey commented Dec 2, 2024

Nthread in prophet_xgboost and select_best Computation Speed #258

Nthread in prophet_xgboost and select_best Computation Speed #258

Comments

Zhouyi-Joey commented Dec 2, 2024