Simulator

Simulator is a framework for training and evaluating recommendation algorithms on real or synthetic data. The framework is based on the pyspark library for working with big data. As part of the simulation process, the framework includes data generators, response functions and other tools that allow flexible use of the simulator.

import numpy as np
import pandas as pd

import pyspark.sql.types as st
from pyspark.ml import PipelineModel

from sim4rec.modules import RealDataGenerator, Simulator, EvaluateMetrics
from sim4rec.response import NoiseResponse, BernoulliResponse
from sim4rec.recommenders.ucb import UCB
from sim4rec.utils import pandas_to_spark

LOG_SCHEMA = st.StructType([
    st.StructField('user_idx', st.LongType(), True),
    st.StructField('item_idx', st.LongType(), True),
    st.StructField('relevance', st.DoubleType(), False),
    st.StructField('response', st.IntegerType(), False)
])

users_df = pd.DataFrame(
    data=np.random.normal(0, 1, size=(100, 15)),
    columns=[f'user_attr_{i}' for i in range(15)]
)
items_df = pd.DataFrame(
    data=np.random.normal(1, 1, size=(30, 10)),
    columns=[f'item_attr_{i}' for i in range(10)]
)
history_df = pandas_to_spark(pd.DataFrame({
    'user_idx' : [1, 10, 10, 50],
    'item_idx' : [4, 25, 26, 25],
    'relevance' : [1.0, 0.0, 1.0, 1.0],
    'response' : [1, 0, 1, 1]
}), schema=LOG_SCHEMA)

users_df['user_idx'] = np.arange(len(users_df))
items_df['item_idx'] = np.arange(len(items_df))

users_df = pandas_to_spark(users_df)
items_df = pandas_to_spark(items_df)

user_gen = RealDataGenerator(label='users_real')
item_gen = RealDataGenerator(label='items_real')
user_gen.fit(users_df)
item_gen.fit(items_df)
_ = user_gen.generate(100)
_ = item_gen.generate(30)

sim = Simulator(
    user_gen=user_gen,
    item_gen=item_gen,
    data_dir='test_simulator',
    user_key_col='user_idx',
    item_key_col='item_idx',
    log_df=history_df
)

noise_resp = NoiseResponse(mu=0.5, sigma=0.2, outputCol='__noise')
br = BernoulliResponse(inputCol='__noise', outputCol='response')
pipeline = PipelineModel(stages=[noise_resp, br])

model = UCB()
model.fit(log=history_df)

evaluator = EvaluateMetrics(
    userKeyCol='user_idx',
    itemKeyCol='item_idx',
    predictionCol='relevance',
    labelCol='response',
    mllib_metrics=['areaUnderROC']
)

metrics = []
for i in range(10):
    users = sim.sample_users(0.1).cache()

    recs = model.predict(
        log=sim.log, k=5, users=users, items=items_df, filter_seen_items=True
    ).cache()

    true_resp = (
        sim.sample_responses(
            recs_df=recs,
            user_features=users,
            item_features=items_df,
            action_models=pipeline,
        )
        .select("user_idx", "item_idx", "relevance", "response")
        .cache()
    )

    sim.update_log(true_resp, iteration=i)

    metrics.append(evaluator(true_resp))

    model.fit(sim.log.drop("relevance").withColumnRenamed("response", "relevance"))

    users.unpersist()
    recs.unpersist()
    true_resp.unpersist()

Examples

You can find useful examples in the 'notebooks' folder, which demonstrate how to use synthetic data generators, composite generators, evaluate the results of the generators, iteratively refit the recommendation algorithm, use response functions and more.

Experiments with different datasets and a tutorial on writing custom response functions can be found in the 'experiments' folder.

Case studies

Case studies prepared for the demo track are available in the 'demo' directory.

Synthetic data generation
Long-term RS performance evaluation

Build from sources

poetry build
pip install ./dist/sim4rec-0.0.1-py3-none-any.whl

Compile documentation

cd docs
make clean && make html

Tests

The pytest Python library is used for testing, and to run tests for all modules you can run the following command from the repository root directory

pytest

Licence

Sim4Rec is distributed under the [Apache Licence Version 2.0] (https://github.com/sb-ai-lab/Sim4Rec/blob/main/LICENSE), however the SDV package imported by Sim4Rec for synthetic data generation is distributed under the Business Source Licence (BSL) 1.1.

Synthetic tabular data generation is not a purpose of the Sit4Rec framework. Sim4Rec provides an API and wrappers to run simulations with synthetic data, but the method of synthetic data generation is determined by the user. The SDV package is imported for illustration purposes and can be replaced by another synthetic data generation solution.

Thus, synthetic data generation functionality and quality evaluation is provided by the SDV library, namely SDVDataGenerator from generator.py and evaluate_synthetic from evaluation.py should only be used for non-production purposes according to the SDV licence.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
demo		demo
docs		docs
experiments		experiments
notebooks		notebooks
sim4rec		sim4rec
tests		tests
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simulator

Table of contents

Installation

Quickstart

Examples

Case studies

Build from sources

Compile documentation

Tests

Licence

About

Releases

Contributors 7

Languages

License

sb-ai-lab/Sim4Rec

Folders and files

Latest commit

History

Repository files navigation

Simulator

Table of contents

Installation

Quickstart

Examples

Case studies

Build from sources

Compile documentation

Tests

Licence

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 7

Languages