You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We would like to add support for Catboost models. Users of Treelite should be able to load Catboost models and run prediction.
Overview
Catboost has a custom target encoding method to encode categorical data, and produces special kinds of decision trees called oblivious trees. See the Catboost paper for more details.
In general, target encoder is a function that takes a categorical input and puts out a numeric output. The function is an "encoding," in the sense that the categorical input is encoded as a real number. The advantage of target encoding is that we can exclusively use the simple test of form [feature] < [threshold] in all of our decision trees.
The challenge is that Catboost uses a custom flavor of target encoding. The goal, therefore, is to abstract away as much complexity as possible.
where each possible categorical value is mapped to a vector of length 1 or greater.
Catboost uses CityHash to convert string categories into int64, so the target encoding field must allow both int64 and float32 types for the categorical input.
Scope
Catboost allows users to save models in two formats: FlatBuffer and JSON. For the initial version, we'll only support the JSON format.
Initially, we'll convert oblivious trees into regular decision trees. We may add ObliviousTree class to the Treelite model spec in the future.
In addition, we'll only support the simple_ctr configuration, where the target encoding function takes in only one single categorical feature at a time. We won't support the combination_ctr configuration where multiple categorical features are fed into the target encoder.
TODOs
Add the target encoder to the Treelite model spec
Implement the deserializer for the Catboost JSON model. The deserializer will be placed in src/frontend.
Update GTIL to support inferencing with Catboost.
Update the C codegen to support text inputs and target encoding. I expect this step to be challenging, given the complexity in the C codegen.
The text was updated successfully, but these errors were encountered:
hcho3
changed the title
Catboost support
Initial support for Catboost
Apr 19, 2022
We would like to add support for Catboost models. Users of Treelite should be able to load Catboost models and run prediction.
Overview
Catboost has a custom target encoding method to encode categorical data, and produces special kinds of decision trees called oblivious trees. See the Catboost paper for more details.
In general, target encoder is a function that takes a categorical input and puts out a numeric output. The function is an "encoding," in the sense that the categorical input is encoded as a real number. The advantage of target encoding is that we can exclusively use the simple test of form
[feature] < [threshold]
in all of our decision trees.The challenge is that Catboost uses a custom flavor of target encoding. The goal, therefore, is to abstract away as much complexity as possible.
Proposed Design
The treelite model spec
treelite/include/treelite/tree.h
Lines 792 to 796 in 4cc4f7e
should be updated to include an optional field to store the target encoding function. The target encoding component should be a lookup table of form
where each possible categorical value is mapped to a vector of length 1 or greater.
Catboost uses CityHash to convert string categories into int64, so the target encoding field must allow both int64 and float32 types for the categorical input.
Scope
Catboost allows users to save models in two formats: FlatBuffer and JSON. For the initial version, we'll only support the JSON format.
Initially, we'll convert oblivious trees into regular decision trees. We may add
ObliviousTree
class to the Treelite model spec in the future.In addition, we'll only support the
simple_ctr
configuration, where the target encoding function takes in only one single categorical feature at a time. We won't support thecombination_ctr
configuration where multiple categorical features are fed into the target encoder.TODOs
src/frontend
.The text was updated successfully, but these errors were encountered: