-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor AssessPy for consistency, stability #24
Conversation
Default 0.05. Float value indicating the confidence | ||
interval to return. 0.05 will return the 95% confidence interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated lot of the language just to reflect Python (instead of R) types.
assesspy/data/ccao_sample.parquet
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the data files contain exactly the same content, they're just renamed and reorganized (I removed unneeded columns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically rewrote all the tests and only kept the known, good values for each metric to test against. Using pytest fixtures lets us simplify the test code a lot while running more tests overall.
from .outliers import is_outlier | ||
from .sales_chasing import is_sales_chased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to hide of some of the exported functions here (iqr_outlier, quantile_outlier) to keep things in line with how is_sales_chased
works. Both function are accessible via is_outlier
anyways.
def prd_ci(assessed, sale_price, nboot=100, alpha=0.05): | ||
return boot_ci( | ||
prd, assessed=assessed, sale_price=sale_price, nboot=nboot, alpha=alpha | ||
def prb_ci( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PRB CI was accessible before, but only via a dictionary value returned by prb()
. This wasn't consistent with the other functions and also meant there was no prb_ci()
function. I've added one here for API consistency even though it's largely redundant code with PRB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion, non-blocking] If we wanted to reduce duplication between this function and metrics.prb()
, we could factor out a shared helper function that validates the inputs and returns the OLS model object for each function to interpret as needed. Not a high priority in my opinion though, since we're only duplicating logic in two places.
from importlib.resources import as_file, files | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is apparently the hot new way to import package data. setuptools
was complaining to me about pkg_resources
being deprecated.
return pd.read_parquet(file) | ||
|
||
|
||
def quintos_sample() -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured it would be nice to include the Quintos data as an export to use in tests and demos.
def cod( | ||
estimate: Union[list[int], list[float], pd.Series], | ||
sale_price: Union[list[int], list[float], pd.Series], | ||
) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cod()
now gets 2 input args instead of the single ratio
arg it took before. FINALLY.
estimate = ( | ||
pd.Series(estimate, dtype=float) | ||
.rename("estimate") | ||
.reset_index(drop=True) | ||
) | ||
sale_price = ( | ||
pd.Series(sale_price, dtype=float) | ||
.rename("sale_price") | ||
.reset_index(drop=True) | ||
) | ||
df = pd.concat([estimate, sale_price], axis=1) | ||
# Mergesort is required for stable sort results | ||
df.sort_values(by="sale_price", kind="mergesort", inplace=True) | ||
df.reset_index(drop=True, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index manipulation here is annoyingly important. If you pass two Series as inputs to this function and they have different indexes, the result will be a NaN output (due to index joining) unless you reset the indices.
|
||
|
||
def cod( | ||
estimate: Union[list[int], list[float], pd.Series], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Union
here instead of the newer list[int] | list[float]
format for compatibility with Python 3.9.
|
||
import numpy as np | ||
from scipy import stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why we needed a scipy
dependency to calculate the IQR when we can (presumably) do the same thing manually with Panda's quantile()
. If there's a good reason we need this specific IQR function then we can add it back.
pct_actual = pct_in_range(ratio, bounds[0], bounds[1]) | ||
|
||
return pct_actual > pct_ideal | ||
return bool(abs(pct_actual - pct_ideal) > gap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new requirement that the absolute difference in percentages be greater than some threshold because this method was waaay too sensitive before.
def test_metric_value_is_correct(self, metric, metric_val): | ||
expected = { | ||
"cod": 17.81456901196891, | ||
"prd": 1.0484192615223522, | ||
"prb": 0.0009470721642262903, | ||
"mki": 0.794, | ||
"ki": -0.06, | ||
} | ||
assert pt.approx(metric_val, rel=0.02) == expected[metric] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's worth updating these tests to use known good values rather than just values resulting from our sample data. I opened #25 for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work here 👏🏻 Code is indeed way cleaner, I can see why the rewrite was worth it!
return gini_assessed, gini_sale_price | ||
|
||
|
||
def mki( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Thought, non-blocking] Unrelated to this PR, but these changes are reminding me that we should make a note to add MKI/KI/Gini to our sales ratio glossary considering we're using them more and more.
assesspy/utils.py
Outdated
raise Exception("All input vectors must be numeric.") | ||
if check.isnull().values.any(): | ||
out.append("\nInput vectors contain null values.") | ||
out_msg.append("\nAll input values must be numeric.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nitpick, non-blocking] Instead of prepending each error string with a newline, could we just add newlines as separators during the join()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holdover from the old code. Fixed in 085dcac.
assesspy/utils.py
Outdated
raise Exception("".join(map(str, out))) | ||
out_msg_set = set(out_msg) | ||
if len(out_msg_set) > 1: | ||
raise Exception("".join(map(str, out_msg_set))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] What motivates the str()
cast here? it kind of seems like all of the elements of out_msg_set
are strings already, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise a holdover from the old code. Removed!
This PR is a full refactor of the
assesspy
package designed to make it clearer, more consistent, easier to maintain, and more compatible with other technologies. It was motivated by the work in ccao-data/data-architecture#521, which revealed that the current version of AssessPy is incompatible with Spark.Warning
This is a breaking refactor. It significantly changes the API of some functions and deprecates others.
Breaking changes
estimate
,sale_price
) and return the same output (a singlefloat
). Previously, some metrics had one input (COD) or different outputs (PRD)detect_chasing
andis_outlier
are no longer exported to the user. Instead they can be selected via an argument in their respective functionsdetect_chasing
is renamed tois_sales_chased
for consistency withis_outlier
Other changes
numpy
as possible for compatibility with Spark/AthenaConsidering the size of this PR, I recommend ignoring most of the removed code and considering only new stuff. Treat it like a greenfield package.
Closes #15.