Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor AssessPy for consistency, stability #24

Merged
merged 48 commits into from
Nov 25, 2024
Merged

Conversation

dfsnow
Copy link
Member

@dfsnow dfsnow commented Nov 23, 2024

This PR is a full refactor of the assesspy package designed to make it clearer, more consistent, easier to maintain, and more compatible with other technologies. It was motivated by the work in ccao-data/data-architecture#521, which revealed that the current version of AssessPy is incompatible with Spark.

Warning

This is a breaking refactor. It significantly changes the API of some functions and deprecates others.

Breaking changes

  • All metrics (COD, PRD, etc.) now have the same inputs (estimate, sale_price) and return the same output (a single float). Previously, some metrics had one input (COD) or different outputs (PRD)
  • The sub-functions of detect_chasing and is_outlier are no longer exported to the user. Instead they can be selected via an argument in their respective functions
  • detect_chasing is renamed to is_sales_chased for consistency with is_outlier
  • Sample datasets are renamed to reflect their respective sources

Other changes

  • Removed as much numpy as possible for compatibility with Spark/Athena
  • Added static types to almost everything, which should make this package easier to maintain in the long run
  • Replaced all unit tests with a fixtures matrix and parameters
  • Updated a lot of the documentation structure
  • Updated the example ratio study notebook

Considering the size of this PR, I recommend ignoring most of the removed code and considering only new stuff. Treat it like a greenfield package.

Closes #15.

Comment on lines +34 to 35
Default 0.05. Float value indicating the confidence
interval to return. 0.05 will return the 95% confidence interval.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated lot of the language just to reflect Python (instead of R) types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the data files contain exactly the same content, they're just renamed and reorganized (I removed unneeded columns).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically rewrote all the tests and only kept the known, good values for each metric to test against. Using pytest fixtures lets us simplify the test code a lot while running more tests overall.

Comment on lines +19 to +20
from .outliers import is_outlier
from .sales_chasing import is_sales_chased
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to hide of some of the exported functions here (iqr_outlier, quantile_outlier) to keep things in line with how is_sales_chased works. Both function are accessible via is_outlier anyways.

def prd_ci(assessed, sale_price, nboot=100, alpha=0.05):
return boot_ci(
prd, assessed=assessed, sale_price=sale_price, nboot=nboot, alpha=alpha
def prb_ci(
Copy link
Member Author

@dfsnow dfsnow Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PRB CI was accessible before, but only via a dictionary value returned by prb(). This wasn't consistent with the other functions and also meant there was no prb_ci() function. I've added one here for API consistency even though it's largely redundant code with PRB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion, non-blocking] If we wanted to reduce duplication between this function and metrics.prb(), we could factor out a shared helper function that validates the inputs and returns the OLS model object for each function to interpret as needed. Not a high priority in my opinion though, since we're only duplicating logic in two places.

Comment on lines +1 to +2
from importlib.resources import as_file, files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is apparently the hot new way to import package data. setuptools was complaining to me about pkg_resources being deprecated.

return pd.read_parquet(file)


def quintos_sample() -> pd.DataFrame:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it would be nice to include the Quintos data as an export to use in tests and demos.

Comment on lines +10 to +13
def cod(
estimate: Union[list[int], list[float], pd.Series],
sale_price: Union[list[int], list[float], pd.Series],
) -> float:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cod() now gets 2 input args instead of the single ratio arg it took before. FINALLY.

Comment on lines +189 to +202
estimate = (
pd.Series(estimate, dtype=float)
.rename("estimate")
.reset_index(drop=True)
)
sale_price = (
pd.Series(sale_price, dtype=float)
.rename("sale_price")
.reset_index(drop=True)
)
df = pd.concat([estimate, sale_price], axis=1)
# Mergesort is required for stable sort results
df.sort_values(by="sale_price", kind="mergesort", inplace=True)
df.reset_index(drop=True, inplace=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index manipulation here is annoyingly important. If you pass two Series as inputs to this function and they have different indexes, the result will be a NaN output (due to index joining) unless you reset the indices.



def cod(
estimate: Union[list[int], list[float], pd.Series],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Union here instead of the newer list[int] | list[float] format for compatibility with Python 3.9.


import numpy as np
from scipy import stats
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we needed a scipy dependency to calculate the IQR when we can (presumably) do the same thing manually with Panda's quantile(). If there's a good reason we need this specific IQR function then we can add it back.

pct_actual = pct_in_range(ratio, bounds[0], bounds[1])

return pct_actual > pct_ideal
return bool(abs(pct_actual - pct_ideal) > gap)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new requirement that the absolute difference in percentages be greater than some threshold because this method was waaay too sensitive before.

Comment on lines +17 to +25
def test_metric_value_is_correct(self, metric, metric_val):
expected = {
"cod": 17.81456901196891,
"prd": 1.0484192615223522,
"prb": 0.0009470721642262903,
"mki": 0.794,
"ki": -0.06,
}
assert pt.approx(metric_val, rel=0.02) == expected[metric]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's worth updating these tests to use known good values rather than just values resulting from our sample data. I opened #25 for this purpose.

Copy link

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work here 👏🏻 Code is indeed way cleaner, I can see why the rewrite was worth it!

CITATION.cff Outdated Show resolved Hide resolved
assesspy/ci.py Show resolved Hide resolved
assesspy/ci.py Outdated Show resolved Hide resolved
assesspy/metrics.py Show resolved Hide resolved
assesspy/metrics.py Outdated Show resolved Hide resolved
return gini_assessed, gini_sale_price


def mki(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought, non-blocking] Unrelated to this PR, but these changes are reminding me that we should make a note to add MKI/KI/Gini to our sales ratio glossary considering we're using them more and more.

assesspy/metrics.py Show resolved Hide resolved
assesspy/tests/conftest.py Outdated Show resolved Hide resolved
raise Exception("All input vectors must be numeric.")
if check.isnull().values.any():
out.append("\nInput vectors contain null values.")
out_msg.append("\nAll input values must be numeric.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick, non-blocking] Instead of prepending each error string with a newline, could we just add newlines as separators during the join()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holdover from the old code. Fixed in 085dcac.

raise Exception("".join(map(str, out)))
out_msg_set = set(out_msg)
if len(out_msg_set) > 1:
raise Exception("".join(map(str, out_msg_set)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] What motivates the str() cast here? it kind of seems like all of the elements of out_msg_set are strings already, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise a holdover from the old code. Removed!

@dfsnow dfsnow merged commit d611de4 into main Nov 25, 2024
14 checks passed
@dfsnow dfsnow deleted the dfsnow/full-refactor branch November 25, 2024 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconsider whether detect_chasing helper functions should be documented
2 participants