Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topology featurizers debug #460

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

XiaoqZhang
Copy link

@XiaoqZhang XiaoqZhang commented Aug 14, 2024

Hi, I found a bug in the topology featurizers. The original way of splitting atom_types will split 'Cu' to ['C', 'u'], for example. Besides, the tuple type is error-prone to the last , as highlighted by the red circle in the figure attached.

Thus, I made several changes accordingly:

Thermal conductivity

Summary by Sourcery

Fix the atom type splitting issue in topology featurizers by changing the data type from tuple to list and introduce a new function filter_element_for_ph for persistence homology. Uncomment previously commented lines in _tda_helpers.py to restore functionality.

Bug Fixes:

  • Fix the incorrect splitting of atom types in topology featurizers by changing the data type from tuple to list.

Enhancements:

  • Introduce a new function filter_element_for_ph to handle element filtering specifically for persistence homology, ensuring compatibility with existing functionality.

Copy link

sourcery-ai bot commented Aug 14, 2024

Reviewer's Guide by Sourcery

This pull request addresses a bug in the topology featurizers of the mofdscribe library. The main changes involve modifying how atom types are handled, introducing a new function for filtering elements in persistence homology calculations, and adjusting data types to improve robustness. The changes span across multiple files in the project, primarily focusing on the topology-related modules.

File-Level Changes

Files Changes
src/mofdscribe/featurizers/utils/substructures.py Introduced a new function 'filter_element_for_ph' specifically for persistence homology calculations
src/mofdscribe/featurizers/topology/ph_image.py
src/mofdscribe/featurizers/topology/ph_hist.py
src/mofdscribe/featurizers/topology/ph_stats.py
src/mofdscribe/featurizers/topology/ph_vect.py
Changed the data type of 'atom_types' from tuple to list in multiple classes
src/mofdscribe/featurizers/topology/_tda_helpers.py Modified the 'get_persistent_images_for_structure' function to use the new 'filter_element_for_ph' function
src/mofdscribe/featurizers/topology/_tda_helpers.py Uncommented two lines in the 'make_supercell' function to include original coordinates and elements
src/mofdscribe/featurizers/topology/ph_image.py
src/mofdscribe/featurizers/topology/ph_hist.py
src/mofdscribe/featurizers/topology/ph_stats.py
src/mofdscribe/featurizers/topology/ph_vect.py
Updated the handling of atom types in various classes to accommodate the new list data type

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @XiaoqZhang - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

@@ -35,6 +35,39 @@ def filter_element(
return Structure.from_sites(keep_sites)
else: # input is molecule or IMolecule
return Molecule.from_sites(keep_sites)


def filter_element_for_ph(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider refactoring to consolidate logic into a single function.

The new code introduces unnecessary complexity by duplicating logic in the filter_element_for_ph function, which increases maintenance overhead. The use of elements_group as a tuple doesn't add functional value and makes the code harder to read. Additionally, the handling of elements is inconsistent, which could lead to bugs. Consider refactoring to consolidate the logic into a single function, using list comprehensions for clarity, and ensuring consistent handling of elements. This will improve readability and maintainability.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a suggestion?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly! To address the complexity issue, you could refactor the filter_element and filter_element_for_ph functions to share common logic. Here's a suggestion:

  1. Create a Helper Function: Extract the common logic into a helper function that both filter_element and filter_element_for_ph can call. This will reduce duplication and make the code easier to maintain.

  2. Use List Comprehensions: If possible, use list comprehensions to make the filtering logic more concise and readable.

  3. Consistent Handling of elements: Ensure that the handling of elements is consistent across both functions. This might involve standardizing the input format or processing steps.

Here's a rough example of how you might structure the refactoring:

def filter_elements_common(sites, elements, elements_group):
    return [site for site in sites if site.specie.symbol in elements or site.specie.symbol in elements_group]

def filter_element(structure, elements, elements_group=()):
    keep_sites = filter_elements_common(structure.sites, elements, elements_group)
    return Structure.from_sites(keep_sites) if isinstance(structure, Structure) else Molecule.from_sites(keep_sites)

def filter_element_for_ph(structure, elements, elements_group=()):
    # If there's any specific logic for persistence homology, handle it here
    keep_sites = filter_elements_common(structure.sites, elements, elements_group)
    return Structure.from_sites(keep_sites) if isinstance(structure, Structure) else Molecule.from_sites(keep_sites)

This approach keeps the core filtering logic in one place, making it easier to update and maintain. If filter_element_for_ph requires additional specific logic, you can still handle that separately within its function. Let me know if you need further clarification or assistance!

@kjappelbaum
Copy link
Owner

interesting, thanks for looking into it!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that deepsource is not heavily used at the moment, but why are those changes needed for this PR?

Comment on lines +107 to +108
xyz_periodic_copies.append(coords)
element_copies.append(np.array(elements).reshape(-1,1))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, seems like a good catch, because the x==y==z==0 condition is skipped atm

@@ -34,14 +34,14 @@ class PHHist(MOFBaseFeaturizer):

def __init__(
self,
atom_types: Tuple[str] = (
atom_types: List[str] = [
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple was originally used because it is immutable and hence a better default for arguments.

But since we do not change the atom_types this is probably a safe choice

Comment on lines +52 to +54
elements_ = []
elements_group = (elements,)
for atom_type in elements_group:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure I understand this change. You are certainly right that the old code behaves in an unexpected way for elements with more than two symbols in the letter. But wouldn't then the clearer fix be something else? For example, always split by - first and then handle other things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants