Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust name validation #703

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

aeisenbarth
Copy link
Contributor

@aeisenbarth aeisenbarth commented Sep 9, 2024

Closes #624

  • This pull request changes name validation rules:
    • allow additionally . (now allowing _-. and alphanumeric, which includes 0-9a-zA-Z but also other Unicode like ɑ and ²)
    • forbid full names ., ..
    • forbid prefix __
    • forbid names only differing in character case, like abc, Abc (only one of them allowed, no matter which case)
  • Name validation is now also applied to AnnData tables (keys/columns in obs, obsm, obsp, var, varm, varp, uns).
    • For obs and var dataframes, _index is forbidden.
  • Validation happens at construction time when adding elements to an element type dictionary (as before).
  • Additionally, validation happens before writing to Zarr.

@aeisenbarth aeisenbarth marked this pull request as draft September 9, 2024 14:41
Copy link

codecov bot commented Sep 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.10%. Comparing base (27bb4a7) to head (3a83768).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #703      +/-   ##
==========================================
+ Coverage   91.89%   92.10%   +0.20%     
==========================================
  Files          45       46       +1     
  Lines        6919     7038     +119     
==========================================
+ Hits         6358     6482     +124     
+ Misses        561      556       -5     
Files with missing lines Coverage Δ
src/spatialdata/_core/_elements.py 91.86% <100.00%> (-0.10%) ⬇️
src/spatialdata/_core/spatialdata.py 91.55% <100.00%> (+0.65%) ⬆️
src/spatialdata/_core/validation.py 100.00% <100.00%> (ø)
src/spatialdata/models/__init__.py 100.00% <100.00%> (ø)
src/spatialdata/models/models.py 87.74% <100.00%> (-0.08%) ⬇️

@aeisenbarth aeisenbarth marked this pull request as ready for review September 9, 2024 19:47
@LucaMarconato
Copy link
Member

Excellent PR @aeisenbarth, thank you!

I performed my code review and applied directly the code changes. I list them here:

  • I added a check for layers as well for tables (updated design docs and tests accordingly)
  • There was a bug in _validate_all_elements(): it should be element_type == 'tables' (instead of 'table'`)
    • That if condition was not covered by tests, so I added a test for that
  • In test_spatialdata_operations.py, some checks for tables were missing (due to old code, that was expecting a single table), I update that
  • Same for some code in test_readwrite.py
  • in test_writing_invalid_name() a test for labels was commented; I uncommented it
  • I extended test_writing_invalid_name() to consider:
    • writing of a table with a valid name bug invalid "subnames" (this is the test I mentioned above that was not covering the "table" vs "tables" bug)
    • incremental writing of single elements (before validation of table "subnames" was triggered only on write(), now also on write_element()).
  • I now trigger the name validation also on TableModel().validate() and not just on TableModel().parse(). I added tests for that in test_models.py

@LucaMarconato
Copy link
Member

I ask you please to give a double check to my code changes, and if you agree with them (or after your edits), let's merge 😊

@LucaMarconato
Copy link
Member

The explanation in the Discussions on how to be able to read datasets with naming problems is great! One minor todo:

@aeisenbarth
Copy link
Contributor Author

Thanks, the changes are good.

add a link to the discussion #707 in the exception that the code raises when reading a dataset with naming problems.

The exception is not raised in a single place. These exceptions would need to be changed:
check_valid_name L74-84 (6 exceptions)
_iter_anndata_attr_keys_collect_value_errors L196
The problem with _iter_anndata_attr_keys_collect_value_errors is that it collects one or more of the above exceptions, so it would include the link multiple times.

Probably we would rather refactor the code or create a wrapper function that adds the link to a raised exception, and call that function in place of check_valid_name, validate_table_attr_keys.
(in Elements._check_key, SpatialData.write, SpatialData.write_element, SpatialData.write_transformations, TableModel.validate, TableModel.parse)

@aeisenbarth
Copy link
Contributor Author

We just discussed following ideas in the meeting:

  1. Add a flag to optionally skip validation on reading and maybe on model construction.
  • ➕ This facilitates debugging or fixing data.
  • For reading it seems feasible.
  • For model construction, every operation adding an element to the dictionary would also need to offer the flag. This doesn't work for dict[key]= assignment, only for add_element(name, elem, validate=False). Also it would affect many places in code and increase complexity.
  1. It might be better to throw validation errors only on writing, and just warnings on reading or construction.
  • ➕ Allows reading old/invalid files
  • ➕ Allows users to easily fix invalid names by renaming in memory, without extra tools
  • ➖ In-memory representation is not guaranteed to be valid
  • ➖ More complex
  • I see the possibility of invalid in-memory objects problematic, because other functions in the code cannot trust that it is valid. Also when a not-disk-backed SpatialData is passed to other libraries (scanpy etc.) that assume all data is disk-backed and validated.
  • In my view, we should expect all newly created datasets after this PR to be valid. There should only exist very little older datasets that violate this constraints, which would benefit from a warning instead of an error. And they need to be migrated anyways, either due to an error on reading, or due to a warning.
  • The use case of reading possibly invalid data overlaps with the issues of 1) gracefully reading legacy formats into the latest in-memory representation (partially implemented for parquet) and 2) partially reading corrupted data Tolerance when reading corrupted data #457.

Any opinions, @LucaMarconato, @giovp ?

@aeisenbarth
Copy link
Contributor Author

I extended the validation error message to include the link to the instructions for renaming misnamed elements.
I think it is ready for a final review.

For example read_zarr displays now (in test test_reading_invalid_name):

Cannot construct SpatialData object, input contains invalid elements.
For renaming, please see the discussion here https://github.com/scverse/spatialdata/discussions/707 .
  shapes/non-alnum_#$%&()*+,?@: Name must contain only alphanumeric characters, underscores, dots and hyphens.
  points/has whitespace: Name must contain only alphanumeric characters, underscores, dots and hyphens.
  • There were still some redundant validations that I kept:
    When a given name is used to referr to an existing element (not to add a new one), we can assume the existing elements had been validated at construction time, and when an invalid name is queried, no element will be found. This is the case in SpatialData.write_element(element_name), SpatialData.write_transformations(element_name), SpatialData.write_metadata(element_name).
  • I decided to remove the name validation in SpatialData.delete_element_from_disk, for the reason above, and especially if an element somehow got an invalid name we should allow deleting it.

Copy link
Collaborator

@melonora melonora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go. @LucaMarconato perhaps a last check from your side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Naming constraints break compatibility with existing datasets
3 participants