More robust name validation #703

aeisenbarth · 2024-09-09T14:30:38Z

Closes #624

This pull request changes name validation rules:
- allow additionally . (now allowing _-. and alphanumeric, which includes 0-9a-zA-Z but also other Unicode like ɑ and ²)
- forbid full names ., ..
- forbid prefix __
- forbid names only differing in character case, like abc, Abc (only one of them allowed, no matter which case)
Name validation is now also applied to AnnData tables (keys/columns in obs, obsm, obsp, var, varm, varp, uns).
- For obs and var dataframes, _index is forbidden.
Validation happens at construction time when adding elements to an element type dictionary (as before).
Additionally, validation happens before writing to Zarr.

codecov · 2024-09-09T16:34:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.10%. Comparing base (27bb4a7) to head (3a83768).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #703      +/-   ##
==========================================
+ Coverage   91.89%   92.10%   +0.20%     
==========================================
  Files          45       46       +1     
  Lines        6919     7038     +119     
==========================================
+ Hits         6358     6482     +124     
+ Misses        561      556       -5

Files with missing lines	Coverage Δ
src/spatialdata/_core/_elements.py	`91.86% <100.00%> (-0.10%)`	⬇️
src/spatialdata/_core/spatialdata.py	`91.55% <100.00%> (+0.65%)`	⬆️
src/spatialdata/_core/validation.py	`100.00% <100.00%> (ø)`
src/spatialdata/models/__init__.py	`100.00% <100.00%> (ø)`
src/spatialdata/models/models.py	`87.74% <100.00%> (-0.08%)`	⬇️

LucaMarconato · 2024-09-28T11:30:35Z

Excellent PR @aeisenbarth, thank you!

I performed my code review and applied directly the code changes. I list them here:

I added a check for layers as well for tables (updated design docs and tests accordingly)
There was a bug in _validate_all_elements(): it should be element_type == 'tables' (instead of 'table'`)
- That if condition was not covered by tests, so I added a test for that
In test_spatialdata_operations.py, some checks for tables were missing (due to old code, that was expecting a single table), I update that
Same for some code in test_readwrite.py
in test_writing_invalid_name() a test for labels was commented; I uncommented it
I extended test_writing_invalid_name() to consider:
- writing of a table with a valid name bug invalid "subnames" (this is the test I mentioned above that was not covering the "table" vs "tables" bug)
- incremental writing of single elements (before validation of table "subnames" was triggered only on write(), now also on write_element()).
I now trigger the name validation also on TableModel().validate() and not just on TableModel().parse(). I added tests for that in test_models.py

LucaMarconato · 2024-09-28T11:31:18Z

I ask you please to give a double check to my code changes, and if you agree with them (or after your edits), let's merge 😊

LucaMarconato · 2024-09-30T18:20:52Z

The explanation in the Discussions on how to be able to read datasets with naming problems is great! One minor todo:

add a link to the discussion Adapting SpatialData to naming restrictions #707 in the exception that the code raises when reading a dataset with naming problems.

aeisenbarth · 2024-10-11T23:13:23Z

Thanks, the changes are good.

add a link to the discussion #707 in the exception that the code raises when reading a dataset with naming problems.

The exception is not raised in a single place. These exceptions would need to be changed:
check_valid_name L74-84 (6 exceptions)
_iter_anndata_attr_keys_collect_value_errors L196
The problem with _iter_anndata_attr_keys_collect_value_errors is that it collects one or more of the above exceptions, so it would include the link multiple times.

Probably we would rather refactor the code or create a wrapper function that adds the link to a raised exception, and call that function in place of check_valid_name, validate_table_attr_keys.
(in Elements._check_key, SpatialData.write, SpatialData.write_element, SpatialData.write_transformations, TableModel.validate, TableModel.parse)

aeisenbarth · 2024-10-17T17:25:04Z

We just discussed following ideas in the meeting:

Add a flag to optionally skip validation on reading and maybe on model construction.

➕ This facilitates debugging or fixing data.
For reading it seems feasible.
For model construction, every operation adding an element to the dictionary would also need to offer the flag. This doesn't work for dict[key]= assignment, only for add_element(name, elem, validate=False). Also it would affect many places in code and increase complexity.

It might be better to throw validation errors only on writing, and just warnings on reading or construction.

➕ Allows reading old/invalid files
➕ Allows users to easily fix invalid names by renaming in memory, without extra tools
➖ In-memory representation is not guaranteed to be valid
➖ More complex
I see the possibility of invalid in-memory objects problematic, because other functions in the code cannot trust that it is valid. Also when a not-disk-backed SpatialData is passed to other libraries (scanpy etc.) that assume all data is disk-backed and validated.
In my view, we should expect all newly created datasets after this PR to be valid. There should only exist very little older datasets that violate this constraints, which would benefit from a warning instead of an error. And they need to be migrated anyways, either due to an error on reading, or due to a warning.
The use case of reading possibly invalid data overlaps with the issues of 1) gracefully reading legacy formats into the latest in-memory representation (partially implemented for parquet) and 2) partially reading corrupted data Tolerance when reading corrupted data #457.

Any opinions, @LucaMarconato, @giovp ?

aeisenbarth · 2024-10-30T18:17:55Z

I extended the validation error message to include the link to the instructions for renaming misnamed elements.
I think it is ready for a final review.

For example read_zarr displays now (in test test_reading_invalid_name):

Cannot construct SpatialData object, input contains invalid elements.
For renaming, please see the discussion here https://github.com/scverse/spatialdata/discussions/707 .
  shapes/non-alnum_#$%&()*+,?@: Name must contain only alphanumeric characters, underscores, dots and hyphens.
  points/has whitespace: Name must contain only alphanumeric characters, underscores, dots and hyphens.

There were still some redundant validations that I kept:
When a given name is used to referr to an existing element (not to add a new one), we can assume the existing elements had been validated at construction time, and when an invalid name is queried, no element will be found. This is the case in SpatialData.write_element(element_name), SpatialData.write_transformations(element_name), SpatialData.write_metadata(element_name).
I decided to remove the name validation in SpatialData.delete_element_from_disk, for the reason above, and especially if an element somehow got an invalid name we should allow deleting it.

melonora

I think this is good to go. @LucaMarconato perhaps a last check from your side?

aeisenbarth added 3 commits September 9, 2024 16:16

Move check_target_region_column_symmetry to separate validation module

fe0f2c6

Refactor shared keys handling to dedicated methods

0c8c2db

Refactor check_valid_name to validation module

1a06f5a

aeisenbarth marked this pull request as draft September 9, 2024 14:41

aeisenbarth added 7 commits September 9, 2024 17:19

Change name validation to allow dot, disallow ".", "..", "__"

edb108c

Add name validation for case-insensitive uniqueness

e01f457

Add name validation for table columns

141c17e

Add validation before writing SpatialData

5b4f407

Update _get_table to multiple tables

8c8835c

Document naming restrictions

1a02119

Add change log

a42d698

aeisenbarth force-pushed the feature-name-validation branch from 3a545d3 to a42d698 Compare September 9, 2024 16:29

aeisenbarth marked this pull request as ready for review September 9, 2024 19:47

LucaMarconato and others added 3 commits September 28, 2024 12:07

edit docstring, design doc

abc21ca

code review changes for valid name validation

4c319ed

added missing update design doc

726657f

LucaMarconato approved these changes Sep 28, 2024

View reviewed changes

LucaMarconato assigned LucaMarconato and aeisenbarth Sep 30, 2024

Merge branch 'scverse:main' into feature-name-validation

2dad312

aeisenbarth force-pushed the feature-name-validation branch from 7f9223d to 961be80 Compare October 30, 2024 16:51

aeisenbarth added 3 commits October 30, 2024 18:58

Refactor formatting of multiple validation errors

cd46302

Allow deleting element with invalid name

a5cbae5

Add test for reading invalid names

3a83768

aeisenbarth force-pushed the feature-name-validation branch from 961be80 to 3a83768 Compare October 30, 2024 17:59

melonora approved these changes Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust name validation #703

More robust name validation #703

aeisenbarth commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

LucaMarconato commented Sep 28, 2024

LucaMarconato commented Sep 28, 2024

LucaMarconato commented Sep 30, 2024

aeisenbarth commented Oct 11, 2024

aeisenbarth commented Oct 17, 2024

aeisenbarth commented Oct 30, 2024

melonora left a comment

More robust name validation #703

Are you sure you want to change the base?

More robust name validation #703

Conversation

aeisenbarth commented Sep 9, 2024 • edited Loading

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

LucaMarconato commented Sep 28, 2024

LucaMarconato commented Sep 28, 2024

LucaMarconato commented Sep 30, 2024

aeisenbarth commented Oct 11, 2024

aeisenbarth commented Oct 17, 2024

aeisenbarth commented Oct 30, 2024

melonora left a comment

Choose a reason for hiding this comment

aeisenbarth commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading