Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't store data filename in corpus data structure #548

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Nov 28, 2024

A fix for #547.

Tests have not been updated yet - wanting to get review of approach.


This change is Reviewable

@johnml1135 johnml1135 changed the title A start Don't store data filename in corpus data structure Nov 28, 2024
Repopulate filename when building and creating USFM files
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 31 of 31 files at r1, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @johnml1135)


src/Serval/src/Serval.Translation/Services/EngineService.cs line 525 at r1 (raw file):

    }

    public Task DeleteAllCorpusFilesAsync(string dataFileId, CancellationToken cancellationToken = default)

I noticed that this method was never updated to reflect the new corpus model changes. We should fix this.


src/Serval/src/Serval.Shared/Contracts/CorpusResult.cs line 8 at r1 (raw file):

    public required string Language { get; init; }
    public string? Name { get; init; }
    public required IReadOnlyList<CorpusFile> Files { get; set; }

We don't want to pass models between contexts. We should only be passing contracts. A CorpusFileResult should be used to pass file information.


src/Serval/src/Serval.DataFiles/Models/CorpusFile.cs line 5 at r1 (raw file):

public record CorpusFile
{
    public required DataFileReference FileReference { get; init; }

I would prefer not to denormalize the database within a context. We shouldn't keep copies of the data. We also don't want clients to have to specify the same information twice. This should just be a FileRef property.


src/Serval/src/Serval.Shared/Models/CorpusFile.cs line 9 at r1 (raw file):

    public required string TextId { get; set; }

    private string? _filename;

I see where you are going by not storing the filename and then synchronously calling Serval.DataFiles to get the filename when it is used. Rather than redesigning all of this, I would prefer that we stick with the current design. We are already dealing with one type of data file change, specifically deletion. We should just use the same approach when a file is updated. I would also like to keep the number of synchronous calls across contexts down to a minimum. In the current design, we only make these calls on setup. I would prefer not to make them on more frequent calls, such as starting a build or generating USFM. See the DataFileDeletedConsumer for an example.


src/Serval/src/Serval.Shared/Models/MonolingualCorpus.cs line 3 at r1 (raw file):

namespace Serval.Shared.Models;

public record MonolingualCorpus

This should be moved back to Serval.Translation.

@johnml1135
Copy link
Collaborator Author

Let me see if I understand the basic changes you are suggesting:

  • Make a DataFileUpdated consumer - that updates:
    • The Corpus (as in the corpus controller)
    • The Legacy corpus
    • The Parallel Corpus
  • Make a CorpusFileUpdated consumer that updates the Parallel Corpus (when the files change)

@johnml1135
Copy link
Collaborator Author

Start by writing integration tests (that fail) so that the logic can be more easily debugged.
Make "removeAll" work with parallel Corpus.
Ignore Assessment Engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants