Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

Open
aokolish opened this issue Oct 24, 2024 · 5 comments
Open

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

aokolish opened this issue Oct 24, 2024 · 5 comments

Comments

@aokolish
Copy link

aokolish commented Oct 24, 2024

I'm testing out hasher-matcher-actioner + tech against terrorism (TAT) API and noticed a log line from the server:

INFO in fetcher: TAT_HASHES[tat] Is a NoCheckpointing class, which hopefully is a test type, and we have a checkpoint. Considering complete

This leads me to believe that HMA would never pull new hashes from the TAT API. Can this somehow be fixed in the TAT exchange implementation?

Here are their API docs - https://www.terrorismanalytics.org/docs/hash-list-v1

There are probably easier steps to reproduce, but my steps were...

  • deploy HMA
  • configure the TAT exchange with prod credentials
  • run HMA fetcher
  • check logs for this message
@aokolish aokolish changed the title [py-tx] TAT [py-tx] Cannot fetch TAT hashes more than once Oct 24, 2024
@Bruce-Pouncey-TAT
Copy link
Contributor

Bruce-Pouncey-TAT commented Oct 24, 2024

Hi @aokolish
Bruce here from TAT, perhaps I can be of assistance.

Our hash list API delivers a JSON file which is updated on a nightly basis as you have seen in our documentation.
In this scenario a checkpoint would be difficult to keep track of as the list is delivered in full every time via a single request.

In the current implementation running threatexchange fetch would download the entire hash list again along with any new hashes that would have not been included in the previous fetch. We want the system to assume the list stale on every fetch.

I'm happy to go over this further to help you find a solution, or if there is something I'm misunderstanding.

@Dcallies
Copy link
Contributor

Dcallies commented Oct 28, 2024

Hey @aokolish , @Bruce-Pouncey-TAT - I looked into this briefly, and this might be a missing functionality in HMA, which currently can't handle APIs like TATs that doesn't handle deltas but also changes over time. Since there is no efficient way to discover updated records with this kind of API, we'd need to write something new in HMA to load all the previously downloaded records from TAT in memory, then create a diff itself (especially removals). We could also force clear all the hashes every time, but as the number of hashes grows over time, this creates weird inconsistencies in the database (hashes disappearing and reappearing in the index) that might have real production impact.

This feature doesn't exist today, and so by default TAT isn't correctly supported by HMA, hence the logs that Alex is seeing.

To evaluate the potential solutions:

  1. @Bruce-Pouncey-TAT - my top recommendation is to switch to a delta-based API like NCMEC, StopNCII, and ThreatExchange on the Tech Against Terrorism side. In the long term, your users will thank you, as the cost of keeping a correct copy grows with the size of your database, and it's not too hard as long as you are storing hashes in a backing database. I have implemented multiple versions of this type of API, and helped other programs make this same jump. This is by far the easiest solution.
  2. If whatever reason TAT can't update their API, I can more fully describe an implementation in the opening paragraph and add it to an issue, for someone to attempt.

@Dcallies Dcallies changed the title [py-tx] Cannot fetch TAT hashes more than once [py-tx][tat] TAT API implementation doesn't work correctly with HMA Oct 28, 2024
@Bruce-Pouncey-TAT
Copy link
Contributor

Hi, @aokolish @Dcallies - An update on this, we're currently in the planning phase of updating our API to become delta-based and more compatible with Threat Exchange & HMA. We're using the NCMEC and StopNCII documentation as reference.

A few points I need guidance on:

  • How does one implement this into HMA
  • What changes are needed in the current TAT API

Can we please open up a new issue for each change, or create a step by step of the changes that need to be made similar to this issue.

@Dcallies
Copy link
Contributor

Dcallies commented Nov 27, 2024

Hey @Bruce-Pouncey-TAT

What changes are needed in the current TAT API

Can we please open up a new issue for each change

As a point of clarification, the changes I suggested would be made to https://www.terrorismanalytics.org/docs/hash-list-v1, not to the code in this repo. I'm unsure how to respond to your request to open new issues for these changes as a result.

The summary of the suggested changes is instead of return a file download link to json hashes, to return the hashes directly in a ordering that allows users to only check for updates. This will make it play nicer with HMA by default, but will benefit all your users.

Before:

  • GET /api/hash-list/ -> file download link
  • file download link -> json list

After:

  • Add a modified time (mtime) to each record (defaults to created time). Any modification to the record (including deleting it) should update mtime and cause the record to be reordered
  • Add a soft-deletion flag to each record (1=deleted)
    • You can choose to hard-delete records after X days, 90d is a classic choice
  • Add an index on whatever backing db stores the hashes on (mtime, id)
  • GET /api/hashes?after=token -> returns json ordered by (mtime, id) with the following:
    • List records -> Equivalent to current file json, though it also contains deleted records (you can ignore all fields except id)
    • checkpoint -> a token for saving progress
    • next -> contains exactly the URL to call to get the next set of hashes. implicitly /api/hashes?after=checkpoint

Example:

curl https://www.terrorismanalytics.org/api/hashes
{
  data: [
    {
      id: 123,
      mtime: 170408502,
      hash_digest: ...
   },
  {
    id: 104,
    mtime: 1704085141,
    deleted: true
  },
   ....
   {
     id: 124,
     mtime: 1704085200,
     hash_digest: ...
   }
  ],
  checkpoint: "1704085200,124",
  next: "https://www.terrorismanalytics.org/api/hashes?next=1704085200,124"
}

curl https://www.terrorismanalytics.org/api/hashes?next=1704085200,124
{
  data:[],
  checkpoint: "1704085200,124",
}

Each call to /api/hashes returns X records, and calling the next returns the next X records until you reach the end. Your users only need to check if there are new entries after the last checkpoint they stored, which reduces the amount of data transmitted. This is implemented as a simple iteration over the underlying database in order of the index you added.

@Bruce-Pouncey-TAT
Copy link
Contributor

@Dcallies @aokolish
Thank you for the clarity. The work we're currently doing is already very close to the example you have provided which is good news.

Once have it in place I think some minor changes will need to be made to this repo, as I recall we used no checkpointing intentionally so that will need updating along some minor changes in [exchanges/clients/techagainstterrorism] and (https://github.com/facebook/ThreatExchange/tree/main/python-threatexchange/threatexchange/exchanges/clients/techagainstterrorism) and impl/techagainstterrorism_api.py

I will create a new branch for this issue and also update the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants