[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

aokolish · 2024-10-24T20:58:56Z

I'm testing out hasher-matcher-actioner + tech against terrorism (TAT) API and noticed a log line from the server:

INFO in fetcher: TAT_HASHES[tat] Is a NoCheckpointing class, which hopefully is a test type, and we have a checkpoint. Considering complete

This leads me to believe that HMA would never pull new hashes from the TAT API. Can this somehow be fixed in the TAT exchange implementation?

Here are their API docs - https://www.terrorismanalytics.org/docs/hash-list-v1

There are probably easier steps to reproduce, but my steps were...

deploy HMA
configure the TAT exchange with prod credentials
run HMA fetcher
check logs for this message

The text was updated successfully, but these errors were encountered:

Bruce-Pouncey-TAT · 2024-10-24T23:09:02Z

Hi @aokolish
Bruce here from TAT, perhaps I can be of assistance.

Our hash list API delivers a JSON file which is updated on a nightly basis as you have seen in our documentation.
In this scenario a checkpoint would be difficult to keep track of as the list is delivered in full every time via a single request.

In the current implementation running threatexchange fetch would download the entire hash list again along with any new hashes that would have not been included in the previous fetch. We want the system to assume the list stale on every fetch.

I'm happy to go over this further to help you find a solution, or if there is something I'm misunderstanding.

Dcallies · 2024-10-28T17:45:37Z

Hey @aokolish , @Bruce-Pouncey-TAT - I looked into this briefly, and this might be a missing functionality in HMA, which currently can't handle APIs like TATs that doesn't handle deltas but also changes over time. Since there is no efficient way to discover updated records with this kind of API, we'd need to write something new in HMA to load all the previously downloaded records from TAT in memory, then create a diff itself (especially removals). We could also force clear all the hashes every time, but as the number of hashes grows over time, this creates weird inconsistencies in the database (hashes disappearing and reappearing in the index) that might have real production impact.

This feature doesn't exist today, and so by default TAT isn't correctly supported by HMA, hence the logs that Alex is seeing.

To evaluate the potential solutions:

@Bruce-Pouncey-TAT - my top recommendation is to switch to a delta-based API like NCMEC, StopNCII, and ThreatExchange on the Tech Against Terrorism side. In the long term, your users will thank you, as the cost of keeping a correct copy grows with the size of your database, and it's not too hard as long as you are storing hashes in a backing database. I have implemented multiple versions of this type of API, and helped other programs make this same jump. This is by far the easiest solution.
If whatever reason TAT can't update their API, I can more fully describe an implementation in the opening paragraph and add it to an issue, for someone to attempt.

Bruce-Pouncey-TAT · 2024-11-25T11:15:09Z

Hi, @aokolish @Dcallies - An update on this, we're currently in the planning phase of updating our API to become delta-based and more compatible with Threat Exchange & HMA. We're using the NCMEC and StopNCII documentation as reference.

A few points I need guidance on:

How does one implement this into HMA
What changes are needed in the current TAT API

Can we please open up a new issue for each change, or create a step by step of the changes that need to be made similar to this issue.

Dcallies · 2024-11-27T21:49:12Z

Hey @Bruce-Pouncey-TAT

What changes are needed in the current TAT API

Can we please open up a new issue for each change

As a point of clarification, the changes I suggested would be made to https://www.terrorismanalytics.org/docs/hash-list-v1, not to the code in this repo. I'm unsure how to respond to your request to open new issues for these changes as a result.

The summary of the suggested changes is instead of return a file download link to json hashes, to return the hashes directly in a ordering that allows users to only check for updates. This will make it play nicer with HMA by default, but will benefit all your users.

Before:

GET /api/hash-list/ -> file download link
file download link -> json list

After:

Add a modified time (mtime) to each record (defaults to created time). Any modification to the record (including deleting it) should update mtime and cause the record to be reordered
Add a soft-deletion flag to each record (1=deleted)
- You can choose to hard-delete records after X days, 90d is a classic choice
Add an index on whatever backing db stores the hashes on (mtime, id)
GET /api/hashes?after=token -> returns json ordered by (mtime, id) with the following:
- List records -> Equivalent to current file json, though it also contains deleted records (you can ignore all fields except id)
- checkpoint -> a token for saving progress
- next -> contains exactly the URL to call to get the next set of hashes. implicitly /api/hashes?after=checkpoint

Example:

curl https://www.terrorismanalytics.org/api/hashes
{
  data: [
    {
      id: 123,
      mtime: 170408502,
      hash_digest: ...
   },
  {
    id: 104,
    mtime: 1704085141,
    deleted: true
  },
   ....
   {
     id: 124,
     mtime: 1704085200,
     hash_digest: ...
   }
  ],
  checkpoint: "1704085200,124",
  next: "https://www.terrorismanalytics.org/api/hashes?next=1704085200,124"
}

curl https://www.terrorismanalytics.org/api/hashes?next=1704085200,124
{
  data:[],
  checkpoint: "1704085200,124",
}

Each call to /api/hashes returns X records, and calling the next returns the next X records until you reach the end. Your users only need to check if there are new entries after the last checkpoint they stored, which reduces the amount of data transmitted. This is implemented as a simple iteration over the underlying database in order of the index you added.

Bruce-Pouncey-TAT · 2024-11-28T09:08:08Z

@Dcallies @aokolish
Thank you for the clarity. The work we're currently doing is already very close to the example you have provided which is good news.

Once have it in place I think some minor changes will need to be made to this repo, as I recall we used no checkpointing intentionally so that will need updating along some minor changes in [exchanges/clients/techagainstterrorism] and (https://github.com/facebook/ThreatExchange/tree/main/python-threatexchange/threatexchange/exchanges/clients/techagainstterrorism) and impl/techagainstterrorism_api.py

I will create a new branch for this issue and also update the tests.

aokolish changed the title ~~[py-tx] TAT~~ [py-tx] Cannot fetch TAT hashes more than once Oct 24, 2024

Dcallies changed the title ~~[py-tx] Cannot fetch TAT hashes more than once~~ [py-tx][tat] TAT API implementation doesn't work correctly with HMA Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

aokolish commented Oct 24, 2024 •

edited

Loading

Bruce-Pouncey-TAT commented Oct 24, 2024 •

edited

Loading

Dcallies commented Oct 28, 2024 •

edited

Loading

Bruce-Pouncey-TAT commented Nov 25, 2024

Dcallies commented Nov 27, 2024 •

edited

Loading

Bruce-Pouncey-TAT commented Nov 28, 2024

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

Comments

aokolish commented Oct 24, 2024 • edited Loading

Bruce-Pouncey-TAT commented Oct 24, 2024 • edited Loading

Dcallies commented Oct 28, 2024 • edited Loading

Bruce-Pouncey-TAT commented Nov 25, 2024

Dcallies commented Nov 27, 2024 • edited Loading

Bruce-Pouncey-TAT commented Nov 28, 2024

aokolish commented Oct 24, 2024 •

edited

Loading

Bruce-Pouncey-TAT commented Oct 24, 2024 •

edited

Loading

Dcallies commented Oct 28, 2024 •

edited

Loading

Dcallies commented Nov 27, 2024 •

edited

Loading