-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668
Comments
Hi @aokolish Our hash list API delivers a JSON file which is updated on a nightly basis as you have seen in our documentation. In the current implementation running I'm happy to go over this further to help you find a solution, or if there is something I'm misunderstanding. |
Hey @aokolish , @Bruce-Pouncey-TAT - I looked into this briefly, and this might be a missing functionality in HMA, which currently can't handle APIs like TATs that doesn't handle deltas but also changes over time. Since there is no efficient way to discover updated records with this kind of API, we'd need to write something new in HMA to load all the previously downloaded records from TAT in memory, then create a diff itself (especially removals). We could also force clear all the hashes every time, but as the number of hashes grows over time, this creates weird inconsistencies in the database (hashes disappearing and reappearing in the index) that might have real production impact. This feature doesn't exist today, and so by default TAT isn't correctly supported by HMA, hence the logs that Alex is seeing. To evaluate the potential solutions:
|
Hi, @aokolish @Dcallies - An update on this, we're currently in the planning phase of updating our API to become delta-based and more compatible with Threat Exchange & HMA. We're using the NCMEC and StopNCII documentation as reference. A few points I need guidance on:
Can we please open up a new issue for each change, or create a step by step of the changes that need to be made similar to this issue. |
As a point of clarification, the changes I suggested would be made to https://www.terrorismanalytics.org/docs/hash-list-v1, not to the code in this repo. I'm unsure how to respond to your request to open new issues for these changes as a result. The summary of the suggested changes is instead of return a file download link to json hashes, to return the hashes directly in a ordering that allows users to only check for updates. This will make it play nicer with HMA by default, but will benefit all your users. Before:
After:
Example:
Each call to /api/hashes returns X records, and calling the next returns the next X records until you reach the end. Your users only need to check if there are new entries after the last checkpoint they stored, which reduces the amount of data transmitted. This is implemented as a simple iteration over the underlying database in order of the index you added. |
@Dcallies @aokolish Once have it in place I think some minor changes will need to be made to this repo, as I recall we used no checkpointing intentionally so that will need updating along some minor changes in [exchanges/clients/techagainstterrorism] and (https://github.com/facebook/ThreatExchange/tree/main/python-threatexchange/threatexchange/exchanges/clients/techagainstterrorism) and impl/techagainstterrorism_api.py I will create a new branch for this issue and also update the tests. |
I'm testing out hasher-matcher-actioner + tech against terrorism (TAT) API and noticed a log line from the server:
This leads me to believe that HMA would never pull new hashes from the TAT API. Can this somehow be fixed in the TAT exchange implementation?
Here are their API docs - https://www.terrorismanalytics.org/docs/hash-list-v1
There are probably easier steps to reproduce, but my steps were...
The text was updated successfully, but these errors were encountered: