Suggestion: Regex-based inbound data filters #482

GeorchW · 2024-05-15T08:26:59Z

Being able to filter the inbound data has been the most popular feature request for a very long time and comes up again and again in the issues of different repositories. In the status quo, some watchers (e.g. the window watcher) have their own filtering, but they have various problems:

Discoverability: The configuration is so hidden that even the developers forget that it exist (no offense). Even when it's exposed as a config, it still requires the user to find it somewhere in the docs, then edit the configuration file to set it up correctly. This is not very discoverable.
Consistency: The user needs to look into the docs for each watcher to see whether it supports data filtering and to find out how it's configured.
Limited expressiveness: For the window watcher, it's only possible to remove all window titles. Many people say they only want to exclude some sensitive information (e.g. mail subjects). In my case, I'm frequently using an app that shows a timer in the window title, creating a new entry every second, which makes the timeline barely readable and very unresponsive.

Suggestion

We could add regex-based filtering on the heartbeat level: whenever a heartbeat comes in, it's checked against a set of user-configurable regexes. If one matches on any field of the entry, the entry is discarded. We could also extend this feature to allow regex-replacing entries or matching only some fields of the JSON entry.

Similar inbound data filters can be found in e.g. Sentry.

I could probably implement this by myself, at least in Python and the Vue frontend, but probably in Rust as well, but I'd like to know if the approach is welcome in the first place. Tbh, if it isn't, I'd consider writing a simple proxy server that does exactly this -- applying some replacements to the heartbeat endpoint and passing everything else through.

The text was updated successfully, but these errors were encountered:

ErikBjare · 2024-05-19T07:24:44Z

The window watcher recently had another PR merged for filtering window titles by regex on the client-side before sending: ActivityWatch/aw-watcher-window#99

I don't like the idea of server-side data filters (ideally it'd happen already on the client), but totally agree about discoverability/ease of configuration. This could be addressed with the server-side settings that's in the recent betas. Watchers could fetch the server's filter settings (which could be configured in the Settings view) and filter before sending, just like in the PR above.

I think your plan sounds great, PRs welcome!

GeorchW · 2024-05-20T10:00:05Z

What's your argument for client-side data filtering exactly? I don't see the point of implementing regex filtering in each client again and again tbh. I think even performance-wise it would be nicer to have a single efficient implementation in Rust.

I see that there is a bit of overhead involved with sending the full window title to the server, but after all, the communication is happening on localhost, where the bandwidth is basically unlimited, and we're talking about much less than 1 kByte/s.

I'm not sure if there's any privacy advantage by filtering earlier either. I see that there is some possibility of MITM'ing the server, but I don't think it's very likely that such an attack happens -- and if it does, the early filtering is not a sufficient privacy gurantee at all. On the other hand, it's much more likely that users want to show their timeline to others, but want to make sure that some data will never be visible there. Depending on what each watcher implements, they might not have the ability to do so.

Implementing it in a central place also allows to iterate on the design much easier, e.g. when adding replacements.

I see that different watchers provide different fields on which the regexes could be applied, so they might want to have some control over the way the filtering works. But then again, the categories work the same way.

Thinking of which, I think by having the filtering implemented in the server, it could deliver a much nicer user experience when setting it up, since it could preview the changes it would have applied if it were active in the past. We could even re-use the implementation for data scrubbing.

ErikBjare · 2024-05-20T10:53:29Z

What's your argument for client-side data filtering exactly?

It feels wrong to send potentially sensitive information (even if locally) only for it to be discarded.

I think even performance-wise it would be nicer to have a single efficient implementation in Rust.

Performance is not a concern as regexes are fast in any language and the strings involved are short. Most people are still using aw-server-python (default) and we are keeping them at feature-parity, so there'd be no "single implementation" anyway.

Implementing it in a central place also allows to iterate on the design much easier, e.g. when adding replacements.

It's practically already implemented in aw-watcher-window.

imo there's very little to iterate on here. The design is clear, just need to add a setting for it in aw-webui and make the watcher respect it.

I think by having the filtering implemented in the server, it could deliver a much nicer user experience when setting it up, since it could preview the changes it would have applied if it were active in the past.

I don't see how it would affect the user experience in any way. None of those things require filtering implemented in the server.

We could even re-use the implementation for data scrubbing.

Data scrubbing with previews would be purely an UI feature in aw-webui using the existing API, no changes needed to the server.

On the other hand, it's much more likely that users want to show their timeline to others, but want to make sure that some data will never be visible there.

This seems like a different but similar feature, where you want some sensitive data stored (not filtered in the first place), but you want it hidden/masked for the purpose of sharing/screenshots (prob what I want instead of a filter). Seems like another purely UI feature. Already stored data matching the filter expression could be hidden/masked by default in the UI.

GeorchW · 2024-05-20T11:48:30Z

Ok, I think I'll just write myself a proxy for scrubbing then

2e3s · 2024-10-06T15:00:38Z

It feels wrong to send potentially sensitive information (even if locally) only for it to be discarded.

Just for a perspective, we had a similar discussion with a different resolution #302 (comment)

ErikBjare · 2024-10-07T15:25:51Z

That's fair, I think that's a good take.

I do think clients should be able to check server configuration, via the settings endpoint, and respect filters set there (i.e. from web UI).

But might get messy. It'd be problematic if we want them to apply generally, without having to change all watchers by having each opt-in by respecting the filter (potentially with different regex-engines etc, not great).

I guess we can do both. Have a setting, let privacy-aware watchers filter before sending, and let the server double-filter.

In addition to letting users filter events to avoid saving them, I've been meaning to add "hidden" categories that are hidden by default in visualizations and either replaced with "Private" or hidden altogether. For the sensitive stuff people want to keep but not see (by default).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Regex-based inbound data filters #482

Suggestion: Regex-based inbound data filters #482

GeorchW commented May 15, 2024

ErikBjare commented May 19, 2024

GeorchW commented May 20, 2024

ErikBjare commented May 20, 2024 •

edited

Loading

GeorchW commented May 20, 2024

2e3s commented Oct 6, 2024

ErikBjare commented Oct 7, 2024

Suggestion: Regex-based inbound data filters #482

Suggestion: Regex-based inbound data filters #482

Comments

GeorchW commented May 15, 2024

Suggestion

ErikBjare commented May 19, 2024

GeorchW commented May 20, 2024

ErikBjare commented May 20, 2024 • edited Loading

GeorchW commented May 20, 2024

2e3s commented Oct 6, 2024

ErikBjare commented Oct 7, 2024

ErikBjare commented May 20, 2024 •

edited

Loading