User behavior analysis in Open Source

I’ve been nerd sniped into thinking about Telemetry in Open Source. Eugenia certainly has a point in that Telemetry is a four letter word in Open Source, and deservedly so:

Most telemetry seems designed to collect data first, figure out what to do with it second. Besides being in violation with a fair amount of data privacy laws as well as research ethics, such a strategy is also, quite simply, a dick move.

But it doesn’t have to be that way. What if:

Telemetry is a user-granular system configuration scheme, which is opt-in
Telemetry is collected for a given purpose
Telemetry is collected and stored locally
Before telemetry is sent out, the user can review it and decide to send or delete the data

What would that look like, from a user’s perspective? Would that make it more acceptable in FOSS environments to collect data on user behavior to guide further development? Could it set a standard to be followed by other parties (one can dream)?

The UI would expose some kind of notification „Would you like to help improve the user experience? [Read on]“ that opens into the telemetry dashboard, which comes with an explanation of how things work.

The user can dismiss the notification, opt-in, opt-out, drill into details (e.g. opt-in or opt-out data collection for specific apps, see individual data collections)

When opted-in, applications can register data collections (what Firefox would call an experiment): They’re scoped to the application/library, and can declare

a time frame (start of collection, end of collection),
a maximum duration (max. x days within the time frame), and
a maximum submission date,

and must come with

an explanation about the purpose: what is it you’re trying to figure out with that data? e.g. if there are features that should be more easily accessible because right now it takes 5 mouse clicks but it’s more commonly used than the feature on the toolbar that is never being clicked and just wastes screen estate,
a separate explanation about the data being collected (e.g. the frequency, method: menu, toolbar, shortcut, and grouping of task invocations in an application), and
a predefined submission URL to which the data will be sent, if approved.

Data collection is done locally, by storing the data somewhere in the home directory¹ in a standardized format (JSON?) with some broad rules for the structure.

Completed data collections are reported with another file in the data collection area and exposed to the user using a notification bubble, or a reminder shown through .profile („You have 5 completed data collections. Run review-data-collections to decide what to do with them.“), …

The user can then revisit the data collection, getting a summarized representation of the data (that’s what the broad rules on structure are for) and an option to see the raw data (e.g. in some JSON browser, from less(1) to jq(1) to something graphical).

The user can decide to delete the data or to send it. Sending implies consent for exactly the purposes described in the data collection’s explanations, nothing more, nothing less.

Submission is not done by the application, both because applications that don’t need network access shouldn’t have to import networking solely for this task, but also as a demonstration that they have „nothing up their sleeve“ (e.g. adding more data to the collection during transmission). It’s merely a POST request to a pre-specified URL with data that the user can inspect upfront.

By having all such collections in one place, many tools can do experiments and users can decide to delay submission to batch-send a whole set of tools‘ data collections at a time the user is comfortable looking into this (as much as they want to).

There are a bunch of details to work out, for example:

With a sufficiently rigid data collection schema, users could pre-declare what kind of behavioral information they’re comfortable sharing and have the system automatically opt-out of data collections that don’t pass their criteria,
What kind of demographic data should be asked for centrally (by the data collection facility) and how could data collections request it? For UX research it’s sometimes useful (or even crucial) to know about color blindness, for example, but should such data be stored indefinitely on the system? Asked on submission every time (e.g. via some „extra questions“ configuration in the data collection description)?
Should users be able to subscribe to data collection block lists, and who would run them? E.g. GNOME or Debian maintaining a list of data collections that they consider malicious, keyed by submission URL? Or maybe opt-in through an intermediary, so that distros and/or projects can set up agreements with researchers and „bless“ submission URLs for data collections going through their infrastructure for them? („KDE guarantees that data collections that submit to research.kde.org follow this set of rules“)

Or is this all a non-starter?

This post might be edited a couple of times as I gain more clarity myself on the issue, and also in response to feedback (if any). Feedback is welcome at @patrick@retro.social or patrick@georgi-clan.de.

e.g. $HOME/.local/state/data-collection/$appname/$experimentname ↩︎