Tweasel open data Datasette instance
Tweasel is a project building infrastructure for detecting and complaining about tracking and privacy violations in mobile apps on Android and iOS. Among other things, we are developing a suite of tools and libraries for automated app analysis and tracking detection, and maintaining a wiki of HTTP endpoints used by tracking companies (for a full overview of what we’re doing, have a look at our documentation).
For our work, we regularly run large-scale traffic analyses on mobile apps. We are using this data for example to maintain the tracking endpoint adapters of our TrackHAR library. Our goal is to shine a light on how trackers work and what they collect, and as such we of course want as many people as possible researching them. In addition, we want to provide documentation on why/how we have concluded what certain values transmitted to a tracking endpoint mean, and do so in a way that is replicable by others.
As such, we are publishing our datasets as open data for other researchers, activists, and anyone else who is interested in understanding the inner workings of trackers. We hope to thereby lower the barrier of entry for people to start investigating trackers themselves.
Currently, requests from the following datasets are available, of which the first three were collected as part of student research projects at the Institute for Application Security at TU Braunschweig (more details in the
- Do they track? Automated analysis of Android apps for privacy violations (data from January 2021, view requests)
- iOS watching you: Automated analysis of “zero-touch” privacy violations under iOS (data from June to July 2021, view requests)
- Informed Consent? A Study of “Consent Dialogs” on Android and iOS (data from March to April 2022, view requests)
- Worrying confessions: A look at data safety labels on Android (data from September 2022, view requests)
- Traffic collection for TrackHAR adapter work (July 2023) (data from July 2023, view requests)
Note: We have decided to only publish requests to endpoints that are contacted by apps from at least two different vendors, using Apple’s definition for determining the vendor from the app ID. As such, our data is not suited for reverse-engineering internal app APIs.
We are publishing the data as a Datasette instance, which allows you to interactively explore the full data online, including running arbitrary SQL queries against it. Here are just a few examples of interesting things you can look at:
- the endpoints that were contacted most often or by the most apps
- requests to a particular host, e.g.
doubleclick.net, ordered by length
- requests by a particular app, e.g. Airbnb on Android
- the requests setting the most cookies
- requests containing a particular value anywhere, e.g. an advertising ID
- app IDs that exist on both Android and iOS
- requests to endpoints that were only contacted by an app after a consent dialog was accepted
- the longest requests
Datasette and the plugins we have installed have lots of additional features that you may find helpful. You can for example:
- Copy data in various formats, e.g. some details about the ten latest requests as a Markdown table, CSV, or JSON
- Use the data as a GraphQL API
- Query using regular expressions thanks to the
- Query based on URLs and paths thanks to the
- Query JSON values using
- Download the full SQLite database for local analysis