The California Consumer Privacy Act (CCPA) requires our customers to provide all data about an end user on request. This Data Subject Access Request (DSAR) API makes it easy to retrieve all data about a user.

Information

Auth will be done with basic auth, and requires an org-level API key. To request an org-level API key, please submit a ticket to our support team at support.amplitude.com.

Due to the potential data volume, this API provides works asynchronously. Start by making a POST request to /dsar/requests, which returns a requestId. Then make a GET requests to check the status of the job. When the job is done, this will return a list of urls from which to fetch the data files.

Expect on the order of 1 file per month per app for which the user has data. Each url will require the sam auth credentials to access and will redirect to a pre-signed url to access a file in s3. For more information on pre-signed urls, see Amazon's Docs

Because the API is asynchronous, you must poll to check the status of the request. See the Rate Limits section to help select the appropriate polling rate.

Output

Each file will be gzipped, and the contents will adhere to the following rules:

  • One line per event
  • Each line is a json object
  • No order guarantee

Example Output

{"amplitude_id":123456789,"app":12345,"event_time":"2020-02-15 01:00:00.123456","event_type":"first_event","server_upload_time":"2020-02-18 01:00:00.234567"}
{"amplitude_id":123456789,"app":12345,"event_time":"2020-02-15 01:00:11.345678","event_type":"second_event","server_upload_time":"2020-02-18 01:00:11.456789"}
{"amplitude_id":123456789,"app":12345,"event_time":"2020-02-15 01:02:00.123456","event_type":"third_event","server_upload_time":"2020-02-18 01:02:00.234567"}

Rate Limits

All APIs under /dsar share a budget of 14.4 K “cost” per hour. POSTs cost 8, and GETs cost 1. Requests beyond this count will get 429 response codes.

In general for each POST, there will be about 1 output file per month per project the user has events for. So for example if you are fetching 13 months of data for a user with data in 2 projects, expect ~26 files.

If you need to get data for 40 users per hour, you can spend 14400 / 40 = 360 cost per request. Conservatively allocating 52 GETs for output files (2x the previously computed amount) and 8 for the initial POST, you can poll for the status of the request 360 - 8 - 52 = 300 times. Given the 5 day SLA for results, this allows for checking the status every 52460 / 300 = 24 minutes over 5 days. A practical usage might be to have a service which runs every 30 minutes, posting 20 new requests and checking on the status of all outstanding requests.

SLAs

  • Requests will complete within 5 days
  • Request result expires in 2 days
  • See Rate Limit section for request limits.
  • We will not support users with 100k+ events per month

Example Client Implementation

base_url = 'https://amplitude.com/api/2/dsar/requests'

payload = {
  "amplitudeId": AMPLITUDE_ID,
  "startDate": "2019-03-01",
  "endDate": "2020-04-01"
}

r = requests.post(base_url, auth=(API_KEY, SECRET_KEY), data=payload)
request_id = r.json().get('requestId')

time.sleep(POLL_DELAY)
while (True):
    r = requests.get(f'{base_url}/{request_id}', auth=(API_KEY, SECRET_KEY))
    response = r.json()
    if response.get('status') == 'failed':
        sys.exit(1)
    if response.get('status') == 'done':
        break
    time.sleep(POLL_INTERVAL)

for url in response.get('urls'):
    r = requests.get(url, auth=(API_KEY, SECRET_KEY), allow_redirects=True)
    index = url.split('/')[-1]
    filename = f'{AMPLITUDE_ID}-{index}.gz'
    with open(f'{OUTPUT_DIR}/{filename}','wb') as f:
        f.write(r.content)