The Self Data Backfill Guide can be used to import historical data into Amplitude.
Table of Contents
- Things to Consider
- Instructions for the Backfill
- Instructions for Preexisting Users Backfill
Things to Consider
- Keep historical data separate. We recommend keeping historical data in a separate Amplitude project. Not only does it make the upload easier, but it keeps your live Amplitude data clean and keeps you focused on current data and moving forward. Chances are, you are not going to check historical data that often, but when you do it is still easily accessible.
- Connecting user data between two data sets. If you want to connect historical data with current data, you should combine the historical data and live data in the same project. You must also have a common ID shared between both sets of data which is the user_id field in Amplitude’s system. You will need to set the user_id of your historical data to the user_id in your current data if you want the data to connect.
- The new user count may change. Amplitude defines a new user based on the earliest event timestamp that Amplitude sees for a given user. As a result, if a user is recorded as new on 6/1/15 and data is backfilled for the user on 2/1/15, the user will then be reflected as new on 2/1/15. Read below for instructions on backfilling users.
- Current application data may be compromised. If there is a mismatch between the Amplitude user_id and the backfilled user_id, then we interpret the two distinct user_ids as two distinct users. As a result, users will be double counted. Since we cannot delete data once it has been recorded, you may have to create a new application altogether to eliminate any data issues.
- Understand how Amplitude identifies unique users. We use the device_id and user_id fields to compute the amplitude_id. Read more here.
- Monthly event limit. Each event backfilled counts toward your monthly event volume.
Instructions for the Backfill
- Review the documentation for the HTTP API.
- Understand which fields you want to send and map your historical data to our fields. We highly recommend that you use the insert_id field so that we can deduplicate events.
- Create a test project in Amplitude where you will send sample test data from your backfill. You should do several tests on a few days worth of data in a separate Amplitude project before the final upload to the production project. This way, you can take a look at it as well and make sure things look good. IMPORTANT NOTE: If you mess up the import to your production project, there is no way for us to "undo" the upload.
- Limit your upload to 100 batches/sec and 1000 events/sec. You can batch events into an upload but we recommend not sending more than 10 events per batch. Thus, we expect at most 100 batches to be sent per second and so the 1000 events/sec limit still applies as we do not recommend sending more than 10 events per batch.
- In your upload, you should retry aggressively with high timeouts.
If you send data that has a timestamp of 30 days or older, it can take up to 48 hours to appear in some parts of our system (e.g. Daily Active Users), so do not be alarmed if you do not see everything right away. You can use the Events Summary page to check the events that you are sending as that updates in real-time regardless of the time of the event.
Sample scripts for data import: https://gist.github.com/djih/2a7e7fb2c1d45c8277f7aef64b682ed6
If you have preexisting users, you should backfill those preexisting users to accurately mark when those users were new users. Amplitude marks users new based on the timestamp of their earliest event.
To backfill your preexisting users, use our HTTP API to send a "dummy event" or a signup event where the timestamp of the event is the time of when the user was actually new. For instance, if a user signed up on Aug 1st, 2013, the timestamp of the event you send would be Aug 1st, 2013.