The Self Data Backfill Guide can be used to import historical data into Amplitude.
Table of Contents
- Things to Consider
- Instructions for the Backfill
- Data Ingestion System
- Instructions for Pre-existing Users Backfill
Things to Consider
- Keep historical data separate. We recommend keeping historical data in a separate Amplitude project and not backfilling historical data into a live production project. Not only does it make the upload easier, but it keeps your live Amplitude data clean and keeps you focused on current data and moving forward. Chances are, you are not going to check historical data that often, but when you do it is still easily accessible. In addition, historical user property values would also get processed and would overwrite current live values. Our ingestion system would then sync the out-of-date property values onto new live events coming in. You can read more about our data ingestion system below.
- Connecting user data between two data sets. If you want to connect historical data with current data, then you should combine the historical data and live data in the same project. You must also have a common ID shared between both sets of data, which is the User ID field in Amplitude’s system. You will need to set the User ID of your historical data to the User ID in your current data if you want the data to connect.
- The new user count may change. Amplitude defines a new user based on the earliest event timestamp that Amplitude sees for a given user. As a result, if a user is recorded as new on 6/1/15 and data is backfilled for the user on 2/1/15, the user will then be reflected as new on 2/1/15. Read below for instructions on backfilling users.
- Current application data may be compromised. If there is a mismatch between the Amplitude User ID and the backfilled User ID, then we interpret the two distinct User IDs as two distinct users. As a result, users will be double counted. Since we cannot delete data once it has been recorded, you may have to create a new project altogether to eliminate any data issues.
- Understand how Amplitude identifies unique users. We use the Device ID and User ID fields to compute the Amplitude ID. Read more here.
- Monthly event limit. Each event backfilled counts toward your monthly event volume.
Instructions for the Backfill
- Review the documentation for the HTTP API.
- Understand which fields you want to send and map your historical data to our fields. We highly recommend that you use the insert_id field so that we can deduplicate events.
- Create a test project in Amplitude where you will send sample test data from your backfill. You should do several tests on a few days worth of data in a separate Amplitude project before the final upload to the production project. This way, you can take a look at it as well and make sure things look good. IMPORTANT NOTE: If you mess up the import to your production project, then there is no way for us to "undo" the upload.
- Limit your upload to 100 batches/sec and 1000 events/sec. You can batch events into an upload but we recommend not sending more than 10 events per batch. Thus, we expect at most 100 batches to be sent per second and so the 1000 events/sec limit still applies as we do not recommend sending more than 10 events per batch. You will also be throttled if you send more than 10 events/sec for a single Device ID. The following is a guideline for our recommended way of backfilling large amounts of data:
- Read a large number of events from your system.
- Partition those events into requests based on device_id or user_id.
- Send your requests concurrently/in parallel to Amplitude.
- To optimize the above process further, you can also do the following:
- Break up the set of events into mini non-overlapping sets (for example, partition by device_id).
- Have 1 worker per set of events executing steps 1-3.
- In your upload, you should retry aggressively with high timeouts. You should always retry forever until you receive a 200. If you send an insert_id, then we will deduplicate any duplicate data for you on our end that is sent within 7 days of each other.
If you send data that has a timestamp of 30 days or older, then it can take up to 48 hours to appear in some parts of our system, so do not be alarmed if you do not see everything right away. You can use the User Activity tab to check the events that you are sending as that updates in real-time regardless of the time of the event.
Sample scripts for data import: https://gist.github.com/djih/2a7e7fb2c1d45c8277f7aef64b682ed6
Data Ingestion System
In Amplitude's ingestion system, each user's current user properties are always being tracked and are synced to a user's incoming events. This diagram details the process of user property syncing. When sending data to Amplitude, customers will either be sending event data or will be sending
identify calls to update a user's user properties. These
identify calls will update a user's current user property values and will affect the properties being synced to subsequent events received after the
identify call. For example, let's say for user Datamonster they currently have one user property, 'color', and it is set to 'red'. Then, Datamonster logs 'View Page A' event and triggers an
identify that sets 'color' to 'blue'. Afterwards, they log a 'View Page B' event:
logEvent-> 'View Page A'
logEvent-> 'View Page B'
If Amplitude receives events from Datamonster in that exact order, then you would expect 'View Page A' to have 'color' = 'red' and 'View Page B' to have 'color' = 'blue'. This is because in Amplitude, we maintain the value of user properties at the time of the event. For this reason, the order in which events are uploaded is very important. If the
identify was received after 'View Page B', then 'View Page B' would have 'color' = 'red' instead of 'blue'.
The way Amplitude guarantees that events are processed in the order in which they are received is we process all of a user's events using the same ingestion worker. In essence, all of Datamonster's events would queue up in order on a single ingestion worker. If these events were instead processed in parallel across two separate workers, then it is much harder to guarantee the ordering (e.g. one worker might be faster than another, etc.).
Because each user's events are processed by the same worker, if that user sends an abnormally high number of events in a short amount of time, then they would overload their assigned worker. For this reason, the event upload limit is 10 events/sec per Device ID. Usually, this limit is reasonable because human interactions will not surpass 10 events/sec. However, the same ingestion system that handles live data also handles backfilled data, meaning the same limits apply when you are backfilling data. It is easy for customer backfills to exceed 10 events/sec if it is a system that iterates through historical data and sends data as fast as possible in parallel. As a result, Amplitude will keep track of each Device ID's event rate and will reject events in addition to returning a 429 throttling HTTP response code if it detects that a particular device is sending faster than 10 events/sec. This is why if you receive a 429 in response to an event upload, then the process should sleep for a few seconds and keep retrying the upload until it succeeds as stated in our HTTP API documentation. This will ensure that no events are lost in the backfill process. If you do not retry after a 429 response code, then that specific batch of events will not be ingested.
If you have pre-existing users, then you should backfill those pre-existing users to accurately mark when those users were new users. Amplitude marks users new based on the timestamp of their earliest event.
To backfill your pre-existing users, use our HTTP API to send a "dummy event" or a signup event where the timestamp of the event is the time of when the user was actually new. For instance, if a user signed up on Aug 1st, 2013, the timestamp of the event you send would be Aug 1st, 2013.