Bulk Inserts Performance with Django and PostgreSQL
Using the Django ORM is often a productivity boon. It allows you to describe models and interact with your data in pure python code instead of mixing it with SQL queries. It removes the need for a lot of boilerplate by reducing the impedance mismatch between the persisted relational data and the object model used by the application.
The Django ORM also provides useful hooks that you can use to enforce models business logic. Custom save() methods and signals are guaranteed to run on every interaction with your model as long as you use the ORM.
These abstractions come with a performance cost. They work under the assumption that your code operates on records one by one. For example, save() is called on every single model instance, and signals receive one instance at a time.
When you only need to insert a few records into the database, this is not an issue. But when you need to save a batch of records, the performance cost of sending records to the database one by one is huge. Each round trip to the database has a fixed cost: a query needs to be sent to the database server which might be on a different machine. On top of that, the database can only see one request at a time and can't optimize multiple records creation. Sending a single request to create a batch of records minimizes the impact of the round trip to the server, and allows the database to batch the writes to the disk.
We can see how much the ORM can affect performance by running a simple benchmark. Each script starts from an empty database and inserts into the database 10,000 records of the following model:
from django.db import models
class TestModel(models.Model):
field_1 = models.IntegerField()
field_2 = models.CharField(max_length=255)
field_3 = models.DateTimeField()
class Meta:
app_label = 'app'
It's a very simple model, with only three fields of three different types on top of the implicit id field. It has no foreign keys.
All the code used for this benchmark is available here. You can clone the repository and follow the instruction in the README to run the scripts yourself and reproduce the results presented here.
The Setup
The scripts were run using the following setup:
- Operating System: Ubuntu 14.04
- Database: PostgreSQL 9.3, using the default Ubuntu configuration
- Database driver: psycopg2 2.6.1
- Django 1.8.6
- Hardware: Intel i5-4570S CPU, 16GB of RAM, Samsung 840 EVO SSD
The database and the python scripts were run on the same computer. This minimizes the cost of round trip time to the database, skewing the result in favor of non-batched operations. As we will see, despite this advantage the performance difference between non-batched and batched operations is still huge.
ORM create()
The more natural way to create records using the ORM is to create them one by one, using a simple loop:
import utils
from django.utils import timezone
from app import models
def orm_create(n_records):
for i in xrange(0, n_records):
models.TestModel.objects.create(
field_1=i,
field_2=str(i),
field_3=timezone.now(),
)
if __name__ == '__main__':
utils.timed(orm_create)
This approach has several advantages: it's simple, easy to understand and respects all your custom model code.
Unfortunately, performance is poor:
$ python orm_create.py Created 10000 records in 12685ms
If you need to create lots of records, this kind of performance is often unacceptable.
ORM bulk_create()
Django does offer a method to handle exactly this scenario: bulk_create. You can pass it a list of new model instances to create, and it will issue bulk queries to the database:
import utils
from django.utils import timezone
from app import models
def orm_bulk_create(n_records):
instances = [
models.TestModel(
field_1=i,
field_2=str(i),
field_3=timezone.now(),
)
for i in xrange(0, n_records)
]
models.TestModel.objects.bulk_create(instances)
if __name__ == '__main__':
utils.timed(orm_bulk_create)
The performance boost is huge:
$ python orm_bulk_create.py Created 10000 records in 334ms
Performance comes at a price: your custom save() method and signals won't be invoked. It also doesn't work with table inheritance.
In short, you can no longer trust the ORM to enforce your business logic, losing lots of the advantages of using an ORM in the first place.
SQL INSERT
Let's now turn away from the ORM to see if we can further improve performance using pure SQL queries.
A first, simple approach is to create records one by one using INSERT statements:
import utils
from contextlib import closing
from django.db import connection
from django.utils import timezone
def sql_simple_insert(n_records):
with closing(connection.cursor()) as cursor:
for i in xrange(0, n_records):
cursor.execute(
'INSERT INTO app_testmodel (field_1, field_2, field_3)'
'VALUES (%s, %s, %s)',
(i, str(i), timezone.now()),
)
if __name__ == '__main__':
utils.timed(sql_simple_insert)
Similarly to the first ORM approach, creating records one by one performs poorly:
$ python sql_simple_insert.py Created 10000 records in 3968ms
It's about three times faster than creating records one by one using the ORM. That's how much the ORM abstractions are costing us. The queries sent to the database are the same, but the ORM needs to call custom validation and save methods and invoke all signals. It also needs to build the SQL insert query every single time.
Using SQL INSERT statements has the same disadvantage we saw with bulk_create, in that custom model business logic is not executed, but it only has a fraction of the performance gains.
SQL INSERT, using executemany
The psycopg2 cursor has a method to execute a single SQL statement multiple times, using a different set of arguments each time. It looks like a simple way to batch operations without changing the code too much:
import utils
from contextlib import closing
from django.db import connection
from django.utils import timezone
def sql_simple_insert_executemany(n_records):
with closing(connection.cursor()) as cursor:
cursor.executemany(
'INSERT INTO app_testmodel (field_1, field_2, field_3)'
'VALUES (%s, %s, %s)',
[(i, str(i), timezone.now()) for i in xrange(0, n_records)],
)
if __name__ == '__main__':
utils.timed(sql_simple_insert_executemany)
Despite looking like a batch operation, performance didn't really improve:
$ python sql_simple_insert_executemany.py Created 10000 records in 3693ms
As it turns out, under the hood psycopg2 is still sending the queries one by one to the database. With that in mind, it's not surprising that performance is so close to the previous approach.
SQL batch INSERT
We can do a lot better by building a single SQL INSERT statement to create multiple records at once, similar to the SQL built by bulk_create:
import utils
from contextlib import closing
from django.db import connection
from django.utils import timezone
def sql_batch_insert(n_records):
sql = 'INSERT INTO app_testmodel (field_1, field_2, field_3) VALUES {}'.format(
', '.join(['(%s, %s, %s)'] * n_records),
)
params = []
for i in xrange(0, n_records):
params.extend([i, str(i), timezone.now()])
with closing(connection.cursor()) as cursor:
cursor.execute(sql, params)
if __name__ == '__main__':
utils.timed(sql_batch_insert)
Building the SQL query manually adds more noise to the code than using bulk_create, but other than that it has no significant disadvantage:
$ python sql_batch_insert.py Created 10000 records in 167ms
Performance is in the same order of magnitude as bulk_create, and as we saw when comparing the standard ORM create() method against simple SQL INSERT statements, the ORM overhead is non-trivial.
PostgreSQL COPY FROM
If we leave the realm of standard, portable SQL, PostgreSQL offers an extension to import and export records in bulk.
Using COPY FROM we can import data from a CSV file. The CSV file can also be read directly from the connection input stream. psycopg2 exposes this functionality in the cursor copy_from() method, which accepts any python file-like object to read the records from.
We can use StringIO together with the csv module to build this file-like object in memory, and then import it in the database using copy_from():
import utils
from contextlib import closing
import csv
from cStringIO import StringIO
from django.db import connection
from django.utils import timezone
def copy_from(n_records):
stream = StringIO()
writer = csv.writer(stream, delimiter='\t')
for i in xrange(0, n_records):
writer.writerow([i, str(i), timezone.now().isoformat()])
stream.seek(0)
with closing(connection.cursor()) as cursor:
cursor.copy_from(
file=stream,
table='app_testmodel',
sep='\t',
columns=('field_1', 'field_2', 'field_3'),
)
if __name__ == '__main__':
utils.timed(copy_from)
The code is relatively easy to understand and arguably more readable than the batch SQL INSERT query, but not as much as bulk_create. Performance is better than any of the other approaches:
$ python copy_from.py Created 10000 records in 96ms
We removed most of the overhead associated with creating and sending queries, and it shows. It's about three times faster than using the fastest ORM method.
This is possibly the more performant method to create multiple records in PostgreSQL.
Generating the data in the database
Finally, we can improve performance further by cheating. If we generate the data directly in the database, we don't need to send it across the wire, removing all the network overhead. On top of that, the database won't need to parse all the records data, since we're not actually sending any record.
This approach is not nearly as generic as the previous ones: in almost all practical cases, data has to come from somewhere (an external file, a web request, etc.) and is not automatically generated inside the database.
In some rare cases though, you do need to generate records programmatically with no external input and this approach can be used. Examples of that are static tables and denormalized tables automatically generated from other tables contents.
This approach also serves as a lower boundary on how fast we can hope to insert new records into the database, showing how close to optimal the previous methods were.
We can use GENERATE_SERIES to generate the 10000 records to insert:
import utils
from contextlib import closing
from django.db import connection
def generate_data_in_database(n_records):
with closing(connection.cursor()) as cursor:
cursor.execute(
"""
INSERT INTO app_testmodel (field_1, field_2, field_3)
SELECT
i,
i::text,
now()
FROM
generate_series(0, %s - 1) AS s(i)
""",
(n_records,),
)
if __name__ == '__main__':
utils.timed(generate_data_in_database)
Performance is obviously better than the previous approaches:
$ python generate_data_in_database.py Created 10000 records in 44ms
If you find yourself with a problem in which this approach can be used, by all means do so, because there is no faster way to insert new records.
Conclusions
The Django ORM design offers a good solution when you need to operate on a small set of records, which is the most common scenario in web applications. Using the hooks it provides can lead to readable, well encapsulated code.
On the other hand, as the number of records you need to operate on at once increases, the ORM becomes a performance liability.
If you took advantage of the hooks available in the ORM, and then you find yourself in a situation where you need to import a large amount of records, you will be in a tough spot.
You will need to decide between replacing all the custom model code with a batch friendly solution, or duplicating the necessary business logic in the code that inserts the records in batches. The first approach can be a lot of work if you have more than a handful of models and complex logic. The second one can be a good trade-off if you only need to batch load a small subset of the models, and their logic is simple, but duplicating logic can easily lead to inconsistencies, bugs, and duplicate maintenance effort.
Of course, if you can afford it, you can also decide to accept the performance impact, and insert records one by one.
As it often happens when it comes to programming, deciding between taking advantage of the ORM or not is a trade-off between immediate convenience and performance.
Make sure you evaluate those trade-offs for your particular situation before blindly deciding in favor or against the ORM.
If you enjoyed this article and need help with your Django application or your PostgreSQL database, you can get in touch with me.