Quickly insert millions of rows in SQL Server in .NET
A few days ago, I needed to insert about 3 million rows in a table. This is not a frequent operation but I will do it about once a month, and of course, I want it to be as fast as possible to not lock the table for too long. The data come from several CSV files, on which I apply some processing using .NET code. This means I cannot use the great BULK INSERT
statement. Instead, I must use the SqlBulkCopy
class from the .NET framework. Let's see how to use it with .NET objects!
Let's use this table:
CREATE TABLE [dbo].[Customer](
[Id] [uniqueidentifier] NOT NULL,
[FirstName] [nvarchar](50) NULL,
[LastName] [nvarchar](50) NULL,
[DateOfBirth] [datetime2](7) NULL
)
And the corresponding class:
public class Customer
{
public Guid Id { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
public DateTime DateOfBirth { get; set; }
}
For testing purpose, I've created a method that generates as many customer instances as needed:
public static IEnumerable<Customer> Generate(int count)
{
for (int i = 0; i < count; i++)
{
yield return new Customer
{
Id = Guid.NewGuid(),
FirstName = "FirstName" + i,
LastName = "LastName" + i,
DateOfBirth = DateTime.UtcNow
};
}
}
At this step, you have an IEnumerable<Customer>
. If you look at the available overloads of the SqlBulkCopy.WriteToServerAsync
method, you can use an array
, a DataTable
or DbDataReader
. Only the last one allows to stream the data. This is very important for the high volumetry of data I want to insert (a few million rows). Indeed, you do not want to load everything in memory before sending the first row to SQL Server. The important question is how to transform an IEnumerable<T>
to a DbDataReader
?
The DbDataReader
class has many methods: Read
, GetOrdinal
, IsDbNull
, GetValue
, GetByte
, GetChar
, GetString
, and so on. In practice, you don't need to implement all of them. The 4 firsts are enough. Also, DbDataReader
uses ordinals (the column index) to get a value, whereas in .NET we use a property name. This means, we have to add a mapping between property names and ordinals. You also get property values a lot of times. So, instead of using reflection, which may be slow, you can create a compiled expression.
public class ObjectDataReader<T> : DbDataReader
{
private IEnumerator<T> _iterator;
private IDictionary<string, int> _propertyNameToOrdinal = new Dictionary<string, int>();
private IDictionary<int, string> _ordinalToPropertyName = new Dictionary<int, string>();
private Func<T, object>[] _getPropertyValueFuncs;
public ObjectDataReader(IEnumerator<T> items)
{
_iterator = items ?? throw new ArgumentNullException(nameof(items));
Initialize();
}
private void Initialize()
{
var properties = typeof(T).GetProperties();
_getPropertyValueFuncs = new Func<T, object>[properties.Length];
int ordinal = 0;
foreach (var property in properties)
{
string propertyName = property.Name;
_propertyNameToOrdinal.Add(propertyName, ordinal);
_ordinalToPropertyName.Add(ordinal, propertyName);
// Lambda: x => x.PropertyName
var parameterExpression = Expression.Parameter(typeof(T), "x");
var func = (Func<T, object>)Expression.Lambda(
Expression.Convert(
Expression.Property(parameterExpression, propertyName),
typeof(object)),
parameterExpression)
.Compile();
_getPropertyValueFuncs[ordinal] = func;
ordinal++;
}
}
}
Once this is done, the next methods are very easy to implements:
public override bool Read()
{
return _iterator.MoveNext();
}
public override int GetOrdinal(string name)
{
if (_propertyNameToOrdinal.TryGetValue(name, out var ordinal))
return ordinal;
return -1;
}
public override object GetValue(int ordinal)
{
var func = _getPropertyValueFuncs[ordinal];
return func(_iterator.Current);
}
public override bool IsDBNull(int ordinal)
{
return GetValue(ordinal) == null;
}
// code omitted for brevity
// https://gist.github.com/meziantou/174e2791dec966be837746750b87d069
And that's enough. Now, you can test the code:
using (var connection = new SqlConnection())
{
connection.ConnectionString = "Server=(local);Database=Sample;Trusted_Connection=True;";
await connection.OpenAsync(ct);
using (var bulkCopy = new SqlBulkCopy(connection, SqlBulkCopyOptions.Default, null))
{
var customers = Customer.Generate(5_000_000); // digit separator (C#7)
using (var enumerator = customers.GetEnumerator())
using (var customerReader = new ObjectDataReader<Customer>(enumerator))
{
// Mapping PropertyName - ColumnName
bulkCopy.DestinationTableName = "Customer";
bulkCopy.ColumnMappings.Add(nameof(Customer.Id), "Id");
bulkCopy.ColumnMappings.Add(nameof(Customer.FirstName), "FirstName");
bulkCopy.ColumnMappings.Add(nameof(Customer.LastName), "LastName");
bulkCopy.ColumnMappings.Add(nameof(Customer.DateOfBirth), "DateOfBirth");
bulkCopy.EnableStreaming = true;
bulkCopy.BatchSize = 10000;
bulkCopy.NotifyAfter = 1000;
bulkCopy.SqlRowsCopied += (sender, e) => Console.WriteLine("RowsCopied: " + e.RowsCopied);
await bulkCopy.WriteToServerAsync(customerReader, ct);
}
}
}
#Conclusion
Using SqlBulkInsert
, we can insert a few millions of rows in a few seconds. If you use INSERT
statements, you'll need a few minutes. To not use too much memory, you can stream the data to the database by creating your own DbDataReader
.
A sample code is available on Gist: https://gist.github.com/meziantou/174e2791dec966be837746750b87d069
Do you have a question or a suggestion about this post? Contact me!