Generate fake data to test performance
Often when we develop, we end up testing on our local machines with very few records in the database. Performance testing, re-architecting your solution for scale etc needs lot of data already present in the db - concurrently adding or updating 1000 records to an empty table is very different from doing it when table already has few million records.
It is actually quite simple to generate fake data and load it into a database. To simulate extreme stress, you don't need to generate obscene loads - you can reduce your available capacity to small CPU/RAM and get the same effect with moderate loads.
Read below to see snippet code on how to do the first part - getting a database up (docker + mariadb
), generating data (python
), loading it (usql
) and then changing database capacity (docker
) to see what you need to tune.
Installation
python 3.8
Based on your OS, install Python 3.8. I use PyEnv because it makes it very easy to switch to a different version later.
docker
Get and install docker. It is pretty straight forward. You may need a reboot of your machine post installation.
Post all that, open command line and see if you can run docker
and docker-compose
. Both should run and print
their own error messages.
mariadb
MariaDB is an open source fork of MySQL. Let us see which all versions are available in docker repositories.
docker search mariadb
This should give a list of various images available. Pick one. Below, I am using version 10.1 of the server.
docker pull mariadb/server:10.1
usql
Oracle has SQL*Plus, PostgreSQL has psql, MySQL has mysql etc as command line clients for their databases. While they are all good for specific databases, if you have the need to work with multiple databases, it is better to pick one that works on all of them. I've been using usql, a universal CLI. Been very happy with that.
Go to the link above and install. It is very easy if you use HomeBrew.
Generate Data
I usually generate data in CSV and keep it zipped up for reuse.
In Python world, Faker and Mimesis are two good libraries. If your objective is to create a Pandas dataframe, then see Farsante which works very well with Mimesis; though it seems to take more resources due to its usage of PySpark. Loading to a dataframe makes it easier for you to manipulate generated data (sorting, running sum etc) before persisting into a file.
For now, we will use Faker and generate files that aren't sorted.
pip install faker dotmap
Let us create two files.
Customer
with fields(full_name, email, telephone, city, state, join_date)
Order
with fields(customer_id, order_date, product)
We can load these into two database tables viz., customer
and customer_order
, each with an auto-incremented integer
id as primary key. One benefit with this is that we don't need to keep looking at customer file to identify a valid customer id to fill in order, but can simply take an integer between 1 and max number of records in the customer file.
order_data_faker.py
#!/usr/bin/env python3 # -*- coding: utf-8 -*- import csv from faker import Faker from dotmap import DotMap #DotMap enables accessing a dictionary with dot notation. _CUSTOMER = DotMap( { "filename": "./data/customer_data.csv", "max": 1000000, "start_date": "-2y", "end_date": "-1y", "headers": ["full_name", "email", "telephone", "city", "state", "join_date"], } ) #join_date is between last 2 years _ORDER = DotMap( { "filename": "./data/order_data.csv", "max": 10000000, "start_date": "-1y", "end_date": "-1d", "skus": ["samsung", "apple", "nokia", "sony"], "headers": ["customer_id", "order_date", "product"], } ) #order_date is between last year and yesterday # Initialize our faker fake = Faker("es_MX") #let us pick a locale! en_US, en_IN etc are all there # generate customer records # TODO: output ideally should be sorted by date with open(_CUSTOMER.filename, "wt") as csvf: writer = csv.writer(csvf) writer.writerow(_CUSTOMER.headers) for i in range(_CUSTOMER.max): writer.writerow( [ fake.name(), fake.free_email().replace("@", f""".{i}@"""), #ensure unique email fake.phone_number(), fake.city(), fake.state(), fake.date_between( start_date=_CUSTOMER.start_date, end_date=_CUSTOMER.end_date, ).strftime("%Y-%m-%d %H:%M"), ] ) with open(_ORDER.filename, "wt") as csvf: writer = csv.writer(csvf) writer.writerow(_ORDER.headers) for i in range(_ORDER.max): writer.writerow( [ fake.random_int(1, _CUSTOMER.max), fake.date_time_between( start_date=_ORDER.start_date, end_date=_ORDER.end_date, ).strftime("%Y-%m-%d %H:%M"), fake.random_choices(elements=_ORDER.skus, length=1)[0], ], )
In the script above, we are generating 1 million customer records and 10 million order records. Depending upon your hardware, it might take some time - say 15-30 minutes.
Note : create a sub-directory called data
before running this script. That is where output files are written to.
Set up MariaDB
Let us setup a database using docker-compose
.
Create a file stack.mariadb.yml
with content like below.
Note that if you move this to a different folder and run, another new database container will be created.
So, please decide on a folder structure first. My structure is below and I cd
into dbsetup folder before executing
commands.
.
└── dbsetup
├── data
│ ├── customer_data.csv
│ └── order_data.csv
├── order_data_faker.py
├── order_data_loader.sql
└── stack.mariadb.yml
stack.mariadb.yml
version: '3.1' services: db: image: mariadb/server:10.1 #we already pulled this image restart: always environment: MYSQL_ROOT_PASSWORD: changeme MYSQL_DATABASE: mariadbtest MYSQL_USER: scott #additional user if you need MYSQL_PASSWORD: tiger ports: - 3306:3306 #internal 3306 is exposed to you as 3306 volumes: - ./data:/tmp/data #your subdir `data` will be visible inside container as `/tmp/data`. deploy: resources: #you can shutdown machine, change these values and restart to experiment limits: cpus: 1 memory: 200M reservations: cpus: 1 memory: 100M
It is now fairly easy to start the database machine. Just up it!
docker-compose -f stack.mariadb.yml up
It will take a while the first time. Then onwards it should be ready to serve within seconds.
Test if you can connect to it by running:
usql my://root:changeme@localhost:3306/mariadbtest
You should get the database prompt.
Load Data
order_data_loader.sql
-- basic tables; no indexes other than those created by primary keys and constraints. drop table if exists customer_order; drop table if exists customer; create table customer ( id int not null auto_increment primary key, full_name varchar(100) not null, email varchar(100), telephone varchar(20), city varchar(32) not null, state varchar(32) not null, join_date datetime not null, constraint uc_email unique(email) ) engine = innodb; create table customer_order ( id int not null auto_increment primary key, customer_id int not null, order_date datetime not null, product varchar(16) not null, constraint fk_customer foreign key (customer_id) references customer(id) on update cascade on delete cascade ) engine = innodb; -- load the files generated. we need to mask -- date field to right format for correct load. load data infile '/tmp/data/customer_data.csv' ignore into table customer fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 1 rows (full_name, email, telephone, city, state, @join_date) set join_date = str_to_date(@join_date, '%Y-%m-%d %H:%i'); load data infile '/tmp/data/order_data.csv' ignore into table customer_order fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 1 rows (customer_id, @order_date, product) set order_date = str_to_date(@order_date, '%Y-%m-%d %H:%i'); -- -- let us try some queries -- who are the customers who ordered most items in what duration? select customer_id, count(1), min(order_date), max(order_date) from customer_order group by 1 order by 2 desc limit 10;
You can load it from within usql
prompt or like below.
usql my://root:changeme@localhost:3306/mariadbtest -f order_data_loader.sql
On a docker container to which I allocated 2GB memory, it took about 5 minutes to load these files.
Now you can shutdown the container (CTRL-C), change the stack.mariadb.yml
file to have lesser/more memory and CPU
and bring it up again. Test it for speed for different queries.