The test
database contains two tables users
and messages
, the tables and records in the tables are created by importing the (https://github.com/eMahtab/mysql-test-dataset/blob/main/users-and-messages/test_database.zip) database dump.
Its medium sized database having in total 110 Million records, so importing the database dump will take a long time.
users table = 10 Million records
messages table = 100 Million records
The tables were originally created using below DDL statements :
users
table have id (which denotes user id) as Primary Key, and messages
table also have id (which denotes message id) as Primary Key.
messages
table have columns sender_id
and recipient_id
which are Foreign Key, referencing id
column of users
table
CREATE TABLE users (
id BIGINT,
name VARCHAR(50),
username VARCHAR(30),
PRIMARY KEY (id)
);
CREATE TABLE messages (
id BIGINT,
sender_id BIGINT,
recipient_id BIGINT,
message TEXT,
created_at DATETIME NOT NULL,
edited_at DATETIME DEFAULT NULL,
deleted_at DATETIME DEFAULT NULL,
PRIMARY KEY (id),
FOREIGN KEY (sender_id) REFERENCES users(id),
FOREIGN KEY (recipient_id) REFERENCES users(id)
);
Below is the output from SHOW INDEX FROM users
and SHOW INDEX FROM messages
As we can see from above screenshot, users
table have one index, which is Primary index on id
column
And messages
table have one Primary index (on column id
) and two Secondary indexes (one on column sender_id
and other one on column recipient_id
)
A Primary index is always unique, and it uniquely identifies each row in the table and enforces uniqueness on the primary key column(s). A table can have only one primary index.
A Secondary index may or may not be unique, secondary indexes are created to optimize query performance for columns not covered by the primary index. A table can have multiple secondary indexes.
If you see the CREATE TABLE
command above for messages
table, we did not explicitly created secondary indexes, it was automatically created by MySQL for enforcing referential integrity efficiently. Operations like ON DELETE CASCADE
or ON UPDATE CASCADE
need to locate and modify rows in the referencing table quickly.
You can always check the indexes in a table using SHOW INDEX FROM
command, to get more details.
select * from messages where id = 679802;
select count(*) from messages where sender_id = 6452;
select * from messages
where sender_id = 12098 AND
created_at BETWEEN '2024-12-05 00:00:00' AND '2024-12-05 23:59:59';
select * from messages where sender_id = 44887 order by created_at ASC;
select * from messages
where sender_id = 889000 AND
created_at BETWEEN '2024-12-01 00:00:00' AND '2024-12-07 23:59:59'
order by created_at ASC;
select * from messages where sender_id = 8 AND recipient_id = 859523;
If you query on a column which don't have index on it, then the query will be slow or very slow or will take unacceptable amount of time, depending on how many records that table have. A Table scan is costly and should be avoided. One of the way to solve this issue, is to create an index on the column on which we want to query.
select count(*) from messages
where created_at BETWEEN '2024-12-03 00:00:00' AND '2024-12-03 23:59:59';
Three indexes on messages
table take up around 7.4 GB space.
To get the total size of index and actual data stored inside a table, use below sql command
SELECT TABLE_NAME,
DATA_LENGTH / (1024 * 1024 * 1024) AS DataSize_GB,
INDEX_LENGTH / (1024 * 1024 * 1024) AS IndexSize_GB,
(DATA_LENGTH + INDEX_LENGTH) / (1024 * 1024 * 1024) AS TotalSize_GB
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'test' AND TABLE_NAME = 'messages';
Indexing comes with a small cost, whenever a new row inserted or a row deleted, the indexes needs to updated to reflect the updated state of database table. Similarly when a row is updated (which updates the indexed column), then also index needs to be updated.
So indexing does adds to slowing down the write operation.