Practical SQL use in relational databases

In this post, we will learn how to use SQL to query a plant disease dataset loaded both as MySQL and PostgreSQL databases. We will also compare the capabilities of each database in general and specific to our example task.

MySQL vs PostgreSQL - General Comparison

SQL Compliance

PostgreSQL is more SQL compliant and supports more advanced SQL features like window functions, CTEs, INTERSECT, etc.
MySQL has some deviations from SQL standards but has been improving compliance in recent versions.

Performance

MySQL is known to perform better for read-heavy workloads, which is why it’s popular for web applications. It’s faster for simple read queries.
PostgreSQL performs better for complex queries, large datasets, and read-write workloads. It has a more sophisticated query optimizer.

Replication

MySQL has built-in master-slave replication and some clustering solutions, but may require 3rd party tools for more advanced setups.
PostgreSQL has built-in synchronous replication which makes it easier to setup high availability.

Concurrency

PostgreSQL handles concurrency better with its Multi-Version Concurrency Control (MVCC) architecture. Readers don’t block writers.
MySQL traditionally used table-level locking but has improved with row-level locking in recent versions. Still, it doesn’t handle high concurrency as well as PostgreSQL.

Full Text Search

PostgreSQL has built-in full text search capabilities.
MySQL has full text search but it requires special setup and configuration.

Data Types

PostgreSQL supports a wider range of data types including native support for arrays, hstore (key-value pairs), and advanced geometric/spatial types.
MySQL has a more limited set of data types. It has spatial extensions but not as advanced as PostGIS.

Extensibility

PostgreSQL is highly extensible. You can create custom data types, operators, index types, etc. Popular extensions include PostGIS, pg_trgm, hstore.
MySQL is less extensible. It allows user-defined functions but not new data types or operators.

Stored Procedures

Both support stored procedures but PostgreSQL allows them to be written in multiple languages (PL/pgSQL, PL/Python, PL/Perl, etc.)
MySQL stored procedures are written in SQL with some extensions.

Licensing

Both are open source but MySQL has a dual licensing model - GPL open source and commercial. It’s owned by Oracle.
PostgreSQL is fully open source under the permissive PostgreSQL license. Owned by a non-profit.

Practical implementation - MySQL

We will now look at step-by-step guides (with code examples) of how to initialize and query the Kaggle New Plant Diseases Dataset as databases in MySQL and PostgreSQL.

Let us start with MySQL.

Step 1: Download the Dataset

Go and download the dataset on Kaggle. It’s around 3 GB in size.
Unzip the downloaded file. You should see directories like train, valid etc. with subdirectories for each class of plant disease.

Step 2: Examine the Dataset Structure

The dataset has 38 classes of plant diseases across 14 crop species.
The directories are structured as <crop>___<disease>. For example, Tomato___Bacterial_spot.
Inside each directory are the individual leaf images for that crop-disease combination.

Step 3: Set up MySQL Database

Install the MySQL Server if not already done. Create a new database to store the plant disease data:
```
CREATE DATABASE plant_diseases;
USE plant_diseases;
```

Create a table to store the image paths and labels:

CREATE TABLE images (
  id INT AUTO_INCREMENT PRIMARY KEY,
  crop VARCHAR(255),
  disease VARCHAR(255),
  image_path VARCHAR(255)
);

Step 4: Load the Dataset into MySQL

Write a Python script to traverse the unzipped dataset directories and insert the image paths and labels into the MySQL table. Use the os module to walk through the directories and the mysql.connector library to connect to MySQL and execute insert statements.

import os
import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="yourusername",
  password="yourpassword",
  database="plant_diseases"
)
mycursor = mydb.cursor()

dataset_path = '/path/to/unzipped/dataset'

for root, dirs, files in os.walk(dataset_path):
    for file in files:
        if file.endswith(".jpg"):
            image_path = os.path.join(root, file)
            label = os.path.basename(root)
            crop, disease = label.split('___')
               
            sql = "INSERT INTO images (crop, disease, image_path) VALUES (%s, %s, %s)"
            val = (crop, disease, image_path)
            mycursor.execute(sql, val)

 mydb.commit()

Run this script to populate the images table with the dataset.

Step 5: Analyze the Data

Now you can use SQL queries to analyze the plant disease data. For example:

-- Count images per crop
SELECT crop, COUNT(*) AS count 
FROM images
GROUP BY crop;
   
-- Get the distribution of diseases for a specific crop
SELECT disease, COUNT(*) AS count
FROM images 
WHERE crop = 'Tomato'
GROUP BY disease;

You can also join this table with other tables containing more information about the crops, diseases, treatments etc. for deeper analysis.

That’s it! You now have the New Plant Diseases Dataset loaded into MySQL ready for further exploration.

Practical implementation - PostgreSQL

Let us now learn how to query using PostgreSQL. The dataset download and analysis step remains same as before. Following that:

Step 1: Set up PostgreSQL Database

Install PostgreSQL if not already done. Create a new database to store the plant disease data:
```
CREATE DATABASE plant_diseases;
\c plant_diseases;
```

Create a table to store the image paths and labels:

CREATE TABLE images (
  id SERIAL PRIMARY KEY,
  crop VARCHAR(255),
  disease VARCHAR(255),
  image_path VARCHAR(255)
);

Note the use of SERIAL instead of AUTO_INCREMENT for the primary key.

Step 2: Load the Dataset into PostgreSQL

Write a Python script to traverse the dataset directories and insert the data into PostgreSQL. Use the psycopg2 library to connect to PostgreSQL and execute SQL statements.

import os
import psycopg2

conn = psycopg2.connect(
    host="localhost",
    database="plant_diseases",
    user="yourusername",
    password="yourpassword"
)
cur = conn.cursor()

dataset_path = '/path/to/unzipped/dataset'

for root, dirs, files in os.walk(dataset_path):
    for file in files:
        if file.endswith(".jpg"):
            image_path = os.path.join(root, file)
            label = os.path.basename(root)
            crop, disease = label.split('___')
               
            cur.execute("INSERT INTO images (crop, disease, image_path) VALUES (%s, %s, %s)", 
                        (crop, disease, image_path))

conn.commit()
cur.close()
conn.close()

Note the slight differences in connecting to the database and executing the insert statement compared to MySQL.

Run this script to populate the images table.

Step 3: Analyze the Data

You can now use SQL to analyze the data in PostgreSQL. For example:

-- Count images per crop
SELECT crop, COUNT(*) AS count
FROM images
GROUP BY crop;

-- Get the distribution of diseases for a specific crop  
SELECT disease, COUNT(*) AS count
FROM images
WHERE crop = 'Tomato'  
GROUP BY disease;

PostgreSQL provides some additional features compared to MySQL, like better support for JSON data types, full text search, and geospatial queries. Depending on your analysis needs, you could leverage these capabilities.

For example, if you have JSON metadata about each image, you could store and query it like:

ALTER TABLE images ADD COLUMN metadata JSONB;

UPDATE images 
SET metadata = '{"height": 256, "width": 256, "color_space": "RGB"}'
WHERE id = 1;

SELECT * 
FROM images
WHERE metadata->>'color_space' = 'RGB';

That covers the process of loading and analyzing the plant disease dataset in PostgreSQL. The steps are quite similar to MySQL with a few syntax differences.

MySQL vs PostgreSQL - Comparison on Plant Disease dataset

We will now compare MySQL and PostgreSQL for the specific case of the plant disease dataset we loaded:

Schema Design
The relational schema would be very similar in both databases - a table to store image paths and labels with foreign keys to dimension tables for crops and diseases if needed.

PostgreSQL’s ability to create custom types could be useful to create a special ‘image’ type but not strictly necessary. MySQL’s lack of transactional DDL is a drawback during initial schema creation and modifications.

Data Loading
Both databases can load the data from CSV files or using programming languages.

PostgreSQL’s COPY command is very efficient for bulk inserts. MySQL’s LOAD DATA INFILE is also fast.

For loading images or metadata, PostgreSQL’s native JSON support is an advantage. You could store image metadata as a JSON column. With MySQL you’d have to use a TEXT column and handle parsing in application code.

Querying
Most of the queries needed would be simple filters and aggregations which both can handle well.

For more complex queries that categorize images along multiple dimensions or look for patterns, PostgreSQL’s window functions and CTEs would be useful. The query optimizer would also likely generate more efficient execution plans.

Full text search on metadata could be done in PostgreSQL without any additional setup.

Performance
For simple read queries like “select all images for crop X”, MySQL would likely be faster, especially with indexing.

As queries get more complex with joins, aggregations and analytics, PostgreSQL is likely to perform better, especially with large data volumes.

Write performance is likely to be better with PostgreSQL due to MVCC.

Geospatial
If there is a geospatial component to the data, like field locations, PostgreSQL with the PostGIS extension would provide advanced capabilities for geospatial indexing and queries that MySQL cannot match.

Machine Learning
For advanced analytics and machine learning, the ability to do more in-database with PostgreSQL is an advantage. You can use extensions like MADlib for in-database ML. With MySQL you’d have to extract data to another tool.

Summary
In summary, while both databases could be made to work, PostgreSQL is likely the better choice for this use case due to:

Ability to handle complex queries and analytics as the application grows
Native JSON support for unstructured metadata
Geospatial capabilities with PostGIS
Ability to do in-database machine learning
Better handling of concurrency for write workloads

The main advantage of MySQL would be faster performance for simple read queries, but that’s likely outweighed by PostgreSQL’s other benefits as the application scales and becomes more complex. The strong developer community and momentum behind PostgreSQL is also a factor in its favor.

References

Assisted by claude-3-opus on perplexity.ai

Written on April 2, 2024