PostgreSQL – Page 2 – The Rails Drop

Rails 8 App: Setup Test DB in PostgreSQL | Write SQL Queries

1. Add the test db and users table: https://railsdrop.com/2025/04/25/rails-8-app-postgresql-heap-vs-mysql-innodb-b-tree-indexing/

2. Add fake data into the table: https://railsdrop.com/2025/04/29/rails-8-app-postgresql-faker-extensions-for-rails/

Start Learn (Premium): https://railsdrop.com/sql-postgresql-queries-bitmap-seq-index-scan-db-clustering/

📌 Summary of all queries

Check: https://railsdrop.com/postgresql-queries-a-summary/

Read – Composite vs Individual indexes (Premium): https://railsdrop.com/sql-postgres-understanding-current-composite-index/

Read Create 1 million sample users: https://railsdrop.com/sql-postgresql-create-1-million-sample-users-data/

👉 1. `SELECT` – Basic Query

🔹 1. Select all columns:

SELECT * FROM users;

This gives you every row and every column in the users table.

🔹 2. Select specific columns:

SELECT username, email FROM users;

This limits the output to only username and email.

👉 2. `ALTER` – Modify Table Structure

The ALTER TABLE statement is used to add, delete, or modify columns in an existing table.

🔹 Example 1: Add a new column

Let’s add a column created_at of type timestamp:

ALTER TABLE users 
  ADD COLUMN created_at timestamp;

🔹 Example 2: Rename a column

Let’s rename phone_number to mobile:

ALTER TABLE users
  RENAME COLUMN phone_number TO mobile;

🔹 Example 3: Drop a column

Let’s say you want to remove the created_at column:

ALTER TABLE users
  DROP COLUMN created_at;

🔹 4. Modify specific columns:

UPDATE users
  SET mobile = '123456'
  WHERE mobile IS NULL;

Use UPDATE instead of ALTER when modifying data in a table.
ALTER is used for changing the structure/schema of a table (e.g., adding columns), not for updating data.

👉 3. `DISTINCT` – Remove Duplicate Values

The DISTINCT keyword is used to return only unique (non-duplicate) values in a result set.

🔹 Example 1: Distinct usernames

SELECT DISTINCT username FROM users;

This returns a list of unique usernames, removing any duplicates.

🔹 Example 2: Distinct combinations of username and email

SELECT DISTINCT username, email FROM users;
SELECT DISTINCT username from users WHERE username like '%quin%';
EXPLAIN ANALYSE SELECT DISTINCT username from users WHERE username like '%quin%';

This checks for uniqueness based on both username and email combined.

👉 4. `WHERE` – Filter Records + Major Combine Types (`AND`, `OR`, `NOT`)

The WHERE clause is used to filter records that meet a certain condition.

Let’s look at basic and combined conditions using our users table.

🔹 Example 1: Simple WHERE

SELECT * FROM users WHERE username = 'john_doe';

🔹 Example 2: `AND` – Combine multiple conditions (all must be true)

SELECT * FROM users 
WHERE username = 'quinton' AND email LIKE '%@gmail.com';

🔹 Example 3: `OR` – At least one condition must be true

SELECT * FROM users 
WHERE username = 'quinton' OR username = 'joaquin_hand';

🔹 Example 4: `NOT` – Negate a condition

SELECT * FROM users 
WHERE NOT email LIKE '%@example.com';

🔹 Example 5: Combine `AND`, `OR`, `NOT` (use parentheses!)

SELECT * FROM users 
WHERE (email like '%example%' OR email like '%test%') 
  AND NOT username = 'admin';

👉 5. `ORDER BY` – Sort the Results

And we’ll also look at combined queries afterward.

🔹 Example 1: Order by a single column (ascending)

SELECT * FROM users 
ORDER BY username;

🔹 Example 2: Order by a column (descending)

SELECT * FROM users 
ORDER BY email DESC;

🔹 Example 3: Order by multiple columns

SELECT * FROM users 
ORDER BY username ASC, email DESC;

👉 6. Combined Queries (UNION, INTERSECT, EXCEPT)

✅ These allow you to combine results from multiple SELECT statements.

⚠ Requirements:

Each query must return the same number of columns.
Data types must be compatible.

🔹 `UNION` – Combine results and remove duplicates

SELECT username FROM users WHERE email LIKE '%@example.com'
UNION
SELECT username FROM users WHERE username LIKE 'ton%';

🔹 `UNION ALL` – Combine results and keep duplicates

SELECT username FROM users WHERE email LIKE '%@gmail.com'
UNION ALL
SELECT username FROM users WHERE username LIKE 'test%';

🔹 `INTERSECT` – Return only common results

SELECT username FROM users 
  WHERE email LIKE '%@gmail.com'
INTERSECT
SELECT username FROM users 
  WHERE username LIKE 'test%';

SELECT username FROM users
  WHERE (email like '%example' OR email like '%test')
INTERSECT
SELECT username FROM users
  WHERE username like 'adam';

🔹 `EXCEPT` – Return results from the first query that are not in the second

SELECT username FROM users 
  WHERE email LIKE '%@example'
EXCEPT
SELECT username FROM users 
  WHERE (username like '%ada%' OR username like '%merlin%');

👉 7. `IS NULL` and `IS NOT NULL` – Handling Missing Data

These are used to check if a column contains a NULL value (i.e., no value).

🔹 Example 1: Users with a missing/have an email

# Find users with a missing email
SELECT * FROM users 
WHERE email IS NULL;

# Find 
SELECT * FROM users 
WHERE email IS NOT NULL;

🔹 Example 2: Users with no email and no mobile

SELECT * FROM users 
WHERE email IS NULL AND phone_number IS NULL;

🔹 Example 3: Users with either email or mobile missing

SELECT * FROM users 
WHERE email IS NULL OR phone_number IS NULL;

🔹 Example 4: Users who have an email and username starts with ‘adam’

SELECT * FROM users 
WHERE email IS NOT NULL AND username LIKE 'adam%';

🔹 Example 5: Users with email missing but username is not empty

SELECT * FROM users 
WHERE email IS NULL AND username IS NOT NULL;

🔹 Example 6: Users where email or mobile is null, but not both (exclusive or)

SELECT * FROM users 
WHERE (email IS NULL AND mobile IS NOT NULL)
   OR (email IS NOT NULL AND mobile IS NULL);

👉 8. `LIMIT`, `SELECT TOP`, `SELECT TOP PERCENT` (PostgreSQL-style)

In PostgreSQL, we use LIMIT instead of SELECT TOP.
(PostgreSQL doesn’t support TOP directly like SQL Server.)

🔹 Example 1: Limit number of results (first 10 rows)

SELECT * FROM users 
LIMIT 10;

🔹 Example 2: Combined with `ORDER BY` (top 5 newest usernames)

SELECT username FROM users 
  WHERE username IS NOT NULL
ORDER BY id DESC
LIMIT 5;

🔹 Example 3: Paginate (e.g., 11th to 20th row)

SELECT * FROM users 
ORDER BY id 
OFFSET 10 LIMIT 10;

🔔 Simulating `SELECT TOP` and `SELECT TOP PERCENT` in PostgreSQL

🔹 Example 4: Simulate `SELECT TOP 1`

SELECT * FROM users 
ORDER BY id 
LIMIT 1;

🔹 Example 5: Simulate `SELECT TOP 10 PERCENT`

To get the top 10% of users by id, you can use a subquery:

SELECT * FROM users
ORDER BY id
LIMIT (SELECT CEIL(COUNT(*) * 0.10) FROM users);

🔹 Example 6: Users with Gmail or Yahoo emails, ordered by ID, limit 5

SELECT id, username, email FROM users
WHERE email LIKE '%@gmail.com' OR email LIKE '%@yahoo.com'
AND username IS NOT NULL
ORDER BY id ASC
LIMIT 5;

Note: Without parentheses, AND has higher precedence than OR.

🔹 Better version with correct logic:

SELECT id, username, email FROM users
WHERE (email LIKE '%@gmail.com' OR email LIKE '%@yahoo.com')
  AND username IS NOT NULL
ORDER BY id ASC
LIMIT 5;

👉 9. Aggregation Functions: `MIN`, `MAX`, `COUNT`, `AVG`, `SUM`

These functions help you perform calculations on column values.

🔹 1. `COUNT` – Number of rows

SELECT COUNT(*) FROM users;

✔️ Total number of users.

SELECT COUNT(email) FROM users WHERE email IS NOT NULL;

✔️ Count of users who have an email.

🔹 2. `MIN` and `MAX` – Smallest and largest values

SELECT MIN(id) AS first_user, MAX(id) AS last_user FROM users;

🔹 3. `AVG` – Average (only on numeric fields)

Assuming id is somewhat sequential, we can do:

SELECT AVG(id) AS avg_id FROM users;

🔹 4. `SUM` – Total (again, only on numeric fields)

SELECT SUM(id) AS total_ids FROM users WHERE id < 1000;

Combined Queries with Aggregates

🔹 Example 1: Count users without email and with usernames starting with ‘test’

SELECT COUNT(*) FROM users 
WHERE email IS NULL AND username LIKE 'test%';

🔹 Example 2: Get min, max, avg ID of users with Gmail addresses

SELECT 
  MIN(id) AS min_id,
  MAX(id) AS max_id,
  AVG(id) AS avg_id
FROM users 
WHERE email LIKE '%@gmail.com';

🔹 Example 3: Count how many users per email domain

SELECT 
  SPLIT_PART(email, '@', 2) AS domain,
  COUNT(*) AS total_users
FROM users
WHERE email IS NOT NULL
GROUP BY domain
ORDER BY total_users DESC
LIMIT 5;

♦️ This query breaks email at the @ to group by domain like gmail.com, yahoo.com.

`GROUP BY` Course

Here’s the SQL query to get the maximum mark, minimum mark, and the email (or emails) of users grouped by each course:

Option 1: Basic `GROUP BY` with aggregate functions (only max/min mark per course, not emails)

SELECT
  course,
  MAX(mark) AS max_mark,
  MIN(mark) AS min_mark
FROM users
GROUP BY course;

Option 2: Include emails of users who have the max or min mark per course

(PostgreSQL-specific using subqueries and JOIN)

SELECT u.course, u.email, u.mark
FROM users u
JOIN (
  SELECT
    course,
    MAX(mark) AS max_mark,
    MIN(mark) AS min_mark
  FROM users
  GROUP BY course
) stats ON u.course = stats.course AND (u.mark = stats.max_mark OR u.mark = stats.min_mark)
ORDER BY u.course, u.mark DESC;

♦️ This second query shows all users who have the highest or lowest mark in their course, including ties.

Here’s the updated query that includes:

Course name
Emails of users with the maximum or minimum marks
Their marks
Average mark per course

SELECT
  u.course,
  u.email,
  u.mark,
  stats.max_mark,
  stats.min_mark,
  stats.avg_mark
FROM users u
JOIN (
  SELECT
    course,
    MAX(mark) AS max_mark,
    MIN(mark) AS min_mark,
    ROUND(AVG(mark), 2) AS avg_mark
  FROM users
  GROUP BY course
) stats ON u.course = stats.course AND (u.mark = stats.max_mark OR u.mark = stats.min_mark)
ORDER BY u.course, u.mark DESC;

Notes:

ROUND(AVG(mark), 2) gives the average mark rounded to two decimal places.
Users with the same max or min mark are all included.

Here’s the full query including:

Course
Email
Mark
Max/Min mark
Average mark
User count per course

SELECT
  u.course,
  u.email,
  u.mark,
  stats.max_mark,
  stats.min_mark,
  stats.avg_mark,
  stats.user_count
FROM users u
JOIN (
  SELECT
    course,
    MAX(mark) AS max_mark,
    MIN(mark) AS min_mark,
    ROUND(AVG(mark), 2) AS avg_mark,
    COUNT(*) AS user_count
  FROM users
  GROUP BY course
) stats ON u.course = stats.course AND (u.mark = stats.max_mark OR u.mark = stats.min_mark)
ORDER BY u.course, u.mark DESC;

♦️ This query gives you a full breakdown of top/bottom performers per course along with stats per group.

Here’s a version that adds the rank of each user within their course based on their mark (highest mark = rank 1), along with:

Course
Email
Mark
Rank (within course)
Max mark, Min mark, Average mark, User count per course

WITH ranked_users AS (
  SELECT
    u.*,
    RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS rank_in_course
  FROM users u
),
course_stats AS (
  SELECT
    course,
    MAX(mark) AS max_mark,
    MIN(mark) AS min_mark,
    ROUND(AVG(mark), 2) AS avg_mark,
    COUNT(*) AS user_count
  FROM users
  GROUP BY course
)
SELECT
  r.course,
  r.email,
  r.mark,
  r.rank_in_course,
  cs.max_mark,
  cs.min_mark,
  cs.avg_mark,
  cs.user_count
FROM ranked_users r
JOIN course_stats cs ON r.course = cs.course
ORDER BY r.course, r.rank_in_course;

Key features:

Users are ranked per course using RANK() (supports ties).
The output includes all users, not just those with max/min marks.

NOTE: Here we can see output like:

    course    |                   email                   | mark | rank_in_course | max_mark | min_mark | avg_mark | user_count
--------------+-------------------------------------------+------+----------------+----------+----------+----------+------------
 IT           | lisandra.schoen@borer-effertz.test        | 1000 |              1 |     1000 |      100 |   543.04 |        796
 IT           | leona@jaskolski-jaskolski.test            | 1000 |              1 |     1000 |      100 |   543.04 |        796
 IT           | angle@ankunding-sauer.example             |  999 |              3 |     1000 |      100 |   543.04 |        796
 IT           | drucilla_okeefe@monahan.test              |  999 |              3 |     1000 |      100 |   543.04 |        796
 algebra      | natashia.langosh@luettgen.test            | 1000 |              1 |     1000 |      100 |   541.52 |        779
 algebra      | tiffany.tremblay@bergnaum.example         | 1000 |              1 |     1000 |      100 |   541.52 |        779
 algebra      | kristeen.nikolaus@crist.example           |  999 |              3 |     1000 |      100 |   541.52 |        779
 algebra      | domenic@predovic-dare.example             |  999 |              3 |     1000 |      100 |   541.52 |        779
 algebra      | kit@oconner.example                       |  999 |              3 |     1000 |      100 |   541.52 |        779
 architecture | tierra_reilly@botsford-okuneva.test       |  997 |              1 |      997 |      100 |   549.24 |        776
 architecture | celestine_reilly@bayer.example            |  996 |              2 |      997 |      100 |   549.24 |        776
 architecture | carson@kulas.example                      |  995 |              3 |      997 |      100 |   549.24 |        776
 botany       | hassan@towne.test                         | 1000 |              1 |     1000 |      103 |   554.07 |        760
 botany       | shaunna@hudson.test                       | 1000 |              1 |     1000 |      103 |   554.07 |        760
 botany       | sanford_jacobs@johnston.example           |  999 |              3 |     1000 |      103 |   554.07 |        760
 botany       | arnulfo_cremin@ernser.example             |  999 |              3 |     1000 |      103 |   554.07 |        760

The Ranks are not consistent. To avoid this we can use DENSE_RANK().

Here’s the updated query using DENSE_RANK() instead of RANK() — this avoids gaps in rank numbering when there are ties:

WITH ranked_users AS (
  SELECT
    u.*,
    DENSE_RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS rank_in_course
  FROM users u
),
course_stats AS (
  SELECT
    course,
    MAX(mark) AS max_mark,
    MIN(mark) AS min_mark,
    ROUND(AVG(mark), 2) AS avg_mark,
    COUNT(*) AS user_count
  FROM users
  GROUP BY course
)
SELECT
  r.course,
  r.email,
  r.mark,
  r.rank_in_course,
  cs.max_mark,
  cs.min_mark,
  cs.avg_mark,
  cs.user_count
FROM ranked_users r
JOIN course_stats cs ON r.course = cs.course
WHERE r.rank_in_course <= 3
ORDER BY r.course, r.rank_in_course;

`DENSE_RANK` difference:

If 2 users tie for 1st place, the next gets rank 2 (not 3 like with RANK).
Ensures consistent top-N output when ties are frequent.

🔥 Boom, Bonus: To export the query result as a CSV file in PostgreSQL, you can use the \copy command in psql (PostgreSQL’s CLI), like this:

🧾 Export Top 3 Students per Course to CSV

\copy (
  WITH ranked_users AS (
    SELECT
      u.*,
      DENSE_RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS rank_in_course
    FROM users u
  ),
  course_stats AS (
    SELECT
      course,
      MAX(mark) AS max_mark,
      MIN(mark) AS min_mark,
      ROUND(AVG(mark), 2) AS avg_mark,
      COUNT(*) AS user_count
    FROM users
    GROUP BY course
  )
  SELECT
    r.course,
    r.email,
    r.mark,
    r.rank_in_course,
    cs.max_mark,
    cs.min_mark,
    cs.avg_mark,
    cs.user_count
  FROM ranked_users r
  JOIN course_stats cs ON r.course = cs.course
  WHERE r.rank_in_course <= 3
  ORDER BY r.course, r.rank_in_course
) TO 'top_students_per_course.csv' WITH CSV HEADER;

✅ Requirements:

Run this in the psql shell.
The file top_students_per_course.csv will be saved in your local working directory (where psql was started).
Make sure PostgreSQL has write permissions to that directory.

👉 10. `LIKE`, `%`, `_` – Pattern Matching in SQL

These are used to filter text using wildcards:

% = matches any sequence of characters (0 or more)
_ = matches exactly one character

🔹 Basic `LIKE` Queries

Example 1: Usernames starting with “admin”

SELECT * FROM users 
WHERE username LIKE 'admin%';

Example 2: Usernames ending with “bot”

SELECT * FROM users 
WHERE username LIKE '%bot';

Example 3: Usernames containing “test”

SELECT * FROM users 
WHERE username LIKE '%test%';

🔹 `_` Single-character Wildcard

Example 4: 5-character usernames

SELECT * FROM users 
WHERE username LIKE '_____';

(Each _ stands for one character.)

Example 5: Emails starting with any single letter + “ohn” (e.g., “john”, “kohn”)

SELECT * FROM users 
WHERE username LIKE '_ohn';

Combined Queries with `LIKE`, `%`, `_`

🔹 Example 6: Users whose username contains “test” and email ends with “gmail.com”

SELECT * FROM users 
WHERE username LIKE '%test%' AND email LIKE '%@gmail.com';

🔹 Example 7: Users with 3-character usernames and missing email

SELECT * FROM users 
WHERE username LIKE '___' AND email IS NULL;

🔹 Example 8: Users with usernames that start with “a” or end with “x” and have a mobile number

SELECT * FROM users 
WHERE (username LIKE 'a%' OR username LIKE '%x') AND mobile IS NOT NULL;

👉 11. `IN`, `NOT IN`, `BETWEEN` – Set & Range Filters

These are used to filter based on a list of values (IN) or a range (BETWEEN).

🔹 1. `IN` – Match any of the listed values

SELECT * FROM users 
WHERE username IN ('admin', 'test_user', 'john_doe');

🔹 2. `NOT IN` – Exclude listed values

SELECT * FROM users 
WHERE username NOT IN ('admin', 'test_user');

🔹 3. `BETWEEN` – Match within a range (inclusive)

SELECT * FROM users 
WHERE id BETWEEN 100 AND 200;

♦️ Equivalent to: id >= 100 AND id <= 200

Combined Queries

🔹 Example 1: Users with username in a list and `id` between 1 and 500

SELECT * FROM users 
WHERE username IN ('alice', 'bob', 'carol') 
  AND id BETWEEN 1 AND 500;

🔹 Example 2: Exclude system users and select a range of IDs

SELECT id, username FROM users 
WHERE username NOT IN ('admin', 'system') 
  AND id BETWEEN 1000 AND 2000;

🔹 Example 3: Top 5 users whose email domains are in a specific list

SELECT * FROM users 
WHERE SPLIT_PART(email, '@', 2) IN ('gmail.com', 'yahoo.com', 'hotmail.com')
ORDER BY id
LIMIT 5;

👉 12. SQL Aliases – Renaming Columns or Tables Temporarily

Aliases help improve readability, especially in joins or when using functions.

🔹 1. Column Aliases

Use AS (optional keyword) to rename a column in the result.

Example 1: Rename `username` to `user_name`

SELECT username AS user_name, email AS user_email 
FROM users;

You can also omit AS:

SELECT username user_name, email user_email 
FROM users;

🔹 2. Table Aliases

Assign a short name to a table (very useful in joins).

Example 2: Simple alias for table

SELECT u.username, u.email 
FROM users u 
WHERE u.email LIKE '%@gmail.com';

🔹 3. Alias with functions

SELECT COUNT(*) AS total_users, MAX(id) AS latest_id 
FROM users;

Combined Query with Aliases

🔹 Example 4: Count Gmail users, alias result and filter

SELECT 
  COUNT(*) AS gmail_users 
FROM users u 
WHERE u.email LIKE '%@gmail.com';

🔹 Example 5: List usernames with shortened table name and domain extracted

SELECT 
  u.username AS name, 
  SPLIT_PART(u.email, '@', 2) AS domain 
FROM users u 
WHERE u.email IS NOT NULL 
ORDER BY u.username
LIMIT 10;

Rails 8 App: Setup Test DB in PostgreSQL | Faker | Extensions for Rails app, VSCode

Let’s try to add some sample data first to our database.

Step 1: Install `pgxnclient`

On macOS (with Homebrew):

brew install pgxnclient

On Ubuntu/Debian:

sudo apt install pgxnclient

Step 2: Install the `faker` extension via PGXN

pgxn install faker

I get issue with installing faker via pgxn:

~ pgxn install faker
INFO: best version: faker 0.5.3
ERROR: resource not found: 'https://api.pgxn.org/dist/PostgreSQL_Faker/0.5.3/META.json'

⚠️ Note: faker extension we’re trying to install via pgxn is not available or improperly published on the PGXN network. Unfortunately, the faker extension is somewhat unofficial and not actively maintained or reliably hosted.

🚨 You can SKIP STEP 3,4,5 and opt Option 2

Step 3: Build and install the extension into PostgreSQL

cd /path/to/pg_faker  # PGXN will print this after install
make
sudo make install

Step 4: Enable it in your database

Inside psql :

CREATE EXTENSION faker;

Step 5: Insert 10,000 fake users

INSERT INTO users (user_id, username, email, phone_number)
SELECT
  gs AS user_id,
  faker_username(),
  faker_email(),
  faker_phone_number()
FROM generate_series(1, 10000) AS gs;

Option 2: Use Ruby + Faker gem (if you’re using Rails or Ruby)

If you’re building your app in Rails, use the faker gem directly:

In Ruby:

require 'faker'
require 'pg'

conn = PG.connect(dbname: 'test_db')

(1..10_000).each do |i|
  conn.exec_params(
    "INSERT INTO users (user_id, username, email, phone_number) VALUES ($1, $2, $3, $4)",
    [i, Faker::Internet.username, Faker::Internet.email, Faker::PhoneNumber.phone_number]
  )
end

In Rails (for test_db), Create the Rake Task:

Create a file at:

lib/tasks/seed_fake_users.rake

# lib/tasks/seed_fake_users.rake

namespace :db do
  desc "Seed 10,000 fake users into the users table"
  task seed_fake_users: :environment do
    require "faker"
    require "pg"

    conn = PG.connect(dbname: "test_db")

    # If user_id is a serial and you want to reset the sequence after deletion, run:
    # conn.exec_params("TRUNCATE TABLE users RESTART IDENTITY")
    # delete existing users to load fake users
    conn.exec_params("DELETE FROM users")
    

    puts "Seeding 10,000 fake users ...."
    (1..10_000).each do |i|
      conn.exec_params(
        "INSERT INTO users (user_id, username, email, phone_number) VALUES ($1, $2, $3, $4)",
        [ i, Faker::Internet.username, Faker::Internet.email, Faker::PhoneNumber.phone_number ]
      )
    end
    puts "Seeded 10,000 fake users into the users table"
    conn.close
  end
end

# run the task
bin/rails db:seed_fake_users

For Normal Rails Rake Task:

# lib/tasks/seed_fake_users.rake

namespace :db do
  desc "Seed 10,000 fake users into the users table"
  task seed_fake_users: :environment do
    require 'faker'

    puts "🌱 Seeding 10,000 fake users..."

    users = []

    # delete existing users
    User.destroy_all

    10_000.times do |i|
      users << {
        user_id: i + 1,
        username: Faker::Internet.unique.username,
        email: Faker::Internet.unique.email,
        phone_number: Faker::PhoneNumber.phone_number
      }
    end

    # Use insert_all for performance
    User.insert_all(users)

    puts "✅ Done. Inserted 10,000 users."
  end
end

# run the task
bin/rails db:seed_fake_users

Now we will discuss about PostgreSQL Extensions and it’s usage.

PostgreSQL extensions are add-ons or plug-ins that extend the core functionality of PostgreSQL. They provide additional capabilities such as new data types, functions, operators, index types, or full features like full-text search, spatial data handling, or fake data generation.

🔧 What Extensions Can Do

Extensions can:

Add functions (e.g. gen_random_bytes() from pgcrypto)
Provide data types (e.g. hstore, uuid, jsonb)
Enable indexing techniques (e.g. btree_gin, pg_trgm)
Provide tools for testing and development (e.g. faker, pg_stat_statements)
Enhance performance monitoring, security, or language support

📦 Common PostgreSQL Extensions

Extension	Purpose
`pgcrypto`	Cryptographic functions (e.g., hashing, random byte generation)
`uuid-ossp`	Functions to generate UUIDs
`postgis`	Spatial and geographic data support
`hstore`	Key-value store in a single PostgreSQL column
`pg_trgm`	Trigram-based text search and indexing
`citext`	Case-insensitive text type
`pg_stat_statements`	SQL query statistics collection
`faker`	Generates fake but realistic data (for testing)

📥 Installing and Enabling Extensions

1. Install (if not built-in)

Via package manager or PGXN (PostgreSQL Extension Network), or compile from source.

2. Enable in a database

CREATE EXTENSION extension_name;

Example:

CREATE EXTENSION pgcrypto;

Enabling an extension makes its functionality available to the current database only.

🤔 Why Use Extensions?

Productivity: Quickly add capabilities without writing custom code.
Performance: Access to advanced indexing, statistics, and optimization tools.
Development: Generate test data (faker), test encryption (pgcrypto), etc.
Modularity: PostgreSQL stays lightweight while letting you add only what you need.

Here’s a categorized list (with a simple visual-style layout) of PostgreSQL extensions that are safe and useful for Rails apps in both development and production environments.

🔌 PostgreSQL Extensions for Rails Apps

# connect psql
psql -U username -d database_name

# list all available extensions
SELECT * FROM pg_available_extensions;

# eg. to install the hstore extension run
CREATE EXTENSION hstore;

# verify the installation
SELECT * FROM pg_extension;
SELECT * FROM pg_extension WHERE extname = 'hstore';

🔐 Security & UUIDs

Extension	Use Case	Safe for Prod
`pgcrypto`	Secure random bytes, hashes, UUIDs	✅
`uuid-ossp`	UUID generation (v1, v4, etc.)	✅

💡 Tip: Use uuid-ossp or pgcrypto to generate UUID primary keys (id: :uuid) in Rails.

📘 PostgreSQL Procedures and Triggers — Explained with Importance and Examples

PostgreSQL is a powerful, open-source relational database that supports advanced features like stored procedures and triggers, which are essential for encapsulating business logic inside the database.

🔹 What are Stored Procedures in PostgreSQL?

A stored procedure is a pre-compiled set of SQL and control-flow statements stored in the database and executed by calling it explicitly.

Purpose: Encapsulate business logic, reuse complex operations, improve performance, and reduce network overhead.

✅ Benefits of Stored Procedures:

Faster execution (compiled and stored in DB)
Centralized logic
Reduced client-server round trips
Language support: SQL, PL/pgSQL, Python, etc.

🧪 Example: Create a Procedure to Add a New User

CREATE OR REPLACE PROCEDURE add_user(name TEXT, email TEXT)
LANGUAGE plpgsql
AS $$
BEGIN
    INSERT INTO users (name, email) VALUES (name, email);
END;
$$;

▶️ Call the procedure:

CALL add_user('John Doe', 'john@example.com');

🔹 What are Triggers in PostgreSQL?

A trigger is a special function that is automatically executed in response to certain events on a table (like INSERT, UPDATE, DELETE).

Purpose: Enforce rules, maintain audit logs, auto-update columns, enforce integrity, etc.

✅ Benefits of Triggers:

Automate tasks on data changes
Enforce business rules and constraints
Keep logs or audit trails
Maintain derived data or counters

🧪 Example: Trigger to Log Inserted Users

1. Create the audit table:

CREATE TABLE user_audit (
    id SERIAL PRIMARY KEY,
    user_id INTEGER,
    name TEXT,
    email TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

2. Create the trigger function:

CREATE OR REPLACE FUNCTION log_user_insert()
RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO user_audit (user_id, name, email)
    VALUES (NEW.id, NEW.name, NEW.email);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

3. Create the trigger on `users` table:

CREATE TRIGGER after_user_insert
AFTER INSERT ON users
FOR EACH ROW
EXECUTE FUNCTION log_user_insert();

Now, every time a user is inserted, the trigger logs it in the user_audit table automatically.

📌 Difference: Procedures vs. Triggers

Feature	Stored Procedures	Triggers
When executed	Called explicitly with `CALL`	Automatically executed on events
Purpose	Batch processing, encapsulate logic	React to data changes automatically
Control	Full control by developer	Fire based on database event (Insert, Update, Delete)
Returns	No return or OUT parameters	Must return `NEW` or `OLD` row in most cases

🎯 Why Are Procedures and Triggers Important?

✅ Use Cases for Stored Procedures:

Bulk processing (e.g. daily billing)
Data import/export
Account setup workflows
Multi-step business logic

✅ Use Cases for Triggers:

Auto update updated_at column
Enforce soft-deletes
Maintain counters or summaries (e.g., post comment count)
Audit logs / change history
Cascading updates or cleanups

🚀 Real-World Example: Soft Delete Trigger

Instead of deleting records, mark them as deleted = true.

CREATE OR REPLACE FUNCTION soft_delete_user()
RETURNS TRIGGER AS $$
BEGIN
  UPDATE users SET deleted = TRUE WHERE id = OLD.id;
  RETURN NULL; -- cancel the delete
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER before_user_delete
BEFORE DELETE ON users
FOR EACH ROW
EXECUTE FUNCTION soft_delete_user();

Now any DELETE FROM users WHERE id = 1; will just update the deleted column.

🛠️ Tools to Manage Procedures & Triggers

pgAdmin (GUI)
psql (CLI)
Code-based migrations (via tools like ActiveRecord or pg gem)

🧠 Summary

Feature	Stored Procedure	Trigger
Manual/Auto	Manual (CALL)	Auto (event-based)
Flexibility	Complex logic, loops, variables	Quick logic, row-based or statement-based
Languages	PL/pgSQL, SQL, Python, etc.	PL/pgSQL, SQL
Best for	Multi-step workflows	Audit, logging, validation

Use Postgres RANDOM()

By using RANDOM() in PostgreSQL. If the application uses PostgreSQL’s built-in RANDOM() function to efficiently retrieve a random user from the database. Here’s why this is important:

Efficiency: PostgreSQL’s RANDOM() is more efficient than loading all records into memory and selecting one randomly in Ruby. This is especially important when dealing with large datasets (like if we have 10000 users).
Database-level Operation: The randomization happens at the database level rather than the application level, which:

Reduces memory usage (we don’t need to load unnecessary records)
Reduces network traffic (only one record is transferred)
Takes advantage of PostgreSQL’s optimized random number generation

Single Query: Using RANDOM() allows us to fetch a random record in a single SQL query, typically something like:sqlApply to

SELECT * FROM users ORDER BY RANDOM() LIMIT 1

This is in contrast to less efficient methods like:

Loading all users and using Ruby’s sample method (User.all.sample)
Getting a random ID and then querying for it (which would require two queries)
Using offset with count (which can be slow on large tables)

🔍 Full Text Search & Similarity

Extension	Use Case	Safe for Prod
`pg_trgm`	Trigram-based fuzzy search (great with `ILIKE` & `similarity`)	✅
`unaccent`	Remove accents for better search results	✅
`fuzzystrmatch`	Soundex, Levenshtein distance	✅ (heavy use = test!)

💡 Combine pg_trgm + unaccent for powerful search in Rails models using ILIKE.

📊 Performance Monitoring & Dev Insights

Extension	Use Case	Safe for Prod
`pg_stat_statements`	Monitor slow queries, frequency	✅
`auto_explain`	Log plans for slow queries	✅
`hypopg`	Simulate hypothetical indexes	✅ (dev only)

🧪 Dev Tools & Data Generation

Extension	Use Case	Safe for Prod
`faker`	Fake data generation for testing	❌ Dev only
`pgfaker`	Community alternative to `faker`	❌ Dev only

📦 Storage & Structure

Extension	Use Case	Safe for Prod
`hstore`	Key-value storage in a column	✅
`citext`	Case-insensitive text	✅

💡 citext is very handy for case-insensitive email columns in Rails.

🗺️ Geospatial (Advanced)

Extension	Use Case	Safe for Prod
`postgis`	GIS/spatial data support	✅ (big apps)

🎨 Visual Summary

+-------------------+-----------------------------+-----------------+
| Category          | Extension                   | Safe for Prod?  |
+-------------------+-----------------------------+-----------------+
| Security/UUIDs    | pgcrypto, uuid-ossp         | ✅              |
| Search/Fuzziness  | pg_trgm, unaccent, fuzzystr | ✅              |
| Monitoring        | pg_stat_statements          | ✅              |
| Dev Tools         | faker, pgfaker              | ❌ (Dev only)   |
| Text/Storage      | citext, hstore              | ✅              |
| Geo               | postgis                     | ✅              |
+-------------------+-----------------------------+-----------------+

PostgreSQL Extension for VSCode

# 1. open the Command Palette (Cmd + Shift + P)
# 2. Type 'PostgreSQL: Add Connection'
# 3. Enter the hostname of the database authentication details
# 4. Open Command Palette, type: 'PostgreSQL: New Query'

Enjoy PostgreSQL 🚀

Rails 8 App: Setup Test DB in PostgreSQL | Query Performance Using EXPLAIN ANALYZE

Now we’ll go full-on query performance pro mode using EXPLAIN ANALYZE and real plans. We’ll learn how PostgreSQL makes decisions, how to catch slow queries, and how your indexes make them 10x faster.

💎 Part 1: What is `EXPLAIN ANALYZE`?

EXPLAIN shows how PostgreSQL plans to execute your query.

ANALYZE runs the query and adds actual time, rows, loops, etc.

Syntax:

EXPLAIN ANALYZE
SELECT * FROM users WHERE username = 'bob';

✏️ Example 1: Without Index

SELECT * FROM users WHERE username = 'bob';

If username has no index, plan shows:

Seq Scan on users
  Filter: (username = 'bob')
  Rows Removed by Filter: 9999

❌ PostgreSQL scans all rows = Sequential Scan = slow!

➕ Add Index:

CREATE INDEX idx_users_username ON users (username);

Now rerun:

EXPLAIN ANALYZE SELECT * FROM users WHERE username = 'bob';

You’ll see:

Index Scan using idx_users_username on users
  Index Cond: (username = 'bob')

✅ PostgreSQL uses B-tree index
🚀 Massive speed-up!

🔥 Want even faster?

SELECT username FROM users WHERE username = 'bob';

If PostgreSQL shows:

Index Only Scan using idx_users_username on users
  Index Cond: (username = 'bob')

🎉 Index Only Scan! = covering index success!
No heap fetch = lightning-fast.

⚠️ Note: Index-only scan only works if:

Index covers all selected columns
Table is vacuumed (PostgreSQL uses visibility map)

If you still get Seq scan output like:

test_db=# EXPLAIN ANALYSE SELECT * FROM users where username = 'aman_chetri';
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..1.11 rows=1 width=838) (actual time=0.031..0.034 rows=1 loops=1)
   Filter: ((username)::text = 'aman_chetri'::text)
   Rows Removed by Filter: 2
 Planning Time: 0.242 ms
 Execution Time: 0.077 ms
(5 rows)

even after adding an index, because PostgreSQL is saying:

🤔 “The table is so small (cost = 1.11), scanning the whole thing is cheaper than using the index.”
Also: Your query uses only SELECT username, which could be eligible for Index Only Scan, but heap fetch might still be needed due to visibility map.

🔧 Step-by-step Fix:

✅ 1. Add Data for Bigger Table

If the table is small (few rows), PostgreSQL will prefer Seq Scan no matter what.

Try adding ~10,000 rows:

INSERT INTO users (username, email, phone_number)
SELECT 'user_' || i, 'user_' || i || '@mail.com', '1234567890'
FROM generate_series(1, 10000) i;

Then VACUUM ANALYZE users; again and retry EXPLAIN.

✅ 2. Confirm Index Exists

First, check your index exists and is recognized:

\d users

You should see something like:

Indexes:
    "idx_users_username" btree (username)

If not, add:

CREATE INDEX idx_users_username ON users(username);

✅ 3. Run `ANALYZE` (Update Stats)

ANALYZE users;

This updates statistics — PostgreSQL might not be using the index if it thinks only one row matches or the table is tiny.

✅ 4. Vacuum for Index-Only Scan

Index-only scans require the visibility map to be set.

Run:

VACUUM ANALYZE users;

This marks pages in the table as “all-visible,” enabling PostgreSQL to avoid reading the heap.

✅ 5. Force PostgreSQL to Consider Index

You can turn off sequential scan temporarily (for testing):

SET enable_seqscan = OFF;

EXPLAIN SELECT username FROM users WHERE username = 'bob';

You should now see:

Index Scan using idx_users_username on users ...

⚠️ Use this only for testing/debugging — not in production.

💡 Extra Tip (optional): Use `EXPLAIN (ANALYZE, BUFFERS)`

EXPLAIN (ANALYZE, BUFFERS)
SELECT username FROM users WHERE username = 'bob';

This will show:

Whether heap was accessed
Buffer hits
Actual rows

📋 Summary

Step	Command
Check Index	`\d users`
Analyze table	`ANALYZE users;`
Vacuum for visibility	`VACUUM ANALYZE users;`
Disable seq scan for test	`SET enable_seqscan = OFF;`
Add more rows (optional)	`INSERT INTO ...`

🚨 How to catch bad index usage?

Always look for:

“Seq Scan” instead of “Index Scan” ➔ missing index
“Heap Fetch” ➔ not a covering index
“Rows Removed by Filter” ➔ inefficient filtering
“Loops: 1000+” ➔ possible N+1 issue

Common Pattern Optimizations

Pattern	Fix
`WHERE column = ?`	B-tree index on `column`
`WHERE column LIKE 'prefix%'`	B-tree works (with text_ops)
`SELECT col1 WHERE col2 = ?`	Covering index: `(col2, col1)` or `(col2) INCLUDE (col1)`
`WHERE col BETWEEN ?`	Composite index with range second: `(status, created_at)`
`WHERE col IN (?, ?, ?)`	Index still helps
`ORDER BY col LIMIT 10`	Index on `col` helps sort fast

⚡ Tip: Use `pg_stat_statements` to Find Slow Queries

Enable it in postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'

Then run:

SELECT query, total_exec_time, calls
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 5;

🎯 Find your worst queries & optimize them with new indexes!

🧪 Try It Yourself

Want a little lab setup to practice?

CREATE TABLE users (
  user_id serial PRIMARY KEY,
  username VARCHAR(220),
  email VARCHAR(150),
  phone_number VARCHAR(20)
);

-- Insert 100K fake rows
INSERT INTO users (username, email, phone_number)
SELECT
  'user_' || i,
  'user_' || i || '@example.com',
  '999-000-' || i
FROM generate_series(1, 100000) i;

Then test:

EXPLAIN ANALYZE SELECT * FROM users WHERE username = 'user_5000';
Add INDEX ON username
Re-run, compare speed!

🎯 Extra Pro Tools for Query Performance

EXPLAIN ANALYZE → Always first tool
pg_stat_statements → Find slow queries in real apps
auto_explain → Log slow plans automatically
pgBadger or pgHero → Visual query monitoring

💥 Now We Know:

✅ How to read query plans
✅ When you’re doing full scans vs index scans
✅ How to achieve index-only scans
✅ How to catch bad performance early
✅ How to test and fix in real world

Happy Performance Fixing.. 🚀

Rails 8 App: Setup Test DB in PostgreSQL | Covering Index | BRIN Indexes | Hash Indexes | Create super fast indexes

Let’s look into some of the features of sql data indexing. This will be super helpful while developing our Rails 8 Application.

💎 Part 1: What is a Covering Index?

Normally when you query:

SELECT * FROM users WHERE username = 'bob';

Database searches username index (secondary).
Finds a pointer (TID or PK).
Then fetches full row from table (heap or clustered B-tree).

Problem:

Heap fetch = extra disk read.
Clustered B-tree fetch = extra traversal.

📜 Covering Index idea:

✅ If the index already contains all the columns you need,
✅ Then the database does not need to fetch the full row!

It can answer the query purely by scanning the index! ⚡

Boom — one disk read, no extra hop!

✏️ Example in PostgreSQL:

Suppose your query is:

SELECT username FROM users WHERE username = 'bob';

You only need username.
But by default, PostgreSQL indexes only store the index column (here, username) + TID.

✅ So in this case — already covering!

No heap fetch needed!

✏️ Example in MySQL InnoDB:

Suppose your query is:

SELECT username FROM users WHERE username = 'bob';

Secondary index (username) contains:
- username (indexed column)
- user_id (because secondary indexes in InnoDB always store PK)

♦️ So again, already covering!
No need to jump to the clustered index!

🎯 Key point:

If your query only asks for columns already inside the index,
then only the index is touched ➔ no second lookup ➔ super fast!

💎 Part 2: Real SQL Examples

✨ PostgreSQL

Create a covering index for common query:

CREATE INDEX idx_users_username_email ON users (username, email);

Now if you run:

SELECT email FROM users WHERE username = 'bob';

Postgres can:

Search index on username
Already have email in index
✅ No heap fetch!

(And Postgres is smart: it checks index-only scan automatically.)

✨ MySQL InnoDB

Create a covering index:

CREATE INDEX idx_users_username_email ON users (username, email);

✅ Now query:

SELECT email FROM users WHERE username = 'bob';

Same behavior:

Only secondary index read.
No need to touch primary clustered B-tree.

💎 Part 3: Tips to design smart Covering Indexes

✅ If your query uses WHERE on col1 and SELECT col2,
✅ Best to create index: (col1, col2).

✅ Keep indexes small — don’t add 10 columns unless needed.
✅ Avoid huge TEXT or BLOB columns in covering indexes — they make indexes heavy.

✅ Composite indexes are powerful:

CREATE INDEX idx_users_username_email ON users (username, email);

→ Can be used for:

WHERE username = ?
WHERE username = ? AND email = ?
etc.

✅ Monitor index usage:

PostgreSQL: EXPLAIN ANALYZE
MySQL: EXPLAIN

✅ Always check if Index Only Scan or Using Index appears in EXPLAIN plan!

📚 Quick Summary Table

Database	Normal Query	With Covering Index
PostgreSQL	B-tree ➔ Heap fetch (unless TID optimization)	B-tree scan only
MySQL InnoDB	Secondary B-tree ➔ Primary B-tree	Secondary B-tree only
Result	2 steps	1 step
Speed	Slower	Faster

🏆 Great! — Now We Know:

🧊 How heap fetch works!
🧊 How block lookup is O(1)!
🧊 How covering indexes skip heap fetch!
🧊 How to create super fast indexes for PostgreSQL and MySQL!

🦾 Advanced Indexing Tricks (Real Production Tips)

Now it’s time to look into super heavy functionalities that Postgres supports for making our sql data search/fetch super fast and efficient.

1. 🎯 Partial Indexes (PostgreSQL ONLY)

✅ Instead of indexing the whole table,
✅ You can index only the rows you care about!

Example:

Suppose 95% of users have status = 'inactive', but you only search active users:

SELECT * FROM users WHERE status = 'active' AND email = 'bob@example.com';

👉 Instead of indexing the whole table:

CREATE INDEX idx_active_users_email ON users (email) WHERE status = 'active';

♦️ PostgreSQL will only store rows with status = 'active' in this index!

Advantages:

Smaller index = Faster scans
Less space on disk
Faster index maintenance (less updates/inserts)

Important:

MySQL (InnoDB) does NOT support partial indexes 😔 — only PostgreSQL has this superpower.

2. 🎯 INCLUDE Indexes (PostgreSQL 11+)

✅ Normally, a composite index uses all columns for sorting/searching.
✅ With INCLUDE, extra columns are just stored in index, not used for ordering.

Example:

CREATE INDEX idx_username_include_email ON users (username) INCLUDE (email);

Meaning:

username is indexed and ordered.
email is only stored alongside.

Now query:

SELECT email FROM users WHERE username = 'bob';

➔ Index-only scan — no heap fetch.

Advantages:

Smaller & faster than normal composite indexes.
Helps to create very efficient covering indexes.

Important:

MySQL 8.0 added something similar with INVISIBLE columns but it’s still different.

3. 🎯 Composite Index Optimization

✅ Always order columns inside index smartly based on query pattern.

Golden Rules:

⚜️ Equality columns first (WHERE col = ?)
⚜️ Range columns second (WHERE col BETWEEN ?)
⚜️ SELECT columns last (for covering)

Example:

If query is:

SELECT email FROM users WHERE status = 'active' AND created_at > '2024-01-01';

Best index:

CREATE INDEX idx_users_status_created_at ON users (status, created_at) INCLUDE (email);

♦️ status first (equality match)
♦️ created_at second (range)
♦️ email included (covering)

Bad Index: (wrong order)

CREATE INDEX idx_created_at_status ON users (created_at, status);

→ Will not be efficient!

4. 🎯 BRIN Indexes (PostgreSQL ONLY, super special!)

✅ When your table is very huge (millions/billions of rows),
✅ And rows are naturally ordered (like timestamp, id increasing),
✅ You can create a BRIN (Block Range Index).

Example:

CREATE INDEX idx_users_created_at_brin ON users USING BRIN (created_at);

♦️ BRIN stores summaries of large ranges of pages (e.g., min/max timestamp per 128 pages).

♦️ Ultra small index size.

♦️ Very fast for large range queries like:

SELECT * FROM users WHERE created_at BETWEEN '2024-01-01' AND '2024-04-01';

Important:

BRIN ≠ B-tree
BRIN is approximate, B-tree is precise.
Only useful if data is naturally correlated with physical storage order.

MySQL?

MySQL does not have BRIN natively. PostgreSQL has a big advantage here.

5. 🎯 Hash Indexes (special case)

✅ If your query is always exact equality (not range),
✅ You can use hash indexes.

Example:

CREATE INDEX idx_users_username_hash ON users USING HASH (username);

Useful for:

Simple WHERE username = 'bob'
Never ranges (BETWEEN, LIKE, etc.)

⚠️ Warning:

Hash indexes used to be “lossy” before Postgres 10.
Now they are safe, but usually B-tree is still better unless you have very heavy point lookups.

😎 PRO-TIP: Which Index Type to Use?

Use case	Index type
Search small ranges or equality	B-tree
Search on huge tables with natural order (timestamps, IDs)	BRIN
Only exact match, super heavy lookup	Hash
Search only small part of table (active users, special conditions)	Partial index
Need to skip heap fetch	INCLUDE / Covering Index

🗺️ Quick Visual Mindmap:

Your Query
│
├── Need Equality + Range? ➔ B-tree
│
├── Need Huge Time Range Query? ➔ BRIN
│
├── Exact equality only? ➔ Hash
│
├── Want Smaller Index (filtered)? ➔ Partial Index
│
├── Want to avoid Heap Fetch? ➔ INCLUDE columns (Postgres) or Covering Index

🏆 Now we Know:

🧊 Partial Indexes
🧊 INCLUDE Indexes
🧊 Composite Index order tricks
🧊 BRIN Indexes
🧊 Hash Indexes
🧊 How to choose best Index

✅ This is serious pro-level database knowledge.

Enjoy SQL! 🚀

Rails 8 App: Setup Test DB | Comprehensive Guide 📖 for PostgreSQL , Mysql Indexing – PostgreSQL Heap ⛰ vs Mysql InnoDB B-Tree 🌿

Enter into psql terminal:

✗ psql postgres
psql (14.17 (Homebrew))
Type "help" for help.

postgres=# \l
                                     List of databases
           Name            |  Owner   | Encoding | Collate | Ctype |   Access privileges
---------------------------+----------+----------+---------+-------+-----------------------
 studio_development | postgres | UTF8     | C       | C     |

Create a new test database
Create a users Table
Check the db and table details

postgres=# create database test_db;
CREATE DATABASE

test_db=# CREATE TABLE users (
user_id INT,
username VARCHAR(220),
email VARCHAR(150),
phone_number VARCHAR(20)
);
CREATE TABLE

test_db=# \dt
List of relations
 Schema | Name  | Type  |  Owner
--------+-------+-------+----------
 public | users | table | abhilash
(1 row)

test_db=# \d users;
                          Table "public.users"
    Column    |          Type          | Collation | Nullable | Default
--------------+------------------------+-----------+----------+---------
 user_id      | integer                |           |          |
 username     | character varying(220) |           |          |
 email        | character varying(150) |           |          |
 phone_number | character varying(20)  |           |          |

Add a Primary key to users and check the user table.

test_db=# ALTER TABLE users ADD PRIMARY KEY (user_id);
ALTER TABLE

test_db=# \d users;
                          Table "public.users"
    Column    |          Type          | Collation | Nullable | Default
--------------+------------------------+-----------+----------+---------
 user_id      | integer                |           | not null |
 username     | character varying(220) |           |          |
 email        | character varying(150) |           |          |
 phone_number | character varying(20)  |           |          |
Indexes:
    "users_pkey" PRIMARY KEY, btree (user_id)

# OR add primary key when creating the table:
CREATE TABLE users (
  user_id INT PRIMARY KEY,
  username VARCHAR(220),
  email VARCHAR(150),
  phone_number VARCHAR(20)
);

You can a unique constraint and an index added when adding a primary key.

Why does adding a primary key also add an index?

A primary key must guarantee that each value is unique and fast to find.
Without an index, the database would have to scan the whole table every time you look up a primary key, which would be very slow.
So PostgreSQL automatically creates a unique index on the primary key to make lookups efficient and to enforce uniqueness at the database level.

👉 It needs the index for speed and to enforce the “no duplicates” rule of primary keys.

What is btree?

btree stands for Balanced Tree (specifically, a “B-tree” data structure).
It’s the default index type in PostgreSQL.
B-tree indexes organize the data in a tree structure, so that searches, inserts, updates, and deletes are all very efficient — about O(log n) time.
It’s great for looking up exact matches (like WHERE user_id = 123) or range queries (like WHERE user_id BETWEEN 100 AND 200).

👉 So when you see btree, it just means it’s using a very efficient tree structure for your primary key index.

Summary in one line:
Adding a primary key automatically adds a btree index to enforce uniqueness and make lookups super fast.

In MySQL (specifically InnoDB engine, which is default now):

Primary keys always create an index automatically.
The index is a clustered index — this is different from Postgres!
The index uses a B-tree structure too, just like Postgres.

👉 So yes, MySQL also adds an index and uses a B-tree under the hood for primary keys.

But here’s a big difference:

In InnoDB, the table data itself is stored inside the primary key’s B-tree.
- That’s called a clustered index.
- It means the physical storage of the table rows follows the order of the primary key.
In PostgreSQL, the index and the table are stored separately (non-clustered by default).

Example: If you have a table like this in MySQL:

CREATE TABLE users (
  user_id INT PRIMARY KEY,
  username VARCHAR(220),
  email VARCHAR(150)
);

user_id will have a B-tree clustered index.
The rows themselves will be stored sorted by user_id.

Short version:

Database	Primary Key Behavior	B-tree?	Clustered?
PostgreSQL	Separate index created for PK	Yes	No (separate by default)
MySQL (InnoDB)	PK index + Table rows stored inside the PK’s B-tree	Yes	Yes (always clustered)

Why Indexing on Unique Columns (like `email`) Improves Lookup 🔍

Use Case

You frequently run queries like:

SELECT * FROM students WHERE email = 'john@example.com';

Without an index, this results in a full table scan — checking each row one-by-one.

With an index, the database can jump directly to the row using a sorted structure, significantly reducing lookup time — especially in large tables.

🌲 How SQL Stores Indexes Internally (PostgreSQL)

📚 PostgreSQL uses B-Tree indexes by default.

When you run:

CREATE UNIQUE INDEX idx_students_on_email ON students(email);

PostgreSQL creates a balanced B-tree like this:

          m@example.com
         /              \
  d@example.com     t@example.com
  /        \           /         \
...      ...        ...         ...

✅ Keys (email values) are sorted lexicographically.
✅ Each leaf node contains a pointer to the actual row in the students table (called a tuple pointer or TID).
✅ Lookup uses binary search, giving O(log n) performance.

⚙️ Unique Index = Even Faster

Because all email values are unique, the database:

Can stop searching immediately once a match is found.
Doesn’t need to scan multiple leaf entries (no duplicates).

🧠 Summary

Feature	Value
Index Type	B-tree (default in PostgreSQL)
Lookup Time	O(log n) vs O(n) without index
Optimized for	Equality search (`WHERE email = ...`), sorting, joins
Email is unique?	✅ Yes – index helps even more (no need to check multiple rows)
Table scan avoided?	✅ Yes – PostgreSQL jumps directly via B-tree lookup

What Exactly is a Clustered Index in MySQL (InnoDB)?

🔹 In MySQL InnoDB, the primary key IS the table.

🔹 A Clustered Index means:

The table’s data rows are physically organized in the order of the primary key.
No separate storage for the table – it’s merged into the primary key’s B-tree structure.

In simple words:
👉 “The table itself lives inside the primary key B-tree.”

That’s why:

Every secondary index must store the primary key value (not a row pointer).
InnoDB can only have one clustered index (because you can’t physically order a table in two different ways).

📈 Visual for MySQL Clustered Index

Suppose you have:

CREATE TABLE users (
  user_id INT PRIMARY KEY,
  username VARCHAR(255),
  email VARCHAR(255)
);

The storage looks like:

B-tree by user_id (Clustered)

user_id  | username | email
----------------------------
101      | Alice    | a@x.com
102      | Bob      | b@x.com
103      | Carol    | c@x.com

👉 Table rows stored directly inside the B-tree nodes by user_id!

🔵 PostgreSQL (Primary Key Index = Separate)

Imagine you have a users table:

users table (physical table):

row_id | user_id | username | email
-------------------------------------
  1    |   101   | Alice    | a@example.com
  2    |   102   | Bob      | b@example.com
  3    |   103   | Carol    | c@example.com

And the Primary Key Index looks like:

Primary Key B-Tree (separate structure):

user_id -> row pointer
 101    -> row_id 1
 102    -> row_id 2
 103    -> row_id 3

👉 When you query WHERE user_id = 102, PostgreSQL goes:

Find user_id 102 in the B-tree index,
Then jump to row_id 2 in the actual table.

🔸 Index and Table are separate.
🔸 Extra step: index lookup ➔ then fetch row.

🟠 MySQL InnoDB (Primary Key Index = Clustered)

Same users table, but stored like this:

Primary Key Clustered B-Tree (index + data together):

user_id | username | email
---------------------------------
  101   | Alice    | a@example.com
  102   | Bob      | b@example.com
  103   | Carol    | c@example.com

👉 When you query WHERE user_id = 102, MySQL:

Goes straight to user_id 102 in the B-tree,
Data is already there, no extra lookup.

🔸 Index and Table are merged.
🔸 One step: direct access!

📈 Quick Visual:

PostgreSQL
(Index)    ➔    (Table Row)
    |
    ➔ extra lookup needed

MySQL InnoDB
(Index + Row Together)
    |
    ➔ data found immediately

Summary:

PostgreSQL: primary key index is separate ➔ needs 2 steps (index ➔ table).
MySQL InnoDB: primary key index is clustered ➔ 1 step (index = table).

📚 How Secondary Indexes Work

Secondary Index = an index on a column that is not the primary key.

Example:

CREATE INDEX idx_username ON users(username);

Now you have an index on username.

🔵 PostgreSQL Secondary Index Behavior

Secondary indexes are separate structures from the table (just like the primary key index).
When you query by username, PostgreSQL:
1. Finds the matching row_id using the secondary B-tree index.
2. Then fetches the full row from the table by row_id.
This is called an Index Scan + Heap Fetch.

📜 Example:

Secondary Index (username -> row_id):

username -> row_id
------------------
Alice    -> 1
Bob      -> 2
Carol    -> 3

(users table is separate)

👉 Flexible, but needs 2 steps: index (row_id) ➔ table.

🟠 MySQL InnoDB Secondary Index Behavior

In InnoDB, secondary indexes don’t store row pointers.
Instead, they store the primary key value!

So:

Find the matching primary key using the secondary index.
Use the primary key to find the actual row inside the clustered primary key B-tree.

📜 Example:

Secondary Index (username -> user_id):

username -> user_id
--------------------
Alice    -> 101
Bob      -> 102
Carol    -> 103

(Then find user_id inside Clustered B-Tree)

✅ Needs 2 steps too: secondary index (primary key) ➔ clustered table.

📈 Quick Visual:

Feature	PostgreSQL	MySQL InnoDB
Secondary Index	username ➔ row pointer (row_id)	username ➔ primary key (user_id)
Fetch Full Row	Use row_id to get table row	Use primary key to find row in clustered index
Steps to Fetch	Index ➔ Table	Index ➔ Primary Key ➔ Table (clustered)

Action	PostgreSQL	MySQL InnoDB
Primary Key Lookup	Index ➔ Row (2 steps)	Clustered Index (1 step)
Secondary Index Lookup	Index (row_id) ➔ Row (2 steps)	Secondary Index (PK) ➔ Row (2 steps)
Storage Model	Separate index and table	Primary key and table merged (clustered)

🌐 Now, let’s do some Real SQL Query ⛁ Examples!

1. **Simple `SELECT * FROM users WHERE user_id = 102;`**

PostgreSQL:
Look into PK btree ➔ find row pointer ➔ fetch row separately.
MySQL InnoDB:
Directly find the row inside the PK B-tree (no extra lookup).

✅ MySQL is a little faster here because it needs only 1 step!

2. `SELECT username FROM users WHERE user_id = 102;` (Only 1 Column)

PostgreSQL:
Might do an Index Only Scan if all needed data is in the index (very fast).
MySQL:
Clustered index contains all columns already, no special optimization needed.

✅ Both can be very fast, but PostgreSQL shines if the index is “covering” (i.e., contains all needed columns). Because index table has less size than clustered index of mysql.

3. **`SELECT * FROM users WHERE username = 'Bob';` (Secondary Index Search)**

PostgreSQL:
Secondary index on username ➔ row pointer ➔ fetch table row.
MySQL:
Secondary index on username ➔ get primary key ➔ clustered index lookup ➔ fetch data.

✅ Both are 2 steps, but MySQL needs 2 different B-trees: secondary ➔ primary clustered.

Consider the below situation:

SELECT username FROM users WHERE user_id = 102;

user_id is the Primary Key.
You only want username, not full row.

Now:

🔵 PostgreSQL Behavior

👉 In PostgreSQL, by default:

It uses the primary key btree to find the row pointer.
Then fetches the full row from the table (heap fetch).

👉 But PostgreSQL has an optimization called Index-Only Scan.

If all requested columns are already present in the index,
And if the table visibility map says the row is still valid (no deleted/updated row needing visibility check),
Then Postgres does not fetch the heap.

👉 So in this case:

If the primary key index also stores username internally (or if an extra index is created covering username), Postgres can satisfy the query just from the index.

✅ Result: No table lookup needed ➔ Very fast (almost as fast as InnoDB clustered lookup).

📢 Postgres primary key indexes usually don’t store extra columns, unless you specifically create an index that includes them (INCLUDE (username) syntax in modern Postgres 11+).

🟠 MySQL InnoDB Behavior

In InnoDB:
Since the primary key B-tree already holds all columns (user_id, username, email),
It directly finds the row from the clustered index.
So when you query by PK, even if you only need one column, it has everything inside the same page/block.

✅ One fast lookup.

🔥 Why sometimes Postgres can still be faster?

If PostgreSQL uses Index-Only Scan, and the page is already cached, and no extra visibility check is needed,
Then Postgres may avoid touching the table at all and only scan the tiny index pages.
In this case, for very narrow queries (e.g., only 1 small field), Postgres can outperform even MySQL clustered fetch.

💡 Because fetching from a small index page (~8KB) is faster than reading bigger table pages.

🎯 Conclusion:

✅ MySQL clustered index is always fast for PK lookups.
✅ PostgreSQL can be even faster for small/narrow queries if Index-Only Scan is triggered.

👉 Quick Tip:

In PostgreSQL, you can force an index to include extra columns by using: CREATE INDEX idx_user_id_username ON users(user_id) INCLUDE (username); Then index-only scans become more common and predictable! 🚀

Isn’t PostgreSQL also doing 2 B-tree scans? One for secondary index and one for table (row_id)?

When you query with a secondary index, like:

SELECT * FROM users WHERE username = 'Bob';

In MySQL InnoDB, I said:
1. Find in secondary index (username ➔ user_id)
2. Then go to primary clustered index (user_id ➔ full row)

Let’s look at PostgreSQL first:

♦️ Step 1: Search Secondary Index B-tree on username.

It finds the matching TID (tuple ID) or row pointer.
- TID is a pair (block_number, row_offset).
- Not a B-tree! Just a physical pointer.

♦️ Step 2: Use the TID to directly jump into the heap (the table).

The heap (table) is not a B-tree — it’s just a collection of unordered pages (blocks of rows).
PostgreSQL goes directly to the block and offset — like jumping straight into a file.

🔔 Important:

Secondary index ➔ TID ➔ heap fetch.
No second B-tree traversal for the table!

🟠 Meanwhile in MySQL InnoDB:

♦️ Step 1: Search Secondary Index B-tree on username.

It finds the Primary Key value (user_id).

♦️ Step 2: Now, search the Primary Key Clustered B-tree to find the full row.

Need another B-tree traversal based on user_id.

🔔 Important:

Secondary index ➔ Primary Key B-tree ➔ data fetch.
Two full B-tree traversals!

Real-world Summary:

♦️ PostgreSQL

Secondary index gives a direct shortcut to the heap.
One B-tree scan (secondary) ➔ Direct heap fetch.

♦️ MySQL

Secondary index gives PK.
Then another B-tree scan (primary clustered) to find full row.

✅ PostgreSQL does not scan a second B-tree when fetching from the table — just a direct page lookup using TID.

✅ MySQL does scan a second B-tree (primary clustered index) when fetching full row after secondary lookup.

Is heap fetch a searching technique? Why is it faster than B-tree?

📚 Let’s start from the basics:

When PostgreSQL finds a match in a secondary index, what it gets is a TID.

♦️ A TID (Tuple ID) is a physical address made of:

Block Number (page number)
Offset Number (row slot inside the page)

Example:

TID = (block_number = 1583, offset = 7)

🔵 How PostgreSQL uses TID?

It directly calculates the location of the block (disk page) using block_number.
It reads that block (if not already in memory).
Inside that block, it finds the row at offset 7.

♦️ No search, no btree, no extra traversal — just:

Find the page (via simple number addressing)
Find the row slot

📈 Visual Example

Secondary index (username ➔ TID):

username	TID
Alice	(1583, 7)
Bob	(1592, 3)
Carol	(1601, 12)

♦️ When you search for “Bob”:

Find (1592, 3) from secondary index B-tree.
Jump directly to Block 1592, Offset 3.
Done ✅!

Answer:

Heap fetch is NOT a search.
It’s a direct address lookup (fixed number).
Heap = unordered collection of pages.
Pages = fixed-size blocks (usually 8 KB each).
TID gives an exact GPS location inside heap — no searching required.

That’s why heap fetch is faster than another B-tree search:

No binary search, no B-tree traversal needed.
Only a simple disk/memory read + row offset jump.

🌿 B-tree vs 📁 Heap Fetch

Action	B-tree	Heap Fetch
What it does	Binary search inside sorted tree nodes	Direct jump to block and slot
Steps needed	Traverse nodes (root ➔ internal ➔ leaf)	Directly read page and slot
Time complexity	O(log n)	O(1)
Speed	Slower (needs comparisons)	Very fast (direct)

🎯 Final and short answer:

♦️ In PostgreSQL, after finding the TID in the secondary index, the heap fetch is a direct, constant-time (O(1)) access — no B-tree needed!
♦️ This is faster than scanning another B-tree like in MySQL InnoDB.

🧩 Our exact question:

When we say:

Jump directly to Block 1592, Offset 3.

We are thinking:

There are thousands of blocks.
How can we directly jump to block 1592?
Shouldn’t that be O(n) (linear time)?
Shouldn’t there be some traversal?

🔵 Here’s the real truth:

No traversal needed.
No O(n) work.
Accessing Block 1592 is O(1) — constant time.

📚 Why?

Because of how files, pages, and memory work inside a database.

When PostgreSQL stores a table (the “heap”), it saves it in a file on disk.
The file is just a long array of fixed-size pages.

Each page = 8KB (default in Postgres).
Each block = 1 page = fixed 8KB chunk.
Block 0 is the first 8KB.
Block 1 is next 8KB.
Block 2 is next 8KB.
…
Block 1592 = (1592 × 8 KB) offset from the beginning.

✅ So block 1592 is simply located at 1592 × 8192 bytes offset from the start of the file.

✅ Operating systems (and PostgreSQL’s Buffer Manager) know exactly how to seek to that byte position without reading everything before it.

📈 Diagram (imagine the table file):

+-----------+-----------+-----------+-----------+-----------+------+
| Block 0   | Block 1   | Block 2   | Block 3   | Block 4   |  ... |
+-----------+-----------+-----------+-----------+-----------+------+
  (8KB)       (8KB)       (8KB)       (8KB)       (8KB)

Finding Block 1592 ➔
Seek directly to offset 1592 * 8192 bytes ➔
Read 8KB ➔
Find row at Offset 3 inside it.

🤔 What happens technically?

If in memory (shared buffers / page cache):

PostgreSQL checks its buffer pool (shared memory).
“Do I already have block 1592 cached?”
- ✅ Yes: immediately access memory address.
- ❌ No: Load block 1592 from disk into memory.

If from disk (rare if cached):

File systems (ext4, xfs, etc) know how to seek to a byte offset in a file without reading previous parts.
Seek to (block_number × 8192) bytes.
Read exactly 8KB into memory.
No need to scan the whole file linearly.

📊 Final Step: Inside the Block

Once the block is loaded:

The block internally is structured like an array of tuples.
Each tuple is placed into an offset slot.
Offset 3 ➔ third tuple inside the block.

♦️ Again, this is just array lookup — no traversal, no O(n).

⚡ So to summarize:

Question	Answer
How does PostgreSQL jump directly to block?	Using the block number × page size calculation (fixed offset math).
Is it O(n)?	❌ No, it’s O(1) constant time
Is there any traversal?	❌ No traversal. Just a seek + memory read.
How fast?	Extremely fast if cached, still fast if disk seeks.

🔥 Key concept:

PostgreSQL heap access is O(1) because the heap file is a flat sequence of fixed-size pages, and the TID gives exact coordinates.

🎯 Simple Real World Example:

Imagine you have a giant book (the table file).
Each page of the book is numbered (block number).

If someone says:

👉 “Go to page 1592.”

♦️ You don’t need to read pages 1 to 1591 first.
♦️ You just flip directly to page 1592.

📗 Same idea: no linear traversal, just positional lookup.

🧠 Deep thought:

Because blocks are fixed size and TID is known,
heap fetch is almost as fast as reading a small array.

(Actually faster than searching B-tree because B-tree needs multiple comparisons at each node.)

Enjoy SQL! 🚀

Setup 🛠 Rails 8 App – Part 13: Composite keys & Candidate keys in Rails DB

🔑 What Is a Composite Key?

A composite key is a primary key made up of two or more columns that together uniquely identify a row in a table.

Use a composite key when no single column is unique on its own, but the combination is.

👉 Example: Composite Key in Action

Let’s say we’re building a table to track which students are enrolled in which courses.

Without Composite Key:

-- This table might allow duplicates
CREATE TABLE Enrollments (
  student_id INT,
  course_id INT
);

Nothing stops the same student from enrolling in the same course multiple times!

With Composite Key:

CREATE TABLE Enrollments (
  student_id INT,
  course_id INT,
  PRIMARY KEY (student_id, course_id)
);

Now:

student_id alone is not unique
course_id alone is not unique
But together → each (student_id, course_id) pair is unique

📌 Why Use Composite Keys?

When to Use	Why
Tracking many-to-many relationships	Ensures unique pairs
Bridging/junction tables	e.g., students-courses, authors-books
No natural single-column key	But the combination is unique

⚠️ Things to Keep in Mind

Composite keys enforce uniqueness across multiple columns.
They can also be used as foreign keys in other tables.
Some developers prefer to add an auto-increment id as the primary key instead—but that’s a design choice.

🔎 What Is a Candidate Key?

A candidate key is any column (or combination of columns) in a table that can uniquely identify each row.

Every table can have multiple candidate keys
One of them is chosen to be the primary key
The rest are called alternate keys

🔑 Think of candidate keys as “potential primary keys”

👉 Example: Users Table

CREATE TABLE Users (
  user_id INT,
  username VARCHAR(80),
  email VARCHAR(150),
  phone_number VARCHAR(30)
);

Let’s have some hands own experience in SQL queries by creating a TEST DB. Check https://railsdrop.com/2025/04/25/rails-8-app-part-13-2-test-sql-queries/

Assume:

user_id is unique
username is unique
email is unique

Candidate Keys:

user_id
username
email

You can choose any one of them as the primary key, depending on your design needs.

-- Choosing user_id as the primary key
PRIMARY KEY (user_id)

The rest (username, email) are alternate keys.

📌 Characteristics of Candidate Keys

Property	Description
Uniqueness	Must uniquely identify each row
Non-null	Cannot contain NULL values
Minimality	Must be the smallest set of columns that uniquely identifies a row (no extra columns)
No duplicates	No two rows have the same value(s)

👥 Candidate Key vs Composite Key

Concept	Explanation
Candidate Key	Any unique identifier (single or multiple columns)
Composite Key	A candidate key that uses multiple columns

So: All composite keys are candidate keys, but not all candidate keys are composite.

💡 When Designing a Database

Find all possible candidate keys
Choose one as the primary key
(Optional) Define other candidate keys as unique constraints

CREATE TABLE Users (
  user_id INT PRIMARY KEY,
  username VARCHAR UNIQUE,
  email VARCHAR UNIQUE
);

Let’s walk through a real-world example using a schema we are already working on: a shopping app that sells clothing for women, men, kids, and infants.

We’ll look at how candidate keys apply to real tables like Users, Products, Orders, etc.

🛍️ Example Schema: Shopping App

1. Users Table

CREATE TABLE Users (
  user_id SERIAL PRIMARY KEY,
  email VARCHAR UNIQUE,
  username VARCHAR UNIQUE,
  phone_number VARCHAR
);

Candidate Keys:

user_id ✅
email ✅
username ✅

We chose user_id as the primary key, but both email and username could also uniquely identify a user — so they’re candidate keys.

2. Products Table

CREATE TABLE Products (
  product_id SERIAL PRIMARY KEY,
  sku VARCHAR UNIQUE,
  name VARCHAR,
  category VARCHAR
);

Candidate Keys:

product_id ✅
sku ✅ (Stock Keeping Unit – a unique identifier for each product)

sku is a candidate key. We use product_id as the primary key, but you could use sku if you wanted a natural key instead.

3. Orders Table

CREATE TABLE Orders (
  order_id SERIAL PRIMARY KEY,
  user_id INT REFERENCES Users(user_id),
  order_number VARCHAR UNIQUE,
  created_at TIMESTAMP
);

Candidate Keys:

order_id ✅
order_number ✅

You might use order_number (e.g., "ORD-20250417-0012") for external reference and order_id internally. Both are unique identifiers = candidate keys.

4. OrderItems Table (Join Table)

This table links orders to the specific products and quantities purchased.

CREATE TABLE OrderItems (
  order_id INT,
  product_id INT,
  quantity INT,
  PRIMARY KEY (order_id, product_id),
  FOREIGN KEY (order_id) REFERENCES Orders(order_id),
  FOREIGN KEY (product_id) REFERENCES Products(product_id)
);

Candidate Key:

Composite key: (order_id, product_id) ✅

Here, a combination of order_id and product_id uniquely identifies a row — i.e., what product was ordered in which order — making it a composite candidate key, and we’ve selected it as the primary key.

👀 Summary of Candidate Keys by Table

Table	Candidate Keys	Primary Key Used
Users	`user_id`, `email`, `username`	`user_id`
Products	`product_id`, `sku`	`product_id`
Orders	`order_id`, `order_number`	`order_id`
OrderItems	`(order_id, product_id)`	`(order_id, product_id)`

Let’s explore how to implement candidate keys in both SQL and Rails (Active Record). Since we are working on a shopping app in Rails 8, I’ll show how to enforce uniqueness and data integrity in both layers:

🔹 1. Candidate Keys in SQL (PostgreSQL Example)

Let’s take the Users table with multiple candidate keys (email, username, and user_id).

CREATE TABLE users (
  user_id SERIAL PRIMARY KEY,
  email VARCHAR(255) NOT NULL UNIQUE,
  username VARCHAR(100) NOT NULL UNIQUE,
  phone_number VARCHAR(20)
);

user_id: chosen as the primary key
email and username: candidate keys, enforced via UNIQUE constraints

💎 Composite Key Example (OrderItems)

CREATE TABLE order_items (
  order_id INT,
  product_id INT,
  quantity INT NOT NULL,
  PRIMARY KEY (order_id, product_id),
  FOREIGN KEY (order_id) REFERENCES orders(order_id),
  FOREIGN KEY (product_id) REFERENCES products(product_id)
);

This sets (order_id, product_id) as a composite candidate key and primary key.

🔸 2. Candidate Keys in Rails (ActiveRecord)

Now let’s do the same with Rails models + migrations + validations.

✅ `users` Migration (with candidate keys)

# db/migrate/xxxxxx_create_users.rb
class CreateUsers < ActiveRecord::Migration[8.0]
  def change
    create_table :users do |t|
      t.string :email, null: false
      t.string :username, null: false
      t.string :phone_number

      t.timestamps
    end

    add_index :users, :email, unique: true
    add_index :users, :username, unique: true
  end
end

✅ `User` Model

class User < ApplicationRecord
  validates :email, presence: true, uniqueness: true
  validates :username, presence: true, uniqueness: true
end

✅ These are candidate keys — email and username could be primary keys, but we are using id instead.

✅ Composite Key with `OrderItem` (Join Table)

ActiveRecord doesn’t support composite primary keys natively, but you can enforce uniqueness via a multi-column index:

Migration:

class CreateOrderItems < ActiveRecord::Migration[8.0]
  def change
    create_table :order_items, id: false do |t|
      t.references :order, null: false, foreign_key: true
      t.references :product, null: false, foreign_key: true
      t.integer :quantity, null: false

      t.timestamps
    end

    add_index :order_items, [:order_id, :product_id], unique: true
  end
end

Model:

class OrderItem < ApplicationRecord
  belongs_to :order
  belongs_to :product

  validates :quantity, presence: true
  validates :order_id, uniqueness: { scope: :product_id }
end

🎯 This simulates a composite key behavior: each product can only appear once per order.

➕ Extra: Use `composite_primary_keys` Gem (Optional)

If you really need true composite primary keys, use:

gem 'composite_primary_keys'

But it’s best to avoid unless your use case demands it — most Rails apps use a surrogate key (id) for simplicity.

to be continued.. 🚀

Setup 🛠 Rails 8 App – Part 11: Convert 🔄 Rails App from SQLite to PostgreSQL

If you’ve already built a Rails 8 app using the default SQLite setup and now want to switch to PostgreSQL, here’s a clean step-by-step guide to make the transition smooth:

1.🔧 Setup PostgreSQL in macOS

🔷 Step 1: Install PostgreSQL via Homebrew

Run the following:

brew install postgresql

This created a default database cluster for me, check the output. So you can skip the Step 3.

==> Summary
🍺  /opt/homebrew/Cellar/postgresql@14/14.17_1: 3,330 files, 45.9MB

==> Running `brew cleanup postgresql@14`...
==> postgresql@14
This formula has created a default database cluster with:
  initdb --locale=C -E UTF-8 /opt/homebrew/var/postgresql@14

To start postgresql@14 now and restart at login:
  brew services start postgresql@14

Or, if you don't want/need a background service you can just run:
  /opt/homebrew/opt/postgresql@14/bin/postgres -D /opt/homebrew/var/postgresql@14

After installation, check the version:

psql --version
> psql (PostgreSQL) 14.17 (Homebrew)

🔷 Step 2: Start PostgreSQL Service

To start PostgreSQL now and have it start automatically at login:

brew services start postgresql
==> Successfully started `postgresql@14` (label: homebrew.mxcl.postgresql@14)

If you just want to run it in the background without autostart:

# pg_ctl — initialize, start, stop, or control a PostgreSQL server
pg_ctl -D /opt/homebrew/var/postgresql@14 start

https://www.postgresql.org/docs/current/app-pg-ctl.html

You can find the installed version using:

brew list | grep postgres

🔷 Step 3: Initialize the Database (if needed)

Sometimes Homebrew does this automatically. If not:

initdb /opt/homebrew/var/postgresql@<version>

Or a more general version:

initdb /usr/local/var/postgres

Key functions of initdb: Creates a new database cluster, Initializes the database cluster’s default locale and character set encoding, Runs a vacuum command.

In essence, initdb prepares the environment for a PostgreSQL database to be used and provides a foundation for creating and managing databases within that cluster

🔷 Step 4: Create a User and Database

PostgreSQL uses a role-based access control. Create a user with superuser privileges:

# createuser creates a new Postgres user
createuser -s postgres

createuser is a shell script wrapper around the SQL command CREATE USER via the Postgres interactive terminal psql. Thus, there is nothing special about creating users via this or other methods

Then switch to psql:

psql postgres

You can also create a database:

createdb <db_name>

🔷 Step 5: Connect and Use psql

psql -d <db_name>

Inside the psql shell, try:

\l    -- list databases
\dt   -- list tables
\q    -- quit

🔷 Step 6: Use a GUI (Optional)

For a friendly UI, install one of the following:

pgAdmin

Postico

TablePlus

2. Update `Gemfile`

Replace SQLite gem with PostgreSQL:

# Remove or comment this:
# gem "sqlite3", "~> 1.4"

# Add this:
gem "pg", "~> 1.4"

Then run:

bundle install

3. Update `config/database.yml`

Replace the entire contents of config/database.yml with the following:

default: &default
  adapter: postgresql
  encoding: unicode
  username: postgres
  password:
  host: localhost
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>

development:
  <<: *default
  database: your_app_name_development

test:
  <<: *default
  database: your_app_name_test

production:
  primary: &primary_production
    <<: *default
    database: your_app_name_production
    username: your_production_username
    password: <%= ENV['YOUR_APP_DATABASE_PASSWORD'] %>
  cache:
    <<: *primary_production
    database: your_app_name_production_cache
    migrations_paths: db/cache_migrate
  queue:
    <<: *primary_production
    database: your_app_name_production_queue
    migrations_paths: db/queue_migrate
  cable:
    <<: *primary_production
    database: your_app_name_production_cable
    migrations_paths: db/cable_migrate

Replace your_app_name with your actual Rails app name.

4. Drop SQLite Database (Optional)

rm storage/development.sqlite3
rm storage/test.sqlite3

5. Create and Setup PostgreSQL Database

rails db:create
rails db:migrate

If you had seed data:

rails db:seed

6. Test It Works

Boot up your server:

bin/dev

Then go to http://localhost:3000 and confirm everything works.

7. Check psql manually (Optional)

psql -d your_app_name_development

Then run:

\dt     -- view tables
\q      -- quit

8. Update `.gitignore`

Note: If not already added /storage/*

Make sure SQLite DBs are not accidentally committed:

/storage/*.sqlite3
/storage/*.sqlite3-journal

After moving into PostgreSQL

I was getting an issue with postgres column, where I have the following data in the migration:

# migration
t.decimal :rating, precision: 1, scale: 1

# log
ActiveRecord::RangeError (PG::NumericValueOutOfRange: ERROR:  numeric field overflow
12:44:36 web.1  | DETAIL:  A field with precision 1, scale 1 must round to an absolute value less than 1.
12:44:36 web.1  | )

Value passed is: 4.3. I was not getting this issue in SqLite DB.

What does `precision: 1, scale: 1` mean?

precision: Total number of digits (both left and right of the decimal).
scale: Number of digits after the decimal point

If you want to store ratings like 4.3, 4.5, etc., a good setup is:

t.decimal :rating, precision: 2, scale: 1

# revert and migrate for products table

✗ rails db:migrate:down VERSION=2025031XXXXX -t
✗ rails db:migrate:up VERSION=2025031XXXXXX -t

Then go to http://localhost:3000 and confirm everything works.

to be continued.. 🚀

Setup 🛠 Rails 8 App – Part 10: PostgreSQL Into The Action

For a Ruby on Rails 8 application, the choice of database depends on your specific needs, but here’s a breakdown of the best options and when to use each:

PostgreSQL (Highly Recommended)

Best overall choice for most Rails apps.

Why:

First-class support in Rails.
Advanced features like full-text search, JSONB support, CTEs, window functions.
Strong consistency and reliability.
Scales well vertically and horizontally (with tools like Citus).

Used by: GitHub, Discourse, Basecamp, Shopify.

Use if:

You’re building a standard Rails web app or API.
You need advanced query features or are handling complex data types (e.g., JSON).

SQLite (For development/testing only)

Lightweight, file-based.
Fast and easy to set up.
But not recommended for production.

Use if:

You’re building a quick prototype or local dev/testing app.
NOT for multi-user production environments.

MySQL / MariaDB

Also supported by Rails.
Can work fine for simpler applications.
Lacks some advanced features (like robust JSON support or full Postgres-style indexing).
Not the default in many modern Rails setups.

Use if:

Your team already has MySQL infrastructure or legacy systems.
You need horizontal scaling with Galera Cluster or similar setups.

Others (NoSQL like MongoDB, Redis, etc.)

Use Redis for caching and background job data (not as primary DB).
Use MongoDB or other NoSQL only if your data model really demands it (e.g., unstructured documents, event sourcing).

Recommendation Summary:

Use Case	Recommended DB
Production web/API app	PostgreSQL
Dev/prototyping/local testing	SQLite
Legacy systems/MySQL infrastructure	MySQL/MariaDB
Background jobs/caching	Redis
Special needs (e.g., documents)	MongoDB (with caution)

If you’re starting fresh or building something scalable and modern with Rails 8, go with PostgreSQL.

Let’s break that down:

💬 What does “robust JSON support” mean?

PostgreSQL supports a special column type: json and jsonb, which lets you store structured JSON data directly in your database — like hashes or objects.

Why it matters:

You can store dynamic data without needing to change your schema.
You can query inside the JSON using SQL (->, ->>, @>, etc.).
You can index parts of the JSON — for speed.

🔧 Example:

You have a products table with a specs column that holds tech specs in JSON:

specs = {
  "color": "black",
  "brand": "Libas",
  "dimensions": {"chest": "34", "waist": "30", "shoulder": "13.5"}
}

You can query like:

SELECT * FROM products WHERE specs->>'color' = 'black';

Or check if the JSON contains a value:

SELECT * FROM products WHERE specs @> '{"brand": "Libas"}';

You can even index specs->>'color' to make these queries fast.

💬 What does “full Postgres-style indexing” mean?

PostgreSQL supports a wide variety of powerful indexing options, which improve query performance and flexibility.

⚙️ Types of Indexes PostgreSQL supports:

Index Type	Use Case
B-Tree	Default; used for most equality and range searches
GIN (Generalized Inverted Index)	Fast indexing for JSON, arrays, full-text search
Partial Indexes	Index only part of the data (e.g., `WHERE active = true`)
Expression Indexes	Index a function or expression (e.g., `LOWER(email)`)
Covering Indexes (INCLUDE)	Fetch data directly from the index, avoiding table reads

B-Tree Indexes: B-tree indexes are more suitable for single-value columns.
When to Use GIN Indexes: When you frequently search for specific elements within arrays, JSON documents, or other composite data types.
Example for GIN Indexes: Imagine you have a table with a JSONB column containing document metadata. A GIN index on this column would allow you to quickly find all documents that have a specific author or belong to a particular category.

Why does this matter for our shopping app?

We can store and filter products with dynamic specs (e.g., kurtas, shorts, pants) without new columns.
Full-text search on product names/descriptions.
Fast filters: color = 'red' AND brand = 'Libas' even if those are stored in JSON.
Index custom expressions like LOWER(email) for case-insensitive login.

💬 What are Common Table Expressions (CTEs)?

CTEs are temporary result sets you can reference within a SQL query — like defining a mini subquery that makes complex SQL easier to read and write.

WITH recent_orders AS (
  SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days'
)
SELECT * FROM recent_orders WHERE total > 100;

Breaking complex queries into readable parts.
Re-using result sets without repeating subqueries.

In Rails (via `with` from gems like `scenic` or `with_cte`):

Order
  .with(recent_orders: Order.where('created_at > ?', 7.days.ago))
  .from('recent_orders')
  .where('total > ?', 100)

💬 What are Window Functions?

Window functions perform calculations across rows related to the current row — unlike aggregate functions, they don’t group results into one row.

🔧 Example: Rank users by their score within each team:

SELECT
  user_id,
  team_id,
  score,
  RANK() OVER (PARTITION BY team_id ORDER BY score DESC) AS rank
FROM users;

Use cases:

Ranking rows (like leaderboards).
Running totals or moving averages.
Calculating differences between rows (e.g. “How much did this order increase from the last?”).

🛤 In Rails:

Window functions are available through raw SQL or Arel. Here’s a basic example:

User
  .select("user_id, team_id, score, RANK() OVER (PARTITION BY team_id ORDER BY score DESC) AS rank")

CTEs and Window functions are fully supported in PostgreSQL, making it the go-to DB for any Rails 8 app that needs advanced querying.

JSONB Support

JSONB stands for “JSON Binary” and is a binary representation of JSON data that allows for efficient storage and retrieval of complex data structures.

This can be useful when you have data that doesn’t fit neatly into traditional relational database tables, such as nested or variable-length data structures.

Absolutely — storing JSON in a relational database (like PostgreSQL) can be super powerful when used wisely. It gives you schema flexibility without abandoning the structure and power of SQL. Here are real-world use cases for using JSON columns in relational databases:

Here are real-world use cases for using JSON columns in relational databases:

🔧 1. Flexible Metadata / Extra Attributes

Let users store arbitrary attributes that don’t require schema changes every time.

Use case: Product variants, custom fields

t.jsonb :metadata

{
  "color": "red",
  "size": "XL",
  "material": "cotton"
}

=> Good when:

You can’t predict all the attributes users will need.
You don’t want to create dozens of nullable columns.

🎛️ 2. Storing Settings or Preferences

User or app settings that vary a lot.

Use case: Notification preferences, UI layout, feature toggles

{
  "email": true,
  "sms": false,
  "theme": "dark"
}

=> Easy to store and retrieve as a blob without complex joins.

🌐 3. API Response Caching

Store external API responses for caching or auditing.

Use case: Storing Stripe, GitHub, or weather API responses.

t.jsonb :api_response

=> Avoids having to map every response field into a column.

📦 4. Storing Logs or Events

Use case: Audit trails, system logs, user events

{
  "action": "login",
  "timestamp": "2025-04-18T10:15:00Z",
  "ip": "123.45.67.89"
}

=> Great for capturing varied data over time without a rigid schema.

📊 6. Embedded Mini-Structures

Use case: A form builder app storing user-created forms and fields.

{
  "fields": [
    { "type": "text", "label": "Name", "required": true },
    { "type": "email", "label": "Email", "required": false }
  ]
}

=> When each row can have nested, structured data — almost like a mini-document.

🕹️ 7. Device or Browser Info (User Agents)

Use case: Analytics, device fingerprinting

{
  "browser": "Safari",
  "os": "macOS",
  "version": "17.3"
}

=> You don’t need to normalize or query this often — perfect for JSON.

JSON vs JSONB in PostgreSQL

Use jsonb over json unless you need to preserve order or whitespace.

jsonb is binary format → faster and indexable
You can do fancy stuff like:

SELECT * FROM users WHERE preferences ->> 'theme' = 'dark';

Or in Rails:

User.where("preferences ->> 'theme' = ?", 'dark')

store and store_accessor

They let you treat JSON or text-based hash columns like structured data, so you can access fields as if they were real database columns.

🔹 `store`

Used to declare a serialized store (usually a jsonb, json, or text column) on your model.
Works best with key/value stores.

👉 Example:

Let’s say your users table has a settings column of type jsonb:

# migration
add_column :users, :settings, :jsonb, default: {}

Now in your model:

class User < ApplicationRecord
  store :settings, accessors: [:theme, :notifications], coder: JSON
end

You can now do this:

user.theme = "dark"
user.notifications = true
user.save

user.settings
# => { "theme" => "dark", "notifications" => true }

🔹 `store_accessor`

A lightweight version that only declares attribute accessors for keys inside a JSON column. Doesn’t include serialization logic — so you usually use it with a json/jsonb/text column that already works as a Hash.

👉 Example:

class User < ApplicationRecord
  store_accessor :settings, :theme, :notifications
end

This gives you:

user.theme, user.theme=
user.notifications, user.notifications=

🤔 When to Use Each?

Feature	When to Use
`store`	When you need both serialization and accessors
`store_accessor`	When your column is already serialized (`jsonb`, etc.)

If you’re using PostgreSQL with jsonb columns — it’s more common to just use store_accessor.

Querying JSON Fields

User.where("settings ->> 'theme' = ?", "dark")

Or if you’re using store_accessor:

User.where(theme: "dark")

💡 But remember: you’ll only be able to query these fields efficiently if you’re using jsonb + proper indexes.

🔥 Conclusion:

PostgreSQL can store, search, and index inside JSON fields natively.
This lets you keep your schema flexible and your queries fast.
Combined with its advanced indexing, it’s ideal for a modern e-commerce app with dynamic product attributes, filtering, and searching.

To install and set up PostgreSQL on macOS, you have a few options. The most common and cleanest method is using Homebrew. Here’s a step-by-step guide:

Learn SQL: Day 7 – Query Optimization Workshop

Welcome to Day 7.

Everything we’ve learned so far leads to this lesson.

Until now, you’ve learned:

How to write SQL
How JOINs work
How GROUP BY works
How indexes work
How PostgreSQL chooses execution plans

Today, we’ll combine everything to solve real-world performance problems.

Today’s Goals

By the end of today, you’ll be able to answer questions like:

Why is this query slow?
Should I add an index?
Should I rewrite the query?
Is this a database problem or an application problem?
How would I debug this in production?

These are exactly the kinds of discussions that happen in senior Rails interviews.

A Senior Engineer’s Workflow

Suppose your manager says:

“The Users page takes 8 seconds to load.”

A junior developer might immediately say:

“Let’s add an index.”

A senior developer thinks:

			
Is the query actually slow?
Which query is slow?
How much data is involved?
What is PostgreSQL doing?
Can I rewrite the query?
Do I need an index?
Is the application causing the problem?

		

Notice:

Adding an index is Step 6, not Step 1.

Our Practice Schema

Let’s build something closer to a real Rails application.

DROP TABLE IF EXISTS order_items;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;
DROP TABLE IF EXISTS users;

Users

CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    name TEXT,
    email TEXT,
    city TEXT
);

Products

CREATE TABLE products (
    id BIGSERIAL PRIMARY KEY,
    name TEXT,
    price NUMERIC(10,2),
    category TEXT
);

Orders

CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    status TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

Order Items

CREATE TABLE order_items (
    id BIGSERIAL PRIMARY KEY,
    order_id BIGINT NOT NULL REFERENCES orders(id),
    product_id BIGINT NOT NULL REFERENCES products(id),
    quantity INTEGER,
    price NUMERIC(10,2)
);

The Six-Step Performance Checklist

Every slow query investigation should start with this checklist.

Step 1 – Measure

Never optimise blindly.

Run:

			
EXPLAIN ANALYZE
SELECT ...

Step 2 – Understand the Business Question

Example:

Show the last 20 completed orders.

Don’t optimise before understanding what the query should do.

Step 3 – Read the Plan

Look for:

Seq Scan
Nested Loop
Hash Join
Sort
Aggregate
Bitmap Heap Scan

Step 4 – Find the Bottleneck

Ask:

Which node took the most time?
Which node processed the most rows?

Step 5 – Decide the Fix

Possible fixes:

Better index
Better SQL
Better schema
Better ActiveRecord
Better pagination

Step 6 – Measure Again

Never assume the optimisation worked.

Always compare before and after.

Scenario 1 – Missing Index

Query:

			
SELECT *
FROM users
WHERE email='john@example.com';

Execution plan:

			
Seq Scan
rows=100000
actual rows=1

Question:

What’s wrong?

Diagnosis

No index on email.

Fix

			
CREATE INDEX idx_users_email
ON users(email);

Run again:

			
EXPLAIN ANALYZE
SELECT *
FROM users
WHERE email='john@example.com';

Expect:

Index Scan

Scenario 2 – Wrong Index

Suppose the application runs:

			
SELECT *
FROM users
WHERE city='Chicago'
AND age=30;

Indexes:

			
(city)
(age)

Question:

Better solution?

Answer

Composite index:

			
CREATE INDEX idx_city_age
ON users(city, age);

Because the application almost always filters by both.

Scenario 3 – Sorting

Query:

			
SELECT *
FROM orders
ORDER BY created_at DESC
LIMIT 20;

Plan:

			
Seq Scan
↓
Sort
↓
Limit

		

Question:

Can we avoid sorting?

Solution

			
CREATE INDEX idx_orders_created_at_desc
ON orders(created_at DESC);

Now PostgreSQL can read the index in order.

Often:

			
Index Scan
↓
Limit

No Sort node.

Scenario 4 – N+1 Queries

Rails code:

orders = Order.limit(100)

orders.each do |order|
  puts order.user.name
end

SQL executed:

SELECT * FROM orders LIMIT 100;

Then:

SELECT * FROM users WHERE id=1;

SELECT * FROM users WHERE id=2;

…

100 additional queries.

Total

101 queries

Fix

Order.includes(:user)

Now:

SELECT * FROM orders;

			
SELECT *
FROM users
WHERE id IN (...);

Two queries.

Interview Question

Which is faster?

includes

joins

Answer:

They solve different problems.

Scenario 5 – OFFSET Pagination

Query:

SELECT *
FROM orders
ORDER BY created_at DESC
LIMIT 20
OFFSET 100000;

Looks harmless.

But PostgreSQL must skip:

100000 rows

before returning:

20 rows

Large OFFSET values become increasingly expensive.

Better Solution

Keyset Pagination.

Instead of:

OFFSET 100000

Use:

WHERE created_at < '2026-07-01'
ORDER BY created_at DESC
LIMIT 20;

This lets PostgreSQL continue from the last seen row instead of counting through earlier rows.

Rails Example

Instead of:

Order.order(created_at: :desc)
     .offset(100000)
     .limit(20)

Use:

Order
  .where("created_at < ?", last_created_at)
  .order(created_at: :desc)
  .limit(20)

This is called keyset pagination or cursor pagination.

Scenario 6 – SELECT *

Query:

			
SELECT *
FROM users;

Returns:

			
id
name
email
city
address
bio
avatar
...

		

Suppose the page only displays:

name
city

Why fetch everything?

Better:

			
SELECT
    name,
    city
FROM users;

Rails:

User.select(:name, :city)

Scenario 7 – COUNT(*)

Suppose:

			
SELECT COUNT(*)
FROM orders;

On:

300 million rows

Question:

Can this be slow?

Yes.

Because PostgreSQL must count visible rows.

Unlike some databases, PostgreSQL generally doesn’t maintain an exact row count that’s instantly available for arbitrary COUNT(*).

Scenario 8 – DISTINCT

Query:

			
SELECT DISTINCT users.*
FROM users
JOIN orders
ON users.id=orders.user_id;

Question:

Why is DISTINCT needed?

Because JOIN duplicates users.

Could EXISTS express the requirement more directly?

			
SELECT *
FROM users u
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.user_id=u.id
);

		

Sometimes that’s clearer.

Scenario 9 – Functions in WHERE

Query:

			
SELECT *
FROM users
WHERE LOWER(email)='john@example.com';

Normal email index?

Not useful.

Solution:

			
CREATE INDEX idx_lower_email
ON users(LOWER(email));

Expression index.

Scenario 10 – Too Many Indexes

Suppose:

			
users
↓
12 indexes

Problem?

Every INSERT must update:

			
Table
+
12 indexes

Indexes speed reads but slow writes.

Always consider the workload.

Optimization Decision Tree

When a query is slow, ask:

			
Is PostgreSQL scanning too many rows?
        │
        ▼
Yes
        │
        ▼
Would an index help?
        │
        ▼
Yes
        │
        ▼
Do I already have one?
        │
        ▼
No
        │
        ▼
Create the correct index.

		

But also ask:

			
Am I returning unnecessary data?
Am I sorting unnecessarily?
Am I joining unnecessarily?
Am I executing the query too many times?

Real Rails Optimization Example

Suppose this page loads slowly:

@orders = Order
            .where(status: "completed")
            .order(created_at: :desc)
            .limit(20)

Questions:

Is there an index on status?
Is there an index on created_at?
Would a composite index help?

Potential solution:

CREATE INDEX idx_orders_status_created_at
ON orders(status, created_at DESC);

Why?

Because the query filters by status and orders by created_at.

Senior Interview Exercise

Suppose you see:

			
SELECT *
FROM orders
WHERE user_id = 100
ORDER BY created_at DESC
LIMIT 20;

		

Which index would you create?

Many people answer:

			
(user_id)
(created_at)

A stronger answer is:

			
CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC);

Because it supports both the filter and the ordering in one index.

Common Performance Mistakes

Mistake 1

Adding indexes without measuring.

Mistake 2

Using SELECT * everywhere.

Mistake 3

Ignoring N+1 queries.

Mistake 4

Using huge OFFSET values.

Mistake 5

Creating duplicate indexes.

Mistake 6

Ignoring EXPLAIN ANALYZE.

Senior-Level Mental Model

Every query has a “cost.”

The cost comes from:

			
Rows read
+
Rows sorted
+
Rows joined
+
Rows transferred
+
Application round trips

		

The goal of optimisation is to reduce one or more of these.

Practical Exercises

Exercise 1

Create:

			
CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC);

Run:

			
EXPLAIN ANALYZE
SELECT *
FROM orders
WHERE user_id=1
ORDER BY created_at DESC
LIMIT 20;

		

Observe whether PostgreSQL can avoid an explicit Sort.

Exercise 2

Compare:

			
SELECT *
FROM users;

with:

			
SELECT name, city
FROM users;

Think about the amount of data returned.

Exercise 3

Write three versions of:

“Find users with completed orders.”

Using:

JOIN
EXISTS
IN

Then compare their execution plans.

Exercise 4

Find a query that performs a Seq Scan.

Add an appropriate index.

Run EXPLAIN ANALYZE again.

What changed?

Interview Case Study

Imagine you’re in a senior Rails interview.

The interviewer says:

“A customer reports that the Orders page takes 6 seconds to load.”

A strong answer isn’t:

“I’ll add an index.”

A stronger answer is:

Reproduce the issue.
Identify the SQL generated by ActiveRecord.
Run EXPLAIN ANALYZE.
Inspect scan types, joins, and sort operations.
Check existing indexes.
Decide whether the fix belongs in the SQL, indexes, ActiveRecord code, or schema.
Measure again after the change.

That systematic approach demonstrates senior-level thinking.

Homework

Build a small benchmark using your practice schema.

Populate:
- 100,000 users
- 500,000 orders
Measure these queries before and after adding indexes:
- Find a user by email.
- Find recent orders for a user.
- Find completed orders.
- Find users with no orders.
For each query, record:
- Execution plan
- Execution time
- Scan type
- Rows estimated
- Rows returned
Explain why PostgreSQL chose each plan.

What’s Next?

At this point, you’re already covering topics that many experienced Rails developers never study in depth.

For Day 8, I recommend Window Functions:

ROW_NUMBER()
RANK()
DENSE_RANK()
LAG()
LEAD()
Running totals
Moving averages
Top N per group

Window functions are common in reporting, analytics, and senior backend interviews because they solve problems that are difficult or inefficient with plain GROUP BY. Understanding them will significantly broaden your SQL toolkit.

Happy Learning! 🚀

Learn SQL: Day 6C – Mastering EXPLAIN ANALYZE (Think Like the PostgreSQL Query Planner)

Welcome to Day 6C.

This is one of the most valuable lessons in the entire course.

Many developers know how to write SQL.

Very few can answer questions like:

“Why is this query slow?”

“Why did PostgreSQL choose a Bitmap Heap Scan instead of an Index Scan?”

“What would you optimize first?”

This lesson will teach you exactly that.

Today’s Goal

By the end of today, you should be able to:

Read an EXPLAIN ANALYZE plan from top to bottom
Understand every important field
Explain why PostgreSQL chose a plan
Identify bottlenecks
Suggest optimizations
Discuss execution plans confidently in a senior interview

First, Understand What EXPLAIN ANALYZE Actually Does

Consider this query:

			
SELECT *
FROM users
WHERE email = 'user50000@example.com';

Without EXPLAIN, PostgreSQL simply returns the result.

With:

			
EXPLAIN
SELECT *
FROM users
WHERE email = 'user50000@example.com';

PostgreSQL says:

“Here’s the plan I intend to use.”

With:

			
EXPLAIN ANALYZE
SELECT *
FROM users
WHERE email = 'user50000@example.com';

PostgreSQL actually executes the query and says:

“Here’s what really happened.”

The Query Planner

Imagine PostgreSQL as a GPS.

You ask:

Go from A to B.

The GPS considers:

Highway
Local roads
Toll roads
Traffic

Then chooses the cheapest route.

PostgreSQL does exactly the same.

It considers:

Sequential Scan
Index Scan
Bitmap Scan
Hash Join
Nested Loop
Merge Join

and chooses what it estimates to be the cheapest plan.

Our Practice Table

Use the same table from Day 6B.

users

100,000 rows.

Our First Plan

			
EXPLAIN ANALYZE
SELECT *
FROM users
WHERE email='user50000@example.com';

You might see something similar to:

			
Index Scan using idx_users_email on users
  (cost=0.42..8.44 rows=1 width=51)
  (actual time=0.030..0.032 rows=1 loops=1)

Let’s decode every part.

Part 1 – Scan Type

First line:

Index Scan

This answers:

How did PostgreSQL access the table?

Possible answers:

Seq Scan
Index Scan
Index Only Scan
Bitmap Heap Scan

The scan type is the first thing you should notice.

Part 2 – Using Which Index?

using idx_users_email

PostgreSQL tells you exactly which index it used.

If you expected:

idx_users_city

but it chose:

idx_users_age

you should ask yourself why.

Part 3 – Cost

Example:

cost=0.42..8.44

Many beginners think:

“8.44 milliseconds.”

No.

Cost is not time.

It is PostgreSQL’s internal scoring system.

Think of it like this:

			
Plan A
Cost = 150

			
Plan B
Cost = 70

PostgreSQL chooses Plan B.

Startup Cost

First number:

0.42

Cost before the first row can be returned.

Total Cost

Second number:

8.44

Cost to return every row.

Part 4 – Rows

rows=1

Planner estimate.

Meaning:

			
"I think this query will return
1 row."

Part 5 – Width

width=51

Estimated average size of one returned row.

Used internally for memory and I/O estimates.

Part 6 – Actual Time

actual time=0.030..0.032

Meaning:

			
First row
↓
0.030 ms

Entire query finished:

0.032 ms

Part 7 – Actual Rows

actual rows=1

Excellent.

Planner guessed:

Reality:

Very accurate.

Part 8 – Loops

loops=1

This operation executed once.

You’ll later see plans like:

loops=100000

That often indicates an expensive nested loop.

Reading Plans from Bottom to Top

This surprises many developers.

Execution plans are printed like a tree.

Example:

			
Limit
↓
Sort
↓
Index Scan

		

Although Limit appears first, execution begins at the bottom.

Conceptually:

			
Index Scan
↓
Sort
↓
Limit

		

Think of a factory:

			
Raw Material
↓
Machine 1
↓
Machine 2
↓
Finished Product

		

The raw material starts at the bottom.

Example 2 – Sequential Scan

			
EXPLAIN ANALYZE
SELECT *
FROM users
WHERE city='Chicago';

Output:

			
Seq Scan on users
(cost=0.00..2332.00 rows=24780 width=51)
(actual time=0.08..21.10 rows=25000 loops=1)

Let’s interpret it.

Why Seq Scan?

Question:

How many rows match?

25,000

That’s:

25%

of the table.

Using the index might require:

index lookup
25,000 table lookups

Sequential Scan may simply be cheaper.

Interview Question

If PostgreSQL ignores your index,

does that mean

the index is useless?

Answer:

Absolutely not.

It means PostgreSQL estimated another plan to be cheaper for that specific query.

Example 3 – Bitmap Heap Scan

Suppose you create:

			
CREATE INDEX idx_users_city
ON users(city);

Now:

			
EXPLAIN ANALYZE
SELECT *
FROM users
WHERE city='Chicago';

Output:

			
Bitmap Heap Scan
↓
Bitmap Index Scan

Notice there are two nodes.

Bitmap Index Scan

First:

Read the index.

			
Chicago
↓
Rows
4
8
12
...

		

Bitmap Heap Scan

Then:

Visit the table efficiently.

Instead of:

			
Index
↓
Table
↓
Index
↓
Table

		

It does:

			
Index
↓
Collect row locations
↓
Read pages together

		

Excellent for medium-sized result sets.

Visual

			
Bitmap Index Scan
↓
Matching Row IDs
↓
Bitmap Heap Scan
↓
Actual Rows

		

Example 4 – Index Only Scan

Suppose:

			
SELECT email
FROM users
WHERE email='user100@example.com';

Plan:

Index Only Scan

Question:

Why is this faster?

Answer:

Because PostgreSQL answered the query using only the index.

No table lookup.

Planning Time vs Execution Time

Example:

			
Planning Time: 0.2 ms
Execution Time: 0.3 ms

Planning:

Choosing the route.

Execution:

Driving the route.

Why Estimates Matter

Suppose:

Planner:

rows=5

Reality:

actual rows=50000

Huge difference.

The planner may choose a terrible plan because its estimate was wrong.

This usually indicates stale statistics.

ANALYZE

Run:

ANALYZE users;

PostgreSQL updates statistics.

The planner now has better information.

VACUUM ANALYZE

Often you’ll see:

VACUUM ANALYZE users;

It does two things:

Cleans dead tuples
Updates statistics

We’ll study MVCC later.

The Most Common Plan Nodes

These are the ones you should know well for interviews.

Seq Scan

Reads every row.

Think:

Read entire book.

Index Scan

Uses an index.

Think:

Use the book's index.

Index Only Scan

Never touches the table.

Think:

Everything I need is already in the index.

Bitmap Index Scan

Collect matching row locations.

Bitmap Heap Scan

Fetch those rows efficiently.

Sort

ORDER BY

often produces:

Sort

Sorting millions of rows can be expensive.

Aggregate

Produced by:

			
COUNT()
SUM()
AVG()
GROUP BY

Hash Join

Often used for joins.

We’ll study joins from PostgreSQL’s perspective soon.

Nested Loop

Good when:

One side is tiny.

Terrible when:

Both sides are huge.

Limit

Produced by:

LIMIT 10

Real Example

			
SELECT *
FROM users
ORDER BY created_at DESC
LIMIT 10;

Possible plan:

			
Limit
↓
Sort
↓
Seq Scan

		

Question:

Can we improve it?

Yes.

Index:

			
CREATE INDEX idx_created_at
ON users(created_at DESC);

Now PostgreSQL may avoid sorting completely.

Buffers (Advanced)

Sometimes you’ll see:

			
Buffers:
shared hit=500
read=2

Meaning:

Most pages were already in memory.

We’ll study this later.

Parallel Query

Sometimes:

			
Gather
↓
Parallel Seq Scan

PostgreSQL used multiple CPU workers.

Very common for huge tables.

How to Read Any Plan

I use this checklist.

Step 1

What is the scan type?

Step 2

Which index?

Step 3

Estimated rows?

Step 4

Actual rows?

Step 5

Huge mismatch?

If yes,

statistics may be wrong.

Step 6

Planning vs execution time.

Step 7

Which operation consumed most of the cost?

Real Interview Example

Interviewer shows:

			
Seq Scan
rows=100000
actual rows=1

Question:

Would you optimize?

Yes.

Probably missing an index.

Another example:

			
Index Scan
rows=90000

Question:

Should PostgreSQL maybe use Seq Scan?

Possibly.

Need to inspect the query.

Common Mistakes

Mistake 1

Thinking cost is milliseconds.

Wrong.

Mistake 2

Looking only at execution time.

Also inspect:

estimated rows
actual rows

Mistake 3

Ignoring scan type.

Always notice:

			
Seq
Index
Bitmap
Index Only

Mistake 4

Assuming an index must always be used.

False.

Senior-Level Interview Questions

Q1

Difference:

			
EXPLAIN
EXPLAIN ANALYZE

Q2

Why can PostgreSQL ignore an index?

Q3

What does

rows

mean?

Q4

Difference between

			
rows
actual rows

Q5

What is

loops

Q6

Difference between

			
Index Scan
Index Only Scan

Q7

Why is Bitmap Heap Scan useful?

Q8

Why isn’t cost measured in milliseconds?

Q9

How do stale statistics affect query plans?

Q10

Why should you run

ANALYZE

after major data changes?

Practical Exercises

Exercise 1

Run:

			
EXPLAIN ANALYZE
SELECT *
FROM users
WHERE email='user50000@example.com';

Write down:

Scan type
Estimated rows
Actual rows
Execution time

Exercise 2

Run:

			
EXPLAIN ANALYZE
SELECT *
FROM users
WHERE city='Chicago';

Explain why PostgreSQL chose that plan.

Exercise 3

Run:

			
EXPLAIN ANALYZE
SELECT *
FROM users
ORDER BY created_at DESC
LIMIT 10;

		

Then create an index:

			
CREATE INDEX idx_users_created_at_desc
ON users(created_at DESC);

Run the query again and compare the plans.

Exercise 4

Run:

ANALYZE users;

Then compare the estimated rows with the actual rows again.

Senior Rails Interview Tips

If an interviewer gives you an execution plan, don’t immediately suggest adding an index.

Instead, ask:

How many rows are in the table?
How many rows does this query return?
What indexes already exist?
Is the planner’s estimate accurate?
Is the query actually slow?

That line of reasoning demonstrates experience much better than jumping straight to “add an index.”

What’s Next?

From here, I recommend Day 7: Query Optimization Workshop.

Unlike the previous lessons, it won’t introduce many new SQL keywords. Instead, we’ll work through real production-style problems, such as:

A query that takes 8 seconds—how do we optimize it?
Why did PostgreSQL choose a Nested Loop instead of a Hash Join?
N+1 queries in Rails and how to eliminate them.
OFFSET pagination vs keyset pagination.
Rewriting slow SQL into faster SQL.
Using indexes effectively rather than adding them blindly.

This is the stage where you’ll start thinking like a senior backend engineer rather than someone who simply knows SQL syntax.

Happy Learning! 🚀

📌 Summary of all queries

👉 1. SELECT – Basic Query

👉 2. ALTER – Modify Table Structure

🔹 Example 1: Add a new column

🔹 Example 2: Rename a column

🔹 Example 3: Drop a column

🔹 4. Modify specific columns:

👉 3. DISTINCT – Remove Duplicate Values

🔹 Example 1: Distinct usernames

🔹 Example 2: Distinct combinations of username and email

👉 4. WHERE – Filter Records + Major Combine Types (AND, OR, NOT)

🔹 Example 1: Simple WHERE

🔹 Example 2: AND – Combine multiple conditions (all must be true)

🔹 Example 3: OR – At least one condition must be true

🔹 Example 4: NOT – Negate a condition

🔹 Example 5: Combine AND, OR, NOT (use parentheses!)

👉 5. ORDER BY – Sort the Results

🔹 Example 1: Order by a single column (ascending)

🔹 Example 2: Order by a column (descending)

🔹 Example 3: Order by multiple columns

👉 6. Combined Queries (UNION, INTERSECT, EXCEPT)

⚠ Requirements:

🔹 UNION – Combine results and remove duplicates

🔹 UNION ALL – Combine results and keep duplicates

🔹 INTERSECT – Return only common results

🔹 EXCEPT – Return results from the first query that are not in the second

👉 7. IS NULL and IS NOT NULL – Handling Missing Data

🔹 Example 1: Users with a missing/have an email

🔹 Example 2: Users with no email and no mobile

🔹 Example 3: Users with either email or mobile missing

🔹 Example 4: Users who have an email and username starts with ‘adam’

🔹 Example 5: Users with email missing but username is not empty

🔹 Example 6: Users where email or mobile is null, but not both (exclusive or)

👉 8. LIMIT, SELECT TOP, SELECT TOP PERCENT (PostgreSQL-style)

🔹 Example 1: Limit number of results (first 10 rows)

🔹 Example 2: Combined with ORDER BY (top 5 newest usernames)

🔹 Example 3: Paginate (e.g., 11th to 20th row)

🔔 Simulating SELECT TOP and SELECT TOP PERCENT in PostgreSQL

🔹 Example 4: Simulate SELECT TOP 1

🔹 Example 5: Simulate SELECT TOP 10 PERCENT

🔹 Example 6: Users with Gmail or Yahoo emails, ordered by ID, limit 5

🔹 Better version with correct logic:

👉 9. Aggregation Functions: MIN, MAX, COUNT, AVG, SUM

🔹 1. COUNT – Number of rows

🔹 2. MIN and MAX – Smallest and largest values

🔹 3. AVG – Average (only on numeric fields)

🔹 4. SUM – Total (again, only on numeric fields)

Combined Queries with Aggregates

🔹 Example 1: Count users without email and with usernames starting with ‘test’

🔹 Example 2: Get min, max, avg ID of users with Gmail addresses

🔹 Example 3: Count how many users per email domain

GROUP BY Course

Option 1: Basic GROUP BY with aggregate functions (only max/min mark per course, not emails)

Option 2: Include emails of users who have the max or min mark per course

Notes:

Key features:

DENSE_RANK difference:

🧾 Export Top 3 Students per Course to CSV

✅ Requirements:

👉 10. LIKE, %, _ – Pattern Matching in SQL

🔹 Basic LIKE Queries

Example 1: Usernames starting with “admin”

Example 2: Usernames ending with “bot”

Example 3: Usernames containing “test”

🔹 _ Single-character Wildcard

Example 4: 5-character usernames

Example 5: Emails starting with any single letter + “ohn” (e.g., “john”, “kohn”)

Combined Queries with LIKE, %, _

🔹 Example 6: Users whose username contains “test” and email ends with “gmail.com”

🔹 Example 7: Users with 3-character usernames and missing email

🔹 Example 8: Users with usernames that start with “a” or end with “x” and have a mobile number

👉 11. IN, NOT IN, BETWEEN – Set & Range Filters

🔹 1. IN – Match any of the listed values

🔹 2. NOT IN – Exclude listed values

🔹 3. BETWEEN – Match within a range (inclusive)

Combined Queries

🔹 Example 1: Users with username in a list and id between 1 and 500

🔹 Example 2: Exclude system users and select a range of IDs

🔹 Example 3: Top 5 users whose email domains are in a specific list

👉 12. SQL Aliases – Renaming Columns or Tables Temporarily

👉 1. `SELECT` – Basic Query

👉 2. `ALTER` – Modify Table Structure

👉 3. `DISTINCT` – Remove Duplicate Values

👉 4. `WHERE` – Filter Records + Major Combine Types (`AND`, `OR`, `NOT`)

🔹 Example 2: `AND` – Combine multiple conditions (all must be true)

🔹 Example 3: `OR` – At least one condition must be true

🔹 Example 4: `NOT` – Negate a condition

🔹 Example 5: Combine `AND`, `OR`, `NOT` (use parentheses!)

👉 5. `ORDER BY` – Sort the Results

🔹 `UNION` – Combine results and remove duplicates

🔹 `UNION ALL` – Combine results and keep duplicates

🔹 `INTERSECT` – Return only common results

🔹 `EXCEPT` – Return results from the first query that are not in the second

👉 7. `IS NULL` and `IS NOT NULL` – Handling Missing Data

👉 8. `LIMIT`, `SELECT TOP`, `SELECT TOP PERCENT` (PostgreSQL-style)

🔹 Example 2: Combined with `ORDER BY` (top 5 newest usernames)

🔔 Simulating `SELECT TOP` and `SELECT TOP PERCENT` in PostgreSQL

🔹 Example 4: Simulate `SELECT TOP 1`

🔹 Example 5: Simulate `SELECT TOP 10 PERCENT`

👉 9. Aggregation Functions: `MIN`, `MAX`, `COUNT`, `AVG`, `SUM`

🔹 1. `COUNT` – Number of rows

🔹 2. `MIN` and `MAX` – Smallest and largest values

🔹 3. `AVG` – Average (only on numeric fields)

🔹 4. `SUM` – Total (again, only on numeric fields)

`GROUP BY` Course

Option 1: Basic `GROUP BY` with aggregate functions (only max/min mark per course, not emails)

`DENSE_RANK` difference:

👉 10. `LIKE`, `%`, `_` – Pattern Matching in SQL

🔹 Basic `LIKE` Queries

🔹 `_` Single-character Wildcard

Combined Queries with `LIKE`, `%`, `_`

👉 11. `IN`, `NOT IN`, `BETWEEN` – Set & Range Filters

🔹 1. `IN` – Match any of the listed values

🔹 2. `NOT IN` – Exclude listed values

🔹 3. `BETWEEN` – Match within a range (inclusive)

🔹 Example 1: Users with username in a list and `id` between 1 and 500

Example 1: Rename `username` to `user_name`

Step 1: Install `pgxnclient`

Step 2: Install the `faker` extension via PGXN

3. Create the trigger on `users` table:

💎 Part 1: What is `EXPLAIN ANALYZE`?

✅ 3. Run `ANALYZE` (Update Stats)

💡 Extra Tip (optional): Use `EXPLAIN (ANALYZE, BUFFERS)`