Query Processing and Optimization

Unit 3•CLO04

When you write a SQL query, the database doesn't just blindly execute it. Behind the scenes, the query processor analyzes, transforms, and optimizes your query to find the fastest way to get results. Understanding this process helps you write better queries and diagnose performance issues.

💻 Interactive: SQL Query Builder

Build your SQL query and see it in action!

Operation

Columns

Table

Condition (WHERE)

Generated Query:

SELECT * FROM Students;

The Basics

Query processing is the set of activities performed by the DBMS to translate a high-level query (SQL) into an efficient execution plan.

From master.txt, the pipeline is:

Parsing and translation
Optimization (the DML compiler chooses the lowest-cost evaluation plan among alternatives)
Evaluation (execute the chosen plan)

Why it matters: pushing selections/projections early can drastically reduce intermediate results, which often turns “slow” queries into fast ones.

Technical Details

Query Optimization Approaches:

HEURISTIC OPTIMIZATION (Rule-Based):

Rules that generally improve performance:

Rule 1: Perform selections early
πname(σsalary>50000(Employee ⋈ Department))
Better: πname(σsalary>50000(Employee) ⋈ Department)
Why? Smaller intermediate result

Rule 2: Perform projections early
Reduce tuple size, fewer I/Os

Rule 3: Combine selections
σc1(σc2(R)) = σ(c1 AND c2)(R)
Single scan instead of two

Rule 4: Combine projections
Similar to selections

Rule 5: Break complex selections with OR
σ(c1 OR c2)(R) can use union of two index scans

Rule 6: Evaluate Cartesian products with joins
σcondition(R × S) = R ⋈condition S
Join algorithms are optimized

COST-BASED OPTIMIZATION:

Estimate cost of each plan using:

A. Statistics:

Number of tuples (n)
Tuple size (bytes)
Number of distinct values (V)
Index structure (B-tree depth, hash buckets)

B. Cost Components:

Disk I/O cost (dominant factor)
CPU cost (comparisons, sorting)
Memory usage
Network cost (distributed databases)

C. Selectivity Estimation:
Selectivity = fraction of tuples satisfying condition

For A = value: 1/V(A) (uniform distribution assumption)
For A > value: (max - value)/(max - min)
For A AND B: s(A) × s(B) (independence assumption)
For A OR B: s(A) + s(B) - s(A)×s(B)

JOIN STRATEGIES:

Nested Loop Join:
For each tuple r in R:
For each tuple s in S:
If join-condition(r,s): output
Cost: n(R) + n(R)×n(S)
Best when: One relation is small
Block Nested Loop:
Process blocks instead of tuples
Cost: b(R) + b(R)×b(S)
Better I/O performance
Index Nested Loop:
For each tuple r in R:
Use index on S to find matching tuples
Cost: n(R) + n(R)×c
Where c = cost of index lookup
Best when: Index exists on join attribute
Sort-Merge Join:
Sort both relations on join attribute
Merge sorted relations
Cost: b(R) + b(S) + sort cost
Best for: Large sorted or nearly-sorted relations
Hash Join:
Build phase: Hash smaller relation
Probe phase: Hash larger relation, probe hash table
Cost: 3(b(R) + b(S))
Best for: Equi-joins on large unsorted relations
Most commonly used in practice

Query Equivalence:
Two queries are equivalent if they produce same result for any database state.

Equivalence Rules:

σ(c1 AND c2)(R) = σc1(σc2(R))
σc1(σc2(R)) = σc2(σc1(R))
πL1(πL2(...(πLn(R)))) = πL1(R)
σc(R × S) = R ⋈c S
R ⋈ S = S ⋈ R (commutative)
(R ⋈ S) ⋈ T = R ⋈ (S ⋈ T) (associative)

Dynamic Programming for Join Ordering:
For n tables, there are (2n-2)!/(n-1)! possible join orders.
DP finds optimal order in O(n· 2^n) time.

Examples

Example 1: Query Transformation

Original SQL:
SELECT name
FROM Employee, Department
WHERE Employee.deptId = Department.id
AND Department.location = 'NYC'
AND Employee.salary > 50000;

Naive Plan:

Employee × Department (huge Cartesian product)
Filter all conditions
Project name

Optimized Plan:

σlocation='NYC'(Department) — filter departments first
σsalary>50000(Employee) — filter employees first
Filtered_Employee ⋈ Filtered_Department — smaller join
πname — project at end

Example 2: Join Strategy Selection

Query: Employee ⋈deptId Department

Scenario A: Employee (1M rows), Department (50 rows)
Best: Index Nested Loop

Build index on Employee.deptId
For each department, lookup matching employees
Cost: 50 index lookups

Scenario B: Employee (1M rows), Department (500K rows)
Best: Hash Join

Hash smaller relation (Department)
Probe with Employee
Single pass through both relations

Example 3: Cost Estimation

Given:

Employee: 10,000 tuples, 100 tuples/block = 100 blocks
Department: 50 tuples, 5 tuples/block = 10 blocks
Join: Employee.deptId = Department.id

Nested Loop:
Cost = 100 + 10,000 × 10 = 100,100 block accesses

Block Nested Loop:
Cost = 100 + 100 × 10 = 1,100 block accesses

Hash Join:
Cost = 3(100 + 10) = 330 block accesses

Winner: Hash Join

EXPLAIN Command:

EXPLAIN SELECT * FROM Employee WHERE salary > 50000;

Output shows:

Scan type (Sequential, Index, Bitmap)
Rows estimate
Cost estimate
Filter conditions
Join method

Example Output:
Seq Scan on employee (cost=0.00..180.00 rows=1000 width=50)
Filter: (salary > 50000)

Vs with index:
Index Scan using idx_salary (cost=0.00..45.00 rows=1000 width=50)
Index Cond: (salary > 50000)

Real-World Use

Performance Tuning Tips:

Write Efficient Queries:
- Use WHERE to filter early
- Avoid SELECT *, specify needed columns
- Use LIMIT when possible
- Avoid functions on indexed columns
Indexing Strategy:
- Index foreign keys
- Index columns in WHERE, JOIN, ORDER BY
- Avoid over-indexing (slows INSERT/UPDATE)
- Consider composite indexes for multiple columns
Understand Query Plans:
- Use EXPLAIN to see execution plan
- Look for sequential scans on large tables
- Check join methods
- Monitor rows estimates vs actual
Database Configuration:
- Buffer pool size (memory for caching)
- Work_mem (memory for sorting, hashing)
- Statistics collection (ANALYZE command)
Query Hints (use sparingly):
- Force index usage
- Specify join order
- Enable/disable certain optimizations

Real-World Scenarios:

E-commerce: Optimize product search queries
Analytics: Optimize complex aggregation queries
Social media: Optimize feed generation queries
Financial: Optimize transaction history queries

For exams

Expected Questions:

Explain phases of query processing with diagram
What is the difference between heuristic and cost-based optimization?
Compare all join strategies: nested loop, block nested loop, index nested loop, sort-merge, hash
Given query and statistics, calculate cost for different join orders
Transform given SQL query to optimized relational algebra
Explain dynamic programming approach for join ordering
What is selectivity? How is it estimated?
Given EXPLAIN output, identify performance bottlenecks
Why is pushing selections down beneficial?
Explain the role of indexes in query optimization

Key points

Key Concepts:

• Query optimization chooses the fastest execution plan from many alternatives
• Heuristic rules provide quick optimizations (push selections down)
• Cost-based optimization uses statistics to estimate plan costs
• Join strategy selection depends on table sizes and available indexes
• Hash join is usually best for large equi-joins
• Index nested loop is best when one table is small
• Understanding optimization helps write efficient queries
• EXPLAIN command reveals execution plans for tuning

←

Relational Database Design

Storage Strategies

→