Query Processing and Optimization
When you write a SQL query, the database doesn't just blindly execute it. Behind the scenes, the query processor analyzes, transforms, and optimizes your query to find the fastest way to get results. Understanding this process helps you write better queries and diagnose performance issues.
💻 Interactive: SQL Query Builder
Build your SQL query and see it in action!
The Basics
Query processing is the set of activities performed by the DBMS to translate a high-level query (SQL) into an efficient execution plan.
From master.txt, the pipeline is:
- Parsing and translation
- Optimization (the DML compiler chooses the lowest-cost evaluation plan among alternatives)
- Evaluation (execute the chosen plan)
Why it matters: pushing selections/projections early can drastically reduce intermediate results, which often turns “slow” queries into fast ones.
Technical Details
Query Optimization Approaches:
- HEURISTIC OPTIMIZATION (Rule-Based):
Rules that generally improve performance:
Rule 1: Perform selections early
πname(σsalary>50000(Employee ⋈ Department))
Better: πname(σsalary>50000(Employee) ⋈ Department)
Why? Smaller intermediate result
Rule 2: Perform projections early
Reduce tuple size, fewer I/Os
Rule 3: Combine selections
σc1(σc2(R)) = σ(c1 AND c2)(R)
Single scan instead of two
Rule 4: Combine projections
Similar to selections
Rule 5: Break complex selections with OR
σ(c1 OR c2)(R) can use union of two index scans
Rule 6: Evaluate Cartesian products with joins
σcondition(R × S) = R ⋈condition S
Join algorithms are optimized
- COST-BASED OPTIMIZATION:
Estimate cost of each plan using:
A. Statistics:
- Number of tuples (n)
- Tuple size (bytes)
- Number of distinct values (V)
- Index structure (B-tree depth, hash buckets)
B. Cost Components:
- Disk I/O cost (dominant factor)
- CPU cost (comparisons, sorting)
- Memory usage
- Network cost (distributed databases)
C. Selectivity Estimation:
Selectivity = fraction of tuples satisfying condition
For A = value: 1/V(A) (uniform distribution assumption)
For A > value: (max - value)/(max - min)
For A AND B: s(A) × s(B) (independence assumption)
For A OR B: s(A) + s(B) - s(A)×s(B)
JOIN STRATEGIES:
-
Nested Loop Join:
For each tuple r in R:
For each tuple s in S:
If join-condition(r,s): output
Cost: n(R) + n(R)×n(S)
Best when: One relation is small -
Block Nested Loop:
Process blocks instead of tuples
Cost: b(R) + b(R)×b(S)
Better I/O performance -
Index Nested Loop:
For each tuple r in R:
Use index on S to find matching tuples
Cost: n(R) + n(R)×c
Where c = cost of index lookup
Best when: Index exists on join attribute -
Sort-Merge Join:
Sort both relations on join attribute
Merge sorted relations
Cost: b(R) + b(S) + sort cost
Best for: Large sorted or nearly-sorted relations -
Hash Join:
Build phase: Hash smaller relation
Probe phase: Hash larger relation, probe hash table
Cost: 3(b(R) + b(S))
Best for: Equi-joins on large unsorted relations
Most commonly used in practice
Query Equivalence:
Two queries are equivalent if they produce same result for any database state.
Equivalence Rules:
- σ(c1 AND c2)(R) = σc1(σc2(R))
- σc1(σc2(R)) = σc2(σc1(R))
- πL1(πL2(...(πLn(R)))) = πL1(R)
- σc(R × S) = R ⋈c S
- R ⋈ S = S ⋈ R (commutative)
- (R ⋈ S) ⋈ T = R ⋈ (S ⋈ T) (associative)
Dynamic Programming for Join Ordering:
For n tables, there are (2n-2)!/(n-1)! possible join orders.
DP finds optimal order in O(n· 2^n) time.
Examples
Example 1: Query Transformation
Original SQL:
SELECT name
FROM Employee, Department
WHERE Employee.deptId = Department.id
AND Department.location = 'NYC'
AND Employee.salary > 50000;
Naive Plan:
- Employee × Department (huge Cartesian product)
- Filter all conditions
- Project name
Optimized Plan:
- σlocation='NYC'(Department) — filter departments first
- σsalary>50000(Employee) — filter employees first
- Filtered_Employee ⋈ Filtered_Department — smaller join
- πname — project at end
Example 2: Join Strategy Selection
Query: Employee ⋈deptId Department
Scenario A: Employee (1M rows), Department (50 rows)
Best: Index Nested Loop
- Build index on Employee.deptId
- For each department, lookup matching employees
- Cost: 50 index lookups
Scenario B: Employee (1M rows), Department (500K rows)
Best: Hash Join
- Hash smaller relation (Department)
- Probe with Employee
- Single pass through both relations
Example 3: Cost Estimation
Given:
- Employee: 10,000 tuples, 100 tuples/block = 100 blocks
- Department: 50 tuples, 5 tuples/block = 10 blocks
- Join: Employee.deptId = Department.id
Nested Loop:
Cost = 100 + 10,000 × 10 = 100,100 block accesses
Block Nested Loop:
Cost = 100 + 100 × 10 = 1,100 block accesses
Hash Join:
Cost = 3(100 + 10) = 330 block accesses
Winner: Hash Join
EXPLAIN Command:
EXPLAIN SELECT * FROM Employee WHERE salary > 50000;
Output shows:
- Scan type (Sequential, Index, Bitmap)
- Rows estimate
- Cost estimate
- Filter conditions
- Join method
Example Output:
Seq Scan on employee (cost=0.00..180.00 rows=1000 width=50)
Filter: (salary > 50000)
Vs with index:
Index Scan using idx_salary (cost=0.00..45.00 rows=1000 width=50)
Index Cond: (salary > 50000)
Real-World Use
Performance Tuning Tips:
-
Write Efficient Queries:
- Use WHERE to filter early
- Avoid SELECT *, specify needed columns
- Use LIMIT when possible
- Avoid functions on indexed columns
-
Indexing Strategy:
- Index foreign keys
- Index columns in WHERE, JOIN, ORDER BY
- Avoid over-indexing (slows INSERT/UPDATE)
- Consider composite indexes for multiple columns
-
Understand Query Plans:
- Use EXPLAIN to see execution plan
- Look for sequential scans on large tables
- Check join methods
- Monitor rows estimates vs actual
-
Database Configuration:
- Buffer pool size (memory for caching)
- Work_mem (memory for sorting, hashing)
- Statistics collection (ANALYZE command)
-
Query Hints (use sparingly):
- Force index usage
- Specify join order
- Enable/disable certain optimizations
Real-World Scenarios:
- E-commerce: Optimize product search queries
- Analytics: Optimize complex aggregation queries
- Social media: Optimize feed generation queries
- Financial: Optimize transaction history queries
For exams
Expected Questions:
- Explain phases of query processing with diagram
- What is the difference between heuristic and cost-based optimization?
- Compare all join strategies: nested loop, block nested loop, index nested loop, sort-merge, hash
- Given query and statistics, calculate cost for different join orders
- Transform given SQL query to optimized relational algebra
- Explain dynamic programming approach for join ordering
- What is selectivity? How is it estimated?
- Given EXPLAIN output, identify performance bottlenecks
- Why is pushing selections down beneficial?
- Explain the role of indexes in query optimization
Key points
Key Concepts:
• Query optimization chooses the fastest execution plan from many alternatives
• Heuristic rules provide quick optimizations (push selections down)
• Cost-based optimization uses statistics to estimate plan costs
• Join strategy selection depends on table sizes and available indexes
• Hash join is usually best for large equi-joins
• Index nested loop is best when one table is small
• Understanding optimization helps write efficient queries
• EXPLAIN command reveals execution plans for tuning