Query Processing, Indexing (B-tree, Hashing), and Advanced Database Systems

Unit 5CLO02, CLO05

When you execute a SQL query, the database doesn't just blindly run it. Behind the scenes, the query processor analyzes, transforms, and optimizes it to find the fastest execution path. Understanding query processing, indexing, and advanced database architectures is crucial for building high-performance applications that scale from single servers to globally distributed systems handling petabytes of data.

The Basics

Query Processing

Definition: Activities to transform high-level query into efficient execution plan

Steps:

  1. Parsing and Translation

    • Check syntax
    • Verify relations and attributes exist
    • Translate SQL to internal representation (relational algebra)
  2. Optimization

    • Generate alternative execution plans
    • Estimate cost of each plan
    • Choose plan with lowest cost
  3. Evaluation

    • Execute chosen plan
    • Return results to user

Why Optimization Matters:

  • Same query, different plans
  • Execution time can vary by orders of magnitude
  • Example:
    • Bad plan: 10 minutes
    • Good plan: 1 second
    • 600x speedup!

Indexing

Purpose: Fast data retrieval without scanning entire table

Analogy: Book index vs reading cover to cover

Trade-offs:

  • Benefit: Fast SELECT queries
  • Cost: Slower INSERT/UPDATE/DELETE, storage overhead

Types:

  1. B-tree Index: Balanced tree, range queries
  2. Hash Index: Hash table, equality queries
  3. Bitmap Index: Bit vectors, low-cardinality columns
  4. Full-text Index: Text search

Advanced Database Systems

Beyond traditional single-server relational databases:

  1. Distributed Databases: Data across multiple sites
  2. Data Warehouses: Historical data for analytics
  3. NoSQL: Non-relational, scalable
  4. Mobile Databases: Lightweight, synchronization
  5. Web Databases: Internet-scale, cloud-native

Technical Details

QUERY PROCESSING DETAILED

1. SELECTION

Algorithms:

A1 (Linear Search):

  • Scan entire file
  • Cost: br (number of blocks)
  • Use when: No index, unsorted

A2 (Binary Search):

  • Requires sorted file
  • Cost: ⌈log₂(br)⌉ + ⌈sc/fr⌉
  • Use when: Sorted on search key

A3 (Primary Index, Equality):

  • Use B+-tree or hash index
  • Cost: HTi + 1 (tree height + 1 data block)
  • Use when: Primary key equality

A4 (Primary Index, Range):

  • Use B+-tree index
  • Cost: HTi + b (blocks in range)
  • Use when: Range query on primary key

A5 (Secondary Index):

  • May retrieve many blocks
  • Cost: HTi + n (n matching records)
  • Use when: Secondary key query

2. SORTING

External Sort-Merge:

Given: M memory buffers, br file blocks

Pass 0 (Create Runs):

  • Read M blocks, sort in memory, write
  • Creates ⌈br/M⌉ runs of M blocks each

Pass 1, 2, ... (Merge):

  • Merge M-1 runs at a time
  • Number of passes: ⌈log_{M-1}(br/M)⌉

Total Cost:

  • Read + Write each block in each pass
  • 2br × (number of passes)

Example:

  • File: 1000 blocks
  • Memory: 5 buffers
  • Pass 0: ⌈1000/5⌉ = 200 runs
  • Pass 1: ⌈200/4⌉ = 50 runs
  • Pass 2: ⌈50/4⌉ = 13 runs
  • Pass 3: ⌈13/4⌉ = 4 runs
  • Pass 4: ⌈4/4⌉ = 1 run
  • Total: 5 passes, cost = 2×1000×5 = 10,000 block accesses

3. JOIN ALGORITHMS

Nested-Loop Join:

for each tuple tr in R:
    for each tuple ts in S:
        if tr and ts match:
            output <tr, ts>

Cost: nr × br + bs
(For each R tuple, scan entire S)

Block Nested-Loop Join:

for each block Br of R:
    for each block Bs of S:
        for each tuple tr in Br:
            for each tuple ts in Bs:
                if match: output

Cost: br + br × bs
(For each R block, scan entire S)

Better: Use smaller relation as outer

Indexed Nested-Loop Join:

for each tuple tr in R:
    use index to find matching tuples in S
    for each match: output

Cost: br + nr × c
(c = cost of index lookup)

Best when: S has index on join attribute

Sort-Merge Join:

Algorithm:

  1. Sort R on join attribute
  2. Sort S on join attribute
  3. Merge sorted relations

Cost:

  • Sort R: 2br × ⌈log_{M-1}(br/M)⌉
  • Sort S: 2bs × ⌈log_{M-1}(bs/M)⌉
  • Merge: br + bs
  • Total: Sort cost + br + bs

Best when: Relations already sorted or result needed sorted

Hash Join:

Build Phase:

  1. Hash smaller relation R into M-1 buckets
  2. Keep hash table in memory

Probe Phase:

  1. Hash each tuple of S
  2. Probe corresponding bucket
  3. Output matches

Cost: 3(br + bs)
(Read both once for partition, read again for join, write once)

Best when: Equi-join, enough memory for hash table

Most commonly used in practice!

QUERY OPTIMIZATION

Cost-Based Optimization:

Factors:

  1. Access Cost: Disk I/O
  2. Storage Cost: Intermediate results
  3. Computation Cost: CPU operations
  4. Memory Cost: Buffers needed
  5. Communication Cost: Distributed systems

Statistics Needed:

  • nr: Number of tuples in relation r
  • br: Number of blocks containing tuples of r
  • lr: Size of tuple in relation r
  • fr: Blocking factor (tuples per block)
  • V(A, r): Number of distinct values of attribute A in r

Selectivity Estimation:

Equality: σA=v(r)

  • Selectivity: 1/V(A, r)
  • Result size: nr / V(A, r)

Range: σA≥v(r)

  • Selectivity: (max(A) - v) / (max(A) - min(A))
  • Result size: nr × selectivity

Conjunction: σA=v ∧ B=w(r)

  • Selectivity: s(A) × s(B) (independence assumption)

Disjunction: σA=v ∨ B=w(r)

  • Selectivity: s(A) + s(B) - s(A) × s(B)

Join Size Estimation:

Natural Join: R ⋈ S

If R.A is key for R:

  • Size ≤ |S| (each S tuple matches at most one R tuple)

If S.A is key for S:

  • Size ≤ |R|

If neither is key:

  • Size ≤ |R| × |S| / max(V(A,R), V(A,S))

Transformation Rules:

  1. σc1∧c2(r) = σc1(σc2(r))
    (Cascade selections)

  2. σc1(σc2(r)) = σc2(σc1(r))
    (Commutative selections)

  3. πL1(πL2(r)) = πL1(r) if L1 ⊆ L2
    (Cascade projections)

  4. σc(r ⋈ s) = σc(r) ⋈ s if c uses only r attributes
    (Push selection through join)

  5. r ⋈ s = s ⋈ r
    (Join commutativity)

  6. (r ⋈ s) ⋈ t = r ⋈ (s ⋈ t)
    (Join associativity)

Optimization Strategy:

  1. Push selections down (reduce tuples early)
  2. Push projections down (reduce attributes early)
  3. Combine selections and cross-products into joins
  4. Choose join order using dynamic programming
  5. Use indexes when beneficial

Example Optimization:

Original:

πName(σDeptName='IT'(Employee × Department))

Problems:

  • Cartesian product very expensive
  • Selection after product

Optimized:

πName(σSalary>50000(Employee) ⋈ σDeptName='IT'(Department))

Benefits:

  • Selections pushed down (smaller intermediate results)
  • Product replaced with join
  • May use indexes on selections

B-TREE INDEX

Properties:

  • Balanced tree: All leaves at same depth
  • Order n: Each node has [⌈n/2⌉, n] keys (except root)
  • Root has [1, n] keys
  • Internal node with k keys has k+1 children

Structure:

Internal Node: [P₁ K₁ P₂ K₂ ... Pₙ Kₙ Pₙ₊₁]
- Kᵢ: Search keys
- Pᵢ: Pointers to children
- Keys in Pᵢ < Kᵢ < keys in Pᵢ₊₁

Search Algorithm:

Search(key K, node N):
    if N is leaf:
        return record with key K in N
    else:
        find i such that Kᵢ ≤ K < Kᵢ₊₁
        return Search(K, child Pᵢ₊₁)

Cost:

  • Height h = ⌈log_⌈n/2⌉(N)⌉ where N = number of keys
  • Search cost: h disk reads

Insertion:

  1. Search for leaf where key should go
  2. If leaf has space, insert key
  3. If leaf full, split:
    • Create new leaf
    • Distribute keys evenly
    • Promote middle key to parent
    • Recursively split parent if full

Deletion:

  1. Search for key and remove
  2. If node underflows (< ⌈n/2⌉ keys):
    • Try to borrow from sibling
    • If can't borrow, merge with sibling
    • Recursively fix parent

B+-TREE INDEX

Differences from B-tree:

  1. All data in leaves

    • Internal nodes only have keys + pointers
    • Leaves contain actual data records or pointers
  2. Leaves linked

    • Sequential access without tree traversal
    • Efficient range queries
  3. Redundant keys

    • Keys in internal nodes repeated in leaves

Structure:

Internal Node: [P₁ K₁ P₂ K₂ ... Pₙ]
- Guide searches, no data

Leaf Node: [K₁ Data₁ K₂ Data₂ ... Kₙ Datₙ Next]
- Contains actual records
- Linked to next leaf (→)

Advantages:

  • Higher fanout (more keys per internal node)
  • Shorter tree (fewer disk I/Os)
  • Efficient range scans (follow leaf links)
  • Most databases use B+-tree (not B-tree)

Example B+-tree (order 3):

                    [30]
                   /    \
             [10,20]    [40,50]
            /  |  \     /  |  \
Leaves: [5,7][12,15][25,28][35,38][45,48][55,60]
         ↓     ↓      ↓       ↓       ↓       ↓
Leaves linked: → → → → → →

Range Query Example:

Find all keys 12 ≤ key ≤ 45:

  1. Search for 12 → finds leaf [12,15]
  2. Follow leaf links: [12,15] → [25,28] → [35,38] → [45,48]
  3. Stop when key > 45
  4. Cost: O(log n) + number of result blocks

HASHING INDEX

Static Hashing:

Hash Function: h(K) = K mod M

Buckets: M buckets, each can hold multiple records

Search:

1. Compute h(K)
2. Access bucket h(K)
3. Search bucket for key K

Cost: 1-2 disk reads (bucket + maybe overflow)

Collision Handling:

  • Chaining: Link overflow buckets
  • Open addressing: Probe next bucket

Problem: Fixed M, poor for growing data

Dynamic Hashing:

Extendible Hashing:

  • Use d bits of hash value
  • d grows/shrinks dynamically
  • Directory of 2^d entries
  • Buckets split when full

Example:

d=2 (use 2 bits)
Directory:
00 → Bucket A
01 → Bucket B
10 → Bucket C
11 → Bucket D

Insert causes overflow in Bucket A:
- Split Bucket A into A1 (00) and A2 (10)
- If directory full, double directory (d=3)

Linear Hashing:

  • No directory
  • Buckets split in linear order
  • Uses two hash functions
  • Gradual growth

Hash vs B+-tree:

FeatureHashB+-tree
EqualityO(1)O(log n)
RangeO(n)O(log n + result)
OrderedNoYes
DynamicComplexEasy

Use hash for: Exact match lookups (primary keys)
Use B+-tree for: Everything else (default)

DISTRIBUTED DATABASES

Definition: Database stored across multiple sites connected by network

Types:

Homogeneous: Same DBMS at all sites
Heterogeneous: Different DBMS at different sites

Data Distribution:

1. Fragmentation:

Horizontal:

  • Split rows
  • Example: Customers_East, Customers_West

Vertical:

  • Split columns
  • Example: Employee_Public(ID, Name), Employee_Private(ID, Salary)

Hybrid:

  • Both horizontal and vertical

2. Replication:

Full Replication:

  • Complete copy at each site
  • Fast reads, slow writes
  • High availability

Partial Replication:

  • Some fragments replicated
  • Balance availability and cost

No Replication:

  • Each fragment at one site only
  • Fragmentation only

Advantages:

  • Local autonomy
  • Improved performance (data locality)
  • Improved reliability (no single point of failure)
  • Easier expansion

Challenges:

  • Distributed query processing
  • Distributed transaction management
  • Maintaining consistency (replication)
  • Increased complexity

Distributed Transactions:

Two-Phase Commit (2PC):

Phase 1 (Prepare):

  1. Coordinator sends "Prepare" to all participants
  2. Participants execute transaction
  3. Participants respond "Ready" or "Abort"

Phase 2 (Commit/Abort):

  1. If all "Ready": Coordinator sends "Commit"
  2. If any "Abort": Coordinator sends "Abort"
  3. Participants commit/abort and ACK
  4. Coordinator completes

Problem: Blocking if coordinator fails

Three-Phase Commit (3PC):

  • Adds "Pre-commit" phase
  • Non-blocking
  • More complex

CAP Theorem:

Can achieve at most 2 of 3:

  • Consistency: All nodes see same data
  • Availability: Every request gets response
  • Partition Tolerance: System works despite network failures

Trade-offs:

  • Traditional RDBMS: CA (sacrifice P)
  • NoSQL: Often AP or CP

DATA WAREHOUSING

Definition: Subject-oriented, integrated, time-variant, non-volatile collection for decision support

Characteristics:

  1. Subject-Oriented:

    • Organized by business subjects (Sales, Inventory)
    • Not by applications
  2. Integrated:

    • Data from multiple sources
    • Consistent naming, encoding, attributes
  3. Time-Variant:

    • Historical data
    • Timestamped
  4. Non-Volatile:

    • Read-only (mostly)
    • Loaded and accessed, not updated

Architecture:

Data Sources (OLTP)
   ↓
ETL (Extract, Transform, Load)
   ↓
Data Warehouse (Central)
   ↓
Data Marts (Departments)
   ↓
OLAP / BI Tools
   ↓
Reports / Dashboards

ETL Process:

Extract:

  • Read from source systems
  • Incremental or full extraction

Transform:

  • Cleanse data (handle nulls, errors)
  • Convert data types
  • Standardize formats
  • Apply business rules
  • Aggregate data

Load:

  • Write to data warehouse
  • Incremental or full load
  • Schedule: nightly, hourly, real-time

Dimensional Modeling:

Star Schema:

     DimTime
        |
        |
DimProduct - FactSales - DimCustomer
        |
        |
     DimStore

Fact Table: Measures (sales, quantity, revenue)
Dimension Tables: Context (who, what, when, where, why)

Snowflake Schema:

  • Normalized dimensions
  • Dimension tables have foreign keys to other dimension tables
  • More complex, less redundant

OLAP Operations:

  1. Roll-Up: Aggregate to higher level

    • Daily → Monthly → Yearly
  2. Drill-Down: Detail to lower level

    • Yearly → Quarterly → Monthly
  3. Slice: Fix one dimension

    • Sales in 2024
  4. Dice: Fix multiple dimensions

    • Sales in 2024, Q1, West region
  5. Pivot: Rotate cube

    • Rows ↔ Columns

DATA MINING

Definition: Discovering patterns, correlations, trends in large datasets

KDD Process:

  1. Data Selection: Choose relevant data
  2. Pre-processing: Clean, integrate
  3. Transformation: Normalize, aggregate
  4. Data Mining: Apply algorithms
  5. Interpretation: Evaluate patterns
  6. Knowledge: Act on insights

Techniques:

1. Association Rules:

Market Basket Analysis:

{Bread, Butter} ⇒ {Milk}
Support = 30% (30% of transactions have all three)
Confidence = 80% (80% with bread and butter also have milk)

Apriori Algorithm:

  1. Find frequent itemsets (support ≥ threshold)
  2. Generate association rules (confidence ≥ threshold)

2. Classification:

Assign items to predefined classes:

  • Decision Trees (ID3, C4.5, CART)
  • Naive Bayes
  • Support Vector Machines (SVM)
  • Neural Networks

Example: Email spam classification

3. Clustering:

Group similar items (no predefined classes):

  • K-Means
  • Hierarchical Clustering
  • DBSCAN (density-based)

Example: Customer segmentation

4. Regression:

Predict numeric values:

  • Linear Regression
  • Logistic Regression
  • Polynomial Regression

Example: House price prediction

5. Anomaly Detection:

Identify outliers:

  • Statistical methods
  • Distance-based
  • Density-based

Example: Fraud detection, intrusion detection

MOBILE AND WEB DATABASES

Mobile Databases:

Characteristics:

  • Limited resources (CPU, memory, battery)
  • Intermittent connectivity
  • Location-aware
  • Lightweight

Examples:

  • SQLite (embedded)
  • Realm
  • Couchbase Lite

Challenges:

  • Synchronization with server
  • Offline operation
  • Conflict resolution
  • Battery efficiency

Synchronization Strategies:

1. Full Sync:

  • Download entire database
  • Simple but inefficient

2. Incremental Sync:

  • Only changed data
  • Requires tracking changes

3. Bidirectional Sync:

  • Client ↔ Server
  • Conflict resolution needed

Conflict Resolution:

  • Last Write Wins (LWW)
  • Version Vectors
  • Application-specific logic

Web Databases:

Backend (Server-Side):

  • Traditional RDBMS (MySQL, PostgreSQL)
  • NoSQL (MongoDB, Cassandra)
  • Cloud databases (Aurora, Cloud SQL)

Frontend (Client-Side):

  • IndexedDB (browser storage)
  • LocalStorage (key-value, 5-10MB limit)
  • WebSQL (deprecated)

Architecture Patterns:

1. Traditional (3-Tier):

Browser → Web Server → Database

2. RESTful API:

Frontend (React/Angular)
   ↓ HTTP/REST
Backend (Node.js/Django)
   ↓ SQL/NoSQL
Database

3. Microservices:

Frontend → API Gateway → Service1 (DB1)
                       → Service2 (DB2)
                       → Service3 (DB3)

4. Serverless:

Frontend → Cloud Functions → Managed Database
(AWS Lambda, Firebase)

Scalability Strategies:

Vertical Scaling:

  • Bigger server (more CPU, RAM)
  • Limited by hardware

Horizontal Scaling:

  • More servers
  • Requires distribution strategy

Caching:

  • Redis, Memcached
  • Reduce database load

Content Delivery Network (CDN):

  • Cache static content
  • Serve from edge locations

Cloud Databases:

SQL:

  • Amazon RDS
  • Google Cloud SQL
  • Azure SQL Database

NoSQL:

  • Amazon DynamoDB
  • MongoDB Atlas
  • Google Firestore

Data Warehouse:

  • Amazon Redshift
  • Google BigQuery
  • Snowflake

Advantages:

  • Managed (less ops)
  • Scalable
  • High availability
  • Pay-as-you-go

Disadvantages:

  • Vendor lock-in
  • Cost (can be high at scale)
  • Less control

Examples

COMPREHENSIVE EXAMPLES

EXAMPLE 1: Query Optimization

Original Query:

SELECT E.Name, D.DeptName
FROM Employee E, Department D, Salary S
WHERE E.DeptID = D.DeptID
  AND E.EmpID = S.EmpID
  AND D.DeptName = 'IT'
  AND S.Salary > 50000;

Inefficient Plan:

1. Employee × Department × Salary  -- Huge cartesian product!
2. σ(DeptName='IT' ∧ Salary>50000)
3. π(Name, DeptName)

Cost: |Employee| × |Department| × |Salary| scans

Optimized Plan:

1. σ(DeptName='IT')(Department)  -- Filter early: 1 dept
2. σ(Salary>50000)(Salary)       -- Filter early: ~10% of rows
3. Result1 ⋈(DeptID) Employee    -- Join filtered dept with employees
4. Result2 ⋈(EmpID) Result2      -- Join with filtered salaries
5. π(Name, DeptName)(Result3)

Improvements:

  • Push selections down (reduce intermediate results)
  • Replace cartesian products with joins
  • Use indexes on DeptName and Salary

Cost Reduction: 1000x or more!

EXAMPLE 2: Join Algorithm Selection

Query:

SELECT *
FROM Orders O, Customer C
WHERE O.CustomerID = C.CustomerID;

Scenario A: No Index

  • Orders: 1M rows, 10,000 blocks
  • Customer: 100K rows, 1,000 blocks

Block Nested-Loop:

Cost = 10,000 + 10,000 × 1,000 = 10,010,000 block reads

Sort-Merge (M=100 buffers):

Sort Orders: 2 × 10,000 × ⌈log₉₉(100)⌉ = 40,000
Sort Customer: 2 × 1,000 × ⌈log₉₉(10)⌉ = 2,000
Merge: 10,000 + 1,000 = 11,000
Total = 53,000 block reads

Hash Join:

Partition: 2 × (10,000 + 1,000) = 22,000
Join: 10,000 + 1,000 = 11,000
Total = 33,000 block reads

Winner: Hash Join (300x faster than Nested-Loop!)

Scenario B: Index on Customer.CustomerID

Indexed Nested-Loop:

Cost = 10,000 + 1,000,000 × 2 = 2,010,000 block reads
(For each Order, 2 reads: index + data)

Winner (B): Still Hash Join! (16x faster)

EXAMPLE 3: B+-tree Operations

B+-tree (order 3, max 3 keys per node):

Initial Tree:

             [20]
            /    \
       [10,15]  [25,30]
       /  |  \    /  |  \
    [5,7][12,13][18,19][22,23][27,28][32,35]
     ↓     ↓       ↓      ↓      ↓      ↓
  (leaves linked)

Insert 14:

  1. Search finds leaf [12,13]
  2. Leaf full (3 keys max), must split
  3. Split [12,13,14] → [12] and [13,14]
  4. Promote 13 to parent [10,15] → [10,13,15]

Result:

             [20]
            /    \
      [10,13,15]  [25,30]
      /  |  |  \    /  |  \
   [5,7][12][13,14][18,19][22,23][27,28][32,35]

Insert 16, 17 (causes cascade):

Leaf [18,19] becomes [16,17,18,19] (overflow!)
Split → [16,17] and [18,19], promote 18
Parent [10,13,15] becomes [10,13,15,18] (overflow!)
Split parent → [10,13] and [15,18], promote 15 to root
Root [20] becomes [15,20]

Result:

           [15,20]
          /   |   \
    [10,13] [15,18] [25,30]
      ...     ...     ...

Delete 12:

  1. Remove 12 from leaf [12]
  2. Leaf empty (underflow!)
  3. Try to borrow from sibling [13,14]
  4. Success: Redistribute → [13] and [14]
  5. Update parent key 13 → 14

EXAMPLE 4: Hash Index

Static Hash (M=5 buckets):

h(K) = K mod 5

Insert: 5, 15, 25, 3, 8, 13, 18, 23, 28

Bucket 0: 5 → 15 → 25 (overflow chain!)
Bucket 1: empty
Bucket 2: empty
Bucket 3: 3 → 8 → 13 → 18 → 23 → 28 (long chain!)
Bucket 4: empty

Problem: Clustering, poor distribution

Extendible Hash (d=2):

h(K) = last 2 bits of K

Directory (2² = 4 entries):
00 → Bucket A: [4, 8, 12]
01 → Bucket B: [5, 9, 13]
10 → Bucket C: [6, 10, 14]
11 → Bucket D: [7, 11, 15]

Insert 16 (binary: ...10000, last 2 bits: 00):
- Goes to Bucket A
- Bucket A full (assume capacity=3)

Split:
- Create Bucket A' for keys with last 3 bits 000
- Keep Bucket A for keys with last 3 bits 100
- Increase directory depth d=3

New Directory (2³ = 8 entries):
000 → Bucket A': [8]
001 → Bucket B
010 → Bucket C
011 → Bucket D
100 → Bucket A: [4, 12, 16]
101 → Bucket B
110 → Bucket C
111 → Bucket D

EXAMPLE 5: Distributed Query

Query:

SELECT *
FROM Orders O, Customer C
WHERE O.CustomerID = C.CustomerID
  AND C.City = 'Mumbai';

Data Distribution:

  • Orders: Horizontally partitioned by region (Site 1: North, Site 2: South)
  • Customer: Replicated at all sites

Strategy 1: Centralized

  1. Ship Orders from Site 1 to coordinator
  2. Ship Orders from Site 2 to coordinator
  3. Join at coordinator

Cost: Transfer all Orders (expensive!)

Strategy 2: Distributed Join

  1. Apply σ(City='Mumbai')(Customer) at each site
  2. Join locally at each site
  3. Union results

Cost: Transfer only Mumbai customers (small!)

Example Numbers:

  • Orders: 1M rows × 1KB = 1GB
  • Customers: 100K rows × 500B = 50MB
  • Mumbai Customers: 5K rows = 2.5MB

Cost Comparison:

  • Centralized: 1GB network transfer
  • Distributed: 2.5MB network transfer (400x less!)

EXAMPLE 6: Star Schema

Fact Table: FactSales

CREATE TABLE FactSales (
    SaleID INT PRIMARY KEY,
    DateKey INT,           -- FK to DimDate
    ProductKey INT,        -- FK to DimProduct
    CustomerKey INT,       -- FK to DimCustomer
    StoreKey INT,          -- FK to DimStore
    Quantity INT,
    TotalAmount DECIMAL(10,2)
);

Dimension Tables:

CREATE TABLE DimDate (
    DateKey INT PRIMARY KEY,
    Date DATE,
    Day INT,
    Month INT,
    Quarter INT,
    Year INT,
    DayOfWeek VARCHAR(10)
);

CREATE TABLE DimProduct (
    ProductKey INT PRIMARY KEY,
    ProductName VARCHAR(100),
    Category VARCHAR(50),
    Brand VARCHAR(50),
    UnitPrice DECIMAL(10,2)
);

CREATE TABLE DimCustomer (
    CustomerKey INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    City VARCHAR(50),
    State VARCHAR(50),
    Country VARCHAR(50)
);

CREATE TABLE DimStore (
    StoreKey INT PRIMARY KEY,
    StoreName VARCHAR(100),
    City VARCHAR(50),
    Manager VARCHAR(100)
);

OLAP Query:

-- Total sales by quarter and product category
SELECT 
    D.Year,
    D.Quarter,
    P.Category,
    SUM(F.TotalAmount) AS TotalSales
FROM FactSales F
JOIN DimDate D ON F.DateKey = D.DateKey
JOIN DimProduct P ON F.ProductKey = P.ProductKey
WHERE D.Year = 2024
GROUP BY D.Year, D.Quarter, P.Category
ORDER BY D.Quarter, TotalSales DESC;

Roll-Up (Aggregate to Year):

SELECT 
    D.Year,
    P.Category,
    SUM(F.TotalAmount) AS TotalSales
FROM FactSales F
JOIN DimDate D ON F.DateKey = D.DateKey
JOIN DimProduct P ON F.ProductKey = P.ProductKey
WHERE D.Year = 2024
GROUP BY D.Year, P.Category;

Drill-Down (To Month):

SELECT 
    D.Year,
    D.Month,
    P.Category,
    SUM(F.TotalAmount) AS TotalSales
FROM FactSales F
JOIN DimDate D ON F.DateKey = D.DateKey
JOIN DimProduct P ON F.ProductKey = P.ProductKey
WHERE D.Year = 2024 AND D.Quarter = 1
GROUP BY D.Year, D.Month, P.Category;

EXAMPLE 7: Association Rule Mining

Transaction Database:

T1: {Bread, Milk, Eggs}
T2: {Bread, Butter}
T3: {Milk, Butter, Eggs}
T4: {Bread, Milk, Butter}
T5: {Bread, Milk, Eggs, Butter}

Apriori Algorithm (min_support=40%, min_confidence=60%):

Step 1: Find frequent 1-itemsets

{Bread}: 4/5 = 80% ✓
{Milk}: 4/5 = 80% ✓
{Butter}: 4/5 = 80% ✓
{Eggs}: 3/5 = 60% ✓

Step 2: Generate frequent 2-itemsets

{Bread, Milk}: 3/5 = 60% ✓
{Bread, Butter}: 3/5 = 60% ✓
{Milk, Butter}: 3/5 = 60% ✓
{Milk, Eggs}: 3/5 = 60% ✓
{Bread, Eggs}: 2/5 = 40% ✓
{Butter, Eggs}: 2/5 = 40% ✓

Step 3: Generate frequent 3-itemsets

{Bread, Milk, Butter}: 2/5 = 40% ✓
{Bread, Milk, Eggs}: 2/5 = 40% ✓
{Milk, Butter, Eggs}: 2/5 = 40% ✓

Step 4: Generate association rules

{Bread, Milk} ⇒ {Butter}: 
  Support = 40%, Confidence = 2/3 = 67% ✓

{Bread, Butter} ⇒ {Milk}:
  Support = 40%, Confidence = 2/3 = 67% ✓

{Milk, Butter} ⇒ {Bread}:
  Support = 40%, Confidence = 2/3 = 67% ✓

{Milk, Eggs} ⇒ {Bread}:
  Support = 40%, Confidence = 2/3 = 67% ✓

Actionable Insight:

  • Place Bread, Milk, Butter close together
  • Bundle discount for combo purchase

EXAMPLE 8: Mobile Database Synchronization

Scenario: Sales app with offline capability

Local SQLite (Mobile):

CREATE TABLE Orders (
    OrderID INTEGER PRIMARY KEY,
    CustomerID INTEGER,
    Amount DECIMAL(10,2),
    Status TEXT,
    CreatedAt DATETIME,
    ModifiedAt DATETIME,
    SyncStatus TEXT  -- 'pending', 'synced', 'conflict'
);

Server Database (PostgreSQL):

CREATE TABLE Orders (
    OrderID SERIAL PRIMARY KEY,
    CustomerID INTEGER,
    Amount DECIMAL(10,2),
    Status TEXT,
    CreatedAt TIMESTAMP,
    ModifiedAt TIMESTAMP,
    Version INTEGER  -- For conflict detection
);

Sync Process:

  1. Upload (Mobile → Server):
def sync_upload():
    pending = db.execute("SELECT * FROM Orders WHERE SyncStatus='pending'")
    for order in pending:
        try:
            # Send to server
            response = api.post('/orders/sync', order)
            if response.status == 'success':
                db.execute("UPDATE Orders SET SyncStatus='synced' WHERE OrderID=?", order.id)
            elif response.status == 'conflict':
                db.execute("UPDATE Orders SET SyncStatus='conflict' WHERE OrderID=?", order.id)
        except NetworkError:
            break  # Will retry later
  1. Download (Server → Mobile):
def sync_download():
    last_sync = get_last_sync_time()
    changes = api.get(f'/orders/changes?since={last_sync}')
    
    for change in changes:
        local = db.execute("SELECT * FROM Orders WHERE OrderID=?", change.id)
        
        if not local:
            # New record
            db.insert(change)
        elif local.ModifiedAt < change.ModifiedAt:
            # Server newer
            db.update(change)
        else:
            # Local newer (conflict)
            handle_conflict(local, change)

Conflict Resolution (Last Write Wins):

def handle_conflict(local, server):
    if server.ModifiedAt > local.ModifiedAt:
        # Server wins
        db.update(server)
    else:
        # Local wins, push again
        db.execute("UPDATE Orders SET SyncStatus='pending' WHERE OrderID=?", local.id)

EXAMPLE 9: NoSQL Data Modeling

Relational (Normalized):

CREATE TABLE Users (
    UserID INT PRIMARY KEY,
    Username VARCHAR(50),
    Email VARCHAR(100)
);

CREATE TABLE Posts (
    PostID INT PRIMARY KEY,
    UserID INT,
    Title VARCHAR(200),
    Content TEXT,
    CreatedAt DATETIME,
    FOREIGN KEY (UserID) REFERENCES Users(UserID)
);

CREATE TABLE Comments (
    CommentID INT PRIMARY KEY,
    PostID INT,
    UserID INT,
    Text TEXT,
    CreatedAt DATETIME,
    FOREIGN KEY (PostID) REFERENCES Posts(PostID)
);

Query (3 joins!):

SELECT U.Username, P.Title, C.Text
FROM Users U
JOIN Posts P ON U.UserID = P.UserID
JOIN Comments C ON P.PostID = C.PostID
WHERE U.UserID = 123;

NoSQL (MongoDB - Denormalized):

{
  "_id": ObjectId("..."),
  "username": "john_doe",
  "email": "john@example.com",
  "posts": [
    {
      "postId": 1,
      "title": "My First Post",
      "content": "Hello world!",
      "createdAt": ISODate("2024-01-15"),
      "comments": [
        {
          "commentId": 1,
          "userId": 456,
          "username": "jane_doe",
          "text": "Great post!",
          "createdAt": ISODate("2024-01-15")
        }
      ]
    }
  ]
}

Query (No joins!):

db.users.findOne({ _id: 123 })

Trade-offs:

  • Pros: Faster reads (no joins), natural document structure
  • Cons: Data duplication, update complexity (update username everywhere)

EXAMPLE 10: Cloud Database Auto-Scaling

AWS Aurora Configuration:

# Application Load
Time    Connections    CPU%    Action
08:00   100           20%     Normal (1 reader)
10:00   500           60%     Add reader (2 readers)
12:00   1500          80%     Add reader (3 readers)
14:00   2000          90%     Add reader (4 readers)
16:00   800           40%     Remove reader (3 readers)
20:00   200           25%     Remove reader (2 readers)
00:00   50            10%     Remove reader (1 reader)

Auto-Scaling Policy:

{
  "minCapacity": 1,
  "maxCapacity": 5,
  "scaleUpThreshold": {
    "cpu": 70,
    "connections": 1000
  },
  "scaleDownThreshold": {
    "cpu": 30,
    "connections": 200,
    "cooldownMinutes": 15
  }
}

Cost Optimization:

  • Peak hours (10:00-16:00): 4 instances × $0.50/hr = $2.00/hr × 6hrs = $12
  • Off-peak: 1-2 instances × $0.50/hr = ~$0.75/hr × 18hrs = $13.50
  • Daily Total: ~$25.50
  • vs Fixed 4 instances: $48/day
  • Savings: ~47%

Real-World Use

PRACTICAL GUIDELINES

1. Query Optimization Tips:

Use EXPLAIN/EXPLAIN ANALYZE:

-- PostgreSQL
EXPLAIN ANALYZE
SELECT * FROM Orders WHERE CustomerID = 123;

-- MySQL
EXPLAIN
SELECT * FROM Orders WHERE CustomerID = 123;

Output Analysis:

  • Seq Scan: Full table scan (BAD for large tables)
  • Index Scan: Using index (GOOD)
  • Index Only Scan: Only index, no table access (BEST)
  • Nested Loop: Good for small joins
  • Hash Join: Good for large equi-joins
  • Merge Join: Good for pre-sorted data

Common Problems and Fixes:

Problem: Full Table Scan

-- BAD
SELECT * FROM Orders WHERE YEAR(OrderDate) = 2024;
-- Function on column prevents index use

-- GOOD
SELECT * FROM Orders 
WHERE OrderDate >= '2024-01-01' AND OrderDate < '2025-01-01';
-- Index on OrderDate can be used

Problem: Type Mismatch

-- BAD
SELECT * FROM Orders WHERE OrderID = '123';  -- String for INT column

-- GOOD
SELECT * FROM Orders WHERE OrderID = 123;

Problem: OR Conditions

-- BAD
SELECT * FROM Orders 
WHERE Status = 'Pending' OR Status = 'Shipped';

-- GOOD
SELECT * FROM Orders 
WHERE Status IN ('Pending', 'Shipped');
-- Or even better with separate queries and UNION if selective

2. Indexing Strategy:

When to Create Index:

  • Columns in WHERE clause
  • Columns in JOIN conditions
  • Columns in ORDER BY/GROUP BY
  • Foreign keys

When NOT to Create Index:

  • Small tables (< 1000 rows)
  • Columns with low selectivity (e.g., Gender: M/F)
  • Frequently updated columns
  • Rarely queried columns

Index Types by Use Case:

-- B-tree (default): Range queries, sorted access
CREATE INDEX idx_orders_date ON Orders(OrderDate);

-- Hash: Exact match only (PostgreSQL)
CREATE INDEX idx_customers_email ON Customers USING HASH (Email);

-- Covering index: Includes all queried columns
CREATE INDEX idx_orders_covering ON Orders(CustomerID, OrderDate, TotalAmount);

-- Partial index: Only subset of rows
CREATE INDEX idx_active_orders ON Orders(OrderDate) 
WHERE Status = 'Active';

-- Composite index: Multiple columns
CREATE INDEX idx_orders_customer_date ON Orders(CustomerID, OrderDate);
-- Good for: WHERE CustomerID=? AND OrderDate=?
-- NOT for: WHERE OrderDate=? (wrong order!)

Index Maintenance:

-- PostgreSQL: Rebuild bloated index
REINDEX INDEX idx_orders_date;

-- MySQL: Rebuild table and indexes
OPTIMIZE TABLE Orders;

-- Check index usage
SELECT 
    schemaname, tablename, indexname, 
    idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE idx_scan = 0;  -- Never used!

3. Distributed Database Best Practices:

Data Partitioning:

-- PostgreSQL: Declarative partitioning
CREATE TABLE Orders (
    OrderID BIGINT,
    OrderDate DATE,
    ...
) PARTITION BY RANGE (OrderDate);

CREATE TABLE Orders_2023 PARTITION OF Orders
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

CREATE TABLE Orders_2024 PARTITION OF Orders
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

Benefits:

  • Query only relevant partitions
  • Drop old partitions easily
  • Better manageability

Replication Configuration:

-- PostgreSQL: Streaming replication
-- On primary
ALTER SYSTEM SET wal_level = 'replica';
ALTER SYSTEM SET max_wal_senders = 5;

-- On replica
primary_conninfo = 'host=primary_host port=5432 user=replicator'
hot_standby = on

Load Balancing:

# Application-level read/write splitting
def get_db_connection(readonly=False):
    if readonly:
        return connect_to_replica()  # Read from replica
    else:
        return connect_to_primary()  # Write to primary

# Usage
with get_db_connection(readonly=True) as conn:
    results = conn.execute("SELECT ...")  # Read query

with get_db_connection(readonly=False) as conn:
    conn.execute("UPDATE ...")  # Write query
    conn.commit()

4. Data Warehouse Development:

ETL Best Practices:

Incremental Load (Better than Full Load):

-- Track last processed timestamp
CREATE TABLE ETL_Metadata (
    TableName VARCHAR(50) PRIMARY KEY,
    LastLoadTime TIMESTAMP
);

-- Extract only new/modified records
INSERT INTO DW_Orders
SELECT *
FROM OLTP_Orders
WHERE ModifiedAt > (
    SELECT LastLoadTime 
    FROM ETL_Metadata 
    WHERE TableName = 'Orders'
);

-- Update metadata
UPDATE ETL_Metadata
SET LastLoadTime = CURRENT_TIMESTAMP
WHERE TableName = 'Orders';

Slowly Changing Dimensions (SCD Type 2):

CREATE TABLE DimCustomer (
    CustomerKey INT PRIMARY KEY,  -- Surrogate key
    CustomerID INT,                -- Business key
    CustomerName VARCHAR(100),
    City VARCHAR(50),
    ValidFrom DATE,
    ValidTo DATE,
    IsCurrent BOOLEAN
);

-- When customer moves to new city
-- Old record
UPDATE DimCustomer
SET ValidTo = CURRENT_DATE, IsCurrent = FALSE
WHERE CustomerID = 123 AND IsCurrent = TRUE;

-- New record
INSERT INTO DimCustomer (CustomerID, CustomerName, City, ValidFrom, IsCurrent)
VALUES (123, 'John Doe', 'New York', CURRENT_DATE, TRUE);

-- Now fact table maintains history
-- FactSales joins to CustomerKey (not CustomerID)
-- Can report sales by customer's city at time of sale

Aggregation Tables (OLAP Cubes):

-- Pre-aggregate for common queries
CREATE TABLE AggSalesByMonth AS
SELECT 
    DATE_TRUNC('month', OrderDate) AS Month,
    ProductID,
    SUM(Quantity) AS TotalQuantity,
    SUM(TotalAmount) AS TotalRevenue
FROM FactSales
GROUP BY Month, ProductID;

CREATE INDEX idx_agg_month ON AggSalesByMonth(Month, ProductID);

-- Query aggregate (much faster!)
SELECT * FROM AggSalesByMonth
WHERE Month = '2024-01-01';

5. Mobile Database Patterns:

Offline-First Architecture:

// React Native + SQLite
import SQLite from 'react-native-sqlite-storage';

class OfflineDB {
  constructor() {
    this.db = SQLite.openDatabase({ name: 'app.db' });
  }
  
  async createOrder(order) {
    // Always save locally first
    const localId = await this.db.executeSql(
      'INSERT INTO Orders (data, syncStatus) VALUES (?, ?)',
      [JSON.stringify(order), 'pending']
    );
    
    // Try to sync immediately if online
    if (await this.isOnline()) {
      this.syncPendingOrders();
    }
    
    return localId;
  }
  
  async syncPendingOrders() {
    const pending = await this.db.executeSql(
      'SELECT * FROM Orders WHERE syncStatus = ?',
      ['pending']
    );
    
    for (let order of pending.rows) {
      try {
        const response = await fetch('/api/orders', {
          method: 'POST',
          body: order.data
        });
        
        if (response.ok) {
          await this.db.executeSql(
            'UPDATE Orders SET syncStatus = ? WHERE id = ?',
            ['synced', order.id]
          );
        }
      } catch (error) {
        // Network error, will retry later
        console.log('Sync failed, will retry');
      }
    }
  }
  
  async isOnline() {
    return navigator.onLine;
  }
}

// Periodic background sync
setInterval(() => {
  if (db.isOnline()) {
    db.syncPendingOrders();
  }
}, 60000);  // Every minute

6. Performance Monitoring:

Key Metrics:

-- PostgreSQL: Slow queries
SELECT 
    query,
    calls,
    total_time / calls AS avg_time,
    min_time,
    max_time
FROM pg_stat_statements
WHERE total_time / calls > 1000  -- > 1 second avg
ORDER BY total_time DESC
LIMIT 10;

-- Cache hit ratio (should be > 90%)
SELECT 
    sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) AS cache_hit_ratio
FROM pg_stat_database;

-- Index usage
SELECT 
    schemaname, tablename,
    idx_scan,  -- Index scans
    seq_scan,  -- Sequential scans
    idx_scan / (seq_scan + idx_scan + 0.001) AS index_usage_ratio
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;

-- Table bloat
SELECT 
    schemaname, tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

7. Cloud Database Cost Optimization:

Strategies:

Use Read Replicas for Read-Heavy Workloads:

Primary (Expensive): Writes only
Replica 1, 2, 3 (Cheaper): Distribute reads

Schedule Non-Production Databases:

# Stop dev/test databases outside business hours
aws rds stop-db-instance --db-instance-identifier dev-db

# Schedule with Lambda + CloudWatch Events
# Save ~60% on dev/test costs

Right-Size Instances:

# Monitor actual usage
aws cloudwatch get-metric-statistics \
    --namespace AWS/RDS \
    --metric-name CPUUtilization \
    --dimensions Name=DBInstanceIdentifier,Value=my-db

# If consistently < 40%, downgrade instance
# db.m5.xlarge ($280/mo) → db.m5.large ($140/mo)

Use Reserved Instances:

On-Demand: $0.50/hr = $360/mo
1-Year Reserved: $0.30/hr = $216/mo (40% savings)
3-Year Reserved: $0.20/hr = $144/mo (60% savings)

Archive Old Data:

-- Move old data to cheaper storage (S3)
-- Orders > 2 years old
COPY (
    SELECT * FROM Orders 
    WHERE OrderDate < CURRENT_DATE - INTERVAL '2 years'
) TO '/tmp/old_orders.csv';

-- Upload to S3, then delete from DB
DELETE FROM Orders 
WHERE OrderDate < CURRENT_DATE - INTERVAL '2 years';

-- Query old data from S3 (AWS Athena) when needed

For exams

IMPORTANT EXAM QUESTIONS

Query Processing:

  1. Explain the steps involved in query processing.

    • Parsing: Check syntax, validate schema
    • Translation: Convert SQL to relational algebra
    • Optimization: Generate and evaluate execution plans
    • Evaluation: Execute chosen plan
  2. What is query optimization? Why is it important?

    • Finding efficient execution plan among alternatives
    • Same query can have vastly different costs (orders of magnitude)
    • Critical for performance at scale
  3. Compare selection algorithms (A1-A5). When is each used?

    • A1 (Linear): No index, unsorted
    • A2 (Binary): Sorted file
    • A3 (Primary Index): Primary key equality
    • A4 (Range): Range query with index
    • A5 (Secondary): Secondary key query
  4. Explain external sort-merge algorithm with an example.

    • Pass 0: Create sorted runs of size M
    • Pass 1+: Merge M-1 runs at a time
    • Cost: 2br × (number of passes)
    • Example with specific numbers
  5. Compare join algorithms: Nested-Loop, Sort-Merge, Hash Join.

    • Nested-Loop: Simple, expensive (nr × br + bs)
    • Sort-Merge: Good for sorted/large joins
    • Hash Join: Best for equi-joins, most common in practice

Indexing:

  1. What is indexing? Compare B-tree and Hash indexes.

    • Index: Data structure for fast retrieval
    • B-tree: Range queries, sorted access, O(log n)
    • Hash: Exact match only, O(1), no ordering
  2. Explain B+-tree structure and operations.

    • All data in leaves
    • Leaves linked for range scans
    • Insert: May cause splits
    • Delete: May cause merges/redistribution
  3. Why are B+-trees preferred over B-trees in databases?

    • Higher fanout (more keys per node)
    • Efficient range scans (leaf links)
    • All data in leaves (simpler)
  4. Explain extendible hashing with an example.

    • Dynamic hash structure
    • Directory grows/shrinks with d bits
    • Buckets split on overflow
    • No performance degradation with growth
  5. When would you use a hash index vs B-tree index?

    • Hash: Primary key lookups, equality predicates
    • B-tree: Range queries, ORDER BY, default choice

Distributed Databases:

  1. What is a distributed database? What are its advantages?

    • Database across multiple sites
    • Advantages: Autonomy, performance, reliability, scalability
  2. Explain data fragmentation with examples.

    • Horizontal: Split rows (geographical partitioning)
    • Vertical: Split columns (separate sensitive data)
    • Hybrid: Both
  3. What is the difference between replication and fragmentation?

    • Fragmentation: Split data (no duplicates)
    • Replication: Copy data (duplicates)
    • Can be combined
  4. Explain Two-Phase Commit protocol.

    • Phase 1 (Prepare): Coordinator asks participants to prepare
    • Phase 2 (Commit/Abort): Coordinator decides based on votes
    • Ensures atomicity in distributed transactions
  5. What is CAP theorem? Explain with examples.

    • Consistency, Availability, Partition tolerance
    • Can achieve at most 2 of 3
    • SQL: CA, NoSQL: AP or CP

Data Warehousing:

  1. What is a data warehouse? How does it differ from OLTP?

    • Subject-oriented, integrated, time-variant, non-volatile
    • OLTP: Current data, transactional, normalized
    • DW: Historical data, analytical, denormalized
  2. Explain star schema with an example.

    • Fact table (center): Measures
    • Dimension tables (points): Context
    • Simple structure, denormalized
  3. What is the difference between star and snowflake schema?

    • Star: Denormalized dimensions (simple, redundant)
    • Snowflake: Normalized dimensions (complex, less redundant)
  4. Explain OLAP operations: Roll-up, Drill-down, Slice, Dice.

    • Roll-up: Aggregate to higher level (month → year)
    • Drill-down: Detail to lower level (year → month)
    • Slice: Fix one dimension
    • Dice: Fix multiple dimensions
  5. What is Slowly Changing Dimension? Explain Type 2 SCD.

    • Dimensions that change over time
    • Type 2: Keep history with validity dates
    • New row for each change, maintains full history

Data Mining:

  1. What is data mining? Name major techniques.

    • Discovering patterns in large datasets
    • Classification, Clustering, Association Rules, Regression, Anomaly Detection
  2. Explain association rule mining with an example.

    • Market basket analysis: {A, B} ⇒ {C}
    • Support: Frequency of itemset
    • Confidence: Conditional probability
    • Apriori algorithm
  3. What is the difference between classification and clustering?

    • Classification: Supervised (predefined classes)
    • Clustering: Unsupervised (discover groupings)

Advanced Topics:

  1. What is MVCC? How does it differ from locking?

    • Multi-Version Concurrency Control
    • Keeps multiple versions of data
    • Readers don't block writers
    • Used in PostgreSQL, Oracle
  2. Explain NoSQL databases. When are they preferred over SQL?

    • Non-relational, schema-less
    • Types: Document, Key-Value, Column-family, Graph
    • Preferred for: Scale, flexibility, specific access patterns

QUICK REVISION

Query Processing:

  • Parse → Optimize → Execute
  • Cost-based optimization
  • Join algorithms: Nested-Loop < Sort-Merge < Hash

Indexing:

  • B+-tree: Default, range queries
  • Hash: Equality only
  • Trade-off: Fast reads vs slow writes

B+-tree:

  • Balanced, O(log n)
  • All data in leaves
  • Leaves linked
  • Insert/Delete may split/merge

Distributed DB:

  • Fragmentation: Split data
  • Replication: Copy data
  • 2PC: Atomic distributed transactions
  • CAP: Pick 2 of 3

Data Warehouse:

  • Subject-oriented, integrated, time-variant
  • Star schema: Fact + Dimensions
  • OLAP: Roll-up, Drill-down, Slice, Dice
  • ETL: Extract, Transform, Load

Data Mining:

  • Association: Market basket
  • Classification: Supervised
  • Clustering: Unsupervised

Advanced:

  • MVCC: Multiple versions
  • NoSQL: Scale, flexibility
  • Mobile: Offline-first, sync

Key points

KEY TAKEAWAYS

Query optimization can improve performance by orders of magnitude: Same query, different plans, vastly different costs

Hash join is most common in practice: Efficient for equi-joins, used by most query optimizers

B+-tree is the default index: Versatile, supports range queries, efficient for most workloads

Push selections and projections down: Reduce intermediate result sizes early in execution

Indexing trades read speed for write speed: Create indexes on frequently queried columns, avoid over-indexing

Distributed databases offer scalability and reliability: But add complexity in query processing and transaction management

CAP theorem guides distributed system design: Can't have all three: Consistency, Availability, Partition tolerance

Data warehouses are optimized for analytics: Denormalized, historical, read-heavy workload

Star schema simplifies OLAP: Fact table + dimension tables, intuitive for business users

ETL is critical for data quality: Extract, Transform, Load - transformation is where data cleaning happens

Data mining discovers hidden patterns: Classification, clustering, association rules enable data-driven decisions

Mobile databases require offline-first design: Local storage + synchronization for reliable mobile apps

NoSQL trades ACID for scale: Different models (document, key-value, graph) for different access patterns

MVCC enables high concurrency: Readers don't block writers, used by PostgreSQL and Oracle

Cloud databases offer elasticity: Auto-scaling, managed services, but watch costs and vendor lock-in

EXPLAIN is your best friend: Always analyze query plans before optimizing

REMEMBER: Modern database systems are complex ecosystems. Understanding query processing, indexing strategies, and advanced architectures (distributed, warehouse, NoSQL) is essential for building scalable, high-performance applications. Always profile before optimizing, and choose the right tool for the job!