Advanced Topics
Relational databases dominated for decades, but new challenges emerged: How to store complex objects? How to analyze petabytes of data? How to handle unstructured data? Advanced database technologies—object-oriented databases, data warehousing, NoSQL, and distributed systems—evolved to address these needs.
The Basics
Advanced database topics extend beyond traditional relational databases to handle complex data types, massive-scale analytics, and modern application requirements.
Key Motivations:
- Complex Data: Multimedia, CAD, scientific simulations
- Big Data: Petabytes of data, distributed processing
- Performance: Real-time analytics, low-latency access
- Scalability: Horizontal scaling, cloud-native
- Flexibility: Schema evolution, heterogeneous data
Technical Details
OBJECT-ORIENTED DATABASES (OODBMS):
Motivation: Impedance mismatch between OOP languages and relational databases
Relational Approach:
class Employee {
int id;
String name;
Address address; // Complex object
}
// Must decompose into multiple tables
Employees(id, name, address_id)
Addresses(address_id, street, city, zip)
// Retrieve requires joins, complex mapping
OODBMS Approach:
Store objects directly, no decomposition
Query using object-oriented syntax
Features:
-
Complex Objects:
- Nested objects
- Collections (lists, sets)
- References between objects
-
Object Identity (OID):
- Each object has unique identifier
- Independent of attribute values
- Persists across sessions
-
Encapsulation:
- Methods stored with data
- Behavior + data together
-
Inheritance:
- Class hierarchies
- Subclasses inherit from superclasses
-
Polymorphism:
- Method overriding
- Dynamic binding
OODBMS Examples:
- db4o (Java, .NET)
- ObjectDB (Java)
- Versant
Object-Relational Databases (ORDBMS):
Hybrid: Relational model + OO features
PostgreSQL Example:
-- User-defined types
CREATE TYPE Address AS (
street VARCHAR(100),
city VARCHAR(50),
zip VARCHAR(10)
);
CREATE TABLE Employee (
id INT PRIMARY KEY,
name VARCHAR(100),
address Address -- Complex type
);
INSERT INTO Employee VALUES (1, 'John', ROW('123 Main St', 'NYC', '10001'));
SELECT name, (address).city FROM Employee;
Advantages:
- Natural mapping from OOP
- No joins for complex objects
- Better performance for complex data
Disadvantages:
- Less mature than RDBMS
- Smaller ecosystem
- No standard query language (unlike SQL)
DATA WAREHOUSING:
Definition: Subject-oriented, integrated, time-variant, non-volatile collection for decision support
Characteristics:
- Subject-Oriented: Organized by business area (Sales, Finance)
- Integrated: Data from multiple sources
- Time-Variant: Historical data, snapshots
- Non-Volatile: Read-only, no updates
Architecture:
Source Systems (OLTP)
↓
ETL (Extract, Transform, Load)
↓
Data Warehouse
↓
Data Marts (department-specific)
↓
OLAP / BI Tools
OLTP vs OLAP:
OLTP (Online Transaction Processing):
- Current data
- Detailed records
- Read/Write
- Normalized (3NF)
- Fast transactions
- Example: Bank account balance
OLAP (Online Analytical Processing):
- Historical data
- Aggregated summaries
- Read-mostly
- Denormalized (star schema)
- Complex queries
- Example: Quarterly sales analysis
Dimensional Modeling:
Star Schema:
Dimension: Time
|
|
Dimension -- Fact Table -- Dimension
(Product) (Sales) (Customer)
|
|
Dimension: Store
Fact Table: Measures (sales amount, quantity)
Dimension Tables: Context (who, what, when, where)
Snowflake Schema:
Normalized dimensions (dimension tables have sub-dimensions)
OLAP Operations:
-
Roll-Up: Aggregate to higher level
Daily sales → Monthly sales -
Drill-Down: Detail to lower level
Yearly sales → Quarterly sales -
Slice: Fix one dimension
Sales in Q1 2024 -
Dice: Fix multiple dimensions
Sales in Q1 2024, Region=West, Product=Laptop -
Pivot: Rotate view
Rows ↔ Columns
DATA MINING:
Definition: Discovering patterns, correlations, trends in large datasets
Knowledge Discovery Process (KDD):
- Data Cleaning: Remove noise, outliers
- Data Integration: Combine sources
- Data Selection: Choose relevant data
- Data Transformation: Normalize, aggregate
- Data Mining: Apply algorithms
- Pattern Evaluation: Identify interesting patterns
- Knowledge Presentation: Visualize, report
Data Mining Techniques:
-
Association Rules:
Market Basket Analysis
{Bread, Butter} ⇒ {Milk} (support=30%, confidence=80%)
Support: % transactions containing itemset
Confidence: % transactions with antecedent that also have consequent -
Classification:
Assign items to predefined categories
Decision Trees, Neural Networks, SVM
Example: Spam vs Not Spam email -
Clustering:
Group similar items (no predefined categories)
K-Means, Hierarchical Clustering
Example: Customer segmentation -
Regression:
Predict continuous values
Linear Regression, Polynomial Regression
Example: Predict house price -
Anomaly Detection:
Identify outliers
Example: Fraud detection -
Sequential Patterns:
Find patterns over time
Example: Web clickstream analysis
DISTRIBUTED DATABASES:
Data stored across multiple locations
Types:
- Homogeneous: Same DBMS at all sites
- Heterogeneous: Different DBMS at different sites
Fragmentation:
-
Horizontal: Split rows
West_Customers (CA, WA, OR)
East_Customers (NY, MA, PA) -
Vertical: Split columns
Employee_Basic (id, name, dept)
Employee_Payroll (id, salary, bonus) -
Hybrid: Both
Replication:
- Full: Complete copy at each site
- Partial: Selected fragments at sites
Advantages:
- Improved performance (data locality)
- Increased availability (redundancy)
- Scalability
Challenges:
- Distributed query processing
- Distributed transactions (2PC)
- Consistency vs Availability (CAP theorem)
NOSQL DATABASES:
Motivation: Scalability, flexibility, performance for web-scale applications
Types:
-
Key-Value: Redis, DynamoDB
Simple: key → value
Fast lookups
Use case: Session storage, caching -
Document: MongoDB, CouchDB
Store JSON/BSON documents
Schema-less
Use case: Content management, catalogs -
Column-Family: Cassandra, HBase
Wide columns, billions of columns
Distributed, high write throughput
Use case: Time-series, logs -
Graph: Neo4j, Amazon Neptune
Nodes and edges
Relationships are first-class
Use case: Social networks, recommendations
CAP Theorem:
Can have at most 2 of 3:
- Consistency: All nodes see same data
- Availability: Every request gets response
- Partition Tolerance: Works despite network splits
Relational: CA (sacrifice partition tolerance)
NoSQL: Often AP or CP
Examples
Example 1: OODBMS (db4o - Java)
// Define classes
class Employee {
String name;
Address address;
List<Project> projects;
}
class Address {
String street;
String city;
}
// Store object directly
Employee emp = new Employee();
emp.name = "John";
emp.address = new Address("123 Main", "NYC");
db.store(emp); // Stores entire object graph
// Query
List<Employee> result = db.query(new Predicate<Employee>() {
public boolean match(Employee emp) {
return emp.address.city.equals("NYC");
}
});
Example 2: Data Warehouse Star Schema (SQL)
CREATE TABLE FactSales (
sale_id INT PRIMARY KEY,
date_id INT,
product_id INT,
customer_id INT,
store_id INT,
quantity INT,
amount DECIMAL(10,2),
FOREIGN KEY (date_id) REFERENCES DimDate(date_id),
FOREIGN KEY (product_id) REFERENCES DimProduct(product_id),
FOREIGN KEY (customer_id) REFERENCES DimCustomer(customer_id),
FOREIGN KEY (store_id) REFERENCES DimStore(store_id)
);
CREATE TABLE DimDate (
date_id INT PRIMARY KEY,
date DATE,
day INT,
month INT,
quarter INT,
year INT
);
CREATE TABLE DimProduct (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(50),
brand VARCHAR(50)
);
-- Analytical Query: Quarterly sales by product category
SELECT d.quarter, p.category, SUM(f.amount) AS total_sales
FROM FactSales f
JOIN DimDate d ON f.date_id = d.date_id
JOIN DimProduct p ON f.product_id = p.product_id
WHERE d.year = 2024
GROUP BY d.quarter, p.category
ORDER BY d.quarter, total_sales DESC;
Example 3: NoSQL (MongoDB)
// Document database - store complex objects
db.employees.insertOne({
_id: 1,
name: "John Doe",
address: {
street: "123 Main St",
city: "NYC",
zip: "10001"
},
projects: [
{ name: "Project A", role: "Lead" },
{ name: "Project B", role: "Developer" }
],
skills: ["Java", "Python", "SQL"]
});
// Query nested fields
db.employees.find({ "address.city": "NYC" });
// Query arrays
db.employees.find({ skills: "Python" });
Example 4: Data Mining - Association Rules (Python)
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd
Transaction data
transactions = [
['Bread', 'Milk', 'Butter'],
['Bread', 'Butter'],
['Milk', 'Butter', 'Eggs'],
['Bread', 'Milk', 'Butter', 'Eggs'],
['Bread', 'Milk']
]
Find frequent itemsets
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)
Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print(rules[['antecedents', 'consequents', 'support', 'confidence']])
Output: {Bread, Butter} => {Milk} (confidence=0.8)
Real-World Use
When to Use Each Technology:
-
Relational Databases (PostgreSQL, MySQL):
- Structured data
- ACID transactions required
- Complex queries, joins
- Examples: Banking, e-commerce orders
-
OODBMS / ORDBMS:
- Complex nested objects
- Multimedia, CAD, scientific data
- Tight OOP integration
- Examples: GIS, engineering simulations
-
Data Warehouses:
- Historical analysis
- Business intelligence
- Reporting, dashboards
- Examples: Sales analytics, financial reporting
-
NoSQL:
Document (MongoDB): Flexible schema, content management
Key-Value (Redis): Caching, session storage
Column-Family (Cassandra): Time-series, logs, high write throughput
Graph (Neo4j): Social networks, recommendations, fraud detection -
Distributed Databases:
- Global applications
- High availability requirements
- Scale beyond single server
- Examples: Multi-region applications
Modern Architecture Trends:
- Polyglot Persistence: Use right database for each use case
- Lambda Architecture: Batch + Stream processing
- Data Lakes: Store raw data, process on-demand
- Cloud Data Warehouses: Snowflake, BigQuery, Redshift
For exams
Key Questions:
- Compare relational databases with object-oriented databases
- What is impedance mismatch? How does OODBMS solve it?
- Explain star schema and snowflake schema
- What is the difference between OLTP and OLAP?
- Describe the ETL process in data warehousing
- Explain OLAP operations: roll-up, drill-down, slice, dice, pivot
- What is data mining? List and explain 3 techniques
- Explain association rule mining with example
- What is the CAP theorem? How does it relate to NoSQL?
- Compare document, key-value, column-family, and graph databases
- What is horizontal vs vertical fragmentation in distributed databases?
- When would you choose NoSQL over relational database?
Key points
Core Concepts:
• OODBMS stores objects directly, eliminating impedance mismatch
• Data warehouses support analytics with denormalized star/snowflake schemas
• OLTP focuses on transactions, OLAP on analysis
• Data mining discovers patterns: association rules, classification, clustering
• NoSQL provides scalability and flexibility for specific use cases
• CAP theorem: Can't have consistency, availability, and partition tolerance together
• Distributed databases provide scalability through fragmentation and replication
• Modern applications use polyglot persistence: right database for each use case