Advanced Topics

Unit 5CLO01

Relational databases dominated for decades, but new challenges emerged: How to store complex objects? How to analyze petabytes of data? How to handle unstructured data? Advanced database technologies—object-oriented databases, data warehousing, NoSQL, and distributed systems—evolved to address these needs.

The Basics

Advanced database topics extend beyond traditional relational databases to handle complex data types, massive-scale analytics, and modern application requirements.

Key Motivations:

  1. Complex Data: Multimedia, CAD, scientific simulations
  2. Big Data: Petabytes of data, distributed processing
  3. Performance: Real-time analytics, low-latency access
  4. Scalability: Horizontal scaling, cloud-native
  5. Flexibility: Schema evolution, heterogeneous data

Technical Details

OBJECT-ORIENTED DATABASES (OODBMS):

Motivation: Impedance mismatch between OOP languages and relational databases

Relational Approach:
class Employee {
int id;
String name;
Address address; // Complex object
}

// Must decompose into multiple tables
Employees(id, name, address_id)
Addresses(address_id, street, city, zip)

// Retrieve requires joins, complex mapping

OODBMS Approach:
Store objects directly, no decomposition
Query using object-oriented syntax

Features:

  1. Complex Objects:

    • Nested objects
    • Collections (lists, sets)
    • References between objects
  2. Object Identity (OID):

    • Each object has unique identifier
    • Independent of attribute values
    • Persists across sessions
  3. Encapsulation:

    • Methods stored with data
    • Behavior + data together
  4. Inheritance:

    • Class hierarchies
    • Subclasses inherit from superclasses
  5. Polymorphism:

    • Method overriding
    • Dynamic binding

OODBMS Examples:

  • db4o (Java, .NET)
  • ObjectDB (Java)
  • Versant

Object-Relational Databases (ORDBMS):

Hybrid: Relational model + OO features

PostgreSQL Example:
-- User-defined types
CREATE TYPE Address AS (
street VARCHAR(100),
city VARCHAR(50),
zip VARCHAR(10)
);

CREATE TABLE Employee (
id INT PRIMARY KEY,
name VARCHAR(100),
address Address -- Complex type
);

INSERT INTO Employee VALUES (1, 'John', ROW('123 Main St', 'NYC', '10001'));

SELECT name, (address).city FROM Employee;

Advantages:

  • Natural mapping from OOP
  • No joins for complex objects
  • Better performance for complex data

Disadvantages:

  • Less mature than RDBMS
  • Smaller ecosystem
  • No standard query language (unlike SQL)

DATA WAREHOUSING:

Definition: Subject-oriented, integrated, time-variant, non-volatile collection for decision support

Characteristics:

  1. Subject-Oriented: Organized by business area (Sales, Finance)
  2. Integrated: Data from multiple sources
  3. Time-Variant: Historical data, snapshots
  4. Non-Volatile: Read-only, no updates

Architecture:

Source Systems (OLTP)

ETL (Extract, Transform, Load)

Data Warehouse

Data Marts (department-specific)

OLAP / BI Tools

OLTP vs OLAP:

OLTP (Online Transaction Processing):

  • Current data
  • Detailed records
  • Read/Write
  • Normalized (3NF)
  • Fast transactions
  • Example: Bank account balance

OLAP (Online Analytical Processing):

  • Historical data
  • Aggregated summaries
  • Read-mostly
  • Denormalized (star schema)
  • Complex queries
  • Example: Quarterly sales analysis

Dimensional Modeling:

Star Schema:

    Dimension: Time
         |
         |

Dimension -- Fact Table -- Dimension
(Product) (Sales) (Customer)
|
|
Dimension: Store

Fact Table: Measures (sales amount, quantity)
Dimension Tables: Context (who, what, when, where)

Snowflake Schema:
Normalized dimensions (dimension tables have sub-dimensions)

OLAP Operations:

  1. Roll-Up: Aggregate to higher level
    Daily sales → Monthly sales

  2. Drill-Down: Detail to lower level
    Yearly sales → Quarterly sales

  3. Slice: Fix one dimension
    Sales in Q1 2024

  4. Dice: Fix multiple dimensions
    Sales in Q1 2024, Region=West, Product=Laptop

  5. Pivot: Rotate view
    Rows ↔ Columns

DATA MINING:

Definition: Discovering patterns, correlations, trends in large datasets

Knowledge Discovery Process (KDD):

  1. Data Cleaning: Remove noise, outliers
  2. Data Integration: Combine sources
  3. Data Selection: Choose relevant data
  4. Data Transformation: Normalize, aggregate
  5. Data Mining: Apply algorithms
  6. Pattern Evaluation: Identify interesting patterns
  7. Knowledge Presentation: Visualize, report

Data Mining Techniques:

  1. Association Rules:
    Market Basket Analysis
    {Bread, Butter} ⇒ {Milk} (support=30%, confidence=80%)
    Support: % transactions containing itemset
    Confidence: % transactions with antecedent that also have consequent

  2. Classification:
    Assign items to predefined categories
    Decision Trees, Neural Networks, SVM
    Example: Spam vs Not Spam email

  3. Clustering:
    Group similar items (no predefined categories)
    K-Means, Hierarchical Clustering
    Example: Customer segmentation

  4. Regression:
    Predict continuous values
    Linear Regression, Polynomial Regression
    Example: Predict house price

  5. Anomaly Detection:
    Identify outliers
    Example: Fraud detection

  6. Sequential Patterns:
    Find patterns over time
    Example: Web clickstream analysis

DISTRIBUTED DATABASES:

Data stored across multiple locations

Types:

  1. Homogeneous: Same DBMS at all sites
  2. Heterogeneous: Different DBMS at different sites

Fragmentation:

  1. Horizontal: Split rows
    West_Customers (CA, WA, OR)
    East_Customers (NY, MA, PA)

  2. Vertical: Split columns
    Employee_Basic (id, name, dept)
    Employee_Payroll (id, salary, bonus)

  3. Hybrid: Both

Replication:

  • Full: Complete copy at each site
  • Partial: Selected fragments at sites

Advantages:

  • Improved performance (data locality)
  • Increased availability (redundancy)
  • Scalability

Challenges:

  • Distributed query processing
  • Distributed transactions (2PC)
  • Consistency vs Availability (CAP theorem)

NOSQL DATABASES:

Motivation: Scalability, flexibility, performance for web-scale applications

Types:

  1. Key-Value: Redis, DynamoDB
    Simple: key → value
    Fast lookups
    Use case: Session storage, caching

  2. Document: MongoDB, CouchDB
    Store JSON/BSON documents
    Schema-less
    Use case: Content management, catalogs

  3. Column-Family: Cassandra, HBase
    Wide columns, billions of columns
    Distributed, high write throughput
    Use case: Time-series, logs

  4. Graph: Neo4j, Amazon Neptune
    Nodes and edges
    Relationships are first-class
    Use case: Social networks, recommendations

CAP Theorem:
Can have at most 2 of 3:

  • Consistency: All nodes see same data
  • Availability: Every request gets response
  • Partition Tolerance: Works despite network splits

Relational: CA (sacrifice partition tolerance)
NoSQL: Often AP or CP

Examples

Example 1: OODBMS (db4o - Java)

// Define classes
class Employee {
String name;
Address address;
List<Project> projects;
}

class Address {
String street;
String city;
}

// Store object directly
Employee emp = new Employee();
emp.name = "John";
emp.address = new Address("123 Main", "NYC");
db.store(emp); // Stores entire object graph

// Query
List<Employee> result = db.query(new Predicate<Employee>() {
public boolean match(Employee emp) {
return emp.address.city.equals("NYC");
}
});

Example 2: Data Warehouse Star Schema (SQL)

CREATE TABLE FactSales (
sale_id INT PRIMARY KEY,
date_id INT,
product_id INT,
customer_id INT,
store_id INT,
quantity INT,
amount DECIMAL(10,2),
FOREIGN KEY (date_id) REFERENCES DimDate(date_id),
FOREIGN KEY (product_id) REFERENCES DimProduct(product_id),
FOREIGN KEY (customer_id) REFERENCES DimCustomer(customer_id),
FOREIGN KEY (store_id) REFERENCES DimStore(store_id)
);

CREATE TABLE DimDate (
date_id INT PRIMARY KEY,
date DATE,
day INT,
month INT,
quarter INT,
year INT
);

CREATE TABLE DimProduct (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(50),
brand VARCHAR(50)
);

-- Analytical Query: Quarterly sales by product category
SELECT d.quarter, p.category, SUM(f.amount) AS total_sales
FROM FactSales f
JOIN DimDate d ON f.date_id = d.date_id
JOIN DimProduct p ON f.product_id = p.product_id
WHERE d.year = 2024
GROUP BY d.quarter, p.category
ORDER BY d.quarter, total_sales DESC;

Example 3: NoSQL (MongoDB)

// Document database - store complex objects
db.employees.insertOne({
_id: 1,
name: "John Doe",
address: {
street: "123 Main St",
city: "NYC",
zip: "10001"
},
projects: [
{ name: "Project A", role: "Lead" },
{ name: "Project B", role: "Developer" }
],
skills: ["Java", "Python", "SQL"]
});

// Query nested fields
db.employees.find({ "address.city": "NYC" });

// Query arrays
db.employees.find({ skills: "Python" });

Example 4: Data Mining - Association Rules (Python)

from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd

Transaction data

transactions = [
['Bread', 'Milk', 'Butter'],
['Bread', 'Butter'],
['Milk', 'Butter', 'Eggs'],
['Bread', 'Milk', 'Butter', 'Eggs'],
['Bread', 'Milk']
]

Find frequent itemsets

frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

Generate association rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print(rules[['antecedents', 'consequents', 'support', 'confidence']])

Output: {Bread, Butter} => {Milk} (confidence=0.8)

Real-World Use

When to Use Each Technology:

  1. Relational Databases (PostgreSQL, MySQL):

    • Structured data
    • ACID transactions required
    • Complex queries, joins
    • Examples: Banking, e-commerce orders
  2. OODBMS / ORDBMS:

    • Complex nested objects
    • Multimedia, CAD, scientific data
    • Tight OOP integration
    • Examples: GIS, engineering simulations
  3. Data Warehouses:

    • Historical analysis
    • Business intelligence
    • Reporting, dashboards
    • Examples: Sales analytics, financial reporting
  4. NoSQL:
    Document (MongoDB): Flexible schema, content management
    Key-Value (Redis): Caching, session storage
    Column-Family (Cassandra): Time-series, logs, high write throughput
    Graph (Neo4j): Social networks, recommendations, fraud detection

  5. Distributed Databases:

    • Global applications
    • High availability requirements
    • Scale beyond single server
    • Examples: Multi-region applications

Modern Architecture Trends:

  • Polyglot Persistence: Use right database for each use case
  • Lambda Architecture: Batch + Stream processing
  • Data Lakes: Store raw data, process on-demand
  • Cloud Data Warehouses: Snowflake, BigQuery, Redshift

For exams

Key Questions:

  1. Compare relational databases with object-oriented databases
  2. What is impedance mismatch? How does OODBMS solve it?
  3. Explain star schema and snowflake schema
  4. What is the difference between OLTP and OLAP?
  5. Describe the ETL process in data warehousing
  6. Explain OLAP operations: roll-up, drill-down, slice, dice, pivot
  7. What is data mining? List and explain 3 techniques
  8. Explain association rule mining with example
  9. What is the CAP theorem? How does it relate to NoSQL?
  10. Compare document, key-value, column-family, and graph databases
  11. What is horizontal vs vertical fragmentation in distributed databases?
  12. When would you choose NoSQL over relational database?

Key points

Core Concepts:

• OODBMS stores objects directly, eliminating impedance mismatch
• Data warehouses support analytics with denormalized star/snowflake schemas
• OLTP focuses on transactions, OLAP on analysis
• Data mining discovers patterns: association rules, classification, clustering
• NoSQL provides scalability and flexibility for specific use cases
• CAP theorem: Can't have consistency, availability, and partition tolerance together
• Distributed databases provide scalability through fragmentation and replication
• Modern applications use polyglot persistence: right database for each use case