Skip to main content

SAM Clustering Models: Complete Catalog

Overview

SAM (Supervised Agentic Modelling) provides access to 6 state-of-the-art clustering algorithms, ranging from traditional statistical methods to cutting-edge density-based approaches. Our AI system automatically selects the optimal combination based on your data characteristics, ensuring maximum accuracy and reliability.

Model Categories

Centroid-Based Models - Fast & Interpretable

Traditional clustering methods that work well with spherical clusters and provide clear cluster centers.

Density-Based Models - Advanced & Adaptive

Modern approaches that excel with irregular cluster shapes and handle noise effectively.

Probabilistic Models - Purpose-Built

Algorithms designed for soft clustering and uncertainty quantification.

Hierarchical Models - Interpretable & Flexible

Tree-based approaches ideal for understanding cluster relationships and business hierarchies.


Centroid-Based Models

K-Means

Best For: Spherical clusters, large datasets, fast processing

  • Strengths: Fast execution, interpretable results, works well with numeric data
  • Data Requirements: Minimum 50 observations, works best with 2-20 clusters
  • Processing Time: Low (1-3 minutes for optimization)
  • Use Cases: Customer segmentation, product categorization, market analysis

Mini-Batch K-Means

Best For: Very large datasets, streaming data, memory-constrained environments

  • Strengths: Memory efficient, handles millions of records, incremental updates
  • Data Requirements: Minimum 100 observations, scales to millions of records
  • Processing Time: Very Low (30 seconds - 2 minutes)
  • Use Cases: Big data analytics, real-time clustering, scalable applications

Density-Based Models

DBSCAN (Density-Based Spatial Clustering)

Best For: Irregular cluster shapes, noise detection, varying densities

  • Strengths: Handles arbitrary shapes, identifies outliers, no need to specify cluster count
  • Data Requirements: Minimum 30 observations, works with any cluster shape
  • Processing Time: Medium (2-5 minutes for optimization)
  • Use Cases: Geographic clustering, anomaly detection, complex pattern recognition

HDBSCAN (Hierarchical DBSCAN)

Best For: Complex cluster hierarchies, varying densities, noise robustness

  • Strengths: Hierarchical clustering, robust to parameter selection, excellent noise handling
  • Data Requirements: Minimum 50 observations, handles complex cluster structures
  • Processing Time: Medium-High (3-8 minutes)
  • Use Cases: Customer behavior analysis, market segmentation, complex business patterns

Probabilistic Models

Gaussian Mixture Model (GMM)

Best For: Soft clustering, uncertainty quantification, overlapping clusters

  • Strengths: Probabilistic assignments, handles overlapping clusters, uncertainty estimates
  • Data Requirements: Minimum 100 observations, works with mixed data types
  • Processing Time: Medium (2-6 minutes)
  • Use Cases: Risk assessment, customer lifetime value, probabilistic segmentation

Hierarchical Models

Agglomerative Clustering

Best For: Interpretable hierarchies, small to medium datasets, business logic

  • Strengths: Clear hierarchy visualization, deterministic results, business interpretable
  • Data Requirements: Minimum 20 observations, works well with < 10,000 records
  • Processing Time: Medium-High (3-10 minutes)
  • Use Cases: Organizational structure, product hierarchies, strategic planning

Model Selection Guide

Automatic Selection Criteria

Our AI system selects models based on these data characteristics:

For Spherical Clusters (Clear Centers)

  1. K-Means - Classic centroid-based approach
  2. Mini-Batch K-Means - For large datasets
  3. GMM - For probabilistic assignments
  4. Agglomerative - For interpretable hierarchies

For Irregular Clusters (Complex Shapes)

  1. HDBSCAN - Best for complex hierarchies
  2. DBSCAN - Robust density-based approach
  3. GMM - Flexible probabilistic modeling
  4. Agglomerative - For structured hierarchies

For Large Datasets (10,000+ records)

  1. Mini-Batch K-Means - Designed for scalability
  2. HDBSCAN - Efficient density-based clustering
  3. K-Means - Fast centroid-based approach
  4. DBSCAN - Memory-efficient density clustering

For Noisy/Outlier Data

  1. HDBSCAN - Excellent noise handling
  2. DBSCAN - Built-in outlier detection
  3. GMM - Probabilistic robustness
  4. Mini-Batch K-Means - Robust to noise

For Business Interpretability

  1. K-Means - Clear cluster centers
  2. Agglomerative - Hierarchical business logic
  3. GMM - Probabilistic business insights
  4. Mini-Batch K-Means - Scalable interpretability

Performance Matrix

ModelAccuracySpeedScalabilityShape FlexibilityNoise RobustInterpretability
K-MeansHighVery HighHigh
Mini-Batch K-MeansHighVery HighVery High
DBSCANHighMediumMediumMedium
HDBSCANVery HighMediumHighHigh
GMMHighMediumMediumMediumMediumHigh
AgglomerativeMediumLowLowMediumVery High

How SAM Selects Models

Intelligent Model Selection Process

SAM automatically chooses the best clustering models for your data through a 3-step AI-driven process:

Step 1: Data Analysis

Our system analyzes your dataset across 28 characteristics:

  • Clusterability: Hopkins statistic and silhouette analysis
  • Shape Requirements: Spherical vs irregular cluster detection
  • Data Quality: Outlier percentage and noise assessment
  • Size & Complexity: Dataset size and dimensionality evaluation
  • Feature Types: Numeric vs categorical data analysis

Step 2: Model Scoring

Each of the 6 available models receives a suitability score (0-10):

  • Centroid Models (K-Means): Best for spherical clusters and large datasets
  • Density Models (DBSCAN, HDBSCAN): Optimal for irregular shapes and noise
  • Probabilistic Models (GMM): Ideal for soft clustering and uncertainty
  • Hierarchical Models (Agglomerative): Perfect for interpretable business logic

Step 3: Smart Selection

The AI doesn't just pick the highest scores - it ensures diversity:

  • Balanced Portfolio: Combines different model types for robustness
  • Optimal Count: Selects 2-5 models based on data complexity
  • Performance Priority: Balances accuracy with processing speed
  • Category Limits: Prevents over-reliance on any single approach

What You See

When clustering starts, you'll receive:

  • Selected Models: "AI chose HDBSCAN, K-Means, and GMM"
  • Selection Reason: "Best for irregular business patterns with noise handling"
  • Expected Quality: "Excellent cluster separation anticipated"
  • Processing Time: "Estimated completion in 6-12 minutes"

User Control Options

While AI selection is recommended, you can:

  • Specify Models: Choose exact algorithms if needed
  • Set Priorities: Emphasize speed vs accuracy vs interpretability
  • Use Presets: Industry-optimized combinations available

SAM Mathematical Framework

Core SAM Formula

The SAM (Supervised Agentic Modelling) system uses a sophisticated mathematical framework to evaluate and select clustering algorithms:

Final Score = α × PS + β × PP - γ × RuntimePenalty

Where:

  • PS (Predicted Suitability): Theoretical algorithm-dataset compatibility score
  • PP (Post-Performance): Empirical performance score from actual testing
  • RuntimePenalty: Computational cost penalty
  • α, β, γ: Weighting coefficients (typically α=0.4, β=0.5, γ=0.1)

Predicted Suitability (PS) Calculation

The PS score evaluates how well an algorithm theoretically matches the dataset characteristics:

PS = Σ(wi × Ci) / Σ(wi)

Where each criterion Ci is evaluated on a 0-1 scale:

Criterion A: Data Type Fit (Weight: 2.0)

C_A = match_score(algorithm.data_requirements, dataset.characteristics)

Criterion B: Shape Adaptability (Weight: 1.5)

C_B = shape_compatibility_score(algorithm.cluster_shape_capability, dataset.cluster_shapes)

Criterion C: Noise/Outlier Robustness (Weight: 1.5)

C_C = noise_handling_score(algorithm.noise_tolerance, dataset.outlier_percentage)

Criterion D: Scalability (Weight: 2.0)

C_D = scalability_score(algorithm.computational_complexity, dataset.size)

Criterion E: Interpretability (Weight: 1.0)

C_E = interpretability_score(algorithm.business_friendliness, use_case.requirements)

Post-Performance (PP) Calculation

The PP score measures actual performance on a representative sample:

PP = (Silhouette_Score × 0.4) + (Davies_Bouldin_Score × 0.3) + (Calinski_Harabasz_Score × 0.3)

Runtime Penalty Calculation

RuntimePenalty = min(1.0, actual_runtime / expected_runtime)

This framework ensures that SAM makes data-driven, objective decisions about algorithm selection while considering both technical performance and business requirements.