SAM Clustering Models: Complete Catalog
Overview
SAM (Supervised Agentic Modelling) provides access to 6 state-of-the-art clustering algorithms, ranging from traditional statistical methods to cutting-edge density-based approaches. Our AI system automatically selects the optimal combination based on your data characteristics, ensuring maximum accuracy and reliability.
Model Categories
Centroid-Based Models - Fast & Interpretable
Traditional clustering methods that work well with spherical clusters and provide clear cluster centers.
Density-Based Models - Advanced & Adaptive
Modern approaches that excel with irregular cluster shapes and handle noise effectively.
Probabilistic Models - Purpose-Built
Algorithms designed for soft clustering and uncertainty quantification.
Hierarchical Models - Interpretable & Flexible
Tree-based approaches ideal for understanding cluster relationships and business hierarchies.
Centroid-Based Models
K-Means
Best For: Spherical clusters, large datasets, fast processing
- Strengths: Fast execution, interpretable results, works well with numeric data
- Data Requirements: Minimum 50 observations, works best with 2-20 clusters
- Processing Time: Low (1-3 minutes for optimization)
- Use Cases: Customer segmentation, product categorization, market analysis
Mini-Batch K-Means
Best For: Very large datasets, streaming data, memory-constrained environments
- Strengths: Memory efficient, handles millions of records, incremental updates
- Data Requirements: Minimum 100 observations, scales to millions of records
- Processing Time: Very Low (30 seconds - 2 minutes)
- Use Cases: Big data analytics, real-time clustering, scalable applications
Density-Based Models
DBSCAN (Density-Based Spatial Clustering)
Best For: Irregular cluster shapes, noise detection, varying densities
- Strengths: Handles arbitrary shapes, identifies outliers, no need to specify cluster count
- Data Requirements: Minimum 30 observations, works with any cluster shape
- Processing Time: Medium (2-5 minutes for optimization)
- Use Cases: Geographic clustering, anomaly detection, complex pattern recognition
HDBSCAN (Hierarchical DBSCAN)
Best For: Complex cluster hierarchies, varying densities, noise robustness
- Strengths: Hierarchical clustering, robust to parameter selection, excellent noise handling
- Data Requirements: Minimum 50 observations, handles complex cluster structures
- Processing Time: Medium-High (3-8 minutes)
- Use Cases: Customer behavior analysis, market segmentation, complex business patterns
Probabilistic Models
Gaussian Mixture Model (GMM)
Best For: Soft clustering, uncertainty quantification, overlapping clusters
- Strengths: Probabilistic assignments, handles overlapping clusters, uncertainty estimates
- Data Requirements: Minimum 100 observations, works with mixed data types
- Processing Time: Medium (2-6 minutes)
- Use Cases: Risk assessment, customer lifetime value, probabilistic segmentation
Hierarchical Models
Agglomerative Clustering
Best For: Interpretable hierarchies, small to medium datasets, business logic
- Strengths: Clear hierarchy visualization, deterministic results, business interpretable
- Data Requirements: Minimum 20 observations, works well with < 10,000 records
- Processing Time: Medium-High (3-10 minutes)
- Use Cases: Organizational structure, product hierarchies, strategic planning
Model Selection Guide
Automatic Selection Criteria
Our AI system selects models based on these data characteristics:
For Spherical Clusters (Clear Centers)
- K-Means - Classic centroid-based approach
- Mini-Batch K-Means - For large datasets
- GMM - For probabilistic assignments
- Agglomerative - For interpretable hierarchies
For Irregular Clusters (Complex Shapes)
- HDBSCAN - Best for complex hierarchies
- DBSCAN - Robust density-based approach
- GMM - Flexible probabilistic modeling
- Agglomerative - For structured hierarchies
For Large Datasets (10,000+ records)
- Mini-Batch K-Means - Designed for scalability
- HDBSCAN - Efficient density-based clustering
- K-Means - Fast centroid-based approach
- DBSCAN - Memory-efficient density clustering
For Noisy/Outlier Data
- HDBSCAN - Excellent noise handling
- DBSCAN - Built-in outlier detection
- GMM - Probabilistic robustness
- Mini-Batch K-Means - Robust to noise
For Business Interpretability
- K-Means - Clear cluster centers
- Agglomerative - Hierarchical business logic
- GMM - Probabilistic business insights
- Mini-Batch K-Means - Scalable interpretability
Performance Matrix
| Model | Accuracy | Speed | Scalability | Shape Flexibility | Noise Robust | Interpretability |
|---|---|---|---|---|---|---|
| K-Means | High | Very High | High | ❌ | ❌ | ✅ |
| Mini-Batch K-Means | High | Very High | Very High | ❌ | ❌ | ✅ |
| DBSCAN | High | Medium | Medium | ✅ | ✅ | Medium |
| HDBSCAN | Very High | Medium | High | ✅ | ✅ | High |
| GMM | High | Medium | Medium | Medium | Medium | High |
| Agglomerative | Medium | Low | Low | Medium | ❌ | Very High |
How SAM Selects Models
Intelligent Model Selection Process
SAM automatically chooses the best clustering models for your data through a 3-step AI-driven process:
Step 1: Data Analysis
Our system analyzes your dataset across 28 characteristics:
- Clusterability: Hopkins statistic and silhouette analysis
- Shape Requirements: Spherical vs irregular cluster detection
- Data Quality: Outlier percentage and noise assessment
- Size & Complexity: Dataset size and dimensionality evaluation
- Feature Types: Numeric vs categorical data analysis
Step 2: Model Scoring
Each of the 6 available models receives a suitability score (0-10):
- Centroid Models (K-Means): Best for spherical clusters and large datasets
- Density Models (DBSCAN, HDBSCAN): Optimal for irregular shapes and noise
- Probabilistic Models (GMM): Ideal for soft clustering and uncertainty
- Hierarchical Models (Agglomerative): Perfect for interpretable business logic
Step 3: Smart Selection
The AI doesn't just pick the highest scores - it ensures diversity:
- Balanced Portfolio: Combines different model types for robustness
- Optimal Count: Selects 2-5 models based on data complexity
- Performance Priority: Balances accuracy with processing speed
- Category Limits: Prevents over-reliance on any single approach
What You See
When clustering starts, you'll receive:
- Selected Models: "AI chose HDBSCAN, K-Means, and GMM"
- Selection Reason: "Best for irregular business patterns with noise handling"
- Expected Quality: "Excellent cluster separation anticipated"
- Processing Time: "Estimated completion in 6-12 minutes"
User Control Options
While AI selection is recommended, you can:
- Specify Models: Choose exact algorithms if needed
- Set Priorities: Emphasize speed vs accuracy vs interpretability
- Use Presets: Industry-optimized combinations available
SAM Mathematical Framework
Core SAM Formula
The SAM (Supervised Agentic Modelling) system uses a sophisticated mathematical framework to evaluate and select clustering algorithms:
Final Score = α × PS + β × PP - γ × RuntimePenalty
Where:
- PS (Predicted Suitability): Theoretical algorithm-dataset compatibility score
- PP (Post-Performance): Empirical performance score from actual testing
- RuntimePenalty: Computational cost penalty
- α, β, γ: Weighting coefficients (typically α=0.4, β=0.5, γ=0.1)
Predicted Suitability (PS) Calculation
The PS score evaluates how well an algorithm theoretically matches the dataset characteristics:
PS = Σ(wi × Ci) / Σ(wi)
Where each criterion Ci is evaluated on a 0-1 scale:
Criterion A: Data Type Fit (Weight: 2.0)
C_A = match_score(algorithm.data_requirements, dataset.characteristics)
Criterion B: Shape Adaptability (Weight: 1.5)
C_B = shape_compatibility_score(algorithm.cluster_shape_capability, dataset.cluster_shapes)
Criterion C: Noise/Outlier Robustness (Weight: 1.5)
C_C = noise_handling_score(algorithm.noise_tolerance, dataset.outlier_percentage)
Criterion D: Scalability (Weight: 2.0)
C_D = scalability_score(algorithm.computational_complexity, dataset.size)
Criterion E: Interpretability (Weight: 1.0)
C_E = interpretability_score(algorithm.business_friendliness, use_case.requirements)
Post-Performance (PP) Calculation
The PP score measures actual performance on a representative sample:
PP = (Silhouette_Score × 0.4) + (Davies_Bouldin_Score × 0.3) + (Calinski_Harabasz_Score × 0.3)
Runtime Penalty Calculation
RuntimePenalty = min(1.0, actual_runtime / expected_runtime)
This framework ensures that SAM makes data-driven, objective decisions about algorithm selection while considering both technical performance and business requirements.