Before applying any clustering, the analyst loads the VAST Challenge 2011 dataset—approximately 1,000,000 simulated microblog messages from the last 4 days of activity in the fictional city of Vastopolis. The map and space-time cube below show the raw spatial and spatio-temporal distributions. At this stage, sickness-related messages are indistinguishable from background noise; the spatial concentrations caused by the contamination event are not yet isolated.
An initial exploratory DBSCAN run with a very broad parameter sweep (ε from 0.05 to 5.0, min_samples = 150) provides a first overview of how the parameter space maps to clustering outcomes. The IteraScope display below shows the resulting quality metrics, Sankey transitions, and 2D embedding—most iterations produce either excessively fragmented results (small ε) or a single giant cluster (large ε).
Based on the initial exploration, the analyst focuses on ε = 0.05 to 1.0 (step 0.05) with min_samples = 150. This round serves Phase 1 of the SI workflow: identifying the coarse structure of the parameter landscape and selecting a promising intermediate value for further refinement.
| Parameter | Value |
|---|---|
| ε (distance threshold) | 0.05 … 1.0, step 0.05 |
| min_samples | 150 (fixed) |
| Iterations | 20 |
The IteraScope display reveals the full parameter landscape. At ε = 0.05, dozens of tiny clusters are discovered with very high noise percentage. As ε increases, clusters merge rapidly. By ε = 0.25, nearly all messages belong to one or two clusters. The intermediate value ε = 0.10 shows a moderate number of clusters with reasonable silhouette and manageable noise—making it the candidate for domain validation.
Sharing the ε = 0.10 grouping to the map and space-time cube confirms the spatial and temporal validity of the discovered clusters. On the map, clusters align with hospitals, the river corridor, and downwind neighbourhoods. In the space-time cube, temporal layering becomes visible: day 1 shows scattered noise, days 2–4 show spatially extended clusters, and hospital clusters appear only on days 3–4.
A second map/space-time cube view shows the clustering result with noise removed, isolating only the identified clusters for clearer inspection of their spatial and temporal extent.
Guided by Round 1, the analyst narrows the distance range to ε = 0.05 … 0.15 (step 0.01) to find the optimal distance threshold with finer resolution. This round corresponds to a refined Phase 1: the analyst seeks the specific ε value where quality metrics peak and cluster structure is most stable.
| Parameter | Value |
|---|---|
| ε (distance threshold) | 0.05 … 0.15, step 0.01 |
| min_samples | 150 (fixed) |
| Iterations | 11 |
Before examining the IteraScope metrics in detail, the analyst shares the ε = 0.07 grouping to domain views. The map and space-time cube confirm that this value produces tightly delineated clusters with sharper boundaries than ε = 0.10, while retaining all major spatial and temporal patterns.
The IteraScope display for the refined sweep provides a detailed view of the ε = 0.05–0.15 landscape. Silhouette peaks sharply at ε = 0.07; Calinski–Harabasz confirms this peak; Davies–Bouldin reaches a local minimum. The Sankey transitions show that the cluster structure at ε = 0.07 persists through ε = 0.08 and 0.09 with only minor boundary adjustments. Several iterations near ε = 0.07 are marked as "complete" (all HDBSCAN archetypes present).
The following views show additional metric perspectives and the tooltip for ε = 0.07, confirming per-group statistics (prevalence, distance to medoid, noise percentage) and archetype membership.
Having fixed ε = 0.07 as the optimal distance threshold, the analyst now sweeps the second parameter—min_samples—from 100 to 200 (step 10) to test robustness of the cluster structure to the neighbourhood-size requirement. This round exercises Phases 2–4 of the SI workflow: transition assessment, confidence evaluation, and domain-contextualised interpretation of borderline members.
| Parameter | Value |
|---|---|
| ε (distance threshold) | 0.07 (fixed) |
| min_samples | 100 … 200, step 10 |
| Iterations | 11 |
The IteraScope display for the min_samples sweep shows that silhouette rises gently from 100 to about 150 and then plateaus, while noise percentage increases steadily. The core cluster structure remains stable across the entire range—the same clusters persist, differing only in how many borderline members they retain versus shed to noise.
The analyst decides to compare the two most promising candidates (130 and 150) in detail. He hides all intermediate axes, keeping only 100, 130, 150, and 200 visible. The Sankey view confirms identical core structure; the analyst then focuses on what transitions to noise between 130 and 150.
The analyst activates "highlight transitions from" on the noise group at min_samples = 150 to identify which messages were reassigned from non-noise clusters at 130 to noise at 150. This creates a temporary class attribute (e.g., "130.3 → 150.noise") that is propagated to all linked views. The 2D embedding shows connection lines with document counts; the map reveals that these borderline members are located at cluster edges in the city centre and river corridor.
The table below summarises the four rounds of the density-based clustering workflow, showing how each round narrowed the parameter space and what analytical questions it addressed.
| Round | Parameters | Workflow Phase | Key Finding |
|---|---|---|---|
| 0 | ε = 0.05 … 5.0, min_samples = 150 | Exploratory overview | Useful structure exists only in ε = 0.05–0.2; focus subsequent analysis here |
| 1 | ε = 0.05 … 1.0, step 0.05, min_samples = 150 | Phase 1 + Phase 4 | ε = 0.10 produces moderate K with good silhouette; clusters align with event geography and show temporal layering |
| 2 | ε = 0.05 … 0.15, step 0.01, min_samples = 150 | Phase 1 (refined) + Phase 5 | ε = 0.07 is optimal: silhouette peak, Sankey stability across neighbours, multiple complete iterations |
| 3 | ε = 0.07 (fixed), min_samples = 100 … 200, step 10 | Phases 2–4 | Core structure is robust across min_samples; members lost between 130 and 150 are genuine border cases; final choice: min_samples = 150 |
Final selected parameters: ε = 0.07, min_samples = 150. This configuration produces 26 clusters capturing all major contamination-related spatial concentrations (hospitals, river corridor, downwind areas) with high membership confidence, correct temporal layering, and validation against the known ground truth.
End of Appendix to Section 6.1