Visual Analytics Workflows for Supervising Unsupervised Data Grouping
Fraunhofer IAIS, Sankt Augustin, Germany | City, University of London, UK
SmartIterator (SI) is a visual analytics approach that treats the full sequence of unsupervised grouping results across a parameter sweep as a first-class analytical object. Rather than seeking a single "optimal" parameter configuration, SI guides the analyst through systematic exploration of how data structure emerges, persists, and transforms across configurations—building cumulative understanding that no single run can provide.
The approach is operationalized through IteraScope (IS), a coordinated visual display that combines:
Three method-specific workflows are provided—for density-based clustering, partition-based clustering, and NMF topic modeling—each following a common six-phase structure:
Key insight: Each phase contributes to cumulative understanding of data structure. The knowledge gained from studying transitions between configurations—which groups persist, which split, what vocabulary migrates, where boundaries lie—is often as valuable as the final parameter choice itself. The analyst does not merely pick a number; the analyst learns about the data.
Each appendix provides full-resolution figures and detailed analysis steps for one demonstration from the paper. The corresponding Google Colab notebooks implement key parts of the computational pipeline (parameter sweeps, quality metrics, archetype detection, embedding computation, word cloud generation) in reproducible Python code.
Approximately 1,000,000 simulated microblog messages from the fictional city of Vastopolis. A contamination event caused spatially concentrated health complaints that vary substantially in density—from tight hospital clusters to elongated, diffuse wind- and water-dispersed patterns. The workflow sweeps DBSCAN parameters (ε and min_samples) across three progressive rounds, using map and space-time cube views for domain validation against the known event geography.
Highlights: Multi-round progressive refinement; transition-class propagation to isolate borderline members; temporal layering revealed in space-time cube; validation against ground truth.
Demographic indicators (gender and age structure) for ~1,500 European NUTS-3 regions from Eurostat. The workflow sweeps the number of K-means clusters (K = 3…50), compares candidate configurations through violin plots and transition-class propagation, validates geographic coherence on choropleth maps, and verifies robustness through a 30-seed stability sweep.
Highlights: Violin-based confidence comparison between K = 20 and K = 24; transition decomposition revealing peripheral losses; seed-stability verification; parallel coordinates for demographic profiling.
~3,800 IEEE VIS papers (1990–2024) with titles and abstracts. The workflow begins with data cleaning (removal of encoding artifacts), sweeps the number of NMF topics (5–25), diagnoses a "wastebasket" topic at ntopics = 16 through Sankey transition analysis, investigates the progressive differentiation of the "time" topic, and validates the final 15-topic solution through 34-year temporal prevalence charts.
Highlights: Phase 0 data cleaning via artifact detection; wastebasket diagnosis through multi-parent Sankey pattern; term-level transition word clouds; temporal prevalence alignment with known community evolution (VAST track founding, growth of evaluation culture).
Each notebook implements key computational components from the paper (Section 5), including:
Note: The notebooks do not implement the full IteraScope interactive display (Sankey rendering, violin plots, hover/click interactions, domain-view linking). These are part of the Java-based V-Analytics environment. The notebooks produce the input data files for IS and provide standalone exploratory visualizations.
The following recommendations generalize from our experience applying SmartIterator across three method families and diverse datasets. They are intended as actionable guidance for analysts beginning their own parameter-supervision workflows.
Begin with a wide parameter sweep at coarse resolution (e.g., K = 3…50 with step 1, or ε = 0.05…1.0 with step 0.05) to identify the region of interest. Examine the metrics chart for elbows, peaks, and plateaus; note where complete iterations cluster. Then narrow the range and reduce the step size for a second round. Two to three rounds typically suffice—each informed by the findings of the previous one. This avoids wasting computation on irrelevant regions and gives early orientation before committing to detailed inspection.
A silhouette peak or coherence plateau narrows the candidate range, but the definitive stability evidence comes from transition flows. A "good" configuration is one where the groups at the selected parameter persist with minimal change at neighboring values—visible as wide, horizontal Sankey bands connecting the same vertical positions across consecutive iterations. Specifically:
If a group splits at the very next parameter step, it may be an artifact of the current granularity rather than genuine structure.
When multiple configurations score similarly on quality metrics, propagate each to domain-linked views and compare. Domain coherence frequently discriminates between statistically equivalent alternatives:
The domain view often reveals the "right" answer instantly when metrics are ambiguous.
Transition-class propagation is most informative when applied to understand losses: which members leave a group at the next parameter step, and where do they go? The spatial, temporal, or semantic distribution of lost members reveals the nature of the transition:
This "what changes" perspective often generates more domain insight than examining static group composition at a single configuration.
A "complete" iteration (containing representatives of all HDBSCAN-detected archetypes) captures the full diversity of recurrent patterns—making it a strong candidate. However:
K-means and other partition-based methods depend on random initialization. Once a promising K is identified:
For topic modeling (NMF), seed sensitivity is typically lower due to the non-negative constraint, but verification remains good practice.
When an alternative configuration is rejected, record:
Documented rejections serve two purposes: they strengthen confidence in the final selection (by demonstrating that alternatives were considered and found wanting), and they contribute to cumulative understanding (by revealing what the data does not support).
The SI workflow does not require converging to a single "winner." If two configurations offer complementary views of the data—e.g., a coarse typology (K=7) and a fine-grained one (K=20)—both may be worth retaining as analytical lenses for different purposes. The knowledge gained from understanding how the coarse groups differentiate into fine-grained ones is itself a valuable analytical product.