SmartIterator

Visual Analytics Workflows for Supervising Unsupervised Data Grouping

Gennady Andrienko and Natalia Andrienko

Fraunhofer IAIS, Sankt Augustin, Germany  |  City, University of London, UK

The Approach

SmartIterator (SI) is a visual analytics approach that treats the full sequence of unsupervised grouping results across a parameter sweep as a first-class analytical object. Rather than seeking a single "optimal" parameter configuration, SI guides the analyst through systematic exploration of how data structure emerges, persists, and transforms across configurations—building cumulative understanding that no single run can provide.

The approach is operationalized through IteraScope (IS), a coordinated visual display that combines:

Three method-specific workflows are provided—for density-based clustering, partition-based clustering, and NMF topic modeling—each following a common six-phase structure:

Phase 1: Metric Overview & Archetype Completeness

Phase 2: Transition Assessment (Sankey flows)

Phase 3: Confidence Evaluation (violin plots)

Phase 4: Content & Context Inspection (domain views)

Phase 5: Archetype Verification (2D embedding)

Phase 6: Decision or Refinement

Key insight: Each phase contributes to cumulative understanding of data structure. The knowledge gained from studying transitions between configurations—which groups persist, which split, what vocabulary migrates, where boundaries lie—is often as valuable as the final parameter choice itself. The analyst does not merely pick a number; the analyst learns about the data.

Supplementary Appendices and Notebooks

Each appendix provides full-resolution figures and detailed analysis steps for one demonstration from the paper. The corresponding Google Colab notebooks implement key parts of the computational pipeline (parameter sweeps, quality metrics, archetype detection, embedding computation, word cloud generation) in reproducible Python code.

Note: The notebooks implement the computational engine described in the paper (Section 5). The IteraScope visual display is a separate Java application within the V-Analytics environment; the notebooks produce the data files that IS consumes, and include standalone visualizations for exploration without IS.

Appendix A — Density-Based Clustering (VAST Challenge 2011)

Approximately 1,000,000 simulated microblog messages from the fictional city of Vastopolis. A contamination event caused spatially concentrated health complaints that vary substantially in density—from tight hospital clusters to elongated, diffuse wind- and water-dispersed patterns. The workflow sweeps DBSCAN parameters (ε and min_samples) across three progressive rounds, using map and space-time cube views for domain validation against the known event geography.

Highlights: Multi-round progressive refinement; transition-class propagation to isolate borderline members; temporal layering revealed in space-time cube; validation against ground truth.

Appendix B — Partition-Based Clustering (EU NUTS-3 Population)

Demographic indicators (gender and age structure) for ~1,500 European NUTS-3 regions from Eurostat. The workflow sweeps the number of K-means clusters (K = 3…50), compares candidate configurations through violin plots and transition-class propagation, validates geographic coherence on choropleth maps, and verifies robustness through a 30-seed stability sweep.

Highlights: Violin-based confidence comparison between K = 20 and K = 24; transition decomposition revealing peripheral losses; seed-stability verification; parallel coordinates for demographic profiling.

Appendix C — Topic Modeling (IEEE VIS Papers, NMF)

~3,800 IEEE VIS papers (1990–2024) with titles and abstracts. The workflow begins with data cleaning (removal of encoding artifacts), sweeps the number of NMF topics (5–25), diagnoses a "wastebasket" topic at ntopics = 16 through Sankey transition analysis, investigates the progressive differentiation of the "time" topic, and validates the final 15-topic solution through 34-year temporal prevalence charts.

Highlights: Phase 0 data cleaning via artifact detection; wastebasket diagnosis through multi-parent Sankey pattern; term-level transition word clouds; temporal prevalence alignment with known community evolution (VAST track founding, growth of evaluation culture).

Notebook Contents Overview

Each notebook implements key computational components from the paper (Section 5), including:

Note: The notebooks do not implement the full IteraScope interactive display (Sankey rendering, violin plots, hover/click interactions, domain-view linking). These are part of the Java-based V-Analytics environment. The notebooks produce the input data files for IS and provide standalone exploratory visualizations.

Practical Recommendations

The following recommendations generalize from our experience applying SmartIterator across three method families and diverse datasets. They are intended as actionable guidance for analysts beginning their own parameter-supervision workflows.

1. Start Coarse, Refine Progressively

Begin with a wide parameter sweep at coarse resolution (e.g., K = 3…50 with step 1, or ε = 0.05…1.0 with step 0.05) to identify the region of interest. Examine the metrics chart for elbows, peaks, and plateaus; note where complete iterations cluster. Then narrow the range and reduce the step size for a second round. Two to three rounds typically suffice—each informed by the findings of the previous one. This avoids wasting computation on irrelevant regions and gives early orientation before committing to detailed inspection.

2. Let Sankey Bands, Not Metrics Alone, Define Stability

A silhouette peak or coherence plateau narrows the candidate range, but the definitive stability evidence comes from transition flows. A "good" configuration is one where the groups at the selected parameter persist with minimal change at neighboring values—visible as wide, horizontal Sankey bands connecting the same vertical positions across consecutive iterations. Specifically:

If a group splits at the very next parameter step, it may be an artifact of the current granularity rather than genuine structure.

3. Use Domain Views to Break Metric Ties

When multiple configurations score similarly on quality metrics, propagate each to domain-linked views and compare. Domain coherence frequently discriminates between statistically equivalent alternatives:

The domain view often reveals the "right" answer instantly when metrics are ambiguous.

4. Inspect What Changes, Not Only What Persists

Transition-class propagation is most informative when applied to understand losses: which members leave a group at the next parameter step, and where do they go? The spatial, temporal, or semantic distribution of lost members reveals the nature of the transition:

This "what changes" perspective often generates more domain insight than examining static group composition at a single configuration.

5. Treat Archetype Completeness as Necessary, Not Sufficient

A "complete" iteration (containing representatives of all HDBSCAN-detected archetypes) captures the full diversity of recurrent patterns—making it a strong candidate. However:

6. Verify Seed Robustness for Partition-Based Methods

K-means and other partition-based methods depend on random initialization. Once a promising K is identified:

For topic modeling (NMF), seed sensitivity is typically lower due to the non-negative constraint, but verification remains good practice.

7. Document Rejections Explicitly

When an alternative configuration is rejected, record:

Documented rejections serve two purposes: they strengthen confidence in the final selection (by demonstrating that alternatives were considered and found wanting), and they contribute to cumulative understanding (by revealing what the data does not support).

8. When in Doubt, Keep Multiple Candidates

The SI workflow does not require converging to a single "winner." If two configurations offer complementary views of the data—e.g., a coarse typology (K=7) and a fine-grained one (K=20)—both may be worth retaining as analytical lenses for different purposes. The knowledge gained from understanding how the coarse groups differentiate into fine-grained ones is itself a valuable analytical product.