The first NMF run (ntopics = 5…25, 300 top unigram terms, 200 bigrams) reveals a large, stable topic whose top keywords include strange entries such as etx, lt, and gt—clearly encoding artefacts rather than meaningful research terms. This topic persists across all values of ntopics, always capturing a substantial share of documents, which is a strong signal that it reflects a data-quality issue rather than genuine thematic content.
Investigation of the raw text data identifies 283 papers containing the HTML entity <<ETX>> (end-of-text control character) appended to their abstracts—a legacy encoding error in the source dataset that was not caught during data collection. The artefact strings were tokenised into lt, gt, and etx by the TF-IDF pipeline, forming a spurious but highly distinctive term cluster that NMF reliably isolated as its own topic. After removing these artefacts from all 283 affected abstracts and recomputing the TF-IDF matrix, the spurious topic disappears entirely, and the freed capacity is redistributed among genuine research themes.
etx, lt, gt) is visible across all iterations. This discovery led to identification of 283 papers with <<ETX>> encoding errors in their abstracts. The topic's persistence and size across all ntopics values made it immediately conspicuous in the Sankey view—an example of how the SI workflow can surface data-quality issues that would otherwise remain hidden in a single-run analysis.After cleaning, the analyst reruns NMF for ntopics = 5…25. This round serves Phase 1 of the SI workflow: identifying a stable number of topics that captures the major research themes in the IEEE VIS corpus without over-fragmentation. The key question is whether the corpus is best described by a small number of broad themes or a larger number of specialised sub-communities.
| Parameter | Value |
|---|---|
| ntopics | 5 … 25, step 1 |
| Top terms (unigrams) | 300 |
| Top bigrams | 200 |
| Papers (after cleaning) | ~3,517 |
The IS display shows that approximately 15 recurrent thematic archetypes emerge across the sweep. HDBSCAN classifies only ~11% of topic instances as noise (unstable or non-recurring), indicating high overall topic stability. The metrics chart confirms that coherence stabilises in the range ntopics = 14…18, suggesting that the corpus naturally supports around 15 well-differentiated research themes. Two views are presented below: the first emphasises the metrics chart, the second highlights the bar-chart representation and 2D archetype embedding.
The analyst activates the Sankey transition view between ntopics = 15 and 16, combined with violin plots in split mode to assess membership confidence simultaneously. At ntopics = 16, a new "method" topic (shown in cyan) appears with 662 documents. The Sankey bands reveal that this topic collects method-related terms from multiple parent topics rather than emerging from a clean split of a single parent—a "wastebasket" pattern indicating conflated methodological vocabulary across unrelated research areas. The violin plots at ntopics = 15 show narrower, more concentrated distributions than at ntopics = 16, confirming higher membership confidence for the 15-topic solution.
To understand the composition of the wastebasket topic, the analyst examines word clouds for the documents that transitioned into it. The frequency-weighted view (left figure) shows which terms are most common among the contributing documents. The term-weight-change view (right figure) visualizes how term weights shift during the transition: word size encodes the magnitude of change, while color encodes direction—blue for terms losing weight (leaving behind their source-topic identity) and red for terms gaining weight (acquiring the destination topic's character). Together, the two views confirm that the "method" topic aggregates generic methodological terms (method, approach, technique, based, model) drawn from a variety of unrelated research areas.
based, method, model appear prominently across all contributing sources, confirming that the wastebasket topic does not have a distinctive thematic identity.
The complete 16-topic word-cloud display provides a broader view. While 15 of the 16 topics retain coherent, interpretable thematic profiles identical to those in the 15-topic solution, the additional "method" topic (typically in the bottom-right cell) lacks thematic focus. Its word cloud is dominated by terms that could belong to any research area, further confirming its role as a wastebasket that absorbs generic vocabulary without contributing interpretive value.
method, approach, technique) without thematic coherence, while the remaining 15 topics retain clearly distinct research profiles.
The final 15-topic solution produces clearly interpretable topics, each corresponding to a recognisable IEEE VIS research community. The frequency-weighted view (left) reveals the most commonly used terms within each topic, reflecting the vocabulary that practitioners in each area use most often. The TF-IDF-weighted view (right) highlights the most discriminative terms—those that best distinguish each topic from the rest of the corpus—providing a complementary lens for interpretation. Comparing the two views helps the analyst separate generic high-frequency vocabulary from the truly distinctive signature of each research theme.
node, edge, layout over generic terms like data or visual that appear across many topics.
Within the 15-topic solution, topic 12 relates to temporal data (top terms: time, temporal, series, event, sequence). The analyst notices that this topic becomes substantially smaller when moving from ntopics = 15 to 16, losing 106 of its 352 documents. This iteration investigates where these temporal papers migrate and what aspects of time-related research are absorbed by other thematic topics at higher granularity. Understanding this decomposition helps validate that ntopics = 15 is the appropriate level at which temporal research remains a coherent community rather than being fragmented across domain-specific topics.
The IS display below focuses on the transition around the time topic. The Sankey view reveals that the time topic acts as a "hub" that progressively differentiates as ntopics increases: at low K, it encompasses all temporal research; as K grows, spatio-temporal papers migrate to the spatial topic, time-varying graph papers to the graph topic, and temporal event sequences to the event-analysis topic. At ntopics = 15, the time topic still holds these sub-communities together as a coherent whole; at ntopics = 16, the first major fragmentation occurs.
The first pair of word clouds shows the overall transition pattern to topic 15.12. The frequency view reveals which terms are most common among the documents that join the time topic, while the TF-IDF view highlights the most discriminative vocabulary of these migrating documents. Together, they confirm that the incoming papers carry time-related vocabulary (time, temporal, events, changes).
The second pair shows all destination topics for papers leaving the time topic at ntopics = 16. In the frequency view (left figure), each cell corresponds to a different destination. In the term-weight-change view (right figure), each cell shows how weights shift for that specific destination: size encodes magnitude, blue = terms losing weight, red = terms gaining weight. The multi-directional dispersal is evident: time-varying graphs move to the graph topic, temporal events to event-related topics, and streaming/dynamic data to analytics topics. This pattern confirms that the time topic at ntopics = 15 serves as an integrative hub for all temporal research.
The third pair provides a detailed view of the largest single outflow from the time topic: the 106 spatio-temporal papers assigned to topic 15.12 (time) but not to topic 16.12. These papers combine spatial and temporal analysis—covering trajectories, movement patterns, space-time cubes, and geographic event sequences. At ntopics = 15, their temporal vocabulary keeps them within the time topic; at ntopics = 16, their equally strong spatial vocabulary pulls them toward the spatial/geographic topic. The frequency view (left figure) shows term prevalence among these 106 papers. The term-weight view (right figure) uses a three-panel layout: source weights (left panel, cyan), destination weights (right panel, cyan), and weight changes (center panel—size encodes magnitude, blue = decrease, red = increase). This makes the dual identity concretely visible: temporal terms shrink while spatial terms grow.
To interpret the 15-topic solution in the context of the IEEE VIS community's evolution over 34 years (1990–2024), the analyst aggregates papers by their dominant topic and publication year. Eight alternative visualisations are produced, combining two chart types (stacked area vs. line), two quantity types (absolute count vs. proportion per year), and two smoothing options (raw data vs. Gaussian smoothing). Together, these views reveal clear long-term trends in the research landscape that would be invisible in any single-year snapshot.
Key findings:
These trends are consistent with known shifts in the VIS research landscape and demonstrate that the discovered topics capture genuine, temporally coherent research communities rather than statistical artefacts. The fact that temporal patterns align with known historical events (VAST track founding, growth of HCI-oriented evaluation culture) provides external validation of the topic model's quality.
Stacked area charts show how the total publication volume is distributed among topics over time. The absolute-count versions reveal overall growth in the field, while the proportional versions isolate relative shifts between topics (controlling for the increasing number of papers per year). Gaussian smoothing removes year-to-year fluctuations caused by small sample sizes in early years and conference-cycle effects.
Line charts provide a complementary view where individual topic trajectories can be traced independently without the stacking effect that can obscure trends in lower layers. They are particularly useful for identifying crossing points (when one topic overtakes another in prevalence) and for comparing growth rates between specific topics of interest.
The table below summarises the four iterations of the topic-modelling workflow, showing how each iteration addressed specific analytical questions and built upon previous findings. The process illustrates the iterative, human-guided nature of the SI workflow: each round's findings inform the questions posed in the next round, progressively deepening the analyst's understanding of the corpus structure.
| Iteration | Focus | Workflow Phase | Key Finding |
|---|---|---|---|
| 0 | Data quality check | Phase 0 (preprocessing) | 283 papers contain <<ETX>> artefacts causing a spurious stable topic; removed before further analysis. Demonstrates how the SI workflow can surface hidden data-quality issues. |
| 1 | Topic count selection (ntopics = 5…25) | Phases 1–3 | ~15 stable archetypes emerge; ntopics = 16 introduces a "method" wastebasket topic that collects generic vocabulary from multiple parents; ntopics = 15 selected for coherence and membership confidence. |
| 2 | Time topic investigation | Phase 4 (domain validation) | Temporal research forms a coherent hub at ntopics = 15; at higher K, 106 spatio-temporal papers (with vocabulary like spatial, space-time, trajectory) disperse to domain-specific topics, validating ntopics = 15 as the integration level. |
| 3 | Temporal prevalence analysis | Phase 4 (temporal context) | Clear 34-year trends validated: VA rise post-2004, user-study growth to ~20%, rendering decline, stable high-dimensional data share. Alignment with known historical events provides external validation. |
Final selected configuration: NMF with ntopics = 15. This produces 15 coherent, interpretable research themes that are stable across the parameter sweep (~89% non-noise archetype instances), exhibit high membership confidence (narrow violin distributions), and show temporally valid prevalence patterns consistent with the known evolution of the IEEE VIS community over 34 years. The 16-topic alternative was rejected because its additional topic acts as a methodological wastebasket rather than a genuine research community.
End of Appendix C – Section 6.3