ATWL Session — VIS Trends Recommendation from ATWL

User #1

The attached document contains a definition of a language called ATWL designed for representation of visual analytics workflows. You need to understand the language concepts and semantics it is meant to express, and also the syntax used for expressing the semantics. In the next step, you will be given a "library" of VA workflows extracted from published papers. Your task will be to help a user to design a workflow for achieving user's goals by finding relevant examples in the "library" and suggesting how they can be adapted to the user's case.

Assistant #1

💭 Show reasoning

The user has provided me with a detailed document describing ATWL (Artifact–Transform Workflow Language), a declarative language for representing visual analytics workflows. Let me summarize my understanding of the key concepts: **Core Building Blocks:** 1. **Artifacts** - identifiable analytical objects (8 types): - **entities** - collections of analytical objects (with internal structure, embedment, and features) - **feature** - descriptors of properties of entities - **arrangement** - organization of entities within a context - **visualisation** - visual representations for human perception - **pattern** - abstracted regularities/structures - **model** - formal/computational representations for prediction/simulation/explanation - **knowledge** - explicitly formulated knowledge (derived or injected) - **specification** - parameters, settings, constraints, or method choices 2. **Transforms** - how artifacts are derived/modified (8 generic intents): - **define-unit** - create/redefine entities as units of analysis - **characterise** - compute/transform features - **contextualise** - place entities into a context (arrangement) - **visualise** - create visual representations - **abstract** - derive patterns/conceptual structures - **build-model** - construct/refine models - **generate-knowledge** - formulate knowledge/specifications - **assess** - evaluate quality/adequacy of artifacts 3. **Control Structures:** - **Loops** - iterative refinement with qualitative stopping conditions - **Conditionals** - alternative paths - **Assignments** - binding artifact identifiers to new versions **Key Concepts:** - Internal structure (within entities): elementary, group/cluster, sequence, episode, region, formation - Embedment (between entities): set, time, space, sequence, relational structure - Value structure for features: atomic, list, vector, matrix, relational configuration - Value types: numeric, ordinal, categorical, temporal, spatial, text, reference - Actors: human, machine, hybrid

I've carefully read and understood the ATWL (Artifact–Transform Workflow Language) specification. Here's a brief summary of my understanding:

Key Concepts

Component	Role
Artifacts (8 types)	Analytical objects: entities, feature, arrangement, visualisation, pattern, model, knowledge, specification
Transforms (8 intents)	Operations: define-unit, characterise, contextualise, visualise, abstract, build-model, generate-knowledge, assess
Control Structures	Loops, conditionals, assignments for workflow logic

Core Semantics

Workflows are directed graphs: transforms consume input artifacts and produce output artifacts.
Entities have internal structure (how components within each entity are organized) and embedment (how entities relate to each other in shared environments).
Transforms have an intent (purpose), optional manner (how it's realized), and an actor (human/machine/hybrid).
Specifications serve as control knowledge, influencing how transforms execute.
Knowledge can be evaluative (from assess) or substantive (from generate-knowledge).
Templates provide high-level summaries of workflow stages using intent sequences.

I'm ready for the next step — please provide the library of VA workflows and/or the user's goals, and I'll help find relevant examples and suggest adapted workflows.

User #2

I have attached an additional file with the workflow library. You can perform initial analysis for revealing reusable and adaptable parts.

Assistant #2

💭 Show reasoning

The user has provided a comprehensive library of 17 ATWL-represented visual analytics workflows. Let me analyze this library to identify reusable and adaptable parts, common patterns, and structural elements that can be leveraged when helping users design new workflows. Let me organize my analysis by: 1. **Common workflow templates/patterns** - recurring structural motifs 2. **Reusable workflow segments** - modular parts that appear across multiple workflows 3. **Domain categorization** - grouping by application area 4. **Transform intent patterns** - common sequences of intents ## Analysis ### 1. Common Structural Patterns **Pattern A: Iterative Clustering Refinement** This shows up across multiple workflows like Cluster-Calendar, MobilityGraphs, Events-to-Places, Progressive Clustering, ST Analysis, and Behaviour Pattern Recognition. The core structure follows a cycle: define a cluster unit, visualize it, assess the results, then either adjust parameters or exit the loop. **Pattern B: Iterative Model Building/Refinement** I'm seeing this pattern in Partition-based Regression, ST Modelling, Behaviour Pattern Recognition, and others—a similar iterative approach but focused on constructing and refining analytical models rather than just clustering. **Pattern C: Progressive Exploration** This one appears in Dynamic Networks, Topic Modelling, Multivariate Temporal analysis, Random Forest, TensorFlow, and the What-If Tool. The workflow here is lighter: visualize first, then abstract away complexity, assess what you've learned, and decide whether to dig deeper or stop. **Pattern D: Data Simplification/Abstraction Pipeline** Found across Dynamic Networks, MobilityGraphs, EventFlow, and TensorFlow—these workflows focus on reducing data complexity through a structured pipeline. **Pattern E: Prescriptive/Action Planning Loop** EventAction follows this: generate a plan, estimate its impact, visualize the results, assess them, then either refine or wrap up.

Initial Analysis of the Workflow Library

I've analyzed all 17 workflows and identified reusable/adaptable structural patterns. Here's my synthesis:

1. Domain Categorization

Domain	Workflows	Key Data Types
Temporal data	1.1, 1.4, 1.5, 1.9	Time series, event sequences
Movement/spatial	1.3, 1.6, 1.7, 1.12	Trajectories, flows, places
Networks/graphs	1.2, 1.16	Dynamic networks, dataflow graphs
Text/documents	1.8	Document corpora
ML model analysis	1.10, 1.11, 1.13, 1.14, 1.15, 1.17	Models, predictions, residuals
Prescriptive analytics	1.5	Historical records + outcomes

2. Recurring Structural Motifs

Motif	Description	Found In
A: Iterative Clustering	`define-unit(cluster) → visualise → assess → [adjust\|exit]`	1.1, 1.3, 1.6, 1.7, 1.11, 1.12
B: Model Build–Evaluate–Refine	`build-model → characterise(evaluate) → visualise → assess → [refine\|exit]`	1.10, 1.11, 1.12, 1.13, 1.14
C: Progressive Exploration	`visualise → abstract → assess → [explore more\|exit]`	1.2, 1.8, 1.9, 1.15, 1.16, 1.17
D: Vectorize–Project–Visualize	`characterise(vectors) → contextualise(DR) → visualise(scatterplot)`	1.2, 1.8, 1.12
E: Residual-driven Refinement	`build-model → compute residuals → visualise → assess → refine`	1.10, 1.11
F: Action Plan Tuning	`specify plan → estimate impact → visualise → assess → [refine\|exit]`	1.5

3. Reusable Modular Segments

Segment 1: Similarity-based Spatial Overview

Source workflows: 1.2, 1.8, 1.12, 1.15
Template: characterise(vectorize) → [characterise(normalize)] → contextualise(projection) → visualise(scatterplot)
Adaptable to: Any scenario needing overview of entity similarities (documents, time series, images, model components).

Segment 2: Calendar/Context-based Temporal Distribution

Source workflows: 1.1, 1.3
Template: contextualise(calendar) → visualise(color-coded grid)
Adaptable to: Any temporal cluster membership displayed over a regular time structure.

Segment 3: Feature Engineering Loop

Source workflow: 1.12 (primary), 1.9, 1.10
Template: characterise(features) → define-unit(cluster) → visualise → assess → [refine features | exit]
Adaptable to: Any domain where feature adequacy must be iteratively refined via visual feedback.

Segment 4: Two-stage Spatial Clustering

Source workflow: 1.6
Template: define-unit(ST-cluster) → define-unit(filter noise) → define-unit(S-cluster)
Adaptable to: Any spatial event data needing noise filtering and place delineation.

Segment 5: Semi-supervised Model Steering

Source workflows: 1.8 (topics), 1.15 (boundaries)
Template: visualise(model output) → assess → generate-knowledge(user guidance) → define-unit/build-model(constrained refit)
Adaptable to: Any model where human domain knowledge should guide retraining.

Segment 6: Model Diagnostics Pipeline

Source workflows: 1.14, 1.17
Template: characterise(predictions) → characterise(explanations) → define-unit(group) → visualise(multi-level) → loop(explore → abstract → assess)
Adaptable to: Any trained classifier/regressor needing post-hoc interpretability analysis.

Segment 7: Progressive Simplification

Source workflows: 1.4 (EventFlow), 1.16 (TensorFlow)
Template: define-unit(simplify/cluster) → define-unit(extract auxiliary) → visualise → loop(assess → simplify further | exit)
Adaptable to: Any complex structure (graphs, sequences) needing iterative decluttering.

4. Common Intent Sequences (Templates)

Discovery:     characterise → visualise → abstract → generate-knowledge
Refinement:    assess → generate-knowledge(adjust) → [loop back]
Model building: generate-knowledge(specify) → build-model → assess
Simplification: define-unit(filter/merge) → visualise → assess
Contextualization: contextualise(arrange) → visualise → abstract

5. Cross-cutting Design Principles

Nested loops — outer loops change strategy/parameters; inner loops refine within a strategy (1.11, 1.14).
Specification-driven iteration — a specification artifact controls machine behavior; human updates it after assessment.
Coordinated multi-view visualization — almost all workflows combine ≥2 linked views for assessment.
Human-at-decision-points — machine does computation; human performs assessment, interpretation, and parameter/strategy decisions.
Progressive abstraction — raw data → features → clusters/patterns → knowledge, with quality gates at each level (1.9 exemplifies this most clearly).

I'm ready to help you design a workflow. Please describe your goals, data characteristics, and any constraints, and I'll match relevant library examples and suggest an adapted workflow.

User #3

I need a workflow design recommendation for the following task. I want to analyse the dataset that contains information on IEEE Visualization (IEEE VIS) publications from 1990-2024 as well as IEEE TVCG and IEEE CG&A articles published at IEEE VIS. The dataset includes for each paper the publication year, title, and abstract. I want to reveal the major research topics and how they evolved over the years. I am interested in seeing trends rather than minor fluctuations. Using the workflow library, compose an appropriate workflow.

Assistant #3

💭 Show reasoning

The user wants to analyze a dataset of IEEE VIS publications (1990-2024) containing year, title, and abstract for each paper. Their goals are: 1. Reveal major research topics 2. Show how topics evolved over years 3. Focus on trends rather than minor fluctuations Let me think about which workflows from the library are most relevant: 1. **Workflow 1.8** (Topic modeling of document corpora) - This is the most directly relevant. It deals with text documents, topic extraction, and iterative refinement of topic models with human guidance. 2. **Workflow 1.1** (Temporal patterns with clustering) - Relevant for the temporal evolution aspect, particularly the calendar-based visualization and trend identification. 3. **Workflow 1.9** (Progressive temporal analysis) - Useful for revealing trends over time and progressive abstraction. 4. **Segment D** (Vectorize-Project-Visualize) - Gives an overview of paper similarities across the corpus. 5. **Motif C** (Progressive Exploration) - Enables iterative exploration of topics and their temporal patterns. For the workflow structure, I need to start with topic discovery by processing the titles and abstracts to extract topics through vectorization and topic modeling, then move into the temporal phase where I can track how these topics shift across the decades.

Workflow Design: Research Topic Evolution in IEEE VIS (1990–2024)

Relevant Library Sources

Library Element	Contribution to Design
Workflow 1.8 (Topic modeling)	Core topic discovery loop with human-steered refinement
Workflow 1.1 (Temporal clustering)	Temporal profiling of cluster/topic membership over calendar time
Workflow 1.9 (Progressive abstraction)	Smoothing & trend-level abstraction over temporal data
Segment D (Vectorize→Project→Visualise)	Document similarity overview for topic validation
Motif B (Build–Evaluate–Refine)	Iterative model quality loop
Motif C (Progressive Exploration)	Explore → abstract → assess for trend interpretation

Workflow in ATWL Notation

WORKFLOW  TopicEvolutionAnalysis

── PHASE 1: DATA PREPARATION & TOPIC DISCOVERY  (adapted from 1.8) ──────────

ARTIFACTS
  papers : entities
    internal_structure: set of attributes {year: temporal, title: textual, abstract: textual}
    embedment: none (independent records)

  spec_model : specification
    content: {method: "BERTopic"|"LDA"|"NMF", num_topics: integer, 
              min_topic_size: integer, constraints: list}

TRANSFORMS

  T1: characterise
    input:   papers
    output:  text_vectors : feature
               structure: vector per entity (from title + abstract)
    manner:  vectorise text using sentence embeddings (or TF-IDF)
    actor:   machine

  T2: build-model
    input:   text_vectors, spec_model
    output:  topic_model : model
               content: topic-term distributions, document-topic assignments
    manner:  fit topic model per spec_model parameters
    actor:   machine

  T3: characterise
    input:   topic_model, papers
    output:  topic_assignments : feature
               structure: dominant topic label + probability per paper
    manner:  assign each paper to most probable topic; extract top terms per topic
    actor:   machine

  T4: contextualise
    input:   text_vectors, topic_assignments
    output:  doc_arrangement : arrangement
               structure: 2-D coordinates (UMAP/t-SNE projection coloured by topic)
    manner:  dimensionality reduction of text_vectors
    actor:   machine

  T5: visualise
    input:   doc_arrangement, topic_model
    output:  vis_topics : visualisation
               content: {scatterplot of documents coloured by topic,
                         topic-term bar charts / word clouds,
                         topic size bar chart}
    manner:  coordinated multiple views
    actor:   machine

  T6: assess
    input:   vis_topics
    output:  eval_topics : knowledge (evaluative)
               content: topic coherence, interpretability, granularity judgement
    actor:   human

  LOOP L1  (topic refinement — from Motif B / Workflow 1.8 steering loop)
    condition: eval_topics ≠ "satisfactory"

    T7: generate-knowledge
      input:   eval_topics
      output:  spec_model_update : specification
                 content: revised num_topics, merge/split instructions, 
                          keyword constraints, stopword additions
      manner:  analyst decides how to improve model
      actor:   human

    ASSIGN  spec_model ← spec_model_update
    GOTO T2

  END LOOP L1


── PHASE 2: TEMPORAL TOPIC PROFILING  (adapted from 1.1, 1.9) ───────────────

TRANSFORMS

  T8: characterise
    input:   topic_assignments, papers.year
    output:  topic_temporal_profiles : feature
               structure: matrix (topics × years), cell = paper count or proportion
    manner:  aggregate topic membership counts per year
    actor:   machine

  T9: characterise
    input:   topic_temporal_profiles, spec_smoothing
    output:  topic_trends : feature
               structure: smoothed time series per topic
    manner:  apply temporal smoothing (moving average / LOESS / Gaussian kernel)
    actor:   machine
    note:    suppresses minor fluctuations per user requirement

  ARTIFACTS
    spec_smoothing : specification
      content: {window_size: integer, method: "moving_avg"|"loess"|"gaussian",
                time_granularity: "year"|"3-year"|"5-year"}

  T10: contextualise
    input:   topic_trends
    output:  temporal_arrangement : arrangement
               structure: topics positioned along shared time axis 1990–2024
    manner:  align all topic curves on common timeline
    actor:   machine

  T11: visualise
    input:   temporal_arrangement, topic_trends
    output:  vis_evolution : visualisation
               content: {streamgraph / stacked area chart (overall composition),
                         small-multiple line charts (individual topic trajectories),
                         horizon graph (compact comparison)}
    manner:  coordinated temporal views
    actor:   machine

  T12: assess
    input:   vis_evolution
    output:  eval_trends : knowledge (evaluative)
               content: judgement on smoothing level, readability, 
                        whether trends (not noise) are visible
    actor:   human

  LOOP L2  (smoothing/granularity adjustment — from Workflow 1.9 progressive abstraction)
    condition: eval_trends ≠ "satisfactory"

    T13: generate-knowledge
      input:   eval_trends
      output:  spec_smoothing_update : specification
      manner:  adjust smoothing window or switch time granularity
      actor:   human

    ASSIGN  spec_smoothing ← spec_smoothing_update
    GOTO T9

  END LOOP L2


── PHASE 3: TREND INTERPRETATION & KNOWLEDGE GENERATION ─────────────────────

TRANSFORMS

  T14: abstract
    input:   vis_evolution, vis_topics
    output:  era_patterns : pattern
               content: {rising topics, declining topics, stable topics,
                         inflection points, dominant eras}
    manner:  identify qualitative phases (e.g., "rise of ML-based VIS post-2015")
    actor:   human

  T15: define-unit  (optional — era segmentation, adapted from 1.1 clustering)
    input:   topic_temporal_profiles, spec_eras
    output:  eras : entities
               internal_structure: contiguous year ranges with coherent topic mix
    manner:  segment timeline into research eras (manual or change-point detection)
    actor:   human + machine

  T16: visualise
    input:   eras, topic_trends
    output:  vis_eras : visualisation
               content: annotated timeline with era labels overlaid on streamgraph
    manner:  overlay era boundaries on vis_evolution
    actor:   machine

  T17: generate-knowledge
    input:   era_patterns, vis_eras, eval_topics
    output:  findings : knowledge (substantive)
               content: narrative summary of research evolution, 
                        key shifts, emerging/fading themes
    manner:  synthesise observations into structured report
    actor:   human

END WORKFLOW

High-Level Template

Phase 1 (Topic Discovery):
  characterise → build-model → characterise → contextualise → visualise → assess
    ↺ [generate-knowledge → rebuild model]

Phase 2 (Temporal Trends):
  characterise(aggregate) → characterise(smooth) → contextualise → visualise → assess
    ↺ [adjust smoothing]

Phase 3 (Interpretation):
  abstract → define-unit(eras) → visualise → generate-knowledge

Design Rationale & Adaptation Notes

Design Decision	Rationale
BERTopic as default method	Handles short texts (titles+abstracts) well; produces interpretable clusters; allows hierarchical merging — suited for adjusting granularity in L1
Smoothing before temporal visualisation	User explicitly wants trends not fluctuations; LOESS or 3–5 year windows appropriate for 35-year span
Streamgraph for evolution	Shows both individual topic trajectories and compositional change simultaneously (adapts 1.1's calendar view to continuous time)
Two nested loops	L1 ensures topic quality before temporal analysis; L2 tunes the trend-vs-noise tradeoff — mirrors the nested loop structure of workflows 1.11/1.14
Era segmentation (T15)	Optional but valuable for a 35-year span; provides narrative anchors ("the InfoVis era", "the deep-learning era")
Scatterplot overview (T4–T5)	From Segment D — provides spatial validation that topics are coherent clusters in embedding space

Suggested Tool Stack (for implementation)

Phase	Tools
Vectorization (T1)	Sentence-BERT, SciBERT
Topic model (T2)	BERTopic, Top2Vec, or Gensim LDA
Projection (T4)	UMAP
Temporal vis (T11)	D3.js streamgraph, Matplotlib, Altair
Smoothing (T9)	Pandas rolling / statsmodels LOESS

Would you like me to refine any phase (e.g., add a sub-workflow for comparing specific topic pairs, or add a citation-impact weighting step)?

User #4

To test this workflow, create a Jupyter Notebook (Python). Output the notebook as json file.

Assistant #4

💭 Show reasoning

The user wants me to create a Jupyter Notebook (Python) that implements the workflow I designed for analyzing IEEE VIS publications (1990-2024) to reveal major research topics and their evolution over time. I'll create a complete notebook as a JSON file. The notebook should implement: 1. Phase 1: Data Preparation & Topic Discovery (vectorization, topic modeling, projection, visualization, assessment) 2. Phase 2: Temporal Topic Profiling (aggregation, smoothing, temporal visualization) 3. Phase 3: Trend Interpretation & Knowledge Generation (era segmentation, final visualization) I'll use BERTopic for topic modeling, UMAP for projection, and various visualization libraries.

(The assistant produced a complete Jupyter Notebook JSON implementing the workflow with BERTopic, UMAP, LOESS smoothing, stacked area charts, era segmentation, and a structured findings report.)

📓 Show/hide notebook JSON (large)

{
 "cells": [
  {"cell_type": "markdown", "source": ["# Research Topic Evolution in IEEE VIS (1990–2024)\n", ...]},
  {"cell_type": "code", "source": ["# Install required packages\n", "# !pip install bertopic sentence-transformers umap-learn ..."]},
  {"cell_type": "code", "source": ["import pandas as pd\n", "from bertopic import BERTopic\n", "from umap import UMAP\n", ...]},
  ... (28 cells total: markdown + code implementing all 3 phases)
 ],
 "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python"}},
 "nbformat": 4, "nbformat_minor": 4
}

Key Implementation Notes

Workflow Element	Implementation Choice
T1 (vectorise)	`sentence-transformers` with `all-MiniLM-L6-v2` (fast); upgrade to `allenai/scibert` for better domain coverage
T2 (build-model)	BERTopic with configurable UMAP + HDBSCAN parameters
Loop L1	Manual flag `topics_satisfactory` + merge/reduce options
T9 (smooth)	LOESS by default (frac=0.15 ≈ 5-year effective window on 35 years)
Loop L2	Manual flag `smoothing_satisfactory` + parameter guidance
T14 (abstract)	Automatic trend classification (rising/declining/stable/peaked)
T15 (eras)	Cosine-dissimilarity based change-point detection
T16/T17	Annotated streamgraph + heatmap + structured text report

To run the notebook, update DATA_PATH to point to your IEEE VIS dataset CSV (expected columns: year, title, abstract). If the file isn't found, synthetic demo data is generated for testing the pipeline.

User #5

Describe the workflow in natural language.

Assistant #5

💭 Show reasoning

The user wants a natural language description of the workflow I designed for analyzing IEEE VIS publication topics and their evolution over time. I'll describe it clearly, walking through the phases and explaining what happens at each step, including the iterative loops.

Workflow Description: Research Topic Evolution in IEEE VIS (1990–2024)

Goal

The workflow takes a dataset of IEEE VIS publications (each described by year, title, and abstract) and reveals the major research topics within the community and how their prominence has changed over 35 years, emphasising long-term trends rather than year-to-year noise.

Phase 1: Topic Discovery

The workflow begins by preparing the text data. Each paper's title and abstract are concatenated into a single document and transformed into a numerical vector using a pre-trained sentence embedding model. These vectors capture semantic similarity — papers about related subjects end up with similar vectors.

Next, a topic model (BERTopic) is fitted to the document vectors. The model clusters semantically similar papers together and identifies representative keywords for each cluster. Each cluster corresponds to a research topic. Every paper is then assigned to its most probable topic, and the top terms characterising each topic are extracted.

To allow the analyst to visually verify topic quality, the high-dimensional vectors are projected into two dimensions using UMAP. The resulting document map is displayed as a scatterplot where each dot is a paper, coloured by its assigned topic. A companion bar chart shows the size of each topic. Together, these views let the analyst judge whether the topics are coherent, well-separated, and at an appropriate level of granularity.

The analyst then assesses the result. If topics are too fine-grained, too coarse, or contain incoherent mixtures, the analyst adjusts the model parameters — for example, changing the minimum topic size, merging similar topics, or constraining the total number of topics — and the model is rebuilt. This refinement loop repeats until the analyst is satisfied with topic quality.

Phase 2: Temporal Profiling

Once stable topics are established, the workflow shifts to temporal analysis. For each year in the dataset, the number (or proportion) of papers belonging to each topic is counted, producing a topic-by-year matrix.

Because raw yearly counts are noisy (especially in earlier years with fewer publications), a smoothing function is applied. By default, LOESS regression gently smooths each topic's time series so that multi-year trends become visible while short-lived fluctuations are suppressed. The smoothing window is chosen to reveal patterns spanning roughly five or more years.

The smoothed trends are then visualised in multiple coordinated views: a stacked area chart (streamgraph) showing how the overall topic composition evolves, a set of individual line charts (small multiples) showing each topic's trajectory in isolation, and a heatmap showing topic intensity across years.

The analyst assesses whether the smoothing level is appropriate. If the curves are still too jagged, the smoothing fraction is increased; if important transitions are being blurred out, it is decreased. This adjustment loop continues until the temporal visualisation clearly communicates long-term trends.

Phase 3: Interpretation and Knowledge Generation

With validated topic trends in hand, the analyst interprets the patterns. Each topic is automatically classified as rising (gaining prominence over time), declining (fading), stable (consistently present), or peaked (rose and then fell). This gives a quick structural summary of the field's evolution.

The workflow then segments the 35-year timeline into distinct research eras — contiguous periods during which the topic composition remains relatively stable, separated by years of notable compositional shift. For each era, the dominant topics are identified.

An annotated visualisation overlays era boundaries on the streamgraph, making it easy to see when the community transitioned from one set of dominant concerns to another.

Finally, the analyst synthesises all observations into a structured knowledge report: which topics have emerged, which have faded, when major shifts occurred, and what characterises each era of IEEE VIS research.

Iterative Structure

The workflow contains two human-in-the-loop feedback cycles:

Topic refinement loop — ensures the discovered topics are meaningful before any temporal analysis begins.
Smoothing adjustment loop — ensures the temporal visualisation reveals genuine trends at the right level of abstraction.

Both loops follow the same principle: the machine computes, the human assesses, and if the result is unsatisfactory, the human updates a specification that controls the machine's next iteration.

Roles

Actor	Responsibilities
Machine	Text vectorisation, topic modeling, dimensionality reduction, smoothing, aggregation, change-point detection, rendering visualisations
Human	Assessing topic coherence, judging smoothing adequacy, interpreting trends, identifying eras, synthesising narrative findings

Output

The workflow produces three main outputs:

A validated set of research topics with representative keywords and a document map.
Smoothed temporal trend charts showing each topic's trajectory from 1990 to 2024.
A structured narrative summarising rising, declining, and stable themes, major transition points, and distinct research eras in the IEEE VIS community.

User #6

Regarding the Jupyter Notebook: you have included the use of BERTopic for topic modelling. It is hard to install; replace it by NMF. Output the modified notebook as json file.

Assistant #6

💭 Show reasoning

The user wants me to replace BERTopic with NMF (Non-negative Matrix Factorization) for topic modeling in the Jupyter Notebook. NMF is available in scikit-learn and is much easier to install. I'll need to: 1. Remove BERTopic, HDBSCAN dependencies 2. Use TF-IDF vectorization + NMF from scikit-learn 3. Keep the sentence embeddings for the UMAP projection (document map), or alternatively use TF-IDF vectors for UMAP too 4. Adjust the topic extraction logic to work with NMF components 5. Maintain the overall workflow structure I'll use TF-IDF vectors for both the topic modeling and UMAP projection to keep dependencies minimal and installation straightforward.

(The assistant produced a modified Jupyter Notebook JSON replacing BERTopic with sklearn NMF + TF-IDF. Full notebook available in the JSON session log.)

📓 Show/hide notebook JSON (large)

{
 "cells": [
  {"cell_type": "markdown", "source": ["# Research Topic Evolution in IEEE VIS (1990–2024)\n", ...]},
  {"cell_type": "code", "source": ["# !pip install pandas numpy matplotlib seaborn scikit-learn umap-learn statsmodels scipy"]},
  {"cell_type": "code", "source": ["from sklearn.feature_extraction.text import TfidfVectorizer\n",
                                    "from sklearn.decomposition import NMF\n", ...]},
  ... (27 cells total: NMF-based implementation of all 3 phases)
 ],
 "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python"}},
 "nbformat": 4, "nbformat_minor": 4
}

Summary of Changes

Aspect	Before (BERTopic)	After (NMF)
Topic model	BERTopic (HDBSCAN clustering)	sklearn `NMF` on TF-IDF matrix
Text vectorisation	Sentence-BERT embeddings	`TfidfVectorizer` from sklearn
Clustering	HDBSCAN (automatic cluster count)	NMF with user-specified `n_topics`
2D projection	UMAP on dense embeddings	UMAP on sparse TF-IDF matrix
Topic assignment	Hard cluster membership	Argmax of NMF document-topic weights (soft weights available)
Temporal profiles	Hard counts only	Soft NMF weights summed per year (smoother)
Heavy dependencies removed	`bertopic`, `sentence-transformers`, `hdbscan`	—
Remaining dependencies	—	`scikit-learn`, `umap-learn`, `statsmodels`, `scipy`

The workflow logic, iterative loops, visualisations, and knowledge generation remain identical.