User #1

The attached archive includes published research papers containing descriptions of visual analytics workflows. Your task will be to use this as a workflow library for recommending workflows addressing new problems. Analyse the library and extract reusable patterns.

Assistant #1

💭 Show reasoning

The user wants me to analyze the library of visual analytics workflow papers and extract reusable patterns. Let me go through each paper and identify the key workflow patterns, then synthesize them into reusable patterns. Let me analyze the papers systematically: 1. A_Partition-Based_Framework_for_Building_and_Validating_Regression_Models - Interactive feature selection workflow for regression models using partition-based visualization and ranking. 2. Cluster_and_calendar_based_visualization_of_time_series_data - Clustering daily patterns + calendar visualization for time series exploration. 3. Episodes and Topics in Multivariate Temporal Data - Progressive abstraction: SAX encoding → topic modeling → distribution exploration. 4. EuroVis2019-Exploratity_model_analysis - Exploratory Model Analysis follows a structured pipeline from initial data exploration through problem specification, model generation, exploration, selection, and export. 5. EventAction - Prescriptive analytics for temporal sequences involves finding similar records, exploring their outcomes, reviewing recommendations, and tuning plans. 6. event_sequence_simplification - Temporal sequences get simplified through filtering, transformation, and find-and-replace operations. 7. explainMLVis_binary_classifiers - Visual diagnostics for classifiers move through outcome-level analysis, then feature-level explanations, down to instance-level inspection. 8. From_movement_tracks_through_events_to_places - Movement analysis extracts events from tracks, clusters them, delineates places, aggregates spatio-temporal data, and analyzes the results. 9. Human-in-the-Loop - Feature engineering for pattern recognition progresses through data representation, temporal abstraction, feature assessment, grouping and labeling, then evaluation. 10. MobilityGraphs - Mass mobility analysis simplifies space through clustering, then time through clustering, before visually exploring the resulting spatio-temporal patterns. 11. Reducing_Snapshots_to_Points - Dynamic network exploration discretizes data, vectorizes and normalizes it, applies dimensionality reduction, then visualizes and interacts with the results. 12. RfX - Random Forest exploration partitions trees into clusters, identifies representative decision trees, explores individual trees, and optimizes them. 13. The_What-If_Tool - clusters. Now I'm identifying the core patterns across these papers...

Reusable Visual Analytics Workflow Patterns

After analysing all 17 papers in the library, I extracted the following reusable workflow patterns that can be applied to new problems:

Pattern 1: Progressive Abstraction

Source papers: Episodes and Topics, From Movement Tracks, MobilityGraphs, Reducing Snapshots to Points

Structure:

Start with raw, elementary data
Apply a first-level transformation/abstraction (e.g., symbolic encoding, event extraction)
Visualize and validate the intermediate representation
Apply a second-level abstraction (e.g., topic modelling, clustering)
Visualize and explore distribution of higher-level patterns in context

Applicable when: Data is hierarchically structured or too granular for direct insight; understanding requires building up from elements to patterns to distributions.

Pattern 2: Rank-then-Explore

Source papers: Partition-Based Regression Framework, Exploratory Model Analysis

Structure:

Compute quantitative relevance/quality measures for many candidates (features, models, variable pairs)
Rank candidates by measures; display as ordered table/list
Coordinate ranked list with small-multiple visualisations showing qualitative structure
User inspects top-ranked candidates visually to confirm relevance or spot issues
Select candidate and proceed (e.g., add to model, export)

Applicable when: The search space is large (many features, many models) and a combination of automated scoring + human judgement is needed.

Pattern 3: Iterative Residual-Driven Model Refinement

Source papers: Partition-Based Regression Framework, Visual Diagnostics of Binary Classifiers

Structure:

Build an initial model using the most dominant effect
Compute residuals (or errors/misclassifications)
Visualize residuals conditioned on remaining candidate variables
Identify systematic patterns (bias, local structures, interactions)
Augment the model with a new variable/transformation addressing the pattern
Repeat from step 2 until stopping criterion is met

Applicable when: Building interpretable predictive models incrementally, especially when domain knowledge must guide feature selection.

Pattern 4: Cluster-Label-Distribute

Source papers: Cluster and Calendar, MobilityGraphs, Episodes and Topics, Human-in-the-Loop, Reducing Snapshots to Points

Structure:

Define a meaningful feature/distance space for the items
Cluster items by similarity (density-based or partition-based)
Visualise cluster representatives (prototypes, summaries, averages)
Interpret/label clusters using domain knowledge
Explore the distribution of clusters over context dimensions (time, space, categories)

Applicable when: A dataset contains many instances that need to be grouped for overview, and the analyst wants to understand when/where/why different groups occur.

Pattern 5: Simplify → Analyse → Restore

Source papers: Temporal Event Sequence Simplification, MobilityGraphs, Reducing Snapshots to Points

Structure:

Start with complex, cluttered data
Apply user-driven simplifications (filtering, merging, replacing, aggregating)
Measure complexity reduction (e.g., visual element count)
Analyse the simplified view for patterns
Selectively restore detail when needed for deeper investigation
Record simplification history for reproducibility

Applicable when: Raw data is too noisy/complex for pattern detection; the signal-to-noise ratio must be improved interactively.

Pattern 6: Similarity-Based Prescription

Source papers: EventAction, Reducing Snapshots to Points

Structure:

Define the current record/state of interest
Compute similarity to archived/historical records
Visualise the distribution of outcomes for similar records
Identify actions/sequences correlated with desired outcomes
Construct a personalised action plan
Provide feedback on estimated likelihood of success
Iterate to refine

Applicable when: The goal is recommending future actions based on what has worked for similar cases in the past (prescriptive analytics).

Pattern 7: Human-in-the-Loop Feature Engineering for Classification

Source papers: Human-in-the-Loop, UTOPIAN, RfX

Structure:

Derive interval-based or aggregate features from raw temporal/sequential data
Assess feature discriminability through visual inspection (distributions, projections)
Select/refine features iteratively
Group data by features (clustering or dimensionality reduction)
Visually inspect groups; label representative examples in bulk
Validate labelled set (e.g., k-NN test) before training a full model

Applicable when: Training labels are scarce, pattern definitions are vague, and domain expertise is needed to bridge raw data and meaningful categories.

Pattern 8: Exploratory Model Analysis (Multi-Problem, Multi-Model)

Source papers: Exploratory Model Analysis (EuroVis 2019), What-If Tool

Structure:

Explore data overview to understand attributes and distributions
Generate/explore candidate modelling problems (target, features, type)
Specify the problem; auto-generate diverse models (AutoML or equivalent)
Compare models on holdout predictions (confusion matrices, residuals, fairness)
Select preferred model(s) considering task-specific criteria beyond accuracy
Export models for deployment

Applicable when: The modelling goal itself is not fixed a priori; the analyst wants to discover what predictions are possible and robust.

Pattern 9: Spatial Event Extraction → Place Delineation → Flow Analysis

Source papers: From Movement Tracks through Events to Places, MobilityGraphs

Structure:

Compute dynamic attributes from trajectories (speed, direction, proximity)
Extract movement events by filtering on attribute thresholds (visually guided)
Cluster events spatially (and optionally temporally) using density-based methods
Build spatial buffers around clusters to delineate "places"
Aggregate events/trajectories by places and time intervals
Analyse aggregated time series or flows for spatio-temporal patterns

Applicable when: Relevant places are not predefined but must emerge from movement behaviour; analysis focuses on place-based temporal patterns.

Pattern 10: Model Explanation → Diagnostic Drill-Down

Source papers: Visual Diagnostics of Binary Classifiers, What-If Tool, RfX

Structure:

Outcome-level: Visualise overall prediction distribution and accuracy (confusion matrix, ROC, histograms)
Explanation-level: Aggregate instance-level explanations into groups; show which features drive decisions and how accurate each decision group is
Instance-level: Drill into individual data items to see raw feature values and understand root causes of errors
Generate diagnostic insights (data quality issues, missing features, label leakage)

Applicable when: A trained classifier needs semantic validation by domain experts, beyond aggregate accuracy metrics.

Pattern 11: Temporal Segmentation + Symbolic Encoding + Co-occurrence Mining

Source papers: Episodes and Topics, Cluster and Calendar, Human-in-the-Loop

Structure:

Segment multivariate time series into episodes (sliding window or event-based)
Encode each attribute's variation within an episode symbolically (e.g., SAX)
Treat episodes as "documents" and attribute patterns as "words"
Apply topic modelling (LDA, NMF) to discover co-occurring patterns ("topics")
Interpret topics visually; merge semantically close topics
Explore topic distribution over time, space, or other context dimensions

Applicable when: The goal is to find and understand recurring combinations of behaviours across multiple attributes simultaneously.

Pattern 12: Interactive Model Steering (Semantic Interaction)

Source papers: UTOPIAN, What-If Tool, RfX, Partition-Based Regression

Structure:

System produces an initial model/result automatically
User inspects the result through coordinated visualisations
User directly manipulates the result (merge topics, adjust thresholds, edit counterfactuals, prune trees)
System re-computes model incorporating user changes
Updated result is shown immediately for validation
Iterate until the user is satisfied

Applicable when: The model requires domain knowledge for refinement that cannot be easily formalised as constraints; experts need to inject knowledge interactively.

Summary Table

#	Pattern Name	Key Mechanism	Typical Output
1	Progressive Abstraction	Multi-level aggregation with validation at each level	High-level interpretable patterns
2	Rank-then-Explore	Quantitative ranking + qualitative visual inspection	Selected features/variables
3	Residual-Driven Refinement	Iterative error analysis and model augmentation	Interpretable multi-variable model
4	Cluster-Label-Distribute	Cluster → interpret → explore distribution	Labelled groups with contextual understanding
5	Simplify → Analyse → Restore	Complexity reduction with reversibility	Clean visual overview + preserved detail
6	Similarity-Based Prescription	Find similar → show outcomes → plan actions	Recommended action sequences
7	Human-in-the-Loop Feature Engineering	Feature design + bulk visual labelling + validation	Training set for ML model
8	Exploratory Model Analysis	Problem exploration + multi-model comparison	Deployed predictive model(s)
9	Event → Place → Flow	Event extraction → spatial clustering → aggregation	Spatio-temporal activity patterns
10	Explanation Drill-Down	Aggregate → explanation groups → raw instances	Diagnostic insights for model improvement
11	Symbolic Encoding + Topic Mining	SAX + topic modelling + distribution exploration	Multi-attribute temporal patterns
12	Interactive Model Steering	Direct manipulation → re-computation loop	Domain-tuned model/result

These patterns can be composed (e.g., Pattern 4 inside Pattern 1, or Pattern 3 using Pattern 2 at each step) and adapted to specific data types and analysis goals.

User #2

I need a workflow recommendation for the following task. I want to analyse the dataset that contains information on IEEE Visualization (IEEE VIS) publications from 1990-2024 as well as IEEE TVCG and IEEE CG&A articles published at IEEE VIS. The dataset includes for each paper the publication year, title, and abstract. I want to reveal the major research topics and how they evolved over the years. I am interested in seeing trends rather than minor fluctuations. Using the workflow library, compose an appropriate workflow.

Assistant #2

💭 Show reasoning

The user wants to analyse a dataset of IEEE VIS publications (1990-2024) containing year, title, and abstract. Their goals are: 1. Reveal major research topics 2. Show how topics evolved over the years 3. Focus on trends rather than minor fluctuations Let me compose a workflow from the patterns I identified. The most relevant patterns are: - Pattern 11: Temporal Segmentation + Symbolic Encoding + Co-occurrence Mining - This is about discovering co-occurring patterns using topic modelling. The "documents" here are the papers themselves (title + abstract), and we can apply topic modelling directly. - Pattern 4: Cluster-Label-Distribute - Cluster papers by similarity, label the clusters, then explore distribution over time. - Pattern 1: Progressive Abstraction - Start with raw papers, abstract to topics, then look at temporal distributions. - Pattern 5: Simplify → Analyse → Restore - To focus on trends rather than minor fluctuations, we need smoothing/simplification. - Pattern 12: Interactive Model Steering - To refine topics interactively (merge, split, relabel). So the workflow would start with text preprocessing on the abstracts to clean them up, then move into topic extraction and temporal analysis.

Recommended Workflow: Research Topic Evolution in IEEE VIS (1990–2024)

Composed from Patterns: 11 → 12 → 4 → 5 → 1

Overview Diagram

Raw Papers (year, title, abstract)
        │
        ▼
┌─────────────────────────┐
│ 1. Text Preprocessing   │  (Pattern 1 – first abstraction level)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 2. Topic Extraction     │  (Pattern 11 – topic mining)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 3. Interactive Refine   │  (Pattern 12 – model steering)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 4. Label & Distribute   │  (Pattern 4 – cluster-label-distribute)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 5. Trend Smoothing      │  (Pattern 5 – simplify for trends)
└───────────┬─────────────┘
            ▼
      Trend Narratives

Step-by-Step Workflow

Step 1 — Text Preprocessing (Pattern 1: Progressive Abstraction, Level 1)

Action	Detail
Concatenate title + abstract per paper	Creates one "document" per publication
Tokenise, lowercase, remove stopwords	Standard NLP pipeline
Remove domain-generic terms	e.g., "paper", "propose", "result", "method"
Optional: bigram/trigram collocation	Preserves phrases like "parallel coordinates", "deep learning"
Represent as bag-of-words or TF-IDF matrix	Input for topic model

Output: Document-term matrix (≈ 4 000–5 000 papers × vocabulary)

Step 2 — Topic Extraction (Pattern 11: Topic Mining)

Action	Detail
Choose topic model	LDA, NMF, or BERTopic (embedding-based)
Set initial k (number of topics)	Start with k = 25–40 (intentionally over-segment)
Fit model	Each paper receives a topic-probability vector
Inspect top terms per topic	Ranked word lists or word clouds
Inspect representative papers per topic	Highest-probability papers as exemplars

Output: k candidate topics, each described by top terms and representative papers; per-paper topic assignments (soft or hard).

Rationale for over-segmentation: It is easier to merge semantically close topics in the next step than to split overly broad ones.

Step 3 — Interactive Topic Refinement (Pattern 12: Interactive Model Steering)

Action	Detail
Visualise topic similarity	2-D projection (t-SNE/UMAP of topic-term vectors) or dendrogram
Merge near-duplicate topics	e.g., "graph drawing" + "network layout" → single topic
Split incoherent topics	If top terms mix unrelated concepts, re-cluster that subset
Remove junk topics	Boilerplate or overly generic clusters
Re-label topics with human-readable names	e.g., "Volume Rendering", "Text & NLP", "Uncertainty Visualisation"
Validate	Check that representative papers match the label

Output: Refined set of ~10–20 well-labelled major topics; each paper assigned to one or more topics.

Iterate until topic set is stable and interpretable.

Step 4 — Label & Distribute over Time (Pattern 4: Cluster-Label-Distribute)

Action	Detail
For each year, count papers per topic	Absolute counts or proportions (share of total)
Construct a papers × year × topic table	Basis for temporal visualisation
Visualise as stacked area chart or ThemeRiver	Shows relative topic prominence over 35 years
Alternatively: small-multiple line charts	One line per topic; easier to compare slopes

Output: Raw temporal distribution of topics (will still be noisy year-to-year).

Step 5 — Trend Smoothing & Narrative (Pattern 5: Simplify → Analyse → Restore)

Action	Detail
Apply temporal smoothing	Moving average (window = 3–5 years) or LOESS/lowess curve per topic
Suppress minor fluctuations	Focus on monotone rises, falls, or plateaus
Identify trend phases	"Emerging" (recent upward), "Declining" (downward), "Stable" (flat)
Annotate key inflection points	e.g., "Deep-learning topics surge after 2016"
Restore detail on demand	Allow drill-down to individual years/papers for any interesting inflection

Output: Smooth trend lines per topic with annotations; classification of topics into trend categories.

Final Deliverables

Artefact	Description
Topic catalogue	Table of major research topics with label, top terms, and 3–5 exemplar papers
Trend chart	Smoothed stacked area or small-multiple line chart (1990–2024)
Trend summary	Narrative listing emerging topics (e.g., ML4VIS, fairness), declining topics (e.g., VRML, early VR), and stable pillars (e.g., volume rendering, information visualisation)
Interactive view (optional)	Linked dashboard: select a topic → see its trend + representative papers per period

Justification of Pattern Composition

User Requirement	Addressed by
"Reveal major research topics"	Patterns 11 + 12 (topic extraction + refinement)
"How they evolved over the years"	Pattern 4 (distribute over time)
"Trends rather than minor fluctuations"	Pattern 5 (temporal smoothing)
Interpretability & trust	Pattern 1 (progressive abstraction with validation at each level)

Tool Suggestions (optional)

Topic modelling: scikit-learn (NMF/LDA), BERTopic, or MALLET
Visualisation: Matplotlib/Seaborn (static), Streamlit + Altair (interactive), or dedicated VA tools like pyLDAvis
Smoothing: statsmodels.lowess or pandas.rolling

User #3

To test this workflow, create a Jupyter Notebook (Python).

Assistant #3

💭 Show reasoning

The user wants me to create a Jupyter Notebook in Python that implements the workflow I recommended for analyzing IEEE VIS publications (1990-2024) to reveal major research topics and their evolution over time. I'll create a complete Jupyter Notebook with all the steps: 1. Text Preprocessing 2. Topic Extraction (using NMF or LDA) 3. Interactive Topic Refinement (manual merging/labeling) 4. Label & Distribute over Time 5. Trend Smoothing & Narrative I'll structure it as a proper notebook with markdown cells and code cells. I'll make the notebook flexible to handle a CSV dataset with year, title, and abstract columns.

(The assistant produced a complete Python implementation. The code is shown below as it was presented in the session — as a percent-format notebook script.)

📓 Show/hide full notebook code

# %% [markdown]
# # IEEE VIS Research Topic Evolution (1990–2024)
#
# This notebook implements the recommended visual analytics workflow to reveal
# major research topics in IEEE VIS publications and how they evolved over time.
#
# **Workflow:** Progressive Abstraction → Topic Mining → Interactive Steering →
# Cluster-Label-Distribute → Simplify for Trends

# %% [markdown]
# ## 0. Setup & Installation

# %%
# !pip install pandas numpy scikit-learn matplotlib seaborn nltk wordcloud adjustText

# %%
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# NLP
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Topic Modelling
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.manifold import TSNE

nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

print("Setup complete.")

# %% [markdown]
# ## 1. Data Loading & Exploration
#
# Expects a CSV file with at least columns: `Year`, `Title`, `Abstract`.

# %%
# === CONFIGURE THIS ===
DATA_PATH = "ieee_vis_papers.csv"  # <-- Adjust path to your dataset

# Load data
df = pd.read_csv(DATA_PATH)
df.columns = df.columns.str.strip().str.lower()

COLUMN_MAP = {
    'year': 'year',
    'title': 'title',
    'abstract': 'abstract',
}
df = df.rename(columns=COLUMN_MAP)

assert 'year' in df.columns
assert 'title' in df.columns
assert 'abstract' in df.columns

print(f"Dataset shape: {df.shape}")
print(f"Year range: {df['year'].min()} – {df['year'].max()}")
df.head(3)

# %%
# Drop rows with missing abstracts
before = len(df)
df = df.dropna(subset=['abstract']).copy()
df = df[df['abstract'].str.strip().str.len() > 50]
print(f"Dropped {before - len(df)} rows. Remaining: {len(df)}")

# %%
# Papers per year
fig, ax = plt.subplots(figsize=(12, 4))
df.groupby('year').size().plot(kind='bar', ax=ax, color='steelblue', width=0.8)
ax.set_title("Number of Papers per Year")
ax.set_xlabel("Year")
ax.set_ylabel("Count")
plt.tight_layout()
plt.show()

# %% [markdown]
# ## 2. Text Preprocessing (Pattern 1 – Progressive Abstraction, Level 1)

# %%
df['text'] = df['title'].fillna('') + ' ' + df['abstract'].fillna('')

CUSTOM_STOPWORDS = {
    'paper', 'propose', 'proposed', 'approach', 'method', 'methods',
    'result', 'results', 'show', 'shown', 'use', 'used', 'using',
    'present', 'presented', 'new', 'novel', 'technique', 'techniques',
    'based', 'provide', 'system', 'also', 'work', 'problem',
    'describe', 'described', 'demonstrate', 'demonstrated',
    'algorithm', 'algorithms', 'however', 'many', 'one', 'two',
    'first', 'second', 'allow', 'allows', 'well', 'can', 'may',
    'different', 'several', 'existing', 'within', 'number',
    'introduction', 'conclusion', 'abstract', 'study', 'studies',
    'example', 'examples', 'application', 'applications'
}

stop_words = set(stopwords.words('english')).union(CUSTOM_STOPWORDS)
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)
    tokens = text.split()
    tokens = [
        lemmatizer.lemmatize(t)
        for t in tokens
        if t not in stop_words and len(t) > 2
    ]
    return ' '.join(tokens)

df['processed_text'] = df['text'].apply(preprocess_text)
print("Preprocessing complete.")

# %% [markdown]
# ## 3. Topic Extraction (Pattern 11 – Topic Mining)
#
# NMF on TF-IDF features; intentionally over-segment (k=30).

# %%
N_TOPICS = 30
MAX_FEATURES = 5000
MIN_DF = 5
MAX_DF = 0.7
NGRAM_RANGE = (1, 2)

# %%
tfidf_vectorizer = TfidfVectorizer(
    max_features=MAX_FEATURES,
    min_df=MIN_DF, max_df=MAX_DF,
    ngram_range=NGRAM_RANGE,
    token_pattern=r'(?u)\b[a-z][a-z]+\b'
)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_text'])
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"TF-IDF matrix: {tfidf_matrix.shape[0]} docs × {tfidf_matrix.shape[1]} terms")

# %%
nmf_model = NMF(n_components=N_TOPICS, random_state=42, max_iter=500, init='nndsvda')
W = nmf_model.fit_transform(tfidf_matrix)
H = nmf_model.components_
print(f"NMF complete. Reconstruction error: {nmf_model.reconstruction_err_:.2f}")

# %%
N_TOP_TERMS = 12

def display_topics(model, feature_names, n_top=N_TOP_TERMS):
    topics = []
    for idx, topic_vec in enumerate(model.components_):
        top_indices = topic_vec.argsort()[:-n_top - 1:-1]
        top_terms = [feature_names[i] for i in top_indices]
        topics.append(top_terms)
        print(f"Topic {idx:2d}: {', '.join(top_terms)}")
    return topics

print("=" * 70)
print("TOP TERMS PER TOPIC")
print("=" * 70)
topic_terms = display_topics(nmf_model, feature_names)

# %%
df['dominant_topic'] = W.argmax(axis=1)
df['topic_weight'] = W.max(axis=1)

topic_distributions = pd.DataFrame(
    W, columns=[f"topic_{i}" for i in range(N_TOPICS)], index=df.index
)
print("\nPapers per topic:")
print(df['dominant_topic'].value_counts().sort_index())

# %% [markdown]
# ## 4. Interactive Topic Refinement (Pattern 12 – Model Steering)

# %%
from sklearn.metrics.pairwise import cosine_similarity

topic_similarity = cosine_similarity(H)

fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(topic_similarity, annot=False, cmap='YlOrRd',
            xticklabels=range(N_TOPICS), yticklabels=range(N_TOPICS), ax=ax)
ax.set_title("Topic-Topic Cosine Similarity")
plt.tight_layout()
plt.show()

# %%
# Representative papers per topic
for t in range(N_TOPICS):
    topic_papers = df[df['dominant_topic'] == t].nlargest(3, 'topic_weight')
    print(f"\n--- Topic {t} ---")
    print(f"    Terms: {', '.join(topic_terms[t][:8])}")
    for _, row in topic_papers.iterrows():
        print(f"    [{row['year']}] {row['title'][:80]}")

# %% [markdown]
# ### 4a. Define Topic Merging & Labelling
#
# Edit MERGE_MAP, TOPIC_LABELS, and EXCLUDE_TOPICS based on inspection.

# %%
MERGE_MAP = {i: i for i in range(N_TOPICS)}
# MERGE_MAP[5] = 3    # Example: merge topic 5 into topic 3

EXCLUDE_TOPICS = []
# EXCLUDE_TOPICS = [0, 22]

TOPIC_LABELS = {i: f"Topic {i}" for i in range(N_TOPICS)}
# TOPIC_LABELS[0] = "Volume Rendering"
# TOPIC_LABELS[1] = "Graph & Network Visualization"
# ...

print("Merge map and labels defined. Adjust above as needed.")

# %%
df['merged_topic'] = df['dominant_topic'].map(MERGE_MAP)
df = df[~df['dominant_topic'].isin(EXCLUDE_TOPICS)].copy()

merged_groups = {}
for orig, merged in MERGE_MAP.items():
    if orig not in EXCLUDE_TOPICS:
        merged_groups.setdefault(merged, []).append(orig)

for group_id, members in merged_groups.items():
    member_cols = [f"topic_{m}" for m in members]
    existing_cols = [c for c in member_cols if c in topic_distributions.columns]
    if existing_cols:
        topic_distributions[f"merged_{group_id}"] = topic_distributions[existing_cols].sum(axis=1)

df['topic_label'] = df['merged_topic'].map(TOPIC_LABELS)

final_topics = sorted(df['merged_topic'].unique())
print(f"Number of final topics after merging: {len(final_topics)}")
for t in final_topics:
    count = (df['merged_topic'] == t).sum()
    print(f"  {TOPIC_LABELS.get(t, t):40s} ({count} papers)")

# %% [markdown]
# ## 5. Temporal Distribution (Pattern 4 – Cluster-Label-Distribute)

# %%
yearly_counts = df.groupby(['year', 'topic_label']).size().unstack(fill_value=0)
yearly_proportions = yearly_counts.div(yearly_counts.sum(axis=1), axis=0)
print(f"Yearly proportions table: {yearly_proportions.shape}")

# %%
fig, ax = plt.subplots(figsize=(16, 8))
yearly_proportions.plot.area(ax=ax, alpha=0.8, linewidth=0.5)
ax.set_title("Topic Proportions Over Time (Raw)", fontsize=14)
ax.set_xlabel("Year")
ax.set_ylabel("Proportion of Papers")
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)
ax.set_xlim(df['year'].min(), df['year'].max())
plt.tight_layout()
plt.show()

# %% [markdown]
# ## 6. Trend Smoothing (Pattern 5 – Simplify → Analyse → Restore)

# %%
WINDOW_SIZE = 5  # 5-year moving average

yearly_smoothed = yearly_proportions.rolling(
    window=WINDOW_SIZE, center=True, min_periods=2
).mean()
yearly_smoothed = yearly_smoothed.dropna()
print(f"Smoothed range: {yearly_smoothed.index.min()} – {yearly_smoothed.index.max()}")

# %%
fig, ax = plt.subplots(figsize=(16, 8))
yearly_smoothed.plot.area(ax=ax, alpha=0.85, linewidth=0.5)
ax.set_title(f"Topic Proportions ({WINDOW_SIZE}-Year Moving Average)", fontsize=14)
ax.set_xlabel("Year")
ax.set_ylabel("Proportion of Papers")
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)
plt.tight_layout()
plt.show()

# %%
# Small-multiple line charts
n_final_topics = len(yearly_smoothed.columns)
ncols = 4
nrows = int(np.ceil(n_final_topics / ncols))

fig, axes = plt.subplots(nrows, ncols, figsize=(16, 3*nrows), sharex=True, sharey=True)
axes_flat = axes.flatten()

for idx, topic_name in enumerate(yearly_smoothed.columns):
    ax = axes_flat[idx]
    ax.plot(yearly_smoothed.index, yearly_smoothed[topic_name], color='steelblue', lw=2)
    ax.fill_between(yearly_smoothed.index, yearly_smoothed[topic_name], alpha=0.3)
    ax.set_title(topic_name, fontsize=9, fontweight='bold')
    ax.grid(True, alpha=0.3)

for idx in range(n_final_topics, len(axes_flat)):
    axes_flat[idx].set_visible(False)

fig.suptitle(f"Individual Topic Trends ({WINDOW_SIZE}-Year Smoothing)", fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

# %% [markdown]
# ## 7. Trend Classification & Summary

# %%
years = yearly_smoothed.index
n_years = len(years)
third = n_years // 3

early_period = yearly_smoothed.iloc[:third]
late_period = yearly_smoothed.iloc[-third:]

trend_summary = []
for topic_name in yearly_smoothed.columns:
    early_mean = early_period[topic_name].mean()
    late_mean = late_period[topic_name].mean()
    if early_mean > 0:
        rel_change = (late_mean - early_mean) / early_mean
    else:
        rel_change = 1.0 if late_mean > 0 else 0.0

    if rel_change > 0.3:
        trend = "📈 Emerging"
    elif rel_change < -0.3:
        trend = "📉 Declining"
    else:
        trend = "➡️ Stable"

    peak_year = yearly_smoothed[topic_name].idxmax()
    trend_summary.append({
        'Topic': topic_name,
        'Early Avg (%)': f"{early_mean*100:.1f}",
        'Late Avg (%)': f"{late_mean*100:.1f}",
        'Relative Change': f"{rel_change:+.0%}",
        'Peak Year': int(peak_year),
        'Trend': trend
    })

trend_df = pd.DataFrame(trend_summary).sort_values('Trend')
print("=" * 70)
print("TOPIC TREND SUMMARY")
print("=" * 70)
trend_df

# %% [markdown]
# ## 8. Detail-on-Demand (Pattern 5 – Restore)

# %%
DRILL_TOPIC = yearly_smoothed.columns[0]  # Change to topic of interest

print(f"Drilling into: '{DRILL_TOPIC}'")
topic_papers = df[df['topic_label'] == DRILL_TOPIC].sort_values('year')

for era_start in range(int(df['year'].min()), int(df['year'].max()) + 1, 5):
    era_end = era_start + 4
    era_papers = topic_papers[
        (topic_papers['year'] >= era_start) & (topic_papers['year'] <= era_end)
    ]
    if len(era_papers) > 0:
        print(f"\n--- {era_start}–{era_end} ({len(era_papers)} papers) ---")
        for _, row in era_papers.nlargest(3, 'topic_weight').iterrows():
            print(f"  [{row['year']}] {row['title'][:90]}")

# %% [markdown]
# ## 9. Export Results

# %%
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)

df[['year', 'title', 'topic_label', 'topic_weight']].to_csv(
    output_dir / "papers_with_topics.csv", index=False)
yearly_smoothed.to_csv(output_dir / "yearly_topic_proportions_smoothed.csv")
trend_df.to_csv(output_dir / "trend_summary.csv", index=False)

print(f"Results saved to '{output_dir}/'")

# %% [markdown]
# ## 10. (Optional) 2-D Topic Map

# %%
if N_TOPICS > 5:
    perplexity = min(5, N_TOPICS - 1)
    tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity)
    topic_coords = tsne.fit_transform(H)

    fig, ax = plt.subplots(figsize=(10, 8))
    ax.scatter(topic_coords[:, 0], topic_coords[:, 1],
               s=100, c=range(N_TOPICS), cmap='tab20', alpha=0.8, edgecolors='black')
    for i in range(N_TOPICS):
        if i not in EXCLUDE_TOPICS:
            ax.annotate(TOPIC_LABELS.get(i, f"T{i}"),
                       (topic_coords[i, 0], topic_coords[i, 1]),
                       fontsize=7, ha='center', va='bottom')
    ax.set_title("Topic Map (t-SNE of topic-term vectors)")
    plt.tight_layout()
    plt.show()

How to Use

Save the code as topic_evolution.ipynb (or use jupytext / VS Code to convert the # %% cells into a notebook).
Place your dataset (CSV with year, title, abstract columns) in the same directory and update DATA_PATH.
Run cells sequentially through Step 3 (topic extraction).
Inspect the printed topics and representative papers.
Edit MERGE_MAP, TOPIC_LABELS, and EXCLUDE_TOPICS in Section 4a based on your inspection — this is the human-in-the-loop step.
Re-run from Section 4a onward to see the refined trend charts.
Iterate until the topic set is satisfying.

The key interactive decision point is Section 4a — that's where your domain expertise shapes the final result.

User #4

Please output the notebook as json file

Assistant #4

💭 Show reasoning

The user wants me to output the Jupyter Notebook as a JSON file (the standard .ipynb format). I'll convert the Python script into the proper Jupyter notebook JSON format with cells properly separated.

(The assistant produced the complete Jupyter Notebook in standard .ipynb JSON format — 30 cells covering all workflow steps from data loading through trend classification and export.)

📓 Show/hide notebook JSON structure

{
 "cells": [
  {"cell_type": "markdown", "source": ["# IEEE VIS Research Topic Evolution (1990–2024)\n", ...]},
  {"cell_type": "code", "source": ["# !pip install pandas numpy scikit-learn ..."]},
  {"cell_type": "code", "source": ["import pandas as pd\n", "import numpy as np\n", ...]},
  {"cell_type": "markdown", "source": ["## 1. Data Loading & Exploration\n", ...]},
  {"cell_type": "code", "source": ["DATA_PATH = \"ieee_vis_papers.csv\"\n", ...]},
  {"cell_type": "code", "source": ["# Drop rows with missing abstracts\n", ...]},
  {"cell_type": "code", "source": ["# Papers per year\n", ...]},
  {"cell_type": "markdown", "source": ["## 2. Text Preprocessing\n", ...]},
  {"cell_type": "code", "source": ["df['text'] = ...\n", "CUSTOM_STOPWORDS = {...}\n", ...]},
  {"cell_type": "markdown", "source": ["## 3. Topic Extraction (NMF)\n", ...]},
  {"cell_type": "code", "source": ["N_TOPICS = 30\n", ...]},
  {"cell_type": "code", "source": ["tfidf_vectorizer = TfidfVectorizer(...)\n", ...]},
  {"cell_type": "code", "source": ["nmf_model = NMF(...)\n", ...]},
  {"cell_type": "code", "source": ["# Display top terms per topic\n", ...]},
  {"cell_type": "code", "source": ["# Assign dominant topic\n", ...]},
  {"cell_type": "markdown", "source": ["## 4. Interactive Topic Refinement\n", ...]},
  {"cell_type": "code", "source": ["# Cosine similarity heatmap\n", ...]},
  {"cell_type": "code", "source": ["# Representative papers\n", ...]},
  {"cell_type": "markdown", "source": ["### 4a. Merging & Labelling\n", ...]},
  {"cell_type": "code", "source": ["MERGE_MAP = {...}\n", "EXCLUDE_TOPICS = [...]\n", ...]},
  {"cell_type": "code", "source": ["# Apply merging\n", ...]},
  {"cell_type": "markdown", "source": ["## 5. Temporal Distribution\n", ...]},
  {"cell_type": "code", "source": ["yearly_counts = ...\n", ...]},
  {"cell_type": "code", "source": ["# Raw stacked area chart\n", ...]},
  {"cell_type": "markdown", "source": ["## 6. Trend Smoothing\n", ...]},
  {"cell_type": "code", "source": ["WINDOW_SIZE = 5\n", ...]},
  {"cell_type": "code", "source": ["# Smoothed stacked area chart\n", ...]},
  {"cell_type": "code", "source": ["# Small-multiple line charts\n", ...]},
  {"cell_type": "markdown", "source": ["## 7. Trend Classification\n", ...]},
  {"cell_type": "code", "source": ["# Classify: Emerging / Declining / Stable\n", ...]},
  {"cell_type": "code", "source": ["# Highlight top trends\n", ...]},
  {"cell_type": "markdown", "source": ["## 8. Detail-on-Demand\n", ...]},
  {"cell_type": "code", "source": ["DRILL_TOPIC = ...\n", ...]},
  {"cell_type": "markdown", "source": ["## 9. Export Results\n", ...]},
  {"cell_type": "code", "source": ["output_dir = Path('output')\n", ...]},
  {"cell_type": "markdown", "source": ["## 10. (Optional) Topic Map\n", ...]},
  {"cell_type": "code", "source": ["# t-SNE of topic-term vectors\n", ...]},
  {"cell_type": "markdown", "source": ["---\n", "## Summary\n", ...]}
 ],
 "metadata": {
  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
  "language_info": {"name": "python", "version": "3.10.12"}
 },
 "nbformat": 4, "nbformat_minor": 4
}

The notebook contains 30+ cells implementing the complete 5-step workflow: text preprocessing, NMF topic extraction, interactive refinement (merge/label/exclude), temporal distribution with smoothing, and trend classification with export.