The attached archive includes published research papers containing descriptions of visual analytics workflows. Your task will be to use this as a workflow library for recommending workflows addressing new problems. Analyse the library and extract reusable patterns.
Reusable Visual Analytics Workflow Patterns
After analysing all 17 papers in the library, I extracted the following reusable workflow patterns that can be applied to new problems:
Pattern 1: Progressive Abstraction
Source papers: Episodes and Topics, From Movement Tracks, MobilityGraphs, Reducing Snapshots to Points
Structure:
- Start with raw, elementary data
- Apply a first-level transformation/abstraction (e.g., symbolic encoding, event extraction)
- Visualize and validate the intermediate representation
- Apply a second-level abstraction (e.g., topic modelling, clustering)
- Visualize and explore distribution of higher-level patterns in context
Applicable when: Data is hierarchically structured or too granular for direct insight; understanding requires building up from elements to patterns to distributions.
Pattern 2: Rank-then-Explore
Source papers: Partition-Based Regression Framework, Exploratory Model Analysis
Structure:
- Compute quantitative relevance/quality measures for many candidates (features, models, variable pairs)
- Rank candidates by measures; display as ordered table/list
- Coordinate ranked list with small-multiple visualisations showing qualitative structure
- User inspects top-ranked candidates visually to confirm relevance or spot issues
- Select candidate and proceed (e.g., add to model, export)
Applicable when: The search space is large (many features, many models) and a combination of automated scoring + human judgement is needed.
Pattern 3: Iterative Residual-Driven Model Refinement
Source papers: Partition-Based Regression Framework, Visual Diagnostics of Binary Classifiers
Structure:
- Build an initial model using the most dominant effect
- Compute residuals (or errors/misclassifications)
- Visualize residuals conditioned on remaining candidate variables
- Identify systematic patterns (bias, local structures, interactions)
- Augment the model with a new variable/transformation addressing the pattern
- Repeat from step 2 until stopping criterion is met
Applicable when: Building interpretable predictive models incrementally, especially when domain knowledge must guide feature selection.
Pattern 4: Cluster-Label-Distribute
Source papers: Cluster and Calendar, MobilityGraphs, Episodes and Topics, Human-in-the-Loop, Reducing Snapshots to Points
Structure:
- Define a meaningful feature/distance space for the items
- Cluster items by similarity (density-based or partition-based)
- Visualise cluster representatives (prototypes, summaries, averages)
- Interpret/label clusters using domain knowledge
- Explore the distribution of clusters over context dimensions (time, space, categories)
Applicable when: A dataset contains many instances that need to be grouped for overview, and the analyst wants to understand when/where/why different groups occur.
Pattern 5: Simplify → Analyse → Restore
Source papers: Temporal Event Sequence Simplification, MobilityGraphs, Reducing Snapshots to Points
Structure:
- Start with complex, cluttered data
- Apply user-driven simplifications (filtering, merging, replacing, aggregating)
- Measure complexity reduction (e.g., visual element count)
- Analyse the simplified view for patterns
- Selectively restore detail when needed for deeper investigation
- Record simplification history for reproducibility
Applicable when: Raw data is too noisy/complex for pattern detection; the signal-to-noise ratio must be improved interactively.
Pattern 6: Similarity-Based Prescription
Source papers: EventAction, Reducing Snapshots to Points
Structure:
- Define the current record/state of interest
- Compute similarity to archived/historical records
- Visualise the distribution of outcomes for similar records
- Identify actions/sequences correlated with desired outcomes
- Construct a personalised action plan
- Provide feedback on estimated likelihood of success
- Iterate to refine
Applicable when: The goal is recommending future actions based on what has worked for similar cases in the past (prescriptive analytics).
Pattern 7: Human-in-the-Loop Feature Engineering for Classification
Source papers: Human-in-the-Loop, UTOPIAN, RfX
Structure:
- Derive interval-based or aggregate features from raw temporal/sequential data
- Assess feature discriminability through visual inspection (distributions, projections)
- Select/refine features iteratively
- Group data by features (clustering or dimensionality reduction)
- Visually inspect groups; label representative examples in bulk
- Validate labelled set (e.g., k-NN test) before training a full model
Applicable when: Training labels are scarce, pattern definitions are vague, and domain expertise is needed to bridge raw data and meaningful categories.
Pattern 8: Exploratory Model Analysis (Multi-Problem, Multi-Model)
Source papers: Exploratory Model Analysis (EuroVis 2019), What-If Tool
Structure:
- Explore data overview to understand attributes and distributions
- Generate/explore candidate modelling problems (target, features, type)
- Specify the problem; auto-generate diverse models (AutoML or equivalent)
- Compare models on holdout predictions (confusion matrices, residuals, fairness)
- Select preferred model(s) considering task-specific criteria beyond accuracy
- Export models for deployment
Applicable when: The modelling goal itself is not fixed a priori; the analyst wants to discover what predictions are possible and robust.
Pattern 9: Spatial Event Extraction → Place Delineation → Flow Analysis
Source papers: From Movement Tracks through Events to Places, MobilityGraphs
Structure:
- Compute dynamic attributes from trajectories (speed, direction, proximity)
- Extract movement events by filtering on attribute thresholds (visually guided)
- Cluster events spatially (and optionally temporally) using density-based methods
- Build spatial buffers around clusters to delineate "places"
- Aggregate events/trajectories by places and time intervals
- Analyse aggregated time series or flows for spatio-temporal patterns
Applicable when: Relevant places are not predefined but must emerge from movement behaviour; analysis focuses on place-based temporal patterns.
Pattern 10: Model Explanation → Diagnostic Drill-Down
Source papers: Visual Diagnostics of Binary Classifiers, What-If Tool, RfX
Structure:
- Outcome-level: Visualise overall prediction distribution and accuracy (confusion matrix, ROC, histograms)
- Explanation-level: Aggregate instance-level explanations into groups; show which features drive decisions and how accurate each decision group is
- Instance-level: Drill into individual data items to see raw feature values and understand root causes of errors
- Generate diagnostic insights (data quality issues, missing features, label leakage)
Applicable when: A trained classifier needs semantic validation by domain experts, beyond aggregate accuracy metrics.
Pattern 11: Temporal Segmentation + Symbolic Encoding + Co-occurrence Mining
Source papers: Episodes and Topics, Cluster and Calendar, Human-in-the-Loop
Structure:
- Segment multivariate time series into episodes (sliding window or event-based)
- Encode each attribute's variation within an episode symbolically (e.g., SAX)
- Treat episodes as "documents" and attribute patterns as "words"
- Apply topic modelling (LDA, NMF) to discover co-occurring patterns ("topics")
- Interpret topics visually; merge semantically close topics
- Explore topic distribution over time, space, or other context dimensions
Applicable when: The goal is to find and understand recurring combinations of behaviours across multiple attributes simultaneously.
Pattern 12: Interactive Model Steering (Semantic Interaction)
Source papers: UTOPIAN, What-If Tool, RfX, Partition-Based Regression
Structure:
- System produces an initial model/result automatically
- User inspects the result through coordinated visualisations
- User directly manipulates the result (merge topics, adjust thresholds, edit counterfactuals, prune trees)
- System re-computes model incorporating user changes
- Updated result is shown immediately for validation
- Iterate until the user is satisfied
Applicable when: The model requires domain knowledge for refinement that cannot be easily formalised as constraints; experts need to inject knowledge interactively.
Summary Table
| # | Pattern Name | Key Mechanism | Typical Output |
|---|---|---|---|
| 1 | Progressive Abstraction | Multi-level aggregation with validation at each level | High-level interpretable patterns |
| 2 | Rank-then-Explore | Quantitative ranking + qualitative visual inspection | Selected features/variables |
| 3 | Residual-Driven Refinement | Iterative error analysis and model augmentation | Interpretable multi-variable model |
| 4 | Cluster-Label-Distribute | Cluster → interpret → explore distribution | Labelled groups with contextual understanding |
| 5 | Simplify → Analyse → Restore | Complexity reduction with reversibility | Clean visual overview + preserved detail |
| 6 | Similarity-Based Prescription | Find similar → show outcomes → plan actions | Recommended action sequences |
| 7 | Human-in-the-Loop Feature Engineering | Feature design + bulk visual labelling + validation | Training set for ML model |
| 8 | Exploratory Model Analysis | Problem exploration + multi-model comparison | Deployed predictive model(s) |
| 9 | Event → Place → Flow | Event extraction → spatial clustering → aggregation | Spatio-temporal activity patterns |
| 10 | Explanation Drill-Down | Aggregate → explanation groups → raw instances | Diagnostic insights for model improvement |
| 11 | Symbolic Encoding + Topic Mining | SAX + topic modelling + distribution exploration | Multi-attribute temporal patterns |
| 12 | Interactive Model Steering | Direct manipulation → re-computation loop | Domain-tuned model/result |
These patterns can be composed (e.g., Pattern 4 inside Pattern 1, or Pattern 3 using Pattern 2 at each step) and adapted to specific data types and analysis goals.
I need a workflow recommendation for the following task. I want to analyse the dataset that contains information on IEEE Visualization (IEEE VIS) publications from 1990-2024 as well as IEEE TVCG and IEEE CG&A articles published at IEEE VIS. The dataset includes for each paper the publication year, title, and abstract. I want to reveal the major research topics and how they evolved over the years. I am interested in seeing trends rather than minor fluctuations. Using the workflow library, compose an appropriate workflow.
Recommended Workflow: Research Topic Evolution in IEEE VIS (1990–2024)
Composed from Patterns: 11 → 12 → 4 → 5 → 1
Overview Diagram
Raw Papers (year, title, abstract)
│
▼
┌─────────────────────────┐
│ 1. Text Preprocessing │ (Pattern 1 – first abstraction level)
└───────────┬─────────────┘
▼
┌─────────────────────────┐
│ 2. Topic Extraction │ (Pattern 11 – topic mining)
└───────────┬─────────────┘
▼
┌─────────────────────────┐
│ 3. Interactive Refine │ (Pattern 12 – model steering)
└───────────┬─────────────┘
▼
┌─────────────────────────┐
│ 4. Label & Distribute │ (Pattern 4 – cluster-label-distribute)
└───────────┬─────────────┘
▼
┌─────────────────────────┐
│ 5. Trend Smoothing │ (Pattern 5 – simplify for trends)
└───────────┬─────────────┘
▼
Trend Narratives
Step-by-Step Workflow
Step 1 — Text Preprocessing (Pattern 1: Progressive Abstraction, Level 1)
| Action | Detail |
|---|---|
| Concatenate title + abstract per paper | Creates one "document" per publication |
| Tokenise, lowercase, remove stopwords | Standard NLP pipeline |
| Remove domain-generic terms | e.g., "paper", "propose", "result", "method" |
| Optional: bigram/trigram collocation | Preserves phrases like "parallel coordinates", "deep learning" |
| Represent as bag-of-words or TF-IDF matrix | Input for topic model |
Output: Document-term matrix (≈ 4 000–5 000 papers × vocabulary)
Step 2 — Topic Extraction (Pattern 11: Topic Mining)
| Action | Detail |
|---|---|
| Choose topic model | LDA, NMF, or BERTopic (embedding-based) |
| Set initial k (number of topics) | Start with k = 25–40 (intentionally over-segment) |
| Fit model | Each paper receives a topic-probability vector |
| Inspect top terms per topic | Ranked word lists or word clouds |
| Inspect representative papers per topic | Highest-probability papers as exemplars |
Output: k candidate topics, each described by top terms and representative papers; per-paper topic assignments (soft or hard).
Rationale for over-segmentation: It is easier to merge semantically close topics in the next step than to split overly broad ones.
Step 3 — Interactive Topic Refinement (Pattern 12: Interactive Model Steering)
| Action | Detail |
|---|---|
| Visualise topic similarity | 2-D projection (t-SNE/UMAP of topic-term vectors) or dendrogram |
| Merge near-duplicate topics | e.g., "graph drawing" + "network layout" → single topic |
| Split incoherent topics | If top terms mix unrelated concepts, re-cluster that subset |
| Remove junk topics | Boilerplate or overly generic clusters |
| Re-label topics with human-readable names | e.g., "Volume Rendering", "Text & NLP", "Uncertainty Visualisation" |
| Validate | Check that representative papers match the label |
Output: Refined set of ~10–20 well-labelled major topics; each paper assigned to one or more topics.
Iterate until topic set is stable and interpretable.
Step 4 — Label & Distribute over Time (Pattern 4: Cluster-Label-Distribute)
| Action | Detail |
|---|---|
| For each year, count papers per topic | Absolute counts or proportions (share of total) |
| Construct a papers × year × topic table | Basis for temporal visualisation |
| Visualise as stacked area chart or ThemeRiver | Shows relative topic prominence over 35 years |
| Alternatively: small-multiple line charts | One line per topic; easier to compare slopes |
Output: Raw temporal distribution of topics (will still be noisy year-to-year).
Step 5 — Trend Smoothing & Narrative (Pattern 5: Simplify → Analyse → Restore)
| Action | Detail |
|---|---|
| Apply temporal smoothing | Moving average (window = 3–5 years) or LOESS/lowess curve per topic |
| Suppress minor fluctuations | Focus on monotone rises, falls, or plateaus |
| Identify trend phases | "Emerging" (recent upward), "Declining" (downward), "Stable" (flat) |
| Annotate key inflection points | e.g., "Deep-learning topics surge after 2016" |
| Restore detail on demand | Allow drill-down to individual years/papers for any interesting inflection |
Output: Smooth trend lines per topic with annotations; classification of topics into trend categories.
Final Deliverables
| Artefact | Description |
|---|---|
| Topic catalogue | Table of major research topics with label, top terms, and 3–5 exemplar papers |
| Trend chart | Smoothed stacked area or small-multiple line chart (1990–2024) |
| Trend summary | Narrative listing emerging topics (e.g., ML4VIS, fairness), declining topics (e.g., VRML, early VR), and stable pillars (e.g., volume rendering, information visualisation) |
| Interactive view (optional) | Linked dashboard: select a topic → see its trend + representative papers per period |
Justification of Pattern Composition
| User Requirement | Addressed by |
|---|---|
| "Reveal major research topics" | Patterns 11 + 12 (topic extraction + refinement) |
| "How they evolved over the years" | Pattern 4 (distribute over time) |
| "Trends rather than minor fluctuations" | Pattern 5 (temporal smoothing) |
| Interpretability & trust | Pattern 1 (progressive abstraction with validation at each level) |
Tool Suggestions (optional)
- Topic modelling:
scikit-learn(NMF/LDA),BERTopic, orMALLET - Visualisation:
Matplotlib/Seaborn(static),Streamlit+Altair(interactive), or dedicated VA tools likepyLDAvis - Smoothing:
statsmodels.lowessorpandas.rolling
To test this workflow, create a Jupyter Notebook (Python).
(The assistant produced a complete Python implementation. The code is shown below as it was presented in the session — as a percent-format notebook script.)
# %% [markdown]
# # IEEE VIS Research Topic Evolution (1990–2024)
#
# This notebook implements the recommended visual analytics workflow to reveal
# major research topics in IEEE VIS publications and how they evolved over time.
#
# **Workflow:** Progressive Abstraction → Topic Mining → Interactive Steering →
# Cluster-Label-Distribute → Simplify for Trends
# %% [markdown]
# ## 0. Setup & Installation
# %%
# !pip install pandas numpy scikit-learn matplotlib seaborn nltk wordcloud adjustText
# %%
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
# NLP
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Topic Modelling
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.manifold import TSNE
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
print("Setup complete.")
# %% [markdown]
# ## 1. Data Loading & Exploration
#
# Expects a CSV file with at least columns: `Year`, `Title`, `Abstract`.
# %%
# === CONFIGURE THIS ===
DATA_PATH = "ieee_vis_papers.csv" # <-- Adjust path to your dataset
# Load data
df = pd.read_csv(DATA_PATH)
df.columns = df.columns.str.strip().str.lower()
COLUMN_MAP = {
'year': 'year',
'title': 'title',
'abstract': 'abstract',
}
df = df.rename(columns=COLUMN_MAP)
assert 'year' in df.columns
assert 'title' in df.columns
assert 'abstract' in df.columns
print(f"Dataset shape: {df.shape}")
print(f"Year range: {df['year'].min()} – {df['year'].max()}")
df.head(3)
# %%
# Drop rows with missing abstracts
before = len(df)
df = df.dropna(subset=['abstract']).copy()
df = df[df['abstract'].str.strip().str.len() > 50]
print(f"Dropped {before - len(df)} rows. Remaining: {len(df)}")
# %%
# Papers per year
fig, ax = plt.subplots(figsize=(12, 4))
df.groupby('year').size().plot(kind='bar', ax=ax, color='steelblue', width=0.8)
ax.set_title("Number of Papers per Year")
ax.set_xlabel("Year")
ax.set_ylabel("Count")
plt.tight_layout()
plt.show()
# %% [markdown]
# ## 2. Text Preprocessing (Pattern 1 – Progressive Abstraction, Level 1)
# %%
df['text'] = df['title'].fillna('') + ' ' + df['abstract'].fillna('')
CUSTOM_STOPWORDS = {
'paper', 'propose', 'proposed', 'approach', 'method', 'methods',
'result', 'results', 'show', 'shown', 'use', 'used', 'using',
'present', 'presented', 'new', 'novel', 'technique', 'techniques',
'based', 'provide', 'system', 'also', 'work', 'problem',
'describe', 'described', 'demonstrate', 'demonstrated',
'algorithm', 'algorithms', 'however', 'many', 'one', 'two',
'first', 'second', 'allow', 'allows', 'well', 'can', 'may',
'different', 'several', 'existing', 'within', 'number',
'introduction', 'conclusion', 'abstract', 'study', 'studies',
'example', 'examples', 'application', 'applications'
}
stop_words = set(stopwords.words('english')).union(CUSTOM_STOPWORDS)
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
text = text.lower()
text = re.sub(r'[^a-z\s]', ' ', text)
tokens = text.split()
tokens = [
lemmatizer.lemmatize(t)
for t in tokens
if t not in stop_words and len(t) > 2
]
return ' '.join(tokens)
df['processed_text'] = df['text'].apply(preprocess_text)
print("Preprocessing complete.")
# %% [markdown]
# ## 3. Topic Extraction (Pattern 11 – Topic Mining)
#
# NMF on TF-IDF features; intentionally over-segment (k=30).
# %%
N_TOPICS = 30
MAX_FEATURES = 5000
MIN_DF = 5
MAX_DF = 0.7
NGRAM_RANGE = (1, 2)
# %%
tfidf_vectorizer = TfidfVectorizer(
max_features=MAX_FEATURES,
min_df=MIN_DF, max_df=MAX_DF,
ngram_range=NGRAM_RANGE,
token_pattern=r'(?u)\b[a-z][a-z]+\b'
)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_text'])
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"TF-IDF matrix: {tfidf_matrix.shape[0]} docs × {tfidf_matrix.shape[1]} terms")
# %%
nmf_model = NMF(n_components=N_TOPICS, random_state=42, max_iter=500, init='nndsvda')
W = nmf_model.fit_transform(tfidf_matrix)
H = nmf_model.components_
print(f"NMF complete. Reconstruction error: {nmf_model.reconstruction_err_:.2f}")
# %%
N_TOP_TERMS = 12
def display_topics(model, feature_names, n_top=N_TOP_TERMS):
topics = []
for idx, topic_vec in enumerate(model.components_):
top_indices = topic_vec.argsort()[:-n_top - 1:-1]
top_terms = [feature_names[i] for i in top_indices]
topics.append(top_terms)
print(f"Topic {idx:2d}: {', '.join(top_terms)}")
return topics
print("=" * 70)
print("TOP TERMS PER TOPIC")
print("=" * 70)
topic_terms = display_topics(nmf_model, feature_names)
# %%
df['dominant_topic'] = W.argmax(axis=1)
df['topic_weight'] = W.max(axis=1)
topic_distributions = pd.DataFrame(
W, columns=[f"topic_{i}" for i in range(N_TOPICS)], index=df.index
)
print("\nPapers per topic:")
print(df['dominant_topic'].value_counts().sort_index())
# %% [markdown]
# ## 4. Interactive Topic Refinement (Pattern 12 – Model Steering)
# %%
from sklearn.metrics.pairwise import cosine_similarity
topic_similarity = cosine_similarity(H)
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(topic_similarity, annot=False, cmap='YlOrRd',
xticklabels=range(N_TOPICS), yticklabels=range(N_TOPICS), ax=ax)
ax.set_title("Topic-Topic Cosine Similarity")
plt.tight_layout()
plt.show()
# %%
# Representative papers per topic
for t in range(N_TOPICS):
topic_papers = df[df['dominant_topic'] == t].nlargest(3, 'topic_weight')
print(f"\n--- Topic {t} ---")
print(f" Terms: {', '.join(topic_terms[t][:8])}")
for _, row in topic_papers.iterrows():
print(f" [{row['year']}] {row['title'][:80]}")
# %% [markdown]
# ### 4a. Define Topic Merging & Labelling
#
# Edit MERGE_MAP, TOPIC_LABELS, and EXCLUDE_TOPICS based on inspection.
# %%
MERGE_MAP = {i: i for i in range(N_TOPICS)}
# MERGE_MAP[5] = 3 # Example: merge topic 5 into topic 3
EXCLUDE_TOPICS = []
# EXCLUDE_TOPICS = [0, 22]
TOPIC_LABELS = {i: f"Topic {i}" for i in range(N_TOPICS)}
# TOPIC_LABELS[0] = "Volume Rendering"
# TOPIC_LABELS[1] = "Graph & Network Visualization"
# ...
print("Merge map and labels defined. Adjust above as needed.")
# %%
df['merged_topic'] = df['dominant_topic'].map(MERGE_MAP)
df = df[~df['dominant_topic'].isin(EXCLUDE_TOPICS)].copy()
merged_groups = {}
for orig, merged in MERGE_MAP.items():
if orig not in EXCLUDE_TOPICS:
merged_groups.setdefault(merged, []).append(orig)
for group_id, members in merged_groups.items():
member_cols = [f"topic_{m}" for m in members]
existing_cols = [c for c in member_cols if c in topic_distributions.columns]
if existing_cols:
topic_distributions[f"merged_{group_id}"] = topic_distributions[existing_cols].sum(axis=1)
df['topic_label'] = df['merged_topic'].map(TOPIC_LABELS)
final_topics = sorted(df['merged_topic'].unique())
print(f"Number of final topics after merging: {len(final_topics)}")
for t in final_topics:
count = (df['merged_topic'] == t).sum()
print(f" {TOPIC_LABELS.get(t, t):40s} ({count} papers)")
# %% [markdown]
# ## 5. Temporal Distribution (Pattern 4 – Cluster-Label-Distribute)
# %%
yearly_counts = df.groupby(['year', 'topic_label']).size().unstack(fill_value=0)
yearly_proportions = yearly_counts.div(yearly_counts.sum(axis=1), axis=0)
print(f"Yearly proportions table: {yearly_proportions.shape}")
# %%
fig, ax = plt.subplots(figsize=(16, 8))
yearly_proportions.plot.area(ax=ax, alpha=0.8, linewidth=0.5)
ax.set_title("Topic Proportions Over Time (Raw)", fontsize=14)
ax.set_xlabel("Year")
ax.set_ylabel("Proportion of Papers")
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)
ax.set_xlim(df['year'].min(), df['year'].max())
plt.tight_layout()
plt.show()
# %% [markdown]
# ## 6. Trend Smoothing (Pattern 5 – Simplify → Analyse → Restore)
# %%
WINDOW_SIZE = 5 # 5-year moving average
yearly_smoothed = yearly_proportions.rolling(
window=WINDOW_SIZE, center=True, min_periods=2
).mean()
yearly_smoothed = yearly_smoothed.dropna()
print(f"Smoothed range: {yearly_smoothed.index.min()} – {yearly_smoothed.index.max()}")
# %%
fig, ax = plt.subplots(figsize=(16, 8))
yearly_smoothed.plot.area(ax=ax, alpha=0.85, linewidth=0.5)
ax.set_title(f"Topic Proportions ({WINDOW_SIZE}-Year Moving Average)", fontsize=14)
ax.set_xlabel("Year")
ax.set_ylabel("Proportion of Papers")
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)
plt.tight_layout()
plt.show()
# %%
# Small-multiple line charts
n_final_topics = len(yearly_smoothed.columns)
ncols = 4
nrows = int(np.ceil(n_final_topics / ncols))
fig, axes = plt.subplots(nrows, ncols, figsize=(16, 3*nrows), sharex=True, sharey=True)
axes_flat = axes.flatten()
for idx, topic_name in enumerate(yearly_smoothed.columns):
ax = axes_flat[idx]
ax.plot(yearly_smoothed.index, yearly_smoothed[topic_name], color='steelblue', lw=2)
ax.fill_between(yearly_smoothed.index, yearly_smoothed[topic_name], alpha=0.3)
ax.set_title(topic_name, fontsize=9, fontweight='bold')
ax.grid(True, alpha=0.3)
for idx in range(n_final_topics, len(axes_flat)):
axes_flat[idx].set_visible(False)
fig.suptitle(f"Individual Topic Trends ({WINDOW_SIZE}-Year Smoothing)", fontsize=14, y=1.01)
plt.tight_layout()
plt.show()
# %% [markdown]
# ## 7. Trend Classification & Summary
# %%
years = yearly_smoothed.index
n_years = len(years)
third = n_years // 3
early_period = yearly_smoothed.iloc[:third]
late_period = yearly_smoothed.iloc[-third:]
trend_summary = []
for topic_name in yearly_smoothed.columns:
early_mean = early_period[topic_name].mean()
late_mean = late_period[topic_name].mean()
if early_mean > 0:
rel_change = (late_mean - early_mean) / early_mean
else:
rel_change = 1.0 if late_mean > 0 else 0.0
if rel_change > 0.3:
trend = "📈 Emerging"
elif rel_change < -0.3:
trend = "📉 Declining"
else:
trend = "➡️ Stable"
peak_year = yearly_smoothed[topic_name].idxmax()
trend_summary.append({
'Topic': topic_name,
'Early Avg (%)': f"{early_mean*100:.1f}",
'Late Avg (%)': f"{late_mean*100:.1f}",
'Relative Change': f"{rel_change:+.0%}",
'Peak Year': int(peak_year),
'Trend': trend
})
trend_df = pd.DataFrame(trend_summary).sort_values('Trend')
print("=" * 70)
print("TOPIC TREND SUMMARY")
print("=" * 70)
trend_df
# %% [markdown]
# ## 8. Detail-on-Demand (Pattern 5 – Restore)
# %%
DRILL_TOPIC = yearly_smoothed.columns[0] # Change to topic of interest
print(f"Drilling into: '{DRILL_TOPIC}'")
topic_papers = df[df['topic_label'] == DRILL_TOPIC].sort_values('year')
for era_start in range(int(df['year'].min()), int(df['year'].max()) + 1, 5):
era_end = era_start + 4
era_papers = topic_papers[
(topic_papers['year'] >= era_start) & (topic_papers['year'] <= era_end)
]
if len(era_papers) > 0:
print(f"\n--- {era_start}–{era_end} ({len(era_papers)} papers) ---")
for _, row in era_papers.nlargest(3, 'topic_weight').iterrows():
print(f" [{row['year']}] {row['title'][:90]}")
# %% [markdown]
# ## 9. Export Results
# %%
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)
df[['year', 'title', 'topic_label', 'topic_weight']].to_csv(
output_dir / "papers_with_topics.csv", index=False)
yearly_smoothed.to_csv(output_dir / "yearly_topic_proportions_smoothed.csv")
trend_df.to_csv(output_dir / "trend_summary.csv", index=False)
print(f"Results saved to '{output_dir}/'")
# %% [markdown]
# ## 10. (Optional) 2-D Topic Map
# %%
if N_TOPICS > 5:
perplexity = min(5, N_TOPICS - 1)
tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity)
topic_coords = tsne.fit_transform(H)
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(topic_coords[:, 0], topic_coords[:, 1],
s=100, c=range(N_TOPICS), cmap='tab20', alpha=0.8, edgecolors='black')
for i in range(N_TOPICS):
if i not in EXCLUDE_TOPICS:
ax.annotate(TOPIC_LABELS.get(i, f"T{i}"),
(topic_coords[i, 0], topic_coords[i, 1]),
fontsize=7, ha='center', va='bottom')
ax.set_title("Topic Map (t-SNE of topic-term vectors)")
plt.tight_layout()
plt.show()
How to Use
- Save the code as
topic_evolution.ipynb(or usejupytext/ VS Code to convert the# %%cells into a notebook). - Place your dataset (CSV with
year,title,abstractcolumns) in the same directory and updateDATA_PATH. - Run cells sequentially through Step 3 (topic extraction).
- Inspect the printed topics and representative papers.
- Edit
MERGE_MAP,TOPIC_LABELS, andEXCLUDE_TOPICSin Section 4a based on your inspection — this is the human-in-the-loop step. - Re-run from Section 4a onward to see the refined trend charts.
- Iterate until the topic set is satisfying.
The key interactive decision point is Section 4a — that's where your domain expertise shapes the final result.
Please output the notebook as json file
(The assistant produced the complete Jupyter Notebook in standard .ipynb JSON format — 30 cells covering all workflow steps from data loading through trend classification and export.)
{
"cells": [
{"cell_type": "markdown", "source": ["# IEEE VIS Research Topic Evolution (1990–2024)\n", ...]},
{"cell_type": "code", "source": ["# !pip install pandas numpy scikit-learn ..."]},
{"cell_type": "code", "source": ["import pandas as pd\n", "import numpy as np\n", ...]},
{"cell_type": "markdown", "source": ["## 1. Data Loading & Exploration\n", ...]},
{"cell_type": "code", "source": ["DATA_PATH = \"ieee_vis_papers.csv\"\n", ...]},
{"cell_type": "code", "source": ["# Drop rows with missing abstracts\n", ...]},
{"cell_type": "code", "source": ["# Papers per year\n", ...]},
{"cell_type": "markdown", "source": ["## 2. Text Preprocessing\n", ...]},
{"cell_type": "code", "source": ["df['text'] = ...\n", "CUSTOM_STOPWORDS = {...}\n", ...]},
{"cell_type": "markdown", "source": ["## 3. Topic Extraction (NMF)\n", ...]},
{"cell_type": "code", "source": ["N_TOPICS = 30\n", ...]},
{"cell_type": "code", "source": ["tfidf_vectorizer = TfidfVectorizer(...)\n", ...]},
{"cell_type": "code", "source": ["nmf_model = NMF(...)\n", ...]},
{"cell_type": "code", "source": ["# Display top terms per topic\n", ...]},
{"cell_type": "code", "source": ["# Assign dominant topic\n", ...]},
{"cell_type": "markdown", "source": ["## 4. Interactive Topic Refinement\n", ...]},
{"cell_type": "code", "source": ["# Cosine similarity heatmap\n", ...]},
{"cell_type": "code", "source": ["# Representative papers\n", ...]},
{"cell_type": "markdown", "source": ["### 4a. Merging & Labelling\n", ...]},
{"cell_type": "code", "source": ["MERGE_MAP = {...}\n", "EXCLUDE_TOPICS = [...]\n", ...]},
{"cell_type": "code", "source": ["# Apply merging\n", ...]},
{"cell_type": "markdown", "source": ["## 5. Temporal Distribution\n", ...]},
{"cell_type": "code", "source": ["yearly_counts = ...\n", ...]},
{"cell_type": "code", "source": ["# Raw stacked area chart\n", ...]},
{"cell_type": "markdown", "source": ["## 6. Trend Smoothing\n", ...]},
{"cell_type": "code", "source": ["WINDOW_SIZE = 5\n", ...]},
{"cell_type": "code", "source": ["# Smoothed stacked area chart\n", ...]},
{"cell_type": "code", "source": ["# Small-multiple line charts\n", ...]},
{"cell_type": "markdown", "source": ["## 7. Trend Classification\n", ...]},
{"cell_type": "code", "source": ["# Classify: Emerging / Declining / Stable\n", ...]},
{"cell_type": "code", "source": ["# Highlight top trends\n", ...]},
{"cell_type": "markdown", "source": ["## 8. Detail-on-Demand\n", ...]},
{"cell_type": "code", "source": ["DRILL_TOPIC = ...\n", ...]},
{"cell_type": "markdown", "source": ["## 9. Export Results\n", ...]},
{"cell_type": "code", "source": ["output_dir = Path('output')\n", ...]},
{"cell_type": "markdown", "source": ["## 10. (Optional) Topic Map\n", ...]},
{"cell_type": "code", "source": ["# t-SNE of topic-term vectors\n", ...]},
{"cell_type": "markdown", "source": ["---\n", "## Summary\n", ...]}
],
"metadata": {
"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
"language_info": {"name": "python", "version": "3.10.12"}
},
"nbformat": 4, "nbformat_minor": 4
}
The notebook contains 30+ cells implementing the complete 5-step workflow: text preprocessing, NMF topic extraction, interactive refinement (merge/label/exclude), temporal distribution with smoothing, and trend classification with export.