VIS Trends — Recommendation from Papers Library

Session with Claude Opus 4 · 4 rounds · Prose condition, Problem B

File: VIS_trends_recommend_from_papers_Claude_Opus_4_6R.json  |  Condition: Prose (paper PDFs)  |  Problem: B — Topic evolution in IEEE VIS 1990–2024  |  Rounds: 4 exchanges (4 user + 4 assistant)
User #1

The attached archive includes published research papers containing descriptions of visual analytics workflows. Your task will be to use this as a workflow library for recommending workflows addressing new problems. Analyse the library and extract reusable patterns.

Assistant #1
💭 Show reasoning
The user wants me to analyze the library of visual analytics workflow papers and extract reusable patterns. Let me go through each paper and identify the key workflow patterns, then synthesize them into reusable patterns. Let me analyze the papers systematically: 1. A_Partition-Based_Framework_for_Building_and_Validating_Regression_Models - Interactive feature selection workflow for regression models using partition-based visualization and ranking. 2. Cluster_and_calendar_based_visualization_of_time_series_data - Clustering daily patterns + calendar visualization for time series exploration. 3. Episodes and Topics in Multivariate Temporal Data - Progressive abstraction: SAX encoding → topic modeling → distribution exploration. 4. EuroVis2019-Exploratity_model_analysis - Exploratory Model Analysis follows a structured pipeline from initial data exploration through problem specification, model generation, exploration, selection, and export. 5. EventAction - Prescriptive analytics for temporal sequences involves finding similar records, exploring their outcomes, reviewing recommendations, and tuning plans. 6. event_sequence_simplification - Temporal sequences get simplified through filtering, transformation, and find-and-replace operations. 7. explainMLVis_binary_classifiers - Visual diagnostics for classifiers move through outcome-level analysis, then feature-level explanations, down to instance-level inspection. 8. From_movement_tracks_through_events_to_places - Movement analysis extracts events from tracks, clusters them, delineates places, aggregates spatio-temporal data, and analyzes the results. 9. Human-in-the-Loop - Feature engineering for pattern recognition progresses through data representation, temporal abstraction, feature assessment, grouping and labeling, then evaluation. 10. MobilityGraphs - Mass mobility analysis simplifies space through clustering, then time through clustering, before visually exploring the resulting spatio-temporal patterns. 11. Reducing_Snapshots_to_Points - Dynamic network exploration discretizes data, vectorizes and normalizes it, applies dimensionality reduction, then visualizes and interacts with the results. 12. RfX - Random Forest exploration partitions trees into clusters, identifies representative decision trees, explores individual trees, and optimizes them. 13. The_What-If_Tool - clusters. Now I'm identifying the core patterns across these papers...

Reusable Visual Analytics Workflow Patterns

After analysing all 17 papers in the library, I extracted the following reusable workflow patterns that can be applied to new problems:


Pattern 1: Progressive Abstraction

Source papers: Episodes and Topics, From Movement Tracks, MobilityGraphs, Reducing Snapshots to Points

Structure:

  1. Start with raw, elementary data
  2. Apply a first-level transformation/abstraction (e.g., symbolic encoding, event extraction)
  3. Visualize and validate the intermediate representation
  4. Apply a second-level abstraction (e.g., topic modelling, clustering)
  5. Visualize and explore distribution of higher-level patterns in context

Applicable when: Data is hierarchically structured or too granular for direct insight; understanding requires building up from elements to patterns to distributions.


Pattern 2: Rank-then-Explore

Source papers: Partition-Based Regression Framework, Exploratory Model Analysis

Structure:

  1. Compute quantitative relevance/quality measures for many candidates (features, models, variable pairs)
  2. Rank candidates by measures; display as ordered table/list
  3. Coordinate ranked list with small-multiple visualisations showing qualitative structure
  4. User inspects top-ranked candidates visually to confirm relevance or spot issues
  5. Select candidate and proceed (e.g., add to model, export)

Applicable when: The search space is large (many features, many models) and a combination of automated scoring + human judgement is needed.


Pattern 3: Iterative Residual-Driven Model Refinement

Source papers: Partition-Based Regression Framework, Visual Diagnostics of Binary Classifiers

Structure:

  1. Build an initial model using the most dominant effect
  2. Compute residuals (or errors/misclassifications)
  3. Visualize residuals conditioned on remaining candidate variables
  4. Identify systematic patterns (bias, local structures, interactions)
  5. Augment the model with a new variable/transformation addressing the pattern
  6. Repeat from step 2 until stopping criterion is met

Applicable when: Building interpretable predictive models incrementally, especially when domain knowledge must guide feature selection.


Pattern 4: Cluster-Label-Distribute

Source papers: Cluster and Calendar, MobilityGraphs, Episodes and Topics, Human-in-the-Loop, Reducing Snapshots to Points

Structure:

  1. Define a meaningful feature/distance space for the items
  2. Cluster items by similarity (density-based or partition-based)
  3. Visualise cluster representatives (prototypes, summaries, averages)
  4. Interpret/label clusters using domain knowledge
  5. Explore the distribution of clusters over context dimensions (time, space, categories)

Applicable when: A dataset contains many instances that need to be grouped for overview, and the analyst wants to understand when/where/why different groups occur.


Pattern 5: Simplify → Analyse → Restore

Source papers: Temporal Event Sequence Simplification, MobilityGraphs, Reducing Snapshots to Points

Structure:

  1. Start with complex, cluttered data
  2. Apply user-driven simplifications (filtering, merging, replacing, aggregating)
  3. Measure complexity reduction (e.g., visual element count)
  4. Analyse the simplified view for patterns
  5. Selectively restore detail when needed for deeper investigation
  6. Record simplification history for reproducibility

Applicable when: Raw data is too noisy/complex for pattern detection; the signal-to-noise ratio must be improved interactively.


Pattern 6: Similarity-Based Prescription

Source papers: EventAction, Reducing Snapshots to Points

Structure:

  1. Define the current record/state of interest
  2. Compute similarity to archived/historical records
  3. Visualise the distribution of outcomes for similar records
  4. Identify actions/sequences correlated with desired outcomes
  5. Construct a personalised action plan
  6. Provide feedback on estimated likelihood of success
  7. Iterate to refine

Applicable when: The goal is recommending future actions based on what has worked for similar cases in the past (prescriptive analytics).


Pattern 7: Human-in-the-Loop Feature Engineering for Classification

Source papers: Human-in-the-Loop, UTOPIAN, RfX

Structure:

  1. Derive interval-based or aggregate features from raw temporal/sequential data
  2. Assess feature discriminability through visual inspection (distributions, projections)
  3. Select/refine features iteratively
  4. Group data by features (clustering or dimensionality reduction)
  5. Visually inspect groups; label representative examples in bulk
  6. Validate labelled set (e.g., k-NN test) before training a full model

Applicable when: Training labels are scarce, pattern definitions are vague, and domain expertise is needed to bridge raw data and meaningful categories.


Pattern 8: Exploratory Model Analysis (Multi-Problem, Multi-Model)

Source papers: Exploratory Model Analysis (EuroVis 2019), What-If Tool

Structure:

  1. Explore data overview to understand attributes and distributions
  2. Generate/explore candidate modelling problems (target, features, type)
  3. Specify the problem; auto-generate diverse models (AutoML or equivalent)
  4. Compare models on holdout predictions (confusion matrices, residuals, fairness)
  5. Select preferred model(s) considering task-specific criteria beyond accuracy
  6. Export models for deployment

Applicable when: The modelling goal itself is not fixed a priori; the analyst wants to discover what predictions are possible and robust.


Pattern 9: Spatial Event Extraction → Place Delineation → Flow Analysis

Source papers: From Movement Tracks through Events to Places, MobilityGraphs

Structure:

  1. Compute dynamic attributes from trajectories (speed, direction, proximity)
  2. Extract movement events by filtering on attribute thresholds (visually guided)
  3. Cluster events spatially (and optionally temporally) using density-based methods
  4. Build spatial buffers around clusters to delineate "places"
  5. Aggregate events/trajectories by places and time intervals
  6. Analyse aggregated time series or flows for spatio-temporal patterns

Applicable when: Relevant places are not predefined but must emerge from movement behaviour; analysis focuses on place-based temporal patterns.


Pattern 10: Model Explanation → Diagnostic Drill-Down

Source papers: Visual Diagnostics of Binary Classifiers, What-If Tool, RfX

Structure:

  1. Outcome-level: Visualise overall prediction distribution and accuracy (confusion matrix, ROC, histograms)
  2. Explanation-level: Aggregate instance-level explanations into groups; show which features drive decisions and how accurate each decision group is
  3. Instance-level: Drill into individual data items to see raw feature values and understand root causes of errors
  4. Generate diagnostic insights (data quality issues, missing features, label leakage)

Applicable when: A trained classifier needs semantic validation by domain experts, beyond aggregate accuracy metrics.


Pattern 11: Temporal Segmentation + Symbolic Encoding + Co-occurrence Mining

Source papers: Episodes and Topics, Cluster and Calendar, Human-in-the-Loop

Structure:

  1. Segment multivariate time series into episodes (sliding window or event-based)
  2. Encode each attribute's variation within an episode symbolically (e.g., SAX)
  3. Treat episodes as "documents" and attribute patterns as "words"
  4. Apply topic modelling (LDA, NMF) to discover co-occurring patterns ("topics")
  5. Interpret topics visually; merge semantically close topics
  6. Explore topic distribution over time, space, or other context dimensions

Applicable when: The goal is to find and understand recurring combinations of behaviours across multiple attributes simultaneously.


Pattern 12: Interactive Model Steering (Semantic Interaction)

Source papers: UTOPIAN, What-If Tool, RfX, Partition-Based Regression

Structure:

  1. System produces an initial model/result automatically
  2. User inspects the result through coordinated visualisations
  3. User directly manipulates the result (merge topics, adjust thresholds, edit counterfactuals, prune trees)
  4. System re-computes model incorporating user changes
  5. Updated result is shown immediately for validation
  6. Iterate until the user is satisfied

Applicable when: The model requires domain knowledge for refinement that cannot be easily formalised as constraints; experts need to inject knowledge interactively.


Summary Table

#Pattern NameKey MechanismTypical Output
1Progressive AbstractionMulti-level aggregation with validation at each levelHigh-level interpretable patterns
2Rank-then-ExploreQuantitative ranking + qualitative visual inspectionSelected features/variables
3Residual-Driven RefinementIterative error analysis and model augmentationInterpretable multi-variable model
4Cluster-Label-DistributeCluster → interpret → explore distributionLabelled groups with contextual understanding
5Simplify → Analyse → RestoreComplexity reduction with reversibilityClean visual overview + preserved detail
6Similarity-Based PrescriptionFind similar → show outcomes → plan actionsRecommended action sequences
7Human-in-the-Loop Feature EngineeringFeature design + bulk visual labelling + validationTraining set for ML model
8Exploratory Model AnalysisProblem exploration + multi-model comparisonDeployed predictive model(s)
9Event → Place → FlowEvent extraction → spatial clustering → aggregationSpatio-temporal activity patterns
10Explanation Drill-DownAggregate → explanation groups → raw instancesDiagnostic insights for model improvement
11Symbolic Encoding + Topic MiningSAX + topic modelling + distribution explorationMulti-attribute temporal patterns
12Interactive Model SteeringDirect manipulation → re-computation loopDomain-tuned model/result

These patterns can be composed (e.g., Pattern 4 inside Pattern 1, or Pattern 3 using Pattern 2 at each step) and adapted to specific data types and analysis goals.

User #2

I need a workflow recommendation for the following task. I want to analyse the dataset that contains information on IEEE Visualization (IEEE VIS) publications from 1990-2024 as well as IEEE TVCG and IEEE CG&A articles published at IEEE VIS. The dataset includes for each paper the publication year, title, and abstract. I want to reveal the major research topics and how they evolved over the years. I am interested in seeing trends rather than minor fluctuations. Using the workflow library, compose an appropriate workflow.

Assistant #2
💭 Show reasoning
The user wants to analyse a dataset of IEEE VIS publications (1990-2024) containing year, title, and abstract. Their goals are: 1. Reveal major research topics 2. Show how topics evolved over the years 3. Focus on trends rather than minor fluctuations Let me compose a workflow from the patterns I identified. The most relevant patterns are: - Pattern 11: Temporal Segmentation + Symbolic Encoding + Co-occurrence Mining - This is about discovering co-occurring patterns using topic modelling. The "documents" here are the papers themselves (title + abstract), and we can apply topic modelling directly. - Pattern 4: Cluster-Label-Distribute - Cluster papers by similarity, label the clusters, then explore distribution over time. - Pattern 1: Progressive Abstraction - Start with raw papers, abstract to topics, then look at temporal distributions. - Pattern 5: Simplify → Analyse → Restore - To focus on trends rather than minor fluctuations, we need smoothing/simplification. - Pattern 12: Interactive Model Steering - To refine topics interactively (merge, split, relabel). So the workflow would start with text preprocessing on the abstracts to clean them up, then move into topic extraction and temporal analysis.

Recommended Workflow: Research Topic Evolution in IEEE VIS (1990–2024)

Composed from Patterns: 11 → 12 → 4 → 5 → 1


Overview Diagram

Raw Papers (year, title, abstract)
        │
        ▼
┌─────────────────────────┐
│ 1. Text Preprocessing   │  (Pattern 1 – first abstraction level)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 2. Topic Extraction     │  (Pattern 11 – topic mining)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 3. Interactive Refine   │  (Pattern 12 – model steering)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 4. Label & Distribute   │  (Pattern 4 – cluster-label-distribute)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 5. Trend Smoothing      │  (Pattern 5 – simplify for trends)
└───────────┬─────────────┘
            ▼
      Trend Narratives

Step-by-Step Workflow

Step 1 — Text Preprocessing (Pattern 1: Progressive Abstraction, Level 1)

ActionDetail
Concatenate title + abstract per paperCreates one "document" per publication
Tokenise, lowercase, remove stopwordsStandard NLP pipeline
Remove domain-generic termse.g., "paper", "propose", "result", "method"
Optional: bigram/trigram collocationPreserves phrases like "parallel coordinates", "deep learning"
Represent as bag-of-words or TF-IDF matrixInput for topic model

Output: Document-term matrix (≈ 4 000–5 000 papers × vocabulary)


Step 2 — Topic Extraction (Pattern 11: Topic Mining)

ActionDetail
Choose topic modelLDA, NMF, or BERTopic (embedding-based)
Set initial k (number of topics)Start with k = 25–40 (intentionally over-segment)
Fit modelEach paper receives a topic-probability vector
Inspect top terms per topicRanked word lists or word clouds
Inspect representative papers per topicHighest-probability papers as exemplars

Output: k candidate topics, each described by top terms and representative papers; per-paper topic assignments (soft or hard).

Rationale for over-segmentation: It is easier to merge semantically close topics in the next step than to split overly broad ones.


Step 3 — Interactive Topic Refinement (Pattern 12: Interactive Model Steering)

ActionDetail
Visualise topic similarity2-D projection (t-SNE/UMAP of topic-term vectors) or dendrogram
Merge near-duplicate topicse.g., "graph drawing" + "network layout" → single topic
Split incoherent topicsIf top terms mix unrelated concepts, re-cluster that subset
Remove junk topicsBoilerplate or overly generic clusters
Re-label topics with human-readable namese.g., "Volume Rendering", "Text & NLP", "Uncertainty Visualisation"
ValidateCheck that representative papers match the label

Output: Refined set of ~10–20 well-labelled major topics; each paper assigned to one or more topics.

Iterate until topic set is stable and interpretable.


Step 4 — Label & Distribute over Time (Pattern 4: Cluster-Label-Distribute)

ActionDetail
For each year, count papers per topicAbsolute counts or proportions (share of total)
Construct a papers × year × topic tableBasis for temporal visualisation
Visualise as stacked area chart or ThemeRiverShows relative topic prominence over 35 years
Alternatively: small-multiple line chartsOne line per topic; easier to compare slopes

Output: Raw temporal distribution of topics (will still be noisy year-to-year).


Step 5 — Trend Smoothing & Narrative (Pattern 5: Simplify → Analyse → Restore)

ActionDetail
Apply temporal smoothingMoving average (window = 3–5 years) or LOESS/lowess curve per topic
Suppress minor fluctuationsFocus on monotone rises, falls, or plateaus
Identify trend phases"Emerging" (recent upward), "Declining" (downward), "Stable" (flat)
Annotate key inflection pointse.g., "Deep-learning topics surge after 2016"
Restore detail on demandAllow drill-down to individual years/papers for any interesting inflection

Output: Smooth trend lines per topic with annotations; classification of topics into trend categories.


Final Deliverables

ArtefactDescription
Topic catalogueTable of major research topics with label, top terms, and 3–5 exemplar papers
Trend chartSmoothed stacked area or small-multiple line chart (1990–2024)
Trend summaryNarrative listing emerging topics (e.g., ML4VIS, fairness), declining topics (e.g., VRML, early VR), and stable pillars (e.g., volume rendering, information visualisation)
Interactive view (optional)Linked dashboard: select a topic → see its trend + representative papers per period

Justification of Pattern Composition

User RequirementAddressed by
"Reveal major research topics"Patterns 11 + 12 (topic extraction + refinement)
"How they evolved over the years"Pattern 4 (distribute over time)
"Trends rather than minor fluctuations"Pattern 5 (temporal smoothing)
Interpretability & trustPattern 1 (progressive abstraction with validation at each level)

Tool Suggestions (optional)

  • Topic modelling: scikit-learn (NMF/LDA), BERTopic, or MALLET
  • Visualisation: Matplotlib/Seaborn (static), Streamlit + Altair (interactive), or dedicated VA tools like pyLDAvis
  • Smoothing: statsmodels.lowess or pandas.rolling
User #3

To test this workflow, create a Jupyter Notebook (Python).

Assistant #3
💭 Show reasoning
The user wants me to create a Jupyter Notebook in Python that implements the workflow I recommended for analyzing IEEE VIS publications (1990-2024) to reveal major research topics and their evolution over time. I'll create a complete Jupyter Notebook with all the steps: 1. Text Preprocessing 2. Topic Extraction (using NMF or LDA) 3. Interactive Topic Refinement (manual merging/labeling) 4. Label & Distribute over Time 5. Trend Smoothing & Narrative I'll structure it as a proper notebook with markdown cells and code cells. I'll make the notebook flexible to handle a CSV dataset with year, title, and abstract columns.

(The assistant produced a complete Python implementation. The code is shown below as it was presented in the session — as a percent-format notebook script.)

📓 Show/hide full notebook code
# %% [markdown]
# # IEEE VIS Research Topic Evolution (1990–2024)
#
# This notebook implements the recommended visual analytics workflow to reveal
# major research topics in IEEE VIS publications and how they evolved over time.
#
# **Workflow:** Progressive Abstraction → Topic Mining → Interactive Steering →
# Cluster-Label-Distribute → Simplify for Trends

# %% [markdown]
# ## 0. Setup & Installation

# %%
# !pip install pandas numpy scikit-learn matplotlib seaborn nltk wordcloud adjustText

# %%
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# NLP
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Topic Modelling
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.manifold import TSNE

nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

print("Setup complete.")

# %% [markdown]
# ## 1. Data Loading & Exploration
#
# Expects a CSV file with at least columns: `Year`, `Title`, `Abstract`.

# %%
# === CONFIGURE THIS ===
DATA_PATH = "ieee_vis_papers.csv"  # <-- Adjust path to your dataset

# Load data
df = pd.read_csv(DATA_PATH)
df.columns = df.columns.str.strip().str.lower()

COLUMN_MAP = {
    'year': 'year',
    'title': 'title',
    'abstract': 'abstract',
}
df = df.rename(columns=COLUMN_MAP)

assert 'year' in df.columns
assert 'title' in df.columns
assert 'abstract' in df.columns

print(f"Dataset shape: {df.shape}")
print(f"Year range: {df['year'].min()} – {df['year'].max()}")
df.head(3)

# %%
# Drop rows with missing abstracts
before = len(df)
df = df.dropna(subset=['abstract']).copy()
df = df[df['abstract'].str.strip().str.len() > 50]
print(f"Dropped {before - len(df)} rows. Remaining: {len(df)}")

# %%
# Papers per year
fig, ax = plt.subplots(figsize=(12, 4))
df.groupby('year').size().plot(kind='bar', ax=ax, color='steelblue', width=0.8)
ax.set_title("Number of Papers per Year")
ax.set_xlabel("Year")
ax.set_ylabel("Count")
plt.tight_layout()
plt.show()

# %% [markdown]
# ## 2. Text Preprocessing (Pattern 1 – Progressive Abstraction, Level 1)

# %%
df['text'] = df['title'].fillna('') + ' ' + df['abstract'].fillna('')

CUSTOM_STOPWORDS = {
    'paper', 'propose', 'proposed', 'approach', 'method', 'methods',
    'result', 'results', 'show', 'shown', 'use', 'used', 'using',
    'present', 'presented', 'new', 'novel', 'technique', 'techniques',
    'based', 'provide', 'system', 'also', 'work', 'problem',
    'describe', 'described', 'demonstrate', 'demonstrated',
    'algorithm', 'algorithms', 'however', 'many', 'one', 'two',
    'first', 'second', 'allow', 'allows', 'well', 'can', 'may',
    'different', 'several', 'existing', 'within', 'number',
    'introduction', 'conclusion', 'abstract', 'study', 'studies',
    'example', 'examples', 'application', 'applications'
}

stop_words = set(stopwords.words('english')).union(CUSTOM_STOPWORDS)
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)
    tokens = text.split()
    tokens = [
        lemmatizer.lemmatize(t)
        for t in tokens
        if t not in stop_words and len(t) > 2
    ]
    return ' '.join(tokens)

df['processed_text'] = df['text'].apply(preprocess_text)
print("Preprocessing complete.")

# %% [markdown]
# ## 3. Topic Extraction (Pattern 11 – Topic Mining)
#
# NMF on TF-IDF features; intentionally over-segment (k=30).

# %%
N_TOPICS = 30
MAX_FEATURES = 5000
MIN_DF = 5
MAX_DF = 0.7
NGRAM_RANGE = (1, 2)

# %%
tfidf_vectorizer = TfidfVectorizer(
    max_features=MAX_FEATURES,
    min_df=MIN_DF, max_df=MAX_DF,
    ngram_range=NGRAM_RANGE,
    token_pattern=r'(?u)\b[a-z][a-z]+\b'
)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_text'])
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"TF-IDF matrix: {tfidf_matrix.shape[0]} docs × {tfidf_matrix.shape[1]} terms")

# %%
nmf_model = NMF(n_components=N_TOPICS, random_state=42, max_iter=500, init='nndsvda')
W = nmf_model.fit_transform(tfidf_matrix)
H = nmf_model.components_
print(f"NMF complete. Reconstruction error: {nmf_model.reconstruction_err_:.2f}")

# %%
N_TOP_TERMS = 12

def display_topics(model, feature_names, n_top=N_TOP_TERMS):
    topics = []
    for idx, topic_vec in enumerate(model.components_):
        top_indices = topic_vec.argsort()[:-n_top - 1:-1]
        top_terms = [feature_names[i] for i in top_indices]
        topics.append(top_terms)
        print(f"Topic {idx:2d}: {', '.join(top_terms)}")
    return topics

print("=" * 70)
print("TOP TERMS PER TOPIC")
print("=" * 70)
topic_terms = display_topics(nmf_model, feature_names)

# %%
df['dominant_topic'] = W.argmax(axis=1)
df['topic_weight'] = W.max(axis=1)

topic_distributions = pd.DataFrame(
    W, columns=[f"topic_{i}" for i in range(N_TOPICS)], index=df.index
)
print("\nPapers per topic:")
print(df['dominant_topic'].value_counts().sort_index())

# %% [markdown]
# ## 4. Interactive Topic Refinement (Pattern 12 – Model Steering)

# %%
from sklearn.metrics.pairwise import cosine_similarity

topic_similarity = cosine_similarity(H)

fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(topic_similarity, annot=False, cmap='YlOrRd',
            xticklabels=range(N_TOPICS), yticklabels=range(N_TOPICS), ax=ax)
ax.set_title("Topic-Topic Cosine Similarity")
plt.tight_layout()
plt.show()

# %%
# Representative papers per topic
for t in range(N_TOPICS):
    topic_papers = df[df['dominant_topic'] == t].nlargest(3, 'topic_weight')
    print(f"\n--- Topic {t} ---")
    print(f"    Terms: {', '.join(topic_terms[t][:8])}")
    for _, row in topic_papers.iterrows():
        print(f"    [{row['year']}] {row['title'][:80]}")

# %% [markdown]
# ### 4a. Define Topic Merging & Labelling
#
# Edit MERGE_MAP, TOPIC_LABELS, and EXCLUDE_TOPICS based on inspection.

# %%
MERGE_MAP = {i: i for i in range(N_TOPICS)}
# MERGE_MAP[5] = 3    # Example: merge topic 5 into topic 3

EXCLUDE_TOPICS = []
# EXCLUDE_TOPICS = [0, 22]

TOPIC_LABELS = {i: f"Topic {i}" for i in range(N_TOPICS)}
# TOPIC_LABELS[0] = "Volume Rendering"
# TOPIC_LABELS[1] = "Graph & Network Visualization"
# ...

print("Merge map and labels defined. Adjust above as needed.")

# %%
df['merged_topic'] = df['dominant_topic'].map(MERGE_MAP)
df = df[~df['dominant_topic'].isin(EXCLUDE_TOPICS)].copy()

merged_groups = {}
for orig, merged in MERGE_MAP.items():
    if orig not in EXCLUDE_TOPICS:
        merged_groups.setdefault(merged, []).append(orig)

for group_id, members in merged_groups.items():
    member_cols = [f"topic_{m}" for m in members]
    existing_cols = [c for c in member_cols if c in topic_distributions.columns]
    if existing_cols:
        topic_distributions[f"merged_{group_id}"] = topic_distributions[existing_cols].sum(axis=1)

df['topic_label'] = df['merged_topic'].map(TOPIC_LABELS)

final_topics = sorted(df['merged_topic'].unique())
print(f"Number of final topics after merging: {len(final_topics)}")
for t in final_topics:
    count = (df['merged_topic'] == t).sum()
    print(f"  {TOPIC_LABELS.get(t, t):40s} ({count} papers)")

# %% [markdown]
# ## 5. Temporal Distribution (Pattern 4 – Cluster-Label-Distribute)

# %%
yearly_counts = df.groupby(['year', 'topic_label']).size().unstack(fill_value=0)
yearly_proportions = yearly_counts.div(yearly_counts.sum(axis=1), axis=0)
print(f"Yearly proportions table: {yearly_proportions.shape}")

# %%
fig, ax = plt.subplots(figsize=(16, 8))
yearly_proportions.plot.area(ax=ax, alpha=0.8, linewidth=0.5)
ax.set_title("Topic Proportions Over Time (Raw)", fontsize=14)
ax.set_xlabel("Year")
ax.set_ylabel("Proportion of Papers")
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)
ax.set_xlim(df['year'].min(), df['year'].max())
plt.tight_layout()
plt.show()

# %% [markdown]
# ## 6. Trend Smoothing (Pattern 5 – Simplify → Analyse → Restore)

# %%
WINDOW_SIZE = 5  # 5-year moving average

yearly_smoothed = yearly_proportions.rolling(
    window=WINDOW_SIZE, center=True, min_periods=2
).mean()
yearly_smoothed = yearly_smoothed.dropna()
print(f"Smoothed range: {yearly_smoothed.index.min()} – {yearly_smoothed.index.max()}")

# %%
fig, ax = plt.subplots(figsize=(16, 8))
yearly_smoothed.plot.area(ax=ax, alpha=0.85, linewidth=0.5)
ax.set_title(f"Topic Proportions ({WINDOW_SIZE}-Year Moving Average)", fontsize=14)
ax.set_xlabel("Year")
ax.set_ylabel("Proportion of Papers")
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)
plt.tight_layout()
plt.show()

# %%
# Small-multiple line charts
n_final_topics = len(yearly_smoothed.columns)
ncols = 4
nrows = int(np.ceil(n_final_topics / ncols))

fig, axes = plt.subplots(nrows, ncols, figsize=(16, 3*nrows), sharex=True, sharey=True)
axes_flat = axes.flatten()

for idx, topic_name in enumerate(yearly_smoothed.columns):
    ax = axes_flat[idx]
    ax.plot(yearly_smoothed.index, yearly_smoothed[topic_name], color='steelblue', lw=2)
    ax.fill_between(yearly_smoothed.index, yearly_smoothed[topic_name], alpha=0.3)
    ax.set_title(topic_name, fontsize=9, fontweight='bold')
    ax.grid(True, alpha=0.3)

for idx in range(n_final_topics, len(axes_flat)):
    axes_flat[idx].set_visible(False)

fig.suptitle(f"Individual Topic Trends ({WINDOW_SIZE}-Year Smoothing)", fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

# %% [markdown]
# ## 7. Trend Classification & Summary

# %%
years = yearly_smoothed.index
n_years = len(years)
third = n_years // 3

early_period = yearly_smoothed.iloc[:third]
late_period = yearly_smoothed.iloc[-third:]

trend_summary = []
for topic_name in yearly_smoothed.columns:
    early_mean = early_period[topic_name].mean()
    late_mean = late_period[topic_name].mean()
    if early_mean > 0:
        rel_change = (late_mean - early_mean) / early_mean
    else:
        rel_change = 1.0 if late_mean > 0 else 0.0

    if rel_change > 0.3:
        trend = "📈 Emerging"
    elif rel_change < -0.3:
        trend = "📉 Declining"
    else:
        trend = "➡️ Stable"

    peak_year = yearly_smoothed[topic_name].idxmax()
    trend_summary.append({
        'Topic': topic_name,
        'Early Avg (%)': f"{early_mean*100:.1f}",
        'Late Avg (%)': f"{late_mean*100:.1f}",
        'Relative Change': f"{rel_change:+.0%}",
        'Peak Year': int(peak_year),
        'Trend': trend
    })

trend_df = pd.DataFrame(trend_summary).sort_values('Trend')
print("=" * 70)
print("TOPIC TREND SUMMARY")
print("=" * 70)
trend_df

# %% [markdown]
# ## 8. Detail-on-Demand (Pattern 5 – Restore)

# %%
DRILL_TOPIC = yearly_smoothed.columns[0]  # Change to topic of interest

print(f"Drilling into: '{DRILL_TOPIC}'")
topic_papers = df[df['topic_label'] == DRILL_TOPIC].sort_values('year')

for era_start in range(int(df['year'].min()), int(df['year'].max()) + 1, 5):
    era_end = era_start + 4
    era_papers = topic_papers[
        (topic_papers['year'] >= era_start) & (topic_papers['year'] <= era_end)
    ]
    if len(era_papers) > 0:
        print(f"\n--- {era_start}–{era_end} ({len(era_papers)} papers) ---")
        for _, row in era_papers.nlargest(3, 'topic_weight').iterrows():
            print(f"  [{row['year']}] {row['title'][:90]}")

# %% [markdown]
# ## 9. Export Results

# %%
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)

df[['year', 'title', 'topic_label', 'topic_weight']].to_csv(
    output_dir / "papers_with_topics.csv", index=False)
yearly_smoothed.to_csv(output_dir / "yearly_topic_proportions_smoothed.csv")
trend_df.to_csv(output_dir / "trend_summary.csv", index=False)

print(f"Results saved to '{output_dir}/'")

# %% [markdown]
# ## 10. (Optional) 2-D Topic Map

# %%
if N_TOPICS > 5:
    perplexity = min(5, N_TOPICS - 1)
    tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity)
    topic_coords = tsne.fit_transform(H)

    fig, ax = plt.subplots(figsize=(10, 8))
    ax.scatter(topic_coords[:, 0], topic_coords[:, 1],
               s=100, c=range(N_TOPICS), cmap='tab20', alpha=0.8, edgecolors='black')
    for i in range(N_TOPICS):
        if i not in EXCLUDE_TOPICS:
            ax.annotate(TOPIC_LABELS.get(i, f"T{i}"),
                       (topic_coords[i, 0], topic_coords[i, 1]),
                       fontsize=7, ha='center', va='bottom')
    ax.set_title("Topic Map (t-SNE of topic-term vectors)")
    plt.tight_layout()
    plt.show()

How to Use

  1. Save the code as topic_evolution.ipynb (or use jupytext / VS Code to convert the # %% cells into a notebook).
  2. Place your dataset (CSV with year, title, abstract columns) in the same directory and update DATA_PATH.
  3. Run cells sequentially through Step 3 (topic extraction).
  4. Inspect the printed topics and representative papers.
  5. Edit MERGE_MAP, TOPIC_LABELS, and EXCLUDE_TOPICS in Section 4a based on your inspection — this is the human-in-the-loop step.
  6. Re-run from Section 4a onward to see the refined trend charts.
  7. Iterate until the topic set is satisfying.

The key interactive decision point is Section 4a — that's where your domain expertise shapes the final result.

User #4

Please output the notebook as json file

Assistant #4
💭 Show reasoning
The user wants me to output the Jupyter Notebook as a JSON file (the standard .ipynb format). I'll convert the Python script into the proper Jupyter notebook JSON format with cells properly separated.

(The assistant produced the complete Jupyter Notebook in standard .ipynb JSON format — 30 cells covering all workflow steps from data loading through trend classification and export.)

📓 Show/hide notebook JSON structure
{
 "cells": [
  {"cell_type": "markdown", "source": ["# IEEE VIS Research Topic Evolution (1990–2024)\n", ...]},
  {"cell_type": "code", "source": ["# !pip install pandas numpy scikit-learn ..."]},
  {"cell_type": "code", "source": ["import pandas as pd\n", "import numpy as np\n", ...]},
  {"cell_type": "markdown", "source": ["## 1. Data Loading & Exploration\n", ...]},
  {"cell_type": "code", "source": ["DATA_PATH = \"ieee_vis_papers.csv\"\n", ...]},
  {"cell_type": "code", "source": ["# Drop rows with missing abstracts\n", ...]},
  {"cell_type": "code", "source": ["# Papers per year\n", ...]},
  {"cell_type": "markdown", "source": ["## 2. Text Preprocessing\n", ...]},
  {"cell_type": "code", "source": ["df['text'] = ...\n", "CUSTOM_STOPWORDS = {...}\n", ...]},
  {"cell_type": "markdown", "source": ["## 3. Topic Extraction (NMF)\n", ...]},
  {"cell_type": "code", "source": ["N_TOPICS = 30\n", ...]},
  {"cell_type": "code", "source": ["tfidf_vectorizer = TfidfVectorizer(...)\n", ...]},
  {"cell_type": "code", "source": ["nmf_model = NMF(...)\n", ...]},
  {"cell_type": "code", "source": ["# Display top terms per topic\n", ...]},
  {"cell_type": "code", "source": ["# Assign dominant topic\n", ...]},
  {"cell_type": "markdown", "source": ["## 4. Interactive Topic Refinement\n", ...]},
  {"cell_type": "code", "source": ["# Cosine similarity heatmap\n", ...]},
  {"cell_type": "code", "source": ["# Representative papers\n", ...]},
  {"cell_type": "markdown", "source": ["### 4a. Merging & Labelling\n", ...]},
  {"cell_type": "code", "source": ["MERGE_MAP = {...}\n", "EXCLUDE_TOPICS = [...]\n", ...]},
  {"cell_type": "code", "source": ["# Apply merging\n", ...]},
  {"cell_type": "markdown", "source": ["## 5. Temporal Distribution\n", ...]},
  {"cell_type": "code", "source": ["yearly_counts = ...\n", ...]},
  {"cell_type": "code", "source": ["# Raw stacked area chart\n", ...]},
  {"cell_type": "markdown", "source": ["## 6. Trend Smoothing\n", ...]},
  {"cell_type": "code", "source": ["WINDOW_SIZE = 5\n", ...]},
  {"cell_type": "code", "source": ["# Smoothed stacked area chart\n", ...]},
  {"cell_type": "code", "source": ["# Small-multiple line charts\n", ...]},
  {"cell_type": "markdown", "source": ["## 7. Trend Classification\n", ...]},
  {"cell_type": "code", "source": ["# Classify: Emerging / Declining / Stable\n", ...]},
  {"cell_type": "code", "source": ["# Highlight top trends\n", ...]},
  {"cell_type": "markdown", "source": ["## 8. Detail-on-Demand\n", ...]},
  {"cell_type": "code", "source": ["DRILL_TOPIC = ...\n", ...]},
  {"cell_type": "markdown", "source": ["## 9. Export Results\n", ...]},
  {"cell_type": "code", "source": ["output_dir = Path('output')\n", ...]},
  {"cell_type": "markdown", "source": ["## 10. (Optional) Topic Map\n", ...]},
  {"cell_type": "code", "source": ["# t-SNE of topic-term vectors\n", ...]},
  {"cell_type": "markdown", "source": ["---\n", "## Summary\n", ...]}
 ],
 "metadata": {
  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
  "language_info": {"name": "python", "version": "3.10.12"}
 },
 "nbformat": 4, "nbformat_minor": 4
}

The notebook contains 30+ cells implementing the complete 5-step workflow: text preprocessing, NMF topic extraction, interactive refinement (merge/label/exclude), temporal distribution with smoothing, and trend classification with export.