[
  {
    "content": "\nThe attached archive includes published research papers containing descriptions of visual analytics workflows. Your task will be to use this as a workflow library for recommending workflows addressing new problems. Analyse the library and extract reusable patterns.",
    "role": "user"
  },
  {
    "content": "<think>The user wants me to analyze the library of visual analytics workflow papers and extract reusable patterns. Let me go through each paper and identify the key workflow patterns, then synthesize them into reusable patterns.\n\nLet me analyze the papers systematically:\n\n1. **A_Partition-Based_Framework_for_Building_and_Validating_Regression_Models** - Interactive feature selection workflow for regression models using partition-based visualization and ranking.\n\n2. **Cluster_and_calendar_based_visualization_of_time_series_data** - Clustering daily patterns + calendar visualization for time series exploration.\n\n3. **Episodes and Topics in Multivariate Temporal Data** - Progressive abstraction: SAX encoding → topic modeling → distribution exploration.\n\n\n4. **EuroVis2019-Exploratity_model_analysis** - Exploratory Model Analysis follows a structured pipeline from initial data exploration through problem specification, model generation, exploration, selection, and export.\n\n5. **EventAction** - Prescriptive analytics for temporal sequences involves finding similar records, exploring their outcomes, reviewing recommendations, and tuning plans.\n\n6. **event_sequence_simplification** - Temporal sequences get simplified through filtering, transformation, and find-and-replace operations.\n\n7. **explainMLVis_binary_classifiers** - Visual diagnostics for classifiers move through outcome-level analysis, then feature-level explanations, down to instance-level inspection.\n\n8. **From_movement_tracks_through_events_to_places** - Movement analysis extracts events from tracks, clusters them, delineates places, aggregates spatio-temporal data, and analyzes the results.\n\n9. **Human-in-the-Loop** - Feature engineering for pattern recognition progresses through data representation, temporal abstraction, feature assessment, grouping and labeling, then evaluation.\n\n10. **MobilityGraphs** - Mass mobility analysis simplifies space through clustering, then time through clustering, before visually exploring the resulting spatio-temporal patterns.\n\n11. **Reducing_Snapshots_to_Points** - Dynamic network exploration discretizes data, vectorizes and normalizes it, applies dimensionality reduction, then visualizes and interacts with the results.\n\n12. **RfX** - Random Forest exploration partitions trees into clusters, identifies representative decision trees, explores individual trees, and optimizes them.\n\n13. **The_What-If_Tool** - Still reviewing the remaining workflows. clusters. Now I'm identifying the core patterns across these papers. The first pattern involves progressive abstraction—starting with raw data and applying successive levels of aggregation while using visualization at each step to verify and guide the next abstraction. The second pattern is an iterative refinement loop where you select features or model parameters, visualize the results, refine based on domain knowledge, and repeat. The third pattern is rank-then-explore, where you compute quantitative measures for candidates first.\n\nThe fourth pattern is cluster-then-explore, which appears frequently across papers—you cluster similar items, visualize their representatives, allow drill-down into individual clusters, and compare them to identify patterns. The fifth pattern adapts Shneiderman's mantra: start with an aggregate overview, filter to subsets of interest, drill down to individual items, and loop back to refine. I'm also seeing a sixth pattern emerging around simplification. - Provide visual comparison mechanisms\n- Support selection of preferred candidate\n- Enable export/deployment of selection\n\n**Pattern 9: Similarity-Based Retrieval + Outcome Analysis**\nPapers: EventAction, Reducing Snapshots\n- Define similarity measure\n- Find similar records/cases\n- Analyze outcomes/distributions of similar cases\n- Use patterns to inform decisions\n\n**Pattern 10: Temporal Pattern Discovery via Segmentation + Encoding**\nPapers: Episodes and Topics, Cluster and Calendar, Human-in-the-Loop\n- Segment time series into episodes\n- Encode episodes using symbolic/compact representations\n- Find patterns of co-occurrence or similarity\n- Map patterns back to temporal/spatial context\n\n**Pattern 11: Residual/Error-Driven Iterative Refinement**\nPapers: Partition-Based Regression, explainMLVis\n- Build initial model\n- Visual analysis of residuals and errors to spot systematic issues\n- Add features or adjust the model based on error patterns\n- Iterate until performance stabilizes\n\n**Pattern 12: Context-Enriched Classification/Labeling**\nPapers: Human-in-the-Loop, From Movement Tracks\n- Extract features from raw data\n- Use visual grouping to identify pattern classes\n- Label groups simultaneously in batch\n- Validate labels through classification evaluation\n\nNow I'm ready to synthesize these patterns into a comprehensive answer.</think>\n\n# Reusable Visual Analytics Workflow Patterns\n\nAfter analysing all 17 papers in the library, I extracted the following reusable workflow patterns that can be applied to new problems:\n\n---\n\n## Pattern 1: Progressive Abstraction\n\n**Source papers:** *Episodes and Topics*, *From Movement Tracks*, *MobilityGraphs*, *Reducing Snapshots to Points*\n\n**Structure:**\n1. Start with raw, elementary data\n2. Apply a first-level transformation/abstraction (e.g., symbolic encoding, event extraction)\n3. Visualize and validate the intermediate representation\n4. Apply a second-level abstraction (e.g., topic modelling, clustering)\n5. Visualize and explore distribution of higher-level patterns in context\n\n**Applicable when:** Data is hierarchically structured or too granular for direct insight; understanding requires building up from elements to patterns to distributions.\n\n---\n\n## Pattern 2: Rank-then-Explore\n\n**Source papers:** *Partition-Based Regression Framework*, *Exploratory Model Analysis*\n\n**Structure:**\n1. Compute quantitative relevance/quality measures for many candidates (features, models, variable pairs)\n2. Rank candidates by measures; display as ordered table/list\n3. Coordinate ranked list with small-multiple visualisations showing qualitative structure\n4. User inspects top-ranked candidates visually to confirm relevance or spot issues\n5. Select candidate and proceed (e.g., add to model, export)\n\n**Applicable when:** The search space is large (many features, many models) and a combination of automated scoring + human judgement is needed.\n\n---\n\n## Pattern 3: Iterative Residual-Driven Model Refinement\n\n**Source papers:** *Partition-Based Regression Framework*, *Visual Diagnostics of Binary Classifiers*\n\n**Structure:**\n1. Build an initial model using the most dominant effect\n2. Compute residuals (or errors/misclassifications)\n3. Visualize residuals conditioned on remaining candidate variables\n4. Identify systematic patterns (bias, local structures, interactions)\n5. Augment the model with a new variable/transformation addressing the pattern\n6. Repeat from step 2 until stopping criterion is met\n\n**Applicable when:** Building interpretable predictive models incrementally, especially when domain knowledge must guide feature selection.\n\n---\n\n## Pattern 4: Cluster-Label-Distribute\n\n**Source papers:** *Cluster and Calendar*, *MobilityGraphs*, *Episodes and Topics*, *Human-in-the-Loop*, *Reducing Snapshots to Points*\n\n**Structure:**\n1. Define a meaningful feature/distance space for the items\n2. Cluster items by similarity (density-based or partition-based)\n3. Visualise cluster representatives (prototypes, summaries, averages)\n4. Interpret/label clusters using domain knowledge\n5. Explore the distribution of clusters over context dimensions (time, space, categories)\n\n**Applicable when:** A dataset contains many instances that need to be grouped for overview, and the analyst wants to understand when/where/why different groups occur.\n\n---\n\n## Pattern 5: Simplify → Analyse → Restore\n\n**Source papers:** *Temporal Event Sequence Simplification*, *MobilityGraphs*, *Reducing Snapshots to Points*\n\n**Structure:**\n1. Start with complex, cluttered data\n2. Apply user-driven simplifications (filtering, merging, replacing, aggregating)\n3. Measure complexity reduction (e.g., visual element count)\n4. Analyse the simplified view for patterns\n5. Selectively restore detail when needed for deeper investigation\n6. Record simplification history for reproducibility\n\n**Applicable when:** Raw data is too noisy/complex for pattern detection; the signal-to-noise ratio must be improved interactively.\n\n---\n\n## Pattern 6: Similarity-Based Prescription\n\n**Source papers:** *EventAction*, *Reducing Snapshots to Points*\n\n**Structure:**\n1. Define the current record/state of interest\n2. Compute similarity to archived/historical records\n3. Visualise the distribution of outcomes for similar records\n4. Identify actions/sequences correlated with desired outcomes\n5. Construct a personalised action plan\n6. Provide feedback on estimated likelihood of success\n7. Iterate to refine\n\n**Applicable when:** The goal is recommending future actions based on what has worked for similar cases in the past (prescriptive analytics).\n\n---\n\n## Pattern 7: Human-in-the-Loop Feature Engineering for Classification\n\n**Source papers:** *Human-in-the-Loop*, *UTOPIAN*, *RfX*\n\n**Structure:**\n1. Derive interval-based or aggregate features from raw temporal/sequential data\n2. Assess feature discriminability through visual inspection (distributions, projections)\n3. Select/refine features iteratively\n4. Group data by features (clustering or dimensionality reduction)\n5. Visually inspect groups; label representative examples in bulk\n6. Validate labelled set (e.g., k-NN test) before training a full model\n\n**Applicable when:** Training labels are scarce, pattern definitions are vague, and domain expertise is needed to bridge raw data and meaningful categories.\n\n---\n\n## Pattern 8: Exploratory Model Analysis (Multi-Problem, Multi-Model)\n\n**Source papers:** *Exploratory Model Analysis (EuroVis 2019)*, *What-If Tool*\n\n**Structure:**\n1. Explore data overview to understand attributes and distributions\n2. Generate/explore candidate modelling problems (target, features, type)\n3. Specify the problem; auto-generate diverse models (AutoML or equivalent)\n4. Compare models on holdout predictions (confusion matrices, residuals, fairness)\n5. Select preferred model(s) considering task-specific criteria beyond accuracy\n6. Export models for deployment\n\n**Applicable when:** The modelling goal itself is not fixed a priori; the analyst wants to discover what predictions are possible and robust.\n\n---\n\n## Pattern 9: Spatial Event Extraction → Place Delineation → Flow Analysis\n\n**Source papers:** *From Movement Tracks through Events to Places*, *MobilityGraphs*\n\n**Structure:**\n1. Compute dynamic attributes from trajectories (speed, direction, proximity)\n2. Extract movement events by filtering on attribute thresholds (visually guided)\n3. Cluster events spatially (and optionally temporally) using density-based methods\n4. Build spatial buffers around clusters to delineate \"places\"\n5. Aggregate events/trajectories by places and time intervals\n6. Analyse aggregated time series or flows for spatio-temporal patterns\n\n**Applicable when:** Relevant places are not predefined but must emerge from movement behaviour; analysis focuses on place-based temporal patterns.\n\n---\n\n## Pattern 10: Model Explanation → Diagnostic Drill-Down\n\n**Source papers:** *Visual Diagnostics of Binary Classifiers*, *What-If Tool*, *RfX*\n\n**Structure:**\n1. **Outcome-level:** Visualise overall prediction distribution and accuracy (confusion matrix, ROC, histograms)\n2. **Explanation-level:** Aggregate instance-level explanations into groups; show which features drive decisions and how accurate each decision group is\n3. **Instance-level:** Drill into individual data items to see raw feature values and understand root causes of errors\n4. Generate diagnostic insights (data quality issues, missing features, label leakage)\n\n**Applicable when:** A trained classifier needs semantic validation by domain experts, beyond aggregate accuracy metrics.\n\n---\n\n## Pattern 11: Temporal Segmentation + Symbolic Encoding + Co-occurrence Mining\n\n**Source papers:** *Episodes and Topics*, *Cluster and Calendar*, *Human-in-the-Loop*\n\n**Structure:**\n1. Segment multivariate time series into episodes (sliding window or event-based)\n2. Encode each attribute's variation within an episode symbolically (e.g., SAX)\n3. Treat episodes as \"documents\" and attribute patterns as \"words\"\n4. Apply topic modelling (LDA, NMF) to discover co-occurring patterns (\"topics\")\n5. Interpret topics visually; merge semantically close topics\n6. Explore topic distribution over time, space, or other context dimensions\n\n**Applicable when:** The goal is to find and understand recurring combinations of behaviours across multiple attributes simultaneously.\n\n---\n\n## Pattern 12: Interactive Model Steering (Semantic Interaction)\n\n**Source papers:** *UTOPIAN*, *What-If Tool*, *RfX*, *Partition-Based Regression*\n\n**Structure:**\n1. System produces an initial model/result automatically\n2. User inspects the result through coordinated visualisations\n3. User directly manipulates the result (merge topics, adjust thresholds, edit counterfactuals, prune trees)\n4. System re-computes model incorporating user changes\n5. Updated result is shown immediately for validation\n6. Iterate until the user is satisfied\n\n**Applicable when:** The model requires domain knowledge for refinement that cannot be easily formalised as constraints; experts need to inject knowledge interactively.\n\n---\n\n## Summary Table\n\n| # | Pattern Name | Key Mechanism | Typical Output |\n|---|---|---|---|\n| 1 | Progressive Abstraction | Multi-level aggregation with validation at each level | High-level interpretable patterns |\n| 2 | Rank-then-Explore | Quantitative ranking + qualitative visual inspection | Selected features/variables |\n| 3 | Residual-Driven Refinement | Iterative error analysis and model augmentation | Interpretable multi-variable model |\n| 4 | Cluster-Label-Distribute | Cluster → interpret → explore distribution | Labelled groups with contextual understanding |\n| 5 | Simplify → Analyse → Restore | Complexity reduction with reversibility | Clean visual overview + preserved detail |\n| 6 | Similarity-Based Prescription | Find similar → show outcomes → plan actions | Recommended action sequences |\n| 7 | Human-in-the-Loop Feature Engineering | Feature design + bulk visual labelling + validation | Training set for ML model |\n| 8 | Exploratory Model Analysis | Problem exploration + multi-model comparison | Deployed predictive model(s) |\n| 9 | Event → Place → Flow | Event extraction → spatial clustering → aggregation | Spatio-temporal activity patterns |\n| 10 | Explanation Drill-Down | Aggregate → explanation groups → raw instances | Diagnostic insights for model improvement |\n| 11 | Symbolic Encoding + Topic Mining | SAX + topic modelling + distribution exploration | Multi-attribute temporal patterns |\n| 12 | Interactive Model Steering | Direct manipulation → re-computation loop | Domain-tuned model/result |\n\nThese patterns can be composed (e.g., Pattern 4 inside Pattern 1, or Pattern 3 using Pattern 2 at each step) and adapted to specific data types and analysis goals.",
    "role": "assistant"
  },
  {
    "content": "I need a workflow recommendation for the following task. I want to analyse the dataset that contains information on IEEE Visualization (IEEE VIS) publications from 1990-2024 as well as IEEE TVCG and IEEE CG&A articles published at IEEE VIS. The dataset includes for each paper the publication year, title, and abstract. I want to reveal the major research topics and how they evolved over the years. I am interested in seeing trends rather than minor fluctuations. Using the workflow library, compose an appropriate workflow.",
    "role": "user"
  },
  {
    "content": "<think>The user wants to analyse a dataset of IEEE VIS publications (1990-2024) containing publication year, title, and abstract. Their goals are:\n1. Reveal major research topics\n2. Show how topics evolved over the years\n3. Focus on trends rather than minor fluctuations\n\nLet me compose a workflow from the patterns I identified. The most relevant patterns are:\n\n- **Pattern 11: Temporal Segmentation + Symbolic Encoding + Co-occurrence Mining** - This is about discovering co-occurring patterns using topic modelling. The \"documents\" here are the papers themselves (title + abstract), and we can apply topic modelling directly.\n\n\n- **Pattern 4: Cluster-Label-Distribute** - Cluster papers by similarity, label the clusters, then explore distribution over time.\n- **Pattern 1: Progressive Abstraction** - Start with raw papers, abstract to topics, then look at temporal distributions.\n- **Pattern 5: Simplify → Analyse → Restore** - To focus on trends rather than minor fluctuations, we need smoothing/simplification.\n- **Pattern 12: Interactive Model Steering** - To refine topics interactively (merge, split, relabel).\n\nSo the workflow would start with text preprocessing on the abstracts to clean them up, then move into topic extraction and temporal analysis.</think>\n\n# Recommended Workflow: Research Topic Evolution in IEEE VIS (1990–2024)\n\n## Composed from Patterns: 11 → 12 → 4 → 5 → 1\n\n---\n\n## Overview Diagram\n\n```\nRaw Papers (year, title, abstract)\n        │\n        ▼\n┌─────────────────────────┐\n│ 1. Text Preprocessing   │  (Pattern 1 – first abstraction level)\n└───────────┬─────────────┘\n            ▼\n┌─────────────────────────┐\n│ 2. Topic Extraction     │  (Pattern 11 – topic mining)\n└───────────┬─────────────┘\n            ▼\n┌─────────────────────────┐\n│ 3. Interactive Refine   │  (Pattern 12 – model steering)\n└───────────┬─────────────┘\n            ▼\n┌─────────────────────────┐\n│ 4. Label & Distribute   │  (Pattern 4 – cluster-label-distribute)\n└───────────┬─────────────┘\n            ▼\n┌─────────────────────────┐\n│ 5. Trend Smoothing      │  (Pattern 5 – simplify for trends)\n└───────────┬─────────────┘\n            ▼\n      Trend Narratives\n```\n\n---\n\n## Step-by-Step Workflow\n\n### Step 1 — Text Preprocessing (Pattern 1: Progressive Abstraction, Level 1)\n\n| Action | Detail |\n|--------|--------|\n| Concatenate title + abstract per paper | Creates one \"document\" per publication |\n| Tokenise, lowercase, remove stopwords | Standard NLP pipeline |\n| Remove domain-generic terms | e.g., \"paper\", \"propose\", \"result\", \"method\" |\n| Optional: bigram/trigram collocation | Preserves phrases like \"parallel coordinates\", \"deep learning\" |\n| Represent as bag-of-words or TF-IDF matrix | Input for topic model |\n\n**Output:** Document-term matrix (≈ 4 000–5 000 papers × vocabulary)\n\n---\n\n### Step 2 — Topic Extraction (Pattern 11: Topic Mining)\n\n| Action | Detail |\n|--------|--------|\n| Choose topic model | LDA, NMF, or BERTopic (embedding-based) |\n| Set initial *k* (number of topics) | Start with *k* = 25–40 (intentionally over-segment) |\n| Fit model | Each paper receives a topic-probability vector |\n| Inspect top terms per topic | Ranked word lists or word clouds |\n| Inspect representative papers per topic | Highest-probability papers as exemplars |\n\n**Output:** *k* candidate topics, each described by top terms and representative papers; per-paper topic assignments (soft or hard).\n\n**Rationale for over-segmentation:** It is easier to merge semantically close topics in the next step than to split overly broad ones.\n\n---\n\n### Step 3 — Interactive Topic Refinement (Pattern 12: Interactive Model Steering)\n\n| Action | Detail |\n|--------|--------|\n| Visualise topic similarity | 2-D projection (t-SNE/UMAP of topic-term vectors) or dendrogram |\n| Merge near-duplicate topics | e.g., \"graph drawing\" + \"network layout\" → single topic |\n| Split incoherent topics | If top terms mix unrelated concepts, re-cluster that subset |\n| Remove junk topics | Boilerplate or overly generic clusters |\n| Re-label topics with human-readable names | e.g., \"Volume Rendering\", \"Text & NLP\", \"Uncertainty Visualisation\" |\n| Validate | Check that representative papers match the label |\n\n**Output:** Refined set of ~10–20 well-labelled major topics; each paper assigned to one or more topics.\n\n**Iterate** until topic set is stable and interpretable.\n\n---\n\n### Step 4 — Label & Distribute over Time (Pattern 4: Cluster-Label-Distribute)\n\n| Action | Detail |\n|--------|--------|\n| For each year, count papers per topic | Absolute counts or proportions (share of total) |\n| Construct a **papers × year × topic** table | Basis for temporal visualisation |\n| Visualise as stacked area chart or ThemeRiver | Shows relative topic prominence over 35 years |\n| Alternatively: small-multiple line charts | One line per topic; easier to compare slopes |\n\n**Output:** Raw temporal distribution of topics (will still be noisy year-to-year).\n\n---\n\n### Step 5 — Trend Smoothing & Narrative (Pattern 5: Simplify → Analyse → Restore)\n\n| Action | Detail |\n|--------|--------|\n| Apply temporal smoothing | Moving average (window = 3–5 years) or LOESS/lowess curve per topic |\n| Suppress minor fluctuations | Focus on monotone rises, falls, or plateaus |\n| Identify trend phases | \"Emerging\" (recent upward), \"Declining\" (downward), \"Stable\" (flat) |\n| Annotate key inflection points | e.g., \"Deep-learning topics surge after 2016\" |\n| Restore detail on demand | Allow drill-down to individual years/papers for any interesting inflection |\n\n**Output:** Smooth trend lines per topic with annotations; classification of topics into trend categories.\n\n---\n\n## Final Deliverables\n\n| Artefact | Description |\n|----------|-------------|\n| **Topic catalogue** | Table of major research topics with label, top terms, and 3–5 exemplar papers |\n| **Trend chart** | Smoothed stacked area or small-multiple line chart (1990–2024) |\n| **Trend summary** | Narrative listing emerging topics (e.g., ML4VIS, fairness), declining topics (e.g., VRML, early VR), and stable pillars (e.g., volume rendering, information visualisation) |\n| **Interactive view** (optional) | Linked dashboard: select a topic → see its trend + representative papers per period |\n\n---\n\n## Justification of Pattern Composition\n\n| User Requirement | Addressed by |\n|------------------|--------------|\n| \"Reveal major research topics\" | Patterns 11 + 12 (topic extraction + refinement) |\n| \"How they evolved over the years\" | Pattern 4 (distribute over time) |\n| \"Trends rather than minor fluctuations\" | Pattern 5 (temporal smoothing) |\n| Interpretability & trust | Pattern 1 (progressive abstraction with validation at each level) |\n\n---\n\n## Tool Suggestions (optional)\n\n- **Topic modelling:** `scikit-learn` (NMF/LDA), `BERTopic`, or `MALLET`\n- **Visualisation:** `Matplotlib`/`Seaborn` (static), `Streamlit` + `Altair` (interactive), or dedicated VA tools like `pyLDAvis`\n- **Smoothing:** `statsmodels.lowess` or `pandas.rolling`",
    "role": "assistant"
  },
  {
    "content": "To test this workflow, create a Jupyter Notebook (Python).",
    "role": "user"
  },
  {
    "content": "<think>The user wants me to create a Jupyter Notebook in Python that implements the workflow I recommended for analyzing IEEE VIS publications (1990-2024) to reveal major research topics and their evolution over time.\n\nI'll create a complete Jupyter Notebook with all the steps:\n1. Text Preprocessing\n2. Topic Extraction (using NMF or LDA)\n3. Interactive Topic Refinement (manual merging/labeling)\n4. Label & Distribute over Time\n5. Trend Smoothing & Narrative\n\nI'll structure it as a proper notebook with markdown cells and code cells. Since I can't actually provide a .ipynb file, I'll provide the Python code that can be converted to a notebook, or I'll format it as a notebook-style document with clear cell separations.\n\n\nNow I'm assuming the dataset is a CSV file with columns like year, title, and abstract, and I'll make the notebook flexible to handle that structure. I'll organize it into logical sections: loading and exploring the data, preprocessing the text, running topic models, inspecting and refining the topics, analyzing temporal patterns, and creating visualizations to show how research themes have evolved.</think>\n\n```python\n# %% [markdown]\n# # IEEE VIS Research Topic Evolution (1990–2024)\n#\n# This notebook implements the recommended visual analytics workflow to reveal\n# major research topics in IEEE VIS publications and how they evolved over time.\n#\n# **Workflow:** Progressive Abstraction → Topic Mining → Interactive Steering →\n# Cluster-Label-Distribute → Simplify for Trends\n\n# %% [markdown]\n# ## 0. Setup & Installation\n\n# %%\n# !pip install pandas numpy scikit-learn matplotlib seaborn nltk wordcloud adjustText\n\n# %%\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom pathlib import Path\nimport warnings\nwarnings.filterwarnings('ignore')\n\n# NLP\nimport re\nimport nltk\nfrom nltk.corpus import stopwords\nfrom nltk.stem import WordNetLemmatizer\n\n# Topic Modelling\nfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\nfrom sklearn.decomposition import NMF, LatentDirichletAllocation\nfrom sklearn.manifold import TSNE\n\nnltk.download('stopwords', quiet=True)\nnltk.download('wordnet', quiet=True)\nnltk.download('omw-1.4', quiet=True)\n\nprint(\"Setup complete.\")\n\n# %% [markdown]\n# ## 1. Data Loading & Exploration\n#\n# Expects a CSV file with at least columns: `Year`, `Title`, `Abstract`.\n# Adjust the file path and column names to match your dataset.\n\n# %%\n# === CONFIGURE THIS ===\nDATA_PATH = \"ieee_vis_papers.csv\"  # <-- Adjust path to your dataset\n\n# Load data\ndf = pd.read_csv(DATA_PATH)\n\n# --- Normalise column names (handle common variations) ---\ndf.columns = df.columns.str.strip().str.lower()\n\n# Map to expected names (adjust if your columns differ)\nCOLUMN_MAP = {\n    'year': 'year',\n    'title': 'title',\n    'abstract': 'abstract',\n    # Add mappings if needed, e.g. 'paper title': 'title'\n}\ndf = df.rename(columns=COLUMN_MAP)\n\n# Basic validation\nassert 'year' in df.columns, \"Column 'year' not found. Adjust COLUMN_MAP.\"\nassert 'title' in df.columns, \"Column 'title' not found. Adjust COLUMN_MAP.\"\nassert 'abstract' in df.columns, \"Column 'abstract' not found. Adjust COLUMN_MAP.\"\n\nprint(f\"Dataset shape: {df.shape}\")\nprint(f\"Year range: {df['year'].min()} – {df['year'].max()}\")\nprint(f\"Sample columns: {list(df.columns[:10])}\")\ndf.head(3)\n\n# %%\n# Drop rows with missing abstracts\nbefore = len(df)\ndf = df.dropna(subset=['abstract']).copy()\ndf = df[df['abstract'].str.strip().str.len() > 50]  # Remove very short abstracts\nprint(f\"Dropped {before - len(df)} rows with missing/short abstracts. Remaining: {len(df)}\")\n\n# %%\n# Papers per year\nfig, ax = plt.subplots(figsize=(12, 4))\ndf.groupby('year').size().plot(kind='bar', ax=ax, color='steelblue', width=0.8)\nax.set_title(\"Number of Papers per Year\")\nax.set_xlabel(\"Year\")\nax.set_ylabel(\"Count\")\nplt.tight_layout()\nplt.show()\n\n# %% [markdown]\n# ## 2. Text Preprocessing (Pattern 1 – Progressive Abstraction, Level 1)\n#\n# Concatenate title and abstract, clean, tokenise, remove stopwords, and lemmatise.\n\n# %%\n# Combine title + abstract\ndf['text'] = df['title'].fillna('') + ' ' + df['abstract'].fillna('')\n\n# Custom stopwords: domain-generic terms that don't help distinguish topics\nCUSTOM_STOPWORDS = {\n    'paper', 'propose', 'proposed', 'approach', 'method', 'methods',\n    'result', 'results', 'show', 'shown', 'use', 'used', 'using',\n    'present', 'presented', 'new', 'novel', 'technique', 'techniques',\n    'based', 'provide', 'system', 'also', 'work', 'problem',\n    'describe', 'described', 'demonstrate', 'demonstrated',\n    'algorithm', 'algorithms', 'however', 'many', 'one', 'two',\n    'first', 'second', 'allow', 'allows', 'well', 'can', 'may',\n    'different', 'several', 'existing', 'within', 'number',\n    'introduction', 'conclusion', 'abstract', 'study', 'studies',\n    'example', 'examples', 'application', 'applications'\n}\n\nstop_words = set(stopwords.words('english')).union(CUSTOM_STOPWORDS)\nlemmatizer = WordNetLemmatizer()\n\n\ndef preprocess_text(text):\n    \"\"\"Clean, tokenise, remove stopwords, lemmatise.\"\"\"\n    # Lowercase\n    text = text.lower()\n    # Remove non-alphabetic characters (keep spaces)\n    text = re.sub(r'[^a-z\\s]', ' ', text)\n    # Tokenise\n    tokens = text.split()\n    # Remove stopwords, short tokens, and lemmatise\n    tokens = [\n        lemmatizer.lemmatize(t)\n        for t in tokens\n        if t not in stop_words and len(t) > 2\n    ]\n    return ' '.join(tokens)\n\n\ndf['processed_text'] = df['text'].apply(preprocess_text)\nprint(\"Preprocessing complete.\")\nprint(f\"Example:\\n{df['processed_text'].iloc[0][:200]}...\")\n\n# %% [markdown]\n# ## 3. Topic Extraction (Pattern 11 – Topic Mining)\n#\n# We use **NMF** (Non-negative Matrix Factorisation) on TF-IDF features.\n# NMF typically produces more coherent topics than LDA for short scientific texts.\n# We intentionally over-segment (k=30) to allow merging later.\n\n# %%\n# === PARAMETERS (tune as needed) ===\nN_TOPICS = 30          # Intentionally over-segment; will merge later\nMAX_FEATURES = 5000    # Vocabulary size\nMIN_DF = 5             # Minimum document frequency\nMAX_DF = 0.7           # Maximum document frequency (fraction)\nNGRAM_RANGE = (1, 2)   # Unigrams + bigrams\n\n# %%\n# Build TF-IDF matrix\ntfidf_vectorizer = TfidfVectorizer(\n    max_features=MAX_FEATURES,\n    min_df=MIN_DF,\n    max_df=MAX_DF,\n    ngram_range=NGRAM_RANGE,\n    token_pattern=r'(?u)\\b[a-z][a-z]+\\b'\n)\n\ntfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_text'])\nfeature_names = tfidf_vectorizer.get_feature_names_out()\n\nprint(f\"TF-IDF matrix: {tfidf_matrix.shape[0]} documents × {tfidf_matrix.shape[1]} terms\")\n\n# %%\n# Fit NMF model\nnmf_model = NMF(\n    n_components=N_TOPICS,\n    random_state=42,\n    max_iter=500,\n    init='nndsvda'\n)\n\nW = nmf_model.fit_transform(tfidf_matrix)  # Document-topic matrix\nH = nmf_model.components_                   # Topic-term matrix\n\nprint(f\"NMF fitting complete. Reconstruction error: {nmf_model.reconstruction_err_:.2f}\")\n\n# %%\n# Display top terms per topic\nN_TOP_TERMS = 12\n\n\ndef display_topics(model, feature_names, n_top=N_TOP_TERMS):\n    \"\"\"Print top terms for each topic.\"\"\"\n    topics = []\n    for idx, topic_vec in enumerate(model.components_):\n        top_indices = topic_vec.argsort()[:-n_top - 1:-1]\n        top_terms = [feature_names[i] for i in top_indices]\n        topics.append(top_terms)\n        print(f\"Topic {idx:2d}: {', '.join(top_terms)}\")\n    return topics\n\n\nprint(\"=\" * 70)\nprint(\"TOP TERMS PER TOPIC (inspect for coherence)\")\nprint(\"=\" * 70)\ntopic_terms = display_topics(nmf_model, feature_names)\n\n# %%\n# Assign dominant topic to each paper\ndf['dominant_topic'] = W.argmax(axis=1)\ndf['topic_weight'] = W.max(axis=1)\n\n# Store full topic distribution for soft assignment\ntopic_distributions = pd.DataFrame(\n    W, columns=[f\"topic_{i}\" for i in range(N_TOPICS)],\n    index=df.index\n)\n\nprint(\"\\nPapers per topic:\")\nprint(df['dominant_topic'].value_counts().sort_index())\n\n# %% [markdown]\n# ## 4. Interactive Topic Refinement (Pattern 12 – Model Steering)\n#\n# Inspect topic similarity, merge near-duplicate topics, remove junk topics,\n# and assign human-readable labels.\n\n# %%\n# Visualise topic similarity via cosine similarity heatmap\nfrom sklearn.metrics.pairwise import cosine_similarity\n\ntopic_similarity = cosine_similarity(H)\n\nfig, ax = plt.subplots(figsize=(12, 10))\nsns.heatmap(\n    topic_similarity, annot=False, cmap='YlOrRd',\n    xticklabels=range(N_TOPICS),\n    yticklabels=range(N_TOPICS),\n    ax=ax\n)\nax.set_title(\"Topic-Topic Cosine Similarity (identify candidates for merging)\")\nplt.tight_layout()\nplt.show()\n\n# %%\n# Show representative papers for each topic (top 3 by weight)\nprint(\"=\" * 70)\nprint(\"REPRESENTATIVE PAPERS PER TOPIC\")\nprint(\"=\" * 70)\n\nfor t in range(N_TOPICS):\n    top_papers = df.nlargest(3, f'topic_weight' if t == df['dominant_topic'].iloc[0] else 'topic_weight')\n    # Get papers where dominant topic is t, sorted by weight\n    topic_papers = df[df['dominant_topic'] == t].nlargest(3, 'topic_weight')\n    print(f\"\\n--- Topic {t} ---\")\n    print(f\"    Terms: {', '.join(topic_terms[t][:8])}\")\n    for _, row in topic_papers.iterrows():\n        print(f\"    [{row['year']}] {row['title'][:80]}\")\n\n# %% [markdown]\n# ### 4a. Define Topic Merging & Labelling\n#\n# **Instructions:** After inspecting the topics above, edit the dictionaries below:\n# - `MERGE_MAP`: maps topic indices to a group ID (topics with same group ID will be merged)\n# - `TOPIC_LABELS`: human-readable name for each group ID\n# - `EXCLUDE_TOPICS`: list of topic indices to drop (junk/incoherent)\n#\n# The example below is a **template** – adjust based on YOUR model's output!\n\n# %%\n# ============================================================\n# EDIT THIS SECTION BASED ON YOUR INSPECTION OF THE TOPICS\n# ============================================================\n\n# Example merge map: {original_topic_id: merged_group_id}\n# Topics assigned to the same group_id will be merged.\n# Start by mapping each topic to itself (no merging), then adjust.\nMERGE_MAP = {i: i for i in range(N_TOPICS)}\n\n# --- Example edits (uncomment and adjust): ---\n# MERGE_MAP[5] = 3    # Merge topic 5 into topic 3\n# MERGE_MAP[12] = 7   # Merge topic 12 into topic 7\n# MERGE_MAP[20] = 15  # Merge topic 20 into topic 15\n\n# Topics to exclude entirely (junk / too generic)\nEXCLUDE_TOPICS = []\n# EXCLUDE_TOPICS = [0, 22]  # Example: exclude topics 0 and 22\n\n# Human-readable labels for merged groups\n# Only need to label the \"target\" group IDs that remain after merging.\nTOPIC_LABELS = {i: f\"Topic {i}\" for i in range(N_TOPICS)}\n\n# --- Example labels (uncomment and adjust): ---\n# TOPIC_LABELS[0] = \"Volume Rendering\"\n# TOPIC_LABELS[1] = \"Graph & Network Visualization\"\n# TOPIC_LABELS[2] = \"Flow Visualization\"\n# TOPIC_LABELS[3] = \"Information Visualization\"\n# TOPIC_LABELS[4] = \"Scientific Visualization\"\n# TOPIC_LABELS[5] = \"Text & Document Visualization\"\n# TOPIC_LABELS[6] = \"Uncertainty Visualization\"\n# TOPIC_LABELS[7] = \"Machine Learning for Visualization\"\n# TOPIC_LABELS[8] = \"User Studies & Evaluation\"\n# TOPIC_LABELS[9] = \"Geospatial & Cartography\"\n# TOPIC_LABELS[10] = \"High-Dimensional Data\"\n# ...\n\nprint(\"Merge map and labels defined. Adjust above as needed.\")\n\n# %%\n# Apply merging and exclusion\ndf['merged_topic'] = df['dominant_topic'].map(MERGE_MAP)\ndf = df[~df['dominant_topic'].isin(EXCLUDE_TOPICS)].copy()\n\n# Merge topic distributions\nmerged_groups = {}\nfor orig, merged in MERGE_MAP.items():\n    if orig not in EXCLUDE_TOPICS:\n        merged_groups.setdefault(merged, []).append(orig)\n\n# Create merged topic weight (sum of component topic weights)\nfor group_id, members in merged_groups.items():\n    member_cols = [f\"topic_{m}\" for m in members]\n    existing_cols = [c for c in member_cols if c in topic_distributions.columns]\n    if existing_cols:\n        topic_distributions[f\"merged_{group_id}\"] = topic_distributions[existing_cols].sum(axis=1)\n\n# Final label\ndf['topic_label'] = df['merged_topic'].map(TOPIC_LABELS)\n\n# Count unique merged topics\nfinal_topics = sorted(df['merged_topic'].unique())\nprint(f\"Number of final topics after merging: {len(final_topics)}\")\nprint(\"\\nFinal topic labels:\")\nfor t in final_topics:\n    count = (df['merged_topic'] == t).sum()\n    print(f\"  {TOPIC_LABELS.get(t, t):40s} ({count} papers)\")\n\n# %% [markdown]\n# ## 5. Temporal Distribution (Pattern 4 – Cluster-Label-Distribute)\n#\n# Compute topic proportions per year and visualise.\n\n# %%\n# Compute yearly topic counts (proportion-based to account for varying publication volume)\nyearly_counts = df.groupby(['year', 'topic_label']).size().unstack(fill_value=0)\nyearly_proportions = yearly_counts.div(yearly_counts.sum(axis=1), axis=0)\n\nprint(f\"Yearly proportions table: {yearly_proportions.shape}\")\nyearly_proportions.tail()\n\n# %%\n# Raw (unsmoothed) stacked area chart\nfig, ax = plt.subplots(figsize=(16, 8))\nyearly_proportions.plot.area(ax=ax, alpha=0.8, linewidth=0.5)\nax.set_title(\"Topic Proportions Over Time (Raw)\", fontsize=14)\nax.set_xlabel(\"Year\")\nax.set_ylabel(\"Proportion of Papers\")\nax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)\nax.set_xlim(df['year'].min(), df['year'].max())\nplt.tight_layout()\nplt.show()\n\n# %% [markdown]\n# ## 6. Trend Smoothing & Narrative (Pattern 5 – Simplify → Analyse → Restore)\n#\n# Apply moving-average smoothing to suppress year-to-year fluctuations\n# and reveal underlying trends.\n\n# %%\n# === SMOOTHING PARAMETER ===\nWINDOW_SIZE = 5  # 5-year moving average (adjust: 3 for less smoothing, 7 for more)\n\n# Apply rolling mean\nyearly_smoothed = yearly_proportions.rolling(\n    window=WINDOW_SIZE, center=True, min_periods=2\n).mean()\n\n# Drop NaN edges from rolling\nyearly_smoothed = yearly_smoothed.dropna()\n\nprint(f\"Smoothed data range: {yearly_smoothed.index.min()} – {yearly_smoothed.index.max()}\")\n\n# %%\n# Smoothed stacked area chart (ThemeRiver-style)\nfig, ax = plt.subplots(figsize=(16, 8))\nyearly_smoothed.plot.area(ax=ax, alpha=0.85, linewidth=0.5)\nax.set_title(\n    f\"Topic Proportions Over Time ({WINDOW_SIZE}-Year Moving Average)\",\n    fontsize=14\n)\nax.set_xlabel(\"Year\")\nax.set_ylabel(\"Proportion of Papers\")\nax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)\nax.set_xlim(yearly_smoothed.index.min(), yearly_smoothed.index.max())\nplt.tight_layout()\nplt.show()\n\n# %%\n# Small-multiple line charts (one per topic) – easier to read individual trends\nn_final_topics = len(yearly_smoothed.columns)\nncols = 4\nnrows = int(np.ceil(n_final_topics / ncols))\n\nfig, axes = plt.subplots(nrows, ncols, figsize=(16, 3 * nrows), sharex=True, sharey=True)\naxes_flat = axes.flatten()\n\nfor idx, topic_name in enumerate(yearly_smoothed.columns):\n    ax = axes_flat[idx]\n    ax.plot(yearly_smoothed.index, yearly_smoothed[topic_name], color='steelblue', linewidth=2)\n    ax.fill_between(yearly_smoothed.index, yearly_smoothed[topic_name], alpha=0.3, color='steelblue')\n    ax.set_title(topic_name, fontsize=9, fontweight='bold')\n    ax.set_ylim(0, yearly_smoothed.values.max() * 1.1)\n    ax.grid(True, alpha=0.3)\n\n# Hide unused subplots\nfor idx in range(n_final_topics, len(axes_flat)):\n    axes_flat[idx].set_visible(False)\n\nfig.suptitle(f\"Individual Topic Trends ({WINDOW_SIZE}-Year Smoothing)\", fontsize=14, y=1.01)\nplt.tight_layout()\nplt.show()\n\n# %% [markdown]\n# ## 7. Trend Classification & Summary\n\n# %%\n# Classify each topic's trend: Emerging, Declining, Stable\n# Compare average proportion in first third vs. last third of time span\n\nyears = yearly_smoothed.index\nn_years = len(years)\nthird = n_years // 3\n\nearly_period = yearly_smoothed.iloc[:third]\nlate_period = yearly_smoothed.iloc[-third:]\n\ntrend_summary = []\nfor topic_name in yearly_smoothed.columns:\n    early_mean = early_period[topic_name].mean()\n    late_mean = late_period[topic_name].mean()\n    overall_mean = yearly_smoothed[topic_name].mean()\n\n    # Relative change\n    if early_mean > 0:\n        rel_change = (late_mean - early_mean) / early_mean\n    else:\n        rel_change = 1.0 if late_mean > 0 else 0.0\n\n    # Classify\n    if rel_change > 0.3:\n        trend = \"📈 Emerging\"\n    elif rel_change < -0.3:\n        trend = \"📉 Declining\"\n    else:\n        trend = \"➡️ Stable\"\n\n    # Peak year\n    peak_year = yearly_smoothed[topic_name].idxmax()\n\n    trend_summary.append({\n        'Topic': topic_name,\n        'Early Avg (%)': f\"{early_mean * 100:.1f}\",\n        'Late Avg (%)': f\"{late_mean * 100:.1f}\",\n        'Relative Change': f\"{rel_change:+.0%}\",\n        'Peak Year': int(peak_year),\n        'Trend': trend\n    })\n\ntrend_df = pd.DataFrame(trend_summary)\ntrend_df = trend_df.sort_values('Trend', ascending=True)\n\nprint(\"=\" * 70)\nprint(\"TOPIC TREND SUMMARY\")\nprint(\"=\" * 70)\ntrend_df\n\n# %%\n# Highlight: Top emerging and declining topics\nprint(\"\\n🔥 TOP EMERGING TOPICS:\")\nemerging = trend_df[trend_df['Trend'] == '📈 Emerging']\nfor _, row in emerging.iterrows():\n    print(f\"   • {row['Topic']} ({row['Early Avg (%)']}% → {row['Late Avg (%)']}%)\")\n\nprint(\"\\n❄️ TOP DECLINING TOPICS:\")\ndeclining = trend_df[trend_df['Trend'] == '📉 Declining']\nfor _, row in declining.iterrows():\n    print(f\"   • {row['Topic']} ({row['Early Avg (%)']}% → {row['Late Avg (%)']}%)\")\n\nprint(\"\\n⚖️ STABLE PILLARS:\")\nstable = trend_df[trend_df['Trend'] == '➡️ Stable']\nfor _, row in stable.iterrows():\n    print(f\"   • {row['Topic']} (peak: {row['Peak Year']})\")\n\n# %% [markdown]\n# ## 8. Detail-on-Demand: Drill into a Specific Topic (Pattern 5 – Restore)\n\n# %%\n# Select a topic to drill into\nDRILL_TOPIC = yearly_smoothed.columns[0]  # <-- Change to any topic of interest\n\nprint(f\"Drilling into: '{DRILL_TOPIC}'\")\nprint(\"=\" * 70)\n\ntopic_papers = df[df['topic_label'] == DRILL_TOPIC].sort_values('year')\n\n# Show papers per 5-year era\nfor era_start in range(df['year'].min(), df['year'].max() + 1, 5):\n    era_end = era_start + 4\n    era_papers = topic_papers[\n        (topic_papers['year'] >= era_start) & (topic_papers['year'] <= era_end)\n    ]\n    if len(era_papers) > 0:\n        print(f\"\\n--- {era_start}–{era_end} ({len(era_papers)} papers) ---\")\n        # Show top 3 by topic weight\n        for _, row in era_papers.nlargest(3, 'topic_weight').iterrows():\n            print(f\"  [{row['year']}] {row['title'][:90]}\")\n\n# %% [markdown]\n# ## 9. Export Results\n\n# %%\n# Save results to CSV\noutput_dir = Path(\"output\")\noutput_dir.mkdir(exist_ok=True)\n\n# Papers with topic assignments\ndf[['year', 'title', 'topic_label', 'topic_weight']].to_csv(\n    output_dir / \"papers_with_topics.csv\", index=False\n)\n\n# Yearly proportions (smoothed)\nyearly_smoothed.to_csv(output_dir / \"yearly_topic_proportions_smoothed.csv\")\n\n# Trend summary\ntrend_df.to_csv(output_dir / \"trend_summary.csv\", index=False)\n\nprint(f\"Results saved to '{output_dir}/' directory:\")\nprint(f\"  • papers_with_topics.csv\")\nprint(f\"  • yearly_topic_proportions_smoothed.csv\")\nprint(f\"  • trend_summary.csv\")\n\n# %% [markdown]\n# ## 10. (Optional) 2-D Topic Map\n#\n# Visualise topic positions using t-SNE on the topic-term matrix to see\n# which topics are semantically close.\n\n# %%\n# t-SNE of topic-term vectors\nif N_TOPICS > 5:\n    perplexity = min(5, N_TOPICS - 1)\n    tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity)\n    topic_coords = tsne.fit_transform(H)\n\n    fig, ax = plt.subplots(figsize=(10, 8))\n    scatter = ax.scatter(\n        topic_coords[:, 0], topic_coords[:, 1],\n        s=100, c=range(N_TOPICS), cmap='tab20', alpha=0.8, edgecolors='black'\n    )\n\n    for i in range(N_TOPICS):\n        if i not in EXCLUDE_TOPICS:\n            label = TOPIC_LABELS.get(i, f\"T{i}\")\n            ax.annotate(\n                label, (topic_coords[i, 0], topic_coords[i, 1]),\n                fontsize=7, ha='center', va='bottom'\n            )\n\n    ax.set_title(\"Topic Map (t-SNE of topic-term vectors)\")\n    ax.set_xlabel(\"t-SNE 1\")\n    ax.set_ylabel(\"t-SNE 2\")\n    plt.tight_layout()\n    plt.show()\n\n# %% [markdown]\n# ---\n# ## Summary\n#\n# This notebook implemented the full workflow:\n#\n# | Step | Pattern | What it does |\n# |------|---------|--------------|\n# | 1–2 | Progressive Abstraction | Raw text → TF-IDF matrix |\n# | 3 | Topic Mining (NMF) | Extract candidate topics |\n# | 4 | Interactive Steering | Merge/label/exclude topics |\n# | 5–6 | Cluster-Label-Distribute | Topic proportions over time |\n# | 7 | Simplify for Trends | Moving-average smoothing + trend classification |\n# | 8 | Restore Detail | Drill into specific topics/eras |\n#\n# **Next steps:**\n# - Refine `MERGE_MAP`, `TOPIC_LABELS`, and `EXCLUDE_TOPICS` iteratively\n# - Try different `N_TOPICS` (20–40) and compare coherence\n# - Experiment with BERTopic for embedding-based topics\n# - Add interactive widgets (`ipywidgets`) for steering without code edits\n```\n\n---\n\n## How to Use\n\n1. **Save** the code above as `topic_evolution.ipynb` (or use `jupytext` / VS Code to convert the `# %%` cells into a notebook).\n2. **Place your dataset** (CSV with `year`, `title`, `abstract` columns) in the same directory and update `DATA_PATH`.\n3. **Run cells sequentially** through Step 3 (topic extraction).\n4. **Inspect** the printed topics and representative papers.\n5. **Edit** `MERGE_MAP`, `TOPIC_LABELS`, and `EXCLUDE_TOPICS` in Section 4a based on your inspection — this is the human-in-the-loop step.\n6. **Re-run** from Section 4a onward to see the refined trend charts.\n7. **Iterate** until the topic set is satisfying.\n\nThe key interactive decision point is **Section 4a** — that's where your domain expertise shapes the final result.",
    "role": "assistant"
  },
  {
    "content": "Please output the notebook as json file",
    "role": "user"
  },
  {
    "content": "<think>The user wants me to output the Jupyter Notebook as a JSON file (the standard .ipynb format). I'll convert the Python script into the proper Jupyter notebook JSON format with cells properly separated.</think>\n\n```json\n{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# IEEE VIS Research Topic Evolution (1990–2024)\\n\",\n    \"\\n\",\n    \"This notebook implements the recommended visual analytics workflow to reveal\\n\",\n    \"major research topics in IEEE VIS publications and how they evolved over time.\\n\",\n    \"\\n\",\n    \"**Workflow:** Progressive Abstraction → Topic Mining → Interactive Steering →\\n\",\n    \"Cluster-Label-Distribute → Simplify for Trends\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 0. Setup & Installation\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# !pip install pandas numpy scikit-learn matplotlib seaborn nltk wordcloud adjustText\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import seaborn as sns\\n\",\n    \"from pathlib import Path\\n\",\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"# NLP\\n\",\n    \"import re\\n\",\n    \"import nltk\\n\",\n    \"from nltk.corpus import stopwords\\n\",\n    \"from nltk.stem import WordNetLemmatizer\\n\",\n    \"\\n\",\n    \"# Topic Modelling\\n\",\n    \"from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\\n\",\n    \"from sklearn.decomposition import NMF, LatentDirichletAllocation\\n\",\n    \"from sklearn.manifold import TSNE\\n\",\n    \"\\n\",\n    \"nltk.download('stopwords', quiet=True)\\n\",\n    \"nltk.download('wordnet', quiet=True)\\n\",\n    \"nltk.download('omw-1.4', quiet=True)\\n\",\n    \"\\n\",\n    \"print(\\\"Setup complete.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 1. Data Loading & Exploration\\n\",\n    \"\\n\",\n    \"Expects a CSV file with at least columns: `Year`, `Title`, `Abstract`.\\n\",\n    \"Adjust the file path and column names to match your dataset.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# === CONFIGURE THIS ===\\n\",\n    \"DATA_PATH = \\\"ieee_vis_papers.csv\\\"  # <-- Adjust path to your dataset\\n\",\n    \"\\n\",\n    \"# Load data\\n\",\n    \"df = pd.read_csv(DATA_PATH)\\n\",\n    \"\\n\",\n    \"# --- Normalise column names (handle common variations) ---\\n\",\n    \"df.columns = df.columns.str.strip().str.lower()\\n\",\n    \"\\n\",\n    \"# Map to expected names (adjust if your columns differ)\\n\",\n    \"COLUMN_MAP = {\\n\",\n    \"    'year': 'year',\\n\",\n    \"    'title': 'title',\\n\",\n    \"    'abstract': 'abstract',\\n\",\n    \"    # Add mappings if needed, e.g. 'paper title': 'title'\\n\",\n    \"}\\n\",\n    \"df = df.rename(columns=COLUMN_MAP)\\n\",\n    \"\\n\",\n    \"# Basic validation\\n\",\n    \"assert 'year' in df.columns, \\\"Column 'year' not found. Adjust COLUMN_MAP.\\\"\\n\",\n    \"assert 'title' in df.columns, \\\"Column 'title' not found. Adjust COLUMN_MAP.\\\"\\n\",\n    \"assert 'abstract' in df.columns, \\\"Column 'abstract' not found. Adjust COLUMN_MAP.\\\"\\n\",\n    \"\\n\",\n    \"print(f\\\"Dataset shape: {df.shape}\\\")\\n\",\n    \"print(f\\\"Year range: {df['year'].min()} \\\\u2013 {df['year'].max()}\\\")\\n\",\n    \"print(f\\\"Sample columns: {list(df.columns[:10])}\\\")\\n\",\n    \"df.head(3)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Drop rows with missing abstracts\\n\",\n    \"before = len(df)\\n\",\n    \"df = df.dropna(subset=['abstract']).copy()\\n\",\n    \"df = df[df['abstract'].str.strip().str.len() > 50]  # Remove very short abstracts\\n\",\n    \"print(f\\\"Dropped {before - len(df)} rows with missing/short abstracts. Remaining: {len(df)}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Papers per year\\n\",\n    \"fig, ax = plt.subplots(figsize=(12, 4))\\n\",\n    \"df.groupby('year').size().plot(kind='bar', ax=ax, color='steelblue', width=0.8)\\n\",\n    \"ax.set_title(\\\"Number of Papers per Year\\\")\\n\",\n    \"ax.set_xlabel(\\\"Year\\\")\\n\",\n    \"ax.set_ylabel(\\\"Count\\\")\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2. Text Preprocessing (Pattern 1 – Progressive Abstraction, Level 1)\\n\",\n    \"\\n\",\n    \"Concatenate title and abstract, clean, tokenise, remove stopwords, and lemmatise.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Combine title + abstract\\n\",\n    \"df['text'] = df['title'].fillna('') + ' ' + df['abstract'].fillna('')\\n\",\n    \"\\n\",\n    \"# Custom stopwords: domain-generic terms that don't help distinguish topics\\n\",\n    \"CUSTOM_STOPWORDS = {\\n\",\n    \"    'paper', 'propose', 'proposed', 'approach', 'method', 'methods',\\n\",\n    \"    'result', 'results', 'show', 'shown', 'use', 'used', 'using',\\n\",\n    \"    'present', 'presented', 'new', 'novel', 'technique', 'techniques',\\n\",\n    \"    'based', 'provide', 'system', 'also', 'work', 'problem',\\n\",\n    \"    'describe', 'described', 'demonstrate', 'demonstrated',\\n\",\n    \"    'algorithm', 'algorithms', 'however', 'many', 'one', 'two',\\n\",\n    \"    'first', 'second', 'allow', 'allows', 'well', 'can', 'may',\\n\",\n    \"    'different', 'several', 'existing', 'within', 'number',\\n\",\n    \"    'introduction', 'conclusion', 'abstract', 'study', 'studies',\\n\",\n    \"    'example', 'examples', 'application', 'applications'\\n\",\n    \"}\\n\",\n    \"\\n\",\n    \"stop_words = set(stopwords.words('english')).union(CUSTOM_STOPWORDS)\\n\",\n    \"lemmatizer = WordNetLemmatizer()\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def preprocess_text(text):\\n\",\n    \"    \\\"\\\"\\\"Clean, tokenise, remove stopwords, lemmatise.\\\"\\\"\\\"\\n\",\n    \"    # Lowercase\\n\",\n    \"    text = text.lower()\\n\",\n    \"    # Remove non-alphabetic characters (keep spaces)\\n\",\n    \"    text = re.sub(r'[^a-z\\\\s]', ' ', text)\\n\",\n    \"    # Tokenise\\n\",\n    \"    tokens = text.split()\\n\",\n    \"    # Remove stopwords, short tokens, and lemmatise\\n\",\n    \"    tokens = [\\n\",\n    \"        lemmatizer.lemmatize(t)\\n\",\n    \"        for t in tokens\\n\",\n    \"        if t not in stop_words and len(t) > 2\\n\",\n    \"    ]\\n\",\n    \"    return ' '.join(tokens)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"df['processed_text'] = df['text'].apply(preprocess_text)\\n\",\n    \"print(\\\"Preprocessing complete.\\\")\\n\",\n    \"print(f\\\"Example:\\\\n{df['processed_text'].iloc[0][:200]}...\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 3. Topic Extraction (Pattern 11 – Topic Mining)\\n\",\n    \"\\n\",\n    \"We use **NMF** (Non-negative Matrix Factorisation) on TF-IDF features.\\n\",\n    \"NMF typically produces more coherent topics than LDA for short scientific texts.\\n\",\n    \"We intentionally over-segment (k=30) to allow merging later.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# === PARAMETERS (tune as needed) ===\\n\",\n    \"N_TOPICS = 30          # Intentionally over-segment; will merge later\\n\",\n    \"MAX_FEATURES = 5000    # Vocabulary size\\n\",\n    \"MIN_DF = 5             # Minimum document frequency\\n\",\n    \"MAX_DF = 0.7           # Maximum document frequency (fraction)\\n\",\n    \"NGRAM_RANGE = (1, 2)   # Unigrams + bigrams\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Build TF-IDF matrix\\n\",\n    \"tfidf_vectorizer = TfidfVectorizer(\\n\",\n    \"    max_features=MAX_FEATURES,\\n\",\n    \"    min_df=MIN_DF,\\n\",\n    \"    max_df=MAX_DF,\\n\",\n    \"    ngram_range=NGRAM_RANGE,\\n\",\n    \"    token_pattern=r'(?u)\\\\b[a-z][a-z]+\\\\b'\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_text'])\\n\",\n    \"feature_names = tfidf_vectorizer.get_feature_names_out()\\n\",\n    \"\\n\",\n    \"print(f\\\"TF-IDF matrix: {tfidf_matrix.shape[0]} documents \\\\u00d7 {tfidf_matrix.shape[1]} terms\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Fit NMF model\\n\",\n    \"nmf_model = NMF(\\n\",\n    \"    n_components=N_TOPICS,\\n\",\n    \"    random_state=42,\\n\",\n    \"    max_iter=500,\\n\",\n    \"    init='nndsvda'\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"W = nmf_model.fit_transform(tfidf_matrix)  # Document-topic matrix\\n\",\n    \"H = nmf_model.components_                   # Topic-term matrix\\n\",\n    \"\\n\",\n    \"print(f\\\"NMF fitting complete. Reconstruction error: {nmf_model.reconstruction_err_:.2f}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Display top terms per topic\\n\",\n    \"N_TOP_TERMS = 12\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def display_topics(model, feature_names, n_top=N_TOP_TERMS):\\n\",\n    \"    \\\"\\\"\\\"Print top terms for each topic.\\\"\\\"\\\"\\n\",\n    \"    topics = []\\n\",\n    \"    for idx, topic_vec in enumerate(model.components_):\\n\",\n    \"        top_indices = topic_vec.argsort()[:-n_top - 1:-1]\\n\",\n    \"        top_terms = [feature_names[i] for i in top_indices]\\n\",\n    \"        topics.append(top_terms)\\n\",\n    \"        print(f\\\"Topic {idx:2d}: {', '.join(top_terms)}\\\")\\n\",\n    \"    return topics\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"print(\\\"TOP TERMS PER TOPIC (inspect for coherence)\\\")\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"topic_terms = display_topics(nmf_model, feature_names)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Assign dominant topic to each paper\\n\",\n    \"df['dominant_topic'] = W.argmax(axis=1)\\n\",\n    \"df['topic_weight'] = W.max(axis=1)\\n\",\n    \"\\n\",\n    \"# Store full topic distribution for soft assignment\\n\",\n    \"topic_distributions = pd.DataFrame(\\n\",\n    \"    W, columns=[f\\\"topic_{i}\\\" for i in range(N_TOPICS)],\\n\",\n    \"    index=df.index\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\nPapers per topic:\\\")\\n\",\n    \"print(df['dominant_topic'].value_counts().sort_index())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 4. Interactive Topic Refinement (Pattern 12 – Model Steering)\\n\",\n    \"\\n\",\n    \"Inspect topic similarity, merge near-duplicate topics, remove junk topics,\\n\",\n    \"and assign human-readable labels.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Visualise topic similarity via cosine similarity heatmap\\n\",\n    \"from sklearn.metrics.pairwise import cosine_similarity\\n\",\n    \"\\n\",\n    \"topic_similarity = cosine_similarity(H)\\n\",\n    \"\\n\",\n    \"fig, ax = plt.subplots(figsize=(12, 10))\\n\",\n    \"sns.heatmap(\\n\",\n    \"    topic_similarity, annot=False, cmap='YlOrRd',\\n\",\n    \"    xticklabels=range(N_TOPICS),\\n\",\n    \"    yticklabels=range(N_TOPICS),\\n\",\n    \"    ax=ax\\n\",\n    \")\\n\",\n    \"ax.set_title(\\\"Topic-Topic Cosine Similarity (identify candidates for merging)\\\")\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Show representative papers for each topic (top 3 by weight)\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"print(\\\"REPRESENTATIVE PAPERS PER TOPIC\\\")\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"\\n\",\n    \"for t in range(N_TOPICS):\\n\",\n    \"    topic_papers = df[df['dominant_topic'] == t].nlargest(3, 'topic_weight')\\n\",\n    \"    print(f\\\"\\\\n--- Topic {t} ---\\\")\\n\",\n    \"    print(f\\\"    Terms: {', '.join(topic_terms[t][:8])}\\\")\\n\",\n    \"    for _, row in topic_papers.iterrows():\\n\",\n    \"        print(f\\\"    [{row['year']}] {row['title'][:80]}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 4a. Define Topic Merging & Labelling\\n\",\n    \"\\n\",\n    \"**Instructions:** After inspecting the topics above, edit the dictionaries below:\\n\",\n    \"- `MERGE_MAP`: maps topic indices to a group ID (topics with same group ID will be merged)\\n\",\n    \"- `TOPIC_LABELS`: human-readable name for each group ID\\n\",\n    \"- `EXCLUDE_TOPICS`: list of topic indices to drop (junk/incoherent)\\n\",\n    \"\\n\",\n    \"The example below is a **template** – adjust based on YOUR model's output!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ============================================================\\n\",\n    \"# EDIT THIS SECTION BASED ON YOUR INSPECTION OF THE TOPICS\\n\",\n    \"# ============================================================\\n\",\n    \"\\n\",\n    \"# Example merge map: {original_topic_id: merged_group_id}\\n\",\n    \"# Topics assigned to the same group_id will be merged.\\n\",\n    \"# Start by mapping each topic to itself (no merging), then adjust.\\n\",\n    \"MERGE_MAP = {i: i for i in range(N_TOPICS)}\\n\",\n    \"\\n\",\n    \"# --- Example edits (uncomment and adjust): ---\\n\",\n    \"# MERGE_MAP[5] = 3    # Merge topic 5 into topic 3\\n\",\n    \"# MERGE_MAP[12] = 7   # Merge topic 12 into topic 7\\n\",\n    \"# MERGE_MAP[20] = 15  # Merge topic 20 into topic 15\\n\",\n    \"\\n\",\n    \"# Topics to exclude entirely (junk / too generic)\\n\",\n    \"EXCLUDE_TOPICS = []\\n\",\n    \"# EXCLUDE_TOPICS = [0, 22]  # Example: exclude topics 0 and 22\\n\",\n    \"\\n\",\n    \"# Human-readable labels for merged groups\\n\",\n    \"# Only need to label the \\\"target\\\" group IDs that remain after merging.\\n\",\n    \"TOPIC_LABELS = {i: f\\\"Topic {i}\\\" for i in range(N_TOPICS)}\\n\",\n    \"\\n\",\n    \"# --- Example labels (uncomment and adjust): ---\\n\",\n    \"# TOPIC_LABELS[0] = \\\"Volume Rendering\\\"\\n\",\n    \"# TOPIC_LABELS[1] = \\\"Graph & Network Visualization\\\"\\n\",\n    \"# TOPIC_LABELS[2] = \\\"Flow Visualization\\\"\\n\",\n    \"# TOPIC_LABELS[3] = \\\"Information Visualization\\\"\\n\",\n    \"# TOPIC_LABELS[4] = \\\"Scientific Visualization\\\"\\n\",\n    \"# TOPIC_LABELS[5] = \\\"Text & Document Visualization\\\"\\n\",\n    \"# TOPIC_LABELS[6] = \\\"Uncertainty Visualization\\\"\\n\",\n    \"# TOPIC_LABELS[7] = \\\"Machine Learning for Visualization\\\"\\n\",\n    \"# TOPIC_LABELS[8] = \\\"User Studies & Evaluation\\\"\\n\",\n    \"# TOPIC_LABELS[9] = \\\"Geospatial & Cartography\\\"\\n\",\n    \"# TOPIC_LABELS[10] = \\\"High-Dimensional Data\\\"\\n\",\n    \"\\n\",\n    \"print(\\\"Merge map and labels defined. Adjust above as needed.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Apply merging and exclusion\\n\",\n    \"df['merged_topic'] = df['dominant_topic'].map(MERGE_MAP)\\n\",\n    \"df = df[~df['dominant_topic'].isin(EXCLUDE_TOPICS)].copy()\\n\",\n    \"\\n\",\n    \"# Merge topic distributions\\n\",\n    \"merged_groups = {}\\n\",\n    \"for orig, merged in MERGE_MAP.items():\\n\",\n    \"    if orig not in EXCLUDE_TOPICS:\\n\",\n    \"        merged_groups.setdefault(merged, []).append(orig)\\n\",\n    \"\\n\",\n    \"# Create merged topic weight (sum of component topic weights)\\n\",\n    \"for group_id, members in merged_groups.items():\\n\",\n    \"    member_cols = [f\\\"topic_{m}\\\" for m in members]\\n\",\n    \"    existing_cols = [c for c in member_cols if c in topic_distributions.columns]\\n\",\n    \"    if existing_cols:\\n\",\n    \"        topic_distributions[f\\\"merged_{group_id}\\\"] = topic_distributions[existing_cols].sum(axis=1)\\n\",\n    \"\\n\",\n    \"# Final label\\n\",\n    \"df['topic_label'] = df['merged_topic'].map(TOPIC_LABELS)\\n\",\n    \"\\n\",\n    \"# Count unique merged topics\\n\",\n    \"final_topics = sorted(df['merged_topic'].unique())\\n\",\n    \"print(f\\\"Number of final topics after merging: {len(final_topics)}\\\")\\n\",\n    \"print(\\\"\\\\nFinal topic labels:\\\")\\n\",\n    \"for t in final_topics:\\n\",\n    \"    count = (df['merged_topic'] == t).sum()\\n\",\n    \"    print(f\\\"  {TOPIC_LABELS.get(t, t):40s} ({count} papers)\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 5. Temporal Distribution (Pattern 4 – Cluster-Label-Distribute)\\n\",\n    \"\\n\",\n    \"Compute topic proportions per year and visualise.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Compute yearly topic counts (proportion-based to account for varying publication volume)\\n\",\n    \"yearly_counts = df.groupby(['year', 'topic_label']).size().unstack(fill_value=0)\\n\",\n    \"yearly_proportions = yearly_counts.div(yearly_counts.sum(axis=1), axis=0)\\n\",\n    \"\\n\",\n    \"print(f\\\"Yearly proportions table: {yearly_proportions.shape}\\\")\\n\",\n    \"yearly_proportions.tail()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Raw (unsmoothed) stacked area chart\\n\",\n    \"fig, ax = plt.subplots(figsize=(16, 8))\\n\",\n    \"yearly_proportions.plot.area(ax=ax, alpha=0.8, linewidth=0.5)\\n\",\n    \"ax.set_title(\\\"Topic Proportions Over Time (Raw)\\\", fontsize=14)\\n\",\n    \"ax.set_xlabel(\\\"Year\\\")\\n\",\n    \"ax.set_ylabel(\\\"Proportion of Papers\\\")\\n\",\n    \"ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)\\n\",\n    \"ax.set_xlim(df['year'].min(), df['year'].max())\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 6. Trend Smoothing & Narrative (Pattern 5 – Simplify → Analyse → Restore)\\n\",\n    \"\\n\",\n    \"Apply moving-average smoothing to suppress year-to-year fluctuations\\n\",\n    \"and reveal underlying trends.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# === SMOOTHING PARAMETER ===\\n\",\n    \"WINDOW_SIZE = 5  # 5-year moving average (adjust: 3 for less smoothing, 7 for more)\\n\",\n    \"\\n\",\n    \"# Apply rolling mean\\n\",\n    \"yearly_smoothed = yearly_proportions.rolling(\\n\",\n    \"    window=WINDOW_SIZE, center=True, min_periods=2\\n\",\n    \").mean()\\n\",\n    \"\\n\",\n    \"# Drop NaN edges from rolling\\n\",\n    \"yearly_smoothed = yearly_smoothed.dropna()\\n\",\n    \"\\n\",\n    \"print(f\\\"Smoothed data range: {yearly_smoothed.index.min()} \\\\u2013 {yearly_smoothed.index.max()}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Smoothed stacked area chart (ThemeRiver-style)\\n\",\n    \"fig, ax = plt.subplots(figsize=(16, 8))\\n\",\n    \"yearly_smoothed.plot.area(ax=ax, alpha=0.85, linewidth=0.5)\\n\",\n    \"ax.set_title(\\n\",\n    \"    f\\\"Topic Proportions Over Time ({WINDOW_SIZE}-Year Moving Average)\\\",\\n\",\n    \"    fontsize=14\\n\",\n    \")\\n\",\n    \"ax.set_xlabel(\\\"Year\\\")\\n\",\n    \"ax.set_ylabel(\\\"Proportion of Papers\\\")\\n\",\n    \"ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), fontsize=8)\\n\",\n    \"ax.set_xlim(yearly_smoothed.index.min(), yearly_smoothed.index.max())\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Small-multiple line charts (one per topic) – easier to read individual trends\\n\",\n    \"n_final_topics = len(yearly_smoothed.columns)\\n\",\n    \"ncols = 4\\n\",\n    \"nrows = int(np.ceil(n_final_topics / ncols))\\n\",\n    \"\\n\",\n    \"fig, axes = plt.subplots(nrows, ncols, figsize=(16, 3 * nrows), sharex=True, sharey=True)\\n\",\n    \"axes_flat = axes.flatten()\\n\",\n    \"\\n\",\n    \"for idx, topic_name in enumerate(yearly_smoothed.columns):\\n\",\n    \"    ax = axes_flat[idx]\\n\",\n    \"    ax.plot(yearly_smoothed.index, yearly_smoothed[topic_name], color='steelblue', linewidth=2)\\n\",\n    \"    ax.fill_between(yearly_smoothed.index, yearly_smoothed[topic_name], alpha=0.3, color='steelblue')\\n\",\n    \"    ax.set_title(topic_name, fontsize=9, fontweight='bold')\\n\",\n    \"    ax.set_ylim(0, yearly_smoothed.values.max() * 1.1)\\n\",\n    \"    ax.grid(True, alpha=0.3)\\n\",\n    \"\\n\",\n    \"# Hide unused subplots\\n\",\n    \"for idx in range(n_final_topics, len(axes_flat)):\\n\",\n    \"    axes_flat[idx].set_visible(False)\\n\",\n    \"\\n\",\n    \"fig.suptitle(f\\\"Individual Topic Trends ({WINDOW_SIZE}-Year Smoothing)\\\", fontsize=14, y=1.01)\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 7. Trend Classification & Summary\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Classify each topic's trend: Emerging, Declining, Stable\\n\",\n    \"# Compare average proportion in first third vs. last third of time span\\n\",\n    \"\\n\",\n    \"years = yearly_smoothed.index\\n\",\n    \"n_years = len(years)\\n\",\n    \"third = n_years // 3\\n\",\n    \"\\n\",\n    \"early_period = yearly_smoothed.iloc[:third]\\n\",\n    \"late_period = yearly_smoothed.iloc[-third:]\\n\",\n    \"\\n\",\n    \"trend_summary = []\\n\",\n    \"for topic_name in yearly_smoothed.columns:\\n\",\n    \"    early_mean = early_period[topic_name].mean()\\n\",\n    \"    late_mean = late_period[topic_name].mean()\\n\",\n    \"    overall_mean = yearly_smoothed[topic_name].mean()\\n\",\n    \"\\n\",\n    \"    # Relative change\\n\",\n    \"    if early_mean > 0:\\n\",\n    \"        rel_change = (late_mean - early_mean) / early_mean\\n\",\n    \"    else:\\n\",\n    \"        rel_change = 1.0 if late_mean > 0 else 0.0\\n\",\n    \"\\n\",\n    \"    # Classify\\n\",\n    \"    if rel_change > 0.3:\\n\",\n    \"        trend = \\\"\\\\U0001f4c8 Emerging\\\"\\n\",\n    \"    elif rel_change < -0.3:\\n\",\n    \"        trend = \\\"\\\\U0001f4c9 Declining\\\"\\n\",\n    \"    else:\\n\",\n    \"        trend = \\\"\\\\u27a1\\\\ufe0f Stable\\\"\\n\",\n    \"\\n\",\n    \"    # Peak year\\n\",\n    \"    peak_year = yearly_smoothed[topic_name].idxmax()\\n\",\n    \"\\n\",\n    \"    trend_summary.append({\\n\",\n    \"        'Topic': topic_name,\\n\",\n    \"        'Early Avg (%)': f\\\"{early_mean * 100:.1f}\\\",\\n\",\n    \"        'Late Avg (%)': f\\\"{late_mean * 100:.1f}\\\",\\n\",\n    \"        'Relative Change': f\\\"{rel_change:+.0%}\\\",\\n\",\n    \"        'Peak Year': int(peak_year),\\n\",\n    \"        'Trend': trend\\n\",\n    \"    })\\n\",\n    \"\\n\",\n    \"trend_df = pd.DataFrame(trend_summary)\\n\",\n    \"trend_df = trend_df.sort_values('Trend', ascending=True)\\n\",\n    \"\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"print(\\\"TOPIC TREND SUMMARY\\\")\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"trend_df\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Highlight: Top emerging and declining topics\\n\",\n    \"print(\\\"\\\\n\\\\U0001f525 TOP EMERGING TOPICS:\\\")\\n\",\n    \"emerging = trend_df[trend_df['Trend'] == '\\\\U0001f4c8 Emerging']\\n\",\n    \"for _, row in emerging.iterrows():\\n\",\n    \"    print(f\\\"   \\\\u2022 {row['Topic']} ({row['Early Avg (%)']}% \\\\u2192 {row['Late Avg (%)']}%)\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\\u2744\\\\ufe0f TOP DECLINING TOPICS:\\\")\\n\",\n    \"declining = trend_df[trend_df['Trend'] == '\\\\U0001f4c9 Declining']\\n\",\n    \"for _, row in declining.iterrows():\\n\",\n    \"    print(f\\\"   \\\\u2022 {row['Topic']} ({row['Early Avg (%)']}% \\\\u2192 {row['Late Avg (%)']}%)\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\\u2696\\\\ufe0f STABLE PILLARS:\\\")\\n\",\n    \"stable = trend_df[trend_df['Trend'] == '\\\\u27a1\\\\ufe0f Stable']\\n\",\n    \"for _, row in stable.iterrows():\\n\",\n    \"    print(f\\\"   \\\\u2022 {row['Topic']} (peak: {row['Peak Year']})\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 8. Detail-on-Demand: Drill into a Specific Topic (Pattern 5 – Restore)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Select a topic to drill into\\n\",\n    \"DRILL_TOPIC = yearly_smoothed.columns[0]  # <-- Change to any topic of interest\\n\",\n    \"\\n\",\n    \"print(f\\\"Drilling into: '{DRILL_TOPIC}'\\\")\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"\\n\",\n    \"topic_papers = df[df['topic_label'] == DRILL_TOPIC].sort_values('year')\\n\",\n    \"\\n\",\n    \"# Show papers per 5-year era\\n\",\n    \"for era_start in range(int(df['year'].min()), int(df['year'].max()) + 1, 5):\\n\",\n    \"    era_end = era_start + 4\\n\",\n    \"    era_papers = topic_papers[\\n\",\n    \"        (topic_papers['year'] >= era_start) & (topic_papers['year'] <= era_end)\\n\",\n    \"    ]\\n\",\n    \"    if len(era_papers) > 0:\\n\",\n    \"        print(f\\\"\\\\n--- {era_start}\\\\u2013{era_end} ({len(era_papers)} papers) ---\\\")\\n\",\n    \"        # Show top 3 by topic weight\\n\",\n    \"        for _, row in era_papers.nlargest(3, 'topic_weight').iterrows():\\n\",\n    \"            print(f\\\"  [{row['year']}] {row['title'][:90]}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 9. Export Results\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Save results to CSV\\n\",\n    \"output_dir = Path(\\\"output\\\")\\n\",\n    \"output_dir.mkdir(exist_ok=True)\\n\",\n    \"\\n\",\n    \"# Papers with topic assignments\\n\",\n    \"df[['year', 'title', 'topic_label', 'topic_weight']].to_csv(\\n\",\n    \"    output_dir / \\\"papers_with_topics.csv\\\", index=False\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"# Yearly proportions (smoothed)\\n\",\n    \"yearly_smoothed.to_csv(output_dir / \\\"yearly_topic_proportions_smoothed.csv\\\")\\n\",\n    \"\\n\",\n    \"# Trend summary\\n\",\n    \"trend_df.to_csv(output_dir / \\\"trend_summary.csv\\\", index=False)\\n\",\n    \"\\n\",\n    \"print(f\\\"Results saved to '{output_dir}/' directory:\\\")\\n\",\n    \"print(f\\\"  \\\\u2022 papers_with_topics.csv\\\")\\n\",\n    \"print(f\\\"  \\\\u2022 yearly_topic_proportions_smoothed.csv\\\")\\n\",\n    \"print(f\\\"  \\\\u2022 trend_summary.csv\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 10. (Optional) 2-D Topic Map\\n\",\n    \"\\n\",\n    \"Visualise topic positions using t-SNE on the topic-term matrix to see\\n\",\n    \"which topics are semantically close.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# t-SNE of topic-term vectors\\n\",\n    \"if N_TOPICS > 5:\\n\",\n    \"    perplexity = min(5, N_TOPICS - 1)\\n\",\n    \"    tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity)\\n\",\n    \"    topic_coords = tsne.fit_transform(H)\\n\",\n    \"\\n\",\n    \"    fig, ax = plt.subplots(figsize=(10, 8))\\n\",\n    \"    scatter = ax.scatter(\\n\",\n    \"        topic_coords[:, 0], topic_coords[:, 1],\\n\",\n    \"        s=100, c=range(N_TOPICS), cmap='tab20', alpha=0.8, edgecolors='black'\\n\",\n    \"    )\\n\",\n    \"\\n\",\n    \"    for i in range(N_TOPICS):\\n\",\n    \"        if i not in EXCLUDE_TOPICS:\\n\",\n    \"            label = TOPIC_LABELS.get(i, f\\\"T{i}\\\")\\n\",\n    \"            ax.annotate(\\n\",\n    \"                label, (topic_coords[i, 0], topic_coords[i, 1]),\\n\",\n    \"                fontsize=7, ha='center', va='bottom'\\n\",\n    \"            )\\n\",\n    \"\\n\",\n    \"    ax.set_title(\\\"Topic Map (t-SNE of topic-term vectors)\\\")\\n\",\n    \"    ax.set_xlabel(\\\"t-SNE 1\\\")\\n\",\n    \"    ax.set_ylabel(\\\"t-SNE 2\\\")\\n\",\n    \"    plt.tight_layout()\\n\",\n    \"    plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\\n\",\n    \"## Summary\\n\",\n    \"\\n\",\n    \"This notebook implemented the full workflow:\\n\",\n    \"\\n\",\n    \"| Step | Pattern | What it does |\\n\",\n    \"|------|---------|---------------|\\n\",\n    \"| 1–2 | Progressive Abstraction | Raw text → TF-IDF matrix |\\n\",\n    \"| 3 | Topic Mining (NMF) | Extract candidate topics |\\n\",\n    \"| 4 | Interactive Steering | Merge/label/exclude topics |\\n\",\n    \"| 5–6 | Cluster-Label-Distribute | Topic proportions over time |\\n\",\n    \"| 7 | Simplify for Trends | Moving-average smoothing + trend classification |\\n\",\n    \"| 8 | Restore Detail | Drill into specific topics/eras |\\n\",\n    \"\\n\",\n    \"**Next steps:**\\n\",\n    \"- Refine `MERGE_MAP`, `TOPIC_LABELS`, and `EXCLUDE_TOPICS` iteratively\\n\",\n    \"- Try different `N_TOPICS` (20–40) and compare coherence\\n\",\n    \"- Experiment with BERTopic for embedding-based topics\\n\",\n    \"- Add interactive widgets (`ipywidgets`) for steering without code edits\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbformat_minor\": 2,\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.12\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 4\n}\n```",
    "role": "assistant"
  }
]