[
  {
    "content": "The attached document contains a definition of a language called ATWL designed for representation of visual analytics workflows. You need to understand the language concepts and semantics it is meant to express, and also the syntax used for expressing the semantics. In the next step, you will be given a \"library\" of VA workflows extracted from published papers. Your task will be to help a user to design a workflow for achieving user's goals by finding relevant examples in the \"library\" and suggesting how they can be adapted to the user's case.",
    "role": "user"
  },
  {
    "content": "<think>The user has provided me with a detailed document describing ATWL (Artifact–Transform Workflow Language), a declarative language for representing visual analytics workflows. Let me summarize my understanding of the key concepts:\n\n**Core Building Blocks:**\n1. **Artifacts** - identifiable analytical objects (8 types):\n   - **entities** - collections of analytical objects (with internal structure, embedment, and features)\n   - **feature** - descriptors of properties of entities\n   - **arrangement** - organization of entities within a context\n   - **visualisation** - visual representations for human perception\n   - **pattern** - abstracted regularities/structures\n   - **model** - formal/computational representations for prediction/simulation/explanation\n   - **knowledge** - explicitly formulated knowledge (derived or injected)\n   - **specification** - parameters, settings, constraints, or method choices\n\n2. **Transforms** - how artifacts are derived/modified (8 generic intents):\n   - **define-unit** - create/redefine entities as units of analysis\n   - **characterise** - compute/transform features\n   - **contextualise** - place entities into a context (arrangement)\n   - **visualise** - create visual representations\n   - **abstract** - derive patterns/conceptual structures\n   - **build-model** - construct/refine models\n   - **generate-knowledge** - formulate knowledge/specifications\n   - **assess** - evaluate quality/adequacy of artifacts\n\n3. **Control Structures:**\n   - **Loops** - iterative refinement with qualitative stopping conditions\n   - **Conditionals** - alternative paths\n   - **Assignments** - binding artifact identifiers to new versions\n\n**Key Concepts:**\n- Internal structure (within entities): elementary, group/cluster, sequence, episode, region, formation\n- Embedment (between entities): set, time, space, sequence, relational structure\n- Value structure for features: atomic, list, vector, matrix, relational configuration\n- Value types: numeric, ordinal, categorical, temporal, spatial, text, reference\n- Actors: human, machine, hybrid\n\nI've mapped out this workflow language framework and I'm ready to help the user design workflows using these components.</think>\n\nI've carefully read and understood the ATWL (Artifact–Transform Workflow Language) specification. Here's a brief summary of my understanding:\n\n## Key Concepts\n\n| Component | Role |\n|-----------|------|\n| **Artifacts** (8 types) | Analytical objects: entities, feature, arrangement, visualisation, pattern, model, knowledge, specification |\n| **Transforms** (8 intents) | Operations: define-unit, characterise, contextualise, visualise, abstract, build-model, generate-knowledge, assess |\n| **Control Structures** | Loops, conditionals, assignments for workflow logic |\n\n## Core Semantics\n\n- **Workflows** are directed graphs: transforms consume input artifacts and produce output artifacts.\n- **Entities** have *internal structure* (how components within each entity are organized) and *embedment* (how entities relate to each other in shared environments).\n- **Transforms** have an *intent* (purpose), optional *manner* (how it's realized), and an *actor* (human/machine/hybrid).\n- **Specifications** serve as control knowledge, influencing how transforms execute.\n- **Knowledge** can be *evaluative* (from `assess`) or *substantive* (from `generate-knowledge`).\n- **Templates** provide high-level summaries of workflow stages using intent sequences.\n\nI'm ready for the next step — please provide the library of VA workflows and/or the user's goals, and I'll help find relevant examples and suggest adapted workflows.",
    "role": "assistant"
  },
  {
    "content": "I have attached an additional file with the workflow library. You can perform initial analysis for revealing reusable and adaptable parts.",
    "role": "user"
  },
  {
    "content": "<think>The user has provided a comprehensive library of 17 ATWL-represented visual analytics workflows. Let me analyze this library to identify reusable and adaptable parts, common patterns, and structural elements that can be leveraged when helping users design new workflows.\n\nLet me organize my analysis by:\n\n1. **Common workflow templates/patterns** - recurring structural motifs\n2. **Reusable workflow segments** - modular parts that appear across multiple workflows\n3. **Domain categorization** - grouping by application area\n4. **Transform intent patterns** - common sequences of intents\n\n## Analysis\n\n### 1. Common Structural Patterns\n\n**Pattern A: Iterative Clustering Refinement**\n\n\nThis shows up across multiple workflows like Cluster-Calendar, MobilityGraphs, Events-to-Places, Progressive Clustering, ST Analysis, and Behaviour Pattern Recognition. The core structure follows a cycle: define a cluster unit, visualize it, assess the results, then either adjust parameters or exit the loop.\n\n**Pattern B: Iterative Model Building/Refinement**\nI'm seeing this pattern in Partition-based Regression, ST Modelling, Behaviour Pattern Recognition, and others—a similar iterative approach but focused on constructing and refining analytical models rather than just clustering.\n\n**Pattern C: Progressive Exploration**\nThis one appears in Dynamic Networks, Topic Modelling, Multivariate Temporal analysis, Random Forest, TensorFlow, and the What-If Tool. The workflow here is lighter: visualize first, then abstract away complexity, assess what you've learned, and decide whether to dig deeper or stop.\n\n**Pattern D: Data Simplification/Abstraction Pipeline**\nFound across Dynamic Networks, MobilityGraphs, EventFlow, and TensorFlow—these workflows focus on reducing data complexity through a structured pipeline.\n\n**Pattern E: Prescriptive/Action Planning Loop**\nEventAction follows this: generate a plan, estimate its impact, visualize the results, assess them, then either refine or wrap up.\n\nI'm also noticing some reusable segments that cut across multiple tools. There's a similarity-based projection workflow that shows up in several places—vectorizing entities, normalizing them, reducing dimensions, then plotting them. Calendar-based temporal patterns are another common segment, where you assign time labels and display them in a grid with color coding. I'm starting to see a two-part segment emerging as well. ization → Assessment → User-guided refinement cycles found in the multivariate and topic modeling sections. Now I'm organizing these patterns by domain: temporal data analysis covers regular time series, event sequences, and multivariate time series; movement and spatial data includes mobility flows, trajectory events, and vessel behavior patterns; and network analysis handles temporal graph structures.\n\nI'm also categorizing machine learning model analysis across regression, time series, classification, and deep learning architectures, along with diagnostic and exploratory tools. The key discovery pattern that emerges is characterizing phenomena, then visualizing them, and finally abstracting to higher-level insights.\n\nI'm seeing several operational patterns: refinement cycles where I assess results and generate knowledge to adjust parameters, model building sequences that move from specification through construction to evaluation, and simplification approaches that filter or merge units before visualizing and assessing. The work splits between machine-driven preprocessing tasks like characterization and visualization, human-driven assessment and decision-making, and hybrid interactive operations like clustering and exploration.</think>\n\n## Initial Analysis of the Workflow Library\n\nI've analyzed all 17 workflows and identified reusable/adaptable structural patterns. Here's my synthesis:\n\n---\n\n### 1. Domain Categorization\n\n| Domain | Workflows | Key Data Types |\n|--------|-----------|----------------|\n| **Temporal data** | 1.1, 1.4, 1.5, 1.9 | Time series, event sequences |\n| **Movement/spatial** | 1.3, 1.6, 1.7, 1.12 | Trajectories, flows, places |\n| **Networks/graphs** | 1.2, 1.16 | Dynamic networks, dataflow graphs |\n| **Text/documents** | 1.8 | Document corpora |\n| **ML model analysis** | 1.10, 1.11, 1.13, 1.14, 1.15, 1.17 | Models, predictions, residuals |\n| **Prescriptive analytics** | 1.5 | Historical records + outcomes |\n\n---\n\n### 2. Recurring Structural Motifs\n\n| Motif | Description | Found In |\n|-------|-------------|----------|\n| **A: Iterative Clustering** | `define-unit(cluster) → visualise → assess → [adjust\\|exit]` | 1.1, 1.3, 1.6, 1.7, 1.11, 1.12 |\n| **B: Model Build–Evaluate–Refine** | `build-model → characterise(evaluate) → visualise → assess → [refine\\|exit]` | 1.10, 1.11, 1.12, 1.13, 1.14 |\n| **C: Progressive Exploration** | `visualise → abstract → assess → [explore more\\|exit]` | 1.2, 1.8, 1.9, 1.15, 1.16, 1.17 |\n| **D: Vectorize–Project–Visualize** | `characterise(vectors) → contextualise(DR) → visualise(scatterplot)` | 1.2, 1.8, 1.12 |\n| **E: Residual-driven Refinement** | `build-model → compute residuals → visualise → assess → refine` | 1.10, 1.11 |\n| **F: Action Plan Tuning** | `specify plan → estimate impact → visualise → assess → [refine\\|exit]` | 1.5 |\n\n---\n\n### 3. Reusable Modular Segments\n\n#### Segment 1: Similarity-based Spatial Overview\n**Source workflows:** 1.2, 1.8, 1.12, 1.15  \n**Template:** `characterise(vectorize) → [characterise(normalize)] → contextualise(projection) → visualise(scatterplot)`  \n**Adaptable to:** Any scenario needing overview of entity similarities (documents, time series, images, model components).\n\n#### Segment 2: Calendar/Context-based Temporal Distribution\n**Source workflows:** 1.1, 1.3  \n**Template:** `contextualise(calendar) → visualise(color-coded grid)`  \n**Adaptable to:** Any temporal cluster membership displayed over a regular time structure.\n\n#### Segment 3: Feature Engineering Loop\n**Source workflow:** 1.12 (primary), 1.9, 1.10  \n**Template:** `characterise(features) → define-unit(cluster) → visualise → assess → [refine features | exit]`  \n**Adaptable to:** Any domain where feature adequacy must be iteratively refined via visual feedback.\n\n#### Segment 4: Two-stage Spatial Clustering\n**Source workflow:** 1.6  \n**Template:** `define-unit(ST-cluster) → define-unit(filter noise) → define-unit(S-cluster)`  \n**Adaptable to:** Any spatial event data needing noise filtering and place delineation.\n\n#### Segment 5: Semi-supervised Model Steering\n**Source workflows:** 1.8 (topics), 1.15 (boundaries)  \n**Template:** `visualise(model output) → assess → generate-knowledge(user guidance) → define-unit/build-model(constrained refit)`  \n**Adaptable to:** Any model where human domain knowledge should guide retraining.\n\n#### Segment 6: Model Diagnostics Pipeline\n**Source workflows:** 1.14, 1.17  \n**Template:** `characterise(predictions) → characterise(explanations) → define-unit(group) → visualise(multi-level) → loop(explore → abstract → assess)`  \n**Adaptable to:** Any trained classifier/regressor needing post-hoc interpretability analysis.\n\n#### Segment 7: Progressive Simplification\n**Source workflows:** 1.4 (EventFlow), 1.16 (TensorFlow)  \n**Template:** `define-unit(simplify/cluster) → define-unit(extract auxiliary) → visualise → loop(assess → simplify further | exit)`  \n**Adaptable to:** Any complex structure (graphs, sequences) needing iterative decluttering.\n\n---\n\n### 4. Common Intent Sequences (Templates)\n\n```\nDiscovery:     characterise → visualise → abstract → generate-knowledge\nRefinement:    assess → generate-knowledge(adjust) → [loop back]\nModel building: generate-knowledge(specify) → build-model → assess\nSimplification: define-unit(filter/merge) → visualise → assess\nContextualization: contextualise(arrange) → visualise → abstract\n```\n\n---\n\n### 5. Cross-cutting Design Principles\n\n1. **Nested loops** — outer loops change strategy/parameters; inner loops refine within a strategy (1.11, 1.14).\n2. **Specification-driven iteration** — a `specification` artifact controls machine behavior; human updates it after assessment.\n3. **Coordinated multi-view visualization** — almost all workflows combine ≥2 linked views for assessment.\n4. **Human-at-decision-points** — machine does computation; human performs assessment, interpretation, and parameter/strategy decisions.\n5. **Progressive abstraction** — raw data → features → clusters/patterns → knowledge, with quality gates at each level (1.9 exemplifies this most clearly).\n\n---\n\nI'm ready to help you design a workflow. Please describe your **goals**, **data characteristics**, and any **constraints**, and I'll match relevant library examples and suggest an adapted workflow.",
    "role": "assistant"
  },
  {
    "content": "I need a workflow design recommendation for the following task. I want to analyse the dataset that contains information on IEEE Visualization (IEEE VIS) publications from 1990-2024 as well as IEEE TVCG and IEEE CG&A articles published at IEEE VIS. The dataset includes for each paper the publication year, title, and abstract. I want to reveal the major research topics and how they evolved over the years. I am interested in seeing trends rather than minor fluctuations. Using the workflow library, compose an appropriate workflow.",
    "role": "user"
  },
  {
    "content": "<think>The user wants to analyze a dataset of IEEE VIS publications (1990-2024) containing year, title, and abstract for each paper. Their goals are:\n1. Reveal major research topics\n2. Show how topics evolved over years\n3. Focus on trends rather than minor fluctuations\n\nLet me think about which workflows from the library are most relevant:\n\n1. **Workflow 1.8** (Topic modeling of document corpora) - This is the most directly relevant. It deals with text documents, topic extraction, and iterative refinement of topic models with human guidance.\n\n2. **Workflow 1.1** (Temporal patterns with clustering) - Relevant for the temporal evolution aspect, particularly the calendar-based visualization and trend identification.\n\n\n3. **Workflow 1.9** (Progressive temporal analysis) - Useful for revealing trends over time and progressive abstraction.\n\n4. **Segment D** (Vectorize-Project-Visualize) - Gives an overview of paper similarities across the corpus.\n\n5. **Motif C** (Progressive Exploration) - Enables iterative exploration of topics and their temporal patterns.\n\nFor the workflow structure, I need to start with topic discovery by processing the titles and abstracts to extract topics through vectorization and topic modeling, then move into the temporal phase where I can track how these topics shift across the decades.\n\nOnce topics are identified, I'll analyze their prevalence over time using temporal pattern analysis, but since the user wants to see trends rather than year-to-year noise, I'll need to apply smoothing or aggregation to abstract away from individual years and reveal the broader patterns.\n\nLooking at the available workflows more closely, Workflow 1.8 handles the interactive topic modeling piece—vectorizing documents, building the topic model, and visualizing topics through word clouds and document-topic distributions with the ability to steer the model iteratively. Workflow 1.9 handles the temporal sequences with progressive abstraction, and Workflow 1.1 covers the temporal analysis foundation.\n\nSo my approach would be: ingest papers with their years, titles, and abstracts; extract text features like TF-IDF or embeddings; build a topic model using something like LDA or BERTopic; visualize the topic-term relationships and document assignments; let the user assess and refine the topics by adjusting parameters or adding constraints; then compute how each topic's prevalence changes year by year; smooth those temporal curves or group years into periods to focus on the major trends; and finally visualize how topics evolve over time.\n\nNow I'm thinking about how to express this workflow in ATWL format. I need to map each step to the right artifact types and transform intents—artifacts can be entities, features, arrangements, visualizations, patterns, models, or knowledge; transforms have intents like define-unit, characterise, contextualise, visualise, abstract, build-model, generate-knowledge, and assess. Let me structure this as a proper directed graph where transforms connect artifacts as inputs and outputs.\n\nI'm organizing this into three main phases: first, data preparation and topic modeling drawing from the library's document corpus approach; second, temporal analysis using clustering and progressive abstraction; and third, trend identification and synthesis. I'll model this similarly to the workflows I've seen in the library examples.\n\nNow I'm structuring the workflow with artifact definitions, transform operations, and control flow. I'm defining the input papers as entities with attributes like year, title, and abstract, then building out the transforms—starting with vectorizing the text features, then applying topic modeling to extract themes from the corpus.\n\nThe workflow includes a feedback loop where I assess the topic quality visually through word clouds and heatmaps, and if the results aren't satisfactory, I adjust the modeling parameters and iterate back through the process. After refining the topics, I move into the temporal analysis phase to track how these topics evolve over time. to need adjustment, I refine the smoothing parameters or time granularity based on feedback, then loop back to recompute the trends. Once the temporal visualization is satisfactory, I move into the final phase where I extract major research phases and eras from the data, then synthesize everything into a coherent narrative about how the field has evolved. I'm also considering whether to add an overview visualization that projects documents into 2D space and colors them by topic, which could provide useful context.</think>\n\n## Workflow Design: Research Topic Evolution in IEEE VIS (1990–2024)\n\n---\n\n### Relevant Library Sources\n\n| Library Element | Contribution to Design |\n|---|---|\n| **Workflow 1.8** (Topic modeling) | Core topic discovery loop with human-steered refinement |\n| **Workflow 1.1** (Temporal clustering) | Temporal profiling of cluster/topic membership over calendar time |\n| **Workflow 1.9** (Progressive abstraction) | Smoothing & trend-level abstraction over temporal data |\n| **Segment D** (Vectorize→Project→Visualise) | Document similarity overview for topic validation |\n| **Motif B** (Build–Evaluate–Refine) | Iterative model quality loop |\n| **Motif C** (Progressive Exploration) | Explore → abstract → assess for trend interpretation |\n\n---\n\n### Workflow in ATWL Notation\n\n```\nWORKFLOW  TopicEvolutionAnalysis\n\n── PHASE 1: DATA PREPARATION & TOPIC DISCOVERY  (adapted from 1.8) ──────────\n\nARTIFACTS\n  papers : entities\n    internal_structure: set of attributes {year: temporal, title: textual, abstract: textual}\n    embedment: none (independent records)\n\n  spec_model : specification\n    content: {method: \"BERTopic\"|\"LDA\"|\"NMF\", num_topics: integer, \n              min_topic_size: integer, constraints: list}\n\nTRANSFORMS\n\n  T1: characterise\n    input:   papers\n    output:  text_vectors : feature\n               structure: vector per entity (from title + abstract)\n    manner:  vectorise text using sentence embeddings (or TF-IDF)\n    actor:   machine\n\n  T2: build-model\n    input:   text_vectors, spec_model\n    output:  topic_model : model\n               content: topic-term distributions, document-topic assignments\n    manner:  fit topic model per spec_model parameters\n    actor:   machine\n\n  T3: characterise\n    input:   topic_model, papers\n    output:  topic_assignments : feature\n               structure: dominant topic label + probability per paper\n    manner:  assign each paper to most probable topic; extract top terms per topic\n    actor:   machine\n\n  T4: contextualise\n    input:   text_vectors, topic_assignments\n    output:  doc_arrangement : arrangement\n               structure: 2-D coordinates (UMAP/t-SNE projection coloured by topic)\n    manner:  dimensionality reduction of text_vectors\n    actor:   machine\n\n  T5: visualise\n    input:   doc_arrangement, topic_model\n    output:  vis_topics : visualisation\n               content: {scatterplot of documents coloured by topic,\n                         topic-term bar charts / word clouds,\n                         topic size bar chart}\n    manner:  coordinated multiple views\n    actor:   machine\n\n  T6: assess\n    input:   vis_topics\n    output:  eval_topics : knowledge (evaluative)\n               content: topic coherence, interpretability, granularity judgement\n    actor:   human\n\n  LOOP L1  (topic refinement — from Motif B / Workflow 1.8 steering loop)\n    condition: eval_topics ≠ \"satisfactory\"\n\n    T7: generate-knowledge\n      input:   eval_topics\n      output:  spec_model_update : specification\n                 content: revised num_topics, merge/split instructions, \n                          keyword constraints, stopword additions\n      manner:  analyst decides how to improve model\n      actor:   human\n\n    ASSIGN  spec_model ← spec_model_update\n    GOTO T2\n\n  END LOOP L1\n\n\n── PHASE 2: TEMPORAL TOPIC PROFILING  (adapted from 1.1, 1.9) ───────────────\n\nTRANSFORMS\n\n  T8: characterise\n    input:   topic_assignments, papers.year\n    output:  topic_temporal_profiles : feature\n               structure: matrix (topics × years), cell = paper count or proportion\n    manner:  aggregate topic membership counts per year\n    actor:   machine\n\n  T9: characterise\n    input:   topic_temporal_profiles, spec_smoothing\n    output:  topic_trends : feature\n               structure: smoothed time series per topic\n    manner:  apply temporal smoothing (moving average / LOESS / Gaussian kernel)\n    actor:   machine\n    note:    suppresses minor fluctuations per user requirement\n\n  ARTIFACTS\n    spec_smoothing : specification\n      content: {window_size: integer, method: \"moving_avg\"|\"loess\"|\"gaussian\",\n                time_granularity: \"year\"|\"3-year\"|\"5-year\"}\n\n  T10: contextualise\n    input:   topic_trends\n    output:  temporal_arrangement : arrangement\n               structure: topics positioned along shared time axis 1990–2024\n    manner:  align all topic curves on common timeline\n    actor:   machine\n\n  T11: visualise\n    input:   temporal_arrangement, topic_trends\n    output:  vis_evolution : visualisation\n               content: {streamgraph / stacked area chart (overall composition),\n                         small-multiple line charts (individual topic trajectories),\n                         horizon graph (compact comparison)}\n    manner:  coordinated temporal views\n    actor:   machine\n\n  T12: assess\n    input:   vis_evolution\n    output:  eval_trends : knowledge (evaluative)\n               content: judgement on smoothing level, readability, \n                        whether trends (not noise) are visible\n    actor:   human\n\n  LOOP L2  (smoothing/granularity adjustment — from Workflow 1.9 progressive abstraction)\n    condition: eval_trends ≠ \"satisfactory\"\n\n    T13: generate-knowledge\n      input:   eval_trends\n      output:  spec_smoothing_update : specification\n      manner:  adjust smoothing window or switch time granularity\n      actor:   human\n\n    ASSIGN  spec_smoothing ← spec_smoothing_update\n    GOTO T9\n\n  END LOOP L2\n\n\n── PHASE 3: TREND INTERPRETATION & KNOWLEDGE GENERATION ─────────────────────\n\nTRANSFORMS\n\n  T14: abstract\n    input:   vis_evolution, vis_topics\n    output:  era_patterns : pattern\n               content: {rising topics, declining topics, stable topics,\n                         inflection points, dominant eras}\n    manner:  identify qualitative phases (e.g., \"rise of ML-based VIS post-2015\")\n    actor:   human\n\n  T15: define-unit  (optional — era segmentation, adapted from 1.1 clustering)\n    input:   topic_temporal_profiles, spec_eras\n    output:  eras : entities\n               internal_structure: contiguous year ranges with coherent topic mix\n    manner:  segment timeline into research eras (manual or change-point detection)\n    actor:   human + machine\n\n  T16: visualise\n    input:   eras, topic_trends\n    output:  vis_eras : visualisation\n               content: annotated timeline with era labels overlaid on streamgraph\n    manner:  overlay era boundaries on vis_evolution\n    actor:   machine\n\n  T17: generate-knowledge\n    input:   era_patterns, vis_eras, eval_topics\n    output:  findings : knowledge (substantive)\n               content: narrative summary of research evolution, \n                        key shifts, emerging/fading themes\n    manner:  synthesise observations into structured report\n    actor:   human\n\nEND WORKFLOW\n```\n\n---\n\n### High-Level Template\n\n```\nPhase 1 (Topic Discovery):\n  characterise → build-model → characterise → contextualise → visualise → assess\n    ↺ [generate-knowledge → rebuild model]\n\nPhase 2 (Temporal Trends):\n  characterise(aggregate) → characterise(smooth) → contextualise → visualise → assess\n    ↺ [adjust smoothing]\n\nPhase 3 (Interpretation):\n  abstract → define-unit(eras) → visualise → generate-knowledge\n```\n\n---\n\n### Design Rationale & Adaptation Notes\n\n| Design Decision | Rationale |\n|---|---|\n| **BERTopic as default method** | Handles short texts (titles+abstracts) well; produces interpretable clusters; allows hierarchical merging — suited for adjusting granularity in L1 |\n| **Smoothing before temporal visualisation** | User explicitly wants *trends not fluctuations*; LOESS or 3–5 year windows appropriate for 35-year span |\n| **Streamgraph for evolution** | Shows both individual topic trajectories and compositional change simultaneously (adapts 1.1's calendar view to continuous time) |\n| **Two nested loops** | L1 ensures topic quality before temporal analysis; L2 tunes the trend-vs-noise tradeoff — mirrors the nested loop structure of workflows 1.11/1.14 |\n| **Era segmentation (T15)** | Optional but valuable for a 35-year span; provides narrative anchors (\"the InfoVis era\", \"the deep-learning era\") |\n| **Scatterplot overview (T4–T5)** | From Segment D — provides spatial validation that topics are coherent clusters in embedding space |\n\n---\n\n### Suggested Tool Stack (for implementation)\n\n| Phase | Tools |\n|---|---|\n| Vectorization (T1) | Sentence-BERT, SciBERT |\n| Topic model (T2) | BERTopic, Top2Vec, or Gensim LDA |\n| Projection (T4) | UMAP |\n| Temporal vis (T11) | D3.js streamgraph, Matplotlib, Altair |\n| Smoothing (T9) | Pandas rolling / statsmodels LOESS |\n\n---\n\nWould you like me to refine any phase (e.g., add a sub-workflow for comparing specific topic pairs, or add a citation-impact weighting step)?",
    "role": "assistant"
  },
  {
    "content": "To test this workflow, create a Jupyter Notebook (Python). Output the notebook as json file.",
    "role": "user"
  },
  {
    "content": "<think>The user wants me to create a Jupyter Notebook (Python) that implements the workflow I designed for analyzing IEEE VIS publications (1990-2024) to reveal major research topics and their evolution over time. I'll create a complete notebook as a JSON file.\n\nThe notebook should implement:\n1. Phase 1: Data Preparation & Topic Discovery (vectorization, topic modeling, projection, visualization, assessment)\n2. Phase 2: Temporal Topic Profiling (aggregation, smoothing, temporal visualization)\n3. Phase 3: Trend Interpretation & Knowledge Generation (era segmentation, final visualization)\n\nI'll need to make some assumptions about the data format. The dataset likely has columns: year, title, abstract. I'll use BERTopic for topic modeling, UMAP for projection, and various visualization libraries.\n\n\nNow I'm building the complete notebook structure with all the necessary imports, data loading, and the three-phase analysis pipeline.</think>\n\n```json\n{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Research Topic Evolution in IEEE VIS (1990–2024)\\n\",\n    \"\\n\",\n    \"This notebook implements the ATWL workflow for discovering major research topics\\n\",\n    \"in IEEE VIS publications and analysing how they evolved over time.\\n\",\n    \"\\n\",\n    \"**Workflow Phases:**\\n\",\n    \"1. Data Preparation & Topic Discovery\\n\",\n    \"2. Temporal Topic Profiling (with smoothing for trends)\\n\",\n    \"3. Trend Interpretation & Knowledge Generation\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Setup & Dependencies\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Install required packages (uncomment if needed)\\n\",\n    \"# !pip install bertopic sentence-transformers umap-learn hdbscan pandas numpy matplotlib seaborn plotly scikit-learn statsmodels\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import matplotlib.cm as cm\\n\",\n    \"import seaborn as sns\\n\",\n    \"from sentence_transformers import SentenceTransformer\\n\",\n    \"from bertopic import BERTopic\\n\",\n    \"from umap import UMAP\\n\",\n    \"from hdbscan import HDBSCAN\\n\",\n    \"from sklearn.feature_extraction.text import CountVectorizer\\n\",\n    \"from statsmodels.nonparametric.smoothers_lowess import lowess\\n\",\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"plt.rcParams['figure.dpi'] = 120\\n\",\n    \"plt.rcParams['figure.figsize'] = (14, 6)\\n\",\n    \"sns.set_style('whitegrid')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Phase 1: Data Preparation & Topic Discovery\\n\",\n    \"\\n\",\n    \"**ATWL Transforms:** T1 (characterise/vectorise) → T2 (build-model) → T3 (characterise/assign) → T4 (contextualise/project) → T5 (visualise) → T6 (assess)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Load Data\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Load the dataset\\n\",\n    \"# Adjust the file path and format as needed (CSV, JSON, etc.)\\n\",\n    \"# Expected columns: 'year', 'title', 'abstract'\\n\",\n    \"\\n\",\n    \"DATA_PATH = \\\"ieee_vis_publications.csv\\\"  # <-- ADJUST THIS PATH\\n\",\n    \"\\n\",\n    \"try:\\n\",\n    \"    df = pd.read_csv(DATA_PATH)\\n\",\n    \"except FileNotFoundError:\\n\",\n    \"    print(f\\\"File not found at {DATA_PATH}. Please update DATA_PATH.\\\")\\n\",\n    \"    print(\\\"Creating synthetic demo data for workflow testing...\\\")\\n\",\n    \"    # Generate synthetic data for testing the workflow\\n\",\n    \"    np.random.seed(42)\\n\",\n    \"    n_papers = 500\\n\",\n    \"    years = np.random.randint(1990, 2025, n_papers)\\n\",\n    \"    sample_topics = [\\n\",\n    \"        \\\"We present a novel graph visualization technique for exploring network structures using node-link diagrams.\\\",\\n\",\n    \"        \\\"This paper describes an interactive volume rendering method for medical imaging data using ray casting.\\\",\\n\",\n    \"        \\\"We propose a machine learning approach for automatic classification of visualization types in scientific publications.\\\",\\n\",\n    \"        \\\"A new method for visual analytics of temporal event sequences in electronic health records is presented.\\\",\\n\",\n    \"        \\\"We introduce a collaborative immersive analytics system using virtual reality for data exploration.\\\",\\n\",\n    \"        \\\"This work presents a technique for uncertainty visualization in ensemble weather forecast data.\\\",\\n\",\n    \"        \\\"We describe a text visualization system for topic modeling and document corpus exploration.\\\",\\n\",\n    \"        \\\"A scalable parallel rendering algorithm for large-scale scientific simulation data is proposed.\\\",\\n\",\n    \"        \\\"We present a user study evaluating the effectiveness of color encoding in multivariate data visualization.\\\",\\n\",\n    \"        \\\"This paper introduces a deep learning method for automatic view recommendation in exploratory data analysis.\\\"\\n\",\n    \"    ]\\n\",\n    \"    abstracts = [sample_topics[i % len(sample_topics)] + f\\\" Variant {i}.\\\" for i in range(n_papers)]\\n\",\n    \"    titles = [f\\\"Paper {i}: \\\" + a.split('.')[0][:60] for i, a in enumerate(abstracts)]\\n\",\n    \"    df = pd.DataFrame({'year': years, 'title': titles, 'abstract': abstracts})\\n\",\n    \"\\n\",\n    \"print(f\\\"Dataset: {len(df)} papers, years {df['year'].min()}–{df['year'].max()}\\\")\\n\",\n    \"df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Prepare text: combine title and abstract for richer representation\\n\",\n    \"df['text'] = df['title'].fillna('') + '. ' + df['abstract'].fillna('')\\n\",\n    \"df = df.dropna(subset=['text']).reset_index(drop=True)\\n\",\n    \"documents = df['text'].tolist()\\n\",\n    \"print(f\\\"Prepared {len(documents)} documents for analysis.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T1: Characterise (Vectorise Text)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T1: characterise — vectorise text using sentence embeddings\\n\",\n    \"# Using a sentence transformer model suitable for scientific text\\n\",\n    \"\\n\",\n    \"print(\\\"Computing sentence embeddings (this may take a few minutes)...\\\")\\n\",\n    \"embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # lightweight; use 'allenai/scibert_scivocab_uncased' for better scientific coverage\\n\",\n    \"embeddings = embedding_model.encode(documents, show_progress_bar=True, batch_size=64)\\n\",\n    \"print(f\\\"Embeddings shape: {embeddings.shape}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T2: Build Model (Topic Model)\\n\",\n    \"\\n\",\n    \"**Specification artifact (spec_model):** controls topic modeling parameters.\\n\",\n    \"Adjust these in Loop L1 if topics are not satisfactory.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# SPECIFICATION: spec_model (adjust in Loop L1 if needed)\\n\",\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"spec_model = {\\n\",\n    \"    'min_topic_size': 15,        # minimum papers per topic\\n\",\n    \"    'nr_topics': 'auto',         # set to integer to force specific count, or 'auto'\\n\",\n    \"    'n_neighbors': 15,           # UMAP parameter\\n\",\n    \"    'n_components': 5,           # UMAP dimensions for clustering\\n\",\n    \"    'min_cluster_size': 15,      # HDBSCAN parameter\\n\",\n    \"    'top_n_words': 10,           # words per topic representation\\n\",\n    \"    'ngram_range': (1, 2),       # unigrams and bigrams\\n\",\n    \"}\\n\",\n    \"# ═══════════════════════════════════════════════════════════\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T2: build-model — fit BERTopic\\n\",\n    \"\\n\",\n    \"umap_model = UMAP(\\n\",\n    \"    n_neighbors=spec_model['n_neighbors'],\\n\",\n    \"    n_components=spec_model['n_components'],\\n\",\n    \"    min_dist=0.0,\\n\",\n    \"    metric='cosine',\\n\",\n    \"    random_state=42\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"hdbscan_model = HDBSCAN(\\n\",\n    \"    min_cluster_size=spec_model['min_cluster_size'],\\n\",\n    \"    metric='euclidean',\\n\",\n    \"    prediction_data=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"vectorizer_model = CountVectorizer(\\n\",\n    \"    ngram_range=spec_model['ngram_range'],\\n\",\n    \"    stop_words='english',\\n\",\n    \"    min_df=3\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"topic_model = BERTopic(\\n\",\n    \"    embedding_model=embedding_model,\\n\",\n    \"    umap_model=umap_model,\\n\",\n    \"    hdbscan_model=hdbscan_model,\\n\",\n    \"    vectorizer_model=vectorizer_model,\\n\",\n    \"    top_n_words=spec_model['top_n_words'],\\n\",\n    \"    min_topic_size=spec_model['min_topic_size'],\\n\",\n    \"    nr_topics=spec_model['nr_topics'] if spec_model['nr_topics'] != 'auto' else None,\\n\",\n    \"    verbose=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"topics, probs = topic_model.fit_transform(documents, embeddings)\\n\",\n    \"print(f\\\"\\\\nDiscovered {len(set(topics)) - (1 if -1 in topics else 0)} topics (+ outlier group -1)\\\")\\n\",\n    \"print(f\\\"Outlier papers: {topics.count(-1)} ({100*topics.count(-1)/len(topics):.1f}%)\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T3: Characterise (Topic Assignments)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T3: characterise — extract topic info and assign to papers\\n\",\n    \"\\n\",\n    \"topic_info = topic_model.get_topic_info()\\n\",\n    \"print(\\\"\\\\nTopic Overview:\\\")\\n\",\n    \"print(topic_info[['Topic', 'Count', 'Name']].to_string())\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Add topic assignments to dataframe\\n\",\n    \"df['topic'] = topics\\n\",\n    \"df['topic_prob'] = [p.max() if len(p) > 0 else 0 for p in probs] if probs is not None else 0\\n\",\n    \"\\n\",\n    \"# Show top terms per topic\\n\",\n    \"print(\\\"\\\\nTop terms per topic:\\\")\\n\",\n    \"for topic_id in sorted(set(topics)):\\n\",\n    \"    if topic_id == -1:\\n\",\n    \"        continue\\n\",\n    \"    terms = topic_model.get_topic(topic_id)\\n\",\n    \"    term_str = ', '.join([t[0] for t in terms[:8]])\\n\",\n    \"    count = topic_info[topic_info['Topic'] == topic_id]['Count'].values[0]\\n\",\n    \"    print(f\\\"  Topic {topic_id:2d} ({count:4d} papers): {term_str}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T4: Contextualise (Dimensionality Reduction for Visualisation)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T4: contextualise — project to 2D for document map\\n\",\n    \"\\n\",\n    \"umap_2d = UMAP(\\n\",\n    \"    n_neighbors=15,\\n\",\n    \"    n_components=2,\\n\",\n    \"    min_dist=0.1,\\n\",\n    \"    metric='cosine',\\n\",\n    \"    random_state=42\\n\",\n    \")\\n\",\n    \"coords_2d = umap_2d.fit_transform(embeddings)\\n\",\n    \"df['x'] = coords_2d[:, 0]\\n\",\n    \"df['y'] = coords_2d[:, 1]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T5: Visualise (Topic Overview)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T5: visualise — document scatterplot coloured by topic\\n\",\n    \"\\n\",\n    \"fig, axes = plt.subplots(1, 2, figsize=(18, 7))\\n\",\n    \"\\n\",\n    \"# Left: Document map\\n\",\n    \"ax = axes[0]\\n\",\n    \"unique_topics = sorted(df[df['topic'] != -1]['topic'].unique())\\n\",\n    \"colors = cm.get_cmap('tab20', len(unique_topics))\\n\",\n    \"topic_color_map = {t: colors(i) for i, t in enumerate(unique_topics)}\\n\",\n    \"topic_color_map[-1] = (0.8, 0.8, 0.8, 0.3)  # grey for outliers\\n\",\n    \"\\n\",\n    \"# Plot outliers first (background)\\n\",\n    \"outlier_mask = df['topic'] == -1\\n\",\n    \"ax.scatter(df.loc[outlier_mask, 'x'], df.loc[outlier_mask, 'y'],\\n\",\n    \"           c='lightgrey', s=5, alpha=0.3, label='Outliers')\\n\",\n    \"\\n\",\n    \"# Plot topics\\n\",\n    \"for t in unique_topics:\\n\",\n    \"    mask = df['topic'] == t\\n\",\n    \"    label = topic_model.get_topic_info()[topic_model.get_topic_info()['Topic'] == t]['Name'].values[0][:30]\\n\",\n    \"    ax.scatter(df.loc[mask, 'x'], df.loc[mask, 'y'],\\n\",\n    \"               c=[topic_color_map[t]], s=12, alpha=0.6, label=f'T{t}')\\n\",\n    \"\\n\",\n    \"ax.set_title('Document Map (UMAP projection, coloured by topic)')\\n\",\n    \"ax.set_xlabel('UMAP-1')\\n\",\n    \"ax.set_ylabel('UMAP-2')\\n\",\n    \"ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=7, ncol=1)\\n\",\n    \"\\n\",\n    \"# Right: Topic sizes\\n\",\n    \"ax2 = axes[1]\\n\",\n    \"topic_counts = df[df['topic'] != -1].groupby('topic').size().sort_values(ascending=True)\\n\",\n    \"bars = ax2.barh(range(len(topic_counts)), topic_counts.values)\\n\",\n    \"ax2.set_yticks(range(len(topic_counts)))\\n\",\n    \"topic_labels = []\\n\",\n    \"for t in topic_counts.index:\\n\",\n    \"    terms = topic_model.get_topic(t)\\n\",\n    \"    topic_labels.append(f\\\"T{t}: {', '.join([x[0] for x in terms[:3]])}\\\")\\n\",\n    \"ax2.set_yticklabels(topic_labels, fontsize=8)\\n\",\n    \"ax2.set_xlabel('Number of papers')\\n\",\n    \"ax2.set_title('Topic Sizes')\\n\",\n    \"\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Additional visualisation: BERTopic built-in topic visualisation\\n\",\n    \"try:\\n\",\n    \"    fig_topics = topic_model.visualize_barchart(top_n_topics=12, n_words=8)\\n\",\n    \"    fig_topics.show()\\n\",\n    \"except Exception as e:\\n\",\n    \"    print(f\\\"Interactive plot not available: {e}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T6: Assess (Human Evaluation)\\n\",\n    \"\\n\",\n    \"**Evaluative knowledge:** Examine the topics above and answer:\\n\",\n    \"- Are topics coherent and interpretable?\\n\",\n    \"- Is the granularity appropriate (not too many small topics, not too few coarse ones)?\\n\",\n    \"- Are there obvious merge/split candidates?\\n\",\n    \"\\n\",\n    \"If unsatisfactory → adjust `spec_model` and re-run from T2 (Loop L1).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# T6/T7: ASSESSMENT & LOOP L1 (Topic Refinement)\\n\",\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# Set this flag based on your assessment:\\n\",\n    \"topics_satisfactory = True  # Change to False to trigger refinement\\n\",\n    \"\\n\",\n    \"if not topics_satisfactory:\\n\",\n    \"    # ─── Option A: Reduce number of topics (merge similar) ───\\n\",\n    \"    # topic_model.reduce_topics(documents, nr_topics=15)\\n\",\n    \"    # topics = topic_model.topics_\\n\",\n    \"    \\n\",\n    \"    # ─── Option B: Merge specific topics manually ───\\n\",\n    \"    # topics_to_merge = [[1, 3], [5, 7]]  # merge topic 1 with 3, and 5 with 7\\n\",\n    \"    # topic_model.merge_topics(documents, topics_to_merge)\\n\",\n    \"    # topics = topic_model.topics_\\n\",\n    \"    \\n\",\n    \"    # ─── Option C: Go back and change spec_model parameters ───\\n\",\n    \"    # Update spec_model dict above and re-run from T2 cell\\n\",\n    \"    \\n\",\n    \"    print(\\\"Please adjust parameters and re-run from T2.\\\")\\n\",\n    \"else:\\n\",\n    \"    print(\\\"✓ Topics accepted. Proceeding to Phase 2.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\\n\",\n    \"## Phase 2: Temporal Topic Profiling\\n\",\n    \"\\n\",\n    \"**ATWL Transforms:** T8 (characterise/aggregate) → T9 (characterise/smooth) → T10 (contextualise) → T11 (visualise) → T12 (assess)\\n\",\n    \"\\n\",\n    \"**Goal:** Show how topics rise and fall over 1990–2024, emphasising trends over fluctuations.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T8: Characterise (Temporal Aggregation)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T8: characterise — aggregate topic membership per year\\n\",\n    \"\\n\",\n    \"# Exclude outliers for temporal analysis\\n\",\n    \"df_topics = df[df['topic'] != -1].copy()\\n\",\n    \"\\n\",\n    \"# Count papers per topic per year\\n\",\n    \"topic_year_counts = df_topics.groupby(['year', 'topic']).size().unstack(fill_value=0)\\n\",\n    \"\\n\",\n    \"# Also compute proportions (share of each topic per year)\\n\",\n    \"topic_year_props = topic_year_counts.div(topic_year_counts.sum(axis=1), axis=0)\\n\",\n    \"\\n\",\n    \"# Ensure all years in range are present\\n\",\n    \"all_years = range(df['year'].min(), df['year'].max() + 1)\\n\",\n    \"topic_year_counts = topic_year_counts.reindex(all_years, fill_value=0)\\n\",\n    \"topic_year_props = topic_year_props.reindex(all_years, fill_value=0)\\n\",\n    \"\\n\",\n    \"print(f\\\"Temporal matrix: {topic_year_counts.shape[0]} years × {topic_year_counts.shape[1]} topics\\\")\\n\",\n    \"topic_year_counts.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T9: Characterise (Temporal Smoothing)\\n\",\n    \"\\n\",\n    \"**Specification artifact (spec_smoothing):** controls trend extraction.\\n\",\n    \"Adjust in Loop L2 if trends are too noisy or over-smoothed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# SPECIFICATION: spec_smoothing (adjust in Loop L2 if needed)\\n\",\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"spec_smoothing = {\\n\",\n    \"    'method': 'loess',          # 'moving_avg', 'loess', or 'gaussian'\\n\",\n    \"    'window_size': 5,           # for moving_avg: window in years\\n\",\n    \"    'loess_frac': 0.15,         # for loess: fraction of data for local fit\\n\",\n    \"    'use_proportions': True,    # True = show share; False = show absolute counts\\n\",\n    \"}\\n\",\n    \"# ═══════════════════════════════════════════════════════════\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T9: characterise — apply smoothing to get trends\\n\",\n    \"\\n\",\n    \"data_for_smoothing = topic_year_props if spec_smoothing['use_proportions'] else topic_year_counts\\n\",\n    \"\\n\",\n    \"def smooth_series(years, values, spec):\\n\",\n    \"    \\\"\\\"\\\"Apply smoothing according to specification.\\\"\\\"\\\"\\n\",\n    \"    if spec['method'] == 'moving_avg':\\n\",\n    \"        s = pd.Series(values, index=years)\\n\",\n    \"        return s.rolling(window=spec['window_size'], center=True, min_periods=1).mean().values\\n\",\n    \"    elif spec['method'] == 'loess':\\n\",\n    \"        smoothed = lowess(values, years, frac=spec['loess_frac'], return_sorted=True)\\n\",\n    \"        return smoothed[:, 1]\\n\",\n    \"    elif spec['method'] == 'gaussian':\\n\",\n    \"        from scipy.ndimage import gaussian_filter1d\\n\",\n    \"        sigma = spec['window_size'] / 2.0\\n\",\n    \"        return gaussian_filter1d(values.astype(float), sigma=sigma)\\n\",\n    \"    else:\\n\",\n    \"        return values\\n\",\n    \"\\n\",\n    \"# Apply smoothing to each topic\\n\",\n    \"years_array = np.array(data_for_smoothing.index)\\n\",\n    \"topic_trends = pd.DataFrame(index=data_for_smoothing.index)\\n\",\n    \"\\n\",\n    \"for topic_id in data_for_smoothing.columns:\\n\",\n    \"    values = data_for_smoothing[topic_id].values\\n\",\n    \"    topic_trends[topic_id] = smooth_series(years_array, values, spec_smoothing)\\n\",\n    \"\\n\",\n    \"# Clip negative values (can occur with smoothing)\\n\",\n    \"topic_trends = topic_trends.clip(lower=0)\\n\",\n    \"\\n\",\n    \"print(f\\\"Smoothed trends computed using '{spec_smoothing['method']}' method.\\\")\\n\",\n    \"topic_trends.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T10 & T11: Contextualise & Visualise (Temporal Evolution)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T10/T11: Streamgraph / Stacked Area Chart\\n\",\n    \"\\n\",\n    \"fig, axes = plt.subplots(2, 1, figsize=(16, 12))\\n\",\n    \"\\n\",\n    \"# Get topic labels\\n\",\n    \"topic_labels_map = {}\\n\",\n    \"for t in topic_trends.columns:\\n\",\n    \"    terms = topic_model.get_topic(t)\\n\",\n    \"    topic_labels_map[t] = f\\\"T{t}: {', '.join([x[0] for x in terms[:3]])}\\\"\\n\",\n    \"\\n\",\n    \"# Stacked area chart\\n\",\n    \"ax = axes[0]\\n\",\n    \"colors_list = cm.get_cmap('tab20', len(topic_trends.columns))\\n\",\n    \"ax.stackplot(topic_trends.index, \\n\",\n    \"             [topic_trends[t].values for t in topic_trends.columns],\\n\",\n    \"             labels=[topic_labels_map.get(t, f'T{t}') for t in topic_trends.columns],\\n\",\n    \"             colors=[colors_list(i) for i in range(len(topic_trends.columns))],\\n\",\n    \"             alpha=0.8)\\n\",\n    \"ax.set_title('Topic Evolution — Stacked Area Chart (smoothed proportions)', fontsize=13)\\n\",\n    \"ax.set_xlabel('Year')\\n\",\n    \"ax.set_ylabel('Proportion' if spec_smoothing['use_proportions'] else 'Count')\\n\",\n    \"ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=7)\\n\",\n    \"ax.set_xlim(years_array.min(), years_array.max())\\n\",\n    \"\\n\",\n    \"# Small multiples: individual topic trend lines\\n\",\n    \"ax2 = axes[1]\\n\",\n    \"for i, t in enumerate(topic_trends.columns):\\n\",\n    \"    ax2.plot(topic_trends.index, topic_trends[t].values,\\n\",\n    \"             color=colors_list(i), linewidth=1.5, alpha=0.8,\\n\",\n    \"             label=topic_labels_map.get(t, f'T{t}'))\\n\",\n    \"ax2.set_title('Individual Topic Trends (smoothed)', fontsize=13)\\n\",\n    \"ax2.set_xlabel('Year')\\n\",\n    \"ax2.set_ylabel('Proportion' if spec_smoothing['use_proportions'] else 'Count')\\n\",\n    \"ax2.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=7)\\n\",\n    \"ax2.set_xlim(years_array.min(), years_array.max())\\n\",\n    \"\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Alternative: Small multiples (one subplot per topic) for clarity\\n\",\n    \"\\n\",\n    \"n_topics_plot = len(topic_trends.columns)\\n\",\n    \"n_cols = 4\\n\",\n    \"n_rows = int(np.ceil(n_topics_plot / n_cols))\\n\",\n    \"\\n\",\n    \"fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 3 * n_rows), sharex=True, sharey=True)\\n\",\n    \"axes_flat = axes.flatten() if n_topics_plot > 1 else [axes]\\n\",\n    \"\\n\",\n    \"for i, t in enumerate(topic_trends.columns):\\n\",\n    \"    ax = axes_flat[i]\\n\",\n    \"    ax.fill_between(topic_trends.index, topic_trends[t].values,\\n\",\n    \"                    alpha=0.3, color=colors_list(i))\\n\",\n    \"    ax.plot(topic_trends.index, topic_trends[t].values,\\n\",\n    \"            color=colors_list(i), linewidth=1.5)\\n\",\n    \"    ax.set_title(topic_labels_map.get(t, f'T{t}'), fontsize=8)\\n\",\n    \"    ax.set_xlim(years_array.min(), years_array.max())\\n\",\n    \"\\n\",\n    \"# Hide empty subplots\\n\",\n    \"for j in range(i + 1, len(axes_flat)):\\n\",\n    \"    axes_flat[j].set_visible(False)\\n\",\n    \"\\n\",\n    \"fig.suptitle('Topic Trends — Small Multiples', fontsize=13, y=1.01)\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T12: Assess (Smoothing Adequacy)\\n\",\n    \"\\n\",\n    \"**Evaluative knowledge:** \\n\",\n    \"- Are the trends smooth enough to see long-term patterns?\\n\",\n    \"- Is too much detail lost?\\n\",\n    \"- Do individual topic curves show clear rise/decline/stability?\\n\",\n    \"\\n\",\n    \"If unsatisfactory → adjust `spec_smoothing` and re-run from T9 (Loop L2).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# T12/T13: ASSESSMENT & LOOP L2 (Smoothing Refinement)\\n\",\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"smoothing_satisfactory = True  # Change to False to try different smoothing\\n\",\n    \"\\n\",\n    \"if not smoothing_satisfactory:\\n\",\n    \"    # Adjust spec_smoothing above and re-run from T9 cell\\n\",\n    \"    # Suggestions:\\n\",\n    \"    #   - Increase loess_frac (e.g., 0.2, 0.25) for smoother trends\\n\",\n    \"    #   - Decrease loess_frac (e.g., 0.1) for more detail\\n\",\n    \"    #   - Try method='moving_avg' with window_size=3 or 7\\n\",\n    \"    #   - Try method='gaussian' with window_size=5\\n\",\n    \"    print(\\\"Please adjust spec_smoothing and re-run from T9.\\\")\\n\",\n    \"else:\\n\",\n    \"    print(\\\"✓ Smoothing accepted. Proceeding to Phase 3.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\\n\",\n    \"## Phase 3: Trend Interpretation & Knowledge Generation\\n\",\n    \"\\n\",\n    \"**ATWL Transforms:** T14 (abstract) → T15 (define-unit/eras) → T16 (visualise) → T17 (generate-knowledge)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T14: Abstract (Identify Patterns)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T14: abstract — identify rising, declining, and stable topics\\n\",\n    \"\\n\",\n    \"def classify_trend(series, early_years=5, late_years=5):\\n\",\n    \"    \\\"\\\"\\\"Classify a topic trend as rising, declining, stable, or peaked.\\\"\\\"\\\"\\n\",\n    \"    values = series.values\\n\",\n    \"    early_mean = values[:early_years].mean()\\n\",\n    \"    late_mean = values[-late_years:].mean()\\n\",\n    \"    peak_idx = values.argmax()\\n\",\n    \"    peak_position = peak_idx / len(values)  # 0=start, 1=end\\n\",\n    \"    \\n\",\n    \"    change_ratio = (late_mean - early_mean) / (early_mean + 1e-8)\\n\",\n    \"    \\n\",\n    \"    if change_ratio > 0.5:\\n\",\n    \"        return 'rising', change_ratio\\n\",\n    \"    elif change_ratio < -0.5:\\n\",\n    \"        return 'declining', change_ratio\\n\",\n    \"    elif 0.3 < peak_position < 0.7 and values.max() > 2 * late_mean:\\n\",\n    \"        return 'peaked', change_ratio\\n\",\n    \"    else:\\n\",\n    \"        return 'stable', change_ratio\\n\",\n    \"\\n\",\n    \"print(\\\"Topic Trend Classification:\\\")\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"trend_classifications = {}\\n\",\n    \"for t in topic_trends.columns:\\n\",\n    \"    trend_type, ratio = classify_trend(topic_trends[t])\\n\",\n    \"    trend_classifications[t] = trend_type\\n\",\n    \"    label = topic_labels_map.get(t, f'T{t}')\\n\",\n    \"    print(f\\\"  {label:<45} → {trend_type.upper():<10} (Δ={ratio:+.2f})\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n--- Summary ---\\\")\\n\",\n    \"for category in ['rising', 'declining', 'stable', 'peaked']:\\n\",\n    \"    topics_in_cat = [t for t, c in trend_classifications.items() if c == category]\\n\",\n    \"    if topics_in_cat:\\n\",\n    \"        print(f\\\"  {category.upper()}: {len(topics_in_cat)} topics\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T15: Define-Unit (Era Segmentation)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T15: define-unit — segment timeline into research eras\\n\",\n    \"# Using change-point detection on overall topic composition\\n\",\n    \"\\n\",\n    \"from scipy.spatial.distance import cosine\\n\",\n    \"\\n\",\n    \"def detect_era_boundaries(topic_props, min_era_length=5):\\n\",\n    \"    \\\"\\\"\\\"Detect eras by finding years where topic composition changes significantly.\\\"\\\"\\\"\\n\",\n    \"    years = topic_props.index.values\\n\",\n    \"    dissimilarities = []\\n\",\n    \"    \\n\",\n    \"    for i in range(1, len(years)):\\n\",\n    \"        v1 = topic_props.iloc[i-1].values\\n\",\n    \"        v2 = topic_props.iloc[i].values\\n\",\n    \"        if v1.sum() > 0 and v2.sum() > 0:\\n\",\n    \"            d = cosine(v1, v2)\\n\",\n    \"        else:\\n\",\n    \"            d = 0\\n\",\n    \"        dissimilarities.append(d)\\n\",\n    \"    \\n\",\n    \"    dissimilarities = np.array(dissimilarities)\\n\",\n    \"    \\n\",\n    \"    # Find peaks in dissimilarity (potential era boundaries)\\n\",\n    \"    threshold = np.percentile(dissimilarities, 75)\\n\",\n    \"    boundaries = [years[0]]\\n\",\n    \"    \\n\",\n    \"    for i, d in enumerate(dissimilarities):\\n\",\n    \"        year = years[i + 1]\\n\",\n    \"        if d > threshold and (year - boundaries[-1]) >= min_era_length:\\n\",\n    \"            boundaries.append(year)\\n\",\n    \"    \\n\",\n    \"    boundaries.append(years[-1] + 1)\\n\",\n    \"    return boundaries\\n\",\n    \"\\n\",\n    \"era_boundaries = detect_era_boundaries(topic_trends, min_era_length=5)\\n\",\n    \"print(f\\\"Detected era boundaries: {era_boundaries}\\\")\\n\",\n    \"\\n\",\n    \"# Create era labels\\n\",\n    \"eras = []\\n\",\n    \"for i in range(len(era_boundaries) - 1):\\n\",\n    \"    start = era_boundaries[i]\\n\",\n    \"    end = era_boundaries[i + 1] - 1\\n\",\n    \"    eras.append({'start': start, 'end': end, 'label': f'Era {i+1} ({start}–{end})'})\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\nIdentified Eras:\\\")\\n\",\n    \"for era in eras:\\n\",\n    \"    # Find dominant topics in this era\\n\",\n    \"    era_mask = (topic_trends.index >= era['start']) & (topic_trends.index <= era['end'])\\n\",\n    \"    era_means = topic_trends[era_mask].mean()\\n\",\n    \"    top_topics = era_means.nlargest(3).index.tolist()\\n\",\n    \"    top_labels = [topic_labels_map.get(t, f'T{t}') for t in top_topics]\\n\",\n    \"    print(f\\\"  {era['label']}: dominant = {', '.join(top_labels)}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T16: Visualise (Annotated Timeline)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T16: visualise — annotated streamgraph with era boundaries\\n\",\n    \"\\n\",\n    \"fig, ax = plt.subplots(figsize=(18, 7))\\n\",\n    \"\\n\",\n    \"# Stacked area\\n\",\n    \"ax.stackplot(topic_trends.index,\\n\",\n    \"             [topic_trends[t].values for t in topic_trends.columns],\\n\",\n    \"             labels=[topic_labels_map.get(t, f'T{t}') for t in topic_trends.columns],\\n\",\n    \"             colors=[colors_list(i) for i in range(len(topic_trends.columns))],\\n\",\n    \"             alpha=0.75)\\n\",\n    \"\\n\",\n    \"# Era boundaries\\n\",\n    \"for boundary in era_boundaries[1:-1]:\\n\",\n    \"    ax.axvline(x=boundary, color='black', linestyle='--', linewidth=1.5, alpha=0.7)\\n\",\n    \"\\n\",\n    \"# Era labels\\n\",\n    \"for era in eras:\\n\",\n    \"    mid = (era['start'] + era['end']) / 2\\n\",\n    \"    ax.text(mid, ax.get_ylim()[1] * 0.95, era['label'],\\n\",\n    \"            ha='center', va='top', fontsize=9, fontweight='bold',\\n\",\n    \"            bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7))\\n\",\n    \"\\n\",\n    \"ax.set_title('IEEE VIS Research Topic Evolution (1990–2024) — Annotated with Research Eras',\\n\",\n    \"             fontsize=13, fontweight='bold')\\n\",\n    \"ax.set_xlabel('Year', fontsize=11)\\n\",\n    \"ax.set_ylabel('Topic Proportion (smoothed)', fontsize=11)\\n\",\n    \"ax.set_xlim(years_array.min(), years_array.max())\\n\",\n    \"ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=7)\\n\",\n    \"\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Heatmap view: topic intensity over time\\n\",\n    \"\\n\",\n    \"fig, ax = plt.subplots(figsize=(16, max(4, len(topic_trends.columns) * 0.4)))\\n\",\n    \"\\n\",\n    \"heatmap_data = topic_trends.T\\n\",\n    \"heatmap_data.index = [topic_labels_map.get(t, f'T{t}')[:40] for t in heatmap_data.index]\\n\",\n    \"\\n\",\n    \"sns.heatmap(heatmap_data, cmap='YlOrRd', ax=ax, xticklabels=5,\\n\",\n    \"            linewidths=0.1, cbar_kws={'label': 'Proportion'})\\n\",\n    \"ax.set_title('Topic Intensity Heatmap (Year × Topic)', fontsize=13)\\n\",\n    \"ax.set_xlabel('Year')\\n\",\n    \"ax.set_ylabel('')\\n\",\n    \"\\n\",\n    \"# Add era boundaries\\n\",\n    \"for boundary in era_boundaries[1:-1]:\\n\",\n    \"    idx = list(topic_trends.index).index(boundary) if boundary in topic_trends.index else None\\n\",\n    \"    if idx is not None:\\n\",\n    \"        ax.axvline(x=idx, color='blue', linestyle='--', linewidth=1.5, alpha=0.6)\\n\",\n    \"\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T17: Generate Knowledge (Summary Report)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T17: generate-knowledge — structured summary of findings\\n\",\n    \"\\n\",\n    \"print(\\\"=\\\"*80)\\n\",\n    \"print(\\\"FINDINGS: Research Topic Evolution in IEEE VIS (1990–2024)\\\")\\n\",\n    \"print(\\\"=\\\"*80)\\n\",\n    \"\\n\",\n    \"print(f\\\"\\\\n📊 Dataset: {len(df)} papers analysed\\\")\\n\",\n    \"print(f\\\"📅 Time span: {df['year'].min()}–{df['year'].max()}\\\")\\n\",\n    \"print(f\\\"🏷️  Topics discovered: {len(topic_trends.columns)}\\\")\\n\",\n    \"print(f\\\"📐 Smoothing: {spec_smoothing['method']} (frac={spec_smoothing.get('loess_frac', spec_smoothing.get('window_size'))})\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\"*80)\\n\",\n    \"print(\\\"RISING TOPICS (gaining prominence):\\\")\\n\",\n    \"print(\\\"-\\\"*80)\\n\",\n    \"for t, cat in trend_classifications.items():\\n\",\n    \"    if cat == 'rising':\\n\",\n    \"        print(f\\\"  📈 {topic_labels_map.get(t, f'T{t}')}\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\"*80)\\n\",\n    \"print(\\\"DECLINING TOPICS (reduced prominence):\\\")\\n\",\n    \"print(\\\"-\\\"*80)\\n\",\n    \"for t, cat in trend_classifications.items():\\n\",\n    \"    if cat == 'declining':\\n\",\n    \"        print(f\\\"  📉 {topic_labels_map.get(t, f'T{t}')}\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\"*80)\\n\",\n    \"print(\\\"STABLE TOPICS (consistent presence):\\\")\\n\",\n    \"print(\\\"-\\\"*80)\\n\",\n    \"for t, cat in trend_classifications.items():\\n\",\n    \"    if cat == 'stable':\\n\",\n    \"        print(f\\\"  ➡️  {topic_labels_map.get(t, f'T{t}')}\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\"*80)\\n\",\n    \"print(\\\"PEAKED TOPICS (rose then declined):\\\")\\n\",\n    \"print(\\\"-\\\"*80)\\n\",\n    \"for t, cat in trend_classifications.items():\\n\",\n    \"    if cat == 'peaked':\\n\",\n    \"        peak_year = topic_trends[t].idxmax()\\n\",\n    \"        print(f\\\"  ⛰️  {topic_labels_map.get(t, f'T{t}')} (peak ~{peak_year})\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\"*80)\\n\",\n    \"print(\\\"RESEARCH ERAS:\\\")\\n\",\n    \"print(\\\"-\\\"*80)\\n\",\n    \"for era in eras:\\n\",\n    \"    era_mask = (topic_trends.index >= era['start']) & (topic_trends.index <= era['end'])\\n\",\n    \"    era_means = topic_trends[era_mask].mean()\\n\",\n    \"    top3 = era_means.nlargest(3).index.tolist()\\n\",\n    \"    print(f\\\"\\\\n  🕐 {era['label']}\\\")\\n\",\n    \"    print(f\\\"     Dominant topics:\\\")\\n\",\n    \"    for t in top3:\\n\",\n    \"        print(f\\\"       • {topic_labels_map.get(t, f'T{t}')}\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"=\\\"*80)\\n\",\n    \"print(\\\"Note: Interpret these findings considering venue composition changes\\\")\\n\",\n    \"print(\\\"(InfoVis/SciVis/VAST merger in 2021) and overall publication volume growth.\\\")\\n\",\n    \"print(\\\"=\\\"*80)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\\n\",\n    \"## Workflow Complete\\n\",\n    \"\\n\",\n    \"### Workflow Summary (ATWL Template):\\n\",\n    \"```\\n\",\n    \"Phase 1: characterise(vectorise) → build-model(topics) → characterise(assign) → contextualise(project) → visualise → assess ↺[refine]\\n\",\n    \"Phase 2: characterise(aggregate) → characterise(smooth) → contextualise(timeline) → visualise → assess ↺[adjust smoothing]\\n\",\n    \"Phase 3: abstract(patterns) → define-unit(eras) → visualise(annotated) → generate-knowledge(report)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"### Next Steps:\\n\",\n    \"- Fine-tune topics by merging/splitting in Loop L1\\n\",\n    \"- Experiment with different smoothing parameters in Loop L2\\n\",\n    \"- Add citation/impact weighting if data available\\n\",\n    \"- Cross-reference with known community events (e.g., VAST introduced 2006, VIS unified 2021)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.10.0\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 4\n}\n```\n\n---\n\n### Key Implementation Notes\n\n| Workflow Element | Implementation Choice |\n|---|---|\n| **T1 (vectorise)** | `sentence-transformers` with `all-MiniLM-L6-v2` (fast); upgrade to `allenai/scibert` for better domain coverage |\n| **T2 (build-model)** | BERTopic with configurable UMAP + HDBSCAN parameters |\n| **Loop L1** | Manual flag `topics_satisfactory` + merge/reduce options |\n| **T9 (smooth)** | LOESS by default (frac=0.15 ≈ 5-year effective window on 35 years) |\n| **Loop L2** | Manual flag `smoothing_satisfactory` + parameter guidance |\n| **T14 (abstract)** | Automatic trend classification (rising/declining/stable/peaked) |\n| **T15 (eras)** | Cosine-dissimilarity based change-point detection |\n| **T16/T17** | Annotated streamgraph + heatmap + structured text report |\n\nTo run the notebook, update `DATA_PATH` to point to your IEEE VIS dataset CSV (expected columns: `year`, `title`, `abstract`). If the file isn't found, synthetic demo data is generated for testing the pipeline.",
    "role": "assistant"
  },
  {
    "content": "Describe the workflow in natural language.",
    "role": "user"
  },
  {
    "content": "<think>The user wants a natural language description of the workflow I designed for analyzing IEEE VIS publication topics and their evolution over time. I'll describe it clearly, walking through the phases and explaining what happens at each step, including the iterative loops.</think>\n\n## Workflow Description: Research Topic Evolution in IEEE VIS (1990–2024)\n\n### Goal\n\nThe workflow takes a dataset of IEEE VIS publications (each described by year, title, and abstract) and reveals the major research topics within the community and how their prominence has changed over 35 years, emphasising long-term trends rather than year-to-year noise.\n\n---\n\n### Phase 1: Topic Discovery\n\nThe workflow begins by preparing the text data. Each paper's title and abstract are concatenated into a single document and transformed into a numerical vector using a pre-trained sentence embedding model. These vectors capture semantic similarity — papers about related subjects end up with similar vectors.\n\nNext, a topic model (BERTopic) is fitted to the document vectors. The model clusters semantically similar papers together and identifies representative keywords for each cluster. Each cluster corresponds to a research topic. Every paper is then assigned to its most probable topic, and the top terms characterising each topic are extracted.\n\nTo allow the analyst to visually verify topic quality, the high-dimensional vectors are projected into two dimensions using UMAP. The resulting document map is displayed as a scatterplot where each dot is a paper, coloured by its assigned topic. A companion bar chart shows the size of each topic. Together, these views let the analyst judge whether the topics are coherent, well-separated, and at an appropriate level of granularity.\n\nThe analyst then assesses the result. If topics are too fine-grained, too coarse, or contain incoherent mixtures, the analyst adjusts the model parameters — for example, changing the minimum topic size, merging similar topics, or constraining the total number of topics — and the model is rebuilt. This refinement loop repeats until the analyst is satisfied with topic quality.\n\n---\n\n### Phase 2: Temporal Profiling\n\nOnce stable topics are established, the workflow shifts to temporal analysis. For each year in the dataset, the number (or proportion) of papers belonging to each topic is counted, producing a topic-by-year matrix.\n\nBecause raw yearly counts are noisy (especially in earlier years with fewer publications), a smoothing function is applied. By default, LOESS regression gently smooths each topic's time series so that multi-year trends become visible while short-lived fluctuations are suppressed. The smoothing window is chosen to reveal patterns spanning roughly five or more years.\n\nThe smoothed trends are then visualised in multiple coordinated views: a stacked area chart (streamgraph) showing how the overall topic composition evolves, a set of individual line charts (small multiples) showing each topic's trajectory in isolation, and a heatmap showing topic intensity across years.\n\nThe analyst assesses whether the smoothing level is appropriate. If the curves are still too jagged, the smoothing fraction is increased; if important transitions are being blurred out, it is decreased. This adjustment loop continues until the temporal visualisation clearly communicates long-term trends.\n\n---\n\n### Phase 3: Interpretation and Knowledge Generation\n\nWith validated topic trends in hand, the analyst interprets the patterns. Each topic is automatically classified as **rising** (gaining prominence over time), **declining** (fading), **stable** (consistently present), or **peaked** (rose and then fell). This gives a quick structural summary of the field's evolution.\n\nThe workflow then segments the 35-year timeline into distinct **research eras** — contiguous periods during which the topic composition remains relatively stable, separated by years of notable compositional shift. For each era, the dominant topics are identified.\n\nAn annotated visualisation overlays era boundaries on the streamgraph, making it easy to see when the community transitioned from one set of dominant concerns to another.\n\nFinally, the analyst synthesises all observations into a structured knowledge report: which topics have emerged, which have faded, when major shifts occurred, and what characterises each era of IEEE VIS research.\n\n---\n\n### Iterative Structure\n\nThe workflow contains two human-in-the-loop feedback cycles:\n\n1. **Topic refinement loop** — ensures the discovered topics are meaningful before any temporal analysis begins.\n2. **Smoothing adjustment loop** — ensures the temporal visualisation reveals genuine trends at the right level of abstraction.\n\nBoth loops follow the same principle: the machine computes, the human assesses, and if the result is unsatisfactory, the human updates a specification that controls the machine's next iteration.\n\n---\n\n### Roles\n\n| Actor | Responsibilities |\n|-------|-----------------|\n| **Machine** | Text vectorisation, topic modeling, dimensionality reduction, smoothing, aggregation, change-point detection, rendering visualisations |\n| **Human** | Assessing topic coherence, judging smoothing adequacy, interpreting trends, identifying eras, synthesising narrative findings |\n\n---\n\n### Output\n\nThe workflow produces three main outputs:\n\n1. **A validated set of research topics** with representative keywords and a document map.\n2. **Smoothed temporal trend charts** showing each topic's trajectory from 1990 to 2024.\n3. **A structured narrative** summarising rising, declining, and stable themes, major transition points, and distinct research eras in the IEEE VIS community.",
    "role": "assistant"
  },
  {
    "content": "Regarding the Jupyter Notebook: you have included the use of BERTopic for topic modelling. It is hard to install; replace it by NMF. Output the modified notebook as json file.",
    "role": "user"
  },
  {
    "content": "<think>The user wants me to replace BERTopic with NMF (Non-negative Matrix Factorization) for topic modeling in the Jupyter Notebook. NMF is available in scikit-learn and is much easier to install. I'll need to:\n\n1. Remove BERTopic, HDBSCAN dependencies\n2. Use TF-IDF vectorization + NMF from scikit-learn\n3. Keep the sentence embeddings for the UMAP projection (document map), or alternatively use TF-IDF vectors for UMAP too\n4. Adjust the topic extraction logic to work with NMF components\n\n\n5. Maintain the overall workflow structure\n\nFor the redesign, I'll use TF-IDF vectorization with NMF for topic modeling, and keep UMAP for the 2D projection while removing the BERTopic and HDBSCAN dependencies that were causing installation issues. just stick with TF-IDF vectors for both the topic modeling and UMAP projection to keep dependencies minimal and installation straightforward.</think>\n\n```json\n{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Research Topic Evolution in IEEE VIS (1990–2024)\\n\",\n    \"\\n\",\n    \"This notebook implements the ATWL workflow for discovering major research topics\\n\",\n    \"in IEEE VIS publications and analysing how they evolved over time.\\n\",\n    \"\\n\",\n    \"**Workflow Phases:**\\n\",\n    \"1. Data Preparation & Topic Discovery (using NMF)\\n\",\n    \"2. Temporal Topic Profiling (with smoothing for trends)\\n\",\n    \"3. Trend Interpretation & Knowledge Generation\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Setup & Dependencies\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Install required packages (uncomment if needed)\\n\",\n    \"# !pip install pandas numpy matplotlib seaborn scikit-learn umap-learn statsmodels scipy\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import matplotlib.cm as cm\\n\",\n    \"import seaborn as sns\\n\",\n    \"from sklearn.feature_extraction.text import TfidfVectorizer\\n\",\n    \"from sklearn.decomposition import NMF\\n\",\n    \"from umap import UMAP\\n\",\n    \"from statsmodels.nonparametric.smoothers_lowess import lowess\\n\",\n    \"from scipy.spatial.distance import cosine\\n\",\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"plt.rcParams['figure.dpi'] = 120\\n\",\n    \"plt.rcParams['figure.figsize'] = (14, 6)\\n\",\n    \"sns.set_style('whitegrid')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Phase 1: Data Preparation & Topic Discovery\\n\",\n    \"\\n\",\n    \"**ATWL Transforms:** T1 (characterise/vectorise) → T2 (build-model) → T3 (characterise/assign) → T4 (contextualise/project) → T5 (visualise) → T6 (assess)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Load Data\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Load the dataset\\n\",\n    \"# Adjust the file path and format as needed (CSV, JSON, etc.)\\n\",\n    \"# Expected columns: 'year', 'title', 'abstract'\\n\",\n    \"\\n\",\n    \"DATA_PATH = \\\"ieee_vis_publications.csv\\\"  # <-- ADJUST THIS PATH\\n\",\n    \"\\n\",\n    \"try:\\n\",\n    \"    df = pd.read_csv(DATA_PATH)\\n\",\n    \"except FileNotFoundError:\\n\",\n    \"    print(f\\\"File not found at {DATA_PATH}. Please update DATA_PATH.\\\")\\n\",\n    \"    print(\\\"Creating synthetic demo data for workflow testing...\\\")\\n\",\n    \"    # Generate synthetic data for testing the workflow\\n\",\n    \"    np.random.seed(42)\\n\",\n    \"    n_papers = 500\\n\",\n    \"    years = np.random.randint(1990, 2025, n_papers)\\n\",\n    \"    sample_topics = [\\n\",\n    \"        \\\"We present a novel graph visualization technique for exploring network structures using node-link diagrams and adjacency matrices.\\\",\\n\",\n    \"        \\\"This paper describes an interactive volume rendering method for medical imaging data using ray casting and transfer functions.\\\",\\n\",\n    \"        \\\"We propose a machine learning approach for automatic classification of visualization types in scientific publications.\\\",\\n\",\n    \"        \\\"A new method for visual analytics of temporal event sequences in electronic health records is presented.\\\",\\n\",\n    \"        \\\"We introduce a collaborative immersive analytics system using virtual reality for data exploration and sensemaking.\\\",\\n\",\n    \"        \\\"This work presents a technique for uncertainty visualization in ensemble weather forecast data using probabilistic models.\\\",\\n\",\n    \"        \\\"We describe a text visualization system for topic modeling and document corpus exploration using natural language processing.\\\",\\n\",\n    \"        \\\"A scalable parallel rendering algorithm for large-scale scientific simulation data is proposed for high performance computing.\\\",\\n\",\n    \"        \\\"We present a user study evaluating the effectiveness of color encoding in multivariate data visualization dashboards.\\\",\\n\",\n    \"        \\\"This paper introduces a deep learning method for automatic view recommendation in exploratory data analysis workflows.\\\"\\n\",\n    \"    ]\\n\",\n    \"    abstracts = [sample_topics[i % len(sample_topics)] + f\\\" Extended discussion variant {i}.\\\" for i in range(n_papers)]\\n\",\n    \"    titles = [f\\\"Paper {i}: \\\" + a.split('.')[0][:60] for i, a in enumerate(abstracts)]\\n\",\n    \"    df = pd.DataFrame({'year': years, 'title': titles, 'abstract': abstracts})\\n\",\n    \"\\n\",\n    \"print(f\\\"Dataset: {len(df)} papers, years {df['year'].min()}–{df['year'].max()}\\\")\\n\",\n    \"df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Prepare text: combine title and abstract for richer representation\\n\",\n    \"df['text'] = df['title'].fillna('') + '. ' + df['abstract'].fillna('')\\n\",\n    \"df = df.dropna(subset=['text']).reset_index(drop=True)\\n\",\n    \"documents = df['text'].tolist()\\n\",\n    \"print(f\\\"Prepared {len(documents)} documents for analysis.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T1: Characterise (Vectorise Text using TF-IDF)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# SPECIFICATION: spec_model (adjust in Loop L1 if needed)\\n\",\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"spec_model = {\\n\",\n    \"    'n_topics': 12,              # number of topics for NMF\\n\",\n    \"    'max_df': 0.85,              # ignore terms appearing in >85% of docs\\n\",\n    \"    'min_df': 5,                 # ignore terms appearing in <5 docs\\n\",\n    \"    'ngram_range': (1, 2),       # unigrams and bigrams\\n\",\n    \"    'max_features': 5000,        # vocabulary size limit\\n\",\n    \"    'top_n_words': 10,           # words per topic representation\\n\",\n    \"    'nmf_init': 'nndsvda',       # NMF initialisation method\\n\",\n    \"    'nmf_max_iter': 500,         # NMF max iterations\\n\",\n    \"}\\n\",\n    \"# ═══════════════════════════════════════════════════════════\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T1: characterise — vectorise text using TF-IDF\\n\",\n    \"\\n\",\n    \"print(\\\"Computing TF-IDF vectors...\\\")\\n\",\n    \"tfidf_vectorizer = TfidfVectorizer(\\n\",\n    \"    max_df=spec_model['max_df'],\\n\",\n    \"    min_df=spec_model['min_df'],\\n\",\n    \"    ngram_range=spec_model['ngram_range'],\\n\",\n    \"    max_features=spec_model['max_features'],\\n\",\n    \"    stop_words='english'\\n\",\n    \")\\n\",\n    \"tfidf_matrix = tfidf_vectorizer.fit_transform(documents)\\n\",\n    \"feature_names = tfidf_vectorizer.get_feature_names_out()\\n\",\n    \"\\n\",\n    \"print(f\\\"TF-IDF matrix shape: {tfidf_matrix.shape} (documents × terms)\\\")\\n\",\n    \"print(f\\\"Vocabulary size: {len(feature_names)} terms\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T2: Build Model (NMF Topic Model)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T2: build-model — fit NMF topic model\\n\",\n    \"\\n\",\n    \"print(f\\\"Fitting NMF with {spec_model['n_topics']} topics...\\\")\\n\",\n    \"nmf_model = NMF(\\n\",\n    \"    n_components=spec_model['n_topics'],\\n\",\n    \"    init=spec_model['nmf_init'],\\n\",\n    \"    max_iter=spec_model['nmf_max_iter'],\\n\",\n    \"    random_state=42\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"# W matrix: documents × topics (document-topic weights)\\n\",\n    \"doc_topic_matrix = nmf_model.fit_transform(tfidf_matrix)\\n\",\n    \"\\n\",\n    \"# H matrix: topics × terms (topic-term weights)\\n\",\n    \"topic_term_matrix = nmf_model.components_\\n\",\n    \"\\n\",\n    \"reconstruction_error = nmf_model.reconstruction_err_\\n\",\n    \"print(f\\\"NMF reconstruction error: {reconstruction_error:.4f}\\\")\\n\",\n    \"print(f\\\"Document-topic matrix shape: {doc_topic_matrix.shape}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T3: Characterise (Topic Assignments & Top Terms)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T3: characterise — extract top terms per topic and assign papers\\n\",\n    \"\\n\",\n    \"def get_top_terms(topic_term_matrix, feature_names, n_top=10):\\n\",\n    \"    \\\"\\\"\\\"Extract top-N terms for each topic.\\\"\\\"\\\"\\n\",\n    \"    topics_terms = {}\\n\",\n    \"    for topic_idx, topic_vec in enumerate(topic_term_matrix):\\n\",\n    \"        top_indices = topic_vec.argsort()[::-1][:n_top]\\n\",\n    \"        top_terms = [(feature_names[i], topic_vec[i]) for i in top_indices]\\n\",\n    \"        topics_terms[topic_idx] = top_terms\\n\",\n    \"    return topics_terms\\n\",\n    \"\\n\",\n    \"topics_terms = get_top_terms(topic_term_matrix, feature_names, n_top=spec_model['top_n_words'])\\n\",\n    \"\\n\",\n    \"# Assign each paper to its dominant topic\\n\",\n    \"df['topic'] = doc_topic_matrix.argmax(axis=1)\\n\",\n    \"df['topic_weight'] = doc_topic_matrix.max(axis=1)\\n\",\n    \"\\n\",\n    \"# Create topic labels from top terms\\n\",\n    \"topic_labels_map = {}\\n\",\n    \"print(\\\"\\\\nDiscovered Topics:\\\")\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"for topic_id, terms in topics_terms.items():\\n\",\n    \"    term_str = ', '.join([t[0] for t in terms[:5]])\\n\",\n    \"    topic_labels_map[topic_id] = f\\\"T{topic_id}: {term_str}\\\"\\n\",\n    \"    count = (df['topic'] == topic_id).sum()\\n\",\n    \"    full_terms = ', '.join([t[0] for t in terms[:8]])\\n\",\n    \"    print(f\\\"  Topic {topic_id:2d} ({count:4d} papers): {full_terms}\\\")\\n\",\n    \"\\n\",\n    \"print(f\\\"\\\\nTotal topics: {spec_model['n_topics']}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Topic size overview\\n\",\n    \"topic_sizes = df.groupby('topic').size().sort_values(ascending=False)\\n\",\n    \"print(\\\"\\\\nTopic Size Distribution:\\\")\\n\",\n    \"print(topic_sizes.to_string())\\n\",\n    \"print(f\\\"\\\\nSmallest topic: {topic_sizes.min()} papers\\\")\\n\",\n    \"print(f\\\"Largest topic: {topic_sizes.max()} papers\\\")\\n\",\n    \"print(f\\\"Median topic size: {topic_sizes.median():.0f} papers\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T4: Contextualise (Dimensionality Reduction for Visualisation)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T4: contextualise — project TF-IDF vectors to 2D using UMAP\\n\",\n    \"\\n\",\n    \"print(\\\"Computing UMAP 2D projection (this may take a moment)...\\\")\\n\",\n    \"umap_2d = UMAP(\\n\",\n    \"    n_neighbors=15,\\n\",\n    \"    n_components=2,\\n\",\n    \"    min_dist=0.1,\\n\",\n    \"    metric='cosine',\\n\",\n    \"    random_state=42\\n\",\n    \")\\n\",\n    \"coords_2d = umap_2d.fit_transform(tfidf_matrix)\\n\",\n    \"df['x'] = coords_2d[:, 0]\\n\",\n    \"df['y'] = coords_2d[:, 1]\\n\",\n    \"print(\\\"Done.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T5: Visualise (Topic Overview)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T5: visualise — document scatterplot coloured by topic + topic size bar chart\\n\",\n    \"\\n\",\n    \"fig, axes = plt.subplots(1, 2, figsize=(18, 7))\\n\",\n    \"\\n\",\n    \"n_topics = spec_model['n_topics']\\n\",\n    \"colors_cmap = cm.get_cmap('tab20', n_topics)\\n\",\n    \"\\n\",\n    \"# Left: Document map\\n\",\n    \"ax = axes[0]\\n\",\n    \"for t in range(n_topics):\\n\",\n    \"    mask = df['topic'] == t\\n\",\n    \"    ax.scatter(df.loc[mask, 'x'], df.loc[mask, 'y'],\\n\",\n    \"               c=[colors_cmap(t)], s=12, alpha=0.6, label=f'T{t}')\\n\",\n    \"\\n\",\n    \"ax.set_title('Document Map (UMAP projection, coloured by topic)', fontsize=12)\\n\",\n    \"ax.set_xlabel('UMAP-1')\\n\",\n    \"ax.set_ylabel('UMAP-2')\\n\",\n    \"ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=7, ncol=1)\\n\",\n    \"\\n\",\n    \"# Right: Topic sizes with labels\\n\",\n    \"ax2 = axes[1]\\n\",\n    \"topic_counts_sorted = df.groupby('topic').size().sort_values(ascending=True)\\n\",\n    \"bars = ax2.barh(range(len(topic_counts_sorted)), topic_counts_sorted.values,\\n\",\n    \"                color=[colors_cmap(t) for t in topic_counts_sorted.index])\\n\",\n    \"ax2.set_yticks(range(len(topic_counts_sorted)))\\n\",\n    \"ylabels = [topic_labels_map.get(t, f'T{t}')[:45] for t in topic_counts_sorted.index]\\n\",\n    \"ax2.set_yticklabels(ylabels, fontsize=8)\\n\",\n    \"ax2.set_xlabel('Number of papers')\\n\",\n    \"ax2.set_title('Topic Sizes', fontsize=12)\\n\",\n    \"\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Additional: Top terms bar chart per topic\\n\",\n    \"\\n\",\n    \"n_cols = 4\\n\",\n    \"n_rows = int(np.ceil(n_topics / n_cols))\\n\",\n    \"fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 2.5 * n_rows))\\n\",\n    \"axes_flat = axes.flatten()\\n\",\n    \"\\n\",\n    \"for topic_id in range(n_topics):\\n\",\n    \"    ax = axes_flat[topic_id]\\n\",\n    \"    terms = topics_terms[topic_id][:8]\\n\",\n    \"    words = [t[0] for t in terms][::-1]\\n\",\n    \"    weights = [t[1] for t in terms][::-1]\\n\",\n    \"    ax.barh(words, weights, color=colors_cmap(topic_id), alpha=0.8)\\n\",\n    \"    ax.set_title(f'Topic {topic_id}', fontsize=9, fontweight='bold')\\n\",\n    \"    ax.tick_params(axis='y', labelsize=7)\\n\",\n    \"    ax.tick_params(axis='x', labelsize=7)\\n\",\n    \"\\n\",\n    \"# Hide empty subplots\\n\",\n    \"for j in range(n_topics, len(axes_flat)):\\n\",\n    \"    axes_flat[j].set_visible(False)\\n\",\n    \"\\n\",\n    \"fig.suptitle('Top Terms per Topic (NMF weights)', fontsize=13, y=1.01)\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T6: Assess (Human Evaluation)\\n\",\n    \"\\n\",\n    \"**Evaluative knowledge:** Examine the topics above and answer:\\n\",\n    \"- Are topics coherent and interpretable?\\n\",\n    \"- Is the granularity appropriate (not too many small topics, not too few coarse ones)?\\n\",\n    \"- Are there topics that seem redundant or too broad?\\n\",\n    \"\\n\",\n    \"If unsatisfactory → adjust `spec_model` (especially `n_topics`) and re-run from T1 (Loop L1).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# T6/T7: ASSESSMENT & LOOP L1 (Topic Refinement)\\n\",\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# Set this flag based on your assessment:\\n\",\n    \"topics_satisfactory = True  # Change to False to trigger refinement\\n\",\n    \"\\n\",\n    \"if not topics_satisfactory:\\n\",\n    \"    # ─── Suggestions for refinement ───\\n\",\n    \"    # 1. Change n_topics: increase for finer granularity, decrease for coarser\\n\",\n    \"    #    spec_model['n_topics'] = 15  # or 8, 10, 20...\\n\",\n    \"    #\\n\",\n    \"    # 2. Adjust vocabulary: change max_df, min_df, max_features\\n\",\n    \"    #    spec_model['max_df'] = 0.8\\n\",\n    \"    #    spec_model['min_df'] = 3\\n\",\n    \"    #\\n\",\n    \"    # 3. Add domain-specific stop words:\\n\",\n    \"    #    Add terms like 'paper', 'method', 'approach', 'propose' to stop words\\n\",\n    \"    #\\n\",\n    \"    # After adjusting, re-run from the spec_model cell and T1.\\n\",\n    \"    print(\\\"Please adjust spec_model parameters and re-run from T1.\\\")\\n\",\n    \"else:\\n\",\n    \"    print(\\\"✓ Topics accepted. Proceeding to Phase 2.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\\n\",\n    \"## Phase 2: Temporal Topic Profiling\\n\",\n    \"\\n\",\n    \"**ATWL Transforms:** T8 (characterise/aggregate) → T9 (characterise/smooth) → T10 (contextualise) → T11 (visualise) → T12 (assess)\\n\",\n    \"\\n\",\n    \"**Goal:** Show how topics rise and fall over 1990–2024, emphasising trends over fluctuations.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T8: Characterise (Temporal Aggregation)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T8: characterise — aggregate topic membership per year\\n\",\n    \"# Using soft assignments (NMF weights) for more nuanced temporal profiles\\n\",\n    \"\\n\",\n    \"# Option A: Hard assignment (dominant topic per paper)\\n\",\n    \"topic_year_counts = df.groupby(['year', 'topic']).size().unstack(fill_value=0)\\n\",\n    \"\\n\",\n    \"# Option B: Soft assignment (sum of NMF weights per year)\\n\",\n    \"df_weights = pd.DataFrame(doc_topic_matrix, columns=range(n_topics))\\n\",\n    \"df_weights['year'] = df['year'].values\\n\",\n    \"topic_year_soft = df_weights.groupby('year').sum()\\n\",\n    \"\\n\",\n    \"# Use soft assignments for smoother, more representative profiles\\n\",\n    \"topic_year_data = topic_year_soft.copy()\\n\",\n    \"\\n\",\n    \"# Compute proportions (share of each topic per year)\\n\",\n    \"topic_year_props = topic_year_data.div(topic_year_data.sum(axis=1), axis=0)\\n\",\n    \"\\n\",\n    \"# Ensure all years in range are present\\n\",\n    \"all_years = range(df['year'].min(), df['year'].max() + 1)\\n\",\n    \"topic_year_props = topic_year_props.reindex(all_years, fill_value=0)\\n\",\n    \"topic_year_counts = topic_year_counts.reindex(all_years, fill_value=0)\\n\",\n    \"\\n\",\n    \"print(f\\\"Temporal matrix: {topic_year_props.shape[0]} years × {topic_year_props.shape[1]} topics\\\")\\n\",\n    \"topic_year_props.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T9: Characterise (Temporal Smoothing)\\n\",\n    \"\\n\",\n    \"**Specification artifact (spec_smoothing):** controls trend extraction.\\n\",\n    \"Adjust in Loop L2 if trends are too noisy or over-smoothed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# SPECIFICATION: spec_smoothing (adjust in Loop L2 if needed)\\n\",\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"spec_smoothing = {\\n\",\n    \"    'method': 'loess',          # 'moving_avg', 'loess', or 'gaussian'\\n\",\n    \"    'window_size': 5,           # for moving_avg / gaussian: window in years\\n\",\n    \"    'loess_frac': 0.15,         # for loess: fraction of data for local fit\\n\",\n    \"    'use_proportions': True,    # True = show share; False = show absolute counts\\n\",\n    \"}\\n\",\n    \"# ═══════════════════════════════════════════════════════════\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T9: characterise — apply smoothing to get trends\\n\",\n    \"\\n\",\n    \"data_for_smoothing = topic_year_props if spec_smoothing['use_proportions'] else topic_year_counts\\n\",\n    \"\\n\",\n    \"def smooth_series(years, values, spec):\\n\",\n    \"    \\\"\\\"\\\"Apply smoothing according to specification.\\\"\\\"\\\"\\n\",\n    \"    if spec['method'] == 'moving_avg':\\n\",\n    \"        s = pd.Series(values, index=years)\\n\",\n    \"        return s.rolling(window=spec['window_size'], center=True, min_periods=1).mean().values\\n\",\n    \"    elif spec['method'] == 'loess':\\n\",\n    \"        smoothed = lowess(values, years, frac=spec['loess_frac'], return_sorted=True)\\n\",\n    \"        return smoothed[:, 1]\\n\",\n    \"    elif spec['method'] == 'gaussian':\\n\",\n    \"        from scipy.ndimage import gaussian_filter1d\\n\",\n    \"        sigma = spec['window_size'] / 2.0\\n\",\n    \"        return gaussian_filter1d(values.astype(float), sigma=sigma)\\n\",\n    \"    else:\\n\",\n    \"        return values\\n\",\n    \"\\n\",\n    \"# Apply smoothing to each topic\\n\",\n    \"years_array = np.array(data_for_smoothing.index)\\n\",\n    \"topic_trends = pd.DataFrame(index=data_for_smoothing.index)\\n\",\n    \"\\n\",\n    \"for topic_id in data_for_smoothing.columns:\\n\",\n    \"    values = data_for_smoothing[topic_id].values\\n\",\n    \"    topic_trends[topic_id] = smooth_series(years_array, values, spec_smoothing)\\n\",\n    \"\\n\",\n    \"# Clip negative values (can occur with some smoothing methods)\\n\",\n    \"topic_trends = topic_trends.clip(lower=0)\\n\",\n    \"\\n\",\n    \"print(f\\\"Smoothed trends computed using '{spec_smoothing['method']}' method.\\\")\\n\",\n    \"topic_trends.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T10 & T11: Contextualise & Visualise (Temporal Evolution)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T10/T11: Stacked area chart + individual trend lines\\n\",\n    \"\\n\",\n    \"fig, axes = plt.subplots(2, 1, figsize=(16, 12))\\n\",\n    \"\\n\",\n    \"# Stacked area chart (streamgraph-like)\\n\",\n    \"ax = axes[0]\\n\",\n    \"ax.stackplot(topic_trends.index,\\n\",\n    \"             [topic_trends[t].values for t in topic_trends.columns],\\n\",\n    \"             labels=[topic_labels_map.get(t, f'T{t}')[:35] for t in topic_trends.columns],\\n\",\n    \"             colors=[colors_cmap(i) for i in range(n_topics)],\\n\",\n    \"             alpha=0.8)\\n\",\n    \"ax.set_title('Topic Evolution — Stacked Area Chart (smoothed proportions)', fontsize=13)\\n\",\n    \"ax.set_xlabel('Year')\\n\",\n    \"ax.set_ylabel('Proportion')\\n\",\n    \"ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=7)\\n\",\n    \"ax.set_xlim(years_array.min(), years_array.max())\\n\",\n    \"\\n\",\n    \"# Individual topic trend lines\\n\",\n    \"ax2 = axes[1]\\n\",\n    \"for i, t in enumerate(topic_trends.columns):\\n\",\n    \"    ax2.plot(topic_trends.index, topic_trends[t].values,\\n\",\n    \"             color=colors_cmap(i), linewidth=1.8, alpha=0.8,\\n\",\n    \"             label=topic_labels_map.get(t, f'T{t}')[:35])\\n\",\n    \"ax2.set_title('Individual Topic Trends (smoothed)', fontsize=13)\\n\",\n    \"ax2.set_xlabel('Year')\\n\",\n    \"ax2.set_ylabel('Proportion')\\n\",\n    \"ax2.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=7)\\n\",\n    \"ax2.set_xlim(years_array.min(), years_array.max())\\n\",\n    \"\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Small multiples: one subplot per topic for detailed inspection\\n\",\n    \"\\n\",\n    \"n_cols = 4\\n\",\n    \"n_rows = int(np.ceil(n_topics / n_cols))\\n\",\n    \"\\n\",\n    \"fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 3 * n_rows), sharex=True, sharey=True)\\n\",\n    \"axes_flat = axes.flatten()\\n\",\n    \"\\n\",\n    \"for i, t in enumerate(topic_trends.columns):\\n\",\n    \"    ax = axes_flat[i]\\n\",\n    \"    ax.fill_between(topic_trends.index, topic_trends[t].values,\\n\",\n    \"                    alpha=0.3, color=colors_cmap(i))\\n\",\n    \"    ax.plot(topic_trends.index, topic_trends[t].values,\\n\",\n    \"            color=colors_cmap(i), linewidth=1.5)\\n\",\n    \"    ax.set_title(topic_labels_map.get(t, f'T{t}')[:40], fontsize=8)\\n\",\n    \"    ax.set_xlim(years_array.min(), years_array.max())\\n\",\n    \"\\n\",\n    \"# Hide empty subplots\\n\",\n    \"for j in range(len(topic_trends.columns), len(axes_flat)):\\n\",\n    \"    axes_flat[j].set_visible(False)\\n\",\n    \"\\n\",\n    \"fig.suptitle('Topic Trends — Small Multiples', fontsize=13, y=1.01)\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T12: Assess (Smoothing Adequacy)\\n\",\n    \"\\n\",\n    \"**Evaluative knowledge:**\\n\",\n    \"- Are the trends smooth enough to see long-term patterns?\\n\",\n    \"- Is too much detail lost?\\n\",\n    \"- Do individual topic curves show clear rise/decline/stability?\\n\",\n    \"\\n\",\n    \"If unsatisfactory → adjust `spec_smoothing` and re-run from T9 (Loop L2).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"# T12/T13: ASSESSMENT & LOOP L2 (Smoothing Refinement)\\n\",\n    \"# ═══════════════════════════════════════════════════════════\\n\",\n    \"smoothing_satisfactory = True  # Change to False to try different smoothing\\n\",\n    \"\\n\",\n    \"if not smoothing_satisfactory:\\n\",\n    \"    # Adjust spec_smoothing above and re-run from T9 cell\\n\",\n    \"    # Suggestions:\\n\",\n    \"    #   - Increase loess_frac (e.g., 0.2, 0.25) for smoother trends\\n\",\n    \"    #   - Decrease loess_frac (e.g., 0.1) for more detail\\n\",\n    \"    #   - Try method='moving_avg' with window_size=3 or 7\\n\",\n    \"    #   - Try method='gaussian' with window_size=5\\n\",\n    \"    print(\\\"Please adjust spec_smoothing and re-run from T9.\\\")\\n\",\n    \"else:\\n\",\n    \"    print(\\\"✓ Smoothing accepted. Proceeding to Phase 3.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\\n\",\n    \"## Phase 3: Trend Interpretation & Knowledge Generation\\n\",\n    \"\\n\",\n    \"**ATWL Transforms:** T14 (abstract) → T15 (define-unit/eras) → T16 (visualise) → T17 (generate-knowledge)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T14: Abstract (Identify Patterns)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T14: abstract — identify rising, declining, and stable topics\\n\",\n    \"\\n\",\n    \"def classify_trend(series, early_years=5, late_years=5):\\n\",\n    \"    \\\"\\\"\\\"Classify a topic trend as rising, declining, stable, or peaked.\\\"\\\"\\\"\\n\",\n    \"    values = series.values\\n\",\n    \"    early_mean = values[:early_years].mean()\\n\",\n    \"    late_mean = values[-late_years:].mean()\\n\",\n    \"    overall_mean = values.mean()\\n\",\n    \"    peak_idx = values.argmax()\\n\",\n    \"    peak_position = peak_idx / len(values)  # 0=start, 1=end\\n\",\n    \"\\n\",\n    \"    # Compute relative change\\n\",\n    \"    denominator = max(early_mean, 0.001)  # avoid division by zero\\n\",\n    \"    change_ratio = (late_mean - early_mean) / denominator\\n\",\n    \"\\n\",\n    \"    if change_ratio > 0.5:\\n\",\n    \"        return 'rising', change_ratio\\n\",\n    \"    elif change_ratio < -0.4:\\n\",\n    \"        return 'declining', change_ratio\\n\",\n    \"    elif 0.25 < peak_position < 0.75 and values.max() > 1.8 * max(early_mean, late_mean):\\n\",\n    \"        return 'peaked', change_ratio\\n\",\n    \"    else:\\n\",\n    \"        return 'stable', change_ratio\\n\",\n    \"\\n\",\n    \"print(\\\"Topic Trend Classification:\\\")\\n\",\n    \"print(\\\"=\\\" * 70)\\n\",\n    \"trend_classifications = {}\\n\",\n    \"for t in topic_trends.columns:\\n\",\n    \"    trend_type, ratio = classify_trend(topic_trends[t])\\n\",\n    \"    trend_classifications[t] = trend_type\\n\",\n    \"    label = topic_labels_map.get(t, f'T{t}')\\n\",\n    \"    print(f\\\"  {label:<45} → {trend_type.upper():<10} (Δ={ratio:+.2f})\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n--- Summary ---\\\")\\n\",\n    \"for category in ['rising', 'declining', 'stable', 'peaked']:\\n\",\n    \"    topics_in_cat = [t for t, c in trend_classifications.items() if c == category]\\n\",\n    \"    if topics_in_cat:\\n\",\n    \"        print(f\\\"  {category.upper()}: {len(topics_in_cat)} topics\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T15: Define-Unit (Era Segmentation)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T15: define-unit — segment timeline into research eras\\n\",\n    \"# Using cosine dissimilarity on topic composition between consecutive years\\n\",\n    \"\\n\",\n    \"def detect_era_boundaries(topic_props, min_era_length=5):\\n\",\n    \"    \\\"\\\"\\\"Detect eras by finding years where topic composition changes significantly.\\\"\\\"\\\"\\n\",\n    \"    years = topic_props.index.values\\n\",\n    \"    dissimilarities = []\\n\",\n    \"\\n\",\n    \"    for i in range(1, len(years)):\\n\",\n    \"        v1 = topic_props.iloc[i-1].values\\n\",\n    \"        v2 = topic_props.iloc[i].values\\n\",\n    \"        if v1.sum() > 0 and v2.sum() > 0:\\n\",\n    \"            d = cosine(v1, v2)\\n\",\n    \"        else:\\n\",\n    \"            d = 0\\n\",\n    \"        dissimilarities.append(d)\\n\",\n    \"\\n\",\n    \"    dissimilarities = np.array(dissimilarities)\\n\",\n    \"\\n\",\n    \"    # Find peaks in dissimilarity (potential era boundaries)\\n\",\n    \"    threshold = np.percentile(dissimilarities, 75)\\n\",\n    \"    boundaries = [years[0]]\\n\",\n    \"\\n\",\n    \"    for i, d in enumerate(dissimilarities):\\n\",\n    \"        year = years[i + 1]\\n\",\n    \"        if d > threshold and (year - boundaries[-1]) >= min_era_length:\\n\",\n    \"            boundaries.append(year)\\n\",\n    \"\\n\",\n    \"    boundaries.append(years[-1] + 1)\\n\",\n    \"    return boundaries, dissimilarities\\n\",\n    \"\\n\",\n    \"era_boundaries, yearly_dissimilarity = detect_era_boundaries(topic_trends, min_era_length=5)\\n\",\n    \"print(f\\\"Detected era boundaries: {era_boundaries}\\\")\\n\",\n    \"\\n\",\n    \"# Create era labels\\n\",\n    \"eras = []\\n\",\n    \"for i in range(len(era_boundaries) - 1):\\n\",\n    \"    start = era_boundaries[i]\\n\",\n    \"    end = era_boundaries[i + 1] - 1\\n\",\n    \"    eras.append({'start': start, 'end': end, 'label': f'Era {i+1} ({start}–{end})'})\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\nIdentified Research Eras:\\\")\\n\",\n    \"for era in eras:\\n\",\n    \"    # Find dominant topics in this era\\n\",\n    \"    era_mask = (topic_trends.index >= era['start']) & (topic_trends.index <= era['end'])\\n\",\n    \"    era_means = topic_trends[era_mask].mean()\\n\",\n    \"    top_topics = era_means.nlargest(3).index.tolist()\\n\",\n    \"    top_labels = [topic_labels_map.get(t, f'T{t}')[:40] for t in top_topics]\\n\",\n    \"    print(f\\\"  {era['label']}\\\")\\n\",\n    \"    for lbl in top_labels:\\n\",\n    \"        print(f\\\"    • {lbl}\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T16: Visualise (Annotated Timeline)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T16: visualise — annotated streamgraph with era boundaries\\n\",\n    \"\\n\",\n    \"fig, ax = plt.subplots(figsize=(18, 7))\\n\",\n    \"\\n\",\n    \"# Stacked area\\n\",\n    \"ax.stackplot(topic_trends.index,\\n\",\n    \"             [topic_trends[t].values for t in topic_trends.columns],\\n\",\n    \"             labels=[topic_labels_map.get(t, f'T{t}')[:35] for t in topic_trends.columns],\\n\",\n    \"             colors=[colors_cmap(i) for i in range(n_topics)],\\n\",\n    \"             alpha=0.75)\\n\",\n    \"\\n\",\n    \"# Era boundaries\\n\",\n    \"for boundary in era_boundaries[1:-1]:\\n\",\n    \"    ax.axvline(x=boundary, color='black', linestyle='--', linewidth=1.5, alpha=0.7)\\n\",\n    \"\\n\",\n    \"# Era labels\\n\",\n    \"for era in eras:\\n\",\n    \"    mid = (era['start'] + era['end']) / 2\\n\",\n    \"    ax.text(mid, ax.get_ylim()[1] * 0.95 if ax.get_ylim()[1] > 0 else 0.95,\\n\",\n    \"            era['label'],\\n\",\n    \"            ha='center', va='top', fontsize=9, fontweight='bold',\\n\",\n    \"            bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7))\\n\",\n    \"\\n\",\n    \"ax.set_title('IEEE VIS Research Topic Evolution (1990–2024) — Annotated with Research Eras',\\n\",\n    \"             fontsize=13, fontweight='bold')\\n\",\n    \"ax.set_xlabel('Year', fontsize=11)\\n\",\n    \"ax.set_ylabel('Topic Proportion (smoothed)', fontsize=11)\\n\",\n    \"ax.set_xlim(years_array.min(), years_array.max())\\n\",\n    \"ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=7)\\n\",\n    \"\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Heatmap view: topic intensity over time\\n\",\n    \"\\n\",\n    \"fig, ax = plt.subplots(figsize=(16, max(4, n_topics * 0.5)))\\n\",\n    \"\\n\",\n    \"heatmap_data = topic_trends.T.copy()\\n\",\n    \"heatmap_data.index = [topic_labels_map.get(t, f'T{t}')[:40] for t in heatmap_data.index]\\n\",\n    \"\\n\",\n    \"sns.heatmap(heatmap_data, cmap='YlOrRd', ax=ax, xticklabels=5,\\n\",\n    \"            linewidths=0.1, cbar_kws={'label': 'Proportion'})\\n\",\n    \"ax.set_title('Topic Intensity Heatmap (Year × Topic)', fontsize=13)\\n\",\n    \"ax.set_xlabel('Year')\\n\",\n    \"ax.set_ylabel('')\\n\",\n    \"\\n\",\n    \"# Add era boundaries\\n\",\n    \"for boundary in era_boundaries[1:-1]:\\n\",\n    \"    if boundary in topic_trends.index:\\n\",\n    \"        idx = list(topic_trends.index).index(boundary)\\n\",\n    \"        ax.axvline(x=idx, color='blue', linestyle='--', linewidth=1.5, alpha=0.6)\\n\",\n    \"\\n\",\n    \"plt.tight_layout()\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### T17: Generate Knowledge (Summary Report)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# T17: generate-knowledge — structured summary of findings\\n\",\n    \"\\n\",\n    \"print(\\\"=\\\" * 80)\\n\",\n    \"print(\\\"FINDINGS: Research Topic Evolution in IEEE VIS (1990–2024)\\\")\\n\",\n    \"print(\\\"=\\\" * 80)\\n\",\n    \"\\n\",\n    \"print(f\\\"\\\\n📊 Dataset: {len(df)} papers analysed\\\")\\n\",\n    \"print(f\\\"📅 Time span: {df['year'].min()}–{df['year'].max()}\\\")\\n\",\n    \"print(f\\\"🏷️  Topics discovered: {n_topics} (via NMF)\\\")\\n\",\n    \"print(f\\\"📐 Smoothing: {spec_smoothing['method']} (frac={spec_smoothing.get('loess_frac', spec_smoothing.get('window_size'))})\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\" * 80)\\n\",\n    \"print(\\\"📈 RISING TOPICS (gaining prominence):\\\")\\n\",\n    \"print(\\\"-\\\" * 80)\\n\",\n    \"rising = [t for t, c in trend_classifications.items() if c == 'rising']\\n\",\n    \"if rising:\\n\",\n    \"    for t in rising:\\n\",\n    \"        print(f\\\"    • {topic_labels_map.get(t, f'T{t}')}\\\")\\n\",\n    \"else:\\n\",\n    \"    print(\\\"    (none detected)\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\" * 80)\\n\",\n    \"print(\\\"📉 DECLINING TOPICS (reduced prominence):\\\")\\n\",\n    \"print(\\\"-\\\" * 80)\\n\",\n    \"declining = [t for t, c in trend_classifications.items() if c == 'declining']\\n\",\n    \"if declining:\\n\",\n    \"    for t in declining:\\n\",\n    \"        print(f\\\"    • {topic_labels_map.get(t, f'T{t}')}\\\")\\n\",\n    \"else:\\n\",\n    \"    print(\\\"    (none detected)\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\" * 80)\\n\",\n    \"print(\\\"➡️  STABLE TOPICS (consistent presence):\\\")\\n\",\n    \"print(\\\"-\\\" * 80)\\n\",\n    \"stable = [t for t, c in trend_classifications.items() if c == 'stable']\\n\",\n    \"if stable:\\n\",\n    \"    for t in stable:\\n\",\n    \"        print(f\\\"    • {topic_labels_map.get(t, f'T{t}')}\\\")\\n\",\n    \"else:\\n\",\n    \"    print(\\\"    (none detected)\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\" * 80)\\n\",\n    \"print(\\\"⛰️  PEAKED TOPICS (rose then declined):\\\")\\n\",\n    \"print(\\\"-\\\" * 80)\\n\",\n    \"peaked = [t for t, c in trend_classifications.items() if c == 'peaked']\\n\",\n    \"if peaked:\\n\",\n    \"    for t in peaked:\\n\",\n    \"        peak_year = topic_trends[t].idxmax()\\n\",\n    \"        print(f\\\"    • {topic_labels_map.get(t, f'T{t}')} (peak ~{peak_year})\\\")\\n\",\n    \"else:\\n\",\n    \"    print(\\\"    (none detected)\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"-\\\" * 80)\\n\",\n    \"print(\\\"🕐 RESEARCH ERAS:\\\")\\n\",\n    \"print(\\\"-\\\" * 80)\\n\",\n    \"for era in eras:\\n\",\n    \"    era_mask = (topic_trends.index >= era['start']) & (topic_trends.index <= era['end'])\\n\",\n    \"    era_means = topic_trends[era_mask].mean()\\n\",\n    \"    top3 = era_means.nlargest(3).index.tolist()\\n\",\n    \"    print(f\\\"\\\\n  {era['label']}\\\")\\n\",\n    \"    print(f\\\"  Dominant topics:\\\")\\n\",\n    \"    for t in top3:\\n\",\n    \"        print(f\\\"    • {topic_labels_map.get(t, f'T{t}')}\\\")\\n\",\n    \"\\n\",\n    \"print(\\\"\\\\n\\\" + \\\"=\\\" * 80)\\n\",\n    \"print(\\\"Note: Interpret these findings considering venue composition changes\\\")\\n\",\n    \"print(\\\"(InfoVis/SciVis/VAST merger into VIS in 2021) and overall publication\\\")\\n\",\n    \"print(\\\"volume growth over the 35-year span.\\\")\\n\",\n    \"print(\\\"=\\\" * 80)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\\n\",\n    \"## Workflow Complete\\n\",\n    \"\\n\",\n    \"### Workflow Summary (ATWL Template):\\n\",\n    \"```\\n\",\n    \"Phase 1: characterise(TF-IDF) → build-model(NMF) → characterise(assign) → contextualise(UMAP) → visualise → assess ↺[refine]\\n\",\n    \"Phase 2: characterise(aggregate) → characterise(smooth) → contextualise(timeline) → visualise → assess ↺[adjust smoothing]\\n\",\n    \"Phase 3: abstract(patterns) → define-unit(eras) → visualise(annotated) → generate-knowledge(report)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"### Dependencies:\\n\",\n    \"- `pandas`, `numpy` — data handling\\n\",\n    \"- `scikit-learn` — TF-IDF vectorisation and NMF topic modeling\\n\",\n    \"- `umap-learn` — 2D projection for document map\\n\",\n    \"- `matplotlib`, `seaborn` — visualisation\\n\",\n    \"- `statsmodels` — LOESS smoothing\\n\",\n    \"- `scipy` — cosine distance for era detection\\n\",\n    \"\\n\",\n    \"### Next Steps:\\n\",\n    \"- Fine-tune `n_topics` in Loop L1 (try 8, 10, 15, 20)\\n\",\n    \"- Experiment with smoothing parameters in Loop L2\\n\",\n    \"- Add custom stop words for the VIS domain (e.g., 'paper', 'propose', 'novel', 'approach')\\n\",\n    \"- Cross-reference eras with known community events\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.10.0\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 4\n}\n```\n\n---\n\n### Summary of Changes\n\n| Aspect | Before (BERTopic) | After (NMF) |\n|--------|-------------------|-------------|\n| **Topic model** | BERTopic (HDBSCAN clustering) | sklearn `NMF` on TF-IDF matrix |\n| **Text vectorisation** | Sentence-BERT embeddings | `TfidfVectorizer` from sklearn |\n| **Clustering** | HDBSCAN (automatic cluster count) | NMF with user-specified `n_topics` |\n| **2D projection** | UMAP on dense embeddings | UMAP on sparse TF-IDF matrix |\n| **Topic assignment** | Hard cluster membership | Argmax of NMF document-topic weights (soft weights available) |\n| **Temporal profiles** | Hard counts only | Soft NMF weights summed per year (smoother) |\n| **Heavy dependencies removed** | `bertopic`, `sentence-transformers`, `hdbscan` | — |\n| **Remaining dependencies** | — | `scikit-learn`, `umap-learn`, `statsmodels`, `scipy` |\n\nThe workflow logic, iterative loops, visualisations, and knowledge generation remain identical.",
    "role": "assistant"
  }
]