The attached document contains a definition of a language called ATWL designed for representation of visual analytics workflows. You need to understand the language concepts and semantics it is meant to express, and also the syntax used for expressing the semantics. In the next step, you will be given a "library" of VA workflows extracted from published papers. Your task will be to help a user to design a workflow for achieving user's goals by finding relevant examples in the "library" and suggesting how they can be adapted to the user's case.
I've carefully read and understood the ATWL (Artifact–Transform Workflow Language) specification. Here's a brief summary of my understanding:
Key Concepts
| Component | Role |
|---|---|
| Artifacts (8 types) | Analytical objects: entities, feature, arrangement, visualisation, pattern, model, knowledge, specification |
| Transforms (8 intents) | Operations: define-unit, characterise, contextualise, visualise, abstract, build-model, generate-knowledge, assess |
| Control Structures | Loops, conditionals, assignments for workflow logic |
Core Semantics
- Workflows are directed graphs: transforms consume input artifacts and produce output artifacts.
- Entities have internal structure (how components within each entity are organized) and embedment (how entities relate to each other in shared environments).
- Transforms have an intent (purpose), optional manner (how it's realized), and an actor (human/machine/hybrid).
- Specifications serve as control knowledge, influencing how transforms execute.
- Knowledge can be evaluative (from
assess) or substantive (fromgenerate-knowledge). - Templates provide high-level summaries of workflow stages using intent sequences.
I'm ready for the next step — please provide the library of VA workflows and/or the user's goals, and I'll help find relevant examples and suggest adapted workflows.
I have attached an additional file with the workflow library. You can perform initial analysis for revealing reusable and adaptable parts.
Initial Analysis of the Workflow Library
I've analyzed all 17 workflows and identified reusable/adaptable structural patterns. Here's my synthesis:
1. Domain Categorization
| Domain | Workflows | Key Data Types |
|---|---|---|
| Temporal data | 1.1, 1.4, 1.5, 1.9 | Time series, event sequences |
| Movement/spatial | 1.3, 1.6, 1.7, 1.12 | Trajectories, flows, places |
| Networks/graphs | 1.2, 1.16 | Dynamic networks, dataflow graphs |
| Text/documents | 1.8 | Document corpora |
| ML model analysis | 1.10, 1.11, 1.13, 1.14, 1.15, 1.17 | Models, predictions, residuals |
| Prescriptive analytics | 1.5 | Historical records + outcomes |
2. Recurring Structural Motifs
| Motif | Description | Found In |
|---|---|---|
| A: Iterative Clustering | define-unit(cluster) → visualise → assess → [adjust|exit] | 1.1, 1.3, 1.6, 1.7, 1.11, 1.12 |
| B: Model Build–Evaluate–Refine | build-model → characterise(evaluate) → visualise → assess → [refine|exit] | 1.10, 1.11, 1.12, 1.13, 1.14 |
| C: Progressive Exploration | visualise → abstract → assess → [explore more|exit] | 1.2, 1.8, 1.9, 1.15, 1.16, 1.17 |
| D: Vectorize–Project–Visualize | characterise(vectors) → contextualise(DR) → visualise(scatterplot) | 1.2, 1.8, 1.12 |
| E: Residual-driven Refinement | build-model → compute residuals → visualise → assess → refine | 1.10, 1.11 |
| F: Action Plan Tuning | specify plan → estimate impact → visualise → assess → [refine|exit] | 1.5 |
3. Reusable Modular Segments
Segment 1: Similarity-based Spatial Overview
Source workflows: 1.2, 1.8, 1.12, 1.15
Template: characterise(vectorize) → [characterise(normalize)] → contextualise(projection) → visualise(scatterplot)
Adaptable to: Any scenario needing overview of entity similarities (documents, time series, images, model components).
Segment 2: Calendar/Context-based Temporal Distribution
Source workflows: 1.1, 1.3
Template: contextualise(calendar) → visualise(color-coded grid)
Adaptable to: Any temporal cluster membership displayed over a regular time structure.
Segment 3: Feature Engineering Loop
Source workflow: 1.12 (primary), 1.9, 1.10
Template: characterise(features) → define-unit(cluster) → visualise → assess → [refine features | exit]
Adaptable to: Any domain where feature adequacy must be iteratively refined via visual feedback.
Segment 4: Two-stage Spatial Clustering
Source workflow: 1.6
Template: define-unit(ST-cluster) → define-unit(filter noise) → define-unit(S-cluster)
Adaptable to: Any spatial event data needing noise filtering and place delineation.
Segment 5: Semi-supervised Model Steering
Source workflows: 1.8 (topics), 1.15 (boundaries)
Template: visualise(model output) → assess → generate-knowledge(user guidance) → define-unit/build-model(constrained refit)
Adaptable to: Any model where human domain knowledge should guide retraining.
Segment 6: Model Diagnostics Pipeline
Source workflows: 1.14, 1.17
Template: characterise(predictions) → characterise(explanations) → define-unit(group) → visualise(multi-level) → loop(explore → abstract → assess)
Adaptable to: Any trained classifier/regressor needing post-hoc interpretability analysis.
Segment 7: Progressive Simplification
Source workflows: 1.4 (EventFlow), 1.16 (TensorFlow)
Template: define-unit(simplify/cluster) → define-unit(extract auxiliary) → visualise → loop(assess → simplify further | exit)
Adaptable to: Any complex structure (graphs, sequences) needing iterative decluttering.
4. Common Intent Sequences (Templates)
Discovery: characterise → visualise → abstract → generate-knowledge
Refinement: assess → generate-knowledge(adjust) → [loop back]
Model building: generate-knowledge(specify) → build-model → assess
Simplification: define-unit(filter/merge) → visualise → assess
Contextualization: contextualise(arrange) → visualise → abstract
5. Cross-cutting Design Principles
- Nested loops — outer loops change strategy/parameters; inner loops refine within a strategy (1.11, 1.14).
- Specification-driven iteration — a
specificationartifact controls machine behavior; human updates it after assessment. - Coordinated multi-view visualization — almost all workflows combine ≥2 linked views for assessment.
- Human-at-decision-points — machine does computation; human performs assessment, interpretation, and parameter/strategy decisions.
- Progressive abstraction — raw data → features → clusters/patterns → knowledge, with quality gates at each level (1.9 exemplifies this most clearly).
I'm ready to help you design a workflow. Please describe your goals, data characteristics, and any constraints, and I'll match relevant library examples and suggest an adapted workflow.
I need a workflow design recommendation for the following task. I want to analyse the dataset that contains information on IEEE Visualization (IEEE VIS) publications from 1990-2024 as well as IEEE TVCG and IEEE CG&A articles published at IEEE VIS. The dataset includes for each paper the publication year, title, and abstract. I want to reveal the major research topics and how they evolved over the years. I am interested in seeing trends rather than minor fluctuations. Using the workflow library, compose an appropriate workflow.
Workflow Design: Research Topic Evolution in IEEE VIS (1990–2024)
Relevant Library Sources
| Library Element | Contribution to Design |
|---|---|
| Workflow 1.8 (Topic modeling) | Core topic discovery loop with human-steered refinement |
| Workflow 1.1 (Temporal clustering) | Temporal profiling of cluster/topic membership over calendar time |
| Workflow 1.9 (Progressive abstraction) | Smoothing & trend-level abstraction over temporal data |
| Segment D (Vectorize→Project→Visualise) | Document similarity overview for topic validation |
| Motif B (Build–Evaluate–Refine) | Iterative model quality loop |
| Motif C (Progressive Exploration) | Explore → abstract → assess for trend interpretation |
Workflow in ATWL Notation
WORKFLOW TopicEvolutionAnalysis
── PHASE 1: DATA PREPARATION & TOPIC DISCOVERY (adapted from 1.8) ──────────
ARTIFACTS
papers : entities
internal_structure: set of attributes {year: temporal, title: textual, abstract: textual}
embedment: none (independent records)
spec_model : specification
content: {method: "BERTopic"|"LDA"|"NMF", num_topics: integer,
min_topic_size: integer, constraints: list}
TRANSFORMS
T1: characterise
input: papers
output: text_vectors : feature
structure: vector per entity (from title + abstract)
manner: vectorise text using sentence embeddings (or TF-IDF)
actor: machine
T2: build-model
input: text_vectors, spec_model
output: topic_model : model
content: topic-term distributions, document-topic assignments
manner: fit topic model per spec_model parameters
actor: machine
T3: characterise
input: topic_model, papers
output: topic_assignments : feature
structure: dominant topic label + probability per paper
manner: assign each paper to most probable topic; extract top terms per topic
actor: machine
T4: contextualise
input: text_vectors, topic_assignments
output: doc_arrangement : arrangement
structure: 2-D coordinates (UMAP/t-SNE projection coloured by topic)
manner: dimensionality reduction of text_vectors
actor: machine
T5: visualise
input: doc_arrangement, topic_model
output: vis_topics : visualisation
content: {scatterplot of documents coloured by topic,
topic-term bar charts / word clouds,
topic size bar chart}
manner: coordinated multiple views
actor: machine
T6: assess
input: vis_topics
output: eval_topics : knowledge (evaluative)
content: topic coherence, interpretability, granularity judgement
actor: human
LOOP L1 (topic refinement — from Motif B / Workflow 1.8 steering loop)
condition: eval_topics ≠ "satisfactory"
T7: generate-knowledge
input: eval_topics
output: spec_model_update : specification
content: revised num_topics, merge/split instructions,
keyword constraints, stopword additions
manner: analyst decides how to improve model
actor: human
ASSIGN spec_model ← spec_model_update
GOTO T2
END LOOP L1
── PHASE 2: TEMPORAL TOPIC PROFILING (adapted from 1.1, 1.9) ───────────────
TRANSFORMS
T8: characterise
input: topic_assignments, papers.year
output: topic_temporal_profiles : feature
structure: matrix (topics × years), cell = paper count or proportion
manner: aggregate topic membership counts per year
actor: machine
T9: characterise
input: topic_temporal_profiles, spec_smoothing
output: topic_trends : feature
structure: smoothed time series per topic
manner: apply temporal smoothing (moving average / LOESS / Gaussian kernel)
actor: machine
note: suppresses minor fluctuations per user requirement
ARTIFACTS
spec_smoothing : specification
content: {window_size: integer, method: "moving_avg"|"loess"|"gaussian",
time_granularity: "year"|"3-year"|"5-year"}
T10: contextualise
input: topic_trends
output: temporal_arrangement : arrangement
structure: topics positioned along shared time axis 1990–2024
manner: align all topic curves on common timeline
actor: machine
T11: visualise
input: temporal_arrangement, topic_trends
output: vis_evolution : visualisation
content: {streamgraph / stacked area chart (overall composition),
small-multiple line charts (individual topic trajectories),
horizon graph (compact comparison)}
manner: coordinated temporal views
actor: machine
T12: assess
input: vis_evolution
output: eval_trends : knowledge (evaluative)
content: judgement on smoothing level, readability,
whether trends (not noise) are visible
actor: human
LOOP L2 (smoothing/granularity adjustment — from Workflow 1.9 progressive abstraction)
condition: eval_trends ≠ "satisfactory"
T13: generate-knowledge
input: eval_trends
output: spec_smoothing_update : specification
manner: adjust smoothing window or switch time granularity
actor: human
ASSIGN spec_smoothing ← spec_smoothing_update
GOTO T9
END LOOP L2
── PHASE 3: TREND INTERPRETATION & KNOWLEDGE GENERATION ─────────────────────
TRANSFORMS
T14: abstract
input: vis_evolution, vis_topics
output: era_patterns : pattern
content: {rising topics, declining topics, stable topics,
inflection points, dominant eras}
manner: identify qualitative phases (e.g., "rise of ML-based VIS post-2015")
actor: human
T15: define-unit (optional — era segmentation, adapted from 1.1 clustering)
input: topic_temporal_profiles, spec_eras
output: eras : entities
internal_structure: contiguous year ranges with coherent topic mix
manner: segment timeline into research eras (manual or change-point detection)
actor: human + machine
T16: visualise
input: eras, topic_trends
output: vis_eras : visualisation
content: annotated timeline with era labels overlaid on streamgraph
manner: overlay era boundaries on vis_evolution
actor: machine
T17: generate-knowledge
input: era_patterns, vis_eras, eval_topics
output: findings : knowledge (substantive)
content: narrative summary of research evolution,
key shifts, emerging/fading themes
manner: synthesise observations into structured report
actor: human
END WORKFLOW
High-Level Template
Phase 1 (Topic Discovery):
characterise → build-model → characterise → contextualise → visualise → assess
↺ [generate-knowledge → rebuild model]
Phase 2 (Temporal Trends):
characterise(aggregate) → characterise(smooth) → contextualise → visualise → assess
↺ [adjust smoothing]
Phase 3 (Interpretation):
abstract → define-unit(eras) → visualise → generate-knowledge
Design Rationale & Adaptation Notes
| Design Decision | Rationale |
|---|---|
| BERTopic as default method | Handles short texts (titles+abstracts) well; produces interpretable clusters; allows hierarchical merging — suited for adjusting granularity in L1 |
| Smoothing before temporal visualisation | User explicitly wants trends not fluctuations; LOESS or 3–5 year windows appropriate for 35-year span |
| Streamgraph for evolution | Shows both individual topic trajectories and compositional change simultaneously (adapts 1.1's calendar view to continuous time) |
| Two nested loops | L1 ensures topic quality before temporal analysis; L2 tunes the trend-vs-noise tradeoff — mirrors the nested loop structure of workflows 1.11/1.14 |
| Era segmentation (T15) | Optional but valuable for a 35-year span; provides narrative anchors ("the InfoVis era", "the deep-learning era") |
| Scatterplot overview (T4–T5) | From Segment D — provides spatial validation that topics are coherent clusters in embedding space |
Suggested Tool Stack (for implementation)
| Phase | Tools |
|---|---|
| Vectorization (T1) | Sentence-BERT, SciBERT |
| Topic model (T2) | BERTopic, Top2Vec, or Gensim LDA |
| Projection (T4) | UMAP |
| Temporal vis (T11) | D3.js streamgraph, Matplotlib, Altair |
| Smoothing (T9) | Pandas rolling / statsmodels LOESS |
Would you like me to refine any phase (e.g., add a sub-workflow for comparing specific topic pairs, or add a citation-impact weighting step)?
To test this workflow, create a Jupyter Notebook (Python). Output the notebook as json file.
(The assistant produced a complete Jupyter Notebook JSON implementing the workflow with BERTopic, UMAP, LOESS smoothing, stacked area charts, era segmentation, and a structured findings report.)
{
"cells": [
{"cell_type": "markdown", "source": ["# Research Topic Evolution in IEEE VIS (1990–2024)\n", ...]},
{"cell_type": "code", "source": ["# Install required packages\n", "# !pip install bertopic sentence-transformers umap-learn ..."]},
{"cell_type": "code", "source": ["import pandas as pd\n", "from bertopic import BERTopic\n", "from umap import UMAP\n", ...]},
... (28 cells total: markdown + code implementing all 3 phases)
],
"metadata": {"kernelspec": {"display_name": "Python 3", "language": "python"}},
"nbformat": 4, "nbformat_minor": 4
}
Key Implementation Notes
| Workflow Element | Implementation Choice |
|---|---|
| T1 (vectorise) | sentence-transformers with all-MiniLM-L6-v2 (fast); upgrade to allenai/scibert for better domain coverage |
| T2 (build-model) | BERTopic with configurable UMAP + HDBSCAN parameters |
| Loop L1 | Manual flag topics_satisfactory + merge/reduce options |
| T9 (smooth) | LOESS by default (frac=0.15 ≈ 5-year effective window on 35 years) |
| Loop L2 | Manual flag smoothing_satisfactory + parameter guidance |
| T14 (abstract) | Automatic trend classification (rising/declining/stable/peaked) |
| T15 (eras) | Cosine-dissimilarity based change-point detection |
| T16/T17 | Annotated streamgraph + heatmap + structured text report |
To run the notebook, update DATA_PATH to point to your IEEE VIS dataset CSV (expected columns: year, title, abstract). If the file isn't found, synthetic demo data is generated for testing the pipeline.
Describe the workflow in natural language.
Workflow Description: Research Topic Evolution in IEEE VIS (1990–2024)
Goal
The workflow takes a dataset of IEEE VIS publications (each described by year, title, and abstract) and reveals the major research topics within the community and how their prominence has changed over 35 years, emphasising long-term trends rather than year-to-year noise.
Phase 1: Topic Discovery
The workflow begins by preparing the text data. Each paper's title and abstract are concatenated into a single document and transformed into a numerical vector using a pre-trained sentence embedding model. These vectors capture semantic similarity — papers about related subjects end up with similar vectors.
Next, a topic model (BERTopic) is fitted to the document vectors. The model clusters semantically similar papers together and identifies representative keywords for each cluster. Each cluster corresponds to a research topic. Every paper is then assigned to its most probable topic, and the top terms characterising each topic are extracted.
To allow the analyst to visually verify topic quality, the high-dimensional vectors are projected into two dimensions using UMAP. The resulting document map is displayed as a scatterplot where each dot is a paper, coloured by its assigned topic. A companion bar chart shows the size of each topic. Together, these views let the analyst judge whether the topics are coherent, well-separated, and at an appropriate level of granularity.
The analyst then assesses the result. If topics are too fine-grained, too coarse, or contain incoherent mixtures, the analyst adjusts the model parameters — for example, changing the minimum topic size, merging similar topics, or constraining the total number of topics — and the model is rebuilt. This refinement loop repeats until the analyst is satisfied with topic quality.
Phase 2: Temporal Profiling
Once stable topics are established, the workflow shifts to temporal analysis. For each year in the dataset, the number (or proportion) of papers belonging to each topic is counted, producing a topic-by-year matrix.
Because raw yearly counts are noisy (especially in earlier years with fewer publications), a smoothing function is applied. By default, LOESS regression gently smooths each topic's time series so that multi-year trends become visible while short-lived fluctuations are suppressed. The smoothing window is chosen to reveal patterns spanning roughly five or more years.
The smoothed trends are then visualised in multiple coordinated views: a stacked area chart (streamgraph) showing how the overall topic composition evolves, a set of individual line charts (small multiples) showing each topic's trajectory in isolation, and a heatmap showing topic intensity across years.
The analyst assesses whether the smoothing level is appropriate. If the curves are still too jagged, the smoothing fraction is increased; if important transitions are being blurred out, it is decreased. This adjustment loop continues until the temporal visualisation clearly communicates long-term trends.
Phase 3: Interpretation and Knowledge Generation
With validated topic trends in hand, the analyst interprets the patterns. Each topic is automatically classified as rising (gaining prominence over time), declining (fading), stable (consistently present), or peaked (rose and then fell). This gives a quick structural summary of the field's evolution.
The workflow then segments the 35-year timeline into distinct research eras — contiguous periods during which the topic composition remains relatively stable, separated by years of notable compositional shift. For each era, the dominant topics are identified.
An annotated visualisation overlays era boundaries on the streamgraph, making it easy to see when the community transitioned from one set of dominant concerns to another.
Finally, the analyst synthesises all observations into a structured knowledge report: which topics have emerged, which have faded, when major shifts occurred, and what characterises each era of IEEE VIS research.
Iterative Structure
The workflow contains two human-in-the-loop feedback cycles:
- Topic refinement loop — ensures the discovered topics are meaningful before any temporal analysis begins.
- Smoothing adjustment loop — ensures the temporal visualisation reveals genuine trends at the right level of abstraction.
Both loops follow the same principle: the machine computes, the human assesses, and if the result is unsatisfactory, the human updates a specification that controls the machine's next iteration.
Roles
| Actor | Responsibilities |
|---|---|
| Machine | Text vectorisation, topic modeling, dimensionality reduction, smoothing, aggregation, change-point detection, rendering visualisations |
| Human | Assessing topic coherence, judging smoothing adequacy, interpreting trends, identifying eras, synthesising narrative findings |
Output
The workflow produces three main outputs:
- A validated set of research topics with representative keywords and a document map.
- Smoothed temporal trend charts showing each topic's trajectory from 1990 to 2024.
- A structured narrative summarising rising, declining, and stable themes, major transition points, and distinct research eras in the IEEE VIS community.
Regarding the Jupyter Notebook: you have included the use of BERTopic for topic modelling. It is hard to install; replace it by NMF. Output the modified notebook as json file.
(The assistant produced a modified Jupyter Notebook JSON replacing BERTopic with sklearn NMF + TF-IDF. Full notebook available in the JSON session log.)
{
"cells": [
{"cell_type": "markdown", "source": ["# Research Topic Evolution in IEEE VIS (1990–2024)\n", ...]},
{"cell_type": "code", "source": ["# !pip install pandas numpy matplotlib seaborn scikit-learn umap-learn statsmodels scipy"]},
{"cell_type": "code", "source": ["from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.decomposition import NMF\n", ...]},
... (27 cells total: NMF-based implementation of all 3 phases)
],
"metadata": {"kernelspec": {"display_name": "Python 3", "language": "python"}},
"nbformat": 4, "nbformat_minor": 4
}
Summary of Changes
| Aspect | Before (BERTopic) | After (NMF) |
|---|---|---|
| Topic model | BERTopic (HDBSCAN clustering) | sklearn NMF on TF-IDF matrix |
| Text vectorisation | Sentence-BERT embeddings | TfidfVectorizer from sklearn |
| Clustering | HDBSCAN (automatic cluster count) | NMF with user-specified n_topics |
| 2D projection | UMAP on dense embeddings | UMAP on sparse TF-IDF matrix |
| Topic assignment | Hard cluster membership | Argmax of NMF document-topic weights (soft weights available) |
| Temporal profiles | Hard counts only | Soft NMF weights summed per year (smoother) |
| Heavy dependencies removed | bertopic, sentence-transformers, hdbscan | — |
| Remaining dependencies | — | scikit-learn, umap-learn, statsmodels, scipy |
The workflow logic, iterative loops, visualisations, and knowledge generation remain identical.