Skip to content

Implement a new diversity aspect in Scholia based on DivinWD queries #2778

@CristianCantoro

Description

@CristianCantoro

I would like to propose a new Scholia aspect, diversity, to integrate and adapt the queries currently from the DivinWD project into Scholia. More info about the project are available in a paper to be presented next week at the Wiki Workshop 2026.

The queries are already defined in SPARQL and available in the dataset-resources repository. The queries are currently embedded in Python scripts.

We will not use data from the Genderize API, as in the paper, but online from Wikidata, with the option of adding data from ROR and S2 FOS (see below).

The goal is to integrate the existing queries and their logic into Scholia in a way that matches Scholia's architecture:

  • create a new aspect named diversity
  • port the existing Python query logic to aspect-specific .sparql files
  • rewrite all visualizations for Scholia
  • add endpoint support for querying the Research Organization Registry (ror)
  • wire the resulting panels into a new diversity.html page

This would provide a dedicated analytical aspect for diversity-related views over scholarly metadata, initially for global/corpus-wide queries, with the idea to possibly extend it to filtered views in the future.

What kind of aspect would we like to add to Scholia

With DivinWD we have already design a set of diversity-oriented queries, including:

  • year
  • language
  • gender
  • field of study
  • affiliation
  • affiliation continents heatmap
  • nationality

At the moment these are implemented as Python scripts that run SPARQL queries and generate static visualizations externally, integrating it into Scholia would mean that we could get up-to-date results from Wikidata.

Scope of the proposed integration

1. New aspect: diversity

Introduce a new aspect named diversity.

Planned UI entry point:

  • /diversity/ for the main aspect page

Potential future entry points:

  • /diversity/topic/<QID>
  • /diversity/venue/<QID> (for conferences/journals)
  • others

The first implementation target is the global/corpus-wide version.

2. Port Python queries to .sparql

The query logic in DivinWD should be moved into Scholia-native .sparql files.

Plan:

  • port the simple queries directly into .sparql files (e.g. year, languange, gender)
  • for the more complex queries, port the SPARQL itself into Scholia, but rebuild the visualization client-side in JavaScript (affiliation heatmap, etc.)

3. Add endpoint support for fos and ror

Some of the DivinWD queries depend on data from the Research Organization Registry (ror) and on data obtained from Allen AI's s2_fos model (fos). Allen AI funds Semantic Scholar.

One solution would be to add an external endpoint for querying ROR and FOS data, so that the new diversity aspect can execute panels that require these data.

4. diversity.html

All panels for this aspect should be wired into a new diversity.html template.

The page would initially host the global views and later expand with filtered variants.

Proposed implementation plan

Phase 1 — Create the aspect scaffold

Add a new aspect with the minimal files and routing needed to render a dedicated page:

  • scholia/app/templates/diversity.html
  • scholia/app/templates/diversity_*.sparql
  • route/view support for the new aspect
  • optional index/landing page if needed

Initial focus: load and render a page with a few working panels.

Phase 2 — Port the simple queries as Scholia-native SPARQL panels

The following queries should be ported first as .sparql files and rendered via query-GUI charts:

  • year
  • language
  • gender
  • field of study
  • affiliation (where possible in a simple form)

These should become standard Scholia panels using the existing query/template machinery.

Expected output type:

  • query-GUI charts for simple visualizations
  • standard table or iframe panels where appropriate

Phase 3 — Rebuild complex visualizations client-side in JS

The following queries are more complex and should be reimplemented as JavaScript visualizations fed by Scholia-managed SPARQL results:

  • affiliation continents heatmap
  • nationality
  • any year trend view that currently depends on Python-side modeling or custom plotting
  • any panel that currently depends on multi-query combination and Python post-processing

Plan:

  • move the SPARQL into Scholia .sparql files
  • expose the results in a form consumable by the frontend
  • implement the charts client-side in JS

This keeps the query logic in Scholia while avoiding server-side Python plotting.

Phase 4 — Add ROR endpoint handling

Implement support for routing relevant diversity panels to an endpoint that can resolve ror-dependent queries.

This likely needs:

  • endpoint configuration support at panel/query level
  • clear separation between standard Wikidata/Scholia queries and ROR-backed queries
  • documentation for how the diversity aspect determines which endpoint a panel uses

Phase 5 — Add filtered variants

After the global/corpus-wide panels are working, extend the aspect to support filtering by:

  • topic / field of study
  • venues, i.e. conferences, journals (one or more)
  • other filters (we are open to proposals)

This should be done only after the global panels are stable, since it will require deciding on URL design, parameterization, and the expected semantics of the filters.

Proposed initial panel set

Initial global panels for diversity.html:

Open questions

  1. Additional endpoint

    • should we set up a backend for ror- and fos-dependent queries? If so, how?
  2. Frontend charting

    • what JS visualization approach is preferred for new custom charts in Scholia?
    • are there existing chart helpers/components we should build on?
  3. Filtered URLs

    • How to define and design queries for topic/venue filters?
  4. Data shape for JS visualizations

    • How to handle complex panels/visualization and the intermediate necessary data transformations?

Roadmap

We could split this project into smaller PRs:

  1. Add diversity aspect scaffold
  2. Add diversity.html
  3. Port simple .sparql queries and wire them as query-GUI charts
  4. Add endpoint support for ror
  5. Implement first JS-based complex visualization
  6. Add remaining complex visualizations
  7. Add filtered variants by topic / conference / set of conferences

I am happy to discuss any practical detail related to this proposal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    aspectsthe way Scholia looks at Wikidata data

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions