litstudy in Action: A Real-World Example of Scientific Literature Analysis

This blog delves into the science of exploring scientific literature, showcasing how tools like Litstudy [1] and Jupyter Notebook can transform your approach to research. Drawing from my experience using Litstudy to craft the introductory chapter of my M.Tech dissertation's literature survey, this post provides an overview of its features. Designed for readers familiar with Jupyter Notebook [3], it highlights how Litstudy enables you to effortlessly navigate new research domains, analyze scientific papers, and enhance your analysis with interactive visualizations.

Figure 1.0 Software architecture of litstudy, showcasing its integration with Jupyter notebooks and Python scripts. The system uses various libraries such as bibtexparser, pandas, and gensim for metadata retrieval, analysis, and visualization. — **Figure** 1.0 **Software** architecture of litstudy, showcasing its integration with Jupyter notebooks and Python scripts. The system uses various libraries such as bibtexparser, pandas, and gensim for metadata retrieval, analysis, and visualization.

Search Functionality of Litstudy

The litstudy [2] search feature aggregates metadata from scientific publications, highlighting important aspects such as the title, authors, publication date, and DOI. However, it does not offer access to the complete documents. This function enables users to obtain a detailed overview of literature databases. A DOI serves as a unique link for digital content, offering crucial metadata for reliable citations and access.

The Various literature databases and their associated functionalities supported by this library include:

Scopus:
- Offers functions to retrieve metadata for documents using identifiers (DOI, PubMed ID, Scopus ID) or approximate title-based searches.
- Enables submitting queries to the Scopus API and importing CSV files exported from Scopus.
Semantic Scholar:
- Allows metadata retrieval using various identifiers (e.g., DOI, ArXiv ID, Corpus ID).
- Provides functions for refining metadata, submitting queries to the Semantic Scholar API, and obtaining results.
CrossRef:
- Offers metadata retrieval for documents using DOI with timeout settings to manage server communication.
- Supports querying the CrossRef API with options for sorting and filtering results.
CSV:
- General-purpose CSV loading with options for field name customization and filtering.
- litsudy [2] attempts to infer field purposes or use explicitly defined fields.
Other Platforms:
- Functions for importing metadata from specific databases, including IEEE Xplore, Springer Link, DBLP, arXiv, BibTeX, and RIS files.
- Each platform supports specific queries or file formats for metadata retrieval.

Document handling classes:

Document: Represents a single publication, storing metadata such as title, authors, abstract, DOI, and publication source.
DocumentSet: Manages a collection of documents, supporting filtering, deduplication, merging, and set operations (e.g., union, intersection).
Author: Represents an author, containing details like name and affiliations.
PublicationSource: Defines the source of a document (e.g., Scopus, PubMed), tracking source-specific metadata.

Figure 2.0 UML diagram illustrating the relationships among key classes in Litstudy, including Document, DocumentSet, Author, and Publication Source. It highlights the attributes, functionalities, and interactions within the system for handling scientific publication metadata.

Fetching Documents using Search Query

Figure 3.0 illustrates Python code snippets utilizing the "litstudy.search_semanticscholar" function. It showcases the retrieval of academic documents based on different search queries [z1-z3] along with corresponding document counts. Enables operations such as union (|) and intersection (&). — **Figure 3.0 illustrates Python code snippets utilizing the "litstudy.search_semanticscholar" function. It showcases the retrieval of academic documents based on different search queries [z1-z3] along with corresponding document counts.** **Enables operations such as union (|) and intersection (&).**

**Figure 4.0 Documents that are fetched from Semantic Scholar**

Create a Document Corpus from a Collection of Documents

The build_corpus function prepares text data from documents for tasks like topic modeling, natural language processing, and clustering. It uses advanced preprocessing techniques (e.g., token filtering and n-gram merging) and outputs word-frequency vectors alongside a word-index dictionary, integrating seamlessly with other Litstudy features.

Topic Modeling: Discovers popular topics and trends in scientific publications using NLP methods (e.g., LDA, embedding-based techniques) and generates topic lists with keywords, visualizations, and thematic insights.
NMF Model: Extracts latent topics from document corpora with adjustable parameters (e.g., number of topics, iterations) to support literature reviews, research exploration, and thematic grouping. Retains important context that could be lost if words are analyzed individually
N-Gram Merging: Combines frequently co-occurring tokens into meaningful phrases (e.g., bigrams, trigrams) to enhance text representation and improve accuracy in clustering and topic modeling.

Topic Clouds

This feature visualizes topics identified through modeling by creating customizable word clouds, where the size of the words indicates their significance, helping users intuitively understand key terms and thematic structures. It accepts trained topic models (such as NMF or LDA) and allows adjustments to parameters like font size and color schemes. These visualizations can be saved as images, used for topic labeling, and easily integrate with Litstudy pipelines and Jupyter Notebooks for interactive exploration.

Figure 5.0 Word clouds illustrating key terms for various topics, highlighting the central concepts with size denoting significance. These visualizations offer a quick grasp of thematic structures and aid in topic labeling and further analysis.

DocumentSet enables efficient management and analysis of document collections. The member function "best_documents_for_topic" in this class is utilized to pinpoint the most pertinent documents for a specific topic.

Figure 6.0 The `litstudy.nlp.TopicModel` is a trained model designed to analyze topics and relationships within documents and tokens. It incorporates matrices to represent topic-document and topic-token mappings, along with the `best_documents_for_topic` function that identifies key documents associated with a specific topic.

Plot Embedding

The Litstudy Plot Embedding functionality visualizes document relationships in a low-dimensional scatter plot, where proximity indicates similarity. It uses precomputed embeddings generated by techniques like word or sentence embeddings and topic modeling

Figure 7.0 A scatter plot visualizing relationships between documents in a low-dimensional space, where proximity of points reflects document similarity based on embeddings generated through techniques like word and sentence embeddings or topic modeling

Citation Network

This feature illustrates the citation network of scientific papers by generating a graph where nodes symbolize documents and edges indicate citation links. By utilizing a DocumentSet as input, it permits customization of node size, edge thickness, and color coding according to metadata such as publication year or source. The resulting graph can be either interactive or static, facilitating the analysis of relationships and connections among documents.

Summary

The key components related to Litstudy for M.Tech dissertations in scientific literature research include:

Comprehensive Metadata Management: Enables the retrieval and management of metadata from various literature databases such as Scopus, Semantic Scholar, and CrossRef.
Advanced Text Analysis: Offers features like document filtering, n-gram merging, and corpus creation for NLP applications.
Topic Modeling and Visualization: Provides tools to identify trends using LDA, NMF, word clouds, and embedding plots for thematic insights.
Citation Network Analysis: Illustrates citation connections among documents with customizable graphs for detailed research linkages.
Seamless Workflow Integration: Integrates with libraries like bibtexparser and pandas in Jupyter Notebooks to enhance analysis and interactivity.

References

S. Heldens, A. Sclocco, H. Dreuning, B. van Werkhoven, P. Hijma, J. Maassen & R.V. van Nieuwpoort (2022), "litstudy: A Python package for literature reviews", SoftwareX, 20, 101207. DOI: 10.1016/j.softx.2022.101207.
Literature Databases — litstudy 0.1 documentation
Link to Jupyter Notebook available on request from Google Collab

Tech Blog

litstudy in Action: A Real-World Example of Scientific Literature Analysis

Search Functionality of Litstudy

Topic Clouds

Recent Posts