litstudy in Action: A Real-World Example of Scientific Literature Analysis
- Kasturi Murthy
- Mar 23
- 5 min read
Updated: Apr 11
This blog delves into the science of exploring scientific literature, showcasing how tools like Litstudy [1] and Jupyter Notebook can transform your approach to research. Drawing from my experience using Litstudy to craft the introductory chapter of my M.Tech dissertation's literature survey, this post provides an overview of its features. Designed for readers familiar with Jupyter Notebook [3], it highlights how Litstudy enables you to effortlessly navigate new research domains, analyze scientific papers, and enhance your analysis with interactive visualizations.
Search Functionality of Litstudy
The litstudy [2] search feature aggregates metadata from scientific publications, highlighting important aspects such as the title, authors, publication date, and DOI. However, it does not offer access to the complete documents. This function enables users to obtain a detailed overview of literature databases. A DOI serves as a unique link for digital content, offering crucial metadata for reliable citations and access.
The Various literature databases and their associated functionalities supported by this library include:
Scopus:
Offers functions to retrieve metadata for documents using identifiers (DOI, PubMed ID, Scopus ID) or approximate title-based searches.
Enables submitting queries to the Scopus API and importing CSV files exported from Scopus.
Semantic Scholar:
Allows metadata retrieval using various identifiers (e.g., DOI, ArXiv ID, Corpus ID).
Provides functions for refining metadata, submitting queries to the Semantic Scholar API, and obtaining results.
CrossRef:
Offers metadata retrieval for documents using DOI with timeout settings to manage server communication.
Supports querying the CrossRef API with options for sorting and filtering results.
CSV:
General-purpose CSV loading with options for field name customization and filtering.
litsudy [2] attempts to infer field purposes or use explicitly defined fields.
Other Platforms:
Functions for importing metadata from specific databases, including IEEE Xplore, Springer Link, DBLP, arXiv, BibTeX, and RIS files.
Each platform supports specific queries or file formats for metadata retrieval.
Document handling classes:
Document: Represents a single publication, storing metadata such as title, authors, abstract, DOI, and publication source.
DocumentSet: Manages a collection of documents, supporting filtering, deduplication, merging, and set operations (e.g., union, intersection).
Author: Represents an author, containing details like name and affiliations.
PublicationSource: Defines the source of a document (e.g., Scopus, PubMed), tracking source-specific metadata.

Fetching Documents using Search Query
![Figure 3.0 illustrates Python code snippets utilizing the "litstudy.search_semanticscholar" function. It showcases the retrieval of academic documents based on different search queries [z1-z3] along with corresponding document counts. Enables operations such as union (|) and intersection (&).](https://static.wixstatic.com/media/48f00c_40ad6c0caa98432b959c7ba1d3b8a3dc~mv2.png/v1/fill/w_57,h_72,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_avif,quality_auto/48f00c_40ad6c0caa98432b959c7ba1d3b8a3dc~mv2.png)

Create a Document Corpus from a Collection of Documents
The build_corpus function prepares text data from documents for tasks like topic modeling, natural language processing, and clustering. It uses advanced preprocessing techniques (e.g., token filtering and n-gram merging) and outputs word-frequency vectors alongside a word-index dictionary, integrating seamlessly with other Litstudy features.
Topic Modeling: Discovers popular topics and trends in scientific publications using NLP methods (e.g., LDA, embedding-based techniques) and generates topic lists with keywords, visualizations, and thematic insights.
NMF Model: Extracts latent topics from document corpora with adjustable parameters (e.g., number of topics, iterations) to support literature reviews, research exploration, and thematic grouping. Retains important context that could be lost if words are analyzed individually
N-Gram Merging: Combines frequently co-occurring tokens into meaningful phrases (e.g., bigrams, trigrams) to enhance text representation and improve accuracy in clustering and topic modeling.
Topic Clouds
This feature visualizes topics identified through modeling by creating customizable word clouds, where the size of the words indicates their significance, helping users intuitively understand key terms and thematic structures. It accepts trained topic models (such as NMF or LDA) and allows adjustments to parameters like font size and color schemes. These visualizations can be saved as images, used for topic labeling, and easily integrate with Litstudy pipelines and Jupyter Notebooks for interactive exploration.

DocumentSet enables efficient management and analysis of document collections. The member function "best_documents_for_topic" in this class is utilized to pinpoint the most pertinent documents for a specific topic.

Plot Embedding
The Litstudy Plot Embedding functionality visualizes document relationships in a low-dimensional scatter plot, where proximity indicates similarity. It uses precomputed embeddings generated by techniques like word or sentence embeddings and topic modeling

Citation Network
This feature illustrates the citation network of scientific papers by generating a graph where nodes symbolize documents and edges indicate citation links. By utilizing a DocumentSet as input, it permits customization of node size, edge thickness, and color coding according to metadata such as publication year or source. The resulting graph can be either interactive or static, facilitating the analysis of relationships and connections among documents.

Summary
The key components related to Litstudy for M.Tech dissertations in scientific literature research include:
Comprehensive Metadata Management: Enables the retrieval and management of metadata from various literature databases such as Scopus, Semantic Scholar, and CrossRef.
Advanced Text Analysis: Offers features like document filtering, n-gram merging, and corpus creation for NLP applications.
Topic Modeling and Visualization: Provides tools to identify trends using LDA, NMF, word clouds, and embedding plots for thematic insights.
Citation Network Analysis: Illustrates citation connections among documents with customizable graphs for detailed research linkages.
Seamless Workflow Integration: Integrates with libraries like bibtexparser and pandas in Jupyter Notebooks to enhance analysis and interactivity.
References
S. Heldens, A. Sclocco, H. Dreuning, B. van Werkhoven, P. Hijma, J. Maassen & R.V. van Nieuwpoort (2022), "litstudy: A Python package for literature reviews", SoftwareX, 20, 101207. DOI: 10.1016/j.softx.2022.101207.
Link to Jupyter Notebook available on request from Google Collab