1932
Artificial intelligence concept. Electronic circuit of brain shape. 3D rendered illustration.

CREDIT: vchal via Shutterstock

A Librarian’s Guide to AI in Academic Search Tools

As Katina covers the integration of AI into academic search tools and other library products, this guide offers useful background information on the technology.

By Aaron Tay Chee Hsien

|

LAYOUT MENU

Insert PARAGRAPH
Insert H2
Insert H3
Insert Unordered List
Insert Ordered List
Insert IMAGE CAPTION
Insert YMAL WITH IMAGES
Insert YMAL NO IMAGES
Insert NEWSLETTER PROMO
Insert QUOTE
Insert VIDEO CAPTION
Insert Horizontal ADVERT
Insert Skyscrapper ADVERT

LAYOUT MENU

This guide provides an overview of these subjects:

  1. Large Language Models (LLMS)
  2. Constructing an Answer with Retrieval Augmented Generation (RAG)
  3. Understanding (Vector) Embedding Search
  4. Why the Use of Embeddings in Retrieval Reduces Interpretability
  5. Embedding Search in Practice
  6. Why Embedding Search Leads to Less Reproducible Results
  7. Reranking with Embedding Search
  8. Hybrid Search and Rerankers

Large Language Models (LLMs)

LLMs are advanced AI systems trained on vast amounts of text data, enabling them to both understand text and generate human-like summaries and answers to questions. Popular LLMs include proprietary models like OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and open-source models like Meta’s Llama and DeepSeek.

(Note: Modern LLMs powered by Transformer architectures reach or approach near human-level scores in many prominent natural language benchmarks in tasks such as summarization, sentiment analysis. However, philosophers and linguists still argue whether such models have genuine “understanding” in the same way as humans do, as opposed to merely being very sophisticated statistical pattern-matchers (Bender et al., 2021; Bender & Koller, 2020). I’ll use “understand” and similar terms throughout this guide but take no position in the larger debate.)

Constructing an Answer with Retrieval Augmented Generation (RAG)

The term “retrieval augmented generation,” or RAG, was first coined in a 2020 paper (Lewis et al., 2021). RAG is popularly used in academic search systems to blend search with large language models. A typical RAG system will not just generate an answer to your query but will provide citations to documents that support the answer (see Figure 1 for a sample answer). But how does it do so?

placeholder Image

FIGURE 1

Think about the difference between a closed-book exam (where you rely only on memory) and an open-book exam (where you look things up before answering). RAG is like an open-book exam for an AI system.

Figure 2 illustrates how RAG typically works in academic AI search engines, including Primo Research Assistant, Web of Science Research Assistant, and Scopus AI.

placeholder ImageFIGURE 2

Here’s a simplified explanation:

1. You Ask: You type your research question into the tool.

2. The System Searches (Retrieval): Instead of answering immediately, the tool first searches its specific, trusted academic database (like Primo’s index, Scopus, or Web of Science) for documents related to your question. It picks out the top few results.

3. The System Reads and Writes (Augmented Generation): The LLM then runs a default, built-in prompt to use the information from those specific documents to write a summary answer to your question, including citations that point back to the documents it used.

An example of what this default prompt might look like appears in Figure 3.

placeholder ImageFIGURE 3

What are the advantages of RAG versus just using a LLM directly without search?

  • More Accurate Answers: Because the AI bases its answer on specific documents from a specific database that it just identified, the information is more likely to be accurate and relevant.
  • Real Citations: This process reduces the chance of the AI making citations up (“hallucinating”). While the prevalence of hallucinations varies with implementation, empirical studies show that, typically, 95 to 99 percent of the papers cited by the RAG systems do exist. These studies likely overcount hallucinations by failing to distinguish them from other types of errors in citations (for example, flagging papers as fictitious due to errors in the citations but which are actually due to incorrect metadata in the source used by the RAG system).
  • Straightforward Verification: Even if the RAG system distorts what the retrieved documents say by paraphrasing (which does happen), you can verify the answer against the source used.

Do note that many modern LLMs, like ChatGPT, Gemini, Claude, and Grok, have search/RAG capabilities. However, two things still distinguish them from academic AI search engines: while they can search, they do not always “choose” to search. And they use general web search, e.g. Bing, while academic AI search systems search academic content in scholarly indexes.

Understanding (Vector) Embedding Search

Beyond traditional keyword searching, some modern AI tools utilize embedding search, also sometimes known as vector search, neural search, or semantic search. At the core of this approach are “embeddings” or “vectors,” that is, sophisticated numerical representations of key parts of the text (like titles, abstracts, or chunks of full text). Embeddings are essentially long lists of numbers, often with hundreds or even thousands of dimensions (common sizes include768 or 1536 dimensions), where each dimension represents some aspect of the text’s meaning. The high dimensionality allows these vectors to capture very complex semantic nuances and relationships between concepts.

Embeddings are generated using complex neural network models. State-of-the-art systems typically employ a type of neural network architecture dubbed the “Transformer model” in the seminal 2017 paper “Attention is all you need”(Vaswani et al., 2017), which laid the foundation for popular models used today (e.g. BERT, GPT). These models are trained on massive datasets and excel at understanding the context in which words appear—analyzing surrounding words and sentence structure. This process allows them to encode nuanced meaning, relationships, and context into the high-dimensional vectors, creating a rich mathematical representation of the text’s core concepts.

Embedding search leverages these vector representations. When you submit a query, it is run through an embedding model that converts the query into its own embedding vector to represent the query.

The system then mathematically compares this query vector to the pre-calculated embedding vectors of the documents in the index (which represent the documents). It identifies documents whose vectors are mathematically closest to the query vector within that high-dimensional “meaning space.” This proximity, often calculated using measures like cosine similarity, signifies strong semantic relevance between the query and the document, forming the basis of the search results.

Why do this? The embedding method can overcome the “vocabulary mismatch problem,” where relevant documents are missed because the query terms used do not match. But it is not a silver bullet, and in some situations, is outperformed by keyword search. This is why many systems (for example, Scopus AI) are a hybrid of both methods.

Why the Use of Embeddings in Retrieval Reduces Interpretability

The first popular vector embedding for individual words was Word2Vec, released by Google in 2013 (Mikolov et al., 2013).

The main idea of Word2Vec is that words with similar meanings tend to appear in similar contexts. Word2Vec uses a simple neural network to learn by looking at words that occur near each other in sentences.

The Word2Vec method involves feeding the network sample sentences from the internet and training the network to predict neighboring words for a given word (Skip-gram) or to predict the given word from its neighbours (CBOW). By adjusting the network’s internal weights through many examples, Word2Vec gradually learns numerical representations that capture word meanings.

Google found that once they trained Word2Vec this way over a huge set of data (Google News text, or about 100 billion words), the embeddings (300 dimensions) that were produced seemed to encode semantic meaning. For example, they famously found that the embeddings for king - man + woman ≈ queen (Mikolov et al., 2013).

Today, modern embeddings go beyond the basic neural networks used in Word2Vec, mostly using Transformer-based neural network architectures instead (e.g. BERT).

While Word2Vec is a word-level embedding, the concept of embeddings can be used at different levels of granularity:

  1. Word level: Map individual words to vector embeddings (e.g., Word2Vec, GloVe (Pennington et al., 2014)).

  1. Sentence embeddings: Map words and whole sentences to vector embeddings (e.g., SBERT (Reimers & Gurevych, 2019), Universal Sentence Encoder (Cer et al., 2018)).

  1. Document embeddings: Map whole documents to vector embeddings

placeholder ImageFIGURE 4

Because embeddings are trained using neural networks, they are black boxes: it is almost never possible to interpret what each number means.

In the example below, I ran the words “Apple,” “Orange,” and “Car individually through a popular OpenAI embedding model named text-embedding-3-small, which has 1,536 dimensions (Open AI, 2024) (Model - OpenAI API, n.d.). Another way to say this is that text in this embedding is represented in 1,536 numbers, or in 1,536-dimensional space.

Here are the first two numbers and the last numbers of the embeddings representing “Apple,” “Orange,” and “Car.”



Dimension



1



2





1536



Apple

0.0176

-0.0168

0.007



Orange

-0.0259

-0.0055

-0.0147



Car

0.0085

-0.0040

-0.003

While it is true that if you run a similarity measure like cosine similarity between the three words, you will find “Apple” is indeed closer (or has higher cosine similarity) to “Orange” compared to “Car,” you will not be able to interpret what each individual number in the embeddings mean.

This problem is compounded if you use sentence-level embeddings, as all the words in the sentence are “squeezed” into one embedding. There are more sophisticated multiple embedding approaches, like ColBERT, that store individual embeddings for each token or word in a text chunk. This improves interpretability.

Embedding Search in Practice

placeholder ImageFIGURE 5

Information retrieval needs to be fast. A common way to leverage embeddings is to adopt a “bi-encoder” model (Reimers & Gurevych, 2019). When documents are being indexed, they are run through an embedding model which converts the text in the documents to embeddings that represent the documents, which are then stored in a special index known as a vector store.

When you enter a query, your query is converted in real-time to an embedding using the same embedding model to create a representation of the query. Assuming your RAG model wants to find the top eight results, the query embeddings are compared against the document embeddings by using a comparison function, such as cosine similarity, to find the top eight closest document embeddings.

But imagine if the index has 100 million document embeddings. If you tried to compare the query embedding, or representation, against all 100 million document embeddings, it would take too long.

There are faster methods that can guarantee that you will find the closest document embedding without going through all 100 million documents, but in practice, most embedding search engines use shortcut methods such as “approximate nearest neighbor” (ANN).

While ANN methods are fast, they often lead to searches being less reproducible—that is, the same exact query entered multiple times sometimes results in different documents being retrieved or in documents being retrieved in different orders, even if the index is unchanged.

Why Embedding Search Leads to Less Reproducible Results

Faster methods like ANN use “guessing” to approximate results, which may miss some of the true closest documents (Elastic, 2024).

Here’s a rough analogy: imagine a library of one million books indexed by subject headings that cluster similar subjects together. A librarian is asked to find the five books most relevant to “history of trade wars.” Rather than scanning every book she:

  • Pulls roughly 10 candidate books from a shelf with a relevant subject heading and skims them.
  • After skimming those candidates, she decides whether her shortlist looks good enough. If not, she jumps to the next-closest shelf cluster and repeats until she runs out of time. In the end, she will have a pretty good (but maybe not perfect) set of five books

Another librarian might choose a different starting point, yielding slightly different results.

This is how approximate nearest neighbor (ANN) search works: the same query can produce varied results depending on the starting point.

Technically, some systems store the “seed,” or starting point, within a session to ensure consistent results, but starting a new session with the same query may trigger the selection of a different starting point, leading to different results.

Searching more thoroughly reduces the chance of missing the true top eight, but the differences are often minor enough that the trade-off in faster speed is worth it.

Reranking with Embedding Search

Implementing embedding search as the primary retrieval method across an entire index can be computationally intensive and costly, often requiring the creation and maintenance of large vector databases to store document embeddings during the pre-indexing phase. A more efficient and cost-effective alternative is to use embedding search for reranking.

This approach works in two stages:

  1. Initial Broad Search: The system performs a retrieval using traditional methods like keyword search to identify an initial set of potentially relevant documents.
  2. Embedding Reranking of Top Candidates: Instead of applying computationally expensive embedding comparisons to the whole index, the system calculates embeddings only for the user’s query and the top few results from the initial search (e.g., the top 30 or 100). It then measures the semantic similarity between the query embedding and these candidate document embeddings as described above and reorders (“reranks”) this smaller set based on these scores.

This method leverages the speed of traditional search in the initial filtering and applies the nuanced semantic understanding of embeddings only where it’s most likely to matter—on the most promising candidates. Because the system only needs to calculate the embedding of a small number of documents, it can do so on the fly rather than constructing embeddings of the whole index in advance and storing them in a separate vector database.

Hybrid Search and Rerankers

Because embedding search and keyword searching have their separate strengths and weaknesses, hybrid searches that combine two or more retrieval systems are common (Cardenas, 2025).

Scopus AI, for example, uses both a keyword Boolean search (with a search strategy generated by LLM) and an embedding search in parallel, combining the top results.

placeholder ImageFIGURE 6

Taking this approach, an AI search engine might combine the top 500 results from a keyword search and the top 500 from an embedding search, dedupe the combined result set, and rank them again.

How would it rank the combined result set? One simple yet popular method is reciprocal rank fusion (Cormack et al., 2009). This table shows reciprocal rank fusion in action:





Rank in algo 1



Rank in algo 2



Overall Relevancy Score



Doc A

1

2

1/1 + 1/2 = 1.5



Doc B

5

3

1/5+1/3 = 0.533



Doc C

10

8

1/10 + 1/8 = 0.225

As the table shows, you just sum the reciprocals of each rank to create an overall ranking.

(In practice, the actual formula for the contribution to RRF is (1/ (k+rank)), where k is typically 60. So, for example, rank 1 contributes (1/(60+1) = 0.0164 instead of 1.K is a damping factor used to reduce the impact of top-ranking positions.)

Besides using simple methods like RRF to combine results, it is also common to run the combined results set from the hybrid search through a reranker.

A reranker is a more powerful algorithm that is very accurate but too slow to run over all the documents in your index. There are many types of rerankers—this blog post offers a detailed but still accessible discussion (Clavié, 2024)—but I will focus on two here.

First let’s discuss the cross-encoder model.

placeholder ImageFIGURE 7

We already met the bi-encoder model, where we convert the query and documents separately by running them through an embedding model, typically a Transformer-based encoder model that encodes queries and documents into embeddings and then compares the two embeddings using a similarity measure like cosine similarity. The bi-encoder model has the advantage of being able to convert the documents into embeddings “in advance.”

But the cross-encoder model is even more powerful. In this approach, both the query and each candidate document are fed into the cross-encoder model, which outputs a relevancy score.

This allows the model to directly consider the interactions between the query and each document. Empirical research shows that cross-encoders in general perform at much higher levels than bi-encoders, but because you need to run each query/document pair in real time, they are much slower and cannot be used as a first stage retrieval system with many documents.

This is why cross-encoders are almost always used as a reranker after a first stage retrieval system (keyword, embedding, or even hybrid search system) has already narrowed down the results to a few hundred.

More recently, researchers have experimented with replacing the cross-encoder with a Transformer based large language model to assess relevance (Pradeep et al., 2023).

To get a sense of how this works, you could feed a LLM like GPT4 the query and the candidate document (say title and abstract) with a prompt to do one of these things:

  1. Classify the document into relevancy classes (e.g. Not relevant, Partially relevant, Highly relevant)
  2. Score relevancy with a number
  3. Rank documents against other documents in terms of relevancy.

Many deep research tools, like OpenAI Deep Research, and academic search tools, like Undermind, already use LLMs directly to assess relevancy, which allows high levels of relevancy even for highly nuanced queries.

References

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922

Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–5198. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.463

Cardenas, E. (2025, January 27). Hybrid Search Explained | Weaviate. Weaviate. https://weaviate.io/blog/hybrid-search-explained

Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., & Kurzweil, R. (2018). Universal Sentence Encoder (No. arXiv:1803.11175). arXiv. https://doi.org/10.48550/arXiv.1803.11175

Clavié, B. (2024, September 16). rerankers: A Lightweight Python Library to Unify Ranking Methods. Answer.AI. https://www.answer.ai/posts/2024-09-16-rerankers.html

Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 758–759. https://doi.org/10.1145/1571941.1572114

Elastic. (2024, April 17). Understanding the approximate nearest neighbor (ANN) algorithm. Elastic Blog. https://www.elastic.co/blog/understanding-ann

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2021). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (No. arXiv:2005.11401). arXiv. https://doi.org/10.48550/arXiv.2005.11401

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (No. arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781

Open AI. (2024, March 13). New embedding models and API updates. OpenAI. https://openai.com/index/new-embedding-models-and-api-updates/

Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. In A. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162

Pradeep, R., Sharifymoghaddam, S., & Lin, J. (2023). RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! (No. arXiv:2312.02724). arXiv. https://doi.org/10.48550/arXiv.2312.02724

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (No. arXiv:1908.10084). arXiv. https://doi.org/10.48550/arXiv.1908.10084

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000-6010. https://dl.acm.org/doi/10.5555/3295222.3295349

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error