Incorporating Custom Data into Large Language Models

Sudam Rohanadeera
Feb 14, 2024
2 min read

Updated: Feb 16, 2024

This image is AI generated

We had the honor to present an abstract titled "Enhancing Large Language Models with Custom Data: Leveraging of Document Indexing with Contextual Embeddings" at APAN56 Conference organized by the Asia Pacific Advanced (APAN)" , which was held on 24th of August, 2023 at Galle Face Hotel.

Introduction

Large Language Models (LLMs) have transformed the landscape of natural language processing, enabling automation across diverse tasks. However, these models encounter limitations, including inaccurate responses and a lack of contextual knowledge in specialized domains. In this study, we explore the integration of custom data to enhance LLM applications, particularly those with restricted input sizes, such as GPT-3 with its maximum token limit of 4,097.

The Challenge

LLMs face a challenge when dealing with extensive context due to their token limitations. Injecting entire documents into an LLM is neither efficient nor cost-effective. The need arises for a more targeted approach that leverages relevant context without overwhelming the model.

Introducing LlamaIndex

To address these limitations, we introduce LlamaIndex, an innovative Open-Source Python library. LlamaIndex serves two critical purposes:

Document Indexing: LlamaIndex abstracts the process of extracting contextually rich segments from documents. It identifies relevant sections and feeds them to the LLM prompt.
Embeddings: Leveraging embeddings, LlamaIndex enhances the LLM’s understanding of input. These numerical representations capture semantic relationships between words, enabling better context extraction.

Experimental Validation

We conducted experiments using three knowledge sources related to the Bachelor of Information Technology (BIT) program at the University of Colombo School of Computing (UCSC). Our findings include:

Diverse Responses: Identical prompts yielded different responses when processed with LlamaIndex-equipped LLMs. This validates the effectiveness of our approach.
Quality Enhancement: Integrating document-indexing and embeddings improved the quality of LLM-generated answers. Responses spanned a spectrum from concise to detailed, allowing nuanced interactions.

Implications and Future Directions

Our research demonstrates the potential of document-indexing and embeddings in enhancing LLM applications. By seamlessly integrating external data sources, we overcome traditional prompt engineering limitations. LlamaIndex paves the way for context-aware LLMs, making them more accurate, relevant, and adaptable.

For more details, you can refer to the original slide deck presented at the A

PAN56 Conference