Project Overview

The LLM Fact Auditor is a sophisticated NLP system designed to act as a crucial verification layer for answers generated by Large Language Models (LLMs) like Llama 2 and 3. In an era where AI-generated content is becoming ubiquitous, ensuring its factual accuracy is paramount. This project addresses the challenge of LLM "hallucinations" by grounding the model's abstract text in structured knowledge, effectively serving as a sentinel against misinformation.

The Challenge

While LLMs are incredibly powerful, their generative nature means they can produce plausible-sounding but factually incorrect statements. The core challenge was to build an automated pipeline that could systematically deconstruct an LLM's response, validate its claims against a reliable knowledge base, and enrich it with verifiable links, all without human intervention. This required a pragmatic approach, combining the semantic understanding of neural models with the structured certainty of a knowledge graph like Wikidata.

Key Features

Multi-Model LLM Integration: Seamlessly generates answers using wrappers for both Llama 2 and Llama 3, allowing for flexibility and performance comparisons.
Advanced Entity Linking: Identifies named entities in text and disambiguates them by linking to the correct Wikipedia page using a multi-step process involving popularity, context similarity, and inter-entity coherence.
Concise Answer Extraction: Classifies the user's query type (boolean, entity, or statement) and distills the often verbose LLM output into a direct, concise answer (e.g., "yes," "no," or the name of an entity).
Knowledge-Based Fact-Checking: Constructs a semantic triple (Subject-Relation-Object) from the question and answer, then queries Wikidata to verify if this relationship exists, providing a robust check on the answer's factual basis.
Fully Containerized Environment: Utilizes Docker to package the entire application, including all models and dependencies, ensuring complete reproducibility and ease of setup.

Technical Deep Dive

The project is architected as a sequential pipeline where the output of each stage becomes the input for the next.

Answer Generation & NLP Pipeline: The process begins by feeding a question to an LLM (Llama 2 or 3) via the llama_cpp_python library. The resulting text is then processed through a series of NLP models. Initial question classification is handled by specialized Transformer models (shahrukhx01/question-vs-statement-classifier and PrimeQA/tydiqa-boolean-question-classifier) to determine the expected answer format.

Entity Linking & Disambiguation: This is the core of the system's grounding capability.

Recognition: Entities are first identified using spaCy and the dslim/bert-base-NER model for high accuracy.
Candidate Generation: For each entity mention, the Wikidata SPARQL endpoint is queried to retrieve potential candidate entities, ranked by popularity.
Disambiguation: A DistilBERT model calculates the cosine similarity between the embedding of the entity's context and the embedding of the first paragraph of each candidate's Wikipedia page. This semantic check, combined with an exact-match heuristic, effectively resolves ambiguity (e.g., distinguishing between Apple the fruit and Apple Inc.).

Fact-Checking & Verification: To validate the final answer, Stanford CoreNLP (Stanza) is used to perform open information extraction, converting the final statement into a relational triple. This structured triple is then used to query Wikidata. The system checks if a property linking the subject and object exists. If the relation is confirmed in the knowledge base, the answer is marked as "correct."

Containerization & Environment: The entire application is defined in a Dockerfile, using karmaresearch/wdps2 as a base image. A setup.py script pre-downloads and caches all required models (from Hugging Face and Stanza) into the Docker image, ensuring that the environment is self-contained and ready to run without requiring further downloads.

Personal Learnings

This project was a deep dive into building a complex, real-world NLP application. A key takeaway was the power of combining different AI paradigms: using the creative, text-generation capabilities of LLMs while imposing the rigid, factual constraints of a symbolic knowledge base like Wikidata. I gained significant experience in managing multiple pre-trained models, orchestrating a multi-stage data processing pipeline, and designing heuristics to solve complex disambiguation challenges. Furthermore, containerizing a heavy, model-dependent application with Docker was a valuable lesson in creating reproducible and portable data science solutions.

About this Project

Project Overview

The Challenge

Key Features

Technical Deep Dive

Personal Learnings