I work as an AI engineer in a specific niche of document automation and information extraction. In my industry, using large-scale language models has presented several challenges related to hallucinations. Imagine a situation where an AI misreads an invoice amount as $100,000 instead of $1,000, resulting in a 100x overpayment. When faced with such risks, preventing hallucinations is a critical aspect of building robust AI solutions. These are some of the core principles I focus on when designing solutions that may be susceptible to hallucinations.
There are a variety of ways to incorporate human oversight into AI systems. Sometimes the extracted information is always provided to humans for review. For example, analyzed resumes may be displayed to users before being submitted to an applicant tracking system (ATS). More often, the extracted information is automatically added to the system and flagged for human review only if there are potential issues.
A critical part of any AI platform is deciding when to include human oversight, which often involves various types of validation rules.
1. Simple rules like checking that the total per item matches the total on the invoice.
2. Lookups and integrations, such as validating the total amount for a purchase order in an accounting system or checking payment details against a supplier’s previous records.
This process is good. But we also don’t want AI that constantly activates safeguards and forces manual human intervention. Illusions can defeat the purpose of using AI if it constantly activates these safeguards.
One solution to avoid hallucinations is to use a small-scale language model (SLM) that is “extractive,” meaning that the model labels parts of the document and collects these labels into a structured output. Instead of defaulting to LLM for every problem, we recommend using SLM whenever possible. For example, in resume parsing for job boards, it is often unacceptable for an LLM to wait more than 30 seconds to process a resume. In this use case, we have found that SLM can provide results in 2-3 seconds with better accuracy than larger models like GPT-4o.
Example of a pipeline
At our startup, we can process documents with up to 7 different models, only 2 of which can be LLMs. This is because LLMs are not always the best tool for the job. Some steps, such as augmented search generation, rely on small multimodal models to generate useful embeddings for search. The first step, detecting whether something is a document, uses a small, very fast model that achieves 99.9% accuracy. It is important to break the problem down into small chunks and figure out where LLMs are best suited. This reduces the chance of hallucinations.
Distinguishing between hallucinations and mistakes
I think it is important to distinguish between illusions (the model creating information) and mistakes (the model misinterpreting existing information). For example, choosing the wrong dollar amount as the total amount of a receipt is a mistake, while creating a nonexistent dollar amount is an illusion. Extractive models can only make mistakes, while generative models can Both Mistakes and illusions.
When using generative models, we need a way to eliminate illusions.
ground connection Any technique that forces generative AI models to reference authoritative information to justify their output. How grounding is managed depends on the risk tolerance of each project.
For example, a company with a general purpose inbox might want to identify action items. Typically, emails that require action are sent directly to an account manager. A general inbox filled with invoices, spam, and simple replies (“Thank you,” “Confirmation,” etc.) has too many messages for a human to see. What happens if tasks are accidentally sent to this general inbox? Tasks are routinely missed. If the model makes mistakes, but is generally accurate, then it is already doing better than doing nothing. In this case, the tolerance for mistakes/hallucinations may be high.
In other situations, a particularly low risk tolerance may be required. Consider financial documents and “direct processing,” where extracted information is automatically added to the system without human review. For example, a company may not allow invoices to be automatically added to the accounting system unless (1) the payment amount exactly matches the amount on the purchase order and (2) the payment method matches the supplier’s previous payment method.
Even when the stakes are low, I still err on the side of caution. Whenever I focus on extracting information, I follow a simple rule:
When extracting text from a document, it must match the text found in that document exactly.
This is tricky when the information is structured (e.g. tables), especially since PDF does not convey any information about the order of words on the page. For example, if a description for a line item is split across multiple lines, the goal is to draw a consistent box around the extracted text, regardless of the left-to-right order of the words (or right-to-left order in some languages).
Forcing a model to point to the exact text of a document is called “strong evidence.” Strong evidence is not limited to information extraction. For example, a customer service chatbot may need to (literally) quote standardized responses from an internal knowledge base. This is not always ideal, as standardized responses may not actually answer the customer’s question.
Another tricky situation is when information needs to be inferred from context. For example, a medical assistant AI may infer the presence of a condition based on symptoms when the medical condition is not explicitly mentioned. Identifying where these symptoms are mentioned would be a form of “weak evidence.” The justification for the response must be present in context, but the correct output can only be synthesized from the information provided. An additional evidence step would be to force the model to query the medical condition and justify that these symptoms are relevant. This may still require weak evidence, as symptoms can often be expressed in multiple ways.
As AI solves increasingly complex problems, it can be difficult to use grounding. For example, if a model is supposed to “infer” or infer information from context, how do you ground the output? Here are some considerations for adding grounding to complex problems:
- Identify complex decisions that can be broken down into a set of rules. Instead of generating the final decision, let the model generate the components of that decision. Then use the rules to represent the result. (Caution – sometimes this can make hallucinations worse. Asking the model multiple questions gives it multiple opportunities to see hallucinations. Asking just one question may be better, but we found that current models are generally worse at complex multi-step reasoning.)
- If something can be expressed in multiple ways (e.g., symptom descriptions), the first step might be for the model to tag and normalize the text (commonly called “coding”), which could open up opportunities to build a stronger foundation.
- We restrict the output to a very specific structure by setting up “tools” that the model can call. We don’t want to execute arbitrary code generated by the LLM. We want to create tools that the model can call and provide constraints on what is in those tools.
- Include the basis for tool usage, if possible, for example validating responses against context before sending them to downstream systems.
- Is there a way to verify the final output? If manual rules are not possible, can you create a prompt for verification? (And follow the above rules for the verified model as well).
- In information extraction, we do not tolerate output that is not found in the original context.
- Next, we go through a verification step to catch mistakes and illusions.
- Everything else we do is about risk assessment and risk minimization.
- Break down complex problems into smaller steps and see if you need an LLM.
- For complex problems, we use a systematic approach to identify verifiable tasks.
— Strong evidence forces LLMs to quote directly from reliable sources. It is always preferable to use strong evidence.
— If your background knowledge is lacking, the LLM must refer to reliable sources, but synthesis and inference are acceptable.
— If you can break down a problem into smaller tasks, leverage a strong foundation for the task if possible.