Large Language Models do not Hallucinate, but Humans Sure do!

Dr M Maruf Hossain, PhD, GAICD
Mar 2
4 min read

Hallucination in Large Language Models (LLMs) refers to the generation of content that may deviate from factual accuracy or the provided source content. While these instances may occur, it’s crucial to recognise that LLMs are primarily designed for general-purpose language understanding and generation, not for delivering absolute precision.

Originally published at LinkedIn Pulse on 7 January 2024.

Consider an example where an image generator, such as DALL-E, is prompted with “Generate a picture of a woman without eyebrows and with an enigmatic smile against a backdrop of a distant, misty landscape.” Despite the prompt’s resemblance to Leonardo Da Vinci’s iconic “Mona Lisa”, the expectation is not to replicate this masterpiece. Instead, the goal is to stimulate the generation of unique content, with efforts directed toward altering the artistic style to foster the creation of novel, distinctive content.

This approach prompts the question: why should LLMs deviate from this paradigm? Both image generators and LLMs are part of the generative AI ecosystem, with a primary focus on creating new content, explicitly highlighting their fundamental objective of generation. Consequently, expecting LLMs to furnish factual information is an erroneous premise from the outset.

In this context, LLMs are redeployed as information-finding tools. However, repurposing a tool of this nature often results in inconsistent functionality, efficiency challenges, security vulnerabilities, maintenance complexities, and a lack of dedicated support. To avoid these potential pitfalls, it’s often more effective to utilise tools specifically designed for the intended task. This ensures optimal performance and mitigates the risks associated with adapting tools for purposes they were not originally designed for.

However, the challenge lies in understanding and addressing the factors contributing to hallucinations in LLMs. Several technical reasons are responsible for these occurrences:

Lack of real-world knowledge. By design, LLMs lack access to real-world knowledge beyond their training data, hindering their ability to verify factual accuracy.
Inability to distinguish fact from fiction. The training data may contain inaccuracies or fictional content, making it difficult for LLMs to distinguish fact from fiction.
Bias or misleading training data. Biased or misleading training data can prompt LLMs to produce statistically based but inaccurate results.

To maximise the utility of LLMs and minimise hallucinations, consider the following approaches:

Providing pre-defined input templates. Guide the model’s responses by employing pre-defined input templates, reducing the likelihood of hallucinations. For example, rather than posing a broad inquiry to the model, such as “Tell me about X,” a more refined approach is to seek specific information by framing the question as “What are the key facts about X according to your training data?” This strategic adjustment enhances precision and aligns with a targeted data retrieval methodology.
Prompt Engineering: Craft prompts carefully to steer the model towards more accurate responses. For example, instructing the model with a directive such as “Provide a detailed, factual response to the following question” is likely to yield better results than a more generic prompt like “Answer this question.” This nuanced guidance emphasises the importance of specificity in eliciting accurate and comprehensive responses from the model.
Reasoning and Iterative Querying: Ensure consistent and accurate outputs by incorporating reasoning and iterative querying techniques. For example, a strategic approach involves instructing the model to furnish the rationale behind its responses or to address the same question through multiple avenues. This methodical direction not only promotes transparency in the decision-making process but also contributes to a more robust and nuanced understanding of the queried subject matter.
Retrieval-Augmented Generation (RAG): Leverage external knowledge sources to enhance responses and retrieve specific information from organisational databases. For example, the model can access pertinent documents within a database and leverage this information to formulate responses. This approach stands as the most favoured method for extracting precise information from the organisational knowledge base and articulating it in natural language. Noteworthy is the utilisation of a comparable strategy by Bing.ai, which involves initial comprehension of user queries, subsequent online searches for relevant content, and the use of exclusively identified pages to construct a comprehensive and accurate response.
Adjusting the Temperature Parameter: A model’s temperature denotes a scalar value used to fine-tune the probability distribution predicted by the model. In the context of LLMs, this parameter strikes a balance between adhering to the knowledge derived from training data and generating diverse or creative responses. Notably, the creative outputs are more susceptible to potential hallucinations. Generally, a lower temperature enhances the determinism of the model’s outputs, while a higher temperature introduces greater diversity. In scenarios demanding accuracy, it is advisable to cultivate an information-rich context by setting the temperature to zero, thereby anchoring responses in a well-grounded context. This deliberate reduction in creativity serves to minimise the probability of hallucinations. Furthermore, this technique can be synergistically employed with RAG. It is especially crucial when high-probability responses are absent from the knowledge base, ensuring the model refrains from generating fictitious responses based on inadequate matches. This meticulous approach safeguards against the risk of misinformation in critical decision-making or customer-centric contexts.
Adopting Reinforcement Learning with Human Feedback (RLHF): The refinement of the model involves a meticulous process of fine-tuning, utilising human-ranked evaluations of prior responses to anticipate and enhance these rankings. It is imperative to acknowledge that this process is resource-intensive. Unless your team develops the foundational model internally, embarking on this route may present formidable challenges. RLHF entails creating a new model, which can introduce efficiency concerns, security vulnerabilities, maintenance complexities, and a lack of support from the original vendor. It is noteworthy that OpenAI has successfully employed this technique to enhance ChatGPT’s performance, demonstrating its viability when applied with prudent judgment and strategic context.
Incorporating External Knowledge Sources: Cross-reference responses with trusted external databases to enhance answer verification.

Concluding remark

While these strategies can help reduce the occurrence of hallucinations, complete elimination may not be feasible. It is essential to validate information generated by LLMs, acknowledging that their responses, while fluent and convincing, may not always be factually accurate. Additional research and validation should be conducted as needed to ensure the reliability of the information provided by LLMs.