From Promise to Pitfalls: 8 Key Takeaways from Our Generative AI Experience

Dr M Maruf Hossain, PhD, GAICD
Mar 3
5 min read

Over the past two years, I have spearheaded multiple Generative AI (GenAI) initiatives to enhance customer service, streamline internal processes, and gain a competitive edge. These projects have spanned various applications, from deploying customer service chatbots to automating internal workflows.

Originally published at LinkedIn Pulse on 24 March 2024.

This journey has provided us with critical insights, revealing both the immense potential and the challenges of GenAI. While we have seen significant improvements in efficiency and customer satisfaction, we have also encountered challenges that underscore the need for a balanced approach. These include managing data privacy, addressing algorithmic bias, and ensuring robust governance frameworks.

Our experience has highlighted the importance of continuous refinement and human oversight to maintain and enhance the performance of GenAI systems. By treating GenAI with the same level of caution and rigour as any other AI technology, we can effectively mitigate risks while fully leveraging its potential.

Lesson 1: GenAI is Simply Another Form of AI

Both engineers and leadership initially viewed GenAI as a revolutionary technology primed to transform our operations. However, we soon recognised that GenAI, like any other AI technology, carries inherent risks in a large enterprise environment. Critical challenges such as data privacy, algorithmic bias, and the need for robust governance frameworks are equally pertinent for GenAI.

Recommendation

We emphasise the importance of treating GenAI with the same level of caution and rigour as any other AI technology. This approach ensures that we address potential risks effectively while leveraging GenAI’s capabilities to their fullest potential.

Lesson 2: Quick Initial Performance, But Refinement Takes Time

Our initial deployment of GenAI for customer service chatbots yielded promising results. The chatbots effectively managed basic inquiries and provided quick responses, improving customer satisfaction. Achieving this initial level of strong performance was relatively quick and straightforward. However, refining and improving the system beyond this point required considerable time and manual effort for evaluation and enhancement. Our data science team has been continuously fine-tuning system prompts, addressing edge cases, and enhancing the system’s ability to handle more complex queries with better accuracy. This ongoing effort is essential to maintaining and improving the system’s performance.

Recommendation

We must continue investing in these refinement processes to sustain and enhance the chatbots’ capabilities. This will ensure the chatbots continue to meet and exceed customer expectations.

Lesson 3: Unreliable for Critical Tasks

During our pilot program, we observed that GenAI models, while initially impressive, are unreliable for critical tasks. When used to generate financial reports, the initial drafts, though impressive, contained subtle errors that could have led to significant financial misinterpretations.

For an information retrieval task, by leveraging the Retrieval-Augmented Generation (RAG) technique with a Large Language Model (LLM), we aimed to transform the user experience from merely searching for information to receiving tailored responses to specific queries or tasks. We have observed subtle procedural errors in over five hundred responses sourced from several thousand HTML pages detailing organisational processes and procedures. These errors, including process mix-ups and missing steps, posed significant risks of misinterpretation.

Recommendation

To mitigate these risks, we recommend limiting GenAI’s use to non-critical tasks. Specifically, GenAI can be effectively utilised to generate content to enhance retrieval systems and to draft marketing content and internal communications, where the impact of potential errors is minimal.

Lesson 4: Dependency on Retrievers for RAG

In the RAG system, the effectiveness of GenAI models relies heavily on the retriever’s ability to provide precise information. Currently, our retrievers that utilise vector embeddings primarily conduct similarity searches. These searches function more like ‘keyword searches’ rather than context-based searches. Consequently, the quality of the generated responses is directly tied to the retriever’s capability to locate relevant information.

Recommendation

To enhance the performance of GenAI models, we should focus on improving the retriever’s context-based search capabilities. This will ensure more accurate and relevant information retrieval, thereby improving the overall quality of the responses generated by the GenAI models.

Lesson 5: Challenges with Knowledge Graph-Based Retrievers

Although knowledge graph-based retrievers excel at preserving text context, they face significant scalability challenges with large document bases. For instance, with just 500 HTML pages, the schema for the underlying Neo4J database surpassed the context length that GPT-4 Turbo could handle. This limitation renders knowledge graphs impractical for large-scale document retrieval.

Recommendation

We recommend exploring alternative solutions, which can effectively handle large-scale document retrieval while maintaining context integrity. This approach will ensure more efficient and scalable information retrieval systems.

Lesson 6: Sensitivity to Noise and Irrelevant Information

Our data scientists have identified a critical limitation in GenAI models: they are not particularly resilient to noise and struggle with excessive irrelevant information. For instance, when analysing customer feedback, the models often got overwhelmed by unrelated comments. In the RAG system, when provided with procedures for the same operation across different user groups or systems, the models often mix up steps, omit crucial steps, or reference unstated steps. This indicates that GenAI models can become overwhelmed by unrelated content.

Recommendation

Despite these challenges, GenAI models perform adequately when instructed to rewrite content, such as rephrasing policy documents or generating summaries of meeting notes. Therefore, we recommend leveraging GenAI for these specific tasks to maximise their effectiveness while minimising risks.

Lesson 7: Variability in LLM Capabilities

Our data scientists observed significant variability in the capabilities of different LLMs. Parameters such as temperature, top_p, and top_k yielded markedly different outcomes, even when set to identical values. This variability necessitated extensive experimentation by our team to identify the optimal settings for specific use cases.

Recommendation

To maximise the effectiveness of LLMs, it is crucial to invest in thorough experimentation and fine-tuning of model parameters. This approach will enable us to achieve optimal performance tailored to our unique requirements.

Lesson 8: Summarisation Limitations

When summarising large volumes of text, the GenAI models often select random segments to produce coherent but not always accurate summaries. This issue was particularly pronounced with lengthy procedural documents, where critical details were frequently missed. To address this limitation, we implemented additional layers of human review to ensure the accuracy and completeness of the summaries. This step was essential to maintain the integrity of the information provided.

Recommendation

To enhance the reliability of GenAI-generated summaries, we recommend continuing the practice of human oversight, especially for critical documents. This approach will ensure that all essential details are accurately captured and conveyed.

Concluding Remarks

Our journey with GenAI over the past two years has been both enlightening and transformative. While we have successfully leveraged GenAI to enhance customer service, streamline internal processes, and gain a competitive edge, we have also encountered significant challenges that underscore the need for a balanced approach.

To fully harness the potential of GenAI while mitigating its risks, we must continue to treat it with the same caution and rigour as any other AI technology. By focusing on targeted applications, investing in continuous refinement, and maintaining robust oversight, we can leverage GenAI to drive innovation and operational excellence.