Published on the 10/09/2024 | Written by Heather Wright
AI summaries performed lower on all criteria in ASIC trial…
AI – at least for summarising information – may be creating more work for people, rather than easing our workloads according to a trial by Australia’s ASIC regulator.
The Australian Securities and Investments Commission (ASIC) trialled Meta’s Llama2-70B chatbot to summarise pubic submissions made to an external Parliamentary Joint Committee Inquiry, over a five week period (including selection of the large language model).
“AI outputs could potentially create more work due to the need to fact check outputs, or because the original source material actually presented information better.”
Final results of the trial, which ran over January and February 2-24, found that AI summaries performed lower on all criteria compared to human summaries.
“The findings support the view that GenAI should be positioned as a tool to augment and not replace human tasks,” ASIC said in response to questions from the Select Committee on Adopting Artificial Intelligence.
Summarising of information is one of the key use cases being touted for GenAI, but the trial confirms what many users of the technology have suspected.
The work of Llama2-70B – deemed the most promising model for the work – was assessed alongside summaries created by ASIC staff. The five assessors weren’t told AI was involved at all, though three later said they suspected it was an AI trial after being told it was AI generated summaries.
Out of a maximum of 75 points, human summaries scored 61 – or 81 percent. The aggregated GenAI summaries came in at just 35, or 47 percent.
“Assessors generally agreed that AI outputs could potentially create more work if used (in current state), due to the need to fact check outputs, or because the original source material actually presented information better.”
One of the most significant issues was GenAI’s limited ability to pick up nuance or context, with one assessor noting of an overall summary provided by AI ‘… it didn’t pick up the issue in a nuanced way. I would have found it difficult to even use an output to craft a summary…”
Another noted that ‘it was wordy and pointless – just repeating what was in the submission’.
Incorrect information was also included in summaries, including analysis which didn’t come from the supposedly summarised document, while in other cases relevant information, and even the central point of the submission, was missed.
Giving minor points prominence was also an issue, with one assessor noting it ‘made strange choices about what to highlight’, along with including irrelevant information from the submissions.
The report notes that summarising a document can contain multiple actions, such as answering questions, finding references or imposing word limits, and that the selected LLM performed strongly with some actions and less capably with others.
The final assessment results show out of a possible 15, the AI generated summaries achieved an aggregated score of 10 for coherency/consistency. That’s the top ranking category for the AI summaries, and puts it just behind the human results of 12. When it came to identifying recommendations on how conflicts of interest should be regulated, it scored a lowly five, though human summaries also scored low at just eight.
Asked to provide a summary of mentions to ASIC with brief context, without quoting the original query and providing the final answer in a concise, human like response, AI scored equally dismally at five again, versus 15 for human summaries.
In May, ASIC chair Joe Longo labelled the AI-generated summaries ‘bland’ when he appeared before the committee.
“It didn’t really capture what the submissions were saying, while the human was able to extract nuances and substance.”
The report does note limitations to the trial, including the limited time frame to optimise the model.
“ASIC has taken a range of learnings from the PoC, including the value of robust experimentation, the need for collaboration between subject matter experts and data science specialists, the necessity of carefully designed prompt engineering, and given the rapidly evolving AI landscape, the importance of providing a safe environment that allows for rapid experimentation to ensure ASIC has continued understanding of the various use cases for AI, including its shortcomings.”
ASIC stressed in the documents released that the work, conducted with AWS, was purely a proof of concept trial, and not used for ASIC’s regulatory or operational purposes.
Despite the apparent failure of the model in this trial, ASIC appears to be remaining open minded about future use, noting technology is rapidly advancing and it is likely future models will improve performance and accuracy of results.