Finding the Needle in the Haystack: AI Tool To Improve the Usability of Government Reports
Media Inquiries
Combing through countless PDF reports for hours in search of a piece of relevant information is no one鈥檚 idea of an interesting day at work. Tedious, overwhelming, soul-crushing, maybe. Engaging? Not so much.
Dedicated public servants 鈥 and many others 鈥 do it anyway, often, in service to some larger goal: to make the case for a new policy, to advocate for funding or to explain a position.聽
Recently, a team of graduate students from 一本道无码鈥檚听 补苍诲听Integrated Innovation Institute(opens in new window) 颈苍听鈥檚听 course came up with a generative AI application that helps researchers find the information they seek in a matter of seconds, not hours.
Their tool, GovScan, provides government workers the ability to locate the proverbial needle in a haystack.
Team members Davis Craig, Aakash Dolas, Tyler Faris and Eashwari Samant spent seven weeks creating a tool that would improve the usability of government reports. Craig and Faris both study in the Master of Science in Public Policy and Management program, and Dolas and Samant study in the Master of Integrated Innovation for Products and Services program.
Maya Mechenbier, a project lead for the United States Digital Service, shared a real-life challenge for the GovScan team to solve that she鈥檚 faced in government.
For this scenario, students connected with government workers tasked with reviewing reports for child care funding from all 50 states; each report might contain hundreds of pages. Policy analysts needed to find particular data points within those reports in order to be able to analyze and compare the effectiveness of programs.
鈥淲hether it鈥檚 for Medicaid or the Child Care Development Fund subsidy dollars, states鈥 plans are typically stored and made public in a PDF form,鈥 explained Mechenbier. 鈥淔ifty states might do 50 different things with their programs.鈥 The magnitude and variation can make it hard for a policy analyst to absorb such large quantities of data, determine who might be addressing certain rules in certain ways, or understand trends emerging across the country.
The student team created a working model that sifts through thousands of pages of those reports to answer analysts鈥 questions. For example, an analyst might ask GovScan, 鈥淲hich states provide child care funding for low-income, single-parent households?鈥 The tool scans all the PDF reports in its database and provides a list of results 鈥 complete with the source citations.
鈥淕ovScan is like the 鈥楥ontrol F鈥 search function on steroids,鈥 explained Craig.聽
Why It鈥檚 a Game-Changer聽
The tool has two main benefits. The first is efficiency.
Policy analysts told the team that they typically spend three to four hours looking for data points within these reports. The GovScan platform gives an answer within about 30 seconds.
鈥淚t鈥檚 not efficiency for efficiency鈥檚 sake,鈥 said Faris. 鈥淚t鈥檚 efficiency for better decision-making and better management.鈥澛
Another challenge for analysts is knowing whether the haystack even contains a needle.聽
鈥淧eople we interviewed were frustrated with the inherent uncertainty. It鈥檚 one thing to know that what you鈥檙e looking for is in a particular report and it鈥檚 just taking time to find it,鈥 Samant said, but spending hours in search of information that doesn鈥檛 exist feels like a waste of time. GovScan helps analysts use their time more effectively by identifying which reports contain the information they need.
The GovScan application was designed not to replace humans, but to serve as a tool to help them work more efficiently and effectively.聽
鈥淚t reduces the cognitive load for researchers,鈥 explained Dolas. 鈥淭he saved time and effort free up humans to spend their time and attention on analyzing and understanding the results.鈥
The application is distinct from other search tools in a couple of important ways.
Platforms like Google or Bing search the internet for information. Large language models such as ChatGPT or Bard also rely on the internet as a data source.聽
Conversely, GovScan searches within a single, secure database of PDF files provided by an organization. The distinction is important because it eliminates false information as part of the data source.
GovScan has another key difference from LLMs like ChatGPT. GovScan鈥檚 results are linked to the source material. When the user receives an answer to a prompt, they can click on the link for each fact and find the exact location of the source of the information within the original report.聽
How It Works
Craig uses the analogy of a library to explain Retrieval Augmented Generation (RAG), the technology behind GovScan.
鈥淚magine if you went to a library and there鈥檚 a big pile of books on the ground. It would be really hard to find the specific information you want,鈥 Craig said. 鈥淭hat鈥檚 the issue with unstructured data, with all those PDF reports. So what we do is basically what librarians do 鈥 take all the books and index them so that they鈥檙e organized neatly.鈥
Watch GovScan at work in this demo.
The next step is to do 鈥渟emantic search.鈥 Natural language processing engineers, in this case Davis, use a technique called vector embeddings to capture the semantic meaning of the question and then scan those indexed reports to find which reports are most relevant, and which data points within those reports are most applicable to the user鈥檚 query. The application functions like a librarian helping someone use the card catalog to locate a particular book, with a particular piece of information in it.
Then the application puts the results together, gives them to an LLM, and the LLM is instructed to handle the information in a way that meets the specific use case. With GovScan, the model is told to summarize the results, provide citations for the information and link to the information source.
What Happens Next聽
Craig, Dolas, Faris and Samant have made their work available via a聽 under an MIT open-source license, including the code they created for the query engine and data pipeline that enable GovScan鈥檚 operation. They are exploring options for further developing the tool.聽
The student team is careful to note that the application needs additional testing, but they are optimistic that GovScan is a workable tool that can help research officers and policy analysts do their jobs better.
鈥淭he tool might not seem all that flashy, but the utility of it against the sheer volume of data is significant,鈥 said Goranson. 鈥淭he team took the time to really understand the challenges facing their partner and then created something that directly addressed the problem.鈥
Mechenbier said the tool could be useful across many disciplines and for any federal agency that must process and analyze large quantities of data from PDF files.
鈥淭heir tool is something that could really improve the lives of policymakers in a tangible way, allowing these creative, smart people to do the analysis and writing they really want to be doing,鈥 Mechenbier said.