The Range of AI Capabilities in Document Search and Discovery

Daniel Faggella

Daniel Faggella is Head of Research at Emerj. Called upon by the United Nations, World Bank, INTERPOL, and leading enterprises, Daniel is a globally sought-after expert on the competitive strategy implications of AI for business and government leaders.

The Range of AI Capabilities in Document Search and Discovery
This Emerj Plus article has been made publicly available for a limited time.
To unlock our full library of AI use-cases and AI strategy best practices, visit Emerj Plus.

Over the last three years of AI Opportunity Landscape research, we’ve examined many broad capabilities across the AI ecosystem, from computer vision to conversational interfaces to anomaly detection and beyond. Some of our earliest client research work focused on back-office automation – mostly in financial services and healthcare – and it brought us face-to-face with an array of vendors, use-cases, and opportunities for applying AI for document search and discovery.

In this article, we’ll break down some of the lessons we learned from one of our more recent analysis of the document search and discovery vendors in the financial services industry – and explore some of their capabilities. Lastly, we’ll look at some of the key factors for delivering business value with these applications.

For enterprise leaders interested in AI use-cases, this article will highlight some of the unique areas where AI might add value to white collar workflows – and help with vendor selection.

For AI vendors and service providers, this article will highlight some of the most important factors in delivering client value, and should serve to prioritize product roadmap ideas for search and discovery applications.

We’ll begin with a breakdown of the capabilities included in document search and discovery.

Document Search and Discovery Use-Cases

AI-enabled document search and discovery might be best defined as any AI-based application that allows an enterprise user to find the documents or data that they’re looking for. This is incredibly open-ended and can be applied to an unlimited number of potential use-cases or workflows. “Search” and “discovery” have different implications, but drive towards the same general aim:

  • Search – Looking for something specific, and finding it. For example, if a customer service agent at a bank is looking for specific information about a financial product and needs to find all the available information on that product quickly, AI might be used to search for the product name or details, and bring up the blurb or details that are specifically relevant for the customer.
  • Discovery – Looking for patterns, connections, or “types.” For example, a procurement professional might train an AI solution with hundreds of examples of different kinds of clauses within procurement contracts – and then train the AI system to manually label the company’s entire corpus of hundreds of thousands of contracts – labeling them based on which kinds of clauses are included in each. This level of richness in the corpus would allow the procurement professional to “discover” which contracts have specific clause types – or possible combinations of clauses that the company deems to be dangerous.

Applications of these kinds can help companies cut down on the man-hours required to complete specific workflows, they might improve customer satisfaction (by finding requested information faster and not having to put customers on hold), or they might help to reduce legal risk (as in the procurement contract example above).

Interested readers might also benefit from looking through some of our industry and use-case specific articles on this topic:

In the following two sub-sections we’ll explore the range of capabilities across the document search and discovery landscape, and the key elements involved in unlocking value and making the most of these applications in a business context:

Capabilities and Differentiation for Document Search and Discovery Vendors

Some capabilities are more challenging than others to build, maintain, and deliver in the enterprise. Below, we’ve listed some of the capabilities that AI-based search and discovery vendors employ, rated by their relative level of differentiation:

Capability Type 1: OCR (e.g. Turning physical invoices and wholly non-text files into digital text with meaning)

  • Level of Differentiation: High
  • Reason Why: Optical character recognition (OCR) is incredibly hard. For years to come, there will be paper documents, PDFs, image files, and other wholly unstructured documents flowing into banks – documents that a bank does not have control over, and systems that handle these use-cases well will be in strong demand.

Capability Type 2: Internal Search with Platform (e.g. Users load their new digital docs in through your platform)

  • Level of Differentiation: Middle-High
  • Reason Why: Becoming integrated as a data “gate” for your client company is powerful positioning, and engrains use. This permits a company to be more than a search tool, but rather, something that a company relies on to make sense of and audit/harmonize their data.

Capability Type 3: Internal Search (e.g. Users don’t load new docs through your interface, but use your interface to search)

  • Level of Differentiation: Middle-Low
  • Reason Why: For already entirely digital data, AI-enabled search is far less challenging. It’s harder to access this data than publicly available data, so finding value with internal search applications will only be possible for firms who get access to that internal data, which is a competitive moat in and of itself.

Capability Type 4: External Search (e.g.  Scraping / extracting financial data from the web)

  • Level of Differentiation: Low
  • Reason Why: Searching secondary or publicly available data (see: AI for media monitoring) is easy, and any startup that raises money will have access to this data, and the ability to apply the most advanced algorithms to deriving value from it.

It should be noted that the gradient of “differentiation” above is also a gradient of difficulty. A more differentiated (i.e. defensible) business model or positioning is also harder to build. This isn’t to say that enterprise buyers should look for OCR if they don’t need it – but from the perspective of the vendor ecosystem, OCR with document search and discovery is rare, and extremely challenging. 

The absolute easiest information to search is to scrape and find patterns in public data (what we’re calling External Search), while being the OCR portal that is the gateway and storage system for all data within a bank is massively hard to achieve, but it is also a massively powerful position to arrive at. We’ll explore the application of these insights in the Recommendation section below. 

Value Drivers for Document Search and Discovery

The value of document search and discovery applications is not in the code, or in the technical nuances of the implementation but with the upfront thinking about designing data labels and ontologies that serve the specific use-cases that will drive business value. For example:

  • In-house legal counsel might work with data scientists and end-user legal employees to think through the types and categories of legal clauses and documents that they look for most frequently. From there, the team could determine the searches or workflows that would deliver the most business value with the least amount of technical complexity. This upfront work would be valuable even if it only created a new set of categories and labels. The AI application delivers value only insomuch as it is trained on labels and categories that have meaning to humans, and can be easily integrated into human workflows.

Our research shows that the key to unlocking these deeper workflow conversations with clients is trust. Specifically, enough trust to access data and apply subject-matter expertise to truly find the workflows that matter (the junctures where AI can be implemented to unlock value). Below we’ve ranked the elements of vendor advantage (i.e. their ability to deliver value to clients) from highest to lowest.

Competitive Advantage 1: Client relationships with data access (storage, analytics, etc)

  • Level of Client Value: High
  • Reason Why: Existing access to data means an ability to (a) position a search product as a value-add to an existing offering, and an existing working relationship of trust, and (b) learn, and iterate from existing data in order to define useful client value.

Competitive Advantage 2: Client relationships without data access (trust)

  • Level of Client Value: Middle-High
  • Reason Why: Trust and working relationships with existing clients allows a vendor to explore what could be valuable with that client through frank and clear dialogue – and allows a vendor to leverage existing trust to win initial search application pilots.

Competitive Advantage 3: Knowledge of the subject-matter (types of data)

  • Level of Client Value: Middle
  • Reason Why: Understanding the business value that a client needs is critical. Knowing the priorities, constraints, and economic levers that are relevant to business leadership in a specific department or company allows a vendor to solve their pain.

Competitive Advantage 4: Knowledge of systems and workflows (processes, IT systems)

  • Level of Client Value: Middle
  • Reason Why: Knowledge of systems and workflows allows a vendor to integrate their processes and systems into those of the client, finding the most seamless process to add value.

Competitive Advantage 5: Data science talent (experience with applied AI, ability to iterate models)

  • Level of Client Value: Middle-Low
  • Reason Why: An application must deliver value. Accessing client data (through trust and competence), requires the necessary skill to not only navigate the data itself, but the ask the right questions. That said, these skills are arguable less important than a rich understanding of the workflows themselves.

It should be noted that data science is still very important. What the table above illustrates it the fact that delivering value is more a matter of integrating into workflows and helping with a business problem, which depends more on an understanding of workflows and access to clients and client data than it does having massive teams of Stanford AI PhDs. Again, we’ll explore the relevance of these insights in the Recommendations section below.

Stay Ahead of the AI Curve

Discover the critical AI trends and applications that separate winners from losers in the future of business.

Sign up for the 'AI Advantage' newsletter: