Document Search and Data Mining in Banking – Natural Language Processing Capabilities

Raghav Bharadwaj

Raghav is serves as Analyst at Emerj, covering AI trends across major industry updates, and conducting qualitative and quantitative research. He previously worked for Frost & Sullivan and Infiniti Research.

Document Search and Data Mining in Banking – Natural Language Processing Capabilities

This article was originally written as part of an in-depth AI report sponsored by Iron Mountain, and was written, edited and published in alignment with our transparent Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Thought Leadership Services page.

Banks and other financial institutions seem to be adopting AI applications ranging from business process automation to credit scoring. Historically, banks have collected vast amounts of data records, and even some of the more traditional banks tend to have the resources needed for AI projects. Efficiently navigating the vast data stores to gain valuable business insights involves understanding the capabilities of AI in information search and discovery applications for the banking sector.

One of the earliest applications of text mining in banking was the development of Automatic Processing of Money Transfer Messages (ATRANS). This system could extract information in a predefined template from telex messages, which were sent between banks to confirm money transfers.

The telex messages were highly predictable in their content, and the standardized format made ATRANS a successful system. AI is not a magic fix-it-all solution for challenges faced by banks today. Rather, business leaders can think of it as a tool that can help them find ways to improve outcomes that already exist within their data. AI software is only as good as the data it consumes.

Most natural language processing, or NLP, and text/data mining applications are developed based on existing software libraries and established use-cases. International banks that collect data in languages other than English might also find it difficult to develop NLP software that can directly understand contextual documents in local languages.

For instance. the fact that there are currently more NLP libraries for English than German might force a German bank to take the additional step of converting its documents to English before building any AI applications.

A report from the Financial Security Bureau seems to suggest that NLP and machine learning can be applied to several applications in banking, such as customer-facing banking chatbots and credit scoring.

Data search and information discovery usually involves several steps even after the data has been organized and collected. NLP and machine learning can be used to create a searchable index of all internal documents. After indexing, the data can be made searchable through an interface. Search interfaces could offer basic search options, such as Boolean (and/or/not), segment, numeric range, or advanced search options that might include natural language search, fuzzy search, and concept search.

The same tools driving advances in machine learning in search engines are being adopted in the banking industry. For example, entity recognition tools similar to those used in search engines are now being used to identify news and social media conversations relevant to publicly-traded firms. As more firms adopt NLP and machine learning methods, the incentives to access additional data may increase among banks.               

Giacomo Domeniconi, a post-doctoral researcher at IBM Watson TJ Research Center and adjunct professor for the course “High-Performance Machine Learning” at New York University, suggests that search tools might get even better in the near future:

Search tools which can contextually retrieve information from both structured and unstructured data might not be that far away from now. This might especially be true in sectors like banking where the companies have the economical resources to spend on gathering information from both structured and unstructured data.

NLP-based document search and data mining software are seemingly most useful for three applications:

  • Mortgages
  • Credit Scoring
  • Product Development


According to the AI and data science experts at Iron Mountain, data search and discovery can help banks automate certain operations in mortgage management. Evidence suggests that banks are engaging in structured data capture through forms (HUD-1, tax forms, and loan applications). That said, unstructured documents such as PDFs, audio files, video files, and handwritten notes still account for almost 80% of all documents in the mortgage industry, according to Axis Technical.

Banks can use AI to extract information from structured, semi-structured, and unstructured documents. Structured documents are usually easier to prepare for input into a machine learning software. Unstructured documents can be digitized using advanced optical character recognition (OCR) or computer vision. A bank can then apply NLP techniques to the extracted textual data and gain insights from all the different types of data they are collecting. Below is an example of document search from Axis AI:

Axis AI's Document Search Software
Axis AI’s Document Search Software

Semi-structured documents, such as invoices and closing statements, have parts that are structured and others that may even be handwritten. AI software can help digitize and organize the information that a bank might consider critical from these documents.

Unstructured documents present the biggest challenge for document imaging since the metadata that banks need to extract in order to train machine learning models is free-form and might be a sentence, paragraph, or whole page within a document.  

Radim Rehurek, who earned his PhD in Computer Science from the Masaryk University Brno and founded RARE Technologies, points out:

Historically a lot of the internal workflows in banking or insurance, deal with manual data entry in some form. A significant portion of these internal documents is also usually unstructured documents such as PDFs, handwritten notes, audio or video files.

Automated document processing using AI software in the mortgage sector might be a go-to low-hanging fruit application. Companies in the mortgage sector will find that this automation might help them reduce human labor costs, including any data-entry training costs.

NLP and machine learning can also help banks speed up loan and mortgage application processing, especially for large banks with volumes of incoming forms. The scale makes using AI software highly lucrative as opposed to human officers that would need to manually review the documents. AI can help banks analyze mortgage and loan applications to see if the forms are missing any information and immediately send customers requests for the same. This could allow the reviewing officers to steer clear of applications that have rudimentary problems (such as missing information) and only deal with cases involving more complex issues.

AI software for automated data capture could also help banks reduce human error and operate at a scale that might not be possible for humans to compete. Once all the documents have been scanned and the data is captured and organized, mortgage firms can create a front-end search interface that allows employees to have access to the information and documents they need to make strategic business decisions.

That being said, training these algorithms requires a large amount of data, and small or medium size mortgage operations with less than 100 transactions per week might find they may not meet the data requirements for training an AI model.

AI for information search and discovery in mortgages might be most useful for large mortgage operations with over 500 transactions per week. Such firms can potentially capture data from web pages, word documents, emails, and other forms of structured text to make it accessible to employees.

Credit Scoring

Another prominent application for AI-based data search and discovery in banking is in credit scoring for loans. The value of a loan is directly related to how likely a bank thinks an individual or a business may default on that loan. Credit scoring is a critical part of the loan process because it helps banks identify the likelihood of default by accessing customer data, including credit histories, social media posts, and, in some cases, an individual’s entire digital footprint.

The vastness of this data makes it almost impossible for human analysts to comb through and identify customers with a high probability of defaulting. NLP and machine learning can help banks crawl through digital footprint data, such as social media posts, internet browsing data, geolocation, and other smartphone-captured information, to generate a credit score for each customer.

Most credit scoring models have historically been based on transaction and payment histories between customers and banks. But banks seem to be increasingly turning to unstructured and semi-structured data sources to capture a more nuanced view of creditworthiness and improve the accuracy of credit scores.

Several AI vendors, such as Lenddo and ZestFinance, already offer credit scoring solutions for banking customers. The value proposition of these vendors seems to be around speeding up lending decisions and processes while simultaneously limiting risks. Below is a flowchart showing which kinds of data Lenddo claims to collect for their scoring solution:

Lenddo's Credit Scoring and Approval Process
Lenddo’s Credit Scoring and Approval Process

That said, according to the Financial Security Board, it has not yet been proven that machine learning-based credit scoring models outperform traditional software drastically in assessing creditworthiness.

Product Development

Banks have several channels for collecting data from customer interactions through social media conversations, emails, phone calls, and website form-fills. With large amounts of data being collected about banking customers, the other broad application for search and discovery AI seems to be in customizing products to improve customer engagement.

We interviewed Adnan Masood, who earned a PhD in Machine Learning from Nova Southeastern University and is Chief Architect of AI and Machine Learning at UST Global. Masood is also a visiting scholar at the Stanford Artificial Intelligence Lab and MIT AI Lab. Masood points out:

AI-driven data search and discovery primarily helps understand the intent. The consumer might be looking for a specific fund, a credit card, loan, an ETF, or checking the health and progress of existing products. A cognitive search provides a richer experience by integrating likes and dislikes, life events, and transaction history, providing context-sensitive results for products which fit these individuals in their specific stages of life and needs.

According to Masood, AI data search and discovery tools help map these recommendations for customized product development. Large banks are exploring the possibilities of “Segment-of-One” financial products, which involve the ability to track and understand individual customer behavior.

In the banking industry, customer interactions and opinions are usually very clear and are rarely neutral. More often, customer reviews are highly positive or highly negative. Negative statements from customers are also usually much lengthier than positive reviews. Large banks collect vast amounts of customer information from multiple channels and need a way to personalize the banking experience.  

By analyzing conversations from chatbots, customer feedback forms, social media pages, and transaction histories, a bank might discover the customer’s intent faster and more accurately. For instance, the Royal Bank of Scotland (RBS) claims they use NLP text-mining techniques to extract trends from customer feedback. According to RBS, they deployed an AI software that could ingest data from customer emails, surveys, and call center conversations to identify which issues affect their customers the most.

Masood further provides context with a personal anecdote:

A large financial institution in Australia whom I work closely with has applied cognitive capabilities, including NLU/NLP, to provide categorical natural language search in a variety of spending categories…The financial institutions which get organized around their data can use this information to invest in creating better experiences and returns for their clients.

Banks can also effectively use search and discovery AI software to provide their customers with a complete view of their interactions with the banks when it comes to their accounts, share deposits, and insurance contracts.

With AI-based search and discovery, banks could discover where customers spend most of their money and open up potential cross-selling opportunities. The scale of the customer-facing side in the banking sector makes it almost impossible for a human team to sift through all the data.

What Business Leaders in Banking Should Know

Data search and discovery applications for AI in the banking sector are emerging rapidly due to the availability of vast amounts of data and large multinational companies with the resources to implement new use cases.

AI search and discovery might automate repetitive tasks for banks and improve sales and customer service processes on the front-end.

Among all the applications discussed in this report, credit scoring seems to have the most traction. Several AI vendors offer credit scoring solutions to banks. AI has made it possible to gain a deeper understanding of the default risks for customers by enabling banks and financial institutions to analyze their entire digital footprint to generate their credit scores.

This has also made it possible for human loan officers to search and retrieve the data relevant to a case from both structured and unstructured documents. As a result, loan officers can review applications faster and more accurately.

The availability of historical data across a range of bank’s customers and products would determine the success of such AI implementations.


Header Image Credit: E-SPIN

Stay Ahead of the AI Curve

Discover the critical AI trends and applications that separate winners from losers in the future of business.

Sign up for the 'AI Advantage' newsletter: