David is the co-founder and COO of Keeper, who leads the machine learning, finance, and operations functions. His team's data science work has been featured in VentureBeat, OpenAI, GeekWire, and the Brookings Institute, among other places. Prior to Keeper, David worked as an equity options trader. He holds an undergraduate degree in Mathematics and Economics from the University of Chicago. In his free time, he likes to play tennis and chess.
Automatic write-off detection
The challenges, and how to overcome them
Automatic write-off detection
The challenges, and how to overcome them
Around 40 million American freelancers overpay on taxes by an average of $1,249 every year. We call this phenomenon the [(Freelance Tax Tragedy)(Middle and low-income Americans who take on supplemental sources of income such as independent contracting / gig work are also extremely likely to miss out on valuable tax deductions and therefore overpay on their taxes.)], and our mission at Keeper is to put an end to it.
To do so, we’ve had to develop software that can find tax deductions by searching through bank statements that contain both business and personal expenses.
We call this technology “automatic write-off detection” and below is a summary of why we developed it, how it works, and how it compares to off-the-shelf solutions.
The problem (AKA why you can’t just use Plaid + simple rules)
Anyone who’s used apps like Mint is familiar with the state of today’s expense categorization technology – from rampant miscategorization, to laughable merchant identification and 5 days of pending transaction delay. Now imagine using this kind of data to predict something as sensitive and complex as deciding which of your transactions are tax deductible, where the stakes are literally tax fraud and jail time.
Before building anything in-house we first tried every off-the-shelf transaction intelligence service available – Plaid, MX, Finicity, nTropy – you name it. Unfortunately, we quickly discovered that nothing could hold a candle to the challenges presented by automatic write-off detection.
Below are a few examples of the types of challenges Keeper had to overcome in order to be able to offer the first genuine “automatic write-off detection”, rated 4.8 stars by over 500,000 Americans.
Theoretically, merchants do pass a clean business name to the processor who then passes it to the bank. However, by the time a transaction actually comes out the other side of this journey it is often completely unrecognizable thanks to arbitrary truncations, prepended and appended substrings, and inserted metadata.
This is why typical software transaction cleaning is so seemingly primitive – it’s simply a hard problem. Neither heuristic models, nor LLMs, can solve it alone and it causes user readability issues and disrupts categorization accuracy.
One of the most important predictors of tax deduction status is knowing *what* was purchased, AKA categorization. Unfortunately, this is no trivial task. Even human-level intelligence struggles to differentiate the 1000+ businesses that have a trademark with the substring “Lemonade”.
Effective categorization requires squeezing every drop of intelligence out of other metadata such as merchant identity, transaction amounts, geographical location, and user preferences. Even then, the best you can do is arrive at a high confidence score.
Pending / duplicate transactions
Most banks have two types of transactions – pending, and approved – nominally to account for holds and tips. This seemingly simple issue causes major technical challenges when trying to show recent expenses without resulting in duplicates when the approved transaction comes in. Frequent problems include situations where the purchase amount has been changed (for instance, due to a tip), or when the transaction description has been completely updated (example: PENDING ONLINE PURCHASE → eBay).
Most software, such as Quickbooks, address this problem by avoiding it (i.e. they won’t show a transaction until it’s finalized). While it makes the engineering easier, it’s a terrible user experience because by the time a transaction is finalized, a week has gone by since the actual purchase was made, and the user has forgotten what it was for.
Some merchants can sell a variety of goods or services, making it difficult to categorize purchases with any degree of confidence.
The way to handle this is to combine data intelligence with clever user interface. On the intelligence side, metadata such as transaction amount, date, and past user actions can help narrow down the realistic options for every merchant. Then clever UI can help users quickly tell us that this big Venmo payment is rent and to create a rule moving forward.
Industry and user-level preferences & regulation
After all the data pipelining has been laid out, the real prediction work begins. Getting accurate merchant identification and categorization was just the first step. Category rules alone are a terrible predictor of deduction status. Variables like work industry, personal preference, amount paid, day-of-week of purchase, location, and the user’s past behavior all factor into determining tax write-off probability.
There’s a lot that can be done by implementing basic rules like “if you drive a car then gas is a write-off”, or “Linkedin Premium is always a write-off.” However, relying solely on these types of heuristics eventually stops scaling. The real solution requires a combination of heuristics, AI, and a vast and high-quality training data set.
At this point, we hope it’s clear why genuine “automatic write-off detection” isn’t as simple as hooking up a category-based rules engine to Plaid. Below is an overview of how we’ve managed to overcome those challenges at Keeper.
Last year, 100,000+ users passed three hundred million transactions through Keeper’s automatic write-off detection system. When all was said and done, we found over $88M in tax write-offs for our users at a 96% overall precision rate.
The pipeline that makes this possible is neither a labyrinth of rules-based heuristics, nor a giant black-box ML system. It’s a fine-tuned combination of both, and it’s constantly improving every day.
At a high level, the system has five components: merchant identification, categorization, transaction de-duplication, write-off prediction, and training. Let’s break them down.
Our automatic write-off detection process starts with parsing the relevant merchant entity from the raw transaction description. On first pass, we use the BM25F search algorithm with fuzzy match to scan each bank description against an internal database of over 30,000 merchant entities.
We compiled our merchant database from a combination of scrapping lists online, and by curating merchants through a multi-year process from our own dataset of bank statements.
This basic approach works relatively well when the desired output is unambiguous, but runs into issues when there are merchant conflicts. For instance, it's common for a debit card purchase to include both the bank account used as well as the merchant information within the description:
Using BM25F’s ranking method at face value, our algorithm will return "Wells Fargo" as the underlying merchant, rather than "Uber", which is the desired answer in this case. To handle this, we've taken a page from more sophisticated Named Entity Recognition (NER) approaches. Rather than treating all merchants as equals, we've subdivided our merchant database into subcategories, including:
Then, instead of identifying the merchant in a single sweep, we first return all the relevant entities and their associated sub-categories in the description, then apply our own prioritization of the parsed entities.
This internal approach addresses about 90% of the data that passes through our system. For the remaining 10% we leverage OpenAI's GPT-3. We’ve worked closely with OpenAI's team to develop several fine-tuned models starting in 2020, and Keeper was one of their earliest fine-tuned GPT-3 customers.
LLMs such as GPT-3 work quite well for this use case, provided it's trained on a robust dataset that covers the long tail of edge cases that need to be handled. We’ve done this using 10,000 examples for fine-tuning GPT-3 that map to over 250 specific scenarios. The resulting pipeline successfully identifies the correct merchant in over 98% of cases, as scored by a manual QA team every month.
Note that the reason we do not pass all transactions through GPT-3/4 is because it's relatively slow and expensive to call at scale. Therefore, it’s best used as a fallback option, and not as a catch-all clutch.
Once we have successfully identified the merchant, we need to categorize it. In many cases, this process is straightforward. During the process of compiling our merchant database, we've already taken the time to seed each merchant with a default category. In the example above, our database has transportation as the default category for Uber.
But not all situations are so simple. A frequent problem that arises is that the category is ambiguous. In cases where a merchant name can be associated with more than one company, we leverage an in-house multi-class classification model. The model takes into account merchant-adjacent factors such as location, amount, time of purchase, along with the user's prior data.
To understand why this works, we can consider an example: recurring payments. These purchases tend to be fairly consistent in both payment amount and date of payment. Thus, to differentiate between Lemonade the insurance company vs. Lemonade the restaurant, we can examine the user's prior Lemonade purchases.
If purchase amounts are the same every time, and payments are each spaced roughly 1 month apart, this strongly suggests that we're looking at Lemonade insurance payments, rather than restaurant payments.
Pending transaction / de-duplication
The next step in our write-off detection process entails verifying that incoming transactions are not merely duplicates of previous pending transactions that have already been recorded.
Very early on, we developed a classification algorithm trained on purchase amount, number of prior duplications for the user, banking institution, and the amount of time that has passed between purchases. After years of iteration, this algorithm is able to handle the vast majority of the nuances involved in de-duplication. For example, an interesting quirk in the data is that while pending and finalized purchases often appear on a bank statement 1 business day apart, they almost never appear together on the exact same day. As a result, if we see two identical purchases appear on the same date, our algorithm correctly predicts this to be 2 separate purchases rather than duplicates.
After all this, we are finally ready to predict which transactions are tax deductions. In broad terms, our objective is to classify each banking transaction into one of three buckets: very likely a write-off, possibly a write-off, not a write-off.
One key factor of the success of our prediction model is user clustering. For example, Keeper supports 185 work industry types – from delivery drivers to pest control contractors. Clustering the training data for those user segments allows the prediction algorithm to achieve that extra “human accountant-level” intelligence when determining tax deduction status for our users.
Other important inputs into the prediction model include:
- Aggregated features across the user base like the overall write-off rate for each category (e.g. GoDaddy web hosting tends to be for work)
- Individual user preferences collected at onboarding (e.g. whether the user travels for work, gets business meals, their income level, their banking institution, etc.)
- An individual user's past transactions and any information they’ve given the platform (e.g. write-off count by merchant, category, amount, recency, etc.)
Over time, our data team has translated these factors into specific numerical or categorical features such as the following:
- trailing 60 day frequency by merchant category for the user’s transactions
- the user's prior year AGI
- number of days since most recent write-off
- standard deviation on a merchant's purchase amounts
All of these variables are then fed into boosted decision-trees, which scale well and are easily interpretable, resulting in a probability score. Probability scores above 95% are automatically added to the user’s write-offs, probabilities between 15% - 95% are suggested as potential write-offs to the user, and probabilities below 15% are automatically ignored.
No matter how fancy our models are or how smart the LLMs get, the real reason Keeper’s system works as well as it does is our unique access to high-quality training data. Without it, it is near impossible to validate prediction accuracy. Moreover, this training set is not so easy to compile simply by having access to a ton of users’ bank accounts. Identifying tax write-offs reliably is difficult for even a well-trained bookkeeper, and that means you cannot simply ask a bunch of folks to label training data for you manually. To arrive at a large, high-quality dataset, we use a couple of approaches:
High-quality user-generated data
Over time, we’ve developed techniques for identifying users who keep an accurate and up-to-date record of their tax write-offs in our app. In some cases, these are users who will reach out right away if we make the slightest error (e.g. we miscategorize a merchant as a restaurant instead of a coffee shop). In other cases, it’s more subtle. Users who are making updates to transactions right before submitting their finalized tax return are more likely to provide accurate data. Data from these users are collated into datasets used to train our models.
The algorithmic approach we use for write-off detection follows a self-reinforcement learning process. New write-off predictions today become part of the inputs for the model when we run it tomorrow. In cases where users are helping us identify every last possible error, this allows our model to quickly converge on making very good predictions. However, when users are less engaged, this can result in an adverse feedback loop. Bad predictions become new inputs into the model, which results in more bad predictions, and so forth. To combat this, we’ve had to develop a regularization framework that pushes the model to become less confident over time given factors like user inactivity. As a side benefit, this technique provides an additional method for finding high-quality data for model training purposes.
Every week we run a small, representative sample of write-off predictions through our bookkeepers and accountants to QA. The results from this process isn’t a large part of our training data since it’s too expensive to scale, but it provides a vital data point in determining whether our model passes the "common sense" test. We’ve found that examining results in a production environment provides our data team with valuable insights into what we could be doing better.
To demonstrate Keeper's automatic write-off detection in action, here is one year's worth of transactions generated from a fictitious Chicago-based freelance graphic designer with a home office, who sometimes meets with clients over coffee and meals but doesn't commute between gigs or travel out-of-town for work. Three models are compared: (1) a "simple" rules-based model using Plaid’s categorizations (standin for apps like Quickbooks), (2) Keeper's model with no additional user input, and (3) Keeper's model after one typical 2-minute engagement session.
But don’t just trust us. Test us by sending your own sample transaction data!
- date (string, format: YYYY-DD-MM): The transaction date.
- name (string): The name associated with the transaction.
- amount (float): The transaction amount.
- deduction_status (string, values: 'yes', 'no', 'duplicate'): The ground truth deduction status of the transaction.
- bank_account (string): The associated bank account for the transaction.
- sub_account (string): The associated sub-account for the transaction.
Additionally, a user profile with the following fields is necessary. Required Fields:
- freelance_job_types (list of strings): A list of freelance job types chosen from the provided options.
- drives_car_for_work (boolean): Indicates whether the user drives a car for freelance work (True, False).
- has_home_office (boolean): Indicates whether the user has a home office (True, False).
- has_business_meals (boolean): Indicates whether the user has business meals (True, False).
- conducts_business_travel (boolean): Indicates whether the user conducts business travel (True, False).
Send this information to email@example.com.
Pellentesque id nibh tortor id aliquet. Iaculis nunc sed augue lacus viverra vitae congue eu. Commodo viverra maecenas accumsan lacus. Purus sit amet luctus venenatis lectus magna fringilla urna porttitor.