Anatomy of an ML startup – redux: part II

By Jay Bartot, Partner • August 22, 2023

Not all data are created equally

In Part I, I discussed the need for AI/ML startup founders to think early about a defensibility strategy - a moat for their startup against competitors. This need arises from the fact that a startup’s AI/ML technology is likely not inherently defensible due to the commodity nature of most AI/ML technologies. Instead, proprietary data has historically been and continues to be a strong component of a moat. 

But in my experience, different forms of data come with various challenges. Given that data is still king, let’s look at a few data sources commonly used to power machine learning startups and review some of the challenges and their pros and cons. BTW, these informal categories come directly from my experiences and are certainly not an exhaustive list of what you may encounter in practice. 

Customer data

“Customer data” refers to the startup’s own data that is generated by customers using your products. Although this could be any type of behavior as measured by your telemetry and user metrics technologies, it is often the purchasing behavior of your customers, assuming you have an e-commerce site of some sort. Once amassed over time, this data can be rich and powerful.

For proof of this power, just think about the insight and predictions Amazon, Google, and Meta can glean about their customers based upon a continuous cycle of collecting and analyzing this behavioral data over time. These companies have also been at the forefront of leveraging machine learning technologies to extract the incredible value that is inherent to this data.  

Interestingly, some of these companies have also been extremely generous with releasing their AI/ML technologies to the community. Why give that technology away to potential future competitors?  The answer points to their current state-of-the-art machine learning technology being ultimately less valuable than the data they have amassed and the data moat they’ve created. 

It can be hard for a startup to call itself an AI/ML company when you can’t utilize technology until you have accumulated enough historical data on your users and customers.  Investors will recognize this chicken-and-egg problem, especially because every startup these days calls itself an AL/ML company. I sometimes call these startups “tech-later,” meaning that once any company has been in business long enough, there will be data and analytics value-creation opportunities. Still, it might take a long time to get there.  

To get to that mass of data, where you can apply machine learning technologies to give your product that extra edge, you must build momentum by some other means. For example, applying collaborative filtering and recommendation technology can’t help until you have enough observations of your user’s behavior. Or perhaps you’re developing a proprietary matching algorithm that will power your 2-sided marketplace application. Again, you’ll need plenty of data for it to be effective. Before you jump into this arena, think about how you will bootstrap your application before you have a critical mass of customers and data. Paul Graham wrote a classic essay years ago entitled, “Do things that don’t scale.” I mention it because sometimes, early on, you might need to create data manually (e.g., labeling it for supervised machine learning) in order to get the ball rolling. This approach won’t scale but will at least help you bootstrap your flywheel.

Operations data (“corporate exhaust” data)

Most corporations generate large volumes of data in the course of their daily operations. This data could be server log files or communication data (chat, email, calendar), financial operations data, legal documents, or data collected using SaaS core systems of record applications (HR, Finance, etc.). Sometimes referred to as “corporate exhaust” data, a lot of this data is often lying around, serving a single purpose (e.g., for communications or operations).  These days, almost everything a company does as part of its operations leaves a trail of data somewhere.  There may be gold in this data if it can be collected, analyzed, and presented to a user as actionable insights, helping answer the perennial question, “How can we run our company more efficiently?”

Startups and their enterprise customers are just beginning to realize there is value in this data that can be extracted with the latest AI/ML technologies, especially unstructured (language) data. Note that the ability to make sense of unstructured text is yet another astonishing ability of LLMs. 

One of the biggest challenges of leveraging this kind of data for your product is data sensitivity and security. You may convince a few pilot customers that you have the technical know-how to extract actionable insights from their “exhaust” data. But do they trust you and your startup to handle and secure their sensitive data appropriately?  And does the mere fact that someone or something is looking at this data feel like surveillance to your customer’s employees? A CISO at a Fortune 500 company you’re hoping to land as a pilot customer (and may unlock your seed round) agrees to engage with you but then sends you a 200-item questionnaire that makes your heart stop, asking, among other things, “please describe the composition of your change control committee” (my personal favorite). This can be pretty scary and intimidating.

You may be a 5-person startup with minimal cloud infrastructure and resources to create a highly secured environment, but building security into your culture from day one is very important if you’re going to ask your customers to transfer their data to your cloud. Having worked on several products and startups that consume exhaust data, trust me that the security scrutiny will come. For these reasons, SOC 2 certifications are becoming an increasingly common request of vendors. Get ahead of these requirements, and don’t be caught off guard. If your startup is going to play in this data space, plan on (and budget for) getting SOC 2 certified (for example, from Strike Graph) in your first year of operations and make sure your initial infrastructure is simple, well documented, and secure, including adhering to best cultural security practices, e.g., principle of least privilege (PoLP), no data outside the cloud on laptops, strong password disciplines, MFA on all accounts, encrypted data at rest, etc.

Industry data

I’ll forever think of travel industry data (airline, hotel, rental cars) and MLS (real-estate) data when I think of this data type. I was involved early in a startup called Farecast, where we predicted changes (future price direction) in airfares and hotel rooms. We relied heavily on a complex stew of data from individual airlines, some of which is filtered through a 3rd party aggregator (GDS) and ultimately computed into a priced airfare a consumer could buy. We collected huge amounts of this data over time, building a historical view of price changes.  

It turned out that there were only a few companies in the world that had the knowledge, expertise, and computing resources to cook up the amount of data we needed on a daily basis. We discovered this well into our journey, having assumed the big challenge would be the efficacy of our AL/ML algorithms. As a defensive measure, we spent a lot of time and effort building our expertise in the construction of airline data and pricing to make the most of the data we were purchasing and to engineer appropriate features for our AL/ML algorithms.  

At Farecast, we ended up having several data vulnerabilities that we hadn't fully considered at the outset of our journey: 1) procuring the data, particularly in the volume we needed it,  2) the cost of the data (I’ll just say the cost was non-trivial and a significant portion of our operating budget), and 3) the need to build up a historical cache of data, which takes time and may need to cover at least a few cycles of whatever signals you are trying to predict or forecast. 

Be careful with this type of data. If you can get your hands on it, which might be more difficult if you are an industry outsider, remember that your competitors will probably be able to get it too.  The fact that you’ve managed to crack the data procurement challenge will be a signal to other startups that they’ll be able to do it as well.

Public data

A ton of public “open source” data is released regularly by governments and public and private institutions alike. The last 15 years have seen a deluge of this kind of data released to the public (e.g., crime data, census data, medical research data, etc.). What is nice about this kind of data is it is essentially free, and large volumes of it can be available. The challenge can be that not only is the data accessible to you, but it is also accessible to your competitors, hampering the defensibility of your moat. With this kind of data, your edge may be your cleverness, perhaps figuring out how to combine or mash up multiple data sources (even mixing in proprietary data) in a unique and powerful way and then applying your AI/ML technology solution.  

One experience I had with this type of data was at a startup I co-founded called Medify. At Medify, we mined medical research literature with NLP/text-mining technologies to extract signals about the relationships between the studied treatments and conditions, patient demographics, and study outcomes.  We rolled up this data across thousands of studies, providing a powerful tool to see aggregate information on treatments, their applicability, and efficacy across a wide swath of patient cohorts. The semi-structured data was freely available from PubMed (as were complex medical taxonomies supported by the NIH). The challenge was ultimately deciphering the unstructured parts, i.e., the medical language. That’s where the technology challenge came in, and some of our moat in that era.  Again, in the modern AI/ML age, an LLM’s ability to help you make sense of unstructured human language data is very powerful. So now is a good time to get a headstart using LLMs to unlock the oceans of unstructured data accumulating over the last 15 years.

Try to come up with creative mashups of multiple data sources (perhaps some public and some proprietary) that ultimately give your data a proprietary edge. If you’re developing a supervised AI/ML solution, the proprietary part may be your labeling strategy or process. This won’t hold off competitors forever but may help give you a head start.

Personal health information (PHI)

It’s difficult to enumerate all of the different kinds of data you may encounter and need as you build your startup. Still, another important type to mention is healthcare data, especially PHI data (protected health information or personal health information). In some respects, this is the holy grail of data. 

Healthcare system data - doctor visits, medical tests, procedures, etc - is all recorded somewhere. If you could get your hands on a lot of it, you could likely build some very transformative AI/ML applications for healthcare and humanity. But, this type of data can be very challenging to procure. And it is not just other people’s PHI that is difficult to get. Our own PHI data can be difficult to obtain, even though we are legally entitled to copies of it via the HIPAA and Hitech acts. 

Although HIPAA is meant to protect and ensure the privacy of individuals, it has undoubtedly played a role in making access to PHI data for AI/ML challenging and difficult. Suffice it to say there are other cultural complications and friction in the healthcare industry that contribute to the lack of accessibility of data for technology applications.  

Like many technologists, I have wandered into the healthcare space before and saw many problems I could solve with the latest and greatest technologies. But like many others before and after me, I found obtaining compelling data to be very challenging and time-consuming. I’ll also say that it’s not impossible to get sensitive PHI data. I know of startups that have gotten over this hump. But it can take a lot of time and perseverance to build relationships and trust inside organizations that can facilitate access to the data you need. If you’re going to build an AL/ML application in this space, preserve your resources so you can have the time you need to get access to the data and sell to slow-moving buyers. Also, even more so in healthcare than in other verticals, if you’re a technologist, you should seriously consider having a co-founder with health domain expertise who can facilitate the key relationships you’ll need to get your foot in the door.

Before finishing this post, I want to mention that procuring data is usually not a once-and-done endeavor. Startups should be thinking about a virtuous cycle of data, as articulated by Soma Somasegar and Daniel Li of Madrona Venture Group. The idea is that given that your product is data-dependent, the more data you collect, the better your product will be. Cyclically, the better your product is, the more popular it will become and the more data you will collect.  

This concept might be easiest to understand in the context of the “My Customer” data type described above, assuming you have a consumer application (although the idea can also apply to enterprise applications). By iteratively incorporating new data into your product, making it a better product at each turn, you are building layers of value that can be part of your mote and defensibility story. Be sure to articulate clearly to your investors how you plan to exploit the virtuous data cycle to benefit your product and customers.

What's the best kind of data

Finally, we ask, “What’s the best kind of data?” It is likely a veritable stew of heterogeneous sources, proprietary, public, etc., synthesized into a high-octane fuel by analytics, AI/ML, and deep domain knowledge and know-how. My advice is to think through this strategy as early as you can in your startup journey.  

Recent stories

View more stories

Let’s start a company together

We are with our founders from day one, for the long run.

Start a company with us