Beyond the Consensus: Navigating AI's Frontier in 2025

Beyond the Consensus: Navigating AI's Frontier in 2025

View presentation: https://luxcapital.com/ai-frontier

I recently gave the above presentation at the AI Engineer Summit in New York City - you can watch the link to the full livestream here.

Lux was founded in NYC, our first AI investment was in NYC, and a majority of our AI portfolio companies have HQs or significant presences here. We’ve long been proponents of the NYC x AI tech ecosystem and were lucky to be early partners to several leading AI companies - all the way back to 2013 with consequential investments in Runway (2018), Hugging Face (2019), MosaicML (2020), Together AI (2023), Sakana AI (2023), Physical Intelligence (2024) and more.

We’ve seen exponential progress in AI over the last 2.5 years and in particular the last 6 months - with models getting more performant and compute efficient as well as model shipping frequency becoming more constant and spread - it’s not just OpenAI and Anthropic publishing models it’s also X.ai, Meta, Mistral, Deepseek and Kyutai. We’ve seen the rise of reasoning models like OpenAI’s o1 and o3 models and Deepseek’s R1 alongside test time compute applying compute at inference instead of training improving performance. We’ve seen meaningful engineering and hardware optimizations down the stack - no matter what you posit Deepseek’s latest R1 model cost to train, it was a feat of engineering efficiency and hardware optimization - illustrating cheaper inference and optimized hardware can deliver impressive results. And of course we’ve seen billions of dollars of spend earmarked for global data center and compute infrastructure - with the US Stargate Project committing $500B to invest in US data centers as well as French President Macron’s $112B AI investment package. We’re poised to have a perfect storm for AI agents in 2025, but in reality - AI agents aren’t working quite yet

Why don’t AI agents work? Let’s take a seemingly simple example of trying to book a flight from NYC to San Francisco. In reality, it’s a pretty complex query: I need to leave after 3 pm but avoid traffic, sit in an aisle seat close to the front of the plane, optimize for my chances of getting an upgrade reward given my chosen airlines and obey business expense requirements. 

We’re all familiar with AI models hallucinating - doing wild fabrications and stretching the truth. But the reason my AI agent didn’t work was a series of tiny, cumulative errors that compound in a complex system. Like decision errors (the model chooses the wrong fact like San Francisco, Peru instead of San Francisco CA), or implementation errors (the model doesn’t have access to the right database or system). Sometimes these errors are related to human preferences like heuristic errors (the model doesn’t account for traffic to JFK), or taste errors (the model doesn’t factor in my personal preference of avoiding flying on a Boeing 737 MAX.) And of course there is the ultimate “perfection paradox” where AI agents are unreliable and inconsistent, underwhelming human expectations. 

Again, it’s not that an AI model today can’t do any one of these tasks like understanding a stated preference or searching the web, it’s how these cumulative errors compound and the complexity of the overall system.

Take two agents - one that is 99% reliable for each task and one that is 95% reliable for each task - both pretty high performing to start. After 50 tasks you end up with a big disparity between the two and the 99% agent ends up at only 60% accuracy. Something that seems simple like booking a flight is complex in reality where small errors get amplified in a multi-agent and multi-step system.

So the challenge for AI leaders and builders is how do you optimize an agent taking into account all of these possible errors to consistently and reliably make the right decision?

The truth is - it’s hard! Here are 5 strategies we’ve seen from portfolio builders and experts to leverage to mitigate these errors.

  1. Data Curation - how do we make sure the agent has the information it needs?

Data is messy, unstructured and in siloes across organizations today. This data is exploding beyond web and text data to image, video, audio, machine data and even agent data. Data is a critical asset and curation of it makes it even more effective. Curating proprietary data, the data the agent generates and even the data you’re using in your model workflow for quality control matters. By the way, data is dynamic - how do you design an agent data flywheel from day one where every time a user uses the product it automatically improves in real time and at scale. Back to our flight example, could we get a curated dataset of my travel presences or automatically update based on my agent interactions over time?

  1. Importance of Evals - how do we collect and measure a model’s response?

Evaluations are important and have long been important in AI and machine learning. They are pretty simple to construct when there are obvious yes and no answers like for verifiable ones for math or science problems. But how do we set up evaluations for systems that aren’t yes and no - like will Grace like this plane seat or not? We’ve recently seen the launch of several impressive Deep Research products - Google Gemini, OpenAI and Perplexity. Non-verifiable evals depend on the eye of the beholder like which product is better for VC market research vs scientific research vs everyday research? Continuing to collect signals to create personalized evals based on human preferences and characteristics. And of course, sometimes the best eval is just trying out the agent itself and “vibes” based on your needs that no number or leaderboard can tell.

  1. Scaffolding Systems - how do we scaffold our agent infrastructure so one error doesn’t have a cascading effect?

We can improve agent infrastructure by scaffolding a compound AI system clearly delineating how all existing underlying frameworks, tools and humans work together. Ramp in the Lux portfolio has done a great job with this - when Ramp launches a new applied AI feature, and it fails, they have infrastructure logic to ensure it doesn’t meaningfully affect Ramp infrastructure from going down. For reasoning models this gets even more important - we need to adapt scaffolding to stronger agents that self heal & grow, agents that realize they are wrong and correct the path or agents that break execution when the agent isn’t sure. Back to our flight example - could we interrupt the agent to steer it in another direction?

  1. User Experience (UX) is the Moat that Matters - how do we know the user better to become a better copilot?

AI apps today are ALL using the same models. Foundation models are the fastest depreciating asset on the market. Contrary to what many believed at the start of LLM application development - “GPT wrappers are cool.” Put another way, UX really does make a difference for those who reimagine products and deeply understand the user workflow, promoting elegant collaboration between human and machine. Take OpenAI’s Deep Research asking clarifying questions, Codeium’s Windsurf understanding developer psyche or Harvey integrating seamlessly with legacy systems to provide ROI for legal workflows. Even the biggest companies across major AI app categories in sales, customer support and coding are all using the same models - it’s the UX and product quality that make a company stand out. At Lux, we’re really excited about the new AI frontier - robotics, defense and manufacturing and the life sciences.  Companies who have proprietary data sources and who know the workflow of a biologist or a defense contractor better than anyone else are uniquely poised to create magical product experiences.

  1. Build multimodally - how do we get away from the AI chatbot as an interface?

How do we leverage new interfaces & pick modalities to create 10x more personalized experiences? How do we make AI more human and go as far to anthropomorphize AI where we can add eyes, ears, nose and a voice? We’ve seen scary good improvements in voice over the past few months as well as the emergence of touch with robotics and embodiment. I’m also excited about incorporating more AI memories - how do we make AI truly personal? Doing all of this reframes what perfection is to a human - even if the agent is inconsistent or unreliable the visionary nature of the product exceeds all expectations. Tldraw is a great example in the Lux portfolio reimagining the visual canvas implementing AI through brush strokes instead of chat interfaces. 

Ultimately, it’s a really exciting time to be building in AI in 2025 - the confluence of so many exciting elements of a “perfect storm” alongside improving strategies to harness AI agents for many more interactions. If you’re building an AI agent, especially reimagining the product experience in any of these spaces, I’d love to chat - grace@lux.vc.

A huge thanks to Nithanth Ram of Lux Capital for his incredible work as a co-author of this piece as well as several colleagues and friends who gave input - Danny Crichton, Laurence Pevsner, Saunaz Moradi, Shawn “swyx” Wang, Victor Sanh, Ankit Mathur, Emma Guo, Gian Segato and many more!

written by

As a Partner based in our New York City office, Grace invests in companies innovating at the nexus of the computational sciences – data, AI and ML infrastructure, open source software, network infrastructure, developer tools, vertical software applications and more.

Before joining Lux, Grace was a principal at Canvas Ventures where she sourced 10 investments. She got her start as a campus scout while attending Stanford University.

Every problem is an opportunity. The bigger the problem, the bigger the opportunity. –Dr. Tina Seelig

Prior to Canvas, Grace worked on the LP side at the Stanford Management Company, in product at edtech startup Handshake, and in growth equity at Stripes Group. She earned a Bachelors of Science and a Masters of Science in Management Science and Engineering from Stanford, where she was a Mayfield Fellow and served as Co-President of Stanford Women in Business, the campus’s largest pre-professional organization for women. In addition, she is on the board of the Stanford Technology Ventures Program, the university’s entrepreneurship center, and is an active member of All Raise, focused on accelerating the success of female and non-binary founders and funders.

Grace is originally from Connecticut, although has lived in Tokyo, Japan, and aspires to re-learn Japanese. She’s an avid runner and cyclist.

written by

Beyond the Consensus: Navigating AI's Frontier in 2025

visit source
Text Link