In this post, I’m sharing how we’re developing our quality evaluation framework for AI-generated outputs at Tofu. As a team, we’re aware consumers are still early in their journey of trusting any LLM-generated output. Any successful generative AI application needs to nail quality evaluation to produce a best-in-class product. A couple reasons why we want to #buildinpublic:
Source: Sequoia Capital: Generative AI Act Two
For AI-first companies to reach their full potential, it’s crucial to gain user trust similar to that placed in human assistants. While most AI companies start with automating singular workflows, their biggest opportunities involve automating entire systems. This is only possible if trust is established early on.
For example, at Tofu, our vision begins with AI-assisted content for marketing and evolves toward a fully automated top-of-funnel workflow based on data-driven recommendations.
2. Openness Fosters Accountability and Learning: We’ve been testing quality for months before our launch out of stealth. As we rapidly introduce new features, we’re unwavering in prioritizing quality. Sharing our progress not only holds us accountable but also helps us learn best practices from our broader network.
A Glimpse into Personalization with Tofu
Before delving into our quality evaluation design, here’s a brief on what Tofu does.
We enable B2B marketing teams to generate on-brand, omnichannel personalized content, from emails to eBooks. Customers feed their brand collateral and target segment information into our proprietary Playbook, and Tofu crafts countless variations of tailored content from there.
As a simple example, I’ll walk you through how we personalize a generic Tofu landing page for an account, Heap.io, and a Persona (CMO).
In the Tofu app, we select the components of the original that we want to personalize by account and persona, and generate the output in our Factory.
As you can see, the output adds some personalized detail about the industry Heap is in (digital insights) as well as details relevant to the CMO role.
Our ultimate goal is for marketers to confidently publish Tofu-generated content with minimal oversight.
Quality Evaluation Blueprint
Our CEO, EJ, outlined the foundational guidelines for our testing process a few months back. In the spirit of authenticity, the following key points are directly extracted from the original document:
Designing our Metrics and Scoring System
In our first iteration, we had a 10 point scale. Notice that for all of our criteria besides personalization, we stuck to a binary metric.
We decided from the start that we would recruit third party evaluators to eliminate biases from our team. We want our criteria and instructions to be easy enough so that someone who has never heard of Tofu can follow them. We decided against a purely automated process because we want our human review to mirror that of a real marketing professional evaluating content that Tofu has generated for them.
Scoring Criteria — 10 points possible
We also added a column for Additional Comments, where Evaluators could note any other errors that weren’t accounted for in our scoring criteria, or anything they were confused about. This feedback is extremely helpful in the early iterations of your tests to help you clarify instructions, particularly as some of the criteria, like the ones relating to Alignment, are subjective.
Running the Test
We trained a handful of evaluators on Upwork, a contractor marketplace. Here’s the exact first job description we posted.
Some of our best advice for the test design process:
Processing Results — First Pass
This was our first ever quality test results dashboard. The goal was to highlight which content type and criteria our team should focus on improving, and also highlight the quality gap between our live models (at the time gpt3.5 and gpt4) given the latter is much more expensive to generate.
Iteration 1 — Scorecard for Launch
Leading up to our launch, we decided to prioritize a subset of the initial metrics (P0 Criteria below). We also modified the point scales for Alignment-Original, Repetition, Personalization, and Format to account for nuances/confusion that our Evaluators flagged to us the first time.
Below is our current criteria (changes from our first iteration bolded)
P0 Criteria
P1 Criteria
P2 Criteria
In addition to the changes in criteria, we added two more aggregate criteria:
Reviewing Results
What Comes Next
We’re always looking to refine our existing testing framework. Here’s two approaches we’re excited to experiment with next.
We’d love to hear from you!
Whether you’re a fellow AI builder, an automated testing provider, or just have tips for us, we’d love to hear your thoughts. You can reach me at jacqueline@tofuhq.com
Five essential tips for integrating generative AI into your B2B marketing strategy.
Discover four ways CMOs can harness the power of generative AI to drive success in the new era of marketing. From embracing early adoption and creating original content to leveraging customer insights and personalizing the journey, learn how AI is redefining the rules of engagement for B2B marketing teams.
Discover how HubSpot leveraged generative AI to drive growth and engagement in their B2B content strategy.
How generative AI content creation tools can help B2B CMOs overcome their top challenges, from finding new customers to boosting engagement and adopting emerging technologies. Learn how AI-driven insights, personalized content, and streamlined tech integration can boost your B2B marketing strategy.
Generative AI is revolutionizing long-form content creation for B2B marketers, enabling them to produce high-quality, informative content at scale while saving time and resources. In this post, learn how generated long form content can help you engage your audience and drive results.
Here are seven powerful ways generative AI is revolutionizing B2B marketing and sales, from streamlining content creation to crafting personalized customer experiences. Explore how this technology can help your team save time, enhance creativity, and drive innovation.
Discover how generative AI is revolutionizing brand building in the B2B marketing landscape. By leveraging AI-powered tools to support the five essential tasks of building a brand, marketers can create emotionally resonant brands that drive long-term success.
Seven innovative ways to revolutionize your content marketing strategy with generative AI. From repurposing case studies into blog posts to personalizing email campaigns, this post explores how AI can help you create engaging, high-performing content that resonates with your target audience.
Five ways to convince your CEO to invest in generative AI content tools for B2B marketing. Learn how these tools can revolutionize your content creation process, enable cost-effective personalization at scale, improve content ROI, provide a competitive advantage, and ensure scalability and consistency across multiple channels, ultimately contributing to your company's growth and success.
A playbook for 1:1 marketing in the AI era
"I take a broad view of ABM: if you're targeting a specific set of accounts and tailoring engagement based on what you know about them, you're doing it. But most teams are stuck in the old loop: Sales hands Marketing a list, Marketing runs ads, and any response is treated as intent."
"ABM has always been just good marketing. It starts with clarity on your ICP and ends with driving revenue. But the way we get from A to B has changed dramatically."
"ABM either dies or thrives on Sales-Marketing alignment; there's no in-between. When Marketing runs plays on specific accounts or contacts and Sales isn't doing complementary outreach, the whole thing falls short."
"In our research at 6sense, few marketers view ABM as critical to hitting revenue goals this year. But that's not because ABM doesn't work; it's because most teams haven't implemented it well."
"To me, ABM isn't a campaign; it's a go-to-market operating model. It starts with cross-functional planning: mapping revenue targets, territories, and board priorities."
"With AI, we can personalize not just by account, but by segment, by buying group, and even by individual. That level of precision just wasn't possible a few years ago."
This comprehensive guide provides a blueprint for modern ABM execution:
6 interdependent stages that form a data-driven ABM engine: account selection, research, channel selection, content generation, orchestration, and optimization
6 ready-to-launch plays for every funnel stage, from competitive displacement to customer expansion
Modern metrics that matter now: engagement velocity, signal relevance, and sales activation rates
Real-world case studies from Snowflake, Unanet, LiveRamp, and more
Sign up now to receive your copy the moment it's released and transform your ABM strategy with AI-powered personalization at scale.
Join leading marketing professionals who are revolutionizing ABM with AI