In this post, I’m sharing how we’re developing our quality evaluation framework for AI-generated outputs at Tofu. As a team, we’re aware consumers are still early in their journey of trusting any LLM-generated output. Any successful generative AI application needs to nail quality evaluation to produce a best-in-class product. A couple reasons why we want to #buildinpublic:
Source: Sequoia Capital: Generative AI Act Two
For AI-first companies to reach their full potential, it’s crucial to gain user trust similar to that placed in human assistants. While most AI companies start with automating singular workflows, their biggest opportunities involve automating entire systems. This is only possible if trust is established early on.
For example, at Tofu, our vision begins with AI-assisted content for marketing and evolves toward a fully automated top-of-funnel workflow based on data-driven recommendations.
2. Openness Fosters Accountability and Learning: We’ve been testing quality for months before our launch out of stealth. As we rapidly introduce new features, we’re unwavering in prioritizing quality. Sharing our progress not only holds us accountable but also helps us learn best practices from our broader network.
A Glimpse into Personalization with Tofu
Before delving into our quality evaluation design, here’s a brief on what Tofu does.
We enable B2B marketing teams to generate on-brand, omnichannel personalized content, from emails to eBooks. Customers feed their brand collateral and target segment information into our proprietary Playbook, and Tofu crafts countless variations of tailored content from there.
As a simple example, I’ll walk you through how we personalize a generic Tofu landing page for an account, Heap.io, and a Persona (CMO).
In the Tofu app, we select the components of the original that we want to personalize by account and persona, and generate the output in our Factory.
As you can see, the output adds some personalized detail about the industry Heap is in (digital insights) as well as details relevant to the CMO role.
Our ultimate goal is for marketers to confidently publish Tofu-generated content with minimal oversight.
Quality Evaluation Blueprint
Our CEO, EJ, outlined the foundational guidelines for our testing process a few months back. In the spirit of authenticity, the following key points are directly extracted from the original document:
Designing our Metrics and Scoring System
In our first iteration, we had a 10 point scale. Notice that for all of our criteria besides personalization, we stuck to a binary metric.
We decided from the start that we would recruit third party evaluators to eliminate biases from our team. We want our criteria and instructions to be easy enough so that someone who has never heard of Tofu can follow them. We decided against a purely automated process because we want our human review to mirror that of a real marketing professional evaluating content that Tofu has generated for them.
Scoring Criteria — 10 points possible
We also added a column for Additional Comments, where Evaluators could note any other errors that weren’t accounted for in our scoring criteria, or anything they were confused about. This feedback is extremely helpful in the early iterations of your tests to help you clarify instructions, particularly as some of the criteria, like the ones relating to Alignment, are subjective.
Running the Test
We trained a handful of evaluators on Upwork, a contractor marketplace. Here’s the exact first job description we posted.
Some of our best advice for the test design process:
Processing Results — First Pass
This was our first ever quality test results dashboard. The goal was to highlight which content type and criteria our team should focus on improving, and also highlight the quality gap between our live models (at the time gpt3.5 and gpt4) given the latter is much more expensive to generate.
Iteration 1 — Scorecard for Launch
Leading up to our launch, we decided to prioritize a subset of the initial metrics (P0 Criteria below). We also modified the point scales for Alignment-Original, Repetition, Personalization, and Format to account for nuances/confusion that our Evaluators flagged to us the first time.
Below is our current criteria (changes from our first iteration bolded)
P0 Criteria
P1 Criteria
P2 Criteria
In addition to the changes in criteria, we added two more aggregate criteria:
Reviewing Results
What Comes Next
We’re always looking to refine our existing testing framework. Here’s two approaches we’re excited to experiment with next.
We’d love to hear from you!
Whether you’re a fellow AI builder, an automated testing provider, or just have tips for us, we’d love to hear your thoughts. You can reach me at jacqueline@tofuhq.com
In 2024, we spoke with 14 of the best B2B CMOs and CROs. Here are their best tips, tactics, and guides to managing your GTM as you plan for 2025.
As other channels see diminishing ROI, Webinars present a strong opportunity for lead gen, offering a unique combination of engagement, scalability, and content repurposing potential.
Generative AI is transforming B2B event follow-up by enabling the creation of personalized follow-up and derivative content at scale.
We generated thirty blog posts in one day using Tofu. Here's how.
Read how AI-powered tools are simplifying the white paper creation process, facilitating personalized content at scale, and optimizing distribution strategies to help B2B marketers establish thought leadership and drive lead generation.
How Generative AI is changing the way marketers approach content marketing, allowing for more efficient and effective strategies that drive engagement and conversions with high-value accounts.
Discover how AI tools are shifting B2B content marketing, helping marketers cut through the noise and create engaging, personalized content that drives results.
AI-powered tools are changing Account-Based Marketing (ABM) by enabling B2B companies to scale personalized content and expand their reach to a larger number of high-value accounts. As demonstrated by success stories of Vividly and Wunderkind, leveraging generative AI for ABM leads to enhanced engagement rates, higher conversion rates, and improved marketing effectiveness, pointing towards a future where AI will play a crucial role in transforming ABM strategies.
Generative AI tools are streamlining B2B marketing workflows, enabling marketers to automate routine tasks, personalize content at scale, and reallocate resources towards strategic initiatives. Here's how.
A playbook for 1:1 marketing in the AI era
"I take a broad view of ABM: if you're targeting a specific set of accounts and tailoring engagement based on what you know about them, you're doing it. But most teams are stuck in the old loop: Sales hands Marketing a list, Marketing runs ads, and any response is treated as intent."
"ABM has always been just good marketing. It starts with clarity on your ICP and ends with driving revenue. But the way we get from A to B has changed dramatically."
"ABM either dies or thrives on Sales-Marketing alignment; there's no in-between. When Marketing runs plays on specific accounts or contacts and Sales isn't doing complementary outreach, the whole thing falls short."
"In our research at 6sense, few marketers view ABM as critical to hitting revenue goals this year. But that's not because ABM doesn't work; it's because most teams haven't implemented it well."
"To me, ABM isn't a campaign; it's a go-to-market operating model. It starts with cross-functional planning: mapping revenue targets, territories, and board priorities."
"With AI, we can personalize not just by account, but by segment, by buying group, and even by individual. That level of precision just wasn't possible a few years ago."
This comprehensive guide provides a blueprint for modern ABM execution:
6 interdependent stages that form a data-driven ABM engine: account selection, research, channel selection, content generation, orchestration, and optimization
6 ready-to-launch plays for every funnel stage, from competitive displacement to customer expansion
Modern metrics that matter now: engagement velocity, signal relevance, and sales activation rates
Real-world case studies from Snowflake, Unanet, LiveRamp, and more
Sign up now to receive your copy the moment it's released and transform your ABM strategy with AI-powered personalization at scale.
Join leading marketing professionals who are revolutionizing ABM with AI