In this post, I’m sharing how we’re developing our quality evaluation framework for AI-generated outputs at Tofu. As a team, we’re aware consumers are still early in their journey of trusting any LLM-generated output. Any successful generative AI application needs to nail quality evaluation to produce a best-in-class product. A couple reasons why we want to #buildinpublic:
Source: Sequoia Capital: Generative AI Act Two
For AI-first companies to reach their full potential, it’s crucial to gain user trust similar to that placed in human assistants. While most AI companies start with automating singular workflows, their biggest opportunities involve automating entire systems. This is only possible if trust is established early on.
For example, at Tofu, our vision begins with AI-assisted content for marketing and evolves toward a fully automated top-of-funnel workflow based on data-driven recommendations.
2. Openness Fosters Accountability and Learning: We’ve been testing quality for months before our launch out of stealth. As we rapidly introduce new features, we’re unwavering in prioritizing quality. Sharing our progress not only holds us accountable but also helps us learn best practices from our broader network.
A Glimpse into Personalization with Tofu
Before delving into our quality evaluation design, here’s a brief on what Tofu does.
We enable B2B marketing teams to generate on-brand, omnichannel personalized content, from emails to eBooks. Customers feed their brand collateral and target segment information into our proprietary Playbook, and Tofu crafts countless variations of tailored content from there.
As a simple example, I’ll walk you through how we personalize a generic Tofu landing page for an account, Heap.io, and a Persona (CMO).
In the Tofu app, we select the components of the original that we want to personalize by account and persona, and generate the output in our Factory.
As you can see, the output adds some personalized detail about the industry Heap is in (digital insights) as well as details relevant to the CMO role.
Our ultimate goal is for marketers to confidently publish Tofu-generated content with minimal oversight.
Quality Evaluation Blueprint
Our CEO, EJ, outlined the foundational guidelines for our testing process a few months back. In the spirit of authenticity, the following key points are directly extracted from the original document:
Designing our Metrics and Scoring System
In our first iteration, we had a 10 point scale. Notice that for all of our criteria besides personalization, we stuck to a binary metric.
We decided from the start that we would recruit third party evaluators to eliminate biases from our team. We want our criteria and instructions to be easy enough so that someone who has never heard of Tofu can follow them. We decided against a purely automated process because we want our human review to mirror that of a real marketing professional evaluating content that Tofu has generated for them.
Scoring Criteria — 10 points possible
We also added a column for Additional Comments, where Evaluators could note any other errors that weren’t accounted for in our scoring criteria, or anything they were confused about. This feedback is extremely helpful in the early iterations of your tests to help you clarify instructions, particularly as some of the criteria, like the ones relating to Alignment, are subjective.
Running the Test
We trained a handful of evaluators on Upwork, a contractor marketplace. Here’s the exact first job description we posted.
Some of our best advice for the test design process:
Processing Results — First Pass
This was our first ever quality test results dashboard. The goal was to highlight which content type and criteria our team should focus on improving, and also highlight the quality gap between our live models (at the time gpt3.5 and gpt4) given the latter is much more expensive to generate.
Iteration 1 — Scorecard for Launch
Leading up to our launch, we decided to prioritize a subset of the initial metrics (P0 Criteria below). We also modified the point scales for Alignment-Original, Repetition, Personalization, and Format to account for nuances/confusion that our Evaluators flagged to us the first time.
Below is our current criteria (changes from our first iteration bolded)
P0 Criteria
P1 Criteria
P2 Criteria
In addition to the changes in criteria, we added two more aggregate criteria:
Reviewing Results
What Comes Next
We’re always looking to refine our existing testing framework. Here’s two approaches we’re excited to experiment with next.
We’d love to hear from you!
Whether you’re a fellow AI builder, an automated testing provider, or just have tips for us, we’d love to hear your thoughts. You can reach me at jacqueline@tofuhq.com
Best practices for prompt engineering for marketing and growth professionals to create high-quality content that drives engagement.
AI chatbots can assist you in your entire blog creation journey, from brainstorming ideas to drafting
Discover how to implement a successful generative marketing strategy that will help you create personalized content, engage with customers in real-time, and optimize your campaigns for conversion.
In 2023, personalization is more important than ever before, and marketers can use AI to provide personalized experiences that increase customer engagement, drive conversions, build brand loyalty, and improve customer satisfaction.
Marketers are using AI to personalize content and recommendations for their audience through recommendation engines, email marketing, product customization, ad personalization, chatbots, personalized landing pages, and social media.
Account Based Marketing (ABM) is a go-to-market strategy that enables companies to personalize campaigns to their most valuable accounts. It’s historically been difficult to implement due to the significant upfront investment required, but the advent of new generative AI tools now allows teams to enhance conversions with greater ease.
A playbook for 1:1 marketing in the AI era
"I take a broad view of ABM: if you're targeting a specific set of accounts and tailoring engagement based on what you know about them, you're doing it. But most teams are stuck in the old loop: Sales hands Marketing a list, Marketing runs ads, and any response is treated as intent."
"ABM has always been just good marketing. It starts with clarity on your ICP and ends with driving revenue. But the way we get from A to B has changed dramatically."
"ABM either dies or thrives on Sales-Marketing alignment; there's no in-between. When Marketing runs plays on specific accounts or contacts and Sales isn't doing complementary outreach, the whole thing falls short."
"In our research at 6sense, few marketers view ABM as critical to hitting revenue goals this year. But that's not because ABM doesn't work; it's because most teams haven't implemented it well."
"To me, ABM isn't a campaign; it's a go-to-market operating model. It starts with cross-functional planning: mapping revenue targets, territories, and board priorities."
"With AI, we can personalize not just by account, but by segment, by buying group, and even by individual. That level of precision just wasn't possible a few years ago."
This comprehensive guide provides a blueprint for modern ABM execution:
6 interdependent stages that form a data-driven ABM engine: account selection, research, channel selection, content generation, orchestration, and optimization
6 ready-to-launch plays for every funnel stage, from competitive displacement to customer expansion
Modern metrics that matter now: engagement velocity, signal relevance, and sales activation rates
Real-world case studies from Snowflake, Unanet, LiveRamp, and more
Sign up now to receive your copy the moment it's released and transform your ABM strategy with AI-powered personalization at scale.
Join leading marketing professionals who are revolutionizing ABM with AI