In this post, I’m sharing how we’re developing our quality evaluation framework for AI-generated outputs at Tofu. As a team, we’re aware consumers are still early in their journey of trusting any LLM-generated output. Any successful generative AI application needs to nail quality evaluation to produce a best-in-class product. A couple reasons why we want to #buildinpublic:
Source: Sequoia Capital: Generative AI Act Two
For AI-first companies to reach their full potential, it’s crucial to gain user trust similar to that placed in human assistants. While most AI companies start with automating singular workflows, their biggest opportunities involve automating entire systems. This is only possible if trust is established early on.
For example, at Tofu, our vision begins with AI-assisted content for marketing and evolves toward a fully automated top-of-funnel workflow based on data-driven recommendations.
2. Openness Fosters Accountability and Learning: We’ve been testing quality for months before our launch out of stealth. As we rapidly introduce new features, we’re unwavering in prioritizing quality. Sharing our progress not only holds us accountable but also helps us learn best practices from our broader network.
A Glimpse into Personalization with Tofu
Before delving into our quality evaluation design, here’s a brief on what Tofu does.
We enable B2B marketing teams to generate on-brand, omnichannel personalized content, from emails to eBooks. Customers feed their brand collateral and target segment information into our proprietary Playbook, and Tofu crafts countless variations of tailored content from there.
As a simple example, I’ll walk you through how we personalize a generic Tofu landing page for an account, Heap.io, and a Persona (CMO).
In the Tofu app, we select the components of the original that we want to personalize by account and persona, and generate the output in our Factory.
As you can see, the output adds some personalized detail about the industry Heap is in (digital insights) as well as details relevant to the CMO role.
Our ultimate goal is for marketers to confidently publish Tofu-generated content with minimal oversight.
Quality Evaluation Blueprint
Our CEO, EJ, outlined the foundational guidelines for our testing process a few months back. In the spirit of authenticity, the following key points are directly extracted from the original document:
Designing our Metrics and Scoring System
In our first iteration, we had a 10 point scale. Notice that for all of our criteria besides personalization, we stuck to a binary metric.
We decided from the start that we would recruit third party evaluators to eliminate biases from our team. We want our criteria and instructions to be easy enough so that someone who has never heard of Tofu can follow them. We decided against a purely automated process because we want our human review to mirror that of a real marketing professional evaluating content that Tofu has generated for them.
Scoring Criteria — 10 points possible
We also added a column for Additional Comments, where Evaluators could note any other errors that weren’t accounted for in our scoring criteria, or anything they were confused about. This feedback is extremely helpful in the early iterations of your tests to help you clarify instructions, particularly as some of the criteria, like the ones relating to Alignment, are subjective.
Running the Test
We trained a handful of evaluators on Upwork, a contractor marketplace. Here’s the exact first job description we posted.
Some of our best advice for the test design process:
Processing Results — First Pass
This was our first ever quality test results dashboard. The goal was to highlight which content type and criteria our team should focus on improving, and also highlight the quality gap between our live models (at the time gpt3.5 and gpt4) given the latter is much more expensive to generate.
Iteration 1 — Scorecard for Launch
Leading up to our launch, we decided to prioritize a subset of the initial metrics (P0 Criteria below). We also modified the point scales for Alignment-Original, Repetition, Personalization, and Format to account for nuances/confusion that our Evaluators flagged to us the first time.
Below is our current criteria (changes from our first iteration bolded)
P0 Criteria
P1 Criteria
P2 Criteria
In addition to the changes in criteria, we added two more aggregate criteria:
Reviewing Results
What Comes Next
We’re always looking to refine our existing testing framework. Here’s two approaches we’re excited to experiment with next.
We’d love to hear from you!
Whether you’re a fellow AI builder, an automated testing provider, or just have tips for us, we’d love to hear your thoughts. You can reach me at jacqueline@tofuhq.com
Content Repurposing is proven to be one of the most effective B2B marketing strategies. With new AI tools supercharging every step of the repurposing flow, marketing teams can repurpose more efficiently than ever before.
Teams who figure out how to create a symbiotic relationship between AI and humans, leveraging each for what it does best, will pull ahead from the pack and solidify their positions as market leaders.
Everything you need to know about marketing automation in 2023 and how to implement it effectively for your enterprise SaaS company.
AI can be a powerful tool for improving customer acquisition and retention in enterprise SaaS companies by analyzing customer behavior, personalizing the customer experience, improving customer support, and optimizing your marketing campaigns.
Generative Marketing is a discipline within marketing that leverages technology to continuously produce, test, and reproduce personalized content at scale to increase conversion. Learn how it works, why it's important, and its benefits in our latest blog post.
By following best practices and avoiding common pitfalls, you can leverage AI to enhance human creativity, personalize content for your target audience, and continuously improve your content marketing strategy.
AI opens up novel channels and new approaches to B2B marketing while also dramatically changing the efficacy and utility of existing channels.
Account Based Marketing is more accessible than ever thanks to a proliferation of new tools. Here's our guide to selecting the best ones for your strategy in 2024.
A playbook for 1:1 marketing in the AI era
"I take a broad view of ABM: if you're targeting a specific set of accounts and tailoring engagement based on what you know about them, you're doing it. But most teams are stuck in the old loop: Sales hands Marketing a list, Marketing runs ads, and any response is treated as intent."
"ABM has always been just good marketing. It starts with clarity on your ICP and ends with driving revenue. But the way we get from A to B has changed dramatically."
"ABM either dies or thrives on Sales-Marketing alignment; there's no in-between. When Marketing runs plays on specific accounts or contacts and Sales isn't doing complementary outreach, the whole thing falls short."
"In our research at 6sense, few marketers view ABM as critical to hitting revenue goals this year. But that's not because ABM doesn't work; it's because most teams haven't implemented it well."
"To me, ABM isn't a campaign; it's a go-to-market operating model. It starts with cross-functional planning: mapping revenue targets, territories, and board priorities."
"With AI, we can personalize not just by account, but by segment, by buying group, and even by individual. That level of precision just wasn't possible a few years ago."
This comprehensive guide provides a blueprint for modern ABM execution:
6 interdependent stages that form a data-driven ABM engine: account selection, research, channel selection, content generation, orchestration, and optimization
6 ready-to-launch plays for every funnel stage, from competitive displacement to customer expansion
Modern metrics that matter now: engagement velocity, signal relevance, and sales activation rates
Real-world case studies from Snowflake, Unanet, LiveRamp, and more
Sign up now to receive your copy the moment it's released and transform your ABM strategy with AI-powered personalization at scale.
Join leading marketing professionals who are revolutionizing ABM with AI