Building our Quality Evaluation Framework at Tofu

In this post, I’m sharing how we’re developing our quality evaluation framework for AI-generated outputs at Tofu. As a team, we’re aware consumers are still early in their journey of trusting any LLM-generated output. Any successful generative AI application needs to nail quality evaluation to produce a best-in-class product. A couple reasons why we want to #buildinpublic:

Building Trust is Essential for AI Adoption: Given ChatGPT’s meteoric rise as the fastest-growing application, the demand for generative AI tools is undeniable. However, the excitement fades quickly when users can’t obtain trustworthy, high-quality outputs. Despite ChatGPT boasting the highest retention among AI-first peers, it retains only 56% of its users after one month, compared to the 63% median for traditional incumbents.

Source: Sequoia Capital: Generative AI Act Two

For AI-first companies to reach their full potential, it’s crucial to gain user trust similar to that placed in human assistants. While most AI companies start with automating singular workflows, their biggest opportunities involve automating entire systems. This is only possible if trust is established early on.

For example, at Tofu, our vision begins with AI-assisted content for marketing and evolves toward a fully automated top-of-funnel workflow based on data-driven recommendations.

2. Openness Fosters Accountability and Learning: We’ve been testing quality for months before our launch out of stealth. As we rapidly introduce new features, we’re unwavering in prioritizing quality. Sharing our progress not only holds us accountable but also helps us learn best practices from our broader network.

A Glimpse into Personalization with Tofu

Before delving into our quality evaluation design, here’s a brief on what Tofu does.

We enable B2B marketing teams to generate on-brand, omnichannel personalized content, from emails to eBooks. Customers feed their brand collateral and target segment information into our proprietary Playbook, and Tofu crafts countless variations of tailored content from there.

As a simple example, I’ll walk you through how we personalize a generic Tofu landing page for an account, Heap.io, and a Persona (CMO).

In the Tofu app, we select the components of the original that we want to personalize by account and persona, and generate the output in our Factory.

As you can see, the output adds some personalized detail about the industry Heap is in (digital insights) as well as details relevant to the CMO role.

Our ultimate goal is for marketers to confidently publish Tofu-generated content with minimal oversight.

Quality Evaluation Blueprint

Our CEO, EJ, outlined the foundational guidelines for our testing process a few months back. In the spirit of authenticity, the following key points are directly extracted from the original document:

Goals
- Define a metrics system to measure quality of content generated by Tofu — what does “great” look like in the context of our use cases?
- Establish a process to measure this repeatedly and reliably
Metrics Guidelines
- Easy to understand and codify numerically (e.g. 1–10 scale)
- Less is more — ideally one aggregate number summarizes a set of evaluation criteria
Reliable
- If the same evaluation process is run 10 times, the metric should output the same (or within a reasonable margin of error) value
Evaluation Use Cases
- Weekly: We want to regularly run our evaluation teamwork to assess the quality of our overall system. These tests are run regardless of whether or not we ship anything new.
- One-off: To evaluate any new models or specific changes that impact output, we want to be able to “run” a quick test. As such, our system needs to be set up to get feedback within 24 hours.

Designing our Metrics and Scoring System

In our first iteration, we had a 10 point scale. Notice that for all of our criteria besides personalization, we stuck to a binary metric.

We decided from the start that we would recruit third party evaluators to eliminate biases from our team. We want our criteria and instructions to be easy enough so that someone who has never heard of Tofu can follow them. We decided against a purely automated process because we want our human review to mirror that of a real marketing professional evaluating content that Tofu has generated for them.

Scoring Criteria — 10 points possible

1 point each (1 if yes, 0 if no)
- Format: bolding, paragraph styles, indents, capitalization, punctuation, colors, number of lines matches the original
- Word Count: +\- 10% of the original word count
- Tone: Does it maintain the tone of the original. Ie: if the original is informative/objective, the new generated copy should not be trying to sell your product. The tone should not be casual or conversational if the original is academic.
- Alignment — Original: Does the generated copy for the selection match the context of the original. Ie: if the original copy talks about the benefits of a certain product, the generated output should talk about the benefits as well. If the original copy talks about the pain points or needs of a specific target, the new copy should also focus on that instead of selling a product
- Alignment — Surrounding: Does the generated section flow with the rest of the entire asset. Does it take into consideration the sections that come before and after it?
- Repetition: Does the generated section repeat information that is already stated in other sections instead of presenting new information
- Grammar: Is the output free of grammatical errors like typos, punctuation errors, run on sentences, etc. Sometimes the model will output random html symbols like /n or or /location/. These should be counted as grammatical errors.
Personalization — 3 points possible
- 0 points: Model inappropriately incorporates the target, such as adding information about the wrong target. This is worse than if you just left the original content the same because the personalization is wrong.
- 1 point: Model doesn’t incorporate the target/essentially leaves the content the same as the original. Thus the generated output doesn’t add value but is neutral relative to the original.
- 2 points: Model incorporates the target name correctly
- 3 points: In addition to meeting the bar for 2, the model incorporates relevant details about the target to the output (ie: when customizing this page for Airship, a drone technology company, our output mentions a specific strategic initiative, “innovative drone technology.”)

We also added a column for Additional Comments, where Evaluators could note any other errors that weren’t accounted for in our scoring criteria, or anything they were confused about. This feedback is extremely helpful in the early iterations of your tests to help you clarify instructions, particularly as some of the criteria, like the ones relating to Alignment, are subjective.

Running the Test

We trained a handful of evaluators on Upwork, a contractor marketplace. Here’s the exact first job description we posted.

Some of our best advice for the test design process:

You can never be too detailed: In addition to posting specific step by step instructions and clear examples of how I would score certain outputs, I also recorded a Loom walking through how to navigate the scoring spreadsheet and the Tofu app.
Have a quick demo in your screening process: To make sure that the raters we chose truly understood each of our criteria,we had them do a demo (just 8 examples) as a requirement for applying to the job. In general these should be tasks that take 5–10 minutes max. I gave them direct feedback through Upwork’s chat feature, and only extended a contract if they came back with the right edits promptly.
Geographic focus: We focused on raters in the Philippines because of cost effectiveness and the level of English proficiency. The time difference was also ideal — it allowed us to send out an evaluation test on Friday night and have them Sunday night, just in time for our team syncs on Monday to review.
Have two test sets:
- Core: This test set represents the most common use cases across all our customers that we want to run weekly, so we chose 2 examples of each content type (Landing Page, PDFs, Emails, Blog Posts).
- Discretionary: Specific tests tailored to certain issues or as Tofu expands use cases. For example, if the team made a big push to improve formatting, you might want to run a ton of tests on just PDFs where formatting errors are most common.
Ship quickly: This first iteration (picking our criteria, designing our test process, creating our results dashboard) took just a week from having no process in place.
Train multiple raters: Having a bench of Evaluators readily available ensures reliable results and availability during peak testing periods.
Be more hands-on at first: I went through and filled out the entire evaluation myself the first time, and then also made edits on every Evaluator’s submissions. I offered to hop on zoom calls to clarify any outstanding questions they had. While this is a pain in the early days, it pays off in the long run.

Processing Results — First Pass

This was our first ever quality test results dashboard. The goal was to highlight which content type and criteria our team should focus on improving, and also highlight the quality gap between our live models (at the time gpt3.5 and gpt4) given the latter is much more expensive to generate.

Iteration 1 — Scorecard for Launch

Leading up to our launch, we decided to prioritize a subset of the initial metrics (P0 Criteria below). We also modified the point scales for Alignment-Original, Repetition, Personalization, and Format to account for nuances/confusion that our Evaluators flagged to us the first time.

Below is our current criteria (changes from our first iteration bolded)

P0 Criteria

Alignment — Original (2 points vs 1 point originally): Does the generated content cover all the key points of the original selection
- 1 point: Follows most of the original structure of the sentences, may add some things or omit some things, but not overly distracting (a reader who didn’t see the original would not notice an obvious error)
- 2 points: Perfectly matches and emphasizes the key points of the original
Alignment — Surrounding Context (1 point)
Tone (1 point)
Correctness of Personalization (1 point): If the generated content brings in wrong information (ie: it adds information for the wrong target) or generates random content that has html code that doesn’t make any sense, it’s a 0. Otherwise, it’s a 1.
Repetition (2 points): Is the generated content repeating other already generated contents or the same value prop over and over again
- 0 points: Repetition is an obvious error. Any reader could tell it’s a mistake.
- 1 point: some repetition that isn’t in the original, but not a glaring error (ie: paraphrasing or repeating a few key points). Generally the piece still seems written by a marketing professional
- 2 points: no repetitiveness that isn’t included in the original
Grammar (1 point)

P1 Criteria

Level of Personalization (2 points):
- 0 points: If the component didn’t get a point for correctness of personalization, it automatically gets a 0 for level of personalization as well.
- 1 point: In addition to getting the point from P0 — Correctness of Personalization, Tofu correctly added the name of the new target in the generation. For example, if you’re personalizing a page for “health clinics” for the target location Sunnyvale, if all Tofu does is personalize to “Sunnyvale health clinic,” it gets 1 point.
- 2 points: generated relevant additional details unique and relevant to the target in addition to the criteria for 1 point. If the component is less than 30 words and got the point for correctness of personalization in P0, it automatically gets 2 points for this section.
Length (1 point)

P2 Criteria

Format (2 points)
- 0 points: Formatting issues clearly stand out to a new reader
- 1 point: Some formatting does not match the original, but a new reader wouldn’t be able to tell/mistake isn’t glaring
- 2 points: Formatting perfectly matches the original with no errors

In addition to the changes in criteria, we added two more aggregate criteria:

All Criteria: This was the original metric we had, which is just the total number of points a selected component scores divided by 13 (all criteria regardless of priority level).
P0 Criteria: % of P0 points that a component scores
P0 Pass Rate: For a component to be publishable, we believe it needs to score at least 1 point in every P0 criteria. This is the % of components that do.
Publishable: We added a section in our dashboard focused on each content type (ie: Tofu Landing Page). For an entire piece to be publishable, all components need to average >90% for P0 Criteria and pass the P0 test.

Link to Google Sheet Version

Reviewing Results

Weekly Priorities in Notion: I paste the two charts below into our team Notion weekly and give the 30-second update on what changes we made and what criteria are performing well. Anything more granular happens in smaller meetings.
#Qualitytesting:- We have a Slack channel for more day to day updates. Here we err on the side of overcommunicating. We’ll post any updates related to quality that come up in engineering team meetings, quick readouts from quality evaluation meetings, feature updates, etc. It’s a detailed play-by-play where anyone new to the team can trace back all the big decisions around our quality testing process from the first day we started measuring.
Physical Printouts — more for fun but also great motivation — we physically print out our weekly dashboards and paste them on our whiteboard — it’s the first thing you see when you walk into our office.

What Comes Next

We’re always looking to refine our existing testing framework. Here’s two approaches we’re excited to experiment with next.

GPT4 Utilization: Training GPT4 for 90% of routine evaluations and reserving human review for niche cases. This seems to be the most popular approach among our peers today. We’re excited for the potential of this method given GPT4’s recently launched update which allows it to analyze images.
Automated LLM Testing Startups: They’ve popped up across the entire software stack, from infrastructure level criteria (data security, latency, observability, hallucinations) to more user-facing criteria (context of outputs, session tracking, prompt engineering etc.). A couple that are on our radar are Patronus AI, Guardrails, Humanloop, Baserun, Braintrust and more.

We’d love to hear from you!

Whether you’re a fellow AI builder, an automated testing provider, or just have tips for us, we’d love to hear your thoughts. You can reach me at jacqueline@tofuhq.com

‍

SHARE THIS POST