Prompt engineering and evaluating AI-generated copy

LLM engineering and evaluation is still a relatively new field. During my time at Thumbtack, I learned and applied new skills in this next generation of content editing.

One of my projects in 2025 involved using LLMs to pre-fill messages that homeowners could send to a home care professional. These messages would be sent whenever a customer sent a job request for work to be done on their house.

LLM-generated customer messages

One of our pro product managers came to me with a problem. Customers in our baseline product filled out a job request for some kind of home care - landscaping, roof repair, plumbing, etc. They would pick one or more pros, send their job request, and ask for a quote and availability.

They could also optionally send a message to the pro, with any relevant information not covered in the details of the job request. This could include any special needs, initial suggestions for dates and times, or anything the customer felt their pro should know about the house.

The opportunity

Over time, pros had started to rely on these customer messages to help them understand the customer’s project better. Often pros preferred to look at the customer messages first before looking at the details of the project. This was because the message felt more conversational, and created an open door through which to reply to the customer. Pros were more likely to reply quickly on the same day if the customer filled out these messages. Otherwise, the pro might shelve the project to look at later, when they had time to comb through a list of details.

However, these customer messages were optional. Only 32% of customers bothered to create one. The rest opted to skip this step and leave the message field blank. As a business, Thumbtack felt that making the message mandatory would cause too much friction for the average customer.

Among the 32% of messages sent, many contained very little in terms of helpful information for the pro. Customers often used the space to write things like:

“Could you give me your rate per hour? I’m shopping around. Thank you.”

…or,

“I was wondering what the process would look like, and if it’d be possible for this to be completed soon. Let me know.”

This was better than nothing, because it still opened that door of communication for the pro. But it meant more back-and-forth between the pro and the customer. It often involved the pro asking for details that could’ve been supplied in the initial message. During that back-and-forth, the customer or pro might get busy or distracted with other priorities. This would lead to less leads converting to finished jobs.

The PM asked me if there was some way to:

Increase the number of customer messages;
Improve the quality of these messages; and
Get pros replying faster and potentially converting more jobs.

Content goals

I came up with the idea of using an LLM to pre-fill the message box with a sample message. This sample message would be conversational. But it would also include a summary of the project details filled out by the customer, as well as a request for a reply from the pro. The customer could then choose to A) send the message as-is; B) edit the message to their heart’s content, and then send; or C) wipe it clean and write their own or leave it blank.

It was important to me that the prefilled message:

Sound plausibly human, like something a customer might actually type;
Have some variety and differentiation, so pros weren’t getting identical-sounding messages from different customers;
Contain the most useful information from the customer’s project details, for use by the pro; and
Do it all in the briefest space possible - preferably under 300 characters, to match the average length of messages currently being sent by customers.

A 5th outstanding goal was the subject of debate right from the beginning. I believed the messages should be flagged as having an AI influence. I felt we shouldn’t be trying to fool our pros into thinking these messages were always authentically written byn customers.

Besides feeling like the honest thing to do, I worried that AI generated messages could never seem perfectly human. Pros might start to ‘sniff out’ the AI influence. If we didn’t label it AI upfront, they might find the whole process disingenuous, and feel that the overall lead quality was low.

The product manager argued that, since the customer could edit the messages at any time, this flag wasn’t necessary. Theoretically, every message would be at least approved, if not edited, by the customer.

Choosing the right LLM

I started to feed rough draft prompts, using sample project details and other information, to a series of LLMs that we had access to at Thumbtack. This included LLMs developed or owned by OpenAI, Anthropic, Meta, and others available on Amazon Web Services.

The goal was to suss out the LLM that would provide the most readable, conversational language, based on the project variables provided. In the first round of prompting, I judged that there was a close tie between two very similar AWS Bedrock LLMs: Llama, an open-source model from Meta, and Mistral, created by a French company with the same name. Both of these LLMs arranged the draft content in a way that felt human and logical, even before additional prompt refining.

Early samples of customer messages from ChatGPT, Anthropic’s Claude, Mistral AI, Qwen Chat, and Meta’s Llama.

Prompt refining

Once I’d picked my top 2 LLMs, I started to enhance and expand the prompts. My goal was to make the initial messages they spit back more useful, more accurate, and more human. This came down to 5 essential sections of the prompt:

1) Persona

The LLM needed to understand the kind of customer it was imitating. It was important to train the LLM to write as a homeowner, looking for a professional.

But I also wanted the LLM to understand the context. This customer was on a double-sided marketplace, pairing pros with customers; and had just filled out a short series of details to initiate a new home project.

2) Grammar and Tone

I wanted these messages to err on the side of friendly and conversational, but not at the expense of a character limit. Although in the real world, customers could sometimes be curt or rude to pros, this was an opening message designed to spark conversation between both parties. So I fed the LLM examples that were on the more pleasant side. In particular, I told it to avoid aggressive or demanding openers, like “You need to help me with my bathroom.”

3) Input Data

I worked with our engineering team to create a data set that mapped to the project intake form filled out by the customer. It included:

the full search query from the customer (a hand-typed answer from the customer to the question “what are you looking for?”);
a list of answers chosen by the customer to short questions Thumbtack automatically asked about the project; and
some initial scheduling information.

4) Exceptions

Certain data didn’t work well with the LLM responses. In particular, we offered multiple choice answers to certain project questions that displayed as a range - for example, “I am 50-65 years old,” or, “the house is between 6-8 rooms.” These answers felt awkward and inhuman when summarized by the AI, as for example “I am 50-65 years old and I’m looking for a personal trainer,” or, “I need some painting done in my between 6-8 room house.” (Everyone knows their exact age, and how many rooms are in their own home.)

5) Task

With all these pieces in place, I refined the AI’s task as:

summarize the customer request in one to three sentences;
do your best to make the summary seem like it could have come from a human;
reword or rephrase the original customer query, as well as answers to Thumbtack questions, so they sound more natural; and
optimize most of all for readability.

Evaluating LLM results

I fed my new models from both Mistral and Llama a sampling of different real, past customer project requests. Each of these customers had left their messages to the pro blank, or wrote only a little unhelpful text. I wanted to see which of the two models would do a better job. I created a simple 3-tier grading structure for each message the models produced.

“PASS” meant I felt the message could pass as an authentic customer message, and could also be useful to the pro.

“NEEDS IMPROVING” meant the message contained something odd in its structure, or wasn’t as helpful as it could be.
“FAIL” means there was something in the tone or structure of the message that didn’t fulfill the goals of the project.

Overall, the models were pretty similar. But Mistral compiled the job details in a way that made slightly more grammatical sense on average. It got higher marks for readability and organization of information. For example, Llama described needing “a painting service for a business with walls that need major repairs.” This isn’t impossible to understand, but requires reading it a couple times. Mistral framed it as “I own a business with interior walls in need of major repairs and painting,” which reads much easier.

Llama also tended to be more curt, and sometimes even a bit rude. The messages felt a touch less conversational and natural / human. Instead, it had a more robotic vibe to it. It used phrases like “I need this done within X hours,” or, “someone has to walk my dog for 15 minutes per day,” which came off just a little bossy; whereas Mistral framed it as “let’s get started in the next X hours,” or “can you walk my dog for 15 minutes per day?”

Design

When it was time for me to mock up the message in context in the product, we returned to the debate around flagging AI influence. Should we be clear with customers that we are using an AI to draft their messages? And what about pros? Should they know if a customer chooses an AI-generated message to send?

To help frame the debate visually, I created three variations of the customer message design:

Version B: the message is referenced as “AI-generated” in sub copy.

Version C: sub copy reference, plus a standard AI icon.

I felt options B or C were strongly preferable. It seemed like the ethical and honest thing to do to make it clear, on both the customer and pro side, that AI contributed to the messaging. I also worried that the messages were not always perfectly human-sounding, despite everything we’d done to train the model. What would the effect be on user sentiment if they sniffed out the AI contribution on their own, without us admitting it?

The product manager preferred option A. He wasn’t confident that customers liked AI enough in 2025 to use it, but would be more likely to use a summarized message from Thumbtack if they weren’t thinking that hard about where it came from. He also worried that the average pro might perceive an overtly AI-generated message as feeling like a “fake lead,” a low-quality effort from Thumbtack to generate fake customers.

In the end, the debate was settled by a review with our in-house legal advisor. Since AI was still very new to Thumbtack at the time, and since the laws around it were so new and changing, our legal team wanted to err on the side of the most caution. So we went with option C.

I worked with legal to also add a scope item: conditionalize the design & content within the pro’s messenger, at the point when they first see the new message. I created three variants, each of which might appear to the pro depending on what the customer had done with the original AI draft:

Variant 1: The customer sends the AI-generated draft as their message, with no editing. In this case, display the AI symbol next to the message and preface the message itself with “AI-generated.”

Variant 2: The customer used the AI-generated draft, but edited it first. In this case, don’t display the symbol, but show the alternate copy “AI-assisted.”

Variant 3 essentially acted the same as baseline. The customer in this case opted to wipe the AI-generated draft clean, and deliver their own message.

Creating AI guidelines

Since this was one of my last projects at Thumbtack, I wasn’t around to see how the experiment turned out. Would customers send more messages as a result of this project? Would both pros & customers perceive value in these AI-generated summaries? Would the summaries be perceived as higher quality than the typical manually-written customer message? Would it encourage more pros to reply quickly to customers?

I wouldn’t get answers to these questions. But I did have the opportunity to shape the future usage of AI at Thumbtack going forward, by working with our legal and design team to create the first draft of our AI guidelines. I circulated these guidelines for feedback throughout the org, and published them on our internal style guide tool.