Prompt autopilot tools comparison.

Or practical approach of TextGrad, DSPy and other tooling.

5 min readDec 10, 2024

TL;DR; You cannot rely on user prompt engineering skills when you building your AI application, so you need to use one of prompt improvement frameworks lide DSPy and TextGrad. This article is for developers wondering how to choose and evaluate results.
Join FestiveTechCalendar for more exciting content!

Plan

The main problem we have now is the velocity of changes, so this result in adoption of frameworks that still in alpha or pre-release stage without proper evaluation. So please consider information in this article and evaluate frameworks for your particular case. And if you building a simple chatbot with enterprise data — it might be better to skip adoption of tooling below :).

The problem with user prompting
DSPy what it is and what it isn’t
TextGrad as potential rival
Summary

The problem

Do not expect that system users can express what they want, and also do not force people to learn specific tricks or prompt engineering to work with your application. Example below is for engineering application powered with extra tooling.

Example prompt: I have a spike at TJS23 and 
alert at the nearest measuring point, what should I do?

Normal chat app will just hallucinate something or find not a relevant information with abbreviation TJS23, but your fancy app should understand that there is a need to search to entity TJS23 somewhere, knew what it is, and how it is related to measuring point.

So what would you do? Yes you will need to connect some data sources, implement search plugins to query database, but you also need to clarify prompt to something like below. And getting this kind of result is not easy task :).

Enhanced prompt: I have a temperature spike at sensor TJS23 located 
at customer factory and temperature alert for temperature above 80 C 
triggered at the nearest measuring point located at power grid,
what measures should be taken instantly, who should I notify about this?

From my perspective these tools shine in scenarios when you have very specific AI app towards narrow business domain that can be easily enhanced in automatic or semi-automatic way. Or you are using medium or small self-hosted LLMs.

DSPy

DSPy is a framework that enables the algorithmic optimization of large language model (LLM) prompts, weights, and provide pipelines that are outside of scope now :).

It is build around modules like dspy.Predict, dspy.ChainOfThought, or dspy.ReActthat have specific functionality and target different scenarios. Similar to how PyTorch simplify the creation of neural networks, DSPy provides modules to replace manual prompting and optimizers.

How much improvement to your results you can get with these modules? From real world examples around 10–20% depending on module type and scenario.

And the simplest thing you can use right away is the chain of thought functionality to have it for any user input.


def search_wikipedia(query: str) -> list[str]:
    results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
    return [x['text'] for x in results]

rag = dspy.ChainOfThought('context, question -> response')

question = "What's the name of the castle that David Gregory inherited?"
rag(context=search_wikipedia(question), question=question)

And output can be something like this

Prediction(
    reasoning='The context provides information about David Gregory, 
 a Scottish physician and inventor. It specifically mentions that he
 inherited Kinnairdy Castle in 1664. This detail directly answers the
 question about the name of the castle that David Gregory inherited.',
    response='Kinnairdy Castle'
)

The best part that I used from DSPy is Optimizers, which are algorithms that tune parameters of DSPy program and contain 3 components, Module, Metric and a few training inputs.

There are 3 type of optimizers:

Automatic few-shot learning with BootstrapFewShotWithRandomSearch, BootstrapFewShot, LabeledFewShot, KNNFewShot
Automatic Instruction Optimization with COPRO and MIPROv2
Automatic Finetuning
Program Transformations

In practice you have to choose MIPROv2(Multiprompt Instruction PRoposal Optimizer Version 2) if you don’t have any few-shot samples, otherwise create a good set up to 1000 samples and use BootstrapFewShotWithRandomSearch which will bring best results.

class RAG(dspy.Module):
    def __init__(self, num_docs=5):
        self.num_docs = num_docs
        self.respond = dspy.ChainOfThought('context, question -> response')

    def forward(self, question):
        context = search(question, k=self.num_docs)   # not defined in this snippet, see link above
        return self.respond(context=context, question=question)

tp = dspy.MIPROv2(metric=dspy.SemanticF1(), auto="medium", num_threads=24)
optimized_rag = tp.compile(RAG(), trainset=trainset, max_bootstrapped_demos=2, max_labeled_demos=2)

Overall DSPy is much more than prompt optimization and can serve as a core of your AI application, might be too heavy as well.

TextGrad

TextGrad is a framework building automatic backpropagation through text feedback provided by LLMs, or engine for textual gradients.

In essence we have two language models (LLMs): a teacher model, which is supposed to be smarter, and a student model. The teacher model reviews and helps improve the prompts for the student model. However, if the task is too difficult for the teacher model, the whole system can fail :).

Let’s start with sample results from the paper.

In machine learning, gradient descent helps reduce errors by tweaking model parameters in the direction that decreases the error the most. The model’s settings are repeatedly adjusted based on gradients, which show the steepest path up or down in the error landscape.

This idea is applied to prompts by using “textual gradients” instead of numeric ones. These textual gradients are created by assessing how well a prompt performs on a training dataset and then describing its weaknesses or areas for improvement in plain language.

You can try version 0.1.4 below, but my take it is good for experimenting with your generic system or pilot, especially is you are using weaker models. https://textgrad.readthedocs.io/en/latest/quickstart.html

Alternatives

A honorable mention, that I will cover in the next article
AdalFlow

Summary

From software engineer perspective DSPy much more mature, have great samples, community and can bring stable improvements for users output, especially if you building something very specific like in my case for civil engineering.

But if you are building a generic poc or mvp without a few-shot approach or sample data, you should do a head-on comparison between textGrad.TGD and spy.MIPROv2 and evaluate the results to find an ideal choice.

Thanks for reading the article, Cheers!